Re: Question about VACUUM behavior with sub-transactions in stored procedures
Hello everyone, I'd like to revisit the topic of auto VACUUM's interaction with stored procedures that perform transactions, with a more technical clarification as suggested earlier. Let's consider the behavior of VACUUM and system table updates after transaction commits in procedures that frequently open and commit transactions. As I understand, statistics updates in PostgreSQL, which VACUUM later analyzes, are performed in pgstat_report_stat, called within db/src/backend/tcop/postgres.c in the PostgresMain function. Specifically: stats_timeout = pgstat_report_stat(false); if (stats_timeout > 0) { if (!get_timeout_active(IDLE_STATS_UPDATE_TIMEOUT)) enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT, stats_timeout); } else { /* all stats flushed, no need for the timeout */ if (get_timeout_active(IDLE_STATS_UPDATE_TIMEOUT)) disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false); } Inside procedures, when _SPI_commit is called in db/src/backend/executor/spi.c, the main command responsible for completing a transaction is CommitTransactionCommand(). My question is the following:? 1. Is it expected behavior that system table updates are deferred until all nested transactions are complete? This would mean that auto VACUUM might not account for dead tuples accumulated during procedure execution until the entire main transaction is finished. 2. Is it possible or advisable to call pgstat_report_stat after each CommitTransactionCommand() within procedures so that auto VACUUM can track intermediate changes and prevent an accumulation of dead rows? 3. To what extent would this approach be viable in terms of performance and correctness? I look forward to any insights and advice you can offer on this matter. Best regards, Vyacheslav Kirillov От: David G. Johnston Отправлено: 21 октября 2024 г. 16:55 Кому: Кириллов Вячеслав Копия: pgsql-hack...@postgresql.org Тема: Re: Question about VACUUM behavior with sub-transactions in stored procedures On Monday, October 21, 2024, Кириллов Вячеслав mailto:vkiril...@diasoft.ru>> wrote: I have a question regarding the behavior of the auto VACUUM in PostgreSQL in the context of using stored procedures with sub-transactions. This is a general usage inquiry not suited to discussion on -hackers. We have a -general mailing list to discuss how to use the product. This list is for discussing patches. Here is the scenario: we have several stored procedures that modify or update table data. These procedures use sub-transactions, which are committed via COMMIT. This isn't how sub-transactions work. They are created mainly by save points and are not independently committed (by the user in SQL). What you are using are full transactions. https://www.postgresql.org/docs/17/plpgsql-transactions.html David J.
Re: Count and log pages set all-frozen by vacuum
Hi Melanie On Wed, 30 Oct 2024 at 21:42, Melanie Plageman wrote: > ... > The number of pages set all-frozen in the visibility map by a given > vacuum can be useful when analyzing which normal vacuums reduce the > number of pages future aggressive vacuum need to scan. > > Also, empty pages that are set all-frozen in the VM don't show up in > the count of pages with newly frozen tuples. When making sense of the > result of visibilitymap_summary() for a relation, it's helpful to know > how many pages were set all-frozen in the VM by each vacuum. > > I agree that this data would be useful for analysing the impact of vacuum operations. The values returned in a case pages are removed (cases where the empty pages are at the end of the table) are a bit confusing. In an example similar to yours, but with a normal vacuum operation, since that seems to be the most useful case for this feature: CREATE TABLE the_table (first int, second int) WITH (autovacuum_enabled = false); INSERT INTO the_table SELECT generate_series(1,1000), 1; DELETE FROM the_table WHERE first > 50; VACUUM (VERBOSE) the_table; pages: 4 removed, 1 remain, 5 scanned (100.00% of total) tuples: 950 removed, 50 remain, 0 are dead but not yet removable removable cutoff: 763, which was 1 XIDs old when operation ended new relfrozenxid: 761, which is 1 XIDs ahead of previous value frozen: 0 pages from table (0.00% of total) had 0 tuples frozen. 4 pages set all-frozen in the VM It looks like 4 pages out of 1 are set all-frozen. So there are -3 to scan for the next aggressive vacuum? The four pages which were set to all frozen are the same four which have been removed from the end of the table. For an example where the empty pages which are marked all frozen are at the start of the table (deleting values < 950 in the example), the results are more intuitive pages: 0 removed, 5 remain, 5 scanned (100.00% of total) tuples: 949 removed, 51 remain, 0 are dead but not yet removable removable cutoff: 768, which was 0 XIDs old when operation ended new relfrozenxid: 766, which is 1 XIDs ahead of previous value frozen: 0 pages from table (0.00% of total) had 0 tuples frozen. 4 pages set all-frozen in the VM Can the pages which are removed from the end of the table not be counted towards those set all frozen to make the arithmetic a bit more intuitive for this edge case? Thanks Alastair
Re: Question about VACUUM behavior with sub-transactions in stored procedures
Hello everyone, I'd like to revisit the topic of auto VACUUM's interaction with stored procedures that perform transactions, with a more technical clarification as suggested earlier. Let's consider the behavior of VACUUM and system table updates after transaction commits in procedures that frequently open and commit transactions. As I understand, statistics updates in PostgreSQL, which VACUUM later analyzes, are performed in pgstat_report_stat, called within db/src/backend/tcop/postgres.c in the PostgresMain function. Specifically: stats_timeout = pgstat_report_stat(false); if (stats_timeout > 0) { if (!get_timeout_active(IDLE_STATS_UPDATE_TIMEOUT)) enable_timeout_after(IDLE_STATS_UPDATE_TIMEOUT, stats_timeout); } else { /* all stats flushed, no need for the timeout */ if (get_timeout_active(IDLE_STATS_UPDATE_TIMEOUT)) disable_timeout(IDLE_STATS_UPDATE_TIMEOUT, false); } Inside procedures, when _SPI_commit is called in db/src/backend/executor/spi.c, the main command responsible for completing a transaction is CommitTransactionCommand(). My question is the following:? 1. Is it expected behavior that system table updates are deferred until all nested transactions are complete? This would mean that auto VACUUM might not account for dead tuples accumulated during procedure execution until the entire main transaction is finished. 2. Is it possible or advisable to call pgstat_report_stat after each CommitTransactionCommand() within procedures so that auto VACUUM can track intermediate changes and prevent an accumulation of dead rows? 3. To what extent would this approach be viable in terms of performance and correctness? I look forward to any insights and advice you can offer on this matter. Best regards, Vyacheslav Kirillov?
Re: Index AM API cleanup
> On Oct 30, 2024, at 12:54 PM, Peter Eisentraut wrote: > > So this is the idea. To take a step back, I can see the following > options: > > 1. Require all index AMs to use btree-compatible strategy numbers. > (Previously rejected, too much upgrading mess.) > > 2. Provide mapping between index AM strategy numbers and btree strategy > numbers. (This doesn't have a space for non-btree operations like > overlaps.) I agree that neither of these options are good, for the reasons you mention. > 3. Provide mapping between index AM strategy numbers and the existing > RT*StrategyNumber numbers. (We can expand the set as we want.) > (This is what the existing mapping function for gist does.) The point of such a mapping is that core code outside any index AM can know what an AM is claiming it can do when it advertises that it implements one of these strategy numbers. We don't need any new entries in that mapping until some core feature depends on the semantics of the new entry. Right now, all six of the btree comparators (including not-equal) have semantics that are used outside AMs by functions like match_clause_to_ordering_op(). If any index AM author comes along and creates an index AM which purports to provide these six semantics but actually does something semantically inconsistent with what the backend thinks these mean, that index AM is totally at fault when, for example, ORDER BY returns the wrong results. On the other hand, if we add RTOverlapStrategyNumber to the common framework of strategy numbers, without anything outside an index AM depending on that, how is an index AM author to know exactly how an "overlaps" operator is supposed to behave? No doubt, brin, gist, spgist, and friends all have their own understanding of what RTOverlapStrategyNumber means, but how is a new index AM supposed to know if it has analogized that concept correctly to its own operator? And if several major versions later, you come along to create some feature, let's say a logical replication feature depending on "overlaps" semantics, how are you to know whether all the index AMs in the wild which advertise they provide an "overlaps" operator will work correctly with your new feature? When logical replication breaks, who is at fault? Perversely, knowing that RTOverlapStrategyNumber is already in the list, you would likely implement the new logical replication feature on some new strategy number, perhaps naming it RTLogicalOverlapStrategyNumber, to avoid such conflicts. The RT*StrategyNumber list is much too long, containing many such problematic entries. We should not, in my opinion, add these to the list prior to some new feature which depends on them, such as a planner or executor optimization. > 4. Provide mapping between index AM strategy numbers and some > completely new set of numbers/symbols. This is fine, if the new set is sufficiently restricted. However, as mentioned below, the set of sufficiently restricted values is identical to what we currently define as a RowCompareType. It creates needless code churn to throw that away and replace it with a new name. > 5. Provide mapping between index AM strategy numbers and > RowCompareType (with some small extensions). This is what this > patch does. As the patch author, obviously this is the one I chose. The "small extensions" are just to handle "no such value" type cases. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Segfault in jit tuple deforming on arm64 due to LLVM issue
On Thu, Oct 31, 2024 at 6:49 AM Thomas Munro wrote: > There are a couple of cases of dual-licensed code in our tree where we > explicitly used the Boost alternative instead of Apache 2. I plead > complete ignorance of this topic and defer to those who know about > such things: can we actually do this? I guess at a minimum a copy of > the licence would need to appear somewhere -- perhaps under > src/backend/jit/llvm? I'm also not super knowledgeable about the licensing intricacies but I read it the same way - a license file has to be provided due to the 4a clause. llvmlite did this when they added the patched memory manager[1] > 4d says that if you modified the code you have > to say so prominently, but I did that at the top (and the changes are > completely trivial, just some #ifdef swizzling to massage some > function prototypes to suit older LLVMs). Otherwise I understand it > to be generally "BSD-like" (sans advert clause) but there is also some > stuff about patents, which surely aren't relevant to this in > practice... but I know that some projects object to it on principle > and because it smells like contract law, or something not an area > I am well informed about. Who should I be asking? (Naively, I > wondered: could there be some kind of fair use concept for > back-patching fixes to broken libraries that you're merely a user of > where you can be excused from the burdens of a distributor? Yeah > wishful thinking I'm sure.) You mean 4b, right? LLVM doesn't seem to have any NOTICE files so the 4d clause shouldn't apply. The top comment looks fine to notify the source of the modified file and how it was changed. But again, I don't have much experience in this so I can't be sure. [1]: https://github.com/numba/llvmlite/pull/1009/files#diff-80b149f35cebd583e21dfc49c0007a7fab89c3c6d07c028e4a87de0848aa2ed8
Re: Test to dump and restore objects left behind by regression
On Mon, Sep 09, 2024 at 03:43:58PM +0530, Ashutosh Bapat wrote: > 894be11adfa60ad1ce5f74534cf5f04e66d51c30 changed the schema in which > objects in test genereated_stored.sql are created. Because of this the > new test added by the patch was failing. Fixed the failure in the > attached. On my laptop, testing the plain format adds roughly 12s, in a test that now takes 1m20s to run vs 1m32s. Enabling regress_dump_formats and adding three more formats counts for 45s of runtime. For a test that usually shows up as the last one to finish for a heavily parallelized run. So even the default of "plain" is going to be noticeable, I am afraid. + test_regression_dump_restore($oldnode, %node_params); Why is this only done for the main regression test suite? Perhaps it could be useful as well for tests that want to check after their own custom dumps, as a shortcut? Linked to that. Could there be some use in being able to pass down a list of databases to this routine, rather than being limited only to "regression"? Think extension databases with USE_MODULE_DB that have unique names. + # Dump the original database in "plain" format for comparison later. The + # order of columns in COPY statements dumped from the original database and [...] + # Adjust the CREATE TABLE ... INHERITS statements. + if ($original) + { + $dump =~ s/(^CREATE\sTABLE\sgenerated_stored_tests\.gtestxx_4\s\() + (\n\s+b\sinteger), + (\n\s+a\sinteger)/$1$3,$2/mgx; The reason why $original exists is documented partially in both 002_pg_upgrade.pl and AdjustDump.pm. It would be better to consolidate that only in AdjustDump.pm, I guess. Isn't the name "$original" a bit too general when it comes to applying filters to the dumps to as the order of the column re-dumped is under control? Perhaps it would be adapted to use a hash that can be extended with more than one parameter to control which filters are applied? For example, imagine a %params where the caller of adjust_dumpfile() can pass in a "filter_column_order => 1". The filters applied to the dump are then self-documented. We could do with a second for the whitespaces, as well. What's the advantage of testing all the formats? Would that stuff have been able to catch up more issues related to specific format(s) when it came to the compression improvements with inheritance? I'm wondering if it would make sense to also externalize the dump comparison routine currently in the pg_upgrade script. Perhaps we should be more ambitious and move more logic into AdjustDump.pm? If we think that the full cycle of dump -> restore -> dump -> compare could be used elsewhere, this would limit the footprint of what we are doing in the pg_upgrade script in this patch and be able to do similar stuff in out-of-core extensions or other tests. Let's say a PostgreSQL::Test::Dump.pm? -- Michael signature.asc Description: PGP signature
Re: Pgoutput not capturing the generated columns
On Thu, Oct 31, 2024 at 9:55 PM Ajin Cherian wrote: > I ran some tests and verified that the patch works with previous versions > of PG12 and PG17 > 1. Verified with publications with generated columns and without generated > columns on patched code and subscriptions on PG12 and PG17 > Observations: > a. If publication is created with publish_generated_columns=true or > with generated columns mentioned explicitly, then tablesync will not copy > generated columns but post tablesync the generated columns are replicated > b. Column list override (publish_generated_columns=false) behaviour > > These seem expected. > > Currently the documentation does not talk about this behaviour, I suggest this be added similar to how such a behaviour was documented when the original row-filter version was committed. Suggestion: "If a subscriber is a pre-18 version, the initial table synchronization won't publish generated columns even if they are defined in the publisher." regards, Ajin Cherian Fujitsu Australia
Time to add a Git .mailmap?
When looking at our Git tree for a recent conference presentation I happened to notice that we have recently gained duplicate names in the shortlog. Not sure if we care enough to fix that with a .mailmap, but if we do the attached diff makes sure that all commits are accounted for a single committer entry. -- Daniel Gustafsson mailmap.diff Description: Binary data
Re: Count and log pages set all-frozen by vacuum
On Wed, Oct 30, 2024 at 2:42 PM Melanie Plageman wrote: > > Vacuum currently counts and logs the number of pages of a relation > with newly frozen tuples. It does not count the number of pages newly > set all-frozen in the visibility map. > > The number of pages set all-frozen in the visibility map by a given > vacuum can be useful when analyzing which normal vacuums reduce the > number of pages future aggressive vacuum need to scan. +1 Some small suggestions: + vmbits = visibilitymap_set(vacrel->rel, blkno, buf, + InvalidXLogRecPtr, + vmbuffer, InvalidTransactionId, + VISIBILITYMAP_ALL_VISIBLE | + VISIBILITYMAP_ALL_FROZEN); How about using "old_vmbits" or something along the same lines to make it clear that this value has the bits before visibilitymap_set()? --- + /* If it wasn't all-frozen before, count it */ + if (!(vmbits & VISIBILITYMAP_ALL_FROZEN)) To keep consistent with surrounding codes in lazyvacuum.c, I think we can write "if ((vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)" instead. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: UUID v7
On Sat, Oct 26, 2024 at 10:05 AM Andrey M. Borodin wrote: > > > > > --- > > - oid | proname | oid | proname > > --+-+-+- > > -(0 rows) > > + oid | proname | oid | proname > > +--+-+--+- > > + 9896 | uuidv7 | 9897 | uuidv7 > > +(1 row) > > > > I think that we need to change these functions so that this check > > query doesn't return anything, no? > > We have 4 options: > 0. Remove uuidv7(interval). But it brings imporatne functionality to the > table: we can avoid contention points while massively insert data. > 1. Give different names to uuidv7() and uuidv7(interval). > 2. Allow importing pg_node_tree (see v7 of the patch) > 3. Change this query. Comment to this query suggest that it checks for > exactly this case: same function is declared with different number of > arguments. > > IMO approach number 3 is best. However, I do not understand why this query > check was introduced in the first place. Maybe, there are string arguments > why we should not do same-named functions with different number of arguments. > I think we typically avoid this kind of check failure by assigning uuidv7() and uuidv7(interval) different C functions that call the common function. That is, we have pg_proc entries like: { oid => '9896', descr => 'generate UUID version 7', proname => 'uuidv7', proleakproof => 't', provolatile => 'v', prorettype => 'uuid', proargtypes => '', prosrc => 'uuidv7' }, { oid => '9897', descr => 'generate UUID version 7 with a timestamp shifted on specific interval', proname => 'uuidv7', proleakproof => 't', provolatile => 'v', prorettype => 'uuid', proargtypes => 'interval', prosrc => 'uuidv7_interval' }, And then have C codes like: static Datum generate_uuidv7(FunctionCallInfo fcinfo) { static uint64 previous_ns = 0;: : PG_RETURN_UUID_P(uuid); } Datum uuidv7(PG_FUNCTION_ARGS) { return generate_uuidv7(fcinfo); } Datum uuidv7_interval(PG_FUNCTION_ARGS) { return generate_uuidv7(fcinfo); } > > > > --- > > + if (version == 6) > > + { > > + tms = ((uint64) uuid->data[0]) << 52; > > + tms += ((uint64) uuid->data[1]) << 44; > > + tms += ((uint64) uuid->data[2]) << 36; > > + tms += ((uint64) uuid->data[3]) << 28; > > + tms += ((uint64) uuid->data[4]) << 20; > > + tms += ((uint64) uuid->data[5]) << 12; > > + tms += (((uint64) uuid->data[6]) & 0xf) << 8; > > + tms += ((uint64) uuid->data[7]); > > + > > + /* convert 100-ns intervals to us, then adjust */ > > + ts = (TimestampTz) (tms / 10) - > > + ((uint64) POSTGRES_EPOCH_JDATE - GREGORIAN_EPOCH_JDATE) * > > SECS_PER_DAY * USECS_PER_SEC; > > + > > + PG_RETURN_TIMESTAMPTZ(ts); > > + } > > > > It's odd to me that only uuid_extract_timestamp() supports UUID v6 in > > spite of not supporting UUID v6 generation. I think it makes more > > sense to support UUID v6 generation as well, if the need for it is > > high. > > RFC urges to use UUIDv7 instead of UUIDv6 when possible. I'm fine with > providing implementation, it's trivial. PFA patch with implementation. > My point is that we should either support full functionality for UUIDv6 (such as generation and extraction) or none of them. I'm not really sure we want UUIDv6 as well, but if we want it, it should be implemented in a separate patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: MultiXact\SLRU buffers configuration
On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin wrote: > > > > > On 29 Oct 2024, at 21:45, Thom Brown wrote: > > > > It clearly checks for interrupts, but when I saw this issue happen, it > > wasn't interruptible. > > Perhaps, that was different multixacts? > When I was observing this pathology on Standby, it was a stream of different > reads encountering different multis. > > Either way startup can cancel locking process on it's own. Or is it the case > that cancel was not enough, did you actually need termination, not cancel? Termination didn't work on either of the processes. Thom
Re: Using read stream in autoprewarm
> On 8 Aug 2024, at 11:32, Nazir Bilal Yavuz wrote: > > Any feedback would be appreciated. I've took a look into the patch. It seems to me that you add new block numbers to the read stream until you have buffers. So when there are no more buffers you will still have some queued blocks. Maybe can you change the logic so that number of free buffers must be enough to allocate all blocks in look-ahead distance? Thanks! Best regards, Andrey Borodin.
Re: [PATCH] Improve code coverage of network address functions
Hi Stepan, > Hello Aleksander! I'm a beginner and I would like to see and try your patch. > However, I have never worked with coverage in regression tests for > PostgreSQL. Could you please tell me how it works and help me understand the > process? Thanks! You are going to need some Linux distribution, GCC stack and lcov 1.16 [1]. (Lcov doesn't work with Clang; We seem to experience some problems with lcov 2.0+, this is being investigated [2]) Apply the patch as usual ( git clone git://git.postgresql.org/git/postgresql.git ; git am ... ) and run: ``` git clean -dfx && meson setup --buildtype debug -DPG_TEST_EXTRA="kerberos ldap ssl" -Db_coverage=true -Dldap=disabled -Dssl=openssl -Dcassert=true -Dtap_tests=enabled -Dprefix=/home/eax/pginstall build && ninja -C build && PG_TEST_EXTRA=1 meson test -C build && ninja -C build coverage-html ``` Note the `-Db_coverage=true` and `ninja -C build coverage-html` parts. Change `prefix` to an appropriate one. This will take longer than usual. Your coverage report will be located in build/meson-logs/coveragereport. You can compare the reports with and without the patch to ensure that the patch improves code coverage. [1]: https://github.com/linux-test-project/lcov/releases/tag/v1.16 [2]: https://postgr.es/m/CAJ7c6TN%2BMCh99EZ8YGhXZAdnqvNQYir6E34B_mmcB5KsxCB00A%40mail.gmail.com -- Best regards, Aleksander Alekseev
Re: "command cannot affect row a second time" in INSERT ... ON CONFLICT
Hi Karthik, > I am looking to better understand the applicability of the error message > "command cannot affect row a second time". > > Consider the following table and data: > CREATE TABLE ioc (i int, UNIQUE(i)); > INSERT INTO ioc VALUES (1); > > The following two queries produce different errors: > Query 1 > postgres=# INSERT INTO ioc VALUES (1), (20) ON CONFLICT (i) DO UPDATE SET i = > 20; > ERROR: 21000: ON CONFLICT DO UPDATE command cannot affect row a second time > HINT: Ensure that no rows proposed for insertion within the same command > have duplicate constrained values. > > Query 2 > postgres=# INSERT INTO ioc VALUES (20), (1) ON CONFLICT (i) DO UPDATE SET i = > 20; > ERROR: 23505: duplicate key value violates unique constraint "ioc_i_key" > DETAIL: Key (i)=(20) already exists. Not sure if it will answer your question *entirely* but you will find a bit more detail about "cannot affect row a second time" in the discussion [1]. This error has nothing to do with unique constraints, so I think you trigger one of two errors depending on the order of inserted rows and the content of your table. This being said, I didn't investigate your scenario in much detail. [1]: https://www.postgresql.org/message-id/flat/CAJ7c6TPQJNFETz9H_qPpA3x7ybz2D1QMDtBku_iK33gT3UR34Q%40mail.gmail.com -- Best regards, Aleksander Alekseev
Re: WIP: parallel GiST index builds
> On 8 Oct 2024, at 17:05, Tomas Vondra wrote: > > Here's an updated patch adding the queryid. I've took another round of looking through the patch. Everything I see seems correct to me. It just occurred to me that we will have: buffered build, parallel build, sorting build. All 3 aiming to speed things up. I really hope that we will find a way to parallel sorting build, because it will be fastest for sure. Currently we have some instances of such code...or similar... or related code. if (is_build) { if (is_parallel) recptr = GetFakeLSNForUnloggedRel(); else recptr = GistBuildLSN; } else { if (RelationNeedsWAL(rel)) { recptr = actuallyWriteWAL(); } else recptr = gistGetFakeLSN(rel); } // Use recptr In previous review I was proponent of not adding arguments to gistGetFakeLSN(). But now I see that now logic of choosing LSN source is sprawling across various places... Now I do not have a strong point of view on this. Do you think if something like following would be clearer? if (!is_build && RelationNeedsWAL(rel)) { recptr = actuallyWriteWAL(); } else recptr = gistGetFakeLSN(rel, is_build, is_parallel); Just as an idea. I'm mostly looking on GiST-specific parts of the patch, while things around entering parallel mode seems much more complicated. But as far as I can see, large portions of this code are taken from B-tree\BRIN. Best regards, Andrey Borodin.
Re: [PATCH] Improve code coverage of network address functions
Hello Aleksander! I'm a beginner and I would like to see and try your patch. However, I have never worked with coverage in regression tests for PostgreSQL. Could you please tell me how it works and help me understand the process? Thanks! Best Regards, Stepan Neretin!
Re: RFC: Extension Packaging & Lookup
Fellow Humans, I’m working on an updated proposal with more detail, and more comprehensive. But I keep getting a bit caught up on this bit: On Oct 28, 2024, at 18:19, David E. Wheeler wrote: >> * Binary-only extensions might also be installed here; the difference is >> they have no control file. The LOAD command and shared_preload_libraries >> would need to know to look here, too. > > Or perhaps we should require a control file for these, too, but add a “type” > key or some such? Maybe such a shared module could be supported by CREATE > EXTENSION, as well as, but not include SQL files? I’m trying to imagine how this ought to work. The challenge is that, with the layout I propose here, shared module files will no longer always be in `$dynamic_library_path`, but in any `$extension/pkglib` subdirectory of each subdirectory of `extension_path`, as well. Is that desirable? Let’s say we want to load a module named “semver” that’s included in the semver extension. With the proposed organization up-thread, the module would be installed in: ``` $extdir_user/semver/pkglib/semver.(so|dylib|dll|etc) ``` What should be passed to preload/LOAD to load it? A few options: Option 1 * Specify the module name “semver” in the `LOAD` command or in `*_preload_libraries` (same as in 17 and earlier) * Teach the preload/LOAD code to search for the module file in `*/pkglib/` under each extension path Pros: * Follows the existing module name specification in preload/LOAD Cons: * Potentially huge number of directories to search, when lots of extension are installed. * Depending on search order, the wrong module may be loaded if two extensions have a module file with the same name Option 2 * Specify the module name to include the extension name. Perhaps something like `$extension:$module`. * Teach the preload/LOAD code to detect the extension name as part of the command and only look for the DSO in that extension's `pkglibdir`. Pros: * Searches at most the list of directories in the `extension_path`. * No conflicts with any other module files from other extensions. Cons: * Overloads the meaning of preload/LOAD strings, which might be confusing to some. * Upgrades might need these values to change from the old to the new syntax. Other Options? -- I kind of like Option 2, as it would allow us to eventually support non-`CREATE EXTENSION` modules as extensions, too. I imagine distributing, say `auto_explain` in an extension directory of its own, with a `auto_explain.control` file that identifies it as a LOAD-only extension. Then specifying `auto_explain:auto_explain` would work as expected. Or perhaps just `auto_explain:` could load *all* the modules included in the auto_explain "extension". But then maybe I'm starting to talk myself into arguing that `LOAD` ought to be deprecated, `CREATE EXTENSION` could support non-SQL extensions, and the `*preload*` GUCs would contain a list of extensions rather than module files. But I digress. Any ideas about other options to address this design challenge? Thanks, David
Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description
On Thu, Oct 24, 2024 at 08:32:54AM +1100, Peter Smith wrote: > Hi. Here are a couple of minor comments. > > 1. > +The time slot synchronization (see +linkend="logicaldecoding-replication-slots-synchronization"/>) > +was most recently stopped. > > /The time slot/The time when slot/ > > ~~~ > > 2. > - /* The time since the slot has become inactive */ > + /* The time slot sychronized was stopped. */ > > > Maybe just make this comment the same as the sentence used in the docs: > - e.g. add "when"; fix, typo "sychronized", etc. > > > /The time slot sychronized was stopped./The time when slot > synchronization was most recently stopped./ Yes, all good suggestions, updated patch attached. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?" diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 61d28e701f2..d853f7fe49b 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -2435,7 +2435,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx active bool - True if this slot is currently actively being used + True if this slot is currently being streamed @@ -2444,9 +2444,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx active_pid int4 - The process ID of the session using this slot if the slot - is currently actively being used. NULL if - inactive. + The process ID of the session streaming data for this slot. + NULL if inactive. @@ -2566,15 +2565,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx inactive_since timestamptz -The time since the slot has become inactive. -NULL if the slot is currently being used. -Note that for slots on the standby that are being synced from a -primary server (whose synced field is -true), the -inactive_since indicates the last -synchronization (see -) -time. +The time when slot synchronization (see ) +was most recently stopped. NULL if the slot +has always been synchronized. This is useful for slots on the +standby that are being synced from a primary server (whose +synced field is true) +so they know when the slot stopped being synchronized. diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c index d62186a5107..f4f80b23129 100644 --- a/src/backend/replication/logical/slotsync.c +++ b/src/backend/replication/logical/slotsync.c @@ -1515,7 +1515,7 @@ update_synced_slots_inactive_since(void) * correctly interpret the inactive_since if the standby gets promoted * without a restart. We don't want the slots to appear inactive for a * long time after promotion if they haven't been synchronized recently. - * Whoever acquires the slot i.e.makes the slot active will reset it. + * Whoever acquires the slot, i.e., makes the slot active, will reset it. */ if (!StandbyMode) return; diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h index 45582cf9d89..90ea2979b87 100644 --- a/src/include/replication/slot.h +++ b/src/include/replication/slot.h @@ -205,7 +205,7 @@ typedef struct ReplicationSlot */ XLogRecPtr last_saved_confirmed_flush; - /* The time since the slot has become inactive */ + /* The time when slot synchronization was most recently stopped. */ TimestampTz inactive_since; } ReplicationSlot;
Re: "command cannot affect row a second time" in INSERT ... ON CONFLICT
On Thu, Oct 31, 2024 at 9:52 AM Karthik Ramanathan < karthikram.3...@gmail.com> wrote: > I am looking to better understand the applicability of the error message > "command cannot affect row a second time". > > Consider the following table and data: > CREATE TABLE ioc (i int, UNIQUE(i)); > INSERT INTO ioc VALUES (1); > > The following two queries produce different errors: > *Query 1* > postgres=# INSERT INTO ioc VALUES (1), (20) ON CONFLICT (i) DO UPDATE SET > i = 20; > ERROR: 21000: ON CONFLICT DO UPDATE command cannot affect row a second > time > HINT: Ensure that no rows proposed for insertion within the same command > have duplicate constrained values. > Right, id 1 exists, you insert id 1 again, that row becomes id 20, then you attempt to insert id 20 again, which conflicts, and the system attempts to update the 1-become-20 row to 20 but fails to perform the update since that now-existing row was already modified in this statement (it was inserted). You don't get a duplicate key error because the second modification condition is more general and thus triggers first. I.e., that error has to happen regardless of whether a duplicate key error condition was going to happen or not (e.g., you could have done something like "set i = i * 20" - not tested) > *Query 2* > postgres=# INSERT INTO ioc VALUES (20), (1) ON CONFLICT (i) DO UPDATE SET > i = 20; > ERROR: 23505: duplicate key value violates unique constraint "ioc_i_key" > DETAIL: Key (i)=(20) already exists. > Here the insertion of id 20 happens just fine, then inserting id 1 conflicts, the existing row with id 1 gets updated to id 20 which results in a duplicate key violation. > 1. Is there a different reason the two queries produce a different error? > First error condition wins. Multiple modification gets tested first, before checking whether the outcome of a modification would result in a duplicate. 2. Is there a better way to think about the "command cannot affect row a > second time"? Appreciate any guidance. Thanks. > > A row inserted or updated in a statement cannot be subsequently modified in that same statement. I don't actually understand how you are presently thinking about this... Apparently the algorithm for merge is able to avoid impacting the same row twice and thus if the underlying DML is going to produce a duplicate key violation that is what you will see. I hesitate to claim you'd never see a multi-update scenario but do find it reasonable that it would be less prone to it. David J.
Re: Changing the default random_page_cost value
On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote: > On Mon, Oct 14, 2024 at 5:15 PM Bruce Momjian wrote: > > I am not a fan of this patch. I don't see why _removing_ the magnetic > part helps because you then have no logic for any 1.2 was chosen. > > > Okay, but we have no documented logic on why 4.0 was chosen either. :) Uh, we do, and it is in the docs: Random access to mechanical disk storage is normally much more expensive than four times sequential access. However, a lower default is used (4.0) because the majority of random accesses to disk, such as indexed reads, are assumed to be in cache. The default value can be thought of as modeling random access as 40 times slower than sequential, while expecting 90% of random reads to be cached. You surely saw this when you created the patch and removed the text. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Re: meson and check-tests
Hi Ashutosh and Jian, Sorry for the late reply and thanks for the feedback! On Fri, 4 Oct 2024 at 16:13, jian he wrote: > > v3, 0001 documentation: > We can at least write something on > https://wiki.postgresql.org/wiki/Meson about this feature. I agree. It seems that I need to apply for an editor access on the wiki page. I am planning to do that after the commit. > TESTS='check check_btree' meson test amcheck/regress --verbose > works, but I feel like there is a discoverability issue . > > TESTS='check check_btree' meson test amcheck/regress --list --verbose > shows: > postgresql:amcheck / amcheck/regress > > > For small tests, listing all the available tests would be great. I agree but unfortunately this seems not possible. 'meson test --list' option only shows tests that are registered to meson build (i.e. test() calls in the meson.build files). On Fri, 4 Oct 2024 at 18:40, Ashutosh Bapat wrote: > --schedule or the test selection becomes part of the test command > itself in the current master. By passing it as an argument, in these > patches, we are allowing those to be overridden if TESTS is set at the > time of running the test. I like this idea. I was pondering whether we > really need two separate arguments --schedule and --tests > respectively. IIUC, we can't pass them as a single argument (say > --test-selection or --selection) because the subsequent --schedule > would be treated as a separate argument if not quoted correctly. That > isn't a problem with value of tests. To avoid quoting '--schedule > ...', we have two separate arguments. Am I right? Yes, that is exactly why we have both '--schedule' and '--tests' flags. Also, a comment is added to clarify this. > It might be better to make this explicit in the code -- by making sure > that only one of them is passed and writing a comment about it. > ArgumentParser might have some trick to specify that passing both the > arguments is an error. I did not understand why only one of them needed to be passed at a time. For example in ecpg tests (src/interfaces/ecpg/test/meson.build), both '--schedule' and '--tests' options are passed. > Also I don't think "Note that setup > # suite tests (at least tmp_install and initdb_cache tests) may need to be run > # before running these tests." is needed. This is true irrespective of > whether we specify TESTS or not. Done. -- Regards, Nazir Bilal Yavuz Microsoft From f0e035636ae02b7fe0668a0b2246b080656a26e5 Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Thu, 31 Oct 2024 16:21:23 +0300 Subject: [PATCH v4 1/2] Add 'make check-tests' behavior to the meson based builds There was no way to run specific regression tests in the regress/regress tests in the meson based builds. Add this behavior. Author: Nazir Bilal Yavuz Reviewed-by: Ashutosh Bapat Reviewed-by: Jian He Discussion: postgr.es/m/CAExHW5tK-QqayUN0%2BN3MF5bjV6vLKDCkRuGwoDJwc7vGjwCygQ%40mail.gmail.com --- meson.build| 14 -- src/tools/testwrap | 10 ++ 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/meson.build b/meson.build index bb9d7f5a8e8..c9f91a0ee22 100644 --- a/meson.build +++ b/meson.build @@ -3399,11 +3399,11 @@ foreach test_dir : tests '--dbname', dbname, ] + t.get('regress_args', []) - test_selection = [] - if t.has_key('schedule') -test_selection += ['--schedule', t['schedule'],] - endif + # To avoid '--schedule' option to be treated as a separate argument in + # the testwrap script if not quoted correctly. + test_schedule = t.get('schedule', []) + test_selection = [] if kind == 'isolation' test_selection += t.get('specs', []) else @@ -3427,12 +3427,13 @@ foreach test_dir : tests testwrap_base, '--testgroup', test_group, '--testname', kind, + '--schedule', test_schedule, + '--tests', test_selection, '--', test_command_base, '--outputdir', test_output, '--temp-instance', test_output / 'tmp_check', '--port', testport.to_string(), - test_selection, ], suite: test_group, kwargs: test_kwargs, @@ -3447,10 +3448,11 @@ foreach test_dir : tests testwrap_base, '--testgroup', test_group_running, '--testname', kind, +'--schedule', test_schedule, +'--tests', test_selection, '--', test_command_base, '--outputdir', test_output_running, -test_selection, ], is_parallel: t.get('runningcheck-parallel', true), suite: test_group_running, diff --git a/src/tools/testwrap b/src/tools/testwrap index 9a270beb72d..0ab9f5dada9 100755 --- a/src/tools/testwrap +++ b/src/tools/testwrap @@ -12,6 +12,8 @@ parser.add_argument('--srcdir', help='source directory of test', type=str) parser.add_argument('--basedir', help='base directory of tes
[PATCH] Add support for displaying database service in psql prompt
Hi, I think it might be useful to sometimes display the database service (i.e. what is defined in ~/.pg_service.conf and used via psql service=foo) in the psql prompt, e.g. if you have 'test' and 'prod' services, but both have the same database name. This was also suggested to me during a recent customer training. I chose the '%s' tag for it. I had to add the service to PGConn as PQservice (first patch) to libpq and then use it in psql in the second patch. Michael >From f876195acb797a5ac58c17409fdd75d18581c292 Mon Sep 17 00:00:00 2001 From: Michael Banck Date: Thu, 31 Oct 2024 18:27:52 +0100 Subject: [PATCH v1 1/2] Add PQservice to PGConn. This adds the content of the database service (if any) to PGConn. One use for this would be for psql to display the service as part of the prompt. --- src/interfaces/libpq/exports.txt | 1 + src/interfaces/libpq/fe-connect.c | 11 ++- src/interfaces/libpq/libpq-fe.h | 1 + src/interfaces/libpq/libpq-int.h | 1 + 4 files changed, 13 insertions(+), 1 deletion(-) diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index 5d8213e0b5..2ad2cbf5ca 100644 --- a/src/interfaces/libpq/exports.txt +++ b/src/interfaces/libpq/exports.txt @@ -205,3 +205,4 @@ PQcancelFinish202 PQsocketPoll 203 PQsetChunkedRowsMode 204 PQgetCurrentTimeUSec 205 +PQservice 206 diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 61c025ff3b..8d809ee4cb 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -190,7 +190,8 @@ typedef struct _internalPQconninfoOption static const internalPQconninfoOption PQconninfoOptions[] = { {"service", "PGSERVICE", NULL, NULL, - "Database-Service", "", 20, -1}, + "Database-Service", "", 20, + offsetof(struct pg_conn, pgservice)}, {"user", "PGUSER", NULL, NULL, "Database-User", "", 20, @@ -7052,6 +7053,14 @@ PQdb(const PGconn *conn) return conn->dbName; } +char * +PQservice(const PGconn *conn) +{ + if (!conn) + return NULL; + return conn->pgservice; +} + char * PQuser(const PGconn *conn) { diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h index 15012c770c..5947e7c766 100644 --- a/src/interfaces/libpq/libpq-fe.h +++ b/src/interfaces/libpq/libpq-fe.h @@ -385,6 +385,7 @@ extern int PQrequestCancel(PGconn *conn); /* Accessor functions for PGconn objects */ extern char *PQdb(const PGconn *conn); +extern char *PQservice(const PGconn *conn); extern char *PQuser(const PGconn *conn); extern char *PQpass(const PGconn *conn); extern char *PQhost(const PGconn *conn); diff --git a/src/interfaces/libpq/libpq-int.h b/src/interfaces/libpq/libpq-int.h index 08cc391cbd..dcebca9898 100644 --- a/src/interfaces/libpq/libpq-int.h +++ b/src/interfaces/libpq/libpq-int.h @@ -394,6 +394,7 @@ struct pg_conn char *fbappname; /* fallback application name */ char *dbName; /* database name */ char *replication; /* connect as the replication standby? */ + char *pgservice; /* Postgres service, if any */ char *pguser; /* Postgres username and password, if any */ char *pgpass; char *pgpassfile; /* path to a file containing password(s) */ -- 2.39.5 >From 3257e93d20353ff348b15df9b45207ec45839ed5 Mon Sep 17 00:00:00 2001 From: Michael Banck Date: Thu, 31 Oct 2024 18:49:14 +0100 Subject: [PATCH v1 2/2] Add support for database service to psql prompt. This adds a new psql variable SERVICE as well as the string '%s' for the psql PROMPT, displaying the service (from PGSERVICEFILE) if so desired. --- src/bin/psql/command.c | 2 ++ src/bin/psql/prompt.c | 6 ++ 2 files changed, 8 insertions(+) diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c index 328d78c73f..e8f2573649 100644 --- a/src/bin/psql/command.c +++ b/src/bin/psql/command.c @@ -4085,6 +4085,7 @@ SyncVariables(void) pset.sversion = PQserverVersion(pset.db); SetVariable(pset.vars, "DBNAME", PQdb(pset.db)); + SetVariable(pset.vars, "SERVICE", PQservice(pset.db)); SetVariable(pset.vars, "USER", PQuser(pset.db)); SetVariable(pset.vars, "HOST", PQhost(pset.db)); SetVariable(pset.vars, "PORT", PQport(pset.db)); @@ -4118,6 +4119,7 @@ void UnsyncVariables(void) { SetVariable(pset.vars, "DBNAME", NULL); + SetVariable(pset.vars, "SERVICE", NULL); SetVariable(pset.vars, "USER", NULL); SetVariable(pset.vars, "HOST", NULL); SetVariable(pset.vars, "PORT", NULL); diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c index 0d99d00ac9..de3d976103 100644 --- a/src/bin/psql/prompt.c +++ b/src/bin/psql/prompt.c @@ -33,6 +33,7 @@ * %p - backend pid * %> - database server port number * %n - database user name + * %s - service * %/ - current database * %~ - like %/ but "~" when database name equals user name * %w - whitespace of the same width as the most recent output of PROMPT1 @@ -246,6 +247,11 @@ get_prompt(promptStatus_t status, ConditionalStack c
Re: Count and log pages set all-frozen by vacuum
On Thu, 31 Oct 2024 at 15:26, Melanie Plageman wrote: > On Thu, Oct 31, 2024 at 11:15 AM Peter Geoghegan wrote: > > > > On Thu, Oct 31, 2024 at 10:22 AM Melanie Plageman > > wrote: > > > It seems a better solution would be to find a way to document it or > > > phrase it clearly in the log. It is true that these pages were set > > > all-frozen in the VM. It is just that some of them were later removed. > > > And we don't know which ones those are. Is there a way to make this > > > clear? > > > > Probably not, but I don't think that it's worth worrying about. ISTM > > that it's inevitable that somebody might get confused whenever we > > expose implementation details such as these. This particular example > > doesn't seem particularly concerning to me. > > > > Fundamentally, the information that you're showing is a precisely > > accurate account of the work performed by VACUUM. If somebody is > > bothered by the fact that we're setting VM bits for pages that just > > get truncated anyway, then they're bothered by the reality of what > > VACUUM does -- and not by the instrumentation itself. > > Makes sense to me. Though, I'm looking at it as a developer. > For user consumption, to reduce the number of puzzled questions, I'd suggest adding a line to the log output of the form visibility map: %u pages set all frozen, up to %u may have been removed from the table rather than appending the info to the frozen: line of the output. By spelling visibility map out in full it gives the curious user something specific enough to look up. It also at least alerts the user to the fact that the number can't just be subtracted from a total.
Re: Changing the default random_page_cost value
Bruce Momjian writes: > On Thu, Oct 24, 2024 at 08:01:11PM -0400, Greg Sabino Mullane wrote: >> Okay, but we have no documented logic on why 4.0 was chosen either. :) > Uh, we do, and it is in the docs: > Random access to mechanical disk storage is normally much more > expensive > than four times sequential access. However, a lower default is used > (4.0) because the majority of random accesses to disk, such as indexed > reads, are assumed to be in cache. The default value can be thought > of > as modeling random access as 40 times slower than sequential, while > expecting 90% of random reads to be cached. Meh. Reality is that that is somebody's long-after-the-fact apologia for a number that was obtained by experimentation. regards, tom lane
Re: Fix for consume_xids advancing XIDs incorrectly
On Wed, Oct 30, 2024 at 12:00 AM Yushi Ogiwara wrote: > > I made a new patch (skip_xid_correctly.diff) that incorporates the > points we discussed: > > 1. Fix the issue that consume_xids consumes nxids+1 XIDs. > 2. Update lastxid when calling GetTopFullTransactionId() to support > nxids==1 case. > 3. Forbid consume_xids when nxids==0. > 4. Add comments explaining the return values of consume_xids and > consume_xids_until, and the rationale for incrementing consumed++ when > GetTopFullTransactionId() is called. > > Also, I made another patch (support_blksz_32k.diff) that supports the > block size == 32K case. > Thank you for making patches! Here are review comments. skip_xid_correctly.diff: - if (nxids < 0) + if (nxids <= 0) elog(ERROR, "invalid nxids argument: %lld", (long long) nxids); - - if (nxids == 0) - lastxid = ReadNextFullTransactionId(); else lastxid = consume_xids_common(InvalidFullTransactionId, (uint64) nxids); After calling elog(ERROR) we don't return here, so the "else" is not necessary. That is, we can rewrite it to: if (nxids <= 0) elog(ERROR, "invalid nxids argument: %lld", (long long) nxids); lastxid = consume_xids_common(InvalidFullTransactionId, (uint64) nxids); --- +* +* If no top-level XID is assigned, a new one is obtained, +* and the consumed XID counter is incremented. I'm not sure this comment is appropriate since what we do here is obvious and the comment doesn't explain why we do that. IMO we don't need to update these comments. support_blksz_32k.diff: - if (xids_left > 2000 && + if (xids_left > COMMIT_TS_XACTS_PER_PAGE && I think it's better to just replace 2000 with 4000 and explain why 4000 is sufficient. Probably we can define '#define XID_SHORTCUT_THRESHOLD 4000" and put an assertion to XidSkip() or consume_xids_common() to check if the number of consumed XIDs doesn't exceed XID_SHORTCUT_THRESHOLD. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Multi delete wal IDEA
On Thu, Oct 31, 2024 at 11:42 AM Stepan Neretin wrote: > > Hi there, hackers! How about trying out an idea to add an analog to save > memory in WAL files for deleting records, similar to multi-insert > optimization? This patch is trying to do just that. Hi, Thanks for the patch! Could you explain a bit more how this patch works? I don't see changes in your patch to heap_delete() where WAL is emitted, so I'm not sure I understand how it works. And could you provide a small example or repro? It makes it much easier to review and understand. - Melanie
Re: Using read stream in autoprewarm
Dear Nazir, At first A quick look it looks good. I will take a closer look at it tomorrow. Could you please let me know about the performance tests and graphics? Best regards, Stepan Neretin!
Re: UUID v7
> > I think we typically avoid this kind of check failure by assigning > > uuidv7() and uuidv7(interval) different C functions that call the > > common function. That is, we have pg_proc entries like: > > > > Done. > > >>> > >>> It's odd to me that only uuid_extract_timestamp() supports UUID v6 in > >>> spite of not supporting UUID v6 generation. I think it makes more > >>> sense to support UUID v6 generation as well, if the need for it is > >>> high. > >> > >> RFC urges to use UUIDv7 instead of UUIDv6 when possible. I'm fine with > providing implementation, it's trivial. PFA patch with implementation. > >> > > > > My point is that we should either support full functionality for > > UUIDv6 (such as generation and extraction) or none of them. I'm not > > really sure we want UUIDv6 as well, but if we want it, it should be > > implemented in a separate patch. > > Make sense. I've removed all traces of v6. > Hi there, Firstly, I'd like to discuss the increased_clock_precision variable, which currently divides the timestamp into milliseconds and nanoseconds. However, this approach only approximates the extra bits for sub-millisecond precision, leading to imprecise timestamps in high-frequency UUID generation. To address this issue, we could consider using a more accurate method for calculating the timestamp. For instance, we could utilize a higher resolution clock or implement a more precise algorithm to ensure accurate timestamps. Additionally, it would be beneficial to add validation checks for the interval argument. These checks could verify that the input interval is within reasonable bounds and that the calculated timestamp is accurate. Examples of checks could include verifying if the interval is too small, too large, or exceeds the maximum possible number of milliseconds and nanoseconds in a timestamp. By implementing these changes, we can improve the accuracy and reliability of UUID generation, making it more suitable for high-frequency usage scenarios. What do you think about these suggestions? Let me know your thoughts! Best Regards, Stepan Neretin!
Re: Use "protocol options" name instead of "protocol extensions" everywhere
On Thu, Oct 31, 2024 at 7:51 AM Heikki Linnakangas wrote: > Or keep using "protocol extension" and add a paragraph to the docs to > say explicitly that there's no support for extensions to create protocol > extensions. TLS extensions is a good comparison. Of the three proposed, this last one is my preference. I think it'd be good to draw very clear lines between the transport level, the protocol level, and the application level. Thanks, --Jacob
Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description
On Fri, Oct 18, 2024 at 04:25:33PM +0530, Amit Kapila wrote: > Few comments: > = > 1. > > - True if this slot is currently actively being used > + True if this slot is currently currently being streamed > > > currently shouldn't be used twice. > > 2. > - /* The time since the slot has become inactive */ > + /* The time slot sychronized was stopped. */ > TimestampTz inactive_since; > > Would it be better to say: "The time slot synchronization was stopped."? > > 3. > This is useful for slots on the > +standby that are intended to be synced from a primary server > > I think it is better to be explicit here and probably say: "This is > useful for slots on the standby that are being synced from a primary > server .." Agreed, fixed in the attached patch. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?" diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml index 61d28e701f2..349f8b3d064 100644 --- a/doc/src/sgml/system-views.sgml +++ b/doc/src/sgml/system-views.sgml @@ -2435,7 +2435,7 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx active bool - True if this slot is currently actively being used + True if this slot is currently being streamed @@ -2444,9 +2444,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx active_pid int4 - The process ID of the session using this slot if the slot - is currently actively being used. NULL if - inactive. + The process ID of the session streaming data for this slot. + NULL if inactive. @@ -2566,15 +2565,13 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx inactive_since timestamptz -The time since the slot has become inactive. -NULL if the slot is currently being used. -Note that for slots on the standby that are being synced from a -primary server (whose synced field is -true), the -inactive_since indicates the last -synchronization (see -) -time. +The time slot synchronization (see ) +was most recently stopped. NULL if the slot +has always been synchronized. This is useful for slots on the +standby that are being synced from a primary server (whose +synced field is true) +so they know when the slot stopped being synchronized. diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c index d62186a5107..f4f80b23129 100644 --- a/src/backend/replication/logical/slotsync.c +++ b/src/backend/replication/logical/slotsync.c @@ -1515,7 +1515,7 @@ update_synced_slots_inactive_since(void) * correctly interpret the inactive_since if the standby gets promoted * without a restart. We don't want the slots to appear inactive for a * long time after promotion if they haven't been synchronized recently. - * Whoever acquires the slot i.e.makes the slot active will reset it. + * Whoever acquires the slot, i.e., makes the slot active, will reset it. */ if (!StandbyMode) return; diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h index 45582cf9d89..5666fcfd3cf 100644 --- a/src/include/replication/slot.h +++ b/src/include/replication/slot.h @@ -205,7 +205,7 @@ typedef struct ReplicationSlot */ XLogRecPtr last_saved_confirmed_flush; - /* The time since the slot has become inactive */ + /* The time slot synchronization was stopped. */ TimestampTz inactive_since; } ReplicationSlot;
Re: Use "protocol options" name instead of "protocol extensions" everywhere
On Thu, 31 Oct 2024 at 17:09, Robert Haas wrote: > I don't think it's really viable to promise that we'll never talk > about extending anything other than in the context of what CREATE > EXTENSION does. Agreed, but it seems nice to avoid confusion by not overloading terminology, if we can find a good/decent alternative. When we do have an overloaded term, we should definitely document what we mean with it and how it differs from other usage of the term. But even then many people won't have read those docs and assume it means the thing that they expect it to mean. A good example of this is the fact that initdb creates a "database cluster". We definitely document that we use the term that way. But if you talk/write about a database cluster many people (understandably) expect you to mean a very different thing.
Re: MultiXact\SLRU buffers configuration
> On 31 Oct 2024, at 17:29, Thom Brown wrote: > > On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin wrote: >> >> >> >>> On 29 Oct 2024, at 21:45, Thom Brown wrote: >>> >>> It clearly checks for interrupts, but when I saw this issue happen, it >>> wasn't interruptible. >> >> Perhaps, that was different multixacts? >> When I was observing this pathology on Standby, it was a stream of different >> reads encountering different multis. >> >> Either way startup can cancel locking process on it's own. Or is it the case >> that cancel was not enough, did you actually need termination, not cancel? > > Termination didn't work on either of the processes. How did you force the process to actually terminate? Did you observe repeated read of the same multixact? Was offending process holding any locks while waiting? Best regards, Andrey Borodin.
Re: UUID v7
> On 31 Oct 2024, at 22:15, Masahiko Sawada wrote: > > On Sat, Oct 26, 2024 at 10:05 AM Andrey M. Borodin > wrote: > > I think we typically avoid this kind of check failure by assigning > uuidv7() and uuidv7(interval) different C functions that call the > common function. That is, we have pg_proc entries like: > Done. >>> >>> It's odd to me that only uuid_extract_timestamp() supports UUID v6 in >>> spite of not supporting UUID v6 generation. I think it makes more >>> sense to support UUID v6 generation as well, if the need for it is >>> high. >> >> RFC urges to use UUIDv7 instead of UUIDv6 when possible. I'm fine with >> providing implementation, it's trivial. PFA patch with implementation. >> > > My point is that we should either support full functionality for > UUIDv6 (such as generation and extraction) or none of them. I'm not > really sure we want UUIDv6 as well, but if we want it, it should be > implemented in a separate patch. Make sense. I've removed all traces of v6. Thanks! Best regards, Andrey Borodin. v29-0001-Implement-UUID-v7.patch Description: Binary data
Re: Count and log pages set all-frozen by vacuum
On Thu, Oct 31, 2024 at 11:56 AM Alastair Turner wrote: > > For user consumption, to reduce the number of puzzled questions, I'd suggest > adding a line to the log output of the form > > visibility map: %u pages set all frozen, up to %u may have been removed from > the table > > rather than appending the info to the frozen: line of the output. By spelling > visibility map out in full it gives the curious user something specific > enough to look up. It also at least alerts the user to the fact that the > number can't just be subtracted from a total. Would it also be useful to have the number set all-visible? It seems like if we added a new line prefixed with visibility map, it ought to include all-visible and all-frozen then. Something like this: visibility map: %u pages set all-visible, %u pages set all-frozen. I find it more confusing to say "up to X may have been removed from the table." It's unclear what that means -- especially since we already have "pages removed" in another part of the log message. We actually could call visibilitymap_count() at the beginning of the vacuum and then log the difference between that and the results of calling it after finishing the vacuum. We currently call it after truncating the table anyway. That won't tell you how many pages were set all-frozen by this vacuum, as pages could have been unfrozen by DML that occurred after the page was vacuumed. It might be useful in addition to the line about the visibility map. This is somewhat in conflict with Robert and Peter's points about how autovacuum logging should be about what this vacuum did. But, we do have lines that talk about the before and after values: new relfrozenxid: 748, which is 3 XIDs ahead of previous value So, we could do something like: visibility map before: %u pages all-visible, %u pages all-frozen visibility map after: %u pages all-visible, %u pages all-frozen or visibility map after: %u pages all-visible (%u more than before), %u pages all-frozen (%u more than before) I still prefer adding how many pages were set all-frozen by this vacuum, though. - Melanie
Re: Time to add a Git .mailmap?
On Thu, Oct 31, 2024 at 11:37:13AM +0100, Daniel Gustafsson wrote: > When looking at our Git tree for a recent conference presentation I happened > to > notice that we have recently gained duplicate names in the shortlog. Not sure > if we care enough to fix that with a .mailmap, but if we do the attached diff > makes sure that all commits are accounted for a single committer entry. Seems reasonable to me. -- nathan
Re: make all ereport in gram.y print out relative location
jian he writes: > now changed to > static PartitionStrategy parsePartitionStrategy(char *strategy, int location, > core_yyscan_t yyscanner); I can take a look at this, since it's basically a followup to 774171c4f. regards, tom lane
Re: Count and log pages set all-frozen by vacuum
On Thu, Oct 31, 2024 at 10:22 AM Melanie Plageman wrote: > It seems a better solution would be to find a way to document it or > phrase it clearly in the log. It is true that these pages were set > all-frozen in the VM. It is just that some of them were later removed. > And we don't know which ones those are. Is there a way to make this > clear? Probably not, but I don't think that it's worth worrying about. ISTM that it's inevitable that somebody might get confused whenever we expose implementation details such as these. This particular example doesn't seem particularly concerning to me. Fundamentally, the information that you're showing is a precisely accurate account of the work performed by VACUUM. If somebody is bothered by the fact that we're setting VM bits for pages that just get truncated anyway, then they're bothered by the reality of what VACUUM does -- and not by the instrumentation itself. Why not just reuse visibilitymap_count for this (and have it count the number of all-frozen pages when called by heap_vacuum_rel)? That'll change the nature of the information shown, but that might actually make it slightly more useful. -- Peter Geoghegan
Re: Consider the number of columns in the sort cost model
I played around with the examples a bit and couldn't figure out something. When I added the same values to different columns - firstly in a, later in b, the order of the columns for sort operation doesn't change. Isn't this a mistake? create table a (x1 int, y1 int); create table b (x2 int, y2 int); insert into a values (NULL, NULL); insert into a values (NULL, 1); insert into a values (1, 1); insert into a values (1, NULL); create index a_x1_idx on a(x1); create index b_x2_idx on b(x2); create index a_y1_idx on a(y1); create index b_y2_idx on b(y2); insert into b select 1, 2 from generate_series(11,20) as id; insert into b select 1, 1 from generate_series(1,10) as id; insert into b select 1, 3 from generate_series(3,30) as id; explain analyze select a.x1, s.x2, s.y2 from a left join (select distinct * from b) s on a.x1=s.x2; QUERY PLAN Hash Right Join (cost=44.99..48.15 rows=5 width=12) (actual time=0.225..0.250 rows=8 loops=1) Hash Cond: (b.x2 = a.x1) -> HashAggregate (cost=43.90..46.16 rows=226 width=8) (actual time=0.117..0.123 rows=3 loops=1) Group Key: b.x2, b.y2 Batches: 1 Memory Usage: 40kB -> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8) (actual time=0.030..0.044 rows=48 loops=1) -> Hash (cost=1.04..1.04 rows=4 width=4) (actual time=0.073..0.074 rows=4 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual time=0.047..0.051 rows=4 loops=1) Planning Time: 1.649 ms Execution Time: 0.485 ms (11 rows) delete from b; insert into b select 2, 1 from generate_series(11,20) as id; insert into b select 1, 1 from generate_series(1,10) as id; insert into b select 3, 1 from generate_series(3,30) as id; vacuum analyze; explain analyze select a.x1, s.x2, s.y2 from a left join (select distinct * from b) s on a.x1=s.x2; QUERY PLAN --- Hash Left Join (cost=1.79..2.86 rows=4 width=12) (actual time=0.083..0.090 rows=4 loops=1) Hash Cond: (a.x1 = b.x2) -> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual time=0.010..0.011 rows=4 loops=1) -> Hash (cost=1.75..1.75 rows=3 width=8) (actual time=0.067..0.068 rows=3 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> HashAggregate (cost=1.72..1.75 rows=3 width=8) (actual time=0.063..0.064 rows=3 loops=1) Group Key: b.x2, b.y2 Batches: 1 Memory Usage: 24kB -> Seq Scan on b (cost=0.00..1.48 rows=48 width=8) (actual time=0.006..0.014 rows=48 loops=1) Planning Time: 0.391 ms Execution Time: 0.151 ms (11 rows) Sorry, I missed vacuum analyze before deleting all data from table b, but after running it I still got the same plan. alena@postgres=# create table a (x1 int, y1 int); create table b (xcreate table a (x1 int, y1 int); create table b (x2 int, y2 int);); insert into a values (NULL, NULL); insert into a values (NULL, 1); insert into a values (1, 1);L); insert into a values (1, NULL); create index a_x1_idx on a(x1); create index a_x1_idx on a(x1); create index b_x2_idx on b(x2); create index a_y1_idx on a(y1); create index b_y2_idx on b(y2); insert into b select 1, 2 from generate_series(11,20) as id; insert into b select 1, 2 from generate_series(11,20) as id; insert into b select 1, 1 from generate_series(1,10) as id; insert into b select 1, 3 from generate_series(3,30) as id; alena@postgres=# vacuum analyze; VACUUM alena@postgres=# explain analyze select a.x1, s.x2, s.y2 from a left join (select distinct * from b) s on a.x1=s.x2; QUERY PLAN --- Hash Left Join (cost=1.79..2.86 rows=4 width=12) (actual time=0.168..0.185 rows=8 loops=1) Hash Cond: (a.x1 = b.x2) -> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual time=0.027..0.029 rows=4 loops=1) -> Hash (cost=1.75..1.75 rows=3 width=8) (actual time=0.129..0.130 rows=3 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> HashAggregate (cost=1.72..1.75 rows=3 width=8) (actual time=0.119..0.123 rows=3 loops=1) Group Key: b.x2, b.y2 Batches: 1 Memory Usage: 24kB -> Seq Scan on b (cost=0.00..1.48 rows=48 width=8) (actual time=0.013..0.029 rows=48 loops=1) Planning Time: 1.464 ms Execution Time: 0.352 ms (11 rows) -- Regards, Alena Rybakina Postgres Professional
Re: Count and log pages set all-frozen by vacuum
On Thu, Oct 31, 2024 at 11:15 AM Peter Geoghegan wrote: > > On Thu, Oct 31, 2024 at 10:22 AM Melanie Plageman > wrote: > > It seems a better solution would be to find a way to document it or > > phrase it clearly in the log. It is true that these pages were set > > all-frozen in the VM. It is just that some of them were later removed. > > And we don't know which ones those are. Is there a way to make this > > clear? > > Probably not, but I don't think that it's worth worrying about. ISTM > that it's inevitable that somebody might get confused whenever we > expose implementation details such as these. This particular example > doesn't seem particularly concerning to me. > > Fundamentally, the information that you're showing is a precisely > accurate account of the work performed by VACUUM. If somebody is > bothered by the fact that we're setting VM bits for pages that just > get truncated anyway, then they're bothered by the reality of what > VACUUM does -- and not by the instrumentation itself. Makes sense to me. Though, I'm looking at it as a developer. > Why not just reuse visibilitymap_count for this (and have it count the > number of all-frozen pages when called by heap_vacuum_rel)? That'll > change the nature of the information shown, but that might actually > make it slightly more useful. I'm biased because I want the count of pages newly set all-frozen in the VM for another patch. You think exposing the total number of all-frozen pages at the end of the vacuum is more useful to users? - Melanie
Re: Making error message more user-friendly with spaces in a URI
> > I made a patch to make the error message more user-friendly when using a > URI to connect a database with psql. > Hi, Yushi! I could not find this line in the master branch and I couldn't apply this patch. I only see this error in the test (I think the test should also be changed), but the idea of fixing the error looks good to me. Best Regards, Stepan Neretin!
Re: Having problems generating a code coverage report
On Thu, Oct 31, 2024 at 10:02 PM Aleksander Alekseev wrote: > > Hi, > > > On Wed, Oct 30, 2024 at 05:26:49PM -0400, Peter Geoghegan wrote: > > > I use Debian unstable for most of my day to day work. Apparently > > > Debian unstable has exactly the same version of lcov as Ubuntu 24.04. > > > > > > I've also been unable to generate coverage reports for some time (at > > > least on Debian, with LCOV version 2.0-1). > > > > So do I, for both the debian SID and the lcov parts. > > I downgraded to lcov 1.16 [1] and it helped. This is merely a > workaround of course, not a long-time solution. > > [1]: https://github.com/linux-test-project/lcov/releases/tag/v1.16 > > -- my ubuntu meson version: 0.61.2, which also fails. with similar errors you've posted. from these two posts, https://github.com/mesonbuild/meson/pull/12345 https://github.com/mesonbuild/meson/issues/11995 Maybe upgrading meson can solve this problem.
Multi delete wal IDEA
Hi there, hackers! How about trying out an idea to add an analog to save memory in WAL files for deleting records, similar to multi-insert optimization? This patch is trying to do just that. Best Regards, Stepan Neretin! diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 1161520f76..fb2d4563fd 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -3816,13 +3816,14 @@ ExecModifyTable(PlanState *pstate) CmdType operation = node->operation; ResultRelInfo *resultRelInfo; PlanState *subplanstate; - TupleTableSlot *slot; + TupleTableSlot *slot = NULL; TupleTableSlot *oldSlot; ItemPointerData tuple_ctid; HeapTupleData oldtupdata; HeapTuple oldtuple; ItemPointer tupleid; bool tuplock; + List *items_to_delete = NULL; CHECK_FOR_INTERRUPTS(); @@ -4179,8 +4180,15 @@ ExecModifyTable(PlanState *pstate) break; case CMD_DELETE: -slot = ExecDelete(&context, resultRelInfo, tupleid, oldtuple, - true, false, node->canSetTag, NULL, NULL, NULL); +ItemPointer item_ptr = (ItemPointer) palloc0(sizeof(ItemPointerData)); +item_ptr->ip_blkid = tupleid->ip_blkid; +item_ptr->ip_posid = tupleid->ip_posid; + +if (!items_to_delete) + items_to_delete = list_make1(item_ptr); +else + items_to_delete = lappend(items_to_delete, item_ptr); + break; case CMD_MERGE: @@ -4197,10 +4205,28 @@ ExecModifyTable(PlanState *pstate) * If we got a RETURNING result, return it to caller. We'll continue * the work on next call. */ - if (slot) + if (slot && !(operation == CMD_DELETE)) return slot; } + if (list_length(items_to_delete) > 0) + { + ListCell *cell; + ereport(WARNING, errmsg("NUM ITEMS TO DELETE = %d", list_length(items_to_delete))); + foreach(cell, items_to_delete) + { + ItemPointer item_ptr = (ItemPointer) lfirst(cell); + if (!slot) +slot = ExecDelete(&context, resultRelInfo, item_ptr, NULL, + true, false, node->canSetTag, NULL, NULL, NULL); + else +ExecDelete(&context, resultRelInfo, item_ptr, NULL, + true, false, node->canSetTag, NULL, NULL, NULL); + } + + return slot; + } + /* * Insert remaining tuples for batch insert. */
Re: Count and log pages set all-frozen by vacuum
On Thu, Oct 31, 2024 at 11:26 AM Melanie Plageman wrote: > I'm biased because I want the count of pages newly set all-frozen in > the VM for another patch. You think exposing the total number of > all-frozen pages at the end of the vacuum is more useful to users? The emphasis on the work that one particular VACUUM operation performed doesn't seem like the most relevant thing to users (I get why you'd care about it in the context of your work, though). What matters to users is that the overall picture over time is one where VACUUM doesn't leave an excessive number of pages not-all-frozen-in-VM. What if we're just setting the same few pages all-frozen, again and again? And what about normal (non-aggressive) VACUUMs that effectively *increase* the number of pages future aggressive VACUUMs need to scan? As you well know, by setting some pages all-visible (not all-frozen), VACUUM essentially guarantees that those same pages can only get frozen by future aggressive VACUUMs. All these factors seem to argue for using visibilitymap_count for this (which is not to say that I am opposed to instrumented pages newly marked all-frozen in the VM, if it makes sense as part of some much larger project). -- Peter Geoghegan
Re: Having problems generating a code coverage report
Hi Jian, > > I downgraded to lcov 1.16 [1] and it helped. This is merely a > > workaround of course, not a long-time solution. > > > > [1]: https://github.com/linux-test-project/lcov/releases/tag/v1.16 > > > > -- > > my ubuntu meson version: 0.61.2, which also fails. > with similar errors you've posted. > > > from these two posts, > https://github.com/mesonbuild/meson/pull/12345 > https://github.com/mesonbuild/meson/issues/11995 > > Maybe upgrading meson can solve this problem. Thanks for the hint. I'm using Meson 1.3.2. Although it's not ancient (Feb 2024) there is 1.6.0 available. I'll try upgrading and let you know the results. -- Best regards, Aleksander Alekseev
"command cannot affect row a second time" in INSERT ... ON CONFLICT
Hello hackers, I am looking to better understand the applicability of the error message "command cannot affect row a second time". Consider the following table and data: CREATE TABLE ioc (i int, UNIQUE(i)); INSERT INTO ioc VALUES (1); The following two queries produce different errors: *Query 1* postgres=# INSERT INTO ioc VALUES (1), (20) ON CONFLICT (i) DO UPDATE SET i = 20; ERROR: 21000: ON CONFLICT DO UPDATE command cannot affect row a second time HINT: Ensure that no rows proposed for insertion within the same command have duplicate constrained values. *Query 2* postgres=# INSERT INTO ioc VALUES (20), (1) ON CONFLICT (i) DO UPDATE SET i = 20; ERROR: 23505: duplicate key value violates unique constraint "ioc_i_key" DETAIL: Key (i)=(20) already exists. INSERT ... ON CONFLICT does not support deferrable unique constraints, and so the two errors appear to be logically equivalent. However, the MERGE command which does support deferring unique constraints, consistently produces the duplicate key violation for similar queries [1] but also raises "command cannot affect row a second time" in other scenarios as demonstrated by regress tests in merge.sql. Naively, it seems to me that attempting to take a tuple lock on both: 1. The conflicting tuple (i = 1 in the second tuple in Query 2) as well as 2. The tuple it updates into (i = 20 in the second tuple in Query 2) (which may or may not exist) in ExecOnConflictUpdate could yield a consistent error message in both scenarios but it offers no real functional gains. 1. Is there a different reason the two queries produce a different error? 2. Is there a better way to think about the "command cannot affect row a second time"? Appreciate any guidance. Thanks. Warm regards, Karthik Ramanathan [1] MERGE command example CREATE TABLE source (sid INT); CREATE TABLE target (tid INT, UNIQUE (tid)); INSERT INTO target VALUES (1); *Query 1a* postgres=# INSERT INTO source VALUES (20), (1); postgres=# MERGE INTO target t USING source AS s ON t.tid = s.sid WHEN MATCHED THEN UPDATE SET tid = 20 WHEN NOT MATCHED THEN INSERT VALUES (s.sid); ERROR: 23505: duplicate key value violates unique constraint "target_tid_key" DETAIL: Key (tid)=(20) already exists. *Query 1b* postgres=# INSERT INTO source VALUES (1), (20); postgres=# MERGE INTO target t USING source AS s ON t.tid = s.sid WHEN MATCHED THEN UPDATE SET tid = 20 WHEN NOT MATCHED THEN INSERT VALUES (s.sid); ERROR: 23505: duplicate key value violates unique constraint "target_tid_key" DETAIL: Key (tid)=(20) already exists.
Re: Use "protocol options" name instead of "protocol extensions" everywhere
On 30/10/2024 15:58, Jelte Fennema-Nio wrote: It was pointed out by Heikki in the thread around protocol-level wait-for-LSN that "protocol extensions" is a pretty confusing name, since it suggests a relation to Postgres extensions. Even though there is no such relationship at all. Attached is a small patch that aligns on the name "protocol options" instead. This terminology is already used in a bunch of the docs. Since no protocol options have been introduced yet, it seems like now is a good time to align on what to call them. It might even be worth backporting this to have our docs of previous versions be consistent. Bikeshedding time: "protocol option" makes me think of GUCs. "optional protocol features" perhaps. A bit long though.. Or keep using "protocol extension" and add a paragraph to the docs to say explicitly that there's no support for extensions to create protocol extensions. TLS extensions is a good comparison. I don't have a strong opinion, all of those would work for me. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Count and log pages set all-frozen by vacuum
On Thu, Oct 31, 2024 at 12:02 PM Robert Haas wrote: > I don't see it quite the same way. I agree that what users are really > concerned about is the excessive number of unfrozen pages in the VM. > But that's not the question here. The question is what should > log_autovacuum_min_duration log. And I think it should log what the > vacuum itself did, not what the state of the table ended up being > around the time the vacuum ended. I don't fundamentally disagree that the actual work performed by VACUUM could also be useful. It's a question of emphasis. FWIW I do disagree with the principle that log_autovacuum_min_duration should only log things that are work performed by VACUUM. While most things that it reports on currently do work that way, that in itself doesn't seem like it should preclude reporting on visibilitymap_count now. -- Peter Geoghegan
Re: [PoC] Federated Authn/z with OAUTHBEARER
On Thu, Oct 31, 2024 at 4:05 AM Antonin Houska wrote: > > > And regardless, the library appears to be loaded by every backend during > > > authentication. Why isn't it loaded by postmaster like libraries listed > > > in > > > shared_preload_libraries? fork() would then ensure that the backends do > > > have > > > the library in their address space. > > > > It _can_ be, if you want -- there's nothing that I know of preventing > > the validator from also being preloaded with its own _PG_init(), is > > there? But I don't think it's a good idea to force that, for the same > > reason we want to allow SIGHUP. > > Loading the library by postmaster does not prevent the backends from reloading > it on SIGHUP later. I was simply concerned about performance. (I proposed > loading the library at another stage of backend initialization rather than > adding _PG_init() to it.) Okay. I think this is going to be one of the slower authentication methods by necessity: the builtin flow in libpq requires a human in the loop, and an online validator is going to be making several HTTP calls from the backend. So if it turns out later that we need to optimize the backend logic, I'd prefer to have a case study in hand; otherwise I think we're likely to optimize the wrong things. > Easiness of reading is the only "question" here :-) It's might not always be > obvious why a variable should have some particular value. In general, the > Assert() statements are almost always preceded with a comment in the PG > source. Oh, an assertion label! I can absolutely add one; I originally thought you were proposing a label for the case itself. > > > Or, doesn't the FE_OAUTH_INIT branch of the switch statement actually > > > fit > > > better into oauth_init()? > > > > oauth_init() is the mechanism initialization for the SASL framework > > itself, which is shared with SCRAM. In the current architecture, the > > init callback doesn't take the initial client response into > > consideration at all. > > Sure. The FE_OAUTH_INIT branch in oauth_exchange() (FE) also does not generate > the initial client response. It might, if it ends up falling through to FE_OAUTH_REQUESTING_TOKEN. There are two paths that can do that: the case where we have no discovery URI, and the case where a custom user flow returns a token synchronously (it was probably cached). > Anyway, are you sure that pg_SASL_continue() can also receive the SASL_ASYNC > value from oauth_exchange()? My understanding is that pg_SASL_init() receives > it if there is no token, but after that, oauth_exchange() is not called util > the token is available, and thus it should not return SASL_ASYNC anymore. Correct -- the only way for the current implementation of the OAUTHBEARER mechanism to return SASL_ASYNC is during the very first call. That's not an assumption I want to put into the higher levels, though; I think Michael will be unhappy with me if I introduce additional SASL coupling after the decoupling work that's been done over the last few releases. :D Thanks again, --Jacob
[PATCH] Improve code coverage of network address functions
Hi hackers, Recently I played with lcov [1]. In the process it was discovered that the following functions are not executed by our tests: - abbrev(inet) - set_masklen(cidr,int4) - netmask(inet) - hostmask(inet) - inet_client_addr() - inet_client_port() - inet_server_addr() - inet_server_port() The proposed patch fixes this. For the last four functions the return values are not checked, only the fact that the functions are callable and don't crash. This improves code coverage of src/backend/utils/adt/network.c from 69.8% to 80.1%. [1]: https://postgr.es/m/CAJ7c6TPYPF93%2ByWi%3DThKiOsnhqLpeTmctMrJWz3xRQobGSY6BA%40mail.gmail.com -- Best regards, Aleksander Alekseev From e3a19689a335bf5c0c6aecd19e6fe3d17d456b2b Mon Sep 17 00:00:00 2001 From: Aleksander Alekseev Date: Thu, 31 Oct 2024 17:54:56 +0300 Subject: [PATCH v1] Improve code coverage of network address functions The following functions were not covered by tests: - abbrev(inet) - set_masklen(cidr,int4) - netmask(inet) - hostmask(inet) - inet_client_addr() - inet_client_port() - inet_server_addr() - inet_server_port() Correct this. Aleksander Alekseev, reviewed by TODO FIXME Discussion: TODO FIXME --- src/test/regress/expected/inet.out | 104 + src/test/regress/sql/inet.sql | 14 2 files changed, 118 insertions(+) diff --git a/src/test/regress/expected/inet.out b/src/test/regress/expected/inet.out index b6895d9ced..c903c6c93b 100644 --- a/src/test/regress/expected/inet.out +++ b/src/test/regress/expected/inet.out @@ -190,6 +190,72 @@ SELECT c AS cidr, masklen(c) AS "masklen(cidr)", 10.0.0.0/8 | 8 | 9.1.2.3/8 | 8 (4 rows) +SELECT i AS inet, abbrev(i) AS "abbrev(inet)" FROM INET_TBL; + inet | abbrev(inet) +--+-- + 192.168.1.226/24 | 192.168.1.226/24 + 192.168.1.226| 192.168.1.226 + 192.168.1.0/24 | 192.168.1.0/24 + 192.168.1.0/25 | 192.168.1.0/25 + 192.168.1.255/24 | 192.168.1.255/24 + 192.168.1.255/25 | 192.168.1.255/25 + 10.1.2.3/8 | 10.1.2.3/8 + 10.1.2.3/8 | 10.1.2.3/8 + 10.1.2.3 | 10.1.2.3 + 10.1.2.3/24 | 10.1.2.3/24 + 10.1.2.3/16 | 10.1.2.3/16 + 10.1.2.3/8 | 10.1.2.3/8 + 11.1.2.3/8 | 11.1.2.3/8 + 9.1.2.3/8| 9.1.2.3/8 + 10:23::f1/64 | 10:23::f1/64 + 10:23:: | 10:23:: + ::4.3.2.1/24 | ::4.3.2.1/24 +(17 rows) + +SELECT i AS inet, netmask(i) AS "netmask(inet)" FROM INET_TBL; + inet | netmask(inet) +--+- + 192.168.1.226/24 | 255.255.255.0 + 192.168.1.226| 255.255.255.255 + 192.168.1.0/24 | 255.255.255.0 + 192.168.1.0/25 | 255.255.255.128 + 192.168.1.255/24 | 255.255.255.0 + 192.168.1.255/25 | 255.255.255.128 + 10.1.2.3/8 | 255.0.0.0 + 10.1.2.3/8 | 255.0.0.0 + 10.1.2.3 | 255.255.255.255 + 10.1.2.3/24 | 255.255.255.0 + 10.1.2.3/16 | 255.255.0.0 + 10.1.2.3/8 | 255.0.0.0 + 11.1.2.3/8 | 255.0.0.0 + 9.1.2.3/8| 255.0.0.0 + 10:23::f1/64 | ::::: + 10:23:: | ::::::: + ::4.3.2.1/24 | :ff00:: +(17 rows) + +SELECT i AS inet, hostmask(i) AS "hostmask(inet)" FROM INET_TBL; + inet | hostmask(inet) +--+ + 192.168.1.226/24 | 0.0.0.255 + 192.168.1.226| 0.0.0.0 + 192.168.1.0/24 | 0.0.0.255 + 192.168.1.0/25 | 0.0.0.127 + 192.168.1.255/24 | 0.0.0.255 + 192.168.1.255/25 | 0.0.0.127 + 10.1.2.3/8 | 0.255.255.255 + 10.1.2.3/8 | 0.255.255.255 + 10.1.2.3 | 0.0.0.0 + 10.1.2.3/24 | 0.0.0.255 + 10.1.2.3/16 | 0.0.255.255 + 10.1.2.3/8 | 0.255.255.255 + 11.1.2.3/8 | 0.255.255.255 + 9.1.2.3/8| 0.255.255.255 + 10:23::f1/64 | ::::: + 10:23:: | :: + ::4.3.2.1/24 | 0:ff:::::: +(17 rows) + SELECT c AS cidr, i AS inet FROM INET_TBL WHERE c = i; cidr | inet @@ -261,6 +327,44 @@ SELECT set_masklen(inet(text(i)), 24) FROM INET_TBL; ::4.3.2.1/24 (17 rows) +SELECT set_masklen(cidr(text(c)), 24) FROM INET_TBL; + set_masklen + + 192.168.1.0/24 + 192.168.1.0/24 + 192.168.1.0/24 + 192.168.1.0/24 + 192.168.1.0/24 + 192.168.1.0/24 + 10.0.0.0/24 + 10.0.0.0/24 + 10.1.2.0/24 + 10.1.2.0/24 + 10.1.0.0/24 + 10.0.0.0/24 + 10.0.0.0/24 + 10.0.0.0/24 + 10::/24 + 10::/24 + ::/24 +(17 rows) + +-- just making sure these functions are callable and don't crash +SELECT 'ok' AS result +FROM unnest(ARRAY[ +text(inet_client_addr()), +text(inet_client_port()), +text(inet_server_addr()), +text(inet_server_port()) + ]) AS x; + result + + ok + ok + ok + ok +(4 rows) + -- check that btree index works correctly CREATE INDEX inet_idx1 ON inet_tbl(i); SET enable_seqscan TO off; diff --git a/src/test/regress/sql/inet.sql b/src/test/regr
Re: Relcache refactoring
On 31/10/2024 12:06, Heikki Linnakangas wrote: On 31/10/2024 10:14, Heikki Linnakangas wrote: Committed with those fixes, thanks for the review! There is a failure in the buildfarm animal 'prion', which builds with -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options. I'll look into that later today. (I did test with -DRELCACHE_FORCE_RELEASE on my laptop, but must've missed something). I was finally able to reproduce this, by running the failing pg_regress tests concurrently with repeated "vacuum full pg_database" calls. It's curious that 'prion' failed on the first run after the commit, I was not able to reproduce it by just running the unmodified regression tests with -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE. Committed a fix. There was one codepath that was missing a call to RelationInitPhysicalAddr(relation) after the patch. I alluded to this earlier in this thread: ## RelationInitPhysicalAddr() call in RelationReloadNailed() One question or little doubt I have: Before these patches, RelationReloadNailed() calls RelationInitPhysicalAddr() even when it leaves the relation as invalidated because we're not in a transaction or if the relation isn't currently in use. That's a bit surprising, I'd expect it to be done when the entry is reloaded, not when it's invalidated. That's how it's done for non-nailed relations. And in fact, for a nailed relation, RelationInitPhysicalAddr() is called *again* when it's reloaded. Is it important? Before commit a54e1f1587, nailed non-index relations were not reloaded at all, except for the call to RelationInitPhysicalAddr(), which seemed consistent. I think this was unintentional in commit a54e1f1587, or perhaps just overly defensive, as there's no harm in some extra RelationInitPhysicalAddr() calls. This patch removes that extra call from the invalidation path, but if it turns out to be wrong, we can easily add it to RelationInvalidateRelation. Now this wasn't exactly that, but related. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Count and log pages set all-frozen by vacuum
On Thu, Oct 31, 2024 at 11:15 AM Peter Geoghegan wrote: > Probably not, but I don't think that it's worth worrying about. ISTM > that it's inevitable that somebody might get confused whenever we > expose implementation details such as these. This particular example > doesn't seem particularly concerning to me. +1. We could possibly make this less confusing by reworking the output so that we first talk about what the vacuuming did in one set of log lines and then talk about what the truncation did afterward. But that's a lot of work, and I don't feel like it's "must do" work. -- Robert Haas EDB: http://www.enterprisedb.com
Re: small pg_combinebackup improvements
On Thu, Oct 31, 2024 at 11:41 AM Bertrand Drouvot wrote: > 0001 looks pretty straightforward and good to me. Thanks for the review. > What about moving the new comment just before the new "else if"? Well, the block comment applies to the whole if-else if-else construction. If we get too many else-ifs here it may need restructuring, but I don't think it needs that now. > Yeah, "copy" is needed. I tested to "just" create an empty incremental file > and got: > > $ pg_combinebackup backup1 backup2 -o restored_full > pg_combinebackup: error: could not read file > "backup1/base/1/INCREMENTAL.6113": read 0 of 4 > > Which is not the error we want to produce. Right, it needs to look like a valid incremental file. > s/pg_combinebackup fails/pg_combinebackup fails due to an unexpected > incremental file/? OK > I'm not sure about 0001 but I think 0002 deserves a back patch as I think it > fits > into the "low-risk fixes" category [0]. I'm inclined to back-patch both, then. We might have more small fixes and they'll be less risky to back-patch if we back-patch them all. -- Robert Haas EDB: http://www.enterprisedb.com
Re: minor doc issue in 9.16.2.1.1. Boolean Predicate Check Expressions
On Fri, Oct 18, 2024 at 10:00:54AM +0800, jian he wrote: > On Fri, Oct 18, 2024 at 2:05 AM Bruce Momjian wrote: > > Yes, updated patch attached. > > > looks good. > > in the meantime, do you think it's necessary to slightly rephrase > jsonb_path_match doc entry: > > currently doc entry: > jsonb_path_match ( target jsonb, path jsonpath [, vars jsonb [, silent > boolean ]] ) → boolean > Returns the result of a JSON path predicate check for the specified JSON > value. > > > "the result of a JSON path predicate check for the specified JSON > value." is a jsonb boolean. > but jsonb_path_match returns sql boolean. > maybe add something to describe case like: "if JSON path predicate > check return jsonb null, jsonb_path_match will return SQL null". Yes, I think that is a good point, updated patch attached. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?" diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 05f630c6a6c..25e445467c3 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -17178,8 +17178,8 @@ ERROR: value too long for type character(2) boolean -Returns the result of a JSON path predicate check for the specified -JSON value. +Returns the SQL boolean result of a JSON path predicate check +for the specified JSON value. (This is useful only with predicate check expressions, not SQL-standard JSON path expressions, @@ -17646,9 +17646,9 @@ SELECT '{ Boolean predicate, whereas the SQL standard allows predicates only within filters. While SQL-standard path expressions return the relevant element(s) of the queried JSON value, predicate check expressions - return the single three-valued result of the + return the single three-valued jsonb result of the predicate: true, - false, or unknown. + false, or null. For example, we could write this SQL-standard filter expression: => select jsonb_path_query(:'json', '$.track.segments ?(@[*].HR > 130)');
Re: small pg_combinebackup improvements
Hi, On Wed, Oct 30, 2024 at 03:50:53PM -0400, Robert Haas wrote: > Hi, > > Here are two small patches to improve pg_combinebackup slightly. > > 0001: I noticed that there is some logic in reconstruct.c that > constructs a pathname of the form a/b//c instead of a/b/c. AFAICT, > this still works fine; it just looks ugly. It's possible to get one of > these pathnames to show up in an error message if you have a corrupted > backup, e.g.: > > pg_combinebackup: error: could not open file > "x/base/1//INCREMENTAL.6229": No such file or directory > pg_combinebackup: removing output directory "xy" > > So 0001 fixes this. Yeah, I don't think that was an issue per say but better to display a path of the form a/b/c (to not be "distracted" by the output at the first place). 0001 looks pretty straightforward and good to me. > 0002: If you try to do something like pg_combinebackup incr1 incr2 -o > result, you'll get an error saying that the first backup must be a > full backup. This is an implementation restriction that might be good > to lift, but right now that's how it is. However, if you do > pg_combinebackup full incr -o result, but the full backup happens to > contain an INCREMENTAL.* file that also exists in the incremental > backup, then you won't get an error. Instead you'll reconstruct a > completely bogus full file and write it to the result directory. Now, > this should really be impossible unless you're intentionally trying to > break pg_combinebackup, but it's also pretty lame, so 0002 fixes > things so that you get a proper error, and also adds a test case. > 1 === +* If we have only incremental files, and there's no full file at any +* point in the backup chain, something has gone wrong. Emit an error. +* * Otherwise, reconstruct. */ if (copy_source != NULL) copy_file(copy_source->filename, output_filename, &checksum_ctx, copy_method, dry_run); + else if (sidx == 0 && source[0]->header_length != 0) + { + pg_fatal("full backup contains unexpected incremental file \"%s\"", +source[0]->filename); + } What about moving the new comment just before the new "else if"? 2 === + copy("$backup2path/base/1/$iname", "$backup1path/base/1/$iname") Yeah, "copy" is needed. I tested to "just" create an empty incremental file and got: $ pg_combinebackup backup1 backup2 -o restored_full pg_combinebackup: error: could not read file "backup1/base/1/INCREMENTAL.6113": read 0 of 4 Which is not the error we want to produce. 3 === + qr/full backup contains unexpected incremental file/, + "pg_combinebackup fails"); s/pg_combinebackup fails/pg_combinebackup fails due to an unexpected incremental file/? > Review appreciated. My plan is to commit at least to master and > possibly back-patch. Opinions on whether to back-patch are also > appreciated. I'm not sure about 0001 but I think 0002 deserves a back patch as I think it fits into the "low-risk fixes" category [0]. [0]: https://www.postgresql.org/support/versioning/ Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Count and log pages set all-frozen by vacuum
On Thu, Oct 31, 2024 at 11:49 AM Peter Geoghegan wrote: > The emphasis on the work that one particular VACUUM operation > performed doesn't seem like the most relevant thing to users (I get > why you'd care about it in the context of your work, though). What > matters to users is that the overall picture over time is one where > VACUUM doesn't leave an excessive number of pages > not-all-frozen-in-VM. I don't see it quite the same way. I agree that what users are really concerned about is the excessive number of unfrozen pages in the VM. But that's not the question here. The question is what should log_autovacuum_min_duration log. And I think it should log what the vacuum itself did, not what the state of the table ended up being around the time the vacuum ended. And I think there is certainly a use case for knowing how much work of each particular kind vacuum did. You might for example be trying to judge whether a particular vacuum was useless. Knowing the cumulative state of the table around the time the vacuum finished doesn't help you figure that out; a count of how much work the vacuum itself did does. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Use "protocol options" name instead of "protocol extensions" everywhere
On Thu, Oct 31, 2024 at 10:51 AM Heikki Linnakangas wrote: > Bikeshedding time: > > "protocol option" makes me think of GUCs. > > "optional protocol features" perhaps. A bit long though.. > > Or keep using "protocol extension" and add a paragraph to the docs to > say explicitly that there's no support for extensions to create protocol > extensions. TLS extensions is a good comparison. > > I don't have a strong opinion, all of those would work for me. I don't particularly like "optional protocol features". I find "protocol extensions" to be mildly clearer than "protocol options," but only mildly. I don't think it's really viable to promise that we'll never talk about extending anything other than in the context of what CREATE EXTENSION does. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Use "protocol options" name instead of "protocol extensions" everywhere
On Thu, 31 Oct 2024 at 15:50, Heikki Linnakangas wrote: > Bikeshedding time: Another few options: 4. Protocol enhancement 5. Protocol flag 6. Protocol feature-flag 7. Protocol configuration 8. Protocol parameter One thing to consider is that there's two ways of using them: 1. Turning an optional protocol feature on/of (send LSN yes/no) 2. Configuring an optional protocol feature (compress with gzip/lz4/zstd) I think "protocol extension" is a good name for the first. But it reads/writes a bit awkward for the second usage imo:. 1. The wait_for_lsn protocol extension needs to be enabled. 2. I configured the compression protocol extension to be gzip. I like that "protocol option" because it works for both: 1. The wait_for_lsn protocol option needs to be enabled. 2. I set the compression protocol option to gzip. I still think of these "protocol xyzs" as essentially being GUCs for the protocol. Especially because they are configured the same way as GUCs in the StartupMessage. So having "protocol option" making you think of GUCs doesn't necessarily seem like a bad thing to me.
Re: Fix typos where 'the' was repeated
> On 31 Oct 2024, at 10:24, vignesh C wrote: > I noticed a couple of typos in code. "the the" should have been "the", > attached patch has the changes for the same. Fixed, thanks. -- Daniel Gustafsson
Re: Eager aggregation, take 3
hi. still trying to understand v13. found a bug. minimum test : drop table if exists t1, t2; CREATE TABLE t1 (a int, b int, c int); CREATE TABLE t2 (a int, b int, c int); SET enable_eager_aggregate TO on; explain(costs off, settings) SELECT avg(t2.a), t1.c FROM t1 JOIN t2 ON t1.b = t2.b GROUP BY t1.c having grouping(t1.c) > 0; create_agg_clause_infos foreach(lc, tlist_exprs) { Expr *expr = (Expr *) lfirst(lc); if (IsA(expr, GroupingFunc)) return; } if (root->parse->havingQual != NULL) { List *having_exprs; having_exprs = pull_var_clause((Node *) root->parse->havingQual, PVC_INCLUDE_AGGREGATES | PVC_RECURSE_PLACEHOLDERS); if (having_exprs != NIL) { tlist_exprs = list_concat(tlist_exprs, having_exprs); list_free(having_exprs); } } havingQual can have GroupingFunc. if that happens, then segmentation fault.
Re: IPC::Run::time[r|out] vs our TAP tests
> On 28 Oct 2024, at 11:56, Heikki Linnakangas wrote: > > On 09/04/2024 20:10, Daniel Gustafsson wrote: >> Turning the timeout into a timer and returning undef along with logging a >> test >> failure in case of expiration seems a bit saner (maybe Andrew can suggest an >> API which has a better Perl feel to it). Most callsites don't need any >> changes >> to accommodate for this, the attached 0002 implements this timer change and >> modify the few sites that need it, converting one to plain query() where the >> added complexity of query_until isn't required. > > +1. The patch looks good to me. I didn't comb through the tests to check for > bugs of omission, i.e. cases where the caller would need adjustments because > of this, I trust that you found them all. Thanks for review! >> =item $session->quit >> Close the session and clean up resources. Each test run must be closed with >> C. Returns TRUE when the session was cleanly terminated, otherwise >> FALSE. A testfailure will be issued in case the session failed to finish. > > What does "session failed to finish" mean? Does it mean the same as "not > cleanly terminated", i.e. a test failure is issued whenever this returns > FALSE? It was very literally referring to the finish() method. I've reworded the comment to indicated that it throws a failure in case the process returns a non-zero exit status to finish(). > typo: testfailure -> test failure Fixed. >> my $pid = $bgpsql->query('SELECT pg_backend_pid()'); >> +isnt($pid, undef, 'Get backend PID'); >> # create the database, prevent drop database via lock held by a 2PC >> transaction >> ok( $bgpsql->query_safe( > > I'm not sure I understand these changes. These can only fail if the query() > or query_until() function times out, right? In that case, the query() or > query_until() would also report a test failure, so these additional checks > after the call seem redundant. They don't do any harm either, but I wonder > why have them in these particular call sites and not in other query() or > query_until() calls. Fair point, they were added to provide additional detail in case of failure, but they are to some degree overzealous and definitely not required. Attached is a v2 with the above changes and 0003 dropped due to already being implemented. -- Daniel Gustafsson v2-0001-Configure-interactive-instance-to-restart-timer.patch Description: Binary data v2-0002-Report-test-failure-rather-than-aborting-in-case-.patch Description: Binary data
Re: MultiXact\SLRU buffers configuration
> On 29 Oct 2024, at 21:45, Thom Brown wrote: > > It clearly checks for interrupts, but when I saw this issue happen, it > wasn't interruptible. Perhaps, that was different multixacts? When I was observing this pathology on Standby, it was a stream of different reads encountering different multis. Either way startup can cancel locking process on it's own. Or is it the case that cancel was not enough, did you actually need termination, not cancel? Thanks! Best regards, Andrey Borodin.
Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Hackers, I'd like to discuss a problem with replication slots's restart LSN. Physical slots are saved to disk at the beginning of checkpoint. At the end of checkpoint, old WAL segments are recycled or removed from disk, if they are not kept by slot's restart_lsn values. If an existing physical slot is advanced in the middle of checkpoint execution, WAL segments, which are related to saved on disk restart LSN may be removed. It is because the calculation of the replication slot miminal LSN is occured at the end of checkpoint, prior to old WAL segments removal. If to hard stop (pg_stl -m immediate) the postgres instance right after checkpoint and to restart it, the slot's restart_lsn may point to the removed WAL segment. I believe, such behaviour is not good. The doc [0] describes that restart_lsn may be set to the some past value after reload. There is a discussion [1] on pghackers where such behaviour is discussed. The main reason of not flushing physical slots on advancing is a performance reason. I'm ok with such behaviour, except of that the corresponding WAL segments should not be removed. I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots. Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the format of storing the slot contents on disk. I attached a patch. It is not yet complete, but demonstate a way to solve the problem. I reproduced the problem by the following way: * Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint execution. * Execute checkpoint and pg_replication_slot_advance right after starting of the checkpoint from another connection. * Hard restart the server right after checkpoint completion. * After restart slot's restart_lsn may point to removed WAL segment. The proposed patch fixes it. [0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html [1] https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru
Re: Relcache refactoring
On 31/10/2024 10:14, Heikki Linnakangas wrote: Committed with those fixes, thanks for the review! There is a failure in the buildfarm animal 'prion', which builds with -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE options. I'll look into that later today. (I did test with -DRELCACHE_FORCE_RELEASE on my laptop, but must've missed something). -- Heikki Linnakangas Neon (https://neon.tech)
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Sorry, attached the missed patch. On Thursday, October 31, 2024 13:18 MSK, "Vitaly Davydov" wrote: Dear Hackers, I'd like to discuss a problem with replication slots's restart LSN. Physical slots are saved to disk at the beginning of checkpoint. At the end of checkpoint, old WAL segments are recycled or removed from disk, if they are not kept by slot's restart_lsn values. If an existing physical slot is advanced in the middle of checkpoint execution, WAL segments, which are related to saved on disk restart LSN may be removed. It is because the calculation of the replication slot miminal LSN is occured at the end of checkpoint, prior to old WAL segments removal. If to hard stop (pg_stl -m immediate) the postgres instance right after checkpoint and to restart it, the slot's restart_lsn may point to the removed WAL segment. I believe, such behaviour is not good. The doc [0] describes that restart_lsn may be set to the some past value after reload. There is a discussion [1] on pghackers where such behaviour is discussed. The main reason of not flushing physical slots on advancing is a performance reason. I'm ok with such behaviour, except of that the corresponding WAL segments should not be removed. I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots. Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the format of storing the slot contents on disk. I attached a patch. It is not yet complete, but demonstate a way to solve the problem. I reproduced the problem by the following way: * Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint execution. * Execute checkpoint and pg_replication_slot_advance right after starting of the checkpoint from another connection. * Hard restart the server right after checkpoint completion. * After restart slot's restart_lsn may point to removed WAL segment. The proposed patch fixes it. [0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html [1] https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru From acae6b55fc766d2fe1bfe85eb8af85110f55dcc8 Mon Sep 17 00:00:00 2001 From: Vitaly Davydov Date: Thu, 31 Oct 2024 12:29:12 +0300 Subject: [PATCH] Keep WAL segments by slot's flushed restart LSN --- src/backend/replication/slot.c | 9 +++-- src/include/replication/slot.h | 4 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 6828100cf1..ee7ab3678e 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1148,7 +1148,9 @@ ReplicationSlotsComputeRequiredLSN(void) continue; SpinLockAcquire(&s->mutex); - restart_lsn = s->data.restart_lsn; + restart_lsn = s->restart_lsn_flushed != InvalidXLogRecPtr ? + s->restart_lsn_flushed : + s->data.restart_lsn; invalidated = s->data.invalidated != RS_INVAL_NONE; SpinLockRelease(&s->mutex); @@ -1207,7 +1209,9 @@ ReplicationSlotsComputeLogicalRestartLSN(void) /* read once, it's ok if it increases while we're checking */ SpinLockAcquire(&s->mutex); - restart_lsn = s->data.restart_lsn; + restart_lsn = s->restart_lsn_flushed != InvalidXLogRecPtr ? + s->restart_lsn_flushed : + s->data.restart_lsn; invalidated = s->data.invalidated != RS_INVAL_NONE; SpinLockRelease(&s->mutex); @@ -2097,6 +2101,7 @@ SaveSlotToPath(ReplicationSlot *slot, const char *dir, int elevel) SpinLockAcquire(&slot->mutex); memcpy(&cp.slotdata, &slot->data, sizeof(ReplicationSlotPersistentData)); + slot->restart_lsn_flushed = slot->data.restart_lsn; SpinLockRelease(&slot->mutex); diff --git a/src/include/replication/slot.h b/src/include/replication/slot.h index 45582cf9d8..ca4c3aab3b 100644 --- a/src/include/replication/slot.h +++ b/src/include/replication/slot.h @@ -207,6 +207,10 @@ typedef struct ReplicationSlot /* The time since the slot has become inactive */ TimestampTz inactive_since; + + /* Latest restart LSN that was flushed to disk */ + XLogRecPtr restart_lsn_flushed; + } ReplicationSlot; #define SlotIsPhysical(slot) ((slot)->data.database == InvalidOid) -- 2.34.1
Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.
On Wed, Oct 30, 2024 at 9:36 PM Junwang Zhao wrote: > On Wed, Oct 30, 2024 at 11:58 AM jian he wrote: > > > > I missed a case when column collation and partition key collation are > > the same and indeterministic. > > that should be fine for partition-wise join. > > so v2 attached. > > > > have_partkey_equi_join, match_expr_to_partition_keys didn't do any > > collation related check. > > propose v2 change disallow partitionwise join for case when > > column collation is indeterministic *and* is differ from partition > > key's collation. > > > > the attached partition_wise_join_collation.sql is the test script. > > you may use it to compare with the master behavior. > > What if the partkey collation and column collation are both deterministic, > but with different sort order? > > I'm not familiar with this part of the code base, but it seems to me the > partition wise join should use partkey collation instead of column collation, > because it's the partkey collation that decides which partition a row to > be dispatched. > > What Jian proposed is also reasonable but seems another aspect of $subject? I think we should insist that the join key collation and the partition collation are exactly the same and refuse to match them if they are not. + { + Oid colloid = exprCollation((Node *) expr); + + if ((partcoll != colloid) && +OidIsValid(colloid) && +!get_collation_isdeterministic(colloid)) + *coll_incompatiable = true; I am not quite sure what is the point of checking whether or not the expression collation is deterministic after confirming that it's not the same as partcoll. Attached 0002 is what I came up with. One thing that's different from what Jian proposed is that match_expr_to_partition_keys() returns -1 (expr not matched to any key) when the collation is also not matched instead of using a separate output parameter for that. 0001 is the patch for the partitionwise grouping issue (bug #18568). I concluded that even partial aggregate should be disallowed when the grouping collation doesn't match partitioning collation. The result with partial partitionwise grouping is not the same as when enable_partitionwise_aggregate is off. -- Thanks, Amit Langote v3-0001-Disallow-partitionwise-grouping-when-collation-do.patch Description: Binary data v3-0002-Disallow-partitionwise-join-when-collation-doesn-.patch Description: Binary data
Re: Statistics Import and Export
> > > (c) we are considering whether to use an in-place heap update for the > relation stats, so that a large restore doesn't bloat pg_class -- I'd > like feedback on this idea > I'd also like feedback, though I feel very strongly that we should do what ANALYZE does. In an upgrade situation, nearly all tables will have stats imported, which would result in an immediate doubling of pg_class - not the end of the world, but not great either. Given the recent bugs associated with inplace updates and race conditions, if we don't want to do in-place here, we should also consider getting rid of it for ANALYZE. I briefly pondered if it would make sense to vertically partition pg_class into the stable attributes and the attributes that get modified in-place, but that list is pretty long: relpages, reltuples, relallvisible, relhasindex, reltoastrelid, relhasrules, relhastriggers, relfrozenxid, and reminmxid, If we don't want to do inplace updates in pg_restore_relation_stats(), then we could mitigate the bloat with a VACUUM FULL pg_class at the tail end of the upgrade if stats were enabled. > pg_restore_*_stats() functions. But there's a lot of overlap, so it may > be worth discussing again whether we should only have one set of > functions. > For the reason of in-place updates and error tolerance, I think they have to remain separate functions, but I'm also interested in hearing other's opinions.
Re: Count and log pages set all-frozen by vacuum
Thanks for taking a look, Alastair! On Thu, Oct 31, 2024 at 5:47 AM Alastair Turner wrote: > > The values returned in a case pages are removed (cases where the empty pages > are at the end of the table) are a bit confusing. > > In an example similar to yours, but with a normal vacuum operation, since > that seems to be the most useful case for this feature: > > CREATE TABLE the_table (first int, second int) WITH (autovacuum_enabled = > false); > INSERT INTO the_table SELECT generate_series(1,1000), 1; > DELETE FROM the_table WHERE first > 50; > VACUUM (VERBOSE) the_table; > > pages: 4 removed, 1 remain, 5 scanned (100.00% of total) > tuples: 950 removed, 50 remain, 0 are dead but not yet removable > removable cutoff: 763, which was 1 XIDs old when operation ended > new relfrozenxid: 761, which is 1 XIDs ahead of previous value > frozen: 0 pages from table (0.00% of total) had 0 tuples frozen. 4 pages set > all-frozen in the VM > > It looks like 4 pages out of 1 are set all-frozen. So there are -3 to scan > for the next aggressive vacuum? The four pages which were set to all frozen > are the same four which have been removed from the end of the table. > > For an example where the empty pages which are marked all frozen are at the > start of the table (deleting values < 950 in the example), the results are > more intuitive > > pages: 0 removed, 5 remain, 5 scanned (100.00% of total) > tuples: 949 removed, 51 remain, 0 are dead but not yet removable > removable cutoff: 768, which was 0 XIDs old when operation ended > new relfrozenxid: 766, which is 1 XIDs ahead of previous value > frozen: 0 pages from table (0.00% of total) had 0 tuples frozen. 4 pages set > all-frozen in the VM > > Can the pages which are removed from the end of the table not be counted > towards those set all frozen to make the arithmetic a bit more intuitive for > this edge case? This is a good point. It could be confusing to see that 1 page remains but 4 were set all-frozen in the VM. >From the perspective of the code, this is because each page is set all-frozen/all-visible in the VM after it is pruned or vacuumed. Truncating of the end of the table happens at the end of vacuum -- after all pages have been processed. So, these pages are set all-frozen in the VM. I actually don't see a good way that we could accurately decrement the count. We have LVRelState->removed_pages but we have no idea which of those pages are all-frozen. We could have visibilitymap_prepare_truncate() count and return to RelationTruncate() how many of the truncated pages were all-frozen. But we have no way of knowing how many of those pages were newly frozen by this vacuum. And if we try to track it from the other direction, when freezing pages, we would have to keep track of all the block numbers of pages in the relation that were empty and set frozen and then when truncating the relation find the overlap. That sounds hard and expensive. It seems a better solution would be to find a way to document it or phrase it clearly in the log. It is true that these pages were set all-frozen in the VM. It is just that some of them were later removed. And we don't know which ones those are. Is there a way to make this clear? And, if not, is it worse than not having the feature at all? - Melanie
Re: Test to dump and restore objects left behind by regression
Michael Paquier writes: > On my laptop, testing the plain format adds roughly 12s, in a test > that now takes 1m20s to run vs 1m32s. Enabling regress_dump_formats > and adding three more formats counts for 45s of runtime. For a test > that usually shows up as the last one to finish for a heavily > parallelized run. So even the default of "plain" is going to be > noticeable, I am afraid. Yeah, that's what I've been afraid of from the start. There's no way that this will buy us enough new coverage to justify that sort of addition to every check-world run. I'd be okay with adding it in a form where the default behavior is to do no additional checking. Whether that's worth maintaining is hard to say though. regards, tom lane
Re: Proper object locking for GRANT/REVOKE
Hi, On Mon, Oct 28, 2024 at 04:20:42PM +0100, Peter Eisentraut wrote: > This patch started out as a refactoring, thinking that objectNamesToOids() > in aclchk.c should really mostly be a loop around get_object_address(). > This is mostly true, with a few special cases because the node > representations are a bit different in some cases, and OBJECT_PARAMETER_ACL, > which is obviously very different. This saves a bunch of duplicative code, > which is nice. > > Additionally, get_object_address() handles locking, which > objectNamesToOids() somewhat famously does not do, and there is a code > comment about it. With this refactoring, we get the locking pretty much for > free. Thanks for the patch, this refactoring makes sense to me. A few random comments: 1 === + default: I like the idea of using default as the first "case" as a way to emphasize that this is the most "common" behavior. 2 === Nit + address = get_object_address(objtype, lfirst(cell), &relation, lockmode, false); + Assert(relation == NULL); Worth to explain why we do expect relation to be NULL here? (the comment on top of get_object_address() says it all, but maybe a few words here could be worth it). 3 === - - /* Used GRANT DOMAIN on a non-domain? */ - if (istmt->objtype == OBJECT_DOMAIN && - pg_type_tuple->typtype != TYPTYPE_DOMAIN) - ereport(ERROR, - (errcode(ERRCODE_WRONG_OBJECT_TYPE), -errmsg("\"%s\" is not a domain", - NameStr(pg_type_tuple->typname; Yeah, get_object_address()->get_object_address_type() does take care of it. 4 === > > Interestingly, this changes the output of the intra-grant-inplace isolation > test, which is recent work. It might be good to get some review from those > who worked on that, to make sure that the new behavior is still correct > and/or to check whether those test cases are still applicable. -s4: WARNING: got: cache lookup failed for relation REDACTED +s4: WARNING: got: relation "intra_grant_inplace" does not exist I did not work on this but out of curiosity I looked at it, and IIUC this change comes from: - relOid = RangeVarGetRelid(relvar, NoLock, false); + relOid = RangeVarGetRelid(relvar, lockmode, false); So without the lock, the error is coming from ExecGrant_Relation() that way: [local] DO(+0x22529b) [0x5c56afdb629b] [local] DO(+0x2220b0) [0x5c56afdb30b0] [local] DO(ExecuteGrantStmt+0x696) [0x5c56afdb3047] [local] DO(+0x6dc8cb) [0x5c56b026d8cb] [local] DO(standard_ProcessUtility+0xd38) [0x5c56b026b9da] [local] DO(ProcessUtility+0x13a) [0x5c56b026ac9b] [local] DO(+0x460073) [0x5c56afff1073] With the lock, the error comes from RangeVarGetRelidExtended() that way: [local] DO(RangeVarGetRelidExtended+0x4e6) [0x5a1c3ac49d36] [local] DO(+0x2223fe) [0x5a1c3ac2a3fe] [local] DO(ExecuteGrantStmt+0x111) [0x5a1c3ac29ac2] [local] DO(+0x6dc1d2) [0x5a1c3b0e41d2] [local] DO(standard_ProcessUtility+0xd38) [0x5a1c3b0e22e1] [local] DO(ProcessUtility+0x13a) [0x5a1c3b0e15a2] That's due to (in RangeVarGetRelidExtended()): " * But if lockmode = NoLock, then we assume that either the caller is OK * with the answer changing under them, or that they already hold some * appropriate lock, and therefore return the first answer we get without * checking for invalidation messages. " So, in the RangeVarGetRelid(relvar, NoLock, false) case (without the patch) then if we are able to reach "relId = RelnameGetRelid(relation->relname);" before the drop gets committed then we get the "cache lookup failed" error. But if we are slow enough so that we don't reach "relId = RelnameGetRelid(relation->relname);" before the drop get committed then we would also get the "relation "intra_grant_inplace" does not exist" error (tested manually, attaching gdb on s4 and breakpoint before the RelnameGetRelid(relation->relname) call in RangeVarGetRelidExtended()). The fact that we use lockmode != NoLock in the patch produces a lock followed by a "retry" in RangeVarGetRelidExtended() and so we get the "relation "intra_grant_inplace" does not exist" error. I think that the new behavior is still correct and in fact is more "appropriate" ( I mean that's the kind of error I expect to see from a user point of view). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Separate memory contexts for relcache and catcache
Hi Jeff, Jeff Davis , 30 Eki 2024 Çar, 01:00 tarihinde şunu yazdı: > On Wed, 2024-04-03 at 16:12 +0300, Melih Mutlu wrote: > > Rebased. PSA. > > Thank you. I missed your patch and came up with a similar patch over > here: > > > https://www.postgresql.org/message-id/flat/78599c442380ddb5990117e281a4fa65a74231af.ca...@j-davis.com > > I closed my thread and we can continue this one. > Thanks for being interested in this patch. I simply merged your patch and mine, which was pretty easy as your patch is quite similar to mine but covers much more caches. One difference is that I tried to capture almost all uses of > CacheMemoryContext so that it would become just a parent context > without many allocations of its own. > > The plan cache and SPI caches can be important, too. Or, one of the > other caches that we expect to be small might grow in some edge cases > (or due to a bug), and it would be good to be able to see that. > My only concern would be allocating too much memory for each cache type unnecessarily. Especially the ones that are expected to be small most of the time, would we see cases where we waste too much memory when the number of backends increases? Or maybe having separate contexts for each cache wouldn't hurt at all. I don't really have enough knowledge about each cache type. I only quickly checked the memory usage right after starting a backend to ensure that the patch does not slow down backend starts. We now have an additional TypCacheContext, which was previously using CacheMemoryContext, with 8KB init size. I'm not sure how much we should be worried about this additional 8KB. I think we can reduce the initial size of CacheMemoryContext instead, assuming that CacheMemoryContext wouldn't have many allocations of its own anymore. I agree with others that we should look at changing the initial size or > type of the contexts, but that should be a separate commit. > Attached a separate patch to change initial sizes for relcache and catcache contexts as they grow large from the start. This was suggested in the thread previously [1]. Also changed CacheMemoryContext to use ALLOCSET_START_SMALL_SIZES, so it starts from 1KB. Regards, -- Melih Mutlu Microsoft v4-0002-Adjusting-cache-memory-context-sizes.patch Description: Binary data v4-0001-Separate-memory-contexts-for-caches.patch Description: Binary data
Re: Inval reliability, especially for inplace updates
Hello Noah, 31.10.2024 04:39, Noah Misch wrote: I had pushed this during the indicated week, before your mail. Reverting it is an option. Let's see if more opinions arrive. I've accidentally discovered an incorrect behaviour caused by commit 4eac5a1fa. Running this script: for ((j=1; j<=100; j++)); do echo "iteration $j" cat << 'EOF' | timeout 60 psql >>psql-$SID.log || { res=1; echo "hanged on iteration $j"; break; } SELECT format('CREATE TABLE t%s (a int, b text);', g) FROM generate_series(1, 50) g \gexec SELECT format('DROP TABLE t%s;', g) FROM generate_series(1, 50) g \gexec EOF done with autovacuum = on autovacuum_naptime = 1s autovacuum_vacuum_threshold = 1 autovacuum_analyze_threshold = 1 in parallel using separate servers (the full script is attached), like: parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40` I can catch the following: ... 3 hanged on iteration 51 ... 19 hanged on iteration 64 ... 39 hanged on iteration 99 And after the script run, I see the server processes hanging: law 1081433 1 0 16:22 ? 00:00:00 .../usr/local/pgsql/bin/postgres law 1081452 1081433 0 16:22 ? 00:00:00 postgres: checkpointer law 1081453 1081433 0 16:22 ? 00:00:00 postgres: background writer law 1081460 1081433 0 16:22 ? 00:00:00 postgres: walwriter law 1081462 1081433 0 16:22 ? 00:00:00 postgres: autovacuum launcher law 1081464 1081433 0 16:22 ? 00:00:00 postgres: logical replication launcher law 1143065 1081433 0 16:32 ? 00:00:00 postgres: postgres postgres [local] CREATE TABLE law 1143263 1081433 0 16:32 ? 00:00:00 postgres: autovacuum worker postgres law 1143320 1081433 0 16:32 ? 00:00:00 postgres: autovacuum worker postgres law 1143403 1081433 0 16:32 ? 00:00:00 postgres: autovacuum worker Attaching to process 1143065 ... (gdb) bt #0 __futex_abstimed_wait_common64 (private=, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=, abstime=0x0, clockid=0, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fed9a8171b8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=) at ./nptl/futex-internal.c:139 #3 0x7feda4674c5f in do_futex_wait (sem=sem@entry=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111 #4 0x7feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183 #5 0x561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327 #6 0x561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318 #7 0x561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182 #8 0x561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28, slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146 ... (the full backtrace is attached) All three autovacuum workers (1143263, 1143320, 1143403) are also waiting for the same buffer lock: #5 0x561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327 #6 0x561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318 #7 0x561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182 Probably, this can be reproduced with VACUUM pg_class/pg_type/..., but I haven't found out the exact combination needed yet. Also as a side note, these processes can't be terminated with SIGTERM, I have to kill them. Initially I saw this on a slowed down VM, but with the attached patch applied I could reproduce it on my workstation too. Best regards, Alexander reproi.sh Description: application/shellscript diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 2168259247..730ef9b5a2 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -138,6 +138,10 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan, heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf); } +uint64 r = 0; +for (uint64 i = 0; i < 10; i++) r += (r ^ i); +Assert(r != 0); + /* Obtain share-lock on the buffer so we can examine visibility */ LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE); got_heap_tuple = heap_hot_search_buffer(tid, (gdb) bt #0 __futex_abstimed_wait_common64 (private=, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=, abstime=0x0, clockid=0, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fed9a8171b
Re: Consider the number of columns in the sort cost model
On 29.10.2024 05:47, Andrei Lepikhov wrote: On 10/28/24 16:48, Alena Rybakina wrote: On 23.10.2024 04:39, Andrei Lepikhov wrote: On 15/10/2024 12:15, David Rowley wrote: And the last patch is a demo of how I'm going to use the previous three patches and add one more strategy to improve the order of columns in the GROUP-BY clause. To be honest, I didn’t find information about this in the code, but did I understand everything correctly? Yes Thanks for the confirmation. 2. I noticed that statistics of distinct values are calculated several times. Maybe I miss something, but this can be avoided if these statistics can be saved after calculation. For example, I saw that it is possible that you calculate the same statistic information for the same equivalence members in cost_incremental_sort and identify_sort_ecmember. Is it possible to store information in a structure and use it later? Hmm, I don't see multiple calculations. em_distinct has made specifically for this sake. Can you provide specific case and code lines? When you said that, I saw it. You are right, checking for 0 is what stops us from useless recalculation. I noticed it right away. 3. I think you should initialize the variable ndist in your patch. I faced the compile complaining during your code compilation. costsize.c: In function ‘identify_sort_ecmember’: costsize.c:6694:42: error: ‘ndist’ may be used uninitialized [-Werror=maybe-uninitialized] 6694 | em->em_ndistinct = ndist; | ~^ costsize.c:6680:33: note: ‘ndist’ was declared here 6680 | double ndist; | ^~~ cc1: all warnings being treated as errors gmake[4]: *** [: costsize.o] Error 1 I think you can just update your compiler. But I added the ndist initialisation to make more compilers happy :). Thank you :) + Assert (node != NULL); + examine_variable(root, node, 0, &vardata); if (!HeapTupleIsValid(vardata.statsTuple)) continue; I don't think so. At least until you provide the case when the get_sortgroupclause_expr function returns NULL. That's more, remember - the patch 0004 here is just to show the perspective and still under construction. Anyway, thanks, I found out that the patch set doesn't apply correctly because of 828e94c. So, see the new version in the attachment. Yes, you are right, if this case is possible, then it is connected with an error in the formation of the target list. I played around with the examples a bit and couldn't figure out something. When I added the same values to different columns - firstly in a, later in b, the order of the columns for sort operation doesn't change. Isn't this a mistake? create table a (x1 int, y1 int); create table b (x2 int, y2 int); insert into a values (NULL, NULL); insert into a values (NULL, 1); insert into a values (1, 1); insert into a values (1, NULL); create index a_x1_idx on a(x1); create index b_x2_idx on b(x2); create index a_y1_idx on a(y1); create index b_y2_idx on b(y2); insert into b select 1, 2 from generate_series(11,20) as id; insert into b select 1, 1 from generate_series(1,10) as id; insert into b select 1, 3 from generate_series(3,30) as id; explain analyze select a.x1, s.x2, s.y2 from a left join (select distinct * from b) s on a.x1=s.x2; QUERY PLAN Hash Right Join (cost=44.99..48.15 rows=5 width=12) (actual time=0.225..0.250 rows=8 loops=1) Hash Cond: (b.x2 = a.x1) -> HashAggregate (cost=43.90..46.16 rows=226 width=8) (actual time=0.117..0.123 rows=3 loops=1) Group Key: b.x2, b.y2 Batches: 1 Memory Usage: 40kB -> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8) (actual time=0.030..0.044 rows=48 loops=1) -> Hash (cost=1.04..1.04 rows=4 width=4) (actual time=0.073..0.074 rows=4 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual time=0.047..0.051 rows=4 loops=1) Planning Time: 1.649 ms Execution Time: 0.485 ms (11 rows) delete from b; insert into b select 2, 1 from generate_series(11,20) as id; insert into b select 1, 1 from generate_series(1,10) as id; insert into b select 3, 1 from generate_series(3,30) as id; vacuum analyze; explain analyze select a.x1, s.x2, s.y2 from a left join (select distinct * from b) s on a.x1=s.x2; QUERY PLAN --- Hash Left Join (cost=1.79..2.86 rows=4 width=12) (actual time=0.083..0.090 rows=4 loops=1) Hash Cond: (a.x1 = b.x2) -> Seq Scan on a (cost=0.00..1.04 rows=4 width=4) (actual time=0.010..0.0
Re: Having problems generating a code coverage report
Hi, > On Wed, Oct 30, 2024 at 05:26:49PM -0400, Peter Geoghegan wrote: > > I use Debian unstable for most of my day to day work. Apparently > > Debian unstable has exactly the same version of lcov as Ubuntu 24.04. > > > > I've also been unable to generate coverage reports for some time (at > > least on Debian, with LCOV version 2.0-1). > > So do I, for both the debian SID and the lcov parts. I downgraded to lcov 1.16 [1] and it helped. This is merely a workaround of course, not a long-time solution. [1]: https://github.com/linux-test-project/lcov/releases/tag/v1.16 -- Best regards, Aleksander Alekseev
Re: protocol-level wait-for-LSN
On Thu, 31 Oct 2024 at 13:59, Jesper Pedersen wrote: > And, it will be confusing to clean-room implementations: When you have > this startup parameter then you get these message formats, when you have > this startup parameter then you get these message formats -- and what > about combinations ? Like Tatsuo-san stated up-thread. I really don't understand why you think that's so difficult. To be clear, no client is forced to implement any of these protocol options. And all of these protocol options would be documented in the official protocol docs. For instance the ReadyForQuery docs on the "Message Formats" page in the docs could easily be made to look like the following, which imho would be very clear to any implementer of the protocol about ordering of these fields: ReadyForQuery (B) Byte1('Z') Identifies the message type. ReadyForQuery is sent whenever the backend is ready for a new query cycle. Int32 Length of message contents in bytes, including self. Int64: Only present if protocol option wait_for_lsn is set to 1 by the client The LSN at time of commit Int64: Only present if protocol option query_time is set to 1 by the client Time it took to run the query in microseconds Byte1 Current backend transaction status indicator. Possible values are 'I' if idle (not in a transaction block); 'T' if in a transaction block; or 'E' if in a failed transaction block (queries will be rejected until block is ended). > You are also assuming that all PostgreSQL protocol implementations uses > the Length (Int32) field very strict - so when one developer adds the > startup parameter, but doesn't change the underlying implementation > everything will break. Yes... But that seems equivalent to saying: If a developer of a Postgres client advertises that they support protocol v4, but don't actually implement it, then everything will break. i.e. it's the job of the client author to not send protocol options that it doesn't know anything about. Just like it's the job of the client author not to request versions that it does not know anything about. > The protocol format must be 100% clear and well-documented in all cases. Agreed. See above. > If this door is open then it has to very clear how multiple startup > parameters are handled at the protocol level, and that is a much bigger > fish because what happens if extensions add startup parameters as well. Postgres extensions **cannot** add such startup parameters. Heikki already mentioned that the naming was confusing in the docs. At this point in time we're only discussing protocol changes that are coming from Postgres core (which is already a contentious enough topic).
Re: UUID v7
> On 1 Nov 2024, at 03:00, Masahiko Sawada wrote: > > Therefore, if the > system clock moves backward due to NTP, we cannot guarantee > monotonicity and sortability. Is that right? Not exactly. Monotonicity is ensured for a given backend. We make sure that timestamp is advanced at least for ~250ns forward on each UUID generation. 60 bits of time are unique and ascending for a given backend. Thanks! Best regards, Andrey Borodin.
Re: Pgoutput not capturing the generated columns
On Thu, Oct 31, 2024 at 3:16 AM vignesh C wrote: > Thanks for committing this patch, here is a rebased version of the > remaining patches. > Hi Vignesh. Here are my review comments for the docs patch v1-0002. == Commit message 1. This patch updates docs to describe the new feature allowing replication of generated columns. This includes addition of a new section "Generated Column Replication" to the "Logical Replication" documentation chapter. ~ That first sentence was correct previously when this patch contained *all* the gencols documentation, but now some of the feature docs are already handled by previous patches, so the first sentence can be removed. Now patch 0002 is only for adding the new chapter, plus the references to it. ~ /This includes addition of a new section/This patch adds a new section/ == doc/src/sgml/protocol.sgml 2. - Next, one of the following submessages appears for each column (except generated columns): + Next, one of the following submessages appears for each column: AFAIK this simply cancels out a change from the v1-0001 patch which IMO should have not been there in the first place. Please refer to my v1-0001 review for the same. == Kind Regards, Peter Smith. Fujitsu Australia
Re: define pg_structiszero(addr, s, r)
On Thu, 31 Oct 2024 at 19:17, Michael Paquier wrote: > Seems kind of OK seen from here. I am wondering if others more > comments about the name of the macro, its location, the fact that we > still have pagebytes in bufpage.c, etc. This change looks incorrect: @@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags) } /* Check all-zeroes case */ - all_zeroes = true; pagebytes = (size_t *) page; - for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) - { - if (pagebytes[i] != 0) - { - all_zeroes = false; - break; - } - } - if (all_zeroes) + if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t return true; I think this should be passing BLCKSZ rather than (BLCKSZ / sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is zero rather than the entire page. However, I've also performance concerns as if I look at the assembly of PageIsVerifiedExtended, I see the zero checking is now done 1 byte at a time: (gcc 11.4) leaq 1024(%rbx), %rdx <-- 1KB bug .p2align 4,,10 .p2align 3 .L60: cmpb $0, (%rax) <-- check 1 byte is zero. jne .L59 addq $1, %rax <-- increment by 1 byte. cmpq %rax, %rdx <-- check if we've done 1024 bytes yet. jne .L60 Whereas previously it was doing: movq %rbx, %rax leaq 8192(%rbx), %rdx <-- 8KB jmp .L60 .p2align 4,,10 .p2align 3 .L90: addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t)) cmpq %rax, %rdx je .L61 .L60: cmpq $0, (%rax) <-- checks an entire 8 bytes are zero. I didn't test how performance-critical this is, but the header comment for this function does use the words "cheaply detect". * This is called when a page has just been read in from disk. The idea is * to cheaply detect trashed pages before we go nuts following bogus line * pointers, testing invalid transaction identifiers, etc. so it seems a bit dangerous to switch this loop to byte-at-a-time rather than doing 8 bytes at once without testing the performance isn't affected. David
Re: Showing applied extended statistics in explain Part 2
Hi All, I apologize for not being able to continue development due to various circumstances. The attached file is the rebased patch. I will now catch up on the discussion and try to revise the patch. Regards, Tatsuro Yamada On Fri, Jul 19, 2024 at 7:17 PM wrote: > > On 7/18/24 12:37, masahiro.ik...@nttdata.com wrote: > > >> Let me share my opinion on those questions ... > > > ...> > > >> For ndistinct, I think we don't show this because it doesn't go > > >> through clauselist_selectivity, which is the only thing I modified in > the PoC. > > >> But I guess we might improve estimate_num_groups() to track the stats > > >> in a similar way, I guess. > > > > > > Thanks. IIUC, the reason is that it doesn't go through > > > statext_clauselist_selectivity() because the number of clauses is one > though it goes > > through clauselist_selectivity(). > > > > > > > Ah, I see I misunderstood the original report. The query used was > > > > EXPLAIN (STATS, ANALYZE) SELECT * FROM t3 > > WHERE date_trunc('month', a) = '2020-01-01'::timestamp; > > > > And it has nothing to do with the number of clauses being one neither. > > > > The problem is this estimate is handled by examine_variable() matching > the expression > > to the "expression" stats, and injecting it into the variable, so that > the > > clauselist_selectivity() sees these stats. > > > > This would happen even if you build just expression statistics on each > of the > > date_trunc() calls, and then tried a query with two clauses: > > > > CREATE STATISTICS s4 ON date_trunc('day', a) FROM t3; > > CREATE STATISTICS s3 ON date_trunc('month', a) FROM t3; > > > > EXPLAIN SELECT * FROM t3 > > WHERE date_trunc('month', a) = '2020-01-01'::timestamp > > AND date_trunc('day', 'a') = '2020-01-01'::timestamp; > > > > Not sure how to handle this - we could remember when explain_variable() > injects > > statistics like this, I guess. But do we know that each call to > > examine_variable() is for estimation? And do we know for which clause? > > I see. The issue is related to extended statistics for single expression. > As a > first step, it's ok for me that we don't support it. > > The below is just an idea to know clauses... > Although I'm missing something, can callers of examine_variable() > for estimation to rebuild the clauses from partial information of "OpExpr"? > > Only clause_selectivity_ext() knows the information of actual full clauses. > But we don't need full information. It's enough to know the information > to show "OpExpr" for EXPLAIN. > > get_oper_expr() deparse "OpExpr" using only the operator oid and arguments > in get_oper_expr(). > > If so, the caller to estimate, for example eqsel_internal(), > scalarineqsel_wrapper() > and so on, seems to be able to know the "OpExpr" information, which are > operator > oid and arguments, and used extended statistics easily to show for EXPLAIN. > > # Memo: the call path of the estimation function > caller to estimate selectivity > (eqsel_internal()/scalargtjoinsel_wrappter()/...) > -> get_restriction_variable()/get_join_valiables() >-> examine_variable() > > > > >>> ERROR: unrecognized node type: 268 > > > > > > Regarding the above error, do "applied_stats" need have the list of > "StatisticExtInfo" > > > because it's enough to have the list of Oid(stat->statOid) for EXPLAIN > output in the > > current patch? > > > change_to_applied_stats_has_list_of_oids.diff is the change I assumed. > > > Do you have any plan to show extra information for example "kind" of > > "StatisticExtInfo"? > > > > > > The above is just one idea came up with while I read the following > > > comments of header of pathnodes.h, and to support copy > "StatisticExtInfo" will leads > > many other nodes to support copy. > > > * We don't support copying RelOptInfo, IndexOptInfo, or Path nodes. > > > * There are some subsidiary structs that are useful to copy, though. > > > > > > > I do think tracking just the OID would work, because we already know how > to copy List > > objects. But if we want to also track the clauses, we'd have to keep > multiple lists, right? > > That seems a bit inconvenient. > > Understood. In future, we might show not only the applied_clauses but also > the clauses of > its extended statistics (StatisticExtInfo->exprs). > > > > > By the way, I found curios result while I tested with the above patch. > It shows same > > "Ext Stats" twice. > > > I think it's expected behavior because the stat is used when estimate > the cost of > > "Partial HashAggregate" and "Group". > > > I've shared the result because I could not understand soon when I saw > > > it first time. I think it's better to let users understand when the > stats are used, but I > > don't have any idea now. > > > > > > -- I tested with the example of CREATE STATISTICS documentation. > > > psql=# EXPLAIN (STATS, ANALYZE) SELECT date_trunc('month', a), > date_trunc('day', > > a) FROM t3 GROUP BY 1, 2; > > >
Re: Eager aggregation, take 3
On Thu, Oct 31, 2024 at 12:25 AM Robert Haas wrote: > Well, the key thing here is the reasoning, which you don't really > spell out either here or in the patch. The patch's justification for > the use of BTEQUALIMAGE_PROC Is that, if we use non-equal-image > operators, "we may lose some information that could be needed to > evaluate upper qual clauses." But there's no explanation of why that > would happen, and I think that makes it difficult to have a good > discussion. > > Here's an example from the optimizer README: > > # 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) > # = A leftjoin (B leftjoin C on (Pbc)) on (Pab) > # > # Identity 3 only holds if predicate Pbc must fail for all-null B rows > # (that is, Pbc is strict for at least one column of B). If Pbc is not > # strict, the first form might produce some rows with nonnull C columns > # where the second form would make those entries null. > > This isn't the clearest justification for a rule that I've ever read, > but it's something. Someone reading this can think about whether the > two sentences of justification are a correct argument that Pbc must be > strict in order for the identity to hold. > > I think you should be trying to offer similar (or better) > justifications, and perhaps similar identities as well. It's a little > bit hard to think about how to write the identities for this patch, > although you could start with aggregate(X) = > finalizeagg(partialagg(X)) and then explain how the partialagg can be > commuted with certain joins and what is required for it to do so. The > argument needs to be detailed enough that we can evaluate whether it's > correct or not. > > Personally, I'm still stuck on the point I wrote about yesterday. When > you partially aggregate a set of rows, the resulting row serves as a > proxy for the original set of rows. So, all of those rows must have > the same destiny. As I mentioned then, if you ever partially aggregate > a row that should ultimately be filtered out with one that shouldn't, > that breaks everything. But also, I need all the rows that feed into > each partial group to be part of the same set of output rows. For > instance, suppose I run a report like this: > > SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l > WHERE o.order_id = l.order_id GROUP BY 1; > > If I'm thinking of executing this as FinalizeAgg(order JOIN > PartialAgg(order_lines)), what must be true in order for that to be a > valid execution plan? I certainly can't aggregate on some unrelated > column, say part_number. If I do, then first of all I don't know how > to get an order_id column out so that I can still do the join. But > even if that were no issue, you might have two rows with different > values of the order_id column and the same value for part_number, and > that the partial groups that you've created on the order_lines table > don't line up properly with the groups that you need to create on the > orders table. Some particular order_id might need some of the rows > that went into a certain part_number group and not others; that's no > good. > > After some thought, I think the process of deduction goes something > like this. We ultimately need to aggregate on a column in the orders > table, but we want to partially aggregate the order_lines table. So we > need a set of columns in the order_lines table that can act as a proxy > for o.order_number. Our proxy should be all the columns that appear in > the join clauses between orders and order_lines. That way, all the > rows in a single partially aggregate row will have the "same" order_id > according to the operator class implementing the = operator, so for a > given row in the "orders" table, either every row in the group has > o.order_id = l.order_id or none of them do; hence they're all part of > the same set of output rows. > > It doesn't matter, for example, whether o.order_number is unique. If > it isn't, then we'll flatten together some different rows from the > orders table -- but each of those rows will match all the rows in > partial groupings where o.order_id = partially_grouped_l.order_id and > none of the rows where that isn't the case, so the optimization is > still valid. Likewise it doesn't matter whether o.order_id is unique: > if order_id = 1 occurs twice, then both of the rows will match the > partially aggregated group with order_id = 1, but that's fine. The > only thing that's a problem is if the same row from orders was going > to match some but not all of the partially aggregate rows from some > order_lines group, and that won't happen here. > > Now consider this example: > > SELECT o.order_number, SUM(l.quantity) FROM orders o, order_lines l > WHERE o.order_id = l.order_id AND o.something < l.something GROUP BY > 1; > > Here, we cannot partially group order_lines on just l.order_id, > because we might have a row in the orders table with order_id = 1 and > something = 1 and two rows in the order_lines table that both have
Re: define pg_structiszero(addr, s, r)
On Fri, Nov 01, 2024 at 05:45:44PM +1300, David Rowley wrote: > I think this should be passing BLCKSZ rather than (BLCKSZ / > sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is > zero rather than the entire page. Ugh, Friday brain fart. The attached should be able to fix that, this brings back the movl to its correct path: - movl$1024, %esi + movl$8192, %esi Does that look fine to you? > I didn't test how performance-critical this is, but the header comment > for this function does use the words "cheaply detect". Under gcc -O2 or -O3, the single-byte check or the 8-byte check don't make a difference. Please see the attached (allzeros.txt) for a quick check if you want to check by yourself. With 1M iterations, both average around 3ms for 1M iterations on my laptop (not the fastest thing around). Under -O0, though, the difference is noticeable: - 1-byte check: 3.52s for 1M iterations, averaging one check at 3.52ns. - 8-byte check: 0.46s for 1M iterations, averaging one check at 0.46ns. Even for that, I doubt that this is going to be noticeable in practice, still the difference exists. -- Michael #include #include #include #define BLCKSZ 8192 #define LOOPS 100 /* 1M */ static inline bool allzeros_char(const void *ptr, size_t len) { const char *p = (const char *) ptr; for (size_t i = 0; i < len; i++) { if (p[i] != 0) return false; } return true; } static inline bool allzeros_size_t(const void *ptr, size_t len) { const size_t *p = (const size_t *) ptr; for (size_t i = 0; i < len / sizeof(size_t); i++) { if (p[i] != 0) return false; } return true; } int main() { size_t pagebytes[BLCKSZ] = {0}; for (int i = 0; i < LOOPS; i++) { allzeros_char(pagebytes, BLCKSZ); //allzeros_size_t(pagebytes, BLCKSZ); } return 0; } diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c index 5ee1e58cd4..db5954def2 100644 --- a/src/backend/storage/page/bufpage.c +++ b/src/backend/storage/page/bufpage.c @@ -88,7 +88,6 @@ bool PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags) { PageHeader p = (PageHeader) page; - size_t *pagebytes; bool checksum_failure = false; bool header_sane = false; uint16 checksum = 0; @@ -124,9 +123,7 @@ PageIsVerifiedExtended(Page page, BlockNumber blkno, int flags) } /* Check all-zeroes case */ - pagebytes = (size_t *) page; - - if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t + if (pg_memory_is_all_zeros(page, BLCKSZ)) return true; /* signature.asc Description: PGP signature
Re: Eager aggregation, take 3
On Thu, Oct 31, 2024 at 9:16 PM jian he wrote: > > hi. > still trying to understand v13. found a bug. > > minimum test : > drop table if exists t1, t2; > CREATE TABLE t1 (a int, b int, c int); > CREATE TABLE t2 (a int, b int, c int); > SET enable_eager_aggregate TO on; > explain(costs off, settings) SELECT avg(t2.a), t1.c FROM t1 JOIN t2 ON > t1.b = t2.b GROUP BY t1.c having grouping(t1.c) > 0; > > havingQual can have GroupingFunc. > if that happens, then segmentation fault. Nice catch. Thanks. create_agg_clause_infos does check for GROUPING() expressions, but not in the right place. Will fix it in the next version. Thanks Richard
Re: UUID v7
On Thu, Oct 31, 2024 at 9:53 PM Andrey M. Borodin wrote: > > > > > On 1 Nov 2024, at 03:00, Masahiko Sawada wrote: > > > > Therefore, if the > > system clock moves backward due to NTP, we cannot guarantee > > monotonicity and sortability. Is that right? > > Not exactly. Monotonicity is ensured for a given backend. We make sure that > timestamp is advanced at least for ~250ns forward on each UUID generation. 60 > bits of time are unique and ascending for a given backend. > Thank you for your explanation. I now understand this code guarantees the monotonicity: +/* minimum amount of ns that guarantees step of increased_clock_precision */ +#define SUB_MILLISECOND_STEP (100/4096 + 1) + ns = get_real_time_ns(); + if (previous_ns + SUB_MILLISECOND_STEP >= ns) + ns = previous_ns + SUB_MILLISECOND_STEP; + previous_ns = ns; I think that one of the most important parts in UUIDv7 implementation is which method (1, 2, or 3 described in RFC 9562) we use to guarantee the monotonicity. The current patch employs method 3 with the assumption that 12 bits of sub-millisecond information is available on most of the systems we support. However, as far as I tested, on MacOS, values returned by clock_gettime(CLOCK_REALTIME) are only microsecond precision, meaning that we could waste some randomness. Has this point been considered? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: define pg_structiszero(addr, s, r)
Hi, On Fri, Nov 01, 2024 at 05:45:44PM +1300, David Rowley wrote: > On Thu, 31 Oct 2024 at 19:17, Michael Paquier wrote: > > Seems kind of OK seen from here. I am wondering if others more > > comments about the name of the macro, its location, the fact that we > > still have pagebytes in bufpage.c, etc. > > This change looks incorrect: > > @@ -126,18 +124,9 @@ PageIsVerifiedExtended(Page page, BlockNumber > blkno, int flags) > } > > /* Check all-zeroes case */ > - all_zeroes = true; > pagebytes = (size_t *) page; > - for (i = 0; i < (BLCKSZ / sizeof(size_t)); i++) > - { > - if (pagebytes[i] != 0) > - { > - all_zeroes = false; > - break; > - } > - } > > - if (all_zeroes) > + if (pg_memory_is_all_zeros(pagebytes, (BLCKSZ / sizeof(size_t > return true; > > I think this should be passing BLCKSZ rather than (BLCKSZ / > sizeof(size_t)), otherwise, it'll just check the first 1 kilobyte is > zero rather than the entire page. Thanks for looking at it! Nice catch, indeed by using the new function we are changing the pointer arithmetic here and then we should pass BLCKSZ instead. > However, I've also performance concerns as if I look at the assembly > of PageIsVerifiedExtended, I see the zero checking is now done 1 byte > at a time: > > (gcc 11.4) > > leaq 1024(%rbx), %rdx <-- 1KB bug > .p2align 4,,10 > .p2align 3 > .L60: > cmpb $0, (%rax) <-- check 1 byte is zero. > jne .L59 > addq $1, %rax <-- increment by 1 byte. > cmpq %rax, %rdx <-- check if we've done 1024 bytes yet. > jne .L60 > > Whereas previously it was doing: > > movq %rbx, %rax > leaq 8192(%rbx), %rdx <-- 8KB > jmp .L60 > .p2align 4,,10 > .p2align 3 > .L90: > addq $8, %rax <-- increment by 8 bytes (or sizeof(size_t)) > cmpq %rax, %rdx > je .L61 > .L60: > cmpq $0, (%rax) <-- checks an entire 8 bytes are zero. > > I didn't test how performance-critical this is, but the header comment > for this function does use the words "cheaply detect". > > * This is called when a page has just been read in from disk. The idea is > * to cheaply detect trashed pages before we go nuts following bogus line > * pointers, testing invalid transaction identifiers, etc. > > so it seems a bit dangerous to switch this loop to byte-at-a-time > rather than doing 8 bytes at once without testing the performance > isn't affected. Agree, I did a quick test (see [0]) and it looks like it's significantly slower using the new inline function. We could try to write a more elaborate version of pg_memory_is_all_zeros(), but as it looks like there is only one use case, then it's probably better to not implement (revert) this change here and "just" add a comment as to why pg_memory_is_all_zeros() should not be used here, thoughts? [0]: https://godbolt.org/z/xqnW4MPY5 Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
RE: Popcount optimization using AVX512
> Here is an updated patch with this change. LGTM. Raghuveer
Re: MultiXact\SLRU buffers configuration
On Thu, 31 Oct 2024, 17:33 Andrey M. Borodin, wrote: > > > > On 31 Oct 2024, at 17:29, Thom Brown wrote: > > > > On Thu, 31 Oct 2024 at 10:47, Andrey M. Borodin > wrote: > >> > >> > >> > >>> On 29 Oct 2024, at 21:45, Thom Brown wrote: > >>> > >>> It clearly checks for interrupts, but when I saw this issue happen, it > >>> wasn't interruptible. > >> > >> Perhaps, that was different multixacts? > >> When I was observing this pathology on Standby, it was a stream of > different reads encountering different multis. > >> > >> Either way startup can cancel locking process on it's own. Or is it the > case that cancel was not enough, did you actually need termination, not > cancel? > > > > Termination didn't work on either of the processes. > > How did you force the process to actually terminate? > Did you observe repeated read of the same multixact? > Was offending process holding any locks while waiting? > Unfortunately I didn't gather much information when it was occuring, and prioritised getting rid of the process blocking replay. I just attached gdb to it, got a backtrace and then "print ProcessInterrupts()". Thom >
Re: Popcount optimization using AVX512
On Wed, Oct 30, 2024 at 04:10:10PM -0500, Nathan Bossart wrote: > On Wed, Oct 30, 2024 at 08:53:10PM +, Raghuveer Devulapalli wrote: >> BTW, I just realized function attributes for xsave and avx512 don't work >> on MSVC (see >> https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630). >> Not sure if you care about it. Its an easy fix (see >> https://gcc.godbolt.org/z/Pebdj3vMx). > > Oh, good catch. IIUC we only need to check for #ifndef _MSC_VER in the > configure programs for meson. pg_attribute_target will be empty on MSVC, > and I believe we only support meson builds there. Here is an updated patch with this change. -- nathan >From 8cf7c08220a9c0a1dec809794af2dfb719981923 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 16 Oct 2024 15:57:55 -0500 Subject: [PATCH v2 1/1] use __attribute__((target(...))) for AVX-512 stuff --- config/c-compiler.m4 | 60 +- configure| 163 ++- configure.ac | 17 +-- meson.build | 21 ++-- src/Makefile.global.in | 5 - src/include/c.h | 10 ++ src/makefiles/meson.build| 4 +- src/port/Makefile| 12 +- src/port/meson.build | 7 +- src/port/pg_popcount_avx512.c| 86 +- src/port/pg_popcount_avx512_choose.c | 102 - 11 files changed, 175 insertions(+), 312 deletions(-) delete mode 100644 src/port/pg_popcount_avx512_choose.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 10f8c7bd0a..aa90f8ef33 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -700,20 +700,20 @@ undefine([Ac_cachevar])dnl # Check if the compiler supports the XSAVE instructions using the _xgetbv # intrinsic function. # -# An optional compiler flag can be passed as argument (e.g., -mxsave). If the -# intrinsic is supported, sets pgac_xsave_intrinsics and CFLAGS_XSAVE. +# If the intrinsics are supported, sets pgac_xsave_intrinsics. AC_DEFUN([PGAC_XSAVE_INTRINSICS], -[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics_$1])])dnl -AC_CACHE_CHECK([for _xgetbv with CFLAGS=$1], [Ac_cachevar], -[pgac_save_CFLAGS=$CFLAGS -CFLAGS="$pgac_save_CFLAGS $1" -AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], - [return _xgetbv(0) & 0xe0;])], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_xsave_intrinsics])])dnl +AC_CACHE_CHECK([for _xgetbv], [Ac_cachevar], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include +__attribute__((target("xsave"))) +static int xsave_test(void) +{ + return _xgetbv(0) & 0xe0; +}], + [return xsave_test();])], [Ac_cachevar=yes], - [Ac_cachevar=no]) -CFLAGS="$pgac_save_CFLAGS"]) + [Ac_cachevar=no])]) if test x"$Ac_cachevar" = x"yes"; then - CFLAGS_XSAVE="$1" pgac_xsave_intrinsics=yes fi undefine([Ac_cachevar])dnl @@ -725,29 +725,27 @@ undefine([Ac_cachevar])dnl # _mm512_setzero_si512, _mm512_maskz_loadu_epi8, _mm512_popcnt_epi64, # _mm512_add_epi64, and _mm512_reduce_add_epi64 intrinsic functions. # -# Optional compiler flags can be passed as argument (e.g., -mavx512vpopcntdq -# -mavx512bw). If the intrinsics are supported, sets -# pgac_avx512_popcnt_intrinsics and CFLAGS_POPCNT. +# If the intrinsics are supported, sets pgac_avx512_popcnt_intrinsics. AC_DEFUN([PGAC_AVX512_POPCNT_INTRINSICS], -[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics_$1])])dnl -AC_CACHE_CHECK([for _mm512_popcnt_epi64 with CFLAGS=$1], [Ac_cachevar], -[pgac_save_CFLAGS=$CFLAGS -CFLAGS="$pgac_save_CFLAGS $1" -AC_LINK_IFELSE([AC_LANG_PROGRAM([#include ], - [const char buf@<:@sizeof(__m512i)@:>@; - PG_INT64_TYPE popcnt = 0; - __m512i accum = _mm512_setzero_si512(); - const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); - const __m512i cnt = _mm512_popcnt_epi64(val); - accum = _mm512_add_epi64(accum, cnt); - popcnt = _mm512_reduce_add_epi64(accum); - /* return computed value, to prevent the above being optimized away */ - return popcnt == 0;])], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx512_popcnt_intrinsics])])dnl +AC_CACHE_CHECK([for _mm512_popcnt_epi64], [Ac_cachevar], +[AC_LINK_IFELSE([AC_LANG_PROGRAM([#include +__attribute__((target("avx512vpopcntdq","avx512bw"))) +static int popcount_test(void) +{ + const char buf@<:@sizeof(__m512i)@:>@; + PG_INT64_TYPE popcnt = 0; + __m512i accum = _mm512_setzero_si512(); + const __m512i val = _mm512_maskz_loadu_epi8((__mmask64) 0xf0f0f0f0f0f0f0f0, (const __m512i *) buf); + const __m512i cnt = _mm512_popcnt_epi64(val); + accum = _mm512_add_epi64(accum, cnt); + popcnt = _mm512_reduce_add_epi64(accum); + return (int) popcnt; +}], + [return popcount_test();])], [Ac_cachevar=yes], - [Ac_cachevar=no]) -CFLAGS="$pgac_save_CFLAGS"]) + [Ac_cachevar=no])]) if test x"$Ac_cach
Re: pg_parse_json() should not leak token copies on failure
On Mon, Oct 7, 2024 at 3:45 PM Jacob Champion wrote: > I've added a 0002 as well. 0002 has since been applied [1] so I'm reattaching v4-0001 to get the cfbot happy again. --Jacob [1] https://git.postgresql.org/cgit/postgresql.git/commit/?id=41b023946d v4-0001-jsonapi-add-lexer-option-to-keep-token-ownership.patch Description: Binary data
Re: pg_upgrade check for invalid databases
On Fri, Oct 25, 2024 at 01:55:57PM +0200, Daniel Gustafsson wrote: > > On 14 Oct 2024, at 18:57, Bruce Momjian wrote: > > > What might be acceptable would be to add an option that would make > > pg_upgrade more tolerant of problems in many areas, that is a lot more > > research and discussion. > > I agree that the concept of having pg_upgrade perform (opt-in) skipping and/or > repairs of the old cluster warrants a larger discussion in its own thread. > There has been significant amount of energy spent recently to add structure to > the checks, any new feature should be properly designed for the get-go. > > In the meantime, the OP has a good point that it's a tad silly that pg_upgrade > fails hard on invalid databases instead of detecting and reporting like how > other errors are handled. The attached adds this check and expands the report > wording to cover it. Agreed. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Re: Time to add a Git .mailmap?
On Thu, Oct 31, 2024 at 02:45:02PM -0500, Nathan Bossart wrote: > On Thu, Oct 31, 2024 at 11:37:13AM +0100, Daniel Gustafsson wrote: >> When looking at our Git tree for a recent conference presentation I happened >> to >> notice that we have recently gained duplicate names in the shortlog. Not >> sure >> if we care enough to fix that with a .mailmap, but if we do the attached diff >> makes sure that all commits are accounted for a single committer entry. > > Seems reasonable to me. +1. -- Michael signature.asc Description: PGP signature
Re: Making error message more user-friendly with spaces in a URI
On Thu, Oct 31, 2024 at 10:35:41PM +0700, Stepan Neretin wrote: > Hi, Yushi! I could not find this line in the master branch and I couldn't > apply this patch. I only see this error in the test (I think the test > should also be changed), but the idea of fixing the error looks good to me. This line exists on HEAD and the patch applies cleanly today as of 2d8bff603c9e. So perhaps just make sure that you are up to date? -- Michael signature.asc Description: PGP signature
Re: libedit history seems to be misbehaving / broken
On Sun, Oct 27, 2024 at 09:07:44PM +0100, Tomas Vondra wrote: > On 10/27/24 20:03, Tom Lane wrote: > > FWIW, I don't observe any particular misbehavior with the very ancient > > libedit that macOS ships. On Fedora 39, I notice something related to > > what you say: it seems like the "\q" ending a session gets repeated > > into .psql_history by the next session. I'm surprised that it's not > > exactly the same as your results though, because it seems to be the > > same source version: > > > > $ rpm -q libedit > > libedit-3.1-53.20240808cvs.fc39.x86_64 > > > > That's probably because I usually terminate psql by Ctrl-D, not by > typing "\q". But yeah, if I use "\q" it gets added twice. > > > Didn't try the too-many-lines behavior, but it looks like that > > is implemented totally by history_truncate_file(), so if that's > > busted it's surely their fault. > > > > Sounds likely. What surprises me a bit, I haven't found any reports of > particularly similar bugs ... I'd have expected other people to hit this > too, but who knows. I wonder if our previous libedit workarounds aren't needed anymore and are causing the bugs. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Re: Inval reliability, especially for inplace updates
On Thu, Oct 31, 2024 at 05:00:02PM +0300, Alexander Lakhin wrote: > I've accidentally discovered an incorrect behaviour caused by commit > 4eac5a1fa. Running this script: Thanks. This looks important. > parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40` This didn't reproduce it for me, at -j20, -j40, or -j80. I tested at commit fb7e27a. At what commit(s) does it reproduce for you? At what commits, if any, did your test not reproduce this? > All three autovacuum workers (1143263, 1143320, 1143403) are also waiting > for the same buffer lock: > #5 0x561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at > pg_sema.c:327 > #6 0x561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) > at lwlock.c:1318 > #7 0x561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182 Can you share the full backtrace for the autovacuum workers? This looks like four backends all waiting for BUFFER_LOCK_SHARE on the same pg_class page. One backend is in CREATE TABLE, and three are in autovacuum. There are no other apparent processes that would hold the BUFFER_LOCK_EXCLUSIVE blocking these four processes. > Also as a side note, these processes can't be terminated with SIGTERM, I > have to kill them. That suggests they're trying to acquire one LWLock while holding another. I'll recreate your CREATE TABLE stack trace and study its conditions. It's not readily clear to me how that would end up holding relevant lwlocks. Guessing how this happened did lead me to a bad decision in commit a07e03f, but I expect fixing that bad decision won't fix the hang you captured. That commit made index_update_stats() needlessly call RelationGetNumberOfBlocks() and visibilitymap_count() with a pg_class heap buffer lock held. Both do I/O, and the latter can exclusive-lock a visibility map buffer. The attached patch corrects that. Since the hang you captured involved a pg_class heap buffer lock, I don't think this patch will fix that hang. The other inplace updaters are free from similar badness. Author: Noah Misch Commit: Noah Misch diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 74d0f30..bef1af6 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -2806,6 +2806,9 @@ index_update_stats(Relation rel, bool hasindex, double reltuples) { + boolupdate_stats; + BlockNumber relpages; + BlockNumber relallvisible; Oid relid = RelationGetRelid(rel); Relationpg_class; ScanKeyData key[1]; @@ -2815,6 +2818,42 @@ index_update_stats(Relation rel, booldirty; /* +* As a special hack, if we are dealing with an empty table and the +* existing reltuples is -1, we leave that alone. This ensures that +* creating an index as part of CREATE TABLE doesn't cause the table to +* prematurely look like it's been vacuumed. The final rd_rel may be +* different from rel->rd_rel due to e.g. commit of concurrent GRANT, but +* the commands that change reltuples take locks conflicting with ours. +* (Even if a command changed reltuples under a weaker lock, this affects +* only statistics for an empty table.) +*/ + if (reltuples == 0 && rel->rd_rel->reltuples < 0) + reltuples = -1; + + /* +* Don't update statistics during binary upgrade, because the indexes are +* created before the data is moved into place. +*/ + update_stats = reltuples >= 0 && !IsBinaryUpgrade; + + /* +* Finish I/O and visibility map buffer locks before +* systable_inplace_update_begin() locks the pg_class buffer. The final +* rd_rel may be different from rel->rd_rel due to e.g. commit of +* concurrent GRANT, but no command changes a relkind from non-index to +* index. (Even if one did, relallvisible doesn't break functionality.) +*/ + if (update_stats) + { + relpages = RelationGetNumberOfBlocks(rel); + + if (rel->rd_rel->relkind != RELKIND_INDEX) + visibilitymap_count(rel, &relallvisible, NULL); + else/* don't bother for indexes */ + relallvisible = 0; + } + + /* * We always update the pg_class row using a non-transactional, * overwrite-in-place update. There are several reasons for this: * @@ -2858,15 +2897,6 @@ index_update_stats(Relation rel, /* Should this be a more comprehensive test? */ Assert(rd_rel->relkind != RELKIND_PARTITIONED_INDEX); - /* -* As a special hack, if we are dealing with an empty table and the -* existing reltuples is -1, we leave that alone. This ensures that -* creating an ind
Re: make all ereport in gram.y print out relative location
I wrote: > I can take a look at this, since it's basically a followup > to 774171c4f. Pushed with some cleanup and further hacking. regards, tom lane
Re: IPC::Run::time[r|out] vs our TAP tests
On 31/10/2024 14:27, Daniel Gustafsson wrote: On 28 Oct 2024, at 11:56, Heikki Linnakangas wrote: On 09/04/2024 20:10, Daniel Gustafsson wrote: =item $session->quit Close the session and clean up resources. Each test run must be closed with C. Returns TRUE when the session was cleanly terminated, otherwise FALSE. A testfailure will be issued in case the session failed to finish. What does "session failed to finish" mean? Does it mean the same as "not cleanly terminated", i.e. a test failure is issued whenever this returns FALSE? It was very literally referring to the finish() method. I've reworded the comment to indicated that it throws a failure in case the process returns a non-zero exit status to finish(). I see. @@ -148,7 +148,9 @@ sub _wait_connect =item $session->quit Close the session and clean up resources. Each test run must be closed with -C. +C. Returns TRUE when the session was cleanly terminated, otherwise +FALSE. A test failure will be issued in case the session exited with a non- +zero exit status (the finish() method returns TRUE for 0 exit status). I still find that confusing. What finish() method? Yes, there's a finish() method in IPC::Run, but that's BackgroundPsql's internal affair, not exposed to the callers in any other way. And why do I care what that finish() returns for 0 exit status? That's not visible to the quit method's caller. Perhaps sommething like this: "Close the psql session and clean up resources. Each psql session must be closed with C before the end of the test. Returns TRUE if psql exited successfully (i.e. with zero exit code), otherwise returns FALSE and reports a test failure. " Would that be accurate? -- Heikki Linnakangas Neon (https://neon.tech)
Re: vacuumdb --analyze-only (e.g., after a major upgrade) vs. partitioned tables: pg_statistic missing stats for the partitioned table itself
On Thu, Oct 24, 2024 at 02:48:42PM -0700, Nikolay Samokhvalov wrote: > [ACg8ocIyQq] > Nikolay Samokhvalov 2:47 PM (0 minutes ago) >[cleardot] > to pglsql-hackers [cleardot] > [cleardot] > I just learned that vacuumdb --analyze-only doesn't update stats for the > partitioned table itself, taking care only about individual partitions: Yes, this is covered in the ANALYZE manual page: https://www.postgresql.org/docs/current/sql-analyze.html For partitioned tables, ANALYZE gathers statistics by sampling rows from all partitions; in addition, it will recurse into each partition and update its statistics. Each leaf partition is analyzed only once, even with multi-level partitioning. No statistics are collected for only the parent table (without data from its partitions), because with partitioning it's guaranteed to be empty. It is discussed here: https://www.postgresql.org/message-id/flat/CAB%2B%3D1TULKjDSBHxqSQVQstxcHshGzQUnHfp45GSESAu2qm0VKg%40mail.gmail.com#586bc5deef05c35ac16100dee99f6e9e > I propose considering 3 fixes: > > 1) vacuumdb --analyze / --analyze-only to update stats for the partitioned > table, so people using pg_upgrade are not in trouble > 2) present the ANALYZE metadata for partitioned tables in pg_stat_all_tables > 3) for old versions, either backpatch with fix (1) OR just add to the docs > (and > maybe to the final words pg_upgrade prints), suggesting something like this in > addition to vacuumdb analyze-only: > > -- psql snippet > select format( > 'analyze verbose %I.%I;', > relnamespace::oid::regnamespace, > oid::regclass > ) as vacuum_command > from pg_class > where relkind = 'p' \gexec > > Additionally, I do like the idea of ANALYZE ONLY from the -general discussion > above (though, there might be confusion with logic of --analyze and > --analyze-only in vacuumdb). > > Does it make sense? I certainly would like to see this improved. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com When a patient asks the doctor, "Am I going to die?", he means "Am I going to die soon?"
Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.
Amit Langote 于2024年10月31日周四 21:09写道: > On Wed, Oct 30, 2024 at 9:36 PM Junwang Zhao wrote: > > On Wed, Oct 30, 2024 at 11:58 AM jian he > wrote: > > > > > > I missed a case when column collation and partition key collation are > > > the same and indeterministic. > > > that should be fine for partition-wise join. > > > so v2 attached. > > > > > > have_partkey_equi_join, match_expr_to_partition_keys didn't do any > > > collation related check. > > > propose v2 change disallow partitionwise join for case when > > > column collation is indeterministic *and* is differ from partition > > > key's collation. > > > > > > the attached partition_wise_join_collation.sql is the test script. > > > you may use it to compare with the master behavior. > > > > What if the partkey collation and column collation are both > deterministic, > > but with different sort order? > > > > I'm not familiar with this part of the code base, but it seems to me the > > partition wise join should use partkey collation instead of column > collation, > > because it's the partkey collation that decides which partition a row to > > be dispatched. > > > > What Jian proposed is also reasonable but seems another aspect of > $subject? > > I think we should insist that the join key collation and the partition > collation are exactly the same and refuse to match them if they are > not. > > + { > + Oid colloid = exprCollation((Node *) expr); > + > + if ((partcoll != colloid) && > +OidIsValid(colloid) && > +!get_collation_isdeterministic(colloid)) > + *coll_incompatiable = true; > > I am not quite sure what is the point of checking whether or not the > expression collation is deterministic after confirming that it's not > the same as partcoll. > Me, too. > > Attached 0002 is what I came up with. One thing that's different from > what Jian proposed is that match_expr_to_partition_keys() returns -1 > Agree. (expr not matched to any key) when the collation is also not matched > instead of using a separate output parameter for that. > In have_partkey_equi_join() ... if (exprs_known_equal(root, expr1, expr2, btree_opfamily)) { Oid partcoll2 = rel1->part_scheme->partcollation[ipk]; I think we should use rel2 here, not rel1. -- Thanks, Tender Wang
Re: Wrong result when enable_partitionwise_join is on if collation of PartitionKey and Column is different.
Amit Langote 于2024年10月31日周四 21:09写道: > > 0001 is the patch for the partitionwise grouping issue (bug #18568). > I concluded that even partial aggregate should be disallowed when the > grouping collation doesn't match partitioning collation. The result > with partial partitionwise grouping is not the same as when > enable_partitionwise_aggregate is off. > LGTM -- Thanks, Tender Wang
Re: UUID v7
On Thu, Oct 31, 2024 at 11:46 AM Andrey M. Borodin wrote: > > > > > On 31 Oct 2024, at 22:15, Masahiko Sawada wrote: > > > > On Sat, Oct 26, 2024 at 10:05 AM Andrey M. Borodin > > wrote: > > > > I think we typically avoid this kind of check failure by assigning > > uuidv7() and uuidv7(interval) different C functions that call the > > common function. That is, we have pg_proc entries like: > > > > Done. > > >>> > >>> It's odd to me that only uuid_extract_timestamp() supports UUID v6 in > >>> spite of not supporting UUID v6 generation. I think it makes more > >>> sense to support UUID v6 generation as well, if the need for it is > >>> high. > >> > >> RFC urges to use UUIDv7 instead of UUIDv6 when possible. I'm fine with > >> providing implementation, it's trivial. PFA patch with implementation. > >> > > > > My point is that we should either support full functionality for > > UUIDv6 (such as generation and extraction) or none of them. I'm not > > really sure we want UUIDv6 as well, but if we want it, it should be > > implemented in a separate patch. > > Make sense. I've removed all traces of v6. Thank you for updating the patch. I've been studying UUID v7 and have a question about the current (v29) implementation. IIUC the current implementation uses nanosecond-precision timestamps for both the first 48 bit space and 12 bits of pseudorandom data space (referred as 'rand_a' space in RFC 9562). IOW, all data except for random, version, and variant parts consist of a timestamp. The nanosecond-precision timestamp is generated using clock_gettime() with CLOCK_REALTIME on Linux, which however could be affected by time adjustment by NTP. Therefore, if the system clock moves backward due to NTP, we cannot guarantee monotonicity and sortability. Is that right? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Add ExprState hashing for GROUP BY and hashed SubPlans
On Thu, 31 Oct 2024 at 15:30, Andrei Lepikhov wrote: > Comparing the master with and without your patch, the first, I see is > more extensive usage of memory (see complete explains in the attachment): > > Current master: >Batches: 1 Memory Usage: 74513kB > Patched: >Batches: 261 Memory Usage: 527905kB Disk Usage: 4832656kB Thanks for testing that. I had forgotten to adjust the storage location for the hash initial value when I rebased the patch against the changes made in 9ca67658d. I've fixed that in the attached patch. The patched version comes out faster for me now: master: Execution Time: 15515.401 ms v4 patch: Execution Time: 15024.395 ms David v4-0001-Use-ExprStates-for-hashing-in-GROUP-BY-and-SubPla.patch Description: Binary data
Re: In-placre persistance change of a relation
On 31/10/2024 10:01, Kyotaro Horiguchi wrote: After some delays, here’s the new version. In this update, UNDO logs are WAL-logged and processed in memory under most conditions. During checkpoints, they’re flushed to files, which are then read when a specific XID’s UNDO log is accessed for the first time during recovery. The biggest changes are in patches 0001 through 0004 (equivalent to the previous 0001-0002). After that, there aren’t any major changes. Since this update involves removing some existing features, I’ve split these parts into multiple smaller identity transformations to make them clearer. As for changes beyond that, the main one is lifting the previous restriction on PREPARE for transactions after a persistence change. This was made possible because, with the shift to in-memory processing of UNDO logs, commit-time crash recovery detection is now simpler. Additional changes include completely removing the abort-handling portion from the pendingDeletes mechanism (0008-0010). In this patch version, the undo log is kept in dynamic shared memory. It can grow indefinitely. On a checkpoint, it's flushed to disk. If I'm reading it correctly, the undo records are kept in the DSA area even after it's flushed to disk. That's not necessary; system never needs to read the undo log unless there's a crash, so there's no need to keep it in memory after it's been flushed to disk. That's true today; we could start relying on the undo log to clean up on abort even when there's no crash, but I think it's a good design to not do that and rely on backend-private state for non-crash transaction abort. I'd suggest doing this the other way 'round. Let's treat the on-disk representation as the primary representation, not the in-memory one. Let's use a small fixed-size shared memory area just as a write buffer to hold the dirty undo log entries that haven't been written to disk yet. Most transactions are short, so most undo log entries never need to be flushed to disk, but I think it'll be simpler to think of it that way. On checkpoint, flush all the buffered dirty entries from memory to disk and clear the buffer. Also do that if the buffer fills up. A high-level overview comment of the on-disk format would be nice. If I understand correctly, there's a magic constant at the beginning of each undo file, followed by UndoLogRecords. There are no other file headers and no page structure within the file. That format seems reasonable. For cross-checking, maybe add the XID to the file header too. There is a separate CRC value on each record, which is nice, but not strictly necessary since the writes to the UNDO log are WAL-logged. The WAL needs CRCs on each record to detect the end of log, but the UNDO log doesn't need that. Anyway, it's fine. I somehow dislike the file per subxid design. I'm sure it works, it's just more of a feeling that it doesn't feel right. I'm somewhat worried about ending up with lots of files, if you e.g. use temporary tables with subtransactions heavily. Could we have just one file per top-level XID? I guess that can become a problem too, if you have a lot of aborted subtransactions. The UNDO records for the aborted subtransactions would bloat the undo file. But maybe that's nevertheless better? -- Heikki Linnakangas Neon (https://neon.tech)