Re: New strategies for freezing, advancing relfrozenxid early
On Tue, Aug 30, 2022 at 11:28 PM Jeff Davis wrote: > That clarifies your point. It's still a challenge for me to reason > about which of these potential new problems really need to be solved in > v1, though. I don't claim to understand it that well myself -- not just yet. I feel like I have the right general idea, but the specifics aren't all there (which is very often the case for me at this point in the cycle). That seems like a good basis for further discussion. It's going to be quite a few months before some version of this patchset is committed, at the very earliest. Obviously these are questions that need answers, but the process of getting to those answers is a significant part of the work itself IMV. > > Why stop at a couple of dozens of lines of code? Why not just change > > the default of vacuum_freeze_min_age and > > vacuum_multixact_freeze_min_age to 0? > > I don't think that would actually solve the unbounded buildup of > unfrozen pages. It would still be possible for pages to be marked all > visible before being frozen, and then end up being skipped until an > aggressive vacuum is forced, right? With the 15 work in place, and with the insert-driven autovacuum behavior from 13, it is likely that this would be enough to avoid all antiwraparound vacuums in an append-only table. There is still one case where we can throw away the opportunity to advance relfrozenxid during non-aggressive VACUUMs for no good reason -- I didn't fix them all just yet. But the remaining case (which is in lazy_scan_skip()) is very narrow. With vacuum_freeze_min_age = 0 and vacuum_multixact_freeze_min_age = 0, any page that is eligible to be set all-visible is also eligible to have its tuples frozen and be set all-frozen instead, immediately. When it isn't then we'll scan it in the next VACUUM anyway. Actually I'm also ignoring some subtleties with Multis that could make this not quite happen, but again, that's only a super obscure corner case. The idea that just setting vacuum_freeze_min_age = 0 and vacuum_multixact_freeze_min_age = 0 will be enough is definitely true in spirit. You don't need to touch vacuum_freeze_table_age (if you did then you'd get aggressive VACUUMs, and one goal here is to avoid those whenever possible -- especially aggressive antiwraparound autovacuums). -- Peter Geoghegan
Re: pg_stat_have_stats() returns true for dropped indexes (or for index creation transaction rolled back)
At Thu, 25 Aug 2022 11:44:34 +0200, "Drouvot, Bertrand" wrote in > Looking closer at it, I think we are already good for the drop case on > the tables (by making direct use of the pg_stat_get_* functions on the > before dropped oid). > > So I think we can remove all the "table" new checks: new patch > attached is doing so. > > On the other hand, for the index case, I think it's better to keep the > "committed index creation one". I agree. > Indeed, to check that the drop behaves correctly I think it's better > in "the same test" to ensure we've had the desired result before the > drop (I mean having pg_stat_have_stats() returning false after a drop > does not really help if we are not 100% sure it was returning true for > the exact same index before the drop). Sounds reasonable. > > We have other variable-numbered stats kinds > > FUNCTION/REPLSLOT/SUBSCRIPTION. Don't we need the same for these? > > I don't think we need more tests for the FUNCTION case (as it looks to > me it is already covered in stat.sql by the > pg_stat_get_function_calls() calls on the dropped functions oids). Right. > For SUBSCRIPTION, i think this is covered in 026_stats.pl: > > # Subscription stats for sub1 should be gone > is( $node_subscriber->safe_psql( > $db, qq(SELECT pg_stat_have_stats('subscription', 0, > $sub1_oid))), > qq(f), > qq(Subscription stats for subscription '$sub1_name' should be > removed.)); > > For REPLSLOT, I agree that we can add one test: I added it in > contrib/test_decoding/sql/stats.sql. It relies on pg_stat_have_stats() > (as relying on pg_stat_replication_slots and/or > pg_stat_get_replication_slot() would not help that much for this test > given that the slot has been removed from ReplicationSlotCtl) Thanks for the searching. +-- pg_stat_have_stats returns true for regression_slot_stats1 +-- Its index is 1 in ReplicationSlotCtl->replication_slots +select pg_stat_have_stats('replslot', 0, 1); This is wrong. The index is actually 0. We cannot know the id reliably since we don't expose it at all. We could slightly increase robustness by assuming the range of the id but that is just moving the problem to another place. If the test is broken by a change of replslot id assignment policy, it would be easily found and fixed. So is it fine simply fixing the comment with the correct ID? Or, contrarily we can be more sensitive to the change of ID assignment policy by checking all the replication slots. select count(n) from generate_series(0, 2) as n where pg_stat_have_stats('replslot', 0, n); The number changes from 3 to 0 across the slots drop.. If any of the slots has gone out of the range, the number before the drop decreases. > Attaching v3-0001 (with the right "numbering" this time ;-) ) Yeah, Looks fine:p regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On Sun, Aug 28, 2022 at 02:52:43PM -0700, Nathan Bossart wrote: > On Sat, Aug 27, 2022 at 02:06:32PM +0530, Bharath Rupireddy wrote: >> PSA v17 patch with reorderbuffer.c changes reverted. > > LGTM Yeah, what you have here looks pretty fine to me, so this is IMO committable. Let's wait a bit to see if there are any objections from the others, in case. -- Michael signature.asc Description: PGP signature
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 3:51 PM Amit Langote wrote: > On Wed, Aug 31, 2022 at 6:25 AM Nikita Glukhov > wrote: > > v10 patches > > Finally, I get this warning: > > execExprInterp.c: In function ‘ExecJsonCoerceCStringToText’: > execExprInterp.c:4765:3: warning: missing braces around initializer > [-Wmissing-braces] >NameData encoding = {0}; >^ > execExprInterp.c:4765:3: warning: (near initialization for > ‘encoding.data’) [-Wmissing-braces] Given the time constraints on making a decision on this, I'd like to also mention that other than the things mentioned in my last email, which don't sound like a big deal for Nikita to take care of, I don't have any further comments on the patches. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: [RFC] building postgres with meson - v12
I found that the perl test modules are not installed. See attached patch to correct this. To the patches: 4e15ee0e24 Don't hardcode tmp_check/ as test directory for tap tests 1a3169bc3f Split TESTDIR into TESTLOGDIR and TESTDATADIR It's a bit weird that the first patch changes the meaning of TESTDIR and the second patch removes it. Maybe these patches should be squashed together? 96d1d0a0cf meson: prereq: Extend gendef.pl in preparation for meson ok 581721fa99 meson: prereq: Add src/tools/gen_export.pl Still wondering about the whitespace changes I reported recently, but that can also be fine-tuned later. 4245cc888e meson: prereq: Refactor PG_TEST_EXTRA logic in autoconf build ok 3afe803e0f meson: prereq: Fix warning compat_informix/rnull.pgc with msvc ok ae7733f46c meson: prereq: Move darwin sysroot determination into separate file ok a1fb97a81b meson: Add meson based buildsystem I'm not a fan of all this business to protect the two build systems from each other. I don't like the build process touching a file under version control every time. How necessary is this? What happens otherwise? conversion_helpers.txt: should probably be removed now. doc/src/sgml/resolv.xsl: I don't understand what this is doing. Maybe at least add a comment in the file. src/common/unicode/meson.build: The comment at the top of the file should be moved next to the files it is describing (similar to how it is in the makefile). I don't see CLDR_VERSION set anywhere. Is that part implemented? src/port/win32ver.rc.in: This is redundant with src/port/win32ver.rc. (Note that the latter is also used as an input file for text substitution. So having another file named *.in next to it would be super confusing.) src/tools/find_meson: Could use a brief comment what it does. src/tools/pgflex: Could use a not-brief comment about what it does, why it's needed. Also a comment where it's used. Also run this through pycodestyle. src/tools/rcgen: This is connected with the comment on win32ver.rc.in above. We already have this equivalent code in src/makefiles/Makefile.win32. Let's figure out a way to share this code. (It could be a Perl script, which is already required on Windows.) Also pycodestyle. src/tools/testwrap: also documentation/comments/pycodestyle cd193eb3e8 meson: ci: Build both with meson and as before I haven't reviewed this one in detail. Maybe add a summary in the commit message, like these are the new jobs, these are the changes to existing jobs. It looks like there is more in there than just adding a few meson jobs. If the above are addressed, I think this will be just about at the point where the above patches can be committed. Everything past these patches I'm mentally postponing right now.From 80d6f4f9a574ddb6250e06490c610969aacdb824 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 31 Aug 2022 09:16:04 +0200 Subject: [PATCH] meson: Install test perl modules --- src/test/meson.build | 2 ++ src/test/perl/meson.build | 12 2 files changed, 14 insertions(+) create mode 100644 src/test/perl/meson.build diff --git a/src/test/meson.build b/src/test/meson.build index b86a0f3889..241d9d48aa 100644 --- a/src/test/meson.build +++ b/src/test/meson.build @@ -21,3 +21,5 @@ endif if icu.found() subdir('icu') endif + +subdir('perl') diff --git a/src/test/perl/meson.build b/src/test/perl/meson.build new file mode 100644 index 00..901bae7a56 --- /dev/null +++ b/src/test/perl/meson.build @@ -0,0 +1,12 @@ +# could use install_data's preserve_path option in >=0.64.0 + +install_data( + 'PostgreSQL/Version.pm', + install_dir: dir_pgxs / 'src/test/perl/PostgreSQL') + +install_data( + 'PostgreSQL/Test/Utils.pm', + 'PostgreSQL/Test/SimpleTee.pm', + 'PostgreSQL/Test/RecursiveCopy.pm', + 'PostgreSQL/Test/Cluster.pm', + install_dir: dir_pgxs / 'src/test/perl/PostgreSQL/Test') -- 2.37.1
Re: [RFC] building postgres with meson - v11
On 24.08.22 17:30, Andres Freund wrote: 0545eec895 meson: Add docs We should think more about how to arrange the documentation. We probably don't want to copy-and-paste all the introductory and requirements information. I think we can make this initially much briefer, like the Windows installation chapter. For example, instead of documenting each setup option again, just mention which ones exist and then point (link) to the configure chapter for details. The current docs, including the windows ones, are already hard to follow. I think we should take some care to not make the meson bits even more confusing. Cross referencing left and right seems problematic from that angle. If you look at the current structure of the installation chapter 17.1. Short Version 17.2. Requirements 17.3. Getting the Source 17.4. Installation Procedure 17.5. Post-Installation Setup 17.6. Supported Platforms 17.7. Platform-Specific Notes only 17.1, small parts of 12.2, and 17.4 should differ between make and meson. There is no conceivable reason why the meson installation chapter should have a different "Getting the Source" section. And some of the post-installation and platform-specific information doesn't appear at all on the meson chapter. I think we can try to be a bit more ingenious in how we weave this together in the best way. What I really wouldn't want is two separate chapters that duplicate the entire process. I think we could do one chapter, like - Short Version - Requirements - Getting the Source - Installation Procedure - Installation Procedure using Meson - Post-Installation Setup - Supported Platforms - Platform-Specific Notes Alternatively, if people prefer two separate chapters, let's think about some source-code level techniques to share the common contents.
Re: plpgsql-trigger.html: Format TG_ variables as table (patch)
On 30.08.22 15:16, Christoph Berg wrote: I found the list of TG_ variables on https://www.postgresql.org/docs/current/plpgsql-trigger.html#PLPGSQL-DML-TRIGGER hard to read for several reasons: too much whitespace, all the lines start with "Data type", and even after that, the actual content is hiding behind some extra "variable that..." boilerplate. The attached patch formats the list as a table, and removes some of the clutter from the text. I reused the catalog_table_entry table machinery, that is probably not quite the correct thing, but I didn't find a better variant, and the result looks ok. I find the new version even harder to read. The catalog_table_entry stuff doesn't really make sense here, since what you have before is already a definition list, and afterwards you have the same, just marked up "incorrectly". We could move the data type in the , similar to how you did it in your patch. I agree the whitespace layout is weird, but that's a problem of the website CSS stylesheet. I think it looks a bit better with the local stylesheet, but that can all be tweaked.
Re: First draft of the PG 15 release notes
On 30.08.22 22:42, Bruce Momjian wrote: Good question --- the full text is: Record and check the collation of each database (Peter Eisentraut) This is designed to detect collation mismatches to avoid data corruption. Function pg_database_collation_actual_version() reports the underlying operating system collation version, and ALTER DATABASE ... REFRESH sets the database to match the operating system collation version. DETAILS? I just can't figure out what the user needs to understand about this, and I understand very little of it. We already had this feature for (schema-level) collations, now we have it on the level of the database collation.
Re: plpgsql-trigger.html: Format TG_ variables as table (patch)
> On 31 Aug 2022, at 11:35, Peter Eisentraut > wrote: > > On 30.08.22 15:16, Christoph Berg wrote: >> I found the list of TG_ variables on >> https://www.postgresql.org/docs/current/plpgsql-trigger.html#PLPGSQL-DML-TRIGGER >> hard to read for several reasons: too much whitespace, all the lines >> start with "Data type", and even after that, the actual content is >> hiding behind some extra "variable that..." boilerplate. >> The attached patch formats the list as a table, and removes some of >> the clutter from the text. >> I reused the catalog_table_entry table machinery, that is probably not >> quite the correct thing, but I didn't find a better variant, and the >> result looks ok. > > I find the new version even harder to read. The catalog_table_entry stuff > doesn't really make sense here, since what you have before is already a > definition list, and afterwards you have the same, just marked up > "incorrectly". If we change variable lists they should get their own formatting in the xsl and css stylesheets. > We could move the data type in the , similar to how you did it in your > patch. That will make this look different from the trigger variable lists for other languages (which typically don't list type), but I think it's worth it to avoid the boilerplate which is a bit annoying. Another thing we should change while there (but it's not directly related to this patch) is that we document TG_RELID and $TG_relid as "object ID" but TD["relid"] and $_TD->{relid} as "OID". Punctuation of item descriptions is also not consistent. -- Daniel Gustafsson https://vmware.com/
Re: plpgsql-trigger.html: Format TG_ variables as table (patch)
Re: Peter Eisentraut > I find the new version even harder to read. The catalog_table_entry stuff > doesn't really make sense here, since what you have before is already a > definition list, and afterwards you have the same, just marked up > "incorrectly". Fair enough. For comparison, this is what yesterday's patch looked like: https://www.df7cb.de/s/2022-08-31.115813.w5UvAS.png > We could move the data type in the , similar to how you did it in your > patch. The new version of the patch just moves up the data types, and removes the extra clutter from the beginnings of each description: https://www.df7cb.de/s/2022-08-31.115857.LkkKl8.png Christoph >From 61bcef4b3a0a8a17c86114ef9747c6e793a5b48f Mon Sep 17 00:00:00 2001 From: Christoph Berg Date: Wed, 31 Aug 2022 11:52:50 +0200 Subject: [PATCH] plpgsql-trigger.html: Use more concise wording for TG_ variables To improve readability of the TG_ variables definition list, this moves the datatypes up to the defined term to avoid having each entry start with "Data type", and removes some more redundant clutter ("Variable holding...") from the descriptions that didn't carry any information. --- doc/src/sgml/plpgsql.sgml | 48 +++ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml index cf387dfc3f..b862af35a7 100644 --- a/doc/src/sgml/plpgsql.sgml +++ b/doc/src/sgml/plpgsql.sgml @@ -4032,10 +4032,10 @@ ASSERT condition , - NEW + NEW (record) - Data type RECORD; variable holding the new + new database row for INSERT/UPDATE operations in row-level triggers. This variable is null in statement-level triggers and for DELETE operations. @@ -4044,10 +4044,10 @@ ASSERT condition , - OLD + OLD (record) - Data type RECORD; variable holding the old + old database row for UPDATE/DELETE operations in row-level triggers. This variable is null in statement-level triggers and for INSERT operations. @@ -4056,20 +4056,20 @@ ASSERT condition , - TG_NAME + TG_NAME (name) - Data type name; variable that contains the name of the trigger actually + name of the trigger actually fired. - TG_WHEN + TG_WHEN (text) - Data type text; a string of + string BEFORE, AFTER, or INSTEAD OF, depending on the trigger's definition. @@ -4077,10 +4077,10 @@ ASSERT condition , - TG_LEVEL + TG_LEVEL (text) - Data type text; a string of either + string ROW or STATEMENT depending on the trigger's definition. @@ -4088,10 +4088,10 @@ ASSERT condition , - TG_OP + TG_OP (text) - Data type text; a string of + string INSERT, UPDATE, DELETE, or TRUNCATE telling for which operation the trigger was fired. @@ -4100,20 +4100,20 @@ ASSERT condition , - TG_RELID + TG_RELID (oid, references pg_class.oid) - Data type oid; the object ID of the table that caused the + object ID of the table that caused the trigger invocation. - TG_RELNAME + TG_RELNAME (name) - Data type name; the name of the table that caused the trigger + name of the table that caused the trigger invocation. This is now deprecated, and could disappear in a future release. Use TG_TABLE_NAME instead. @@ -4121,40 +4121,40 @@ ASSERT condition , - TG_TABLE_NAME + TG_TABLE_NAME (name) - Data type name; the name of the table that + name of the table that caused the trigger invocation. - TG_TABLE_SCHEMA + TG_TABLE_SCHEMA (name) - Data type name; the name of the schema of the + name of the schema of the table that caused the trigger invocation. - TG_NARGS + TG_NARGS (integer) - Data type integer; the number of arguments given to the trigger + number of arguments given to the trigger function in the CREATE TRIGGER statement. - TG_ARGV[] + TG_ARGV[] (text[]) - Data type array of text; the arguments from + arguments from the CREATE TRIGGER statement. The index counts from 0. Invalid indexes (less than 0 or greater than or equal to tg_nargs) -- 2.35.1
Re: [PATCH] Fix alter subscription concurrency errors
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:not tested The patch applies with few "Hunk succeeded, offset -3 lines" warnings. Tested against master '7d5852ca'. + if (!HeapTupleIsValid(tup)) + { + if (!missing_ok) + ereport(ERROR, + (errcode(ERRCODE_UNDEFINED_OBJECT), +errmsg("subscription \"%s\" does not exist", + subname))); + else + ereport(NOTICE, + (errmsg("subscription \"%s\" does not exist, skipping", + subname))); + + return InvalidOid; + } + I think 'tup' should be released before returning, or break out of loop instead to release it. The new status of this patch is: Waiting on Author
Re: Handle infinite recursion in logical replication setup
On Wed, Aug 31, 2022 at 11:35 AM Peter Smith wrote: > > Here are my review comments for v43-0001. > > == > > 1. Commit message > > The initial copy phase has no way to know the origin of the row data, > so if 'copy_data = true' in the step 4 below, log a warning to notify user > that potentially non-local data might have been copied. > > 1a. > "in the step 4" -> "in step 4" > > ~ > > 1b. > "notify user" -> "notify the user" > > == > > 2. doc/src/sgml/ref/create_subscription.sgml > > + > + If the subscription is created with origin = none and > + copy_data = true, it will check if the publisher has > + subscribed to the same table from other publishers and, if so, log a > + warning to notify the user to check the publisher tables. Before > continuing > + with other operations the user should check that publisher tables did not > + have data with different origins to prevent data inconsistency issues on > the > + subscriber. > + > > I am not sure about that wording saying "to prevent data inconsistency > issues" because I thought maybe is already too late because any > unwanted origin data is already copied during the initial sync. I > think the user can do it try to clean up any problems caused before > things become even worse (??) > Oh no, it is not too late in all cases. The problem can only occur if the user will try to subscribe from all the different publications with copy_data = true. We are anyway trying to clarify in the second patch the right way to accomplish this. So, I am not sure what better we can do here. The only bulletproof way is to provide some APIs where users don't need to bother about all these cases, basically something similar to what Kuroda-San is working on in the thread [1]. > ~~~ > > You asked for my thoughts (see [1] 6b): > > > There is nothing much a user can do in this case. Only option would be > > to take a backup before the operation and restore it and then recreate > > the replication setup. I was not sure if we should document these > > steps as similar inconsistent data could get created without the > > origin option . Thoughts? > > I think those cases are a bit different: > > - For a normal scenario (i.e. origin=ANY) then inconsistent data just > refers to problems like maybe PK violations so I think you just get > what you get and have to deal with the errors... > > - But when the user said origin=NONE they are specifically saying they > do NOT want any non-local data, yet they still might end up getting > data contrary to their wishes. So I think maybe there ought to be > something documented to warn about this potentially happening. My > first impression is that all this seems like a kind of terrible > problem because if it happens then cleanup could be difficult - if > that subscriber in turn was also a publisher then this bad data might > have propagated all over the place already! Anyway, I guess all this > could be mentioned in some (new ?) section of the > logical-replication.sgml file (i.e. patch 0002). > > IDEA: I was wondering if the documentation should recommend (or maybe > we can even enforce this) that when using origin=NONE the user should > always create the subscription with enabled=FALSE. That way if there > are some warnings about potential bad origin data then there it might > be possible to take some action or change their mind *before* any > unwanted data pollutes everything. > If we want to give this suggestion, then we need to also say that this is only valid when copy_data is true. The other drawback is if users always need to do this to use origin=NONE then this can be a burden to the user. I am not against having such a NOTE but the tone should not be to enforce users to do this. [1] - https://www.postgresql.org/message-id/CAHut%2BPuwRAoWY9pz%3DEubps3ooQCOBFiYPU9Yi%3DVB-U%2ByORU7OA%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Small cleanups to tuplesort.c and a bonus small performance improvement
On Fri, 26 Aug 2022 at 16:48, David Rowley wrote: > 0003: Changes writetuple to tell it what it should do in regards to > freeing and adjusting the memory accounting. > > Probably 0003 could be done differently. I'm certainly not set on the > bool args. I understand that I'm never calling it with "freetup" == > true. So other options include 1) rip out the pfree code and that > parameter; or 2) just do the inlining manually at both call sites. This patch series needed to be rebased and on looking it at again, since the pfree() code is never used I felt it makes very little sense to keep it, so I decided that it might be better just to keep the WRITETUP macro and just completely get rid of the writetuple function and have the macro call the function pointed to be the "writetup" pointer. The only extra code we needed from writetuple() was the memory accounting code which was only used in dumptuples(), so I've just included that code in that function instead. I also noticed that dumptuples() had a pretty braindead method of zeroing out state->memtupcount by subtracting 1 from it on each loop. Since that's not being used to keep track of the loop's progress, I've just moved it out the loop and changed the code to set it to 0 once the loop is done. > I'll throw this in the September CF to see if anyone wants to look. > There's probably lots more cleaning jobs that could be done in > tuplesort.c. My current thoughts are that this is a very trivial patch and unless there's any objections I plan to push it soon. David From 7d9d960c6080f9511ecb2514defed386b9b65cdb Mon Sep 17 00:00:00 2001 From: David Rowley Date: Wed, 31 Aug 2022 18:52:11 +1200 Subject: [PATCH v2] Be smarter about freeing tuples during tuplesorts During dumptuples() the call to writetuple() would pfree any non-null tuple. This was quite wasteful as this happens just before we perform a reset of the context which stores all of those tuples. It seems to make sense to do a bit of a code refactor to make this work, so here we just get rid of the writetuple function and adjust the WRITETUP macro to call the state's writetup function. The WRITETUP usage in mergeonerun() always has state->slabAllocatorUsed == true, so writetuple() would never free the tuple or do any memory accounting. The only call path that needs memory accounting done is in dumptuples(), so let's just do it manually there. In passing, let's get rid of the state->memtupcount-- code that counts the memtupcount down to 0 one tuple at a time inside the loop. That seems to be a rather inefficient way to set memtupcount to 0, so let's just zero it after the loop instead. --- src/backend/utils/sort/tuplesort.c | 38 -- 1 file changed, 15 insertions(+), 23 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 66950983e6..416f02ba3c 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -395,7 +395,7 @@ struct Sharedsort #define REMOVEABBREV(state,stup,count) ((*(state)->base.removeabbrev) (state, stup, count)) #define COMPARETUP(state,a,b) ((*(state)->base.comparetup) (a, b, state)) -#define WRITETUP(state,tape,stup) (writetuple(state, tape, stup)) +#define WRITETUP(state,tape,stup) ((*(state)->base.writetup) (state, tape, stup)) #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len)) #define FREESTATE(state) ((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0) #define LACKMEM(state) ((state)->availMem < 0 && !(state)->slabAllocatorUsed) @@ -453,8 +453,6 @@ struct Sharedsort static void tuplesort_begin_batch(Tuplesortstate *state); -static void writetuple(Tuplesortstate *state, LogicalTape *tape, - SortTuple *stup); static bool consider_abort_common(Tuplesortstate *state); static void inittapes(Tuplesortstate *state, bool mergeruns); static void inittapestate(Tuplesortstate *state, int maxTapes); @@ -1339,24 +1337,6 @@ tuplesort_puttuple_common(Tuplesortstate *state, SortTuple *tuple, bool useAbbre MemoryContextSwitchTo(oldcontext); } -/* - * Write a stored tuple onto tape. Unless the slab allocator is - * used, after writing the tuple, pfree() the out-of-line data (not the - * SortTuple struct!), and increase state->availMem by the amount of - * memory space thereby released. - */ -static void -writetuple(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup) -{ - state->base.writetup(state, tape, stup); - - if (!state->slabAllocatorUsed && stup->tuple) - { - FREEMEM(state, GetMemoryChunkSpace(stup->tuple)); - pfree(stup->tuple); - } -} - static bool consider_abort_common(Tuplesortstate *state) { @@ -2260,6 +2240,8 @@ mergeonerun(Tuplesortstate *state) */ beginmerge(state); + Assert(state->slabAllocatorUsed); + /*
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 3:51 PM Amit Langote wrote: > SELECT JSON_VALUE(jsonb '"aaa"', '$' RETURNING int DEFAULT 111 ON ERROR); > - json_value > - > -111 > -(1 row) > - > +ERROR: syntax error at or near "DEFAULT" > +LINE 1: ...ELECT JSON_VALUE(jsonb '"aaa"', '$' RETURNING int DEFAULT 11... > > Is it intentional that you left many instances of the regression test > output changes like the above? Actually, thinking more about this, I am wondering if we should not remove the DEFAULT expression productions in gram.y. Maybe we can keep the syntax and give an unsupported error during parse-analysis, like the last version of the patch did for DEFAULT ON EMPTY. Which also means to also leave JsonBehavior alone but with default_expr always NULL for now. -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Avoid overhead open-close indexes (catalog updates)
Hi, The commit https://github.com/postgres/postgres/commit/b17ff07aa3eb142d2cde2ea00e4a4e8f63686f96 Introduced the CopyStatistics function. To do the work, CopyStatistics uses a less efficient function to update/insert tuples at catalog systems. The comment at indexing.c says: "Avoid using it for multiple tuples, since opening the indexes * and building the index info structures is moderately expensive. * (Use CatalogTupleInsertWithInfo in such cases.)" So inspired by the comment, changed in some fews places, the CatalogInsert/CatalogUpdate to more efficient functions CatalogInsertWithInfo/CatalogUpdateWithInfo. With quick tests, resulting in small performance. head: 1. REINDEX TABLE CONCURRENTLY pgbench_accounts; Time: 77,805 ms Time: 74,836 ms Time: 73,480 ms 2. REINDEX TABLE CONCURRENTLY pgbench_tellers; Time: 22,260 ms Time: 22,205 ms Time: 21,008 ms patched: 1. REINDEX TABLE CONCURRENTLY pgbench_accounts; Time: 65,048 ms Time: 61,853 ms Time: 61,119 ms 2. REINDEX TABLE CONCURRENTLY pgbench_tellers; Time: 15,999 ms Time: 15,961 ms Time: 13,264 ms There are other places that this could be useful, but a careful analysis is necessary. regards, Ranier Vilela diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 9b03579e6e..cf15051a05 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -2856,8 +2856,7 @@ CopyStatistics(Oid fromrelid, Oid torelid) SysScanDesc scan; ScanKeyData key[1]; Relation statrel; - - statrel = table_open(StatisticRelationId, RowExclusiveLock); + CatalogIndexState indstate; /* Now search for stat records */ ScanKeyInit(&key[0], @@ -2865,6 +2864,9 @@ CopyStatistics(Oid fromrelid, Oid torelid) BTEqualStrategyNumber, F_OIDEQ, ObjectIdGetDatum(fromrelid)); + statrel = table_open(StatisticRelationId, RowExclusiveLock); + indstate = CatalogOpenIndexes(statrel); + scan = systable_beginscan(statrel, StatisticRelidAttnumInhIndexId, true, NULL, 1, key); @@ -2878,13 +2880,14 @@ CopyStatistics(Oid fromrelid, Oid torelid) /* update the copy of the tuple and insert it */ statform->starelid = torelid; - CatalogTupleInsert(statrel, tup); + CatalogTupleInsertWithInfo(statrel, tup, indstate); heap_freetuple(tup); } systable_endscan(scan); + CatalogCloseIndexes(indstate); table_close(statrel, RowExclusiveLock); } diff --git a/src/backend/catalog/pg_enum.c b/src/backend/catalog/pg_enum.c index 114715498d..785fbc1669 100644 --- a/src/backend/catalog/pg_enum.c +++ b/src/backend/catalog/pg_enum.c @@ -69,6 +69,7 @@ EnumValuesCreate(Oid enumTypeOid, List *vals) bool nulls[Natts_pg_enum]; ListCell *lc; HeapTuple tup; + CatalogIndexState indstate; num_elems = list_length(vals); @@ -113,6 +114,9 @@ EnumValuesCreate(Oid enumTypeOid, List *vals) /* and make the entries */ memset(nulls, false, sizeof(nulls)); + /* open the Catalog Indexes for Insert Tuples */ + indstate = CatalogOpenIndexes(pg_enum); + elemno = 0; foreach(lc, vals) { @@ -137,7 +141,7 @@ EnumValuesCreate(Oid enumTypeOid, List *vals) tup = heap_form_tuple(RelationGetDescr(pg_enum), values, nulls); - CatalogTupleInsert(pg_enum, tup); + CatalogTupleInsertWithInfo(pg_enum, tup, indstate); heap_freetuple(tup); elemno++; @@ -145,6 +149,7 @@ EnumValuesCreate(Oid enumTypeOid, List *vals) /* clean up */ pfree(oids); + CatalogCloseIndexes(indstate); table_close(pg_enum, RowExclusiveLock); } diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index a7966fff83..5cfcdf8ba7 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -1624,12 +1624,14 @@ static void update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats) { Relation sd; + CatalogIndexState indstate; int attno; if (natts <= 0) return; /* nothing to do */ sd = table_open(StatisticRelationId, RowExclusiveLock); + indstate = CatalogOpenIndexes(sd); for (attno = 0; attno < natts; attno++) { @@ -1735,18 +1737,19 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats) nulls, replaces); ReleaseSysCache(oldtup); - CatalogTupleUpdate(sd, &stup->t_self, stup); + CatalogTupleUpdateWithInfo(sd, &stup->t_self, stup, indstate); } else { /* No, insert new tuple */ stup = heap_form_tuple(RelationGetDescr(sd), values, nulls); - CatalogTupleInsert(sd, stup); + CatalogTupleInsertWithInfo(sd, stup, indstate); } heap_freetuple(stup); } + CatalogCloseIndexes(indstate); table_close(sd, RowExclusiveLock); } diff --git a/src/backend/commands/tsearchcmds.c b/src/backend/commands/tsearchcmds.c index 4cc4e3c00f..b95b989c5a 100644 --- a/src/backend/commands/tsearchcmds.c +++ b/src/backend/commands/tsearchcmds.c @@ -1004,14 +1004,16 @@ DefineTSConfiguration(List *names, List *parameters, ObjectAddress *copied) ScanKeyData skey; SysScanDesc scan; HeapTuple m
Re: plpgsql-trigger.html: Format TG_ variables as table (patch)
Re: To Peter Eisentraut > The new version of the patch just moves up the data types, and removes > the extra clutter from the beginnings of each description: The last version had the brackets in TG_ARGV[] (text[]) duplicated. Christoph >From 1355794d56919a015bd1528f62428beaab0a681b Mon Sep 17 00:00:00 2001 From: Christoph Berg Date: Wed, 31 Aug 2022 11:52:50 +0200 Subject: [PATCH] plpgsql-trigger.html: Use more concise wording for TG_ variables To improve readability of the TG_ variables definition list, this moves the datatypes up to the defined term to avoid having each entry start with "Data type", and removes some more redundant clutter ("Variable holding...") from the descriptions that didn't carry any information. --- doc/src/sgml/plpgsql.sgml | 48 +++ 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml index cf387dfc3f..13b2ab8829 100644 --- a/doc/src/sgml/plpgsql.sgml +++ b/doc/src/sgml/plpgsql.sgml @@ -4032,10 +4032,10 @@ ASSERT condition , - NEW + NEW (record) - Data type RECORD; variable holding the new + new database row for INSERT/UPDATE operations in row-level triggers. This variable is null in statement-level triggers and for DELETE operations. @@ -4044,10 +4044,10 @@ ASSERT condition , - OLD + OLD (record) - Data type RECORD; variable holding the old + old database row for UPDATE/DELETE operations in row-level triggers. This variable is null in statement-level triggers and for INSERT operations. @@ -4056,20 +4056,20 @@ ASSERT condition , - TG_NAME + TG_NAME (name) - Data type name; variable that contains the name of the trigger actually + name of the trigger actually fired. - TG_WHEN + TG_WHEN (text) - Data type text; a string of + string BEFORE, AFTER, or INSTEAD OF, depending on the trigger's definition. @@ -4077,10 +4077,10 @@ ASSERT condition , - TG_LEVEL + TG_LEVEL (text) - Data type text; a string of either + string ROW or STATEMENT depending on the trigger's definition. @@ -4088,10 +4088,10 @@ ASSERT condition , - TG_OP + TG_OP (text) - Data type text; a string of + string INSERT, UPDATE, DELETE, or TRUNCATE telling for which operation the trigger was fired. @@ -4100,20 +4100,20 @@ ASSERT condition , - TG_RELID + TG_RELID (oid, references pg_class.oid) - Data type oid; the object ID of the table that caused the + object ID of the table that caused the trigger invocation. - TG_RELNAME + TG_RELNAME (name) - Data type name; the name of the table that caused the trigger + name of the table that caused the trigger invocation. This is now deprecated, and could disappear in a future release. Use TG_TABLE_NAME instead. @@ -4121,40 +4121,40 @@ ASSERT condition , - TG_TABLE_NAME + TG_TABLE_NAME (name) - Data type name; the name of the table that + name of the table that caused the trigger invocation. - TG_TABLE_SCHEMA + TG_TABLE_SCHEMA (name) - Data type name; the name of the schema of the + name of the schema of the table that caused the trigger invocation. - TG_NARGS + TG_NARGS (integer) - Data type integer; the number of arguments given to the trigger + number of arguments given to the trigger function in the CREATE TRIGGER statement. - TG_ARGV[] + TG_ARGV (text[]) - Data type array of text; the arguments from + arguments from the CREATE TRIGGER statement. The index counts from 0. Invalid indexes (less than 0 or greater than or equal to tg_nargs) -- 2.35.1
Re: replacing role-level NOINHERIT with a grant-level option
On Tue, Aug 30, 2022 at 5:20 PM Nathan Bossart wrote: > On Tue, Aug 30, 2022 at 03:24:56PM -0400, Robert Haas wrote: > > - william => charles => elizabeth => uk is OK because the first two > > hops have ADMIN and the last hop has INHERIT > > Don't you mean the first two hops have INHERIT and the last has ADMIN? I do. > > - william => charles => elizabeth => uk => parliament is probably not > > OK because although the first two hops have ADMIN and the last has > > INHERIT, the third hop probably lacks INHERIT > > Same here. I don't mean to be pedantic, I just want to make sure I'm > thinking of this correctly. Right. The first two hops have INHERIT and the last has ADMIN, but the third hop probably lacks INHERIT. > Yes, this is very helpful. I always appreciate your detailed examples. I > think what you are describing matches the mental model I was beginning to > form. Cool. > Okay, now to take a closer look at the patch... Thanks. -- Robert Haas EDB: http://www.enterprisedb.com
pg_upgrade generated files in subdir follow-up
Commit 38bfae36526 moved the .txt files pg_upgrade generates to a separate subdir, but there are a few left which are written to cwd. The thread resulting in that patch doesn't discuss these files specifically so it seems they are just an oversight. Unless I'm missing something. Should something the attached be applied to ensure all generated files are placed in the subdirectory? -- Daniel Gustafsson https://vmware.com/ pg_upgrade_generated_files.diff Description: Binary data
Re: plpgsql-trigger.html: Format TG_ variables as table (patch)
> On 31 Aug 2022, at 13:55, Christoph Berg wrote: > > Re: To Peter Eisentraut >> The new version of the patch just moves up the data types, and removes >> the extra clutter from the beginnings of each description: > > The last version had the brackets in TG_ARGV[] (text[]) duplicated. This, and the other string variables, now reads a bit strange IMO: - Data type text; a string of + string INSERT, UPDATE, DELETE, or TRUNCATE Wouldn't it be better with "string containing INSERT.." or something similar? -- Daniel Gustafsson https://vmware.com/
Re: plpgsql-trigger.html: Format TG_ variables as table (patch)
Re: Daniel Gustafsson > This, and the other string variables, now reads a bit strange IMO: > > - Data type text; a string of > + string > INSERT, UPDATE, > DELETE, or TRUNCATE > > Wouldn't it be better with "string containing INSERT.." or something > similar? Right, that felt strange to me as well, but I couldn't think of something better. "string containing" is again pretty boilerplatish, how about just "contains"? Christoph >From b675a9e63ac663817cf90ca3aee2acf398b3fa97 Mon Sep 17 00:00:00 2001 From: Christoph Berg Date: Wed, 31 Aug 2022 11:52:50 +0200 Subject: [PATCH] plpgsql-trigger.html: Use more concise wording for TG_ variables To improve readability of the TG_ variables definition list, this moves the datatypes up to the defined term to avoid having each entry start with "Data type", and removes some more redundant clutter ("Variable holding...") from the descriptions that didn't carry any information. --- doc/src/sgml/plpgsql.sgml | 52 +++ 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml index cf387dfc3f..b5cfe3c63d 100644 --- a/doc/src/sgml/plpgsql.sgml +++ b/doc/src/sgml/plpgsql.sgml @@ -4032,10 +4032,10 @@ ASSERT condition , - NEW + NEW (record) - Data type RECORD; variable holding the new + new database row for INSERT/UPDATE operations in row-level triggers. This variable is null in statement-level triggers and for DELETE operations. @@ -4044,10 +4044,10 @@ ASSERT condition , - OLD + OLD (record) - Data type RECORD; variable holding the old + old database row for UPDATE/DELETE operations in row-level triggers. This variable is null in statement-level triggers and for INSERT operations. @@ -4056,20 +4056,20 @@ ASSERT condition , - TG_NAME + TG_NAME (name) - Data type name; variable that contains the name of the trigger actually + name of the trigger actually fired. - TG_WHEN + TG_WHEN (text) - Data type text; a string of + contains BEFORE, AFTER, or INSTEAD OF, depending on the trigger's definition. @@ -4077,43 +4077,43 @@ ASSERT condition , - TG_LEVEL + TG_LEVEL (text) - Data type text; a string of either - ROW or STATEMENT + contains + ROW or STATEMENT, depending on the trigger's definition. - TG_OP + TG_OP (text) - Data type text; a string of + contains INSERT, UPDATE, - DELETE, or TRUNCATE + DELETE, or TRUNCATE, telling for which operation the trigger was fired. - TG_RELID + TG_RELID (oid, references pg_class.oid) - Data type oid; the object ID of the table that caused the + object ID of the table that caused the trigger invocation. - TG_RELNAME + TG_RELNAME (name) - Data type name; the name of the table that caused the trigger + name of the table that caused the trigger invocation. This is now deprecated, and could disappear in a future release. Use TG_TABLE_NAME instead. @@ -4121,40 +4121,40 @@ ASSERT condition , - TG_TABLE_NAME + TG_TABLE_NAME (name) - Data type name; the name of the table that + name of the table that caused the trigger invocation. - TG_TABLE_SCHEMA + TG_TABLE_SCHEMA (name) - Data type name; the name of the schema of the + name of the schema of the table that caused the trigger invocation. - TG_NARGS + TG_NARGS (integer) - Data type integer; the number of arguments given to the trigger + number of arguments given to the trigger function in the CREATE TRIGGER statement. - TG_ARGV[] + TG_ARGV (text[]) - Data type array of text; the arguments from + arguments from the CREATE TRIGGER statement. The index counts from 0. Invalid indexes (less than 0 or greater than or equal to tg_nargs) -- 2.35.1
Re: replacing role-level NOINHERIT with a grant-level option
On Tue, Aug 30, 2022 at 6:10 PM Nathan Bossart wrote: > On Tue, Aug 30, 2022 at 08:33:46AM -0400, Robert Haas wrote: > > Third try. > > Now that I have this grantor stuff fresh in my mind, this patch looks good > to me. Thanks for reviewing. Committed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: plpgsql-trigger.html: Format TG_ variables as table (patch)
Re: To Daniel Gustafsson > "string containing" is again pretty boilerplatish, how about just > "contains"? Actually, just omitting the whole prefix works best. TG_WHEN (text) BEFORE, AFTER, or INSTEAD OF, depending on the trigger's definition. I also shortened some "name of table" to just "table". Since the data type is "name", it's clear what "table" means. Christoph >From a281bbb3ba80e14645f83d464ca6df892a85c1bb Mon Sep 17 00:00:00 2001 From: Christoph Berg Date: Wed, 31 Aug 2022 11:52:50 +0200 Subject: [PATCH] plpgsql-trigger.html: Use more concise wording for TG_ variables To improve readability of the TG_ variables definition list, this moves the datatypes up to the defined term to avoid having each entry start with "Data type", and removes some more redundant clutter ("Variable holding...") from the descriptions that didn't carry any information. --- doc/src/sgml/plpgsql.sgml | 52 +++ 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml index cf387dfc3f..b5cfe3c63d 100644 --- a/doc/src/sgml/plpgsql.sgml +++ b/doc/src/sgml/plpgsql.sgml @@ -4032,10 +4032,10 @@ ASSERT condition , - NEW + NEW (record) - Data type RECORD; variable holding the new + new database row for INSERT/UPDATE operations in row-level triggers. This variable is null in statement-level triggers and for DELETE operations. @@ -4044,10 +4044,10 @@ ASSERT condition , - OLD + OLD (record) - Data type RECORD; variable holding the old + old database row for UPDATE/DELETE operations in row-level triggers. This variable is null in statement-level triggers and for INSERT operations. @@ -4056,20 +4056,20 @@ ASSERT condition , - TG_NAME + TG_NAME (name) - Data type name; variable that contains the name of the trigger actually + name of the trigger actually fired. - TG_WHEN + TG_WHEN (text) - Data type text; a string of + contains BEFORE, AFTER, or INSTEAD OF, depending on the trigger's definition. @@ -4077,43 +4077,43 @@ ASSERT condition , - TG_LEVEL + TG_LEVEL (text) - Data type text; a string of either - ROW or STATEMENT + contains + ROW or STATEMENT, depending on the trigger's definition. - TG_OP + TG_OP (text) - Data type text; a string of + contains INSERT, UPDATE, - DELETE, or TRUNCATE + DELETE, or TRUNCATE, telling for which operation the trigger was fired. - TG_RELID + TG_RELID (oid, references pg_class.oid) - Data type oid; the object ID of the table that caused the + object ID of the table that caused the trigger invocation. - TG_RELNAME + TG_RELNAME (name) - Data type name; the name of the table that caused the trigger + name of the table that caused the trigger invocation. This is now deprecated, and could disappear in a future release. Use TG_TABLE_NAME instead. @@ -4121,40 +4121,40 @@ ASSERT condition , - TG_TABLE_NAME + TG_TABLE_NAME (name) - Data type name; the name of the table that + name of the table that caused the trigger invocation. - TG_TABLE_SCHEMA + TG_TABLE_SCHEMA (name) - Data type name; the name of the schema of the + name of the schema of the table that caused the trigger invocation. - TG_NARGS + TG_NARGS (integer) - Data type integer; the number of arguments given to the trigger + number of arguments given to the trigger function in the CREATE TRIGGER statement. - TG_ARGV[] + TG_ARGV (text[]) - Data type array of text; the arguments from + arguments from the CREATE TRIGGER statement. The index counts from 0. Invalid indexes (less than 0 or greater than or equal to tg_nargs) -- 2.35.1
Re: SQL/JSON features for v15
On 2022-08-31 We 07:01, Amit Langote wrote: > On Wed, Aug 31, 2022 at 3:51 PM Amit Langote wrote: >> SELECT JSON_VALUE(jsonb '"aaa"', '$' RETURNING int DEFAULT 111 ON ERROR); >> - json_value >> - >> -111 >> -(1 row) >> - >> +ERROR: syntax error at or near "DEFAULT" >> +LINE 1: ...ELECT JSON_VALUE(jsonb '"aaa"', '$' RETURNING int DEFAULT 11... >> >> Is it intentional that you left many instances of the regression test >> output changes like the above? > Actually, thinking more about this, I am wondering if we should not > remove the DEFAULT expression productions in gram.y. Maybe we can > keep the syntax and give an unsupported error during parse-analysis, > like the last version of the patch did for DEFAULT ON EMPTY. Which > also means to also leave JsonBehavior alone but with default_expr > always NULL for now. > Producing an error in the parse analysis phase seems best to me. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
allowing for control over SET ROLE
Hi, There are two ways in which a role can exercise the privileges of some other role which has been granted to it. First, it can implicitly inherit the privileges of the granted role. Second, it can assume the identity of the granted role using the SET ROLE command. It is possible to control the former behavior, but not the latter. In v15 and prior release, we had a role-level [NO]INHERIT property which controlled whether a role automatically inherited the privileges of any role granted to it. This was all-or-nothing. Beginning in e3ce2de09d814f8770b2e3b3c152b7671bcdb83f, the inheritance behavior of role-grants can be overridden for individual grants, so that some grants are inherited and others are not. However, there is no similar facility for controlling whether a role can SET ROLE to some other role of which it is a member. At present, if role A is a member of role B, then A can SET ROLE B, and that's it. In some circumstances, it may be desirable to control this behavior. For example, if we GRANT pg_read_all_settings TO seer, we do want the seer to be able to read all the settings, else we would not have granted the role. But we might not want the seer to be able to do this: You are now connected to database "rhaas" as user "seer". rhaas=> set role pg_read_all_settings; SET rhaas=> create table artifact (a int); CREATE TABLE rhaas=> \d List of relations Schema | Name | Type |Owner +--+---+-- public | artifact | table | pg_read_all_settings (1 row) I have attached a rather small patch which makes it possible to control this behavior: You are now connected to database "rhaas" as user "rhaas". rhaas=# grant pg_read_all_settings to seer with set false; GRANT ROLE rhaas=# \c - seer You are now connected to database "rhaas" as user "seer". rhaas=> set role pg_read_all_settings; ERROR: permission denied to set role "pg_read_all_settings" I think that this behavior is generally useful, and not just for the predefined roles that we ship as part of PostgreSQL. I don't think it's too hard to imagine someone wanting to use some locally created role as a container for privileges but not wanting the users who possess this role to run around creating new objects owned by it. To some extent that can be controlled by making sure the role in question doesn't have any excess privileges, but that's not really sufficient: all you need is one schema anywhere in the system that grants CREATE to PUBLIC. You could avoid creating such a schema, which might be a good idea for other reasons anyway, but it feels like approaching the problem from the wrong end. What you really want is to allow the users to inherit the privileges of the role but not use SET ROLE to become that role, so that's what this patch lets you do. There's one other kind of case in which this sort of thing might be somewhat useful, although it's more dubious. Suppose you have an oncall group where you regularly add and remove members according to who is on call. Naturally, you have an on-call bot which performs this task automatically. The on-call bot has the ability to manage memberships in the oncall group, but should not have the ability to access any of its privileges, either by inheritance or via SET ROLE. This patch KIND OF lets you accomplish this: rhaas=# create role oncall; CREATE ROLE rhaas=# create role oncallbot login; CREATE ROLE rhaas=# grant oncall to oncallbot with inherit false, set false, admin true; GRANT ROLE rhaas=# create role anna; CREATE ROLE rhaas=# create role eliza; CREATE ROLE rhaas=# \c - oncallbot You are now connected to database "rhaas" as user "oncallbot". rhaas=> grant oncall to anna; GRANT ROLE rhaas=> revoke oncall from anna; REVOKE ROLE rhaas=> grant oncall to eliza; GRANT ROLE rhaas=> set role oncall; ERROR: permission denied to set role "oncall" The problem here is that if a nasty evil hacker takes over the oncallbot role, nothing whatsoever prevents them from executing "grant oncall to oncallbot with set true" after which they can then "SET ROLE oncall" using the privileges they just granted themselves. And even if under some theory we blocked that, they could still maliciously grant the sought-after on-call privileges to some other role i.e. "grant oncall to accomplice". It's fundamentally difficult to allow people to administer a set of privileges without giving them the ability to usurp those privileges, and I wouldn't like to pretend that this patch is in any way sufficient to accomplish such a thing. Nevertheless, I think there's some chance it might be useful to someone building such a system, in combination with other safeguards. Or maybe not: this isn't the main reason I'm interested in this, and it's just an added benefit if it turns out that someone can do something like this with it. In order to apply this patch, we'd need to reach a conclusion about the matters mentioned in http://postgr.es/m/ca+tgmobheyynw9vrhvolvd8odspbjuu9
Re: Slight refactoring of state check in pg_upgrade check_ function
> On 30 Aug 2022, at 23:08, Bruce Momjian wrote: > On Sun, Aug 28, 2022 at 03:06:09PM -0700, Nathan Bossart wrote: >> The patch looks reasonable to me. > > +1. Those checks have accumulated over time with different authors, > hence the stylistic differences. Pushed, thanks for review! -- Daniel Gustafsson https://vmware.com/
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
>> For the benefit of anyone who may be looking at this thread in the >> archive later, I believe this commit will have fixed this issue: I can also confirm that indeed the commit https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=6672d79 does fix this issue. Thanks! -- Sami Imseih Amazon Web Services (AWS)
Re: pg_upgrade generated files in subdir follow-up
Daniel Gustafsson writes: > Commit 38bfae36526 moved the .txt files pg_upgrade generates to a separate > subdir, but there are a few left which are written to cwd. The thread > resulting in that patch doesn't discuss these files specifically so it seems > they are just an oversight. Unless I'm missing something. > Should something the attached be applied to ensure all generated files are > placed in the subdirectory? It certainly looks inconsistent ATM. I wondered if maybe the plan was to put routine output into the log directory but problem-reporting files into cwd --- but that isn't what's happening now. As long as we report the path to where the file is, I don't see a reason not to put problem-reporting files in the subdir too. regards, tom lane
pg15b3: recovery fails with wal prefetch enabled
An internal VM crashed last night due to OOM. When I tried to start postgres, it failed like: < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery immediate wait < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730 < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is not satisfied --- flushed only to 1201/1CADB730 < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation base/16881/2840_vm < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed I was able to start it with -c recovery_prefetch=no, so it seems like prefetch tried to do too much. The VM runs centos7 under qemu. I'm making a copy of the data dir in cases it's needed. -- Justin
Re: plpgsql-trigger.html: Format TG_ variables as table (patch)
Christoph Berg writes: > Re: To Daniel Gustafsson >> "string containing" is again pretty boilerplatish, how about just >> "contains"? > > Actually, just omitting the whole prefix works best. > > TG_WHEN (text) > > BEFORE, AFTER, or INSTEAD OF, depending on the trigger's definition. The attached patch does not reflect this, did you attach an old version? > I also shortened some "name of table" to just "table". Since the data > type is "name", it's clear what "table" means. I think it reads better with the definite article and initial capital, e.g. "The table that triggered ….". > > > - NEW > + NEW (record) The type names should still be wrapped in , like they were before. - ilmari
Re: SQL/JSON features for v15
On 8/31/22 8:38 AM, Andrew Dunstan wrote: On 2022-08-31 We 07:01, Amit Langote wrote: On Wed, Aug 31, 2022 at 3:51 PM Amit Langote wrote: SELECT JSON_VALUE(jsonb '"aaa"', '$' RETURNING int DEFAULT 111 ON ERROR); - json_value - -111 -(1 row) - +ERROR: syntax error at or near "DEFAULT" +LINE 1: ...ELECT JSON_VALUE(jsonb '"aaa"', '$' RETURNING int DEFAULT 11... Is it intentional that you left many instances of the regression test output changes like the above? Actually, thinking more about this, I am wondering if we should not remove the DEFAULT expression productions in gram.y. Maybe we can keep the syntax and give an unsupported error during parse-analysis, like the last version of the patch did for DEFAULT ON EMPTY. Which also means to also leave JsonBehavior alone but with default_expr always NULL for now. Producing an error in the parse analysis phase seems best to me. Andres, Robert, Tom: With this recent work, have any of your opinions changed on including SQL/JSON in v15? Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: pg_upgrade generated files in subdir follow-up
> On 31 Aug 2022, at 15:59, Tom Lane wrote: > > Daniel Gustafsson writes: >> Commit 38bfae36526 moved the .txt files pg_upgrade generates to a separate >> subdir, but there are a few left which are written to cwd. The thread >> resulting in that patch doesn't discuss these files specifically so it seems >> they are just an oversight. Unless I'm missing something. > >> Should something the attached be applied to ensure all generated files are >> placed in the subdirectory? > > It certainly looks inconsistent ATM. I wondered if maybe the plan was to > put routine output into the log directory but problem-reporting files > into cwd --- but that isn't what's happening now. Right, check_proper_datallowconn and check_for_isn_and_int8_passing_mismatch and a few other check functions already place error reporting in the subdir. > As long as we report the path to where the file is, I don't see a reason > not to put problem-reporting files in the subdir too. Agreed. The documentation states: "pg_upgrade creates various working files, such as schema dumps, stored within pg_upgrade_output.d in the directory of the new cluster. Each run creates a new subdirectory named with a timestamp formatted as per ISO 8601 (%Y%m%dT%H%M%S), where all the generated files are stored." The delete_old_cluster and reindex_hash scripts are still placed in CWD, which isn't changed by this patch, as that seems correct (and might break scripts if we move them). Maybe we should amend the docs to mention that scripts aren't generated in the subdir? -- Daniel Gustafsson https://vmware.com/
Re: introduce bufmgr hooks
HHi, On 2022-08-29 15:24:49 -0700, Nathan Bossart wrote: > I'd like to propose some new hooks for the buffer manager. My primary goal > is to allow users to create an additional caching mechanism between the > shared buffers and disk for evicted buffers. I'm very doubtful this is a good idea. These are quite hot paths. While not a huge cost, adding an indirection isn't free nonetheless. I also think it'll make it harder to improve things in this area, which needs quite a bit of work. Greetings, Andres Freund
Re: SQL/JSON features for v15
Hi, On 2022-08-31 10:20:24 -0400, Jonathan S. Katz wrote: > Andres, Robert, Tom: With this recent work, have any of your opinions > changed on including SQL/JSON in v15? I don't really know what to do here. It feels blatantly obvious that this code isn't even remotely close to being releasable. I'm worried about the impact of the big revert at this stage of the release cycle, and that's not getting better by delaying further. And I'm getting weary of being asked to make the obvious call that the authors of this feature as well as the RMT should have made a while ago. >From my POV the only real discussion is whether we'd want to revert this in 15 and HEAD or just 15. There's imo a decent point to be made to just revert in 15 and aggressively press forward with the changes posted in this thread. Greetings, Andres Freund
Re: [PATCH] Query Jumbling for CALL and SET utility statements
Hi st 31. 8. 2022 v 17:34 odesílatel Drouvot, Bertrand napsal: > Hi hackers, > > While query jumbling is provided for function calls that’s currently not > the case for procedures calls. > The reason behind this is that all utility statements are currently > discarded for jumbling. > > We’ve recently seen performance impacts (LWLock contention) due to the > lack of jumbling on procedure calls with pg_stat_statements and > pg_stat_statements.track_utility enabled (think an application with a high > rate of procedure calls with unique parameters for each call). > > Jeremy has had this conversation on twitter (see > https://twitter.com/jer_s/status/1560003560116342785) and Nikolay > reported that he also had to work on a similar performance issue with SET > being used. > > That’s why we think it would make sense to allow jumbling for those 2 > utility statements: CALL and SET. > > Please find attached a patch proposal for doing so. > > With the attached patch we would get things like: > CALL MINUS_TWO(3); > CALL MINUS_TWO(7); > CALL SUM_TWO(3, 8); > CALL SUM_TWO(7, 5); > set enable_seqscan=false; > set enable_seqscan=true; > set seq_page_cost=2.0; > set seq_page_cost=1.0; > > postgres=# SELECT query, calls, rows FROM pg_stat_statements; >query | calls | rows > ---+---+-- > set seq_page_cost=$1 | 2 |0 > CALL MINUS_TWO($1)| 2 |0 > set enable_seqscan=$1 | 2 |0 > CALL SUM_TWO($1, $2) | 2 |0 > > Looking forward to your feedback, > The idea is good, but I think you should use pg_stat_functions instead. Maybe it is supported already (I didn't test it). I am not sure so SET statement should be traced in pg_stat_statements - it is usually pretty fast, and without context it says nothing. It looks like just overhead. Regards Pavel > Thanks, > > Jeremy & Bertrand > > -- > Bertrand Drouvot > Amazon Web Services: https://aws.amazon.com > >
Re: SQL/JSON features for v15
On 2022-08-30 Tu 17:25, Nikita Glukhov wrote: > > > Patches 0001-0006: Yeah, these add the overhead of an extra function call (typin() -> typin_opt_error()) in possibly very common paths. Other than refactoring *all* places that call typin() to use the new API, the only other option seems to be to leave the typin() functions alone and duplicate their code in typin_opt_error() versions for all the types that this patch cares about. Though maybe, that's not necessarily a better compromise than accepting the extra function call overhead. >>> I think another possibility is to create a static inline function in the >>> corresponding .c module (say boolin_impl() in bool.c), which is called >>> by both the opt_error variant as well as the regular one. This would >>> avoid the duplicate code as well as the added function-call overhead. >> +1 > I always thought about such internal inline functions, I 've added them in > v10. > > A couple of questions about these: 1. Patch 5 changes the API of DecodeDateTime() and DecodeTimeOnly() by adding an extra parameter bool *error. Would it be better to provide _opt_error flavors of these? 2. Patch 6 changes jsonb_from_cstring so that it's no longer static inline. Shouldn't we have a static inline function that can be called from inside jsonb.c and is called by the extern function? changing both of these things would be quite trivial and should not hold anything up. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Tracking last scan time
On Tue, 30 Aug 2022 at 19:46, Bruce Momjian wrote: > On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote: > > On Thu, 25 Aug 2022 at 01:44, David Rowley wrote: > > I don't have a particular opinion about the patch, I'm just pointing > > out that there are other ways. Even just writing down the numbers on > a > > post-it note and coming back in a month to see if they've changed is > > enough to tell if the table or index has been used. > > > > > > There are usually other ways to perform monitoring tasks, but there is > > something to be said for the convenience of having functionality built > in and > > not having to rely on tools, scripts, or post-it notes :-) > > Should we consider using something cheaper like time() so we don't need > a GUC to enable this? > Interesting idea, but on my mac at least, 100,000,000 gettimeofday() calls takes about 2 seconds, whilst 100,000,000 time() calls takes 14(!) seconds. -- Dave Page Blog: https://pgsnake.blogspot.com Twitter: @pgsnake EDB: https://www.enterprisedb.com
Add tracking of backend memory allocated to pg_stat_activity
Hi Hackers, Attached is a patch to Add tracking of backend memory allocated to pg_stat_activity This new field displays the current bytes of memory allocated to the backend process. It is updated as memory for the process is malloc'd/free'd. Memory allocated to items on the freelist is included in the displayed value. Dynamic shared memory allocations are included only in the value displayed for the backend that created them, they are not included in the value for backends that are attached to them to avoid double counting. On occasion, orphaned memory segments may be cleaned up on postmaster startup. This may result in decreasing the sum without a prior increment. We limit the floor of backend_mem_allocated to zero. -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thomp...@crunchydata.com www.crunchydata.com diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index 1d9509a2f6..40ae638f25 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -947,6 +947,18 @@ postgres 27093 0.0 0.0 30096 2752 ?Ss 11:34 0:00 postgres: ser + + + backend_mem_allocated bigint + + + The byte count of memory allocated to this backend. Dynamic shared memory + allocations are included only in the value displayed for the backend that + created them, they are not included in the value for backends that are + attached to them to avoid double counting. + + + query text diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 5a844b63a1..d23f0e9dbb 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -863,6 +863,7 @@ CREATE VIEW pg_stat_activity AS S.backend_xid, s.backend_xmin, S.query_id, +S.backend_mem_allocated, S.query, S.backend_type FROM pg_stat_get_activity(NULL) AS S diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c index e1b90c5de4..269ad2fe53 100644 --- a/src/backend/storage/ipc/dsm_impl.c +++ b/src/backend/storage/ipc/dsm_impl.c @@ -66,6 +66,7 @@ #include "postmaster/postmaster.h" #include "storage/dsm_impl.h" #include "storage/fd.h" +#include "utils/backend_status.h" #include "utils/guc.h" #include "utils/memutils.h" @@ -232,6 +233,13 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size, name))); return false; } + /* + * Detach and destroy pass through here, only decrease the memory + * shown allocated in pgstat_activity when the creator destroys the + * allocation. + */ + if (op == DSM_OP_DESTROY) + pgstat_report_backend_mem_allocated_decrease(*mapped_size); *mapped_address = NULL; *mapped_size = 0; if (op == DSM_OP_DESTROY && shm_unlink(name) != 0) @@ -332,6 +340,36 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size, name))); return false; } + + /* + * Attach and create pass through here, only update backend memory + * allocated in pgstat_activity for the creator process. + */ + if (op == DSM_OP_CREATE) + { + /* + * Posix creation calls dsm_impl_posix_resize implying that resizing + * occurs or may be added in the future. As implemented + * dsm_impl_posix_resize utilizes fallocate or truncate, passing the + * whole new size as input, growing the allocation as needed * (only + * truncate supports shrinking). We update by replacing the * old + * allocation with the new. + */ +#if defined(HAVE_POSIX_FALLOCATE) && defined(__linux__) + /* + * posix_fallocate does not shrink allocations, adjust only on + * allocation increase. + */ + if (request_size > *mapped_size) + { + pgstat_report_backend_mem_allocated_decrease(*mapped_size); + pgstat_report_backend_mem_allocated_increase(request_size); + } +#else + pgstat_report_backend_mem_allocated_decrease(*mapped_size); + pgstat_report_backend_mem_allocated_increase(request_size); +#endif + } *mapped_address = address; *mapped_size = request_size; close(fd); @@ -537,6 +575,14 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size, name))); return false; } + + /* + * Detach and destroy pass through here, only decrease the memory + * shown allocated in pgstat_activity when the creator destroys the + * allocation. + */ + if (op == DSM_OP_DESTROY) + pgstat_report_backend_mem_allocated_decrease(*mapped_size); *mapped_address = NULL; *mapped_size = 0; if (op == DSM_OP_DESTROY && shmctl(ident, IPC_RMID, NULL) < 0) @@ -584,6 +630,13 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size, name))); return false; } + + /* + * Attach and create pass through here, only update backend memory + * allocated in pgstat_activity for the creator process. + */ + if (op == DSM_OP_CREATE) + pgstat_report_backend_mem_allocated_increase(req
Re: SQL/JSON features for v15
Andres Freund writes: > On 2022-08-31 10:20:24 -0400, Jonathan S. Katz wrote: >> Andres, Robert, Tom: With this recent work, have any of your opinions >> changed on including SQL/JSON in v15? > I don't really know what to do here. It feels blatantly obvious that this code > isn't even remotely close to being releasable. I'm worried about the impact of > the big revert at this stage of the release cycle, and that's not getting > better by delaying further. And I'm getting weary of being asked to make the > obvious call that the authors of this feature as well as the RMT should have > made a while ago. I have to agree. There is a large amount of code at stake here. We're being asked to review a bunch of hastily-produced patches to that code on an even more hasty schedule (and personally I have other things I need to do today...) I think the odds of a favorable end result are small. > From my POV the only real discussion is whether we'd want to revert this in 15 > and HEAD or just 15. There's imo a decent point to be made to just revert in > 15 and aggressively press forward with the changes posted in this thread. I'm not for that. Code that we don't think is ready to ship has no business being in the common tree, nor does it make review any easier to be looking at one bulky set of already-committed patches and another bulky set of deltas. I'm okay with making an exception for the include/nodes/ and backend/nodes/ files in HEAD, since the recent changes in that area mean it'd be a lot of error-prone work to produce a reverting patch there. We can leave those in as dead code temporarily, I think. regards, tom lane
Re: [PATCH] Query Jumbling for CALL and SET utility statements
Hi, On 2022-08-31 17:33:44 +0200, Drouvot, Bertrand wrote: > While query jumbling is provided for function calls that’s currently not the > case for procedures calls. > The reason behind this is that all utility statements are currently > discarded for jumbling. > [...] > That’s why we think it would make sense to allow jumbling for those 2 > utility statements: CALL and SET. Yes, I've seen this be an issue repeatedly. Although more heavily for PREPARE / EXECUTE than either of the two cases you handle here. IME not tracking PREPARE / EXECUTE can distort statistics substantially - there's appears to be a surprising number of applications / frameworks resorting to them. Basically requiring that track utility is turned on. I suspect we should carve out things like CALL, PREPARE, EXECUTE from track_utility - it's more or less an architectural accident that they're utility statements. It's a bit less clear that SET should be dealt with that way. > @@ -383,6 +384,30 @@ JumbleExpr(JumbleState *jstate, Node *node) > APP_JUMB(var->varlevelsup); > } > break; > + case T_CallStmt: > + { > + CallStmt *stmt = (CallStmt *) node; > + FuncExpr *expr = stmt->funcexpr; > + > + APP_JUMB(expr->funcid); > + JumbleExpr(jstate, (Node *) expr->args); > + } > + break; Why do we need to take the arguments into account? > + case T_VariableSetStmt: > + { > + VariableSetStmt *stmt = (VariableSetStmt *) > node; > + > + APP_JUMB_STRING(stmt->name); > + JumbleExpr(jstate, (Node *) stmt->args); > + } > + break; Same? > + case T_A_Const: > + { > + int loc = ((const A_Const > *) node)->location; > + > + RecordConstLocation(jstate, loc); > + } > + break; I suspect we only need this because of the jumbling of unparsed arguments I questioned above? If we do end up needing it, shouldn't we include the type in the jumbling? Greetings, Andres Freund
Re: Tracking last scan time
On Wed, Aug 31, 2022 at 05:02:33PM +0100, Dave Page wrote: > > > On Tue, 30 Aug 2022 at 19:46, Bruce Momjian wrote: > > On Fri, Aug 26, 2022 at 02:05:36PM +0100, Dave Page wrote: > > On Thu, 25 Aug 2022 at 01:44, David Rowley wrote: > > I don't have a particular opinion about the patch, I'm just pointing > > out that there are other ways. Even just writing down the numbers on > a > > post-it note and coming back in a month to see if they've changed is > > enough to tell if the table or index has been used. > > > > > > There are usually other ways to perform monitoring tasks, but there is > > something to be said for the convenience of having functionality built > in > and > > not having to rely on tools, scripts, or post-it notes :-) > > Should we consider using something cheaper like time() so we don't need > a GUC to enable this? > > > Interesting idea, but on my mac at least, 100,000,000 gettimeofday() calls > takes about 2 seconds, whilst 100,000,000 time() calls takes 14(!) seconds. Wow. I was just thinking you need second-level accuracy, which must be cheap somewhere. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'
Hi Hackers, This patch ensures get_database_list() switches back to the memory context in use upon entry rather than returning with TopMemoryContext as the context. This will address memory allocations in autovacuum.c being associated with TopMemoryContext when they should not be. autovacuum.c do_start_worker() with current context 'Autovacuum start worker (tmp)' invokes get_database_list(). Upon return, the current context has been changed to TopMemoryContext by AtCommit_Memory() as part of an internal transaction. Further down in the do_start_worker(), pgstat_fetch_stat_dbentry() is invoked. Previously this didn't pose a issue, however recent changes altered how pgstat_fetch_stat_dbentry() is implemented. The new implementation has a branch utilizing palloc. The patch ensures these allocations are associated with the 'Autovacuum start worker (tmp)' context rather than the TopMemoryContext. Prior to the change, leaving an idle laptop PG instance running just shy of 3 days saw the autovacuum launcher process grow to 42MB with most of that growth in TopMemoryContext due to the palloc allocations issued with autovacuum worker startup. -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thomp...@crunchydata.com www.crunchydata.com -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thomp...@crunchydata.com www.crunchydata.com diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index b3b1afba86..92b1ef45c1 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -1941,6 +1941,13 @@ get_database_list(void) CommitTransactionCommand(); + /* + * CommitTransactionCommand returns with CurrentMemoryContext set + * to TopMemoryContext. Switch back to the context that we entered + * with. + */ + MemoryContextSwitchTo(resultcxt); + return dblist; }
Re: Tracking last scan time
Hi, On 2022-08-23 10:55:09 +0100, Dave Page wrote: > Often it is beneficial to review one's schema with a view to removing > indexes (and sometimes tables) that are no longer required. It's very > difficult to understand when that is the case by looking at the number of > scans of a relation as, for example, an index may be used infrequently but > may be critical in those times when it is used. > > The attached patch against HEAD adds optional tracking of the last scan > time for relations. It updates pg_stat_*_tables with new last_seq_scan and > last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to > help with this. > > Due to the use of gettimeofday(), those values are only maintained if a new > GUC, track_scans, is set to on. By default, it is off. > > I did run a 12 hour test to see what the performance impact is. pgbench was > run with scale factor 1 and 75 users across 4 identical bare metal > machines running Rocky 8 in parallel which showed roughly a -2% average > performance penalty against HEAD with track_scans enabled. Machines were > PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data > directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source > is tsc. > >HEAD track_scans Penalty (%) > box1 19582.4973519341.8881 -1.22869541 > box2 19936.5551319928.07479-0.04253664659 > box3 19631.7889518649.64379-5.002830696 > box4 19810.8676719420.67192-1.969604525 > Average 19740.4272819335.06965-2.05343896 Based on the size of those numbers this was a r/w pgbench. If it has this noticable an impact for r/w, with a pretty low number of scans/sec, how's the overhead for r/o (which can have 2 orders of magnitude more scans/sec)? It must be quite bad. I don't think we should accept this feature with this overhead - but I also think we can do better, by accepting a bit less accuracy. For this to be useful we don't need a perfectly accurate timestamp. The statement start time is probably not accurate enough, but we could just have bgwriter or such update one in shared memory every time we wake up? Or perhaps we could go to an even lower granularity, by putting in the current LSN or such? Greetings, Andres Freund
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 10:20 AM Jonathan S. Katz wrote: > Andres, Robert, Tom: With this recent work, have any of your opinions > changed on including SQL/JSON in v15? No. Nothing's been committed, and there's no time to review anything in detail, and there was never going to be. Nikita said he was ready to start hacking in mid-August. That's good of him, but feature freeze was in April. We don't start hacking on a feature 4 months after the freeze. I'm unwilling to drop everything I'm working on to review patches that were written in a last minute rush. Even if these patches were more important to me than my own work, which they are not, I couldn't possibly do a good job reviewing complex patches on top of other complex patches in an area that I haven't studied in years. And if I could do a good job, no doubt I'd find a bunch of problems - whether they would be large or small, I don't know - and then that would lead to more changes even closer to the intended release date. I just don't understand what the RMT thinks it is doing here. When a concern is raised about whether a feature is anywhere close to being in a releasable state in August, "hack on it some more and then see where we're at" seems like an obviously impractical way forward. It seemed clear to me from the moment that Andres raised his concerns that the only two viable strategies were (1) revert the feature and be sad or (2) decide to ship it anyway and hope that Andres is incorrect in thinking that it will become an embarrassment to the project. The RMT has chosen neither of these, and in fact, really seems to want someone else to make the decision. But that's not how it works. The RMT concept was invented precisely to solve problems like this one, where the patch authors don't really want to revert it but other people think it's pretty busted. If such problems were best addressed by waiting for a long time to see whether anything changes, we wouldn't need an RMT. That's exactly how we used to handle these kinds of problems, and it sucked. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SQL/JSON features for v15
Hi, On 2022-08-31 12:26:29 -0400, Robert Haas wrote: > On Wed, Aug 31, 2022 at 10:20 AM Jonathan S. Katz > wrote: > > Andres, Robert, Tom: With this recent work, have any of your opinions > > changed on including SQL/JSON in v15? > > No. Nothing's been committed, and there's no time to review anything > in detail, and there was never going to be. Nikita said he was ready > to start hacking in mid-August. That's good of him, but feature freeze > was in April. As additional context: I had started raising those concerns mid June. Greetings, Andres Freund
Re: SQL/JSON features for v15
On 8/31/22 12:26 PM, Robert Haas wrote: On Wed, Aug 31, 2022 at 10:20 AM Jonathan S. Katz wrote: Andres, Robert, Tom: With this recent work, have any of your opinions changed on including SQL/JSON in v15? No. Nothing's been committed, and there's no time to review anything in detail, and there was never going to be. OK. Based on this feedback, the RMT is going to request that this is reverted. With RMT hat on -- Andrew can you please revert the patchset? Nikita said he was ready to start hacking in mid-August. That's good of him, but feature freeze was in April. We don't start hacking on a feature 4 months after the freeze. I'm unwilling to drop everything I'm working on to review patches that were written in a last minute rush. Even if these patches were more important to me than my own work, which they are not, I couldn't possibly do a good job reviewing complex patches on top of other complex patches in an area that I haven't studied in years. And if I could do a good job, no doubt I'd find a bunch of problems - whether they would be large or small, I don't know - and then that would lead to more changes even closer to the intended release date. I just don't understand what the RMT thinks it is doing here. When a concern is raised about whether a feature is anywhere close to being in a releasable state in August, "hack on it some more and then see where we're at" seems like an obviously impractical way forward. It seemed clear to me from the moment that Andres raised his concerns that the only two viable strategies were (1) revert the feature and be sad or (2) decide to ship it anyway and hope that Andres is incorrect in thinking that it will become an embarrassment to the project. The RMT has chosen neither of these, and in fact, really seems to want someone else to make the decision. But that's not how it works. The RMT concept was invented precisely to solve problems like this one, where the patch authors don't really want to revert it but other people think it's pretty busted. If such problems were best addressed by waiting for a long time to see whether anything changes, we wouldn't need an RMT. That's exactly how we used to handle these kinds of problems, and it sucked. This is fair feedback. However, there are a few things to consider here: 1. When Andres raised his initial concerns, the RMT did recommend to revert but did not force it. Part of the RMT charter is to try to get consensus before doing so and after we've exhausted the community process. As we moved closer, the patch others proposed some suggestions which other folks were amenable to trying. Unfortunately, time has run out. However, 2. One of the other main goals of the RMT is to ensure the release ships "on time" which we define to be late Q3/early Q4. We factored that into the decision making process around this. We are still on time for the release. I take responsibility for the decision making. I would be open to discuss this further around what worked / what didn't with the RMT and where we can improve in the future. Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
Add the ability to limit the amount of memory that can be allocated to backends.
Hi Hackers, Add the ability to limit the amount of memory that can be allocated to backends. This builds on the work that adds backend memory allocated to pg_stat_activity https://www.postgresql.org/message-id/67bb5c15c0489cb499723b0340f16e10c22485ec.camel%40crunchydata.com Both patches are attached. Add GUC variable max_total_backend_memory. Specifies a limit to the amount of memory (MB) that may be allocated to backends in total (i.e. this is not a per user or per backend limit). If unset, or set to 0 it is disabled. It is intended as a resource to help avoid the OOM killer. A backend request that would push the total over the limit will be denied with an out of memory error causing that backends current query/transaction to fail. Due to the dynamic nature of memory allocations, this limit is not exact. If within 1.5MB of the limit and two backends request 1MB each at the same time both may be allocated exceeding the limit. Further requests will not be allocated until dropping below the limit. Keep this in mind when setting this value to avoid the OOM killer. Currently, this limit does not affect auxiliary backend processes, this list of non-affected backend processes is open for discussion as to what should/should not be included. Backend memory allocations are displayed in the pg_stat_activity view. -- Reid Thompson Senior Software Engineer Crunchy Data, Inc. reid.thomp...@crunchydata.com www.crunchydata.com diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index a5cd4e44c7..caf958310a 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2079,6 +2079,32 @@ include_dir 'conf.d' + + max_total_backend_memory (integer) + + max_total_backend_memory configuration parameter + + + + +Specifies a limit to the amount of memory (MB) that may be allocated to +backends in total (i.e. this is not a per user or per backend limit). +If unset, or set to 0 it is disabled. A backend request that would push +the total over the limit will be denied with an out of memory error +causing that backends current query/transaction to fail. Due to the dynamic +nature of memory allocations, this limit is not exact. If within 1.5MB of +the limit and two backends request 1MB each at the same time both may be +allocated exceeding the limit. Further requests will not be allocated until +dropping below the limit. Keep this in mind when setting this value. This +limit does not affect auxiliary backend processes + . Backend memory allocations +(backend_mem_allocated) are displayed in the +pg_stat_activity +view. + + + + diff --git a/src/backend/storage/ipc/dsm_impl.c b/src/backend/storage/ipc/dsm_impl.c index 269ad2fe53..808ffe75f2 100644 --- a/src/backend/storage/ipc/dsm_impl.c +++ b/src/backend/storage/ipc/dsm_impl.c @@ -253,6 +253,10 @@ dsm_impl_posix(dsm_op op, dsm_handle handle, Size request_size, return true; } + /* Do not exceed maximum allowed memory allocation */ + if (op == DSM_OP_CREATE && exceeds_max_total_bkend_mem(request_size)) + return false; + /* * Create new segment or open an existing one for attach. * @@ -524,6 +528,10 @@ dsm_impl_sysv(dsm_op op, dsm_handle handle, Size request_size, int flags = IPCProtection; size_t segsize; + /* Do not exceed maximum allowed memory allocation */ + if (op == DSM_OP_CREATE && exceeds_max_total_bkend_mem(request_size)) + return false; + /* * Allocate the memory BEFORE acquiring the resource, so that we don't * leak the resource if memory allocation fails. @@ -718,6 +726,10 @@ dsm_impl_windows(dsm_op op, dsm_handle handle, Size request_size, return true; } + /* Do not exceed maximum allowed memory allocation */ + if (op == DSM_OP_CREATE && exceeds_max_total_bkend_mem(request_size)) + return false; + /* Create new segment or open an existing one for attach. */ if (op == DSM_OP_CREATE) { diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c index 17a00587f8..9137a000ae 100644 --- a/src/backend/utils/activity/backend_status.c +++ b/src/backend/utils/activity/backend_status.c @@ -44,6 +44,8 @@ */ bool pgstat_track_activities = false; int pgstat_track_activity_query_size = 1024; +/* Max backend memory allocation allowed (MB). 0 = disabled */ +int max_total_bkend_mem = 0; /* exposed so that backend_progress.c can access it */ @@ -1253,3 +1255,107 @@ pgstat_report_backend_mem_allocated_decrease(uint64 deallocation) beentry->backend_mem_allocated -= deallocation; PGSTAT_END_WRITE_ACTIVITY(beentry); } + +/* -- + * pgstat_get_all_backend_memory_allocated() - + * + * Return a uint64 representing the current shared memory allocated to all + * backends. This looks directly at the BackendStatusArray, and so will + * pro
Re: [PATCH] Query Jumbling for CALL and SET utility statements
st 31. 8. 2022 v 17:50 odesílatel Pavel Stehule napsal: > Hi > > > st 31. 8. 2022 v 17:34 odesílatel Drouvot, Bertrand > napsal: > >> Hi hackers, >> >> While query jumbling is provided for function calls that’s currently not >> the case for procedures calls. >> The reason behind this is that all utility statements are currently >> discarded for jumbling. >> >> We’ve recently seen performance impacts (LWLock contention) due to the >> lack of jumbling on procedure calls with pg_stat_statements and >> pg_stat_statements.track_utility enabled (think an application with a high >> rate of procedure calls with unique parameters for each call). >> >> Jeremy has had this conversation on twitter (see >> https://twitter.com/jer_s/status/1560003560116342785) and Nikolay >> reported that he also had to work on a similar performance issue with SET >> being used. >> >> That’s why we think it would make sense to allow jumbling for those 2 >> utility statements: CALL and SET. >> >> Please find attached a patch proposal for doing so. >> >> With the attached patch we would get things like: >> CALL MINUS_TWO(3); >> CALL MINUS_TWO(7); >> CALL SUM_TWO(3, 8); >> CALL SUM_TWO(7, 5); >> set enable_seqscan=false; >> set enable_seqscan=true; >> set seq_page_cost=2.0; >> set seq_page_cost=1.0; >> >> postgres=# SELECT query, calls, rows FROM pg_stat_statements; >>query | calls | rows >> ---+---+-- >> set seq_page_cost=$1 | 2 |0 >> CALL MINUS_TWO($1)| 2 |0 >> set enable_seqscan=$1 | 2 |0 >> CALL SUM_TWO($1, $2) | 2 |0 >> >> Looking forward to your feedback, >> > The idea is good, but I think you should use pg_stat_functions instead. > Maybe it is supported already (I didn't test it). I am not sure so SET > statement should be traced in pg_stat_statements - it is usually pretty > fast, and without context it says nothing. It looks like just overhead. > I was wrong - there is an analogy with SELECT fx, and the statistics are in pg_stat_statements, and in pg_stat_function too. Regards Pavel > > Regards > > Pavel > > >> Thanks, >> >> Jeremy & Bertrand >> >> -- >> Bertrand Drouvot >> Amazon Web Services: https://aws.amazon.com >> >>
Re: Add tracking of backend memory allocated to pg_stat_activity
On Wed, Aug 31, 2022 at 12:03:06PM -0400, Reid Thompson wrote: > Hi Hackers, > > Attached is a patch to > Add tracking of backend memory allocated to pg_stat_activity > + proargmodes => > '{i,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}', In the past, there was concern about making pg_stat_activity wider by adding information that's less-essential than what's been there for years. This is only an int64, so it's not "wide", but I wonder if there's another way to expose this information? Like adding backends to pg_get_backend_memory_contexts() , maybe with another view on top of that ? +* shown allocated in pgstat_activity when the creator destroys the pg_stat > + * Posix creation calls dsm_impl_posix_resize implying that > resizing > + * occurs or may be added in the future. As implemented > + * dsm_impl_posix_resize utilizes fallocate or truncate, > passing the > + * whole new size as input, growing the allocation as needed * > (only > + * truncate supports shrinking). We update by replacing the * > old wrapping caused extraneous stars > + * Do not allow backend_mem_allocated to go below zero. ereport if we > + * would have. There's no need for a lock around the read here asit's as it's > + ereport(LOG, (errmsg("decrease reduces reported backend memory > allocated below zero; setting reported to 0"))); errmsg() doesn't require the outside paranthesis since a couple years ago. > + /* > + * Until header allocation is included in context->mem_allocated cast to > + * slab and decrement the headerSize add a comma before "cast" ? -- Justin
Re: SQL/JSON features for v15
I wrote: > Andres Freund writes: >> From my POV the only real discussion is whether we'd want to revert this in >> 15 >> and HEAD or just 15. There's imo a decent point to be made to just revert in >> 15 and aggressively press forward with the changes posted in this thread. > I'm not for that. Code that we don't think is ready to ship > has no business being in the common tree, nor does it make > review any easier to be looking at one bulky set of > already-committed patches and another bulky set of deltas. To enlarge on that a bit: it seems to me that the really fundamental issue here is how to catch datatype-specific input and conversion errors without using subtransactions, because those are too expensive and can mask errors we'd rather not be masking, such as OOM. (Andres had some additional, more localized concerns, but I think this is the one with big-picture implications.) The currently proposed patchset hacks up a relatively small number of core datatypes to be able to do that. But it's just a hack and there's no prospect of extension types being able to join in the fun. I think where we need to start, for v16, is making an API design that will let any datatype have this functionality. (I don't say that we'd convert every datatype to do so right away; in the long run we should, but I'm content to start with just the same core types touched here.) Beside the JSON stuff, there is another even more pressing application for such behavior, namely the often-requested COPY functionality to be able to shunt bad data off somewhere without losing the entire transfer. In the COPY case I think we'd want to be able to capture the error message that would have been issued, which means the current patches are not at all appropriate as a basis for that API design: they're just returning a bool without any details. So that's why I'm in favor of reverting and starting over. There are probably big chunks of what's been done that can be re-used, but it all needs to be re-examined with this sort of design in mind. As a really quick sketch of what such an API might look like: we could invent a new node type, say IOCallContext, which is intended to be passed as FunctionCallInfo.context to type input functions and perhaps type conversion functions. Call sites wishing to have no-thrown-error functionality would initialize one of these to show "no error" and then pass it to the data type's usual input function. Old-style input functions would ignore this and just throw errors as usual; sorry, you don't get the no-error functionality you wanted. But I/O functions that had been updated would know to store the report of a relevant error into that node and then return NULL. (Although I think there may be assumptions somewhere that I/O functions don't return NULL, so maybe "just return any dummy value" is a better idea? Although likely it wouldn't be hard to remove such assumptions from callers using this functionality.) The caller would detect the presence of an error by examining the node contents and then do whatever it needs to do. regards, tom lane
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 1:06 PM Tom Lane wrote: > The currently proposed patchset hacks up a relatively small number > of core datatypes to be able to do that. But it's just a hack > and there's no prospect of extension types being able to join > in the fun. I think where we need to start, for v16, is making > an API design that will let any datatype have this functionality. This would be really nice to have. > (I don't say that we'd convert every datatype to do so right away; > in the long run we should, but I'm content to start with just the > same core types touched here.) I would be in favor of making more of an effort than just a few token data types. The initial patch could just touch a few, but once the infrastructure is in place we should really make a sweep through the tree and tidy up. > Beside the JSON stuff, there is > another even more pressing application for such behavior, namely > the often-requested COPY functionality to be able to shunt bad data > off somewhere without losing the entire transfer. In the COPY case > I think we'd want to be able to capture the error message that > would have been issued, which means the current patches are not > at all appropriate as a basis for that API design: they're just > returning a bool without any details. Fully agreed. -- Robert Haas EDB: http://www.enterprisedb.com
Re: SQL/JSON features for v15
Robert Haas writes: > On Wed, Aug 31, 2022 at 1:06 PM Tom Lane wrote: >> (I don't say that we'd convert every datatype to do so right away; >> in the long run we should, but I'm content to start with just the >> same core types touched here.) > I would be in favor of making more of an effort than just a few token > data types. The initial patch could just touch a few, but once the > infrastructure is in place we should really make a sweep through the > tree and tidy up. Sure, but my point is that we can do that in a time-extended fashion rather than having a flag day where everything must be updated. The initial patch just needs to update a few types as proof of concept. regards, tom lane
Re: [PATCH] Add native windows on arm64 support
Michael Paquier writes: > At the end, I'd like to think that it would be wiser to just remove > entirely the test for node() or reduce the contents of the string to > be able to strictly control the output order (say a simple > 'prepost' would do the trick). I cannot think > about a better idea now. Note that removing the test case where we > have node() has no impact on the coverage of xml.c. Yeah, I confirm that: local code-coverage testing shows no change in the number of lines reached in xml.c when I remove that column: -SELECT * FROM XMLTABLE('*' PASSING 'pre&deeppost' COLUMNS x xml PATH 'node()', y xml PATH '/'); +SELECT * FROM XMLTABLE('*' PASSING 'pre&deeppost' COLUMNS y xml PATH '/'); Dropping the query altogether does result in a reduction in the number of lines hit. I did not check the other XMLTABLE infrastructure, so what probably is best to do is keep the two output columns but change 'node()' to something with a more stable result; any preferences? regards, tom lane
Re: Add the ability to limit the amount of memory that can be allocated to backends.
On Wed, Aug 31, 2022 at 12:50:19PM -0400, Reid Thompson wrote: > Hi Hackers, > > Add the ability to limit the amount of memory that can be allocated to > backends. > > This builds on the work that adds backend memory allocated to > pg_stat_activity > https://www.postgresql.org/message-id/67bb5c15c0489cb499723b0340f16e10c22485ec.camel%40crunchydata.com > Both patches are attached. You should name the patches with different prefixes, like 001,002,003 Otherwise, cfbot may try to apply them in the wrong order. git format-patch is the usual tool for that. > +Specifies a limit to the amount of memory (MB) that may be allocated > to MB are just the default unit, right ? The user should be allowed to write max_total_backend_memory='2GB' > +backends in total (i.e. this is not a per user or per backend limit). > +If unset, or set to 0 it is disabled. A backend request that would > push > +the total over the limit will be denied with an out of memory error > +causing that backends current query/transaction to fail. Due to the > dynamic backend's > +nature of memory allocations, this limit is not exact. If within > 1.5MB of > +the limit and two backends request 1MB each at the same time both > may be > +allocated exceeding the limit. Further requests will not be > allocated until allocated, and exceed the limit > +bool > +exceeds_max_total_bkend_mem(uint64 allocation_request) > +{ > + bool result = false; > + > + if (MyAuxProcType != NotAnAuxProcess) > + return result; The double negative is confusing, so could use a comment. > + /* Convert max_total_bkend_mem to bytes for comparison */ > + if (max_total_bkend_mem && > + pgstat_get_all_backend_memory_allocated() + > + allocation_request > (uint64)max_total_bkend_mem * 1024 * 1024) > + { > + /* > + * Explicitely identify the OOM being a result of this > + * configuration parameter vs a system failure to allocate OOM. > + */ > + elog(WARNING, > + "request will exceed postgresql.conf defined > max_total_backend_memory limit (%lu > %lu)", > + pgstat_get_all_backend_memory_allocated() + > + allocation_request, (uint64)max_total_bkend_mem * 1024 > * 1024); I think it should be ereport() rather than elog(), which is internal-only, and not-translated. > + {"max_total_backend_memory", PGC_SIGHUP, RESOURCES_MEM, > + gettext_noop("Restrict total backend memory > allocations to this max."), > + gettext_noop("0 turns this feature off."), > + GUC_UNIT_MB > + }, > + &max_total_bkend_mem, > + 0, 0, INT_MAX, > + NULL, NULL, NULL I think this needs a maximum like INT_MAX/1024/1024 > +uint64 > +pgstat_get_all_backend_memory_allocated(void) > +{ ... > + for (i = 1; i <= NumBackendStatSlots; i++) > + { It's looping over every backend for each allocation. Do you know if there's any performance impact of that ? I think it may be necessary to track the current allocation size in shared memory (with atomic increments?). Maybe decrements would need to be exactly accounted for, or otherwise Assert() that the value is not negative. I don't know how expensive it'd be to have conditionals for each decrement, but maybe the value would only be decremented at strategic times, like at transaction commit or backend shutdown. -- Justin
Re: PostgreSQL 15 release announcement draft
Hi, On Tue, Aug 30, 2022 at 03:58:48PM -0400, Jonathan S. Katz wrote: > ### Other Notable Changes > > PostgreSQL server-level statistics are now collected in shared memory, > eliminating the statistics collector process and writing these stats to disk. > PostgreSQL 15 also revokes the `CREATE` permission from all users except a > database owner from the `public` (or default) schema. It's a bit weird to lump those two in the same paragraph, but ok. However, I think the "and writing these stats to disk." might not be very clear to people not familiar with the feature, they might think writing stats to disk is part of the new feature. So I propose "as well as writing theses stats to disk" instead or something. Michael
Re: Tracking last scan time
On Wed, 31 Aug 2022 at 18:21, Andres Freund wrote: > > Hi, > > On 2022-08-23 10:55:09 +0100, Dave Page wrote: > > Often it is beneficial to review one's schema with a view to removing > > indexes (and sometimes tables) that are no longer required. It's very > > difficult to understand when that is the case by looking at the number of > > scans of a relation as, for example, an index may be used infrequently but > > may be critical in those times when it is used. > > > > The attached patch against HEAD adds optional tracking of the last scan > > time for relations. It updates pg_stat_*_tables with new last_seq_scan and > > last_idx_scan columns, and pg_stat_*_indexes with a last_idx_scan column to > > help with this. > > > > Due to the use of gettimeofday(), those values are only maintained if a new > > GUC, track_scans, is set to on. By default, it is off. > > > > I did run a 12 hour test to see what the performance impact is. pgbench was > > run with scale factor 1 and 75 users across 4 identical bare metal > > machines running Rocky 8 in parallel which showed roughly a -2% average > > performance penalty against HEAD with track_scans enabled. Machines were > > PowerEdge R7525's with 128GB RAM, dual 16C/32T AMD 7302 CPUs, with the data > > directory on 6 x 800GB 12Gb/s SSD SAS drives in RAID 0. Kernel time source > > is tsc. > > > >HEAD track_scans Penalty (%) > > box1 19582.4973519341.8881 -1.22869541 > > box2 19936.5551319928.07479-0.04253664659 > > box3 19631.7889518649.64379-5.002830696 > > box4 19810.8676719420.67192-1.969604525 > > Average 19740.4272819335.06965-2.05343896 > > Based on the size of those numbers this was a r/w pgbench. If it has this > noticable an impact for r/w, with a pretty low number of scans/sec, how's the > overhead for r/o (which can have 2 orders of magnitude more scans/sec)? It > must be quite bad. > > I don't think we should accept this feature with this overhead - but I also > think we can do better, by accepting a bit less accuracy. For this to be > useful we don't need a perfectly accurate timestamp. The statement start time > is probably not accurate enough, but we could just have bgwriter or such > update one in shared memory every time we wake up? Or perhaps we could go to > an even lower granularity, by putting in the current LSN or such? I don't think that LSN is precise enough. For example, if you're in a (mostly) read-only system, the system may go long times without any meaningful records being written. As for having a lower granularity and preventing the one-syscall-per-Relation issue, can't we reuse the query_start or state_change timestamps that appear in pg_stat_activity (potentially updated immediately before this stat flush), or some other per-backend timestamp that is already maintained and considered accurate enough for this use? Regardless, with this patch as it is we get a new timestamp for each relation processed, which I think is a waste of time (heh) even in VDSO-enabled systems. Apart from the above, I don't have any other meaningful opinion on this patch - it might be a good addition, but I don't consume stats often enough to make a good cost / benefit comparison. Kind regards, Matthias van de Meent
Re: Tracking last scan time
On Wed, Aug 31, 2022 at 07:52:49PM +0200, Matthias van de Meent wrote: > As for having a lower granularity and preventing the > one-syscall-per-Relation issue, can't we reuse the query_start or > state_change timestamps that appear in pg_stat_activity (potentially Yeah, query start should be fine, but not transaction start time. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: [RFC] building postgres with meson - v12
Hi, On 2022-08-31 10:28:05 +0200, Peter Eisentraut wrote: > I found that the perl test modules are not installed. See attached patch to > correct this. > > To the patches: > > 4e15ee0e24 Don't hardcode tmp_check/ as test directory for tap tests > 1a3169bc3f Split TESTDIR into TESTLOGDIR and TESTDATADIR > > It's a bit weird that the first patch changes the meaning of TESTDIR > and the second patch removes it. Maybe these patches should be > squashed together? Hm, to me they seem topically separate enough, but I don't have a strong opinion on it. > 581721fa99 meson: prereq: Add src/tools/gen_export.pl > > Still wondering about the whitespace changes I reported recently, but > that can also be fine-tuned later. I'll look into it in a bit. > a1fb97a81b meson: Add meson based buildsystem > > I'm not a fan of all this business to protect the two build systems > from each other. I don't like the build process touching a file under > version control every time. How necessary is this? What happens > otherwise? I added it after just about everyone trying meson hit problems due to conflicts between (past) in-tree configure builds and meson, due to files left in tree (picking up the wrong .h files, cannot entirely be fixed with -I arguments, due to the "" includes). By adding the relevant check to the meson configure phase, and by triggering meson re-configure whenever an in-tree configure build is done, these issues can be detected. It'd of course be nicer to avoid the potential for such conflicts, but that appears to be a huge chunk of work, see the bison/flex subthread. So I don't really see an alternative. > conversion_helpers.txt: should probably be removed now. Done. > doc/src/sgml/resolv.xsl: I don't understand what this is doing. Maybe > at least add a comment in the file. It's only used for building epubs. Perhaps I should extract that into a separate patch as well? The relevant section is: > # > # epub > # > > # This was previously implemented using dbtoepub - but that doesn't seem to > # support running in build != source directory (i.e. VPATH builds already > # weren't supported). > if pandoc.found() and xsltproc.found() > # XXX: Wasn't able to make pandoc successfully resolve entities > # XXX: Perhaps we should just make all targets use this, to avoid repeatedly > # building whole thing? It's comparatively fast though. > postgres_full_xml = custom_target('postgres-full.xml', > input: ['resolv.xsl', 'postgres.sgml'], > output: ['postgres-full.xml'], > depends: doc_generated + [postgres_sgml_valid], > command: [xsltproc, '--path', '@OUTDIR@/', xsltproc_flags, > '-o', '@OUTPUT@', '@INPUT@'], > build_by_default: false, > ) A noted, I couldn't make pandoc resolve our entities, so I used resolv.xsl them, before calling pandoc. I'll rename it to resolve-entities.xsl and add a comment. > src/common/unicode/meson.build: The comment at the top of the file > should be moved next to the files it is describing (similar to how it > is in the makefile). Done. > I don't see CLDR_VERSION set anywhere. Is that part implemented? No, I didn't implement the generation parts of contrib/unaccent. I started tackling the src/common/unicode bits after John Naylor asked whether that could be done, but considered that good enough... > src/port/win32ver.rc.in: This is redundant with src/port/win32ver.rc. > (Note that the latter is also used as an input file for text > substitution. So having another file named *.in next to it would be > super confusing.) Yea, this stuff isn't great. I think the better solution, both for meson and for configure, would be to move to do all the substitution to the C preprocessor. > src/tools/find_meson: Could use a brief comment what it does. Added. > src/tools/pgflex: Could use a not-brief comment about what it does, > why it's needed. Also a comment where it's used. Also run this > through pycodestyle. Working on that. > cd193eb3e8 meson: ci: Build both with meson and as before > > I haven't reviewed this one in detail. Maybe add a summary in the > commit message, like these are the new jobs, these are the changes to > existing jobs. It looks like there is more in there than just adding > a few meson jobs. I don't think we want to commit this as-is. It contains CI for a lot of platforms - that's very useful for working on meson, but too much for in-tree. I guess I'll split it into two, one patch for converting a reasonable subset of the current CI tasks to meson and another to add (back) the current array of tested platforms. > If the above are addressed, I think this will be just about at the > point where the above patches can be committed. Woo! > Everything past these patches I'm mentally postponing right now. Makes sense. Greetings, Andres Freund
Re: SQL/JSON features for v15
On 2022-08-31 We 12:48, Jonathan S. Katz wrote: > > > With RMT hat on -- Andrew can you please revert the patchset? :-( Yes, I'll do it, starting with the v15 branch. Might take a day or so. cheers (kinda) andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 12:26:29PM -0400, Robert Haas wrote: > someone else to make the decision. But that's not how it works. The > RMT concept was invented precisely to solve problems like this one, > where the patch authors don't really want to revert it but other > people think it's pretty busted. If such problems were best addressed > by waiting for a long time to see whether anything changes, we > wouldn't need an RMT. That's exactly how we used to handle these kinds > of problems, and it sucked. I saw the RMT/Jonathan stated August 28 as the cut-off date for a decision, which was later changed to September 1: > The RMT is still inclined to revert, but will give folks until Sep 1 0:00 > AoE[1] to reach consensus on if SQL/JSON can be included in v15. This matches > up to Andrew's availability timeline for a revert, and gives enough time to > get through the buildfarm prior to the Beta 4 release[2]. I guess you are saying that setting a cut-off was a bad idea, or that the cut-off was too close to the final release date. For me, I think there were three questions: 1. Were subtransactions acceptable, consensus no 2. Could trapping errors work for PG 15, consensus no 3. Could the feature be trimmed back for PG 15 to avoid these, consensus ? I don't think our community works well when there are three issues in play at once. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: SQL/JSON features for v15
On Wed, Aug 31, 2022 at 12:04:44PM -0400, Tom Lane wrote: > Andres Freund writes: > > From my POV the only real discussion is whether we'd want to revert this in > > 15 > > and HEAD or just 15. There's imo a decent point to be made to just revert in > > 15 and aggressively press forward with the changes posted in this thread. > > I'm not for that. Code that we don't think is ready to ship > has no business being in the common tree, nor does it make > review any easier to be looking at one bulky set of > already-committed patches and another bulky set of deltas. Agreed on removing from PG 15 and master --- it would be confusing to have lots of incomplete code in master that is not in PG 15. > I'm okay with making an exception for the include/nodes/ and > backend/nodes/ files in HEAD, since the recent changes in that > area mean it'd be a lot of error-prone work to produce a reverting > patch there. We can leave those in as dead code temporarily, I think. I don't have an opinion on this. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: effective_multixact_freeze_max_age issue
On Tue, Aug 30, 2022 at 8:56 PM Nathan Bossart wrote: > LGTM Pushed, thanks. -- Peter Geoghegan
Re: SQL/JSON features for v15
Bruce Momjian writes: > I guess you are saying that setting a cut-off was a bad idea, or that > the cut-off was too close to the final release date. For me, I think > there were three questions: > 1. Were subtransactions acceptable, consensus no > 2. Could trapping errors work for PG 15, consensus no > 3. Could the feature be trimmed back for PG 15 to avoid these, consensus ? We could probably have accomplished #3 if there was more time, but we're out of time. (I'm not entirely convinced that spending effort towards #3 was productive anyway, given that we're now thinking about a much differently-scoped patch with API changes.) > I don't think our community works well when there are three issues in > play at once. To the extent that there was a management failure here, it was that we didn't press for a resolution sooner. Given the scale of the concerns raised in June, I kind of agree with Andres' opinion that fixing them post-freeze was doomed to failure. It was definitely doomed once we reached August with no real work done towards it. regards, tom lane
Re: Tracking last scan time
Hi, On 2022-08-31 19:52:49 +0200, Matthias van de Meent wrote: > As for having a lower granularity and preventing the > one-syscall-per-Relation issue, can't we reuse the query_start or > state_change timestamps that appear in pg_stat_activity (potentially > updated immediately before this stat flush), or some other per-backend > timestamp that is already maintained and considered accurate enough > for this use? The problem is that it won't change at all for a query that runs for a week - and we'll report the timestamp from a week ago when it finally ends. But given this is done when stats are flushed, which only happens after the transaction ended, we can just use GetCurrentTransactionStopTimestamp() - if we got to flushing the transaction stats we'll already have computed that. > tabentry->numscans += lstats->t_counts.t_numscans; > + if (pgstat_track_scans && lstats->t_counts.t_numscans) > + tabentry->lastscan = GetCurrentTimestamp(); Besides replacing GetCurrentTimestamp() with GetCurrentTransactionStopTimestamp(), this should then also check if tabentry->lastscan is already newer than the new timestamp. Greetings, Andres Freund
Re: [PATCH] Query Jumbling for CALL and SET utility statements
Hi, On 2022-08-31 11:00:05 -0700, Jeremy Schneider wrote: > On 8/31/22 9:08 AM, Andres Freund wrote: > > > > I suspect we should carve out things like CALL, PREPARE, EXECUTE from > > track_utility - it's more or less an architectural accident that they're > > utility statements. It's a bit less clear that SET should be dealt with > > that > > way. > > Regarding SET, the compelling use case was around "application_name" > whose purpose is to provide a label in pg_stat_activity and on log > lines, which can be used to improve observability and connect queries to > their source in application code. I wasn't saying that SET shouldn't be jumbled, just that it seems more reasonable to track it only when track_utility is enabled, rather than doing so even when that's disabled. Which I do think makes sense for executing a prepared statement and calling a procedure, since they're really only utility statements by accident. > Personally, at this point, I think pg_stat_statements is critical > infrastructure for anyone running PostgreSQL at scale. The information > it provides is indispensable. I don't think it's really defensible to > tell people that if they want to scale, then they need to fly blind on > any utility statements. I wasn't suggesting doing so at all. Greetings, Andres Freund
Re: SQL/JSON features for v15
On 2022-08-31 We 14:45, Tom Lane wrote: > To the extent that there was a management failure here, it was that > we didn't press for a resolution sooner. Given the scale of the > concerns raised in June, I kind of agree with Andres' opinion that > fixing them post-freeze was doomed to failure. It was definitely > doomed once we reached August with no real work done towards it. I'm not going to comment publicly in general about this, you might imagine what my reaction is. The decision is the RMT's to make and I have no quarrel with that. But I do want it understood that there was work being done right from the time in June when Andres' complaints were published. These were difficult issues, and we didn't let the grass grow looking for a fix. I concede that might not have been visible until later. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Hi, On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote: > @@ -50,7 +51,7 @@ copydir(char *fromdir, char *todir, bool recurse) > > while ((xlde = ReadDir(xldir, fromdir)) != NULL) > { > - struct stat fst; > + PGFileType xlde_type; > > /* If we got a cancel signal during the copy of the directory, > quit */ > CHECK_FOR_INTERRUPTS(); > @@ -62,18 +63,15 @@ copydir(char *fromdir, char *todir, bool recurse) > snprintf(fromfile, sizeof(fromfile), "%s/%s", fromdir, > xlde->d_name); > snprintf(tofile, sizeof(tofile), "%s/%s", todir, xlde->d_name); > > - if (lstat(fromfile, &fst) < 0) > - ereport(ERROR, > - (errcode_for_file_access(), > - errmsg("could not stat file \"%s\": > %m", fromfile))); > + xlde_type = get_dirent_type(fromfile, xlde, false, ERROR); > > - if (S_ISDIR(fst.st_mode)) > + if (xlde_type == PGFILETYPE_DIR) > { > /* recurse to handle subdirectories */ > if (recurse) > copydir(fromfile, tofile, true); > } > - else if (S_ISREG(fst.st_mode)) > + else if (xlde_type == PGFILETYPE_REG) > copy_file(fromfile, tofile); > } > FreeDir(xldir); It continues to make no sense to me to add behaviour changes around error-handling as part of a conversion to get_dirent_type(). I don't at all understand why e.g. the above change to make copydir() silently skip over files it can't stat is ok? Greetings, Andres Freund
Re: [PATCH] Add sortsupport for range types and btree_gist
Hi! Sorry for the long delay. This fixes the crashes, now all tests run fine w/ and w/o debug configuration. That shouldn't even have slipped through as such. Notable changes from v1: - gbt_enum_sortsupport() now passes on fcinfo->flinfo enum_cmp_internal() needs a place to cache the typcache entry. - inet sortsupport now uses network_cmp() directly Thanks, Christoph HeissFrom 667779bc0761c1356141722181c5a54ac46d96b9 Mon Sep 17 00:00:00 2001 From: Christoph Heiss Date: Wed, 31 Aug 2022 19:20:43 +0200 Subject: [PATCH v2 1/1] Add sortsupport for range types and btree_gist Incrementally building up a GiST index can result in a sub-optimal index structure for range types. By sorting the data before inserting it into the index will result in a much better index. This can provide sizeable improvements in query execution time, I/O read time and shared block hits/reads. Signed-off-by: Christoph Heiss --- contrib/btree_gist/Makefile | 3 +- contrib/btree_gist/btree_bit.c | 19 contrib/btree_gist/btree_bool.c | 22 contrib/btree_gist/btree_cash.c | 22 contrib/btree_gist/btree_enum.c | 26 + contrib/btree_gist/btree_gist--1.7--1.8.sql | 110 contrib/btree_gist/btree_gist.control | 2 +- contrib/btree_gist/btree_inet.c | 19 contrib/btree_gist/btree_interval.c | 19 contrib/btree_gist/btree_macaddr8.c | 19 contrib/btree_gist/btree_time.c | 19 src/backend/utils/adt/rangetypes_gist.c | 70 + src/include/catalog/pg_amproc.dat | 3 + src/include/catalog/pg_proc.dat | 3 + 14 files changed, 354 insertions(+), 2 deletions(-) create mode 100644 contrib/btree_gist/btree_gist--1.7--1.8.sql diff --git a/contrib/btree_gist/Makefile b/contrib/btree_gist/Makefile index 48997c75f6..4096de73ea 100644 --- a/contrib/btree_gist/Makefile +++ b/contrib/btree_gist/Makefile @@ -33,7 +33,8 @@ EXTENSION = btree_gist DATA = btree_gist--1.0--1.1.sql \ btree_gist--1.1--1.2.sql btree_gist--1.2.sql btree_gist--1.2--1.3.sql \ btree_gist--1.3--1.4.sql btree_gist--1.4--1.5.sql \ - btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql + btree_gist--1.5--1.6.sql btree_gist--1.6--1.7.sql \ + btree_gist--1.7--1.8.sql PGFILEDESC = "btree_gist - B-tree equivalent GiST operator classes" REGRESS = init int2 int4 int8 float4 float8 cash oid timestamp timestamptz \ diff --git a/contrib/btree_gist/btree_bit.c b/contrib/btree_gist/btree_bit.c index 5b246bcde4..bf32bfd628 100644 --- a/contrib/btree_gist/btree_bit.c +++ b/contrib/btree_gist/btree_bit.c @@ -7,6 +7,7 @@ #include "btree_utils_var.h" #include "utils/builtins.h" #include "utils/bytea.h" +#include "utils/sortsupport.h" #include "utils/varbit.h" @@ -19,10 +20,17 @@ PG_FUNCTION_INFO_V1(gbt_bit_picksplit); PG_FUNCTION_INFO_V1(gbt_bit_consistent); PG_FUNCTION_INFO_V1(gbt_bit_penalty); PG_FUNCTION_INFO_V1(gbt_bit_same); +PG_FUNCTION_INFO_V1(gbt_bit_sortsupport); /* define for comparison */ +static int +bit_fast_cmp(Datum x, Datum y, SortSupport ssup) +{ + return DatumGetInt32(DirectFunctionCall2(byteacmp, x, y)); +} + static bool gbt_bitgt(const void *a, const void *b, Oid collation, FmgrInfo *flinfo) { @@ -208,3 +216,14 @@ gbt_bit_penalty(PG_FUNCTION_ARGS) PG_RETURN_POINTER(gbt_var_penalty(result, o, n, PG_GET_COLLATION(), &tinfo, fcinfo->flinfo)); } + +Datum +gbt_bit_sortsupport(PG_FUNCTION_ARGS) +{ + SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); + + ssup->comparator = bit_fast_cmp; + ssup->ssup_extra = NULL; + + PG_RETURN_VOID(); +} diff --git a/contrib/btree_gist/btree_bool.c b/contrib/btree_gist/btree_bool.c index 8b2af129b5..3a9f230e68 100644 --- a/contrib/btree_gist/btree_bool.c +++ b/contrib/btree_gist/btree_bool.c @@ -6,6 +6,7 @@ #include "btree_gist.h" #include "btree_utils_num.h" #include "common/int.h" +#include "utils/sortsupport.h" typedef struct boolkey { @@ -23,6 +24,16 @@ PG_FUNCTION_INFO_V1(gbt_bool_picksplit); PG_FUNCTION_INFO_V1(gbt_bool_consistent); PG_FUNCTION_INFO_V1(gbt_bool_penalty); PG_FUNCTION_INFO_V1(gbt_bool_same); +PG_FUNCTION_INFO_V1(gbt_bool_sortsupport); + +static int +bool_fast_cmp(Datum x, Datum y, SortSupport ssup) +{ + bool arg1 = DatumGetBool(x); + bool arg2 = DatumGetBool(y); + + return arg1 - arg2; +} static bool gbt_boolgt(const void *a, const void *b, FmgrInfo *flinfo) @@ -167,3 +178,14 @@ gbt_bool_same(PG_FUNCTION_ARGS) *result = gbt_num_same((void *) b1, (void *) b2, &tinfo, fcinfo->flinfo); PG_RETURN_POINTER(result); } + +Datum +gbt_bool_sortsupport(PG_FUNCTION_ARGS) +{ + SortSupport ssup = (SortSupport) PG_GETARG_POINTER(0); + + ssup->comparator = bool_fast_cmp; + ssup->ssup_extra = NULL; + + PG_RETURN_VOID(); +} diff --git a/contrib/btree_gist/btree_cash.c b/contrib/btree_gist/btree_cash.c index 6ac97e2b12..389b725130 1006
Re: Tracking last scan time
On Wed, Aug 31, 2022 at 11:56:29AM -0700, Andres Freund wrote: > Hi, > > On 2022-08-31 19:52:49 +0200, Matthias van de Meent wrote: > > As for having a lower granularity and preventing the > > one-syscall-per-Relation issue, can't we reuse the query_start or > > state_change timestamps that appear in pg_stat_activity (potentially > > updated immediately before this stat flush), or some other per-backend > > timestamp that is already maintained and considered accurate enough > > for this use? > > The problem is that it won't change at all for a query that runs for a week - > and we'll report the timestamp from a week ago when it finally ends. > > But given this is done when stats are flushed, which only happens after the > transaction ended, we can just use GetCurrentTransactionStopTimestamp() - if > we got to flushing the transaction stats we'll already have computed that. Oh, good point --- it is safer to show a more recent time than a too-old time. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
On Wed, Aug 31, 2022 at 12:14:33PM -0700, Andres Freund wrote: > On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote: >> -if (lstat(fromfile, &fst) < 0) >> -ereport(ERROR, >> -(errcode_for_file_access(), >> - errmsg("could not stat file \"%s\": >> %m", fromfile))); >> +xlde_type = get_dirent_type(fromfile, xlde, false, ERROR); >> >> -if (S_ISDIR(fst.st_mode)) >> +if (xlde_type == PGFILETYPE_DIR) >> { >> /* recurse to handle subdirectories */ >> if (recurse) >> copydir(fromfile, tofile, true); >> } >> -else if (S_ISREG(fst.st_mode)) >> +else if (xlde_type == PGFILETYPE_REG) >> copy_file(fromfile, tofile); >> } >> FreeDir(xldir); > > It continues to make no sense to me to add behaviour changes around > error-handling as part of a conversion to get_dirent_type(). I don't at all > understand why e.g. the above change to make copydir() silently skip over > files it can't stat is ok? In this example, the call to get_dirent_type() should ERROR if the call to lstat() fails (the "elevel" argument is set to ERROR). -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: pg_stat_have_stats() returns true for dropped indexes (or for index creation transaction rolled back)
Hiu, On 2022-08-25 11:44:34 +0200, Drouvot, Bertrand wrote: > For REPLSLOT, I agree that we can add one test: I added it in > contrib/test_decoding/sql/stats.sql. It relies on pg_stat_have_stats() (as > relying on pg_stat_replication_slots and/or pg_stat_get_replication_slot() > would not help that much for this test given that the slot has been removed > from ReplicationSlotCtl) As Horiguchi-san noted, we can't rely on specific indexes being used. I feel ok with the current coverage in that area, but if we *really* feel we need to test it, we'd need to count the number of indexes with slots before dropping the slot and after the drop. > +-- pg_stat_have_stats returns true for committed index creation Maybe another test for an uncommitted index creation would be good too? Could you try running this test with debug_discard_caches = 1 - it's pretty easy to write tests in this area that aren't reliable timing wise. > +CREATE table stats_test_tab1 as select generate_series(1,10) a; > +CREATE index stats_test_idx1 on stats_test_tab1(a); > +SELECT oid AS dboid from pg_database where datname = current_database() \gset Since you introduced this, maybe convert the other instance of this query at the end of the file as well? Greetings, Andres Freund
Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Hi, On 2022-08-31 12:24:28 -0700, Nathan Bossart wrote: > On Wed, Aug 31, 2022 at 12:14:33PM -0700, Andres Freund wrote: > > On 2022-08-27 14:06:32 +0530, Bharath Rupireddy wrote: > > It continues to make no sense to me to add behaviour changes around > > error-handling as part of a conversion to get_dirent_type(). I don't at all > > understand why e.g. the above change to make copydir() silently skip over > > files it can't stat is ok? > > In this example, the call to get_dirent_type() should ERROR if the call to > lstat() fails (the "elevel" argument is set to ERROR). Oh, oops. Skimmed code too quickly... Greetings, Andres Freund
Re: First draft of the PG 15 release notes
On Wed, Aug 31, 2022 at 11:38:33AM +0200, Peter Eisentraut wrote: > On 30.08.22 22:42, Bruce Momjian wrote: > > Good question --- the full text is: > > > > > > > > Record and check the collation of each > linkend="sql-createdatabase">database (Peter Eisentraut) > > > > > > > > This is designed to detect collation > > mismatches to avoid data corruption. Function > > pg_database_collation_actual_version() > > reports the underlying operating system collation version, and > > ALTER DATABASE ... REFRESH sets the database > > to match the operating system collation version. DETAILS? > > > > > > > > I just can't figure out what the user needs to understand about this, > > and I understand very little of it. > > We already had this feature for (schema-level) collations, now we have it on > the level of the database collation. Okay, I figured out the interplay between OS collation version support, collation libraries, and collation levels. Here is an updated patch for the release notes. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson diff --git a/doc/src/sgml/release-15.sgml b/doc/src/sgml/release-15.sgml index e3ab13e53c..dca4550c03 100644 --- a/doc/src/sgml/release-15.sgml +++ b/doc/src/sgml/release-15.sgml @@ -578,17 +578,17 @@ Author: Peter Eisentraut - Record and check the collation of each database (Peter Eisentraut) - This is designed to detect collation + This is designed to detect collation version mismatches to avoid data corruption. Function pg_database_collation_actual_version() reports the underlying operating system collation version, and ALTER DATABASE ... REFRESH sets the database - to match the operating system collation version. DETAILS? + to match the operating system collation version. @@ -605,9 +605,11 @@ Author: Peter Eisentraut - Previously, ICU collations could only be - specified in CREATE - COLLATION and used with the + Previously, only libc-based + collations could be set at the cluster and database levels. + ICU collations were previously limited + to CREATE + COLLATION and referenced by the COLLATE clause.
Re: Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'
Reid Thompson writes: > This patch ensures get_database_list() switches back to the memory > context in use upon entry rather than returning with TopMemoryContext > as the context. > This will address memory allocations in autovacuum.c being associated > with TopMemoryContext when they should not be. > autovacuum.c do_start_worker() with current context > 'Autovacuum start worker (tmp)' invokes get_database_list(). Upon > return, the current context has been changed to TopMemoryContext by > AtCommit_Memory() as part of an internal transaction. Further down > in the do_start_worker(), pgstat_fetch_stat_dbentry() is invoked. > Previously this didn't pose a issue, however recent changes altered > how pgstat_fetch_stat_dbentry() is implemented. The new > implementation has a branch utilizing palloc. The patch ensures these > allocations are associated with the 'Autovacuum start worker (tmp)' > context rather than the TopMemoryContext. Prior to the change, > leaving an idle laptop PG instance running just shy of 3 days saw the > autovacuum launcher process grow to 42MB with most of that growth in > TopMemoryContext due to the palloc allocations issued with autovacuum > worker startup. Yeah, I can reproduce noticeable growth within a couple minutes after setting autovacuum_naptime to 1s, and I confirm that the launcher's space consumption stays static after applying this. Even if there's only a visible leak in v15/HEAD, I'm inclined to back-patch this all the way, because it's certainly buggy on its own terms. It's just the merest accident that neither caller is leaking other stuff into TopMemoryContext, because they both think they are using a short-lived context. Thanks for the report! regards, tom lane
Re: Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'
Hi, On 2022-08-31 16:05:03 -0400, Tom Lane wrote: > Even if there's only a visible leak in v15/HEAD, I'm inclined > to back-patch this all the way, because it's certainly buggy > on its own terms. It's just the merest accident that neither > caller is leaking other stuff into TopMemoryContext, because > they both think they are using a short-lived context. +1 > Thanks for the report! +1 Greetings, Andres Freund
Re: SQL/JSON features for v15
On 8/31/22 3:08 PM, Andrew Dunstan wrote: On 2022-08-31 We 14:45, Tom Lane wrote: To the extent that there was a management failure here, it was that we didn't press for a resolution sooner. Given the scale of the concerns raised in June, I kind of agree with Andres' opinion that fixing them post-freeze was doomed to failure. It was definitely doomed once we reached August with no real work done towards it. I'm not going to comment publicly in general about this, you might imagine what my reaction is. The decision is the RMT's to make and I have no quarrel with that. But I do want it understood that there was work being done right from the time in June when Andres' complaints were published. These were difficult issues, and we didn't let the grass grow looking for a fix. I concede that might not have been visible until later. June was a bit of a rough month too -- we had the issues that spawned the out-of-cycle release at the top of the month, which started almost right after Beta 1, and then almost immediately into Beta 2 after 14.4. I know that consumed a lot of my cycles. At that point in time for the v15 release process I was primarily focused on monitoring open items at that point, so I missed the June comments. Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: SQL/JSON features for v15
On 31.08.2022 20:14, Tom Lane wrote: Robert Haas writes: On Wed, Aug 31, 2022 at 1:06 PM Tom Lane wrote: The currently proposed patchset hacks up a relatively small number of core datatypes to be able to do that. But it's just a hack and there's no prospect of extension types being able to join in the fun. I think where we need to start, for v16, is making an API design that will let any datatype have this functionality. (I don't say that we'd convert every datatype to do so right away; in the long run we should, but I'm content to start with just the same core types touched here.) Beside the JSON stuff, there is another even more pressing application for such behavior, namely the often-requested COPY functionality to be able to shunt bad data off somewhere without losing the entire transfer. In the COPY case I think we'd want to be able to capture the error message that would have been issued, which means the current patches are not at all appropriate as a basis for that API design: they're just returning a bool without any details. I would be in favor of making more of an effort than just a few token data types. The initial patch could just touch a few, but once the infrastructure is in place we should really make a sweep through the tree and tidy up. Sure, but my point is that we can do that in a time-extended fashion rather than having a flag day where everything must be updated. The initial patch just needs to update a few types as proof of concept. And here is a quick POC patch with an example for COPY and float4: =# CREATE TABLE test (i int, f float4); CREATE TABLE =# COPY test (f) FROM stdin WITH (null_on_error (f)); 1 err 2 \. COPY 3 =# SELECT f FROM test; f --- 1 2 (3 rows) =# COPY test (i) FROM stdin WITH (null_on_error (i)); ERROR: input function for datatype "integer" does not support error handling PG_RETURN_ERROR() is a reincarnation of ereport_safe() macro for returning ErrorData, which was present in older versions (~v18) of SQL/JSON patches. Later it was replaced with `bool *have_error` and less magical `if (have_error) ... else ereport(...)`. Obviously, this needs a separate thread. -- Nikita Glukhov Postgres Professional:http://www.postgrespro.com The Russian Postgres Company From 537e65e1ebcccdf4a760d3aa743c88611e7c Mon Sep 17 00:00:00 2001 From: Nikita Glukhov Date: Wed, 31 Aug 2022 22:24:33 +0300 Subject: [PATCH 1/5] Introduce PG_RETURN_ERROR and FuncCallError node --- src/include/nodes/execnodes.h | 7 +++ src/include/utils/elog.h | 25 + 2 files changed, 32 insertions(+) diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h index 01b1727fc09..b401d354656 100644 --- a/src/include/nodes/execnodes.h +++ b/src/include/nodes/execnodes.h @@ -2729,4 +2729,11 @@ typedef struct LimitState TupleTableSlot *last_slot; /* slot for evaluation of ties */ } LimitState; +typedef struct FuncCallError +{ + NodeTag type; + ErrorData *error; /* place where function should put + * error data instead of throwing it */ +} FuncCallError; + #endif /* EXECNODES_H */ diff --git a/src/include/utils/elog.h b/src/include/utils/elog.h index 56398176901..5fd3deed61f 100644 --- a/src/include/utils/elog.h +++ b/src/include/utils/elog.h @@ -159,6 +159,31 @@ #define ereport(elevel, ...) \ ereport_domain(elevel, TEXTDOMAIN, __VA_ARGS__) +/* + * PG_RETURN_ERROR() -- special macro for copying error info into the + * FuncCallError node, if it was as FunctionCallInfo.context, + * instead of throwing it with ordinary ereport(). + * + * This is intended for handling of errors of categories like + * ERRCODE_DATA_EXCEPTION without PG_TRY/PG_CATCH and substransactions, + * but not for errors like ERRCODE_OUT_OF_MEMORY. + */ +#define PG_RETURN_ERROR(...) \ + do { \ + if (fcinfo->context && IsA(fcinfo->context, FuncCallError)) { \ + pg_prevent_errno_in_scope(); \ + \ + errstart(ERROR, TEXTDOMAIN); \ + (__VA_ARGS__); \ + castNode(FuncCallError, fcinfo->context)->error = CopyErrorData(); \ + FlushErrorState(); \ + \ + PG_RETURN_NULL(); \ + } else { \ + ereport(ERROR, (__VA_ARGS__)); \ + } \ + } while (0) + #define TEXTDOMAIN NULL extern bool message_level_is_interesting(int elevel); -- 2.25.1 From 7831b2c571784a614dd4ee4b48730187b0c0a8d1 Mon Sep 17 00:00:00 2001 From: Nikita Glukhov Date: Wed, 31 Aug 2022 23:02:06 +0300 Subject: [PATCH 2/5] Introduce pg_proc.proissafe and func_is_safe() --- src/backend/utils/cache/lsyscache.c | 19 +++ src/include/catalog/pg_proc.h | 3 +++ src/include/utils/lsyscache.h | 1 + 3 files changed, 23 insertions(+) diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c index a16a63f4957..ff3005077b2 100644 --- a/src/backend/utils/cache/lsyscache.c +++ b/src/backend/utils/cache/lsyscache.c @@ -1830,6 +1830,25 @@ get_func_leakproof(Oid funcid) return result; } +/* + * func_is_safe +
Re: Reducing the chunk header sizes on all memory context types
On 8/31/22 00:40, David Rowley wrote: > On Wed, 31 Aug 2022 at 02:17, Tom Lane wrote: >> >> I wrote: >>> So maybe we should revisit the question. It'd be worth collecting some >>> stats about how much extra space would be needed if we force there >>> to be room for a sentinel. >> >> Actually, after ingesting more caffeine, the problem with this for aset.c >> is that the only way to add space for a sentinel that didn't fit already >> is to double the space allocation. That's a little daunting, especially >> remembering how many places deliberately allocate power-of-2-sized >> arrays. > > I decided to try and quantify that by logging the size, MAXALIGN(size) > and the power of 2 size during AllocSetAlloc and GenerationAlloc. I > made the pow2_size 0 in GenerationAlloc and in AlloocSetAlloc when > size > allocChunkLimit. > > After running make installcheck, grabbing the records out the log and > loading them into Postgres, I see that if we did double the pow2_size > when there's no space for the sentinel byte then we'd go from > allocating a total of 10.2GB all the way to 16.4GB (!) of > non-dedicated block aset.c allocations. > > select > round(sum(pow2_Size)::numeric/1024/1024/1024,3) as pow2_size, > round(sum(case when maxalign_size=pow2_size then pow2_size*2 else > pow2_size end)::numeric/1024/1024/1024,3) as method1, > round(sum(case when maxalign_size=pow2_size then pow2_size+8 else > pow2_size end)::numeric/1024/1024/1024,3) as method2 > from memstats > where pow2_size > 0; > pow2_size | method1 | method2 > ---+-+- > 10.194 | 16.382 | 10.463 > > if we did just add on an extra 8 bytes (or or MAXALIGN(size+1) at > least), then that would take the size up to 10.5GB. > I've been experimenting with this a bit too, and my results are similar, but not exactly the same. I've logged all Alloc/Realloc calls for the two memory contexts, and when I aggregated the results I get this: f| size | pow2(size) | pow2(size+1) -+--++-- AllocSetAlloc |23528 | 28778 |31504 AllocSetRelloc | 761 |824 | 1421 GenerationAlloc | 68 | 90 | 102 So the raw size (what we asked for) is ~23.5GB, but in practice we allocate ~28.8GB because of the pow-of-2 logic. And by adding the extra 1B we end up allocating 31.5GB. That doesn't seem like a huge increase, and it's far from the +60% you got. I wonder where does the difference come - I did make installcheck too, so how come you get 10/16GB, and I get 28/31GB? My patch is attached, maybe I did something silly. I also did a quick hack to see if always having the sentinel detects any pre-existing issues, but that didn't happen. I guess valgrind would find those, but not sure? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index b6eeb8abab..e6d86a54d7 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -674,6 +674,15 @@ AllocSetDelete(MemoryContext context) free(set); } +static Size pow2_size(Size size) +{ + Size s = 1; + while (s < size) + s *= 2; + + return s; +} + /* * AllocSetAlloc * Returns pointer to allocated memory of given size or NULL if @@ -699,13 +708,16 @@ AllocSetAlloc(MemoryContext context, Size size) AssertArg(AllocSetIsValid(set)); + fprintf(stderr, "AllocSetAlloc %ld %ld %ld\n", size, pow2_size(size), pow2_size(size+1)); + fflush(stderr); + /* * If requested size exceeds maximum for chunks, allocate an entire block * for this request. */ - if (size > set->allocChunkLimit) + if (size + 1 > set->allocChunkLimit) { - chunk_size = MAXALIGN(size); + chunk_size = MAXALIGN(size + 1); blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; block = (AllocBlock) malloc(blksize); if (block == NULL) @@ -725,7 +737,16 @@ AllocSetAlloc(MemoryContext context, Size size) chunk->requested_size = size; /* set mark to catch clobber of "unused" space */ if (size < chunk_size) + { set_sentinel(MemoryChunkGetPointer(chunk), size); + fprintf(stderr, "sentinel added\n"); + fflush(stderr); + } + else + { + fprintf(stderr, "A sentinel not added\n"); + fflush(stderr); + } #endif #ifdef RANDOMIZE_ALLOCATED_MEMORY /* fill the allocated space with junk */ @@ -767,7 +788,7 @@ AllocSetAlloc(MemoryContext context, Size size) * If one is found, remove it from the free list, make it again a member * of the alloc set and return its data address. */ - fidx = AllocSetFreeIndex(size); + fidx = AllocSetFreeIndex(size+1); chunk = set->freelist[fidx]; if (chunk != NULL) { @@ -784,7 +805,16 @@ AllocSetAlloc(MemoryContext context, Size size) chunk->requested_size = size; /* set mark to catch clobber of "unused" space */ if (size < GetChunkSizeFromFreeListIdx(fidx)) + { set_senti
Re: Trivial doc patch
On Fri, Aug 19, 2022 at 10:42:45AM -0400, Bruce Momjian wrote: > > Given that 'optional, optional' has no independent meaning from > > 'optional'; it requires one to scan the entire set looking for > > the non-optional embedded in the option. So no gain. > > I originally had the same reaction Michael Paquier did, that having one > big optional block is nice, but seeing that 'CREATE DATABASE name WITH' > actually works, I can see the point in having our syntax be accurate, > and removing the outer optional brackets now does seem like an > improvement to me. Backpatched to PG 10. Thanks. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Reducing the chunk header sizes on all memory context types
On Thu, 1 Sept 2022 at 08:53, Tomas Vondra wrote: > So the raw size (what we asked for) is ~23.5GB, but in practice we > allocate ~28.8GB because of the pow-of-2 logic. And by adding the extra > 1B we end up allocating 31.5GB. That doesn't seem like a huge increase, > and it's far from the +60% you got. > > I wonder where does the difference come - I did make installcheck too, > so how come you get 10/16GB, and I get 28/31GB? My patch is attached, > maybe I did something silly. The reason my reported results were lower is because I ignored size > allocChunkLimit allocations. These are not raised to the next power of 2, so I didn't think they should be included. I'm not sure why you're seeing only a 3GB additional overhead. I noticed a logic error in my query where I was checking maxaligned_size=pow2_size and doubling that to give sentinel space. That really should have been "case size=pow2_size then pow2_size * 2 else pow2_size end", However, after adjusting the query, it does not seem to change the results much: postgres=# select postgres-# round(sum(pow2_Size)::numeric/1024/1024/1024,3) as pow2_size, postgres-# round(sum(case when size=pow2_size then pow2_size*2 else pow2_size end)::numeric/1024/1024/1024,3) as method1, postgres-# round(sum(case when size=pow2_size then pow2_size+8 else pow2_size end)::numeric/1024/1024/1024,3) as method2 postgres-# from memstats postgres-# where pow2_size > 0; pow2_size | method1 | method2 ---+-+- 10.269 | 16.322 | 10.476 I've attached the crude patch I came up with for this. For some reason it was crashing on Linux, but it ran ok on Windows, so I used the results from that instead. Maybe that accounts for some differences as e.g sizeof(long) == 4 on 64-bit windows. I'd be surprised if that accounted for so many GBs though. I also forgot to add code to GenerationRealloc and AllocSetRealloc David diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index b6eeb8abab..f4977f9bcc 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -46,6 +46,7 @@ #include "postgres.h" +#include "miscadmin.h" #include "port/pg_bitutils.h" #include "utils/memdebug.h" #include "utils/memutils.h" @@ -696,9 +697,22 @@ AllocSetAlloc(MemoryContext context, Size size) int fidx; Sizechunk_size; Sizeblksize; + static int rlevel = 1; AssertArg(AllocSetIsValid(set)); + if (rlevel == 1 && GetProcessingMode() == NormalProcessing) + { + rlevel++; + elog(LOG, "AllocSetAlloc,%s,%zu,%zu,%zu,%zu", +context->name, +size, +MAXALIGN(size), +size > set->allocChunkLimit ? 0 : GetChunkSizeFromFreeListIdx(AllocSetFreeIndex(size)), +set->allocChunkLimit); + rlevel--; + } + /* * If requested size exceeds maximum for chunks, allocate an entire block * for this request. diff --git a/src/backend/utils/mmgr/generation.c b/src/backend/utils/mmgr/generation.c index b39894ec94..9c5ff3c095 100644 --- a/src/backend/utils/mmgr/generation.c +++ b/src/backend/utils/mmgr/generation.c @@ -36,6 +36,7 @@ #include "postgres.h" #include "lib/ilist.h" +#include "miscadmin.h" #include "port/pg_bitutils.h" #include "utils/memdebug.h" #include "utils/memutils.h" @@ -344,6 +345,18 @@ GenerationAlloc(MemoryContext context, Size size) MemoryChunk *chunk; Sizechunk_size = MAXALIGN(size); Sizerequired_size = chunk_size + Generation_CHUNKHDRSZ; + static int rlevel = 1; + + if (rlevel == 1 && GetProcessingMode() == NormalProcessing) + { + rlevel++; + elog(LOG, "GenerationAlloc,%s,%zu,%zu,0,%zu", +context->name, +size, +MAXALIGN(size), +set->allocChunkLimit); + rlevel--; + } /* is it an over-sized chunk? if yes, allocate special block */ if (chunk_size > set->allocChunkLimit)
Re: New strategies for freezing, advancing relfrozenxid early
On Thu, Aug 25, 2022 at 2:21 PM Peter Geoghegan wrote: > Attached patch series is a completely overhauled version of earlier > work on freezing. Related work from the Postgres 15 cycle became > commits 0b018fab, f3c15cbe, and 44fa8488. Attached is v2. This is just to keep CFTester happy, since v1 now has conflicts when applied against HEAD. There are no notable changes in this v2 compared to v1. -- Peter Geoghegan v2-0001-Add-page-level-freezing-to-VACUUM.patch Description: Binary data v2-0002-Teach-VACUUM-to-use-visibility-map-snapshot.patch Description: Binary data v2-0003-Add-eager-freezing-strategy-to-VACUUM.patch Description: Binary data v2-0004-Avoid-allocating-MultiXacts-during-VACUUM.patch Description: Binary data
Solaris "sed" versus pre-v13 plpython tests
I noticed that buildfarm member margay, which just recently started running tests on REL_12_STABLE, is failing the plpython tests in that branch [1], though it's happy in v13 and later. The failures appear due to syntax errors in Python "except" statements, and it's visible in some of the tests that we are failing to convert ancient Python "except" syntax to what Python 3 wants: diff -U3 /home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/expected/python3/plpython_types_3.out /home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/results/python3/plpython_types.out --- /home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/expected/python3/plpython_types_3.out 2022-08-31 22:29:51.070597370 +0200 +++ /home/marcel/build-farm-14/buildroot/REL_12_STABLE/pgsql.build/src/pl/plpython/results/python3/plpython_types.out 2022-08-31 22:29:53.053544840 +0200 @@ -400,15 +400,16 @@ import marshal try: return marshal.loads(x) -except ValueError as e: +except ValueError, e: return 'FAILED: ' + str(e) $$ LANGUAGE plpython3u; +ERROR: could not compile PL/Python function "test_type_unmarshal" +DETAIL: SyntaxError: invalid syntax (, line 6) So it would seem that regress-python3-mangle.mk's sed command to perform this transformation isn't working on margay's sed. We've had to hack regress-python3-mangle.mk for Solaris "sed" before (cf commit c3556f6fa). This seems to be another instance of that same crummy implementation of '*' patterns. I suppose Noah missed this angle at the time because the problematic pattern had already been removed in v13 and up (45223fd9c). But it's still there in v12. I am not completely sure why buildfarm member wrasse isn't failing similarly, but a likely theory is that Noah has got some other sed in his search path there. I confirmed on the gcc farm's Solaris 11 box that the pattern doesn't work as expected with /usr/bin/sed: tgl@gcc-solaris11:~$ echo except foo, bar: | sed -e 's/except \([[:alpha:]][[:alpha:].]*\), *\([[:alpha:]][[:alpha:]]*\):/except \1 as \2:/g' except foo, bar: We could perhaps do this instead: $ echo except foo, bar: | sed -e '/^ *except.*,.*: *$/s/, / as /g' except foo as bar: It's a little bit more brittle, but Python doesn't allow any other commands on the same line does it? Another idea is to try to find some other sed to use. I see that configure already does that on most platforms, because ax_pthread.m4 has "AC_REQUIRE([AC_PROG_SED])". But if there's still someone out there using --disable-thread-safety, they might think this is an annoying movement of the build-requirements goalposts. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=margay&dt=2022-08-31%2020%3A00%3A05
Re: cataloguing NOT NULL constraints
On Wed, Aug 31, 2022 at 3:19 PM Alvaro Herrera wrote: > So I was wrong in thinking that "this case was simple to implement" as I > replied upthread. Doing that actually required me to rewrite large > parts of the patch. I think it ended up being a good thing, because in > hindsight the approach I was using was somewhat bogus anyway, and the > current one should be better. Please find it attached. > > There are still a few problems, sadly. Most notably, I ran out of time > trying to fix a pg_upgrade issue with pg_dump in binary-upgrade mode. > I have to review that again, but I think it'll need a deeper rethink of > how we pg_upgrade inherited constraints. So the pg_upgrade tests are > known to fail. I'm not aware of any other tests failing, but I'm sure > the cfbot will prove me wrong. > > I reluctantly added a new ALTER TABLE subcommand type, AT_SetAttNotNull, > to allow setting pg_attribute.attnotnull without adding a CHECK > constraint (only used internally). I would like to find a better way to > go about this, so I may remove it again, therefore it's not fully > implemented. > > There are *many* changed regress expect files and I didn't carefully vet > all of them. Mostly it's the addition of CHECK constraints in the > footers of many \d listings and stuff like that. At a quick glance they > appear valid, but I need to review them more carefully still. > > We've had pg_constraint.conparentid for a while now, but for some > constraints we continue to use conislocal/coninhcount. I think we > should get rid of that and rely on conparentid completely. > > An easily fixed issue is that of constraint naming. > ChooseConstraintName has an argument for passing known constraint names, > but this patch doesn't use it and it must. > > One issue that I don't currently know how to fix, is the fact that we > need to know whether a column is a row type or not (because they need a > different null test). At table creation time that's easy to know, > because we have the descriptor already built by the time we add the > constraints; but if you do ALTER TABLE .. ADD COLUMN .., ADD CONSTRAINT > then we don't. > > Some ancient code comments suggest that allowing a child table's NOT > NULL constraint acquired from parent shouldn't be independently > droppable. This patch doesn't change that, but it's easy to do if we > decide to. However, that'd be a compatibility break, so I'd rather not > do it in the same patch that introduces the feature. > > Overall, there's a lot more work required to get this to a good shape. > That said, I think it's the right direction. > > -- > Álvaro Herrera 48°01'N 7°57'E — > https://www.EnterpriseDB.com/ > "La primera ley de las demostraciones en vivo es: no trate de usar el > sistema. > Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen) > Hi, For findNotNullConstraintAttnum(): + if (multiple == NULL) + break; Shouldn't `pfree(arr)` be called before breaking ? +static Constraint *makeNNCheckConstraint(Oid nspid, char *constraint_name, You used `NN` because there is method makeCheckNotNullConstraint, right ? I think it would be better to expand `NN` so that its meaning is easy to understand. Cheers
Re: cataloguing NOT NULL constraints
On Wed, Aug 31, 2022 at 4:08 PM Zhihong Yu wrote: > > > On Wed, Aug 31, 2022 at 3:19 PM Alvaro Herrera > wrote: > >> So I was wrong in thinking that "this case was simple to implement" as I >> replied upthread. Doing that actually required me to rewrite large >> parts of the patch. I think it ended up being a good thing, because in >> hindsight the approach I was using was somewhat bogus anyway, and the >> current one should be better. Please find it attached. >> >> There are still a few problems, sadly. Most notably, I ran out of time >> trying to fix a pg_upgrade issue with pg_dump in binary-upgrade mode. >> I have to review that again, but I think it'll need a deeper rethink of >> how we pg_upgrade inherited constraints. So the pg_upgrade tests are >> known to fail. I'm not aware of any other tests failing, but I'm sure >> the cfbot will prove me wrong. >> >> I reluctantly added a new ALTER TABLE subcommand type, AT_SetAttNotNull, >> to allow setting pg_attribute.attnotnull without adding a CHECK >> constraint (only used internally). I would like to find a better way to >> go about this, so I may remove it again, therefore it's not fully >> implemented. >> >> There are *many* changed regress expect files and I didn't carefully vet >> all of them. Mostly it's the addition of CHECK constraints in the >> footers of many \d listings and stuff like that. At a quick glance they >> appear valid, but I need to review them more carefully still. >> >> We've had pg_constraint.conparentid for a while now, but for some >> constraints we continue to use conislocal/coninhcount. I think we >> should get rid of that and rely on conparentid completely. >> >> An easily fixed issue is that of constraint naming. >> ChooseConstraintName has an argument for passing known constraint names, >> but this patch doesn't use it and it must. >> >> One issue that I don't currently know how to fix, is the fact that we >> need to know whether a column is a row type or not (because they need a >> different null test). At table creation time that's easy to know, >> because we have the descriptor already built by the time we add the >> constraints; but if you do ALTER TABLE .. ADD COLUMN .., ADD CONSTRAINT >> then we don't. >> >> Some ancient code comments suggest that allowing a child table's NOT >> NULL constraint acquired from parent shouldn't be independently >> droppable. This patch doesn't change that, but it's easy to do if we >> decide to. However, that'd be a compatibility break, so I'd rather not >> do it in the same patch that introduces the feature. >> >> Overall, there's a lot more work required to get this to a good shape. >> That said, I think it's the right direction. >> >> -- >> Álvaro Herrera 48°01'N 7°57'E — >> https://www.EnterpriseDB.com/ >> "La primera ley de las demostraciones en vivo es: no trate de usar el >> sistema. >> Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen) >> > > Hi, > For findNotNullConstraintAttnum(): > > + if (multiple == NULL) > + break; > > Shouldn't `pfree(arr)` be called before breaking ? > > +static Constraint *makeNNCheckConstraint(Oid nspid, char *constraint_name, > > You used `NN` because there is method makeCheckNotNullConstraint, right ? > I think it would be better to expand `NN` so that its meaning is easy to > understand. > > Cheers > Hi, For tryExtractNotNullFromNode , in the block for `if (rel == NULL)`: + return false; I think you meant returning NULL since false is for boolean. Cheers
Re: Support tls-exporter as channel binding for TLSv1.3
On Sun, Aug 28, 2022 at 11:02 PM Michael Paquier wrote: > RFC9266, that has been released not so long ago, has added > tls-exporter as a new channel binding type: > https://www.rfc-editor.org/rfc/rfc5929.html Hi Michael, thank you for sending this! > Note also that tls-exporter is aimed for > TLSv1.3 and newer protocols, but OpenSSL allows the thing to work with > older protocols (testable with ssl_max_protocol_version, for example), > and I don't see a need to prevent this scenario. For protocols less than 1.3 we'll need to ensure that the extended master secret is in use: This channel binding mechanism is defined only when the TLS handshake results in unique master secrets. This is true of TLS versions prior to 1.3 when the extended master secret extension of [RFC7627] is in use, and it is always true for TLS 1.3 (see Appendix D of [RFC8446]). OpenSSL should have an API for that (SSL_get_extms_support); I don't know when it was introduced. If we want to cross all our T's, we should also disallow tls-exporter if the server was unable to set SSL_OP_NO_RENEGOTIATION. > An extra thing is > that attempting to use tls-exporter with a backend <= 15 and a client > >= 16 causes a failure during the SASL exchange, where the backend > complains about tls-exporter being unsupported. Yep. -- Did you have any thoughts about contributing the Python tests (or porting them to Perl, bleh) so that we could test failure modes as well? Unfortunately those Python tests were also OpenSSL-based, so they're less powerful than an independent implementation... Thanks, --Jacob
Re: Small cleanups to tuplesort.c and a bonus small performance improvement
On Wed, 31 Aug 2022 at 22:39, David Rowley wrote: > My current thoughts are that this is a very trivial patch and unless > there's any objections I plan to push it soon. Pushed. David
Re: Doc patch
On Fri, Aug 19, 2022 at 12:04:54PM -0400, Bruce Momjian wrote: > You are right about the above to changes. The existing syntax shows > FROM/IN is only possible if a direction is specified, while > src/parser/gram.y says that FROM/IN with no direction is supported. > > I plan to apply this second part of the patch soon. Patch applied back to PG 10. Thanks for the research on this. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: Transparent column encryption
On Tue, Aug 30, 2022 at 4:53 AM Peter Eisentraut wrote: > I would be interested in learning more about such padding systems. I > have done a lot of reading for this development project, and I have > never come across a cryptographic approach to hide length differences by > padding. Of course, padding to the block cipher's block size is already > part of the process, but that is done out of necessity, not because you > want to disguise the length. Are there any other methods? I'm > interested to learn more. TLS 1.3 has one example. Here is a description from GnuTLS: https://gnutls.org/manual/html_node/On-Record-Padding.html (Note the option to turn on constant-time padding; that may not be a good tradeoff for us if we're focusing on offline attacks.) Here's a recent paper that claims to formally characterize length hiding, but it's behind a wall and I haven't read it: https://dl.acm.org/doi/abs/10.1145/3460120.3484590 I'll try to find more when I get the chance. --Jacob
Re: Inconsistent error message for varchar(n)
8On Tue, Aug 16, 2022 at 09:56:17PM -0400, Bruce Momjian wrote: > On Sun, Nov 14, 2021 at 10:33:19AM +0800, Japin Li wrote: > > Oh! I didn't consider this situation. Since the max size of varchar cannot > > exceed 10485760, however, I cannot find this in documentation [1]. Is there > > something I missed? Should we mention this in the documentation? > > > > [1] https://www.postgresql.org/docs/devel/datatype-character.html > > Sorry for my long delay in reviewing this issue. You are correct this > should be documented --- patch attached. Patch applied back to PG 10. Thanks for the report. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Indecision is a decision. Inaction is an action. Mark Batterson
Re: pg15b3: recovery fails with wal prefetch enabled
On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby wrote: > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery > immediate wait > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of generated > WAL; request 1201/1CAF84F0, current position 1201/1CADB730 > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > base/16881/2840_vm > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is > not satisfied --- flushed only to 1201/1CADB730 > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > base/16881/2840_vm > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > I was able to start it with -c recovery_prefetch=no, so it seems like > prefetch tried to do too much. The VM runs centos7 under qemu. > I'm making a copy of the data dir in cases it's needed. Hmm, a page with an LSN set 118208 bytes past the end of WAL. It's a vm fork page (which recovery prefetch should ignore completely). Did you happen to get a copy before the successful recovery? After the successful recovery, what LSN does that page have, and can you find the references to it in the WAL with eg pg_waldump -R 1663/16681/2840 -F vm? Have you turned FPW off (perhaps this is on ZFS?)?
Re: Reducing the chunk header sizes on all memory context types
On 8/31/22 23:46, David Rowley wrote: > On Thu, 1 Sept 2022 at 08:53, Tomas Vondra > wrote: >> So the raw size (what we asked for) is ~23.5GB, but in practice we >> allocate ~28.8GB because of the pow-of-2 logic. And by adding the extra >> 1B we end up allocating 31.5GB. That doesn't seem like a huge increase, >> and it's far from the +60% you got. >> >> I wonder where does the difference come - I did make installcheck too, >> so how come you get 10/16GB, and I get 28/31GB? My patch is attached, >> maybe I did something silly. > > The reason my reported results were lower is because I ignored size > > allocChunkLimit allocations. These are not raised to the next power of > 2, so I didn't think they should be included. If I differentiate the large chunks allocated separately (v2 patch attached), I get this: f|t | count | s1 | s2 | s3 -+--+--+--+--+-- AllocSetAlloc | normal | 60714914 | 4982 | 6288 | 8185 AllocSetAlloc | separate | 824390 |18245 |18245 |18251 AllocSetRelloc | normal | 182070 | 763 | 826 | 1423 GenerationAlloc | normal | 2118115 | 68 | 90 | 102 GenerationAlloc | separate | 28 |0 |0 |0 (5 rows) Where s1 is the sum of requested sizes, s2 is the sum of allocated chunks, and s3 is chunks allocated with 1B sentinel. Focusing on the aset, vast majority of allocations (60M out of 64M) is small enough to use power-of-2 logic, and we go from 6.3GB to 8.2GB, so ~30%. Not great, not terrible. For the large allocations, there's almost no increase - in the last query I used the power-of-2 logic in the calculations, but that was incorrect, of course. > > I'm not sure why you're seeing only a 3GB additional overhead. I > noticed a logic error in my query where I was checking > maxaligned_size=pow2_size and doubling that to give sentinel space. > That really should have been "case size=pow2_size then pow2_size * 2 > else pow2_size end", However, after adjusting the query, it does not > seem to change the results much: > > postgres=# select > postgres-# round(sum(pow2_Size)::numeric/1024/1024/1024,3) as pow2_size, > postgres-# round(sum(case when size=pow2_size then pow2_size*2 else > pow2_size end)::numeric/1024/1024/1024,3) as method1, > postgres-# round(sum(case when size=pow2_size then pow2_size+8 else > pow2_size end)::numeric/1024/1024/1024,3) as method2 > postgres-# from memstats > postgres-# where pow2_size > 0; > pow2_size | method1 | method2 > ---+-+- > 10.269 | 16.322 | 10.476 > > I've attached the crude patch I came up with for this. For some > reason it was crashing on Linux, but it ran ok on Windows, so I used > the results from that instead. Maybe that accounts for some > differences as e.g sizeof(long) == 4 on 64-bit windows. I'd be > surprised if that accounted for so many GBs though. > I tried to use that patch, but "make installcheck" never completes for me, for some reason. It seems to get stuck in infinite_recurse.sql, but I haven't looked into the details. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Companydiff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c index b6eeb8abab..b215940062 100644 --- a/src/backend/utils/mmgr/aset.c +++ b/src/backend/utils/mmgr/aset.c @@ -674,6 +674,15 @@ AllocSetDelete(MemoryContext context) free(set); } +static Size pow2_size(Size size) +{ + Size s = 1; + while (s < size) + s *= 2; + + return s; +} + /* * AllocSetAlloc * Returns pointer to allocated memory of given size or NULL if @@ -703,9 +712,12 @@ AllocSetAlloc(MemoryContext context, Size size) * If requested size exceeds maximum for chunks, allocate an entire block * for this request. */ - if (size > set->allocChunkLimit) + if (size + 1 > set->allocChunkLimit) { - chunk_size = MAXALIGN(size); + fprintf(stderr, "AllocSetAlloc separate %ld %ld %ld\n", size, MAXALIGN(size), MAXALIGN(size+1)); + fflush(stderr); + + chunk_size = MAXALIGN(size+1); blksize = chunk_size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ; block = (AllocBlock) malloc(blksize); if (block == NULL) @@ -726,6 +738,11 @@ AllocSetAlloc(MemoryContext context, Size size) /* set mark to catch clobber of "unused" space */ if (size < chunk_size) set_sentinel(MemoryChunkGetPointer(chunk), size); + else + { + fprintf(stderr, "sentinel not added\n"); + fflush(stderr); + } #endif #ifdef RANDOMIZE_ALLOCATED_MEMORY /* fill the allocated space with junk */ @@ -761,13 +778,16 @@ AllocSetAlloc(MemoryContext context, Size size) return MemoryChunkGetPointer(chunk); } + fprintf(stderr, "AllocSetAlloc normal %ld %ld %ld\n", size, pow2_size(size), pow2_size(size+1)); + fflush(stderr); + /* * Request is small enough to be treated as a chunk. Look in the * corresponding
Re: PostgreSQL 15 release announcement draft
On Tue, Aug 30, 2022 at 03:58:48PM -0400, Jonathan S. Katz wrote: > In this latest release, PostgreSQL improves on its in-memory and on-disk > sorting > algorithms, with benchmarks showing speedups of 25% - 400% based on sort > types. rather than "based on": "depending on the data types being sorted" > Building on work from the previous PostgreSQL release for allowing async > remote > queries, the PostgreSQL foreign data wrapper, `postgres_fdw`, can now commit > transactions in parallel. asynchronous > benefits for certain workloads. On certain operating systems, PostgreSQL 15 s/certain/some ? > supports the ability to prefetch WAL file contents and speed up recovery > times. > PostgreSQL's built-in backup command, `pg_basebackup`, now supports > server-side > compression of backup files with a choice of gzip, LZ4, and zstd. remove "server-side", since they're also supported on the client-side. > PostgreSQL 15 lets user create views that query data using the permissions of users > the caller, not the view creator. This option, called `security_invoker`, adds > an additional layer of protection to ensure view callers have the correct > permissions for working with the underlying data. ensure *that ? > alter server-level configuration parameters. Additionally, users can now > search > for information about configuration using the `\dconfig` command from the > `psql` > command-line tool. rather than "search for information about configuration", say "list configuration information" ? > PostgreSQL server-level statistics are now collected in shared memory, > eliminating the statistics collector process and writing these stats to disk. and *the need to periodically* write these stats to disk -- Justin
Re: Handle infinite recursion in logical replication setup
On Wed, Aug 31, 2022 at 8:36 PM Amit Kapila wrote: > > On Wed, Aug 31, 2022 at 11:35 AM Peter Smith wrote: > > > > Here are my review comments for v43-0001. > > ... > > == > > > > 2. doc/src/sgml/ref/create_subscription.sgml > > > > + > > + If the subscription is created with origin = none and > > + copy_data = true, it will check if the publisher has > > + subscribed to the same table from other publishers and, if so, log a > > + warning to notify the user to check the publisher tables. Before > > continuing > > + with other operations the user should check that publisher tables did > > not > > + have data with different origins to prevent data inconsistency issues > > on the > > + subscriber. > > + > > > > I am not sure about that wording saying "to prevent data inconsistency > > issues" because I thought maybe is already too late because any > > unwanted origin data is already copied during the initial sync. I > > think the user can do it try to clean up any problems caused before > > things become even worse (??) > > > > Oh no, it is not too late in all cases. The problem can only occur if > the user will try to subscribe from all the different publications > with copy_data = true. We are anyway trying to clarify in the second > patch the right way to accomplish this. So, I am not sure what better > we can do here. The only bulletproof way is to provide some APIs where > users don't need to bother about all these cases, basically something > similar to what Kuroda-San is working on in the thread [1]. > The point of my review comment was only about the wording of the note - specifically, you cannot "prevent" something (e,g, data inconsistency) if it has already happened. Maybe modify the wording like below? BEFORE Before continuing with other operations the user should check that publisher tables did not have data with different origins to prevent data inconsistency issues on the subscriber. AFTER If a publisher table with data from different origins was found to have been copied to the subscriber, then some corrective action may be necessary before continuing with other operations. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Reducing the chunk header sizes on all memory context types
Tomas Vondra writes: > Focusing on the aset, vast majority of allocations (60M out of 64M) is > small enough to use power-of-2 logic, and we go from 6.3GB to 8.2GB, so > ~30%. Not great, not terrible. Not sure why this escaped me before, but I remembered another argument for not forcibly adding space for a sentinel: if you don't have room, that means the chunk end is up against the header for the next chunk, which means that any buffer overrun will clobber that header. So we'll detect the problem anyway if we validate the headers to a reasonable extent. The hole in this argument is that the very last chunk allocated in a block might have no following chunk to validate. But we could probably special-case that somehow, maybe by laying down a sentinel in the free space, where it will get overwritten by the next chunk when that does get allocated. 30% memory bloat seems like a high price to pay if it's adding negligible detection ability, which it seems is true if this argument is valid. Is there reason to think we can't validate headers enough to catch clobbers? regards, tom lane
Re: Reducing the chunk header sizes on all memory context types
On Thu, 1 Sept 2022 at 12:23, Tom Lane wrote: > Is there reason to think we can't validate headers enough to catch > clobbers? For non-sentinel chunks, the next byte after the end of the chunk will be storing the block offset for the following chunk. I think: if (block != MemoryChunkGetBlock(chunk)) elog(WARNING, "problem in alloc set %s: bad block offset for chunk %p in block %p", name, chunk, block); should catch those. Maybe we should just consider always making room for a sentinel for chunks that are on dedicated blocks. At most that's an extra 8 bytes in some allocation that's either over 1024 or 8192 (depending on maxBlockSize). David
Re: [PATCH] Add native windows on arm64 support
On Wed, Aug 31, 2022 at 01:33:53PM -0400, Tom Lane wrote: > I did not check the other XMLTABLE infrastructure, so what probably > is best to do is keep the two output columns but change 'node()' > to something with a more stable result; any preferences? The input object could also be reworked so as we would not have any ordering issues, say "pre&deeppost". Changing only the path, you could use "/e/n2" instead of "node()", so as we'd always get the most internal member. -- Michael signature.asc Description: PGP signature
Re: Bug fix. autovacuum.c do_worker_start() associates memory allocations with TopMemoryContext rather than 'Autovacuum start worker (tmp)'
On Wed, Aug 31, 2022 at 01:09:22PM -0700, Andres Freund wrote: > On 2022-08-31 16:05:03 -0400, Tom Lane wrote: >> Even if there's only a visible leak in v15/HEAD, I'm inclined >> to back-patch this all the way, because it's certainly buggy >> on its own terms. It's just the merest accident that neither >> caller is leaking other stuff into TopMemoryContext, because >> they both think they are using a short-lived context. > > +1 Ouch. Thanks for the quick fix and the backpatch. -- Michael signature.asc Description: PGP signature
Re: Reducing the chunk header sizes on all memory context types
David Rowley writes: > Maybe we should just consider always making room for a sentinel for > chunks that are on dedicated blocks. At most that's an extra 8 bytes > in some allocation that's either over 1024 or 8192 (depending on > maxBlockSize). Agreed, if we're not doing that already then we should. regards, tom lane
Re: Reducing the chunk header sizes on all memory context types
On 9/1/22 02:23, Tom Lane wrote: > Tomas Vondra writes: >> Focusing on the aset, vast majority of allocations (60M out of 64M) is >> small enough to use power-of-2 logic, and we go from 6.3GB to 8.2GB, so >> ~30%. Not great, not terrible. > > Not sure why this escaped me before, but I remembered another argument > for not forcibly adding space for a sentinel: if you don't have room, > that means the chunk end is up against the header for the next chunk, > which means that any buffer overrun will clobber that header. So we'll > detect the problem anyway if we validate the headers to a reasonable > extent. > > The hole in this argument is that the very last chunk allocated in a > block might have no following chunk to validate. But we could probably > special-case that somehow, maybe by laying down a sentinel in the free > space, where it will get overwritten by the next chunk when that does > get allocated. > > 30% memory bloat seems like a high price to pay if it's adding negligible > detection ability, which it seems is true if this argument is valid. > Is there reason to think we can't validate headers enough to catch > clobbers? > I'm not quite convinced the 30% figure is correct - it might be if you ignore cases exceeding allocChunkLimit, but that also makes it pretty bogus (because large allocations represent ~2x as much space). You're probably right we'll notice the clobber cases due to corruption of the next chunk header. The annoying thing is having a corrupted header only tells you there's a corruption somewhere, but it may be hard to know which part of the code caused it. I was hoping the sentinel would make it easier, because we mark it as NOACCESS for valgrind. But now I see we mark the first part of a MemoryChunk too, so maybe that's enough. OTOH we have platforms where valgrind is either not supported or no one runs tests with (e.g. on rpi4 it'd take insane amounts of code). In that case the sentinel might be helpful, especially considering alignment on those platforms can cause funny memory issues, as evidenced by this thread. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: POC: Better infrastructure for automated testing of concurrency issues
Hi! On Tue, Feb 23, 2021 at 3:09 AM Peter Geoghegan wrote: > On Tue, Dec 8, 2020 at 2:42 AM Alexander Korotkov > wrote: > > Thank you for your feedback! > > It would be nice to use this patch to test things that are important > but untested inside vacuumlazy.c, such as the rare > HEAPTUPLE_DEAD/tupgone case (grep for "Ordinarily, DEAD tuples would > have been removed by..."). Same is true of the closely related > heap_prepare_freeze_tuple()/heap_tuple_needs_freeze() code. I'll continue work on this patch. The rebased patch is attached. It implements stop events as configure option (not runtime GUC option). -- Regards, Alexander Korotkov 0001-Stopevents-v3.patch Description: Binary data
Re: pg15b3: recovery fails with wal prefetch enabled
On Thu, Sep 01, 2022 at 12:05:36PM +1200, Thomas Munro wrote: > On Thu, Sep 1, 2022 at 2:01 AM Justin Pryzby wrote: > > < 2022-08-31 08:44:10.495 CDT >LOG: checkpoint starting: end-of-recovery > > immediate wait > > < 2022-08-31 08:44:10.609 CDT >LOG: request to flush past end of > > generated WAL; request 1201/1CAF84F0, current position 1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >ERROR: xlog flush request 1201/1CAF84F0 is > > not satisfied --- flushed only to 1201/1CADB730 > > < 2022-08-31 08:44:10.609 CDT >CONTEXT: writing block 0 of relation > > base/16881/2840_vm > > < 2022-08-31 08:44:10.609 CDT >FATAL: checkpoint request failed > > > > I was able to start it with -c recovery_prefetch=no, so it seems like > > prefetch tried to do too much. The VM runs centos7 under qemu. > > I'm making a copy of the data dir in cases it's needed. > > Hmm, a page with an LSN set 118208 bytes past the end of WAL. It's a > vm fork page (which recovery prefetch should ignore completely). Did > you happen to get a copy before the successful recovery? After the > successful recovery, what LSN does that page have, and can you find > the references to it in the WAL with eg pg_waldump -R 1663/16681/2840 > -F vm? Have you turned FPW off (perhaps this is on ZFS?)? Yes, I have a copy that reproduces the issue: #1 0x009a0df4 in errfinish (filename=, filename@entry=0xa15535 "xlog.c", lineno=lineno@entry=2671, funcname=funcname@entry=0xa1da27 <__func__.22763> "XLogFlush") at elog.c:588 #2 0x0055f1cf in XLogFlush (record=19795985532144) at xlog.c:2668 #3 0x00813b24 in FlushBuffer (buf=0x7fffdf1f8900, reln=) at bufmgr.c:2889 #4 0x00817a5b in SyncOneBuffer (buf_id=buf_id@entry=7796, skip_recently_used=skip_recently_used@entry=false, wb_context=wb_context@entry=0x7fffcdf0) at bufmgr.c:2576 #5 0x008181c2 in BufferSync (flags=flags@entry=358) at bufmgr.c:2164 #6 0x008182f5 in CheckPointBuffers (flags=flags@entry=358) at bufmgr.c:2743 #7 0x005587b2 in CheckPointGuts (checkPointRedo=19795985413936, flags=flags@entry=358) at xlog.c:6855 #8 0x0055feb3 in CreateCheckPoint (flags=flags@entry=358) at xlog.c:6534 #9 0x007aceaa in CheckpointerMain () at checkpointer.c:455 #10 0x007aac52 in AuxiliaryProcessMain (auxtype=auxtype@entry=CheckpointerProcess) at auxprocess.c:153 #11 0x007b0bd8 in StartChildProcess (type=) at postmaster.c:5430 #12 0x007b388f in PostmasterMain (argc=argc@entry=7, argv=argv@entry=0xf139e0) at postmaster.c:1463 #13 0x004986a6 in main (argc=7, argv=0xf139e0) at main.c:202 It's not on zfs, and FPW have never been turned off. I should add that this instance has been pg_upgraded since v10. BTW, base/16881 is the postgres DB )which has 43GB of logfiles imported from CSV, plus 2GB of snapshots of pg_control_checkpoint, pg_settings, pg_stat_bgwriter, pg_stat_database, pg_stat_wal). postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'main', 0)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid ---+--+---+---+---+-+--+-+ 1201/1CDD1F98 |-6200 | 1 |44 | 424 |8192 | 8192 | 4 | 3681043287 (1 fila) postgres=# SELECT * FROM page_header(get_raw_page('pg_toast.pg_toast_2619', 'vm', 0)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid ---+--+---+---+---+-+--+-+--- 1201/1CAF84F0 |-6010 | 0 |24 | 8192 |8192 | 8192 | 4 | 0 I found this in waldump (note that you had a typoe - it's 16881). [pryzbyj@template0 ~]$ sudo /usr/pgsql-15/bin/pg_waldump -R 1663/16881/2840 -F vm -p /mnt/tmp/15/data/pg_wal 00011201001C rmgr: Heap2 len (rec/tot): 64/ 122, tx: 0, lsn: 1201/1CAF2658, prev 1201/1CAF2618, desc: VISIBLE cutoff xid 3681024856 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0 FPW, blkref #1: rel 1663/16881/2840 blk 54 rmgr: Heap2 len (rec/tot): 59/59, tx: 0, lsn: 1201/1CAF3AF8, prev 1201/1CAF2788, desc: VISIBLE cutoff xid 2 flags 0x03, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 0 rmgr: Heap2 len (rec/tot): 59/59, tx: 0, lsn: 1201/1CAF3B70, prev 1201/1CAF3B38, desc: VISIBLE cutoff xid 3671427998 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 2 rmgr: Heap2 len (rec/tot): 59/59, tx: 0, lsn: 1201/1CAF4DC8, prev 1201/1CAF3BB0, desc: VISIBLE cutoff xid 3672889900 flags 0x01, blkref #0: rel 1663/16881/2840 fork vm blk 0, blkref #1: rel 1663/16881/2840 blk 4 rmgr: Heap2 len (rec