Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Hello Tom, The buildfarm is still complaining about the synopsis being too wide for PDF format. I think what we ought to do is give up on using a for log lines at all, and instead convert the documentation into a tabular list of fields. Proposal attached, which also fixes a couple of outright errors. Looks ok. Html doc generation ok. While looking at the html outpout, the "pgbench" command line just below wraps strangely: pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 --latency-limit=10 --failures-detailed --max-tries=10 test ISTM that there should be no nl in the pgbench … section, although maybe it would trigger a complaint in the pdf format. One thing that this doesn't fix is that the existing text appears to suggest that the "failures" column is something different from the sum of the serialization_failures and deadlock_failures columns, which it's obvious from the code is not so. If this isn't a code bug then I think we ought to just drop that column entirely, because it's redundant. Ok. Fine with me. Possibly at some point there was the idea that there could be other failures counted, but there are none. Also, there has been questions about the failures detailed option, or whether the reports should always be detailed, and the result may be some kind of not convincing compromise. (BTW, now that I've read this stuff I am quite horrified by how the non-aggregated log format has been mangled for error retries, and will be probably be submitting a proposal to change that. But that's a different issue.) Indeed, any improvement is welcome! -- Fabien.
Re: Defer selection of asynchronous subplans until the executor initialization stage
On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita wrote: > On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby wrote: > > This patch seems to be causing the planner to crash. > > Here's a query reduced from sqlsmith. > > > > | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 > > <= pg_trigger_depth(); > > > > Program terminated with signal SIGSEGV, Segmentation fault. > > Reproduced. Will look into this. I think the cause of this is that mark_async_capable_plan() failed to take into account that when the given path is a SubqueryScanPath or ForeignPath, the given corresponding plan might include a gating Result node that evaluates pseudoconstant quals. My oversight. :-( Attached is a patch for fixing that. I think v14 has the same issue, so I think we need backpatching. Best regards, Etsuro Fujita prevent-async.patch Description: Binary data
Re: Defer selection of asynchronous subplans until the executor initialization stage
Hi, On Sat, Apr 9, 2022 at 1:24 AM Zhihong Yu wrote: > On Fri, Apr 8, 2022 at 5:43 AM Justin Pryzby wrote: >> This patch seems to be causing the planner to crash. >> Here's a query reduced from sqlsmith. >> >> | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 >> <= pg_trigger_depth(); >> >> Program terminated with signal SIGSEGV, Segmentation fault. > I logged the value of plan->scanstatus before the assertion : > > 2022-04-08 16:20:59.601 UTC [26325] LOG: scan status 0 > 2022-04-08 16:20:59.601 UTC [26325] STATEMENT: explain SELECT 1 FROM > information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth(); > 2022-04-08 16:20:59.796 UTC [26296] LOG: server process (PID 26325) was > terminated by signal 11: Segmentation fault > > It seems its value was SUBQUERY_SCAN_UNKNOWN. > > Still trying to find out the cause for the crash. I think the cause is an oversight in mark_async_capable_plan(). See [1]. Thanks! Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/CAPmGK15NkuaVo0Fu_0TfoCpPPJaJi4OMLzEQtkE6Bt6YT52fPQ%40mail.gmail.com
Re: Can we automatically add elapsed times to tap test log?
On 4/8/22 09:51, Andrew Dunstan wrote: > On 4/7/22 19:55, Andrew Dunstan wrote: >> On 4/7/22 17:58, Andres Freund wrote: >>> Hi, >>> >>> On 2022-04-07 17:45:09 -0400, Tom Lane wrote: Andres Freund writes: > On 2022-04-07 17:21:09 -0400, Tom Lane wrote: >> I too think that the elapsed time is useful. I'm less convinced >> that the time-of-day marker is useful. > I think it'd be quite useful if it had more precision - it's a pita to > correlate regress_log_* output with server logs. Fair point. Maybe we could keep the timestamp (with ms precision if possible) and then the parenthetical bit is time-since-last-line (also with ms precision)? I think that would more or less satisfy both uses. >>> Would work for me... >>> >> All doable. Time::HiRes gives us a higher resolution timer. I'll post a >> new version in a day or two. > > New version attached. > > > Sample traces: > > > andrew@emma:log $ egrep '^\[[0-9][0-9]:[00-9][0-9]:' > regress_log_020_pg_receivewal | tail -n 15 > [09:22:45.031](0.000s) ok 30 # skip postgres was not built with LZ4 support > [09:22:45.032](0.000s) ok 31 # skip postgres was not built with LZ4 support > [09:22:45.296](0.265s) ok 32 - streaming some WAL > [09:22:45.297](0.001s) ok 33 - check that previously partial WAL is now > complete > [09:22:45.298](0.001s) ok 34 - check stream dir permissions > [09:22:45.298](0.000s) # Testing pg_receivewal with slot as starting > streaming point > [09:22:45.582](0.284s) ok 35 - pg_receivewal fails with non-existing slot: > exit code not 0 > [09:22:45.583](0.001s) ok 36 - pg_receivewal fails with non-existing slot: > matches > [09:22:45.618](0.036s) ok 37 - WAL streamed from the slot's restart_lsn > [09:22:45.619](0.001s) ok 38 - WAL from the slot's restart_lsn has been > archived > [09:22:46.597](0.978s) ok 39 - Stream some wal after promoting, resuming from > the slot's position > [09:22:46.598](0.001s) ok 40 - WAL segment 0001000B archived > after timeline jump > [09:22:46.598](0.000s) ok 41 - WAL segment 0002000C archived > after timeline jump > [09:22:46.598](0.000s) ok 42 - timeline history file archived after timeline > jump > [09:22:46.599](0.001s) 1..42 > > In the absence of further comment I have pushed this. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Defer selection of asynchronous subplans until the executor initialization stage
On Sun, Apr 10, 2022 at 3:42 AM Etsuro Fujita wrote: > On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita > wrote: > > On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby > wrote: > > > This patch seems to be causing the planner to crash. > > > Here's a query reduced from sqlsmith. > > > > > > | explain SELECT 1 FROM information_schema.constraint_column_usage > WHERE 1 <= pg_trigger_depth(); > > > > > > Program terminated with signal SIGSEGV, Segmentation fault. > > > > Reproduced. Will look into this. > > I think the cause of this is that mark_async_capable_plan() failed to > take into account that when the given path is a SubqueryScanPath or > ForeignPath, the given corresponding plan might include a gating > Result node that evaluates pseudoconstant quals. My oversight. :-( > Attached is a patch for fixing that. I think v14 has the same issue, > so I think we need backpatching. > > Best regards, > Etsuro Fujita > Hi, Looking at the second hunk of the patch: FdwRoutine *fdwroutine = path->parent->fdwroutine; ... + if (IsA(plan, Result)) + return false; It seems the check of whether plan is a Result node can be lifted ahead of the switch statement (i.e. to the beginning of mark_async_capable_plan). This way, we don't have to check for every case in the switch statement. Cheers
Re: Can we automatically add elapsed times to tap test log?
On 2022-04-10 09:23:16 -0400, Andrew Dunstan wrote: > In the absence of further comment I have pushed this. Thanks!
Re: suboverflowed subtransactions concurrency performance optimize
On Thu, 7 Apr 2022 at 00:36, Michael Paquier wrote: > > On Mon, Mar 07, 2022 at 10:17:41PM +0800, Julien Rouhaud wrote: > > On Mon, Mar 07, 2022 at 01:27:40PM +, Simon Riggs wrote: > >> The patch was posted because TransactionLogFetch() has a cache, yet > >> SubTransGetTopmostTransaction() does not, yet the argument should be > >> identical in both cases. > > > > I totally agree with that. > > Agreed as well. That's worth doing in isolation and that will save > some lookups of pg_subtrans anyway while being simple. As mentioned > upthread, this needed an indentation, and the renaming of > cachedFetchXid to cachedFetchSubXid looks adapted. So.. Applied all > those things. Thanks Michael, thanks all. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors
Fabien COELHO writes: > While looking at the html outpout, the "pgbench" command line just below > wraps strangely: >pgbench --aggregate-interval=10 --time=20 --client=10 --log --rate=1000 >--latency-limit=10 --failures-detailed --max-tries=10 test > ISTM that there should be no nl in the pgbench … > section, although maybe it would trigger a complaint in the pdf format. PDF wraps that text where it wants to anyway, so I removed the newline. regards, tom lane
GSoC: pgBackRest port to Windows
Hello everyone, My name is Samuel Bassaly, and I would like to submit my proposal for this year's GSoC. Kindly find my proposal under the following link: https://docs.google.com/document/d/1cEGLLJaRmb5hkpt7GayIJA4oCllYtchOU2p0FQpLZ3U/edit?usp=sharing Your feedback is highly appreciated. Thank you for your time. Best Regards, Samuel Bassaly
Re: Schema variables - new implementation for Postgres 15+1
On Sun, Apr 10, 2022 at 08:30:39PM +0200, Pavel Stehule wrote: > I am sending fresh rebased patch + separation to more patches. This split > is initial, and can be changed later The 0001 patch requires this, but it's not included until 0003. src/include/commands/session_variable.h Each patch should compile and pass tests with the preceding patches, without the following patches. I think the regression tests should be included with their corresponding patch. Maybe it's ok to separate out the changes for pg_dump, docs, and psql - but they'd have to be merged together eventually. I realize some of this runs counter to Julien's suggestion to split patches. The version should be changed: + if (fout->remoteVersion < 15) I enabled these, which causes the regression tests fail: +#define COPY_PARSE_PLAN_TREES +#define WRITE_READ_PARSE_PLAN_TREES +#define RAW_EXPRESSION_COVERAGE_TEST /home/pryzbyj/src/postgres/src/test/regress/results/session_variables.out 2022-04-10 15:37:32.926306124 -0500 @@ -16,7 +16,7 @@ SET ROLE TO var_test_role; -- should fail LET var1 = 10; -ERROR: permission denied for session variable var1 +ERROR: unrecognized node type: 368 ...
Re: A qsort template
Hi, David Rowley privately reported a performance regression when sorting single ints with a lot of duplicates, in a case that previously hit qsort_ssup() but now hits qsort_tuple_int32() and then has to call the tiebreaker comparator. Note that this comes up only for sorts in a query, not for eg index builds which always have to tiebreak on item ptr. I don't have data right now but that'd likely be due to: + * XXX: For now, there is no specialization for cases where datum1 is + * authoritative and we don't even need to fall back to a callback at all (that + * would be true for types like int4/int8/timestamp/date, but not true for + * abbreviations of text or multi-key sorts. There could be! Is it worth it? Upthread we were discussing which variations it'd be worth investing extra text segment space on to gain speedup and we put those hard decisions off for future work, but on reflection, we probably should tackle this particular point to avoid a regression. I think something like the attached achieves that (draft, not tested much yet, could perhaps find a tidier way to code the decision tree). In short: variants qsort_tuple_{int32,signed,unsigned}() no longer fall back, but new variants qsort_tuple_{int32,signed,unsigned}_tiebreak() do. We should perhaps also reconsider the other XXX comment about finding a way to skip the retest of column 1 in the tiebreak comparator. Perhaps you'd just install a different comparetup function, eg comparetup_index_btree_tail (which would sharing code), so no need to multiply specialisations for that. Planning to look at this more closely after I've sorted out some other problems, but thought I'd post this draft/problem report early in case John or others have thoughts or would like to run some experiments. From 3ff95aedb1065393a0ae1496cedb463bc998a329 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 6 Apr 2022 11:38:31 +1200 Subject: [PATCH] Specialize tuplesort for keys without tiebreaker. Commit 69749243 noted a future opportunity to avoid calling the comparator function in the case of single column non-abbreviated sort, where datum1 alone is enough to decide the order. David Rowley reported a performance regression in single-column integer sorts with a lot of duplicates, where previously the qsort_ssup() specialization was reached, due to extra calls to the comparator. It seems like a good idea to fix that, so let's complete that TODO item now. XXX WIP, is the if-then-else getting too ugly, maybe switch to a table of functions to search? XXX needs testing, experiments --- src/backend/utils/sort/tuplesort.c | 110 +++-- 1 file changed, 89 insertions(+), 21 deletions(-) diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c index 571fb95532..dc16e6c242 100644 --- a/src/backend/utils/sort/tuplesort.c +++ b/src/backend/utils/sort/tuplesort.c @@ -312,6 +312,14 @@ struct Tuplesortstate */ bool haveDatum1; + /* + * Whether the comparator function must always be called for tuples with + * equal datum1. This should be set to true if the sorting has extra + * 'invisible' sort order beyond the regular keys, to disable optimizations + * that would otherwise skip the function call. + */ + bool alwaysTiebreak; + /* * This array holds the tuples now in sort memory. If we are in state * INITIAL, the tuples are in no particular order; if we are in state @@ -682,11 +690,6 @@ static void tuplesort_updatemax(Tuplesortstate *state); * * XXX: For now, these fall back to comparator functions that will compare the * leading datum a second time. - * - * XXX: For now, there is no specialization for cases where datum1 is - * authoritative and we don't even need to fall back to a callback at all (that - * would be true for types like int4/int8/timestamp/date, but not true for - * abbreviations of text or multi-key sorts. There could be! Is it worth it? */ /* Used if first key's comparator is ssup_datum_unsigned_compare */ @@ -740,10 +743,12 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) * qsort_ssup() is specialized for the case where the comparetup function * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts * and Datum sorts. qsort_tuple_{unsigned,signed,int32} are specialized for - * common comparison functions on pass-by-value leading datums. + * common comparison functions on pass-by-value leading datums. The _tiebreak + * variants are for when it is necessary to call the comparator function even + * if datum1 is equal. */ -#define ST_SORT qsort_tuple_unsigned +#define ST_SORT qsort_tuple_unsigned_tiebreak #define ST_ELEMENT_TYPE SortTuple #define ST_COMPARE(a, b, state) qsort_tuple_unsigned_compare(a, b, state) #define ST_COMPARE_ARG_TYPE Tuplesortstate @@ -752,7 +757,7 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state) #define ST_DEFINE #include "lib/sort_template.h" -#define
Re: Improving btree performance through specializing by key shape, take 2
On Fri, Apr 8, 2022 at 9:55 AM Matthias van de Meent wrote: > Here's generation 2 of this effort. Instead of proceeding to trying to > shoehorn all types of btree keys into one common code path, this > patchset acknowledges that there exist different shapes of keys that > each have a different best way of accessing each subsequent key > attribute. This patch achieves this by specializing the functions to > different key shapes. Cool. > These patches still have some rough edges (specifically: some > functions that are being generated are unused, and intermediate > patches don't compile), but I wanted to get this out to get some > feedback on this approach. I attempted to apply your patch series to get some general sense of how it affects performance, by using my own test cases from previous nbtree project work. I gave up on that pretty quickly, though, since the code wouldn't compile. That in itself might have been okay (some "rough edges" are generally okay). The real problem was that it wasn't clear what I was expected to do about it! You mentioned that some of the patches just didn't compile, but which ones? How do I quickly get some idea of the benefits on offer here, however imperfect or preliminary? Can you post a version of this that compiles? As a general rule you should try to post patches that can be "test driven" easily. An opening paragraph that says "here is why you should care about my patch" is often something to strive for, too. I suspect that you actually could have done that here, but you didn't, for whatever reason. > I expect the performance to be at least on par with current btree > code, and I'll try to publish a more polished patchset with > performance results sometime in the near future. I'll also try to > re-attach dynamic page-level prefix truncation, but that depends on > the amount of time I have and the amount of feedback on this patchset. Can you give a few motivating examples? You know, queries that are sped up by the patch series, with an explanation of where the benefit comes from. You had some on the original thread, but that included dynamic prefix truncation stuff as well. Ideally you would also describe where the adversized improvements come from for each test case -- which patch, which enhancement (perhaps only in rough terms for now). -- Peter Geoghegan
Re: A qsort template
On Sun, Apr 10, 2022 at 2:44 PM Thomas Munro wrote: > David Rowley privately reported a performance regression when sorting > single ints with a lot of duplicates, in a case that previously hit > qsort_ssup() but now hits qsort_tuple_int32() and then has to call the > tiebreaker comparator. That's not good. The B&M quicksort implementation that we adopted is generally extremely fast for that case, since it uses 3 way partitioning (based on the Dutch National Flag algorithm). This essentially makes sorting large groups of duplicates take only linear time (not linearithmic time). -- Peter Geoghegan
Re: Improving btree performance through specializing by key shape, take 2
On Sun, Apr 10, 2022 at 2:44 PM Peter Geoghegan wrote: > Can you post a version of this that compiles? I forgot to add: the patch also bitrot due to recent commit dbafe127. I didn't get stuck at this point (this is minor bitrot), but no reason not to rebase. -- Peter Geoghegan
Re: Improving btree performance through specializing by key shape, take 2
On Sun, 10 Apr 2022 at 23:45, Peter Geoghegan wrote: > > On Fri, Apr 8, 2022 at 9:55 AM Matthias van de Meent > wrote: > > Here's generation 2 of this effort. Instead of proceeding to trying to > > shoehorn all types of btree keys into one common code path, this > > patchset acknowledges that there exist different shapes of keys that > > each have a different best way of accessing each subsequent key > > attribute. This patch achieves this by specializing the functions to > > different key shapes. > > Cool. > > > These patches still have some rough edges (specifically: some > > functions that are being generated are unused, and intermediate > > patches don't compile), but I wanted to get this out to get some > > feedback on this approach. > > I attempted to apply your patch series to get some general sense of > how it affects performance, by using my own test cases from previous > nbtree project work. I gave up on that pretty quickly, though, since > the code wouldn't compile. That in itself might have been okay (some > "rough edges" are generally okay). The real problem was that it wasn't > clear what I was expected to do about it! You mentioned that some of > the patches just didn't compile, but which ones? How do I quickly get > some idea of the benefits on offer here, however imperfect or > preliminary? > > Can you post a version of this that compiles? As a general rule you > should try to post patches that can be "test driven" easily. An > opening paragraph that says "here is why you should care about my > patch" is often something to strive for, too. I suspect that you > actually could have done that here, but you didn't, for whatever > reason. Yes, my bad. I pulled one patch that included unrelated changes from the set; but I missed that it also contained some changes that should've been in an earlier commit, thus breaking the set. I'll send an updated patchset soon (I'm planning on moving around when what is changed/added); but before that the attached incremental patch should help. FYI, the patchset has been tested on commit 05023a23, and compiles (with unused function warnings) after applying the attached patch. > > I expect the performance to be at least on par with current btree > > code, and I'll try to publish a more polished patchset with > > performance results sometime in the near future. I'll also try to > > re-attach dynamic page-level prefix truncation, but that depends on > > the amount of time I have and the amount of feedback on this patchset. > > Can you give a few motivating examples? You know, queries that are > sped up by the patch series, with an explanation of where the benefit > comes from. You had some on the original thread, but that included > dynamic prefix truncation stuff as well. Queries that I expect to be faster are situations where the index does front-to-back attribute accesses in a tight loop and repeated index lookups; such as in index builds, data loads, JOINs, or IN () and = ANY () operations; and then specifically for indexes with only a single key attribute, or indexes where we can determine based on the index attributes' types that nocache_index_getattr will be called at least once for a full _bt_compare call (i.e. att->attcacheoff cannot be set for at least one key attribute). Queries that I expect to be slower to a limited degree are hot loops on btree indexes that do not have a specifically optimized path, as there is some extra overhead calling the specialized functions. Other code might also see a minimal performance impact due to an increased binary size resulting in more cache thrashing. > Ideally you would also describe where the adversized improvements come > from for each test case -- which patch, which enhancement (perhaps > only in rough terms for now). In the previous iteration, I discerned that there are different "shapes" of indexes, some of which currently have significant overhead in the existing btree infrastructure. Especially indexes with multiple key attributes can see significant overhead while their attributes are being extracted, which (for a significant part) can be attributed to the O(n) behaviour of nocache_index_getattr. This O(n) overhead is currently avoided only by indexes with only a single key attribute and by indexes in which all key attributes have a fixed size (att->attlen > 0). The types of btree keys I discerned were: CREATE INDEX ON tst ... ... (single_key_attribute) ... (varlen, other, attributes, ...) ... (fixed_size, also_fixed, ...) ... (sometimes_null, other, attributes, ...) For single-key-attribute btrees, the performance benefits in the patch are achieved by reducing branching in the attribute extraction: There are no other key attributes to worry about, so much of the code dealing with looping over attributes can inline values, and thus reduce the amount of code generated in the hot path. For btrees with multiple key attributes, benefits are achieved if some attributes are of variable length (e.g. tex
Re: make MaxBackends available in _PG_init
On Sat, Apr 09, 2022 at 09:24:42PM +0800, Julien Rouhaud wrote: > Yeah I would prefer this approach too, although it couldn't prevent extension > from directly modifying the underlying variables so I don't know how effective > it would be. > > On the bright side, I see that citus is using SetConfigOption() to increase > max_prepared_transactions [1]. That's the only extension mentioned in that > thread that does modify some GUC, and this GUC was already marked as > PGDLLIMPORT since 2017 so it probably wasn't done that way for Windows > compatibility reason. > > In the meantime, should we add an open item? Yes, I think that we need to discuss more the matter here, so added one. -- Michael signature.asc Description: PGP signature
Re: A qsort template
On Mon, 11 Apr 2022 at 09:44, Thomas Munro wrote: > David Rowley privately reported a performance regression when sorting > single ints with a lot of duplicates, in a case that previously hit > qsort_ssup() but now hits qsort_tuple_int32() and then has to call the > tiebreaker comparator. Note that this comes up only for sorts in a > query, not for eg index builds which always have to tiebreak on item > ptr. I don't have data right now but that'd likely be due to: I've now added this as an open item for v15. David
Re: REINDEX blocks virtually any queries but some prepared queries.
On Fri, Apr 08, 2022 at 04:23:48PM +0200, Frédéric Yhuel wrote: > Thank you Michael. And done as of 8ac700a. -- Michael signature.asc Description: PGP signature
Re: Commitfest Closed
On Fri, Apr 08, 2022 at 08:48:57AM -0400, Greg Stark wrote: > Thanks to all the reviewers and committers who put a lot of work in, > especially in the last two weeks. I especially want to thank Andres > who showed me how to use the cfbot to check on patch statuses and did > a lot of work doing that until I was up to speed. Thanks for going through the CF, Greg! -- Michael signature.asc Description: PGP signature
Re: Improving btree performance through specializing by key shape, take 2
On Sun, Apr 10, 2022 at 4:08 PM Matthias van de Meent wrote: > I'll send an updated patchset soon (I'm planning on moving around when > what is changed/added); but before that the attached incremental patch > should help. FYI, the patchset has been tested on commit 05023a23, and > compiles (with unused function warnings) after applying the attached > patch. I can get it to work now, with your supplemental patch. > Queries that I expect to be faster are situations where the index does > front-to-back attribute accesses in a tight loop and repeated index > lookups; such as in index builds, data loads, JOINs, or IN () and = > ANY () operations; and then specifically for indexes with only a > single key attribute, or indexes where we can determine based on the > index attributes' types that nocache_index_getattr will be called at > least once for a full _bt_compare call (i.e. att->attcacheoff cannot > be set for at least one key attribute). I did some quick testing of the patch series -- pretty much just reusing my old test suite from the Postgres 12 and 13 nbtree work. This showed that there is a consistent improvement in some cases. It also failed to demonstrate any performance regressions. That's definitely a good start. I saw about a 4% reduction in runtime for the same UK land registry test that you yourself have run in the past for the same patch series [1]. I suspect that there just aren't that many ways to get that kind of speed up with this test case, except perhaps by further compressing the on-disk representation used by nbtree. My guess is that the patch reduces the runtime for this particular test case to a level that's significantly closer to the limit for this particular piece of silicon. Which is not to be sniffed at. Admittedly these test cases were chosen purely because they were convenient. They were originally designed to test space utilization, which isn't affected either way here. I like writing reproducible test cases for indexing stuff, and think that it could work well here too (even though you're not optimizing space utilization at all). A real test suite that targets a deliberately chosen cross section of "index shapes" might work very well. > In the previous iteration, I discerned that there are different > "shapes" of indexes, some of which currently have significant overhead > in the existing btree infrastructure. Especially indexes with multiple > key attributes can see significant overhead while their attributes are > being extracted, which (for a significant part) can be attributed to > the O(n) behaviour of nocache_index_getattr. This O(n) overhead is > currently avoided only by indexes with only a single key attribute and > by indexes in which all key attributes have a fixed size (att->attlen > > 0). Good summary. > The types of btree keys I discerned were: > CREATE INDEX ON tst ... > ... (single_key_attribute) > ... (varlen, other, attributes, ...) > ... (fixed_size, also_fixed, ...) > ... (sometimes_null, other, attributes, ...) > > For single-key-attribute btrees, the performance benefits in the patch > are achieved by reducing branching in the attribute extraction: There > are no other key attributes to worry about, so much of the code > dealing with looping over attributes can inline values, and thus > reduce the amount of code generated in the hot path. I agree that it might well be useful to bucket indexes into several different "index shape archetypes" like this. Roughly the same approach worked well for me in the past. This scheme might turn out to be reductive, but even then it could still be very useful (all models are wrong, some are useful, now as ever). > For btrees with multiple key attributes, benefits are achieved if some > attributes are of variable length (e.g. text): > On master, if your index looks like CREATE INDEX ON tst (varlen, > fixed, fixed), for the latter attributes the code will always hit the > slow path of nocache_index_getattr. This introduces a significant > overhead; as that function wants to re-establish that the requested > attribute's offset is indeed not cached and not cacheable, and > calculates the requested attributes' offset in the tuple from > effectively zero. Right. So this particular index shape seems like something that we treat in a rather naive way currently. Can you demonstrate that with a custom test case? (The result I cited before was from a '(varlen,varlen,varlen)' index, which is important, but less relevant.) [1] https://www.postgresql.org/message-id/flat/CAEze2Whwvr8aYcBf0BeBuPy8mJGtwxGvQYA9OGR5eLFh6Q_ZvA%40mail.gmail.com -- Peter Geoghegan
typos
In docs and comments. Mostly for v15. Maybe Fabien will want to comment on the pgbench one. >From b1fb6d9bae1b0d69a30d59e65ed4f5a10f4141da Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 25 Jan 2022 10:47:55 -0600 Subject: [PATCH 01/19] doc: database SYSTEM since aa01051418f10afbdfa781b8dc109615ca785ff9 --- doc/src/sgml/ref/create_database.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/create_database.sgml b/doc/src/sgml/ref/create_database.sgml index 255ad3a1ce0..b2bb3da5fdb 100644 --- a/doc/src/sgml/ref/create_database.sgml +++ b/doc/src/sgml/ref/create_database.sgml @@ -278,7 +278,7 @@ CREATE DATABASE name The object identifier to be used for the new database. If this - parameter is not specified, the database will choose a suitable + parameter is not specified, the database system will choose a suitable OID automatically. This parameter is primarily intended for internal use by pg_upgrade, and only pg_upgrade can specify a value less -- 2.17.1 >From 5631996e4ebf996ff89f19b52e9bf616cad09d6c Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 16 Feb 2022 21:07:53 -0600 Subject: [PATCH 02/19] doc: Remove 'synchronized' from --no-sync Since it would be more accurate to say "unsynchronized". The corresponding change was made for pgupgrade.sgml in commit 410aa248 --- doc/src/sgml/ref/pg_rewind.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index e808239aa5b..3231f67845a 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -210,7 +210,7 @@ PostgreSQL documentation to be written safely to disk. This option causes pg_rewind to return without waiting, which is faster, but means that a subsequent operating system crash can leave -the synchronized data directory corrupt. Generally, this option is +the data directory corrupt. Generally, this option is useful for testing but should not be used on a production installation. -- 2.17.1 >From a16152c556661be0217b689218def3ac0d308d96 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 16 Feb 2022 21:26:17 -0600 Subject: [PATCH 03/19] doc: json log dc686681e0799b12c40f44f85fc5bfd7fed4e57f --- doc/src/sgml/config.sgml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 6e3e27bed76..30ac8e0137f 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -7657,8 +7657,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; - Each log line is serialized as a JSON object as of the following - set of keys with their values. + Each log line is serialized as a JSON object with the following + set of keys and their associated values. -- 2.17.1 >From 1d565b227c1184fd0e58b4acf69a289c3e2f3b35 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 17 Feb 2022 21:08:56 -0600 Subject: [PATCH 04/19] doc: s/in local server/on local server/ 94c49d53402240ad7ddbcae9049ff2840a54b9c6 many similar things were fixed in 410aa248e5a883fde4832999cc9b23c7ace0f2ff --- doc/src/sgml/postgres-fdw.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml index d8dc7155874..92d89c0fc99 100644 --- a/doc/src/sgml/postgres-fdw.sgml +++ b/doc/src/sgml/postgres-fdw.sgml @@ -1040,7 +1040,7 @@ postgres=# SELECT postgres_fdw_disconnect_all(); %C - Cluster name in local server + Cluster name on local server (see for details) -- 2.17.1 >From 732aebc934ac8539f49535cb3c17d67805fbd583 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 19 Feb 2022 11:47:35 -0600 Subject: [PATCH 05/19] doc: pg_column_compression(): we say method not algorithm everywhere else could backpatch to v14 --- doc/src/sgml/func.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2ecf0482d84..a34eb8b43e0 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -29285,7 +29285,7 @@ postgres=# SELECT * FROM pg_walfile_name_offset((pg_backup_stop()).lsn); text -Shows the compression algorithm that was used to compress +Shows the compression method that was used to compress an individual variable-length value. Returns NULL if the value is not compressed. -- 2.17.1 >From 0586643c7405d4d922c7cc0739a4d207d73fba8a Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 25 Mar 2022 12:25:04 -0500 Subject: [PATCH 06/19] doc review: postgres_fdw parallel commit This has a unicode apostrophe, unlike everywhere else in the docs. 04e70
Re: Defer selection of asynchronous subplans until the executor initialization stage
On Sun, Apr 10, 2022 at 07:43:48PM +0900, Etsuro Fujita wrote: > On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita wrote: > > On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby wrote: > > > This patch seems to be causing the planner to crash. > > > Here's a query reduced from sqlsmith. > > > > > > | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE > > > 1 <= pg_trigger_depth(); > > > > > > Program terminated with signal SIGSEGV, Segmentation fault. > > > > Reproduced. Will look into this. > > I think the cause of this is that mark_async_capable_plan() failed to > take into account that when the given path is a SubqueryScanPath or > ForeignPath, the given corresponding plan might include a gating > Result node that evaluates pseudoconstant quals. My oversight. :-( > Attached is a patch for fixing that. I think v14 has the same issue, > so I think we need backpatching. Thanks - this seems to resolve the issue. On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote: > Looking at the second hunk of the patch: > FdwRoutine *fdwroutine = path->parent->fdwroutine; > ... > + if (IsA(plan, Result)) > + return false; > > It seems the check of whether plan is a Result node can be lifted ahead of > the switch statement (i.e. to the beginning of mark_async_capable_plan). > > This way, we don't have to check for every case in the switch statement. I think you misread it - the other branch says: if (*not* IsA()) -- Justin
Re: Defer selection of asynchronous subplans until the executor initialization stage
On Sun, Apr 10, 2022 at 7:41 PM Justin Pryzby wrote: > On Sun, Apr 10, 2022 at 07:43:48PM +0900, Etsuro Fujita wrote: > > On Sat, Apr 9, 2022 at 1:58 AM Etsuro Fujita > wrote: > > > On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby > wrote: > > > > This patch seems to be causing the planner to crash. > > > > Here's a query reduced from sqlsmith. > > > > > > > > | explain SELECT 1 FROM information_schema.constraint_column_usage > WHERE 1 <= pg_trigger_depth(); > > > > > > > > Program terminated with signal SIGSEGV, Segmentation fault. > > > > > > Reproduced. Will look into this. > > > > I think the cause of this is that mark_async_capable_plan() failed to > > take into account that when the given path is a SubqueryScanPath or > > ForeignPath, the given corresponding plan might include a gating > > Result node that evaluates pseudoconstant quals. My oversight. :-( > > Attached is a patch for fixing that. I think v14 has the same issue, > > so I think we need backpatching. > > Thanks - this seems to resolve the issue. > > On Sun, Apr 10, 2022 at 06:46:25AM -0700, Zhihong Yu wrote: > > Looking at the second hunk of the patch: > > FdwRoutine *fdwroutine = path->parent->fdwroutine; > > ... > > + if (IsA(plan, Result)) > > + return false; > > > > It seems the check of whether plan is a Result node can be lifted ahead > of > > the switch statement (i.e. to the beginning of mark_async_capable_plan). > > > > This way, we don't have to check for every case in the switch statement. > > I think you misread it - the other branch says: if (*not* IsA()) > > No, I didn't misread: if (!IsA(plan, Result) && mark_async_capable_plan(plan, ((ProjectionPath *) path)->subpath)) return true; return false; If the plan is Result node, false would be returned. So the check can be lifted to the beginning of the func. Cheers
Re: generic plans and "initial" pruning
On Fri, Apr 8, 2022 at 8:45 PM Amit Langote wrote: > Most looked fine changes to me except a couple of typos, so I've > adopted those into the attached new version, even though I know it's > too late to try to apply it. > > + * XXX is it worth doing a bms_copy() on glob->minLockRelids if > + * glob->containsInitialPruning is true?. I'm slighly worried that the > + * Bitmapset could have a very long empty tail resulting in excessive > + * looping during AcquireExecutorLocks(). > + */ > > I guess I trust your instincts about bitmapset operation efficiency > and what you've written here makes sense. It's typical for leaf > partitions to have been appended toward the tail end of rtable and I'd > imagine their indexes would be in the tail words of minLockRelids. If > copying the bitmapset removes those useless words, I don't see why we > shouldn't do that. So added: > > + /* > + * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted > + * bit from it just above to prevent empty tail bits resulting in > + * inefficient looping during AcquireExecutorLocks(). > + */ > + if (glob->containsInitialPruning) > + glob->minLockRelids = bms_copy(glob->minLockRelids) > > Not 100% about the comment I wrote. And the quoted code change missed a semicolon in the v14 that I hurriedly sent on Friday. (Had apparently forgotten to `git add` the hunk to fix that). Sending v15 that fixes that to keep the cfbot green for now. -- Amit Langote EDB: http://www.enterprisedb.com v15-0001-Optimize-AcquireExecutorLocks-to-skip-pruned-par.patch Description: Binary data
Re: PG DOCS - logical replication filtering
On Fri, Apr 8, 2022 at 11:42 AM Peter Smith wrote: > > On Wed, Apr 6, 2022 at 8:58 PM Amit Kapila wrote: > > > > > 3. > > + > > +Whenever an UPDATE is processed, the row filter > > +expression is evaluated for both the old and new row (i.e. before > > +and after the data is updated). > > + > > + > > + > > +If both evaluations are true, it replicates the > > +UPDATE change. > > + > > + > > + > > +If both evaluations are false, it doesn't replicate > > +the change. > > + > > > > I think we can combine these three separate paragraphs. > > The first sentence is the explanation, then there are the 3 separate > “If …” cases mentioned. It doesn’t seem quite right to group just the > first 2 “if…” cases into one paragraph, while leaving the 3rd one > separated. OTOH combining everything into one big paragraph seems > worse. Please confirm if you still want this changed. > Yeah, I think we should have something like what Euler's version had and maybe keep a summary section from the current patch. > > 4. > > + > > +If the publication contains a partitioned table, the publication > > parameter > > +publish_via_partition_root determines which row > > filter > > +is used. > > + > > + > > + > > + > > + If publish_via_partition_root is > > false > > + (default), each partition's row filter is used. > > + > > + > > + > > + > > + > > + If publish_via_partition_root is > > true, > > + the root partitioned table's row filter is > > used. > > + > > + > > + > > + > > + > > > > I think we can combine this into single para as Euler had in his version. > > We can do it, but I am not sure if your review was looking at the > rendered HTML or just looking at the SGML text? IMO using bullets here > ended up being more readable (it is also consistent with other bullet > usages on this page). Please confirm if you still want this changed. > Fair enough. We can keep this part as it is. > > 5. > > + > > + > > + Publication publish operations are ignored > > when copying pre-existing table data. > > + > > + > > + > > + > > + > > + If the subscriber is in a release prior to 15, copy pre-existing data > > + doesn't use row filters even if they are defined in the publication. > > + This is because old releases can only copy the entire table data. > > + > > + > > > > I don't see the need for the first here, the second one seems > > to convey it. > > Well, the 2nd note is only about compatibility with older versions > doing the subscribe. But the 1st note is not version-specific at all. > It is saying that the COPY does not take the “publish” option into > account. If you know of some other docs already mentioning this subtle > behaviour of the COPY then I can remove this note and just > cross-reference to the other place. But I did not know anywhere this > is already mentioned, so that is why I wrote the note about it. > I don't see the need to say about general initial sync (COPY) behavior here. It is already defined at [1]. If we want to enhance, we can do that as a separate patch to make changes where the initial sync is explained. I am not sure that is required though. > > > > 7. > > + > > +The PSQL command \d shows what publications the > > table is > > +a member of, as well as that table's row filter expression (if > > defined) in > > +those publications. > > + > > + > > +- notice table t1 is a member of 2 publications, but > > +has a row filter only in p1. > > + > > + > > +- notice table t2 is a member of 2 publications, and > > +has a different row filter in each of them. > > > > This looks unnecessary to me. Let's remove this part. > > I thought something is needed to explain/demonstrate how the user can > know the different row filters for all the publications that the same > table is a member of. Otherwise, the user has to guess (??) what > publications are using their table and then use \dRp+ to dig at all > those publications to find the row filters. > > I can remove all this part from the Examples, but I think at least the > \d should still be mentioned somewhere. IMO I should put that "PSQL > commands" section back (which existed in an earlier version) and just > add a sentence about this. Then this examples part can be removed. > What do you think? > I think the way it is changed in the current patch by moving that explanation down seems okay to me. I feel in the initial "Row Filters" and "Row Filter Rules" sections, we don't need to have separate paragraphs. I think the same is pointed out by Alvaro as well. -- With Regards, Amit Kapila.
Re: generic plans and "initial" pruning
On Sun, Apr 10, 2022 at 8:05 PM Amit Langote wrote: > On Fri, Apr 8, 2022 at 8:45 PM Amit Langote > wrote: > > Most looked fine changes to me except a couple of typos, so I've > > adopted those into the attached new version, even though I know it's > > too late to try to apply it. > > > > + * XXX is it worth doing a bms_copy() on glob->minLockRelids if > > + * glob->containsInitialPruning is true?. I'm slighly worried that the > > + * Bitmapset could have a very long empty tail resulting in excessive > > + * looping during AcquireExecutorLocks(). > > + */ > > > > I guess I trust your instincts about bitmapset operation efficiency > > and what you've written here makes sense. It's typical for leaf > > partitions to have been appended toward the tail end of rtable and I'd > > imagine their indexes would be in the tail words of minLockRelids. If > > copying the bitmapset removes those useless words, I don't see why we > > shouldn't do that. So added: > > > > + /* > > + * It seems worth doing a bms_copy() on glob->minLockRelids if we > deleted > > + * bit from it just above to prevent empty tail bits resulting in > > + * inefficient looping during AcquireExecutorLocks(). > > + */ > > + if (glob->containsInitialPruning) > > + glob->minLockRelids = bms_copy(glob->minLockRelids) > > > > Not 100% about the comment I wrote. > > And the quoted code change missed a semicolon in the v14 that I > hurriedly sent on Friday. (Had apparently forgotten to `git add` the > hunk to fix that). > > Sending v15 that fixes that to keep the cfbot green for now. > > -- > Amit Langote > EDB: http://www.enterprisedb.com Hi, + /* RT index of the partitione table. */ partitione -> partitioned Cheers
Re: [PATCH] Full support for index LP_DEAD hint bits on standby
On Thu, Mar 31, 2022 at 4:57 PM Michail Nikolaev wrote: > I remember you had an idea about using the LP_REDIRECT bit in btree > indexes as some kind of “recently dead” flag (1). > Is this idea still in progress? Maybe an additional bit could provide > a space for a better solution. I think that the best way to make the patch closer to being committable is to make the on-disk representation more explicit. Relying on an implicit or contextual definition for anything seems like something to avoid. This is probably the single biggest problem that I see with the patch. I suggest that you try to "work backwards". If the patch was already committed today, but had subtle bugs, then how would we be able to identify the bugs relatively easily? What would our strategy be then? -- Peter Geoghegan
Re: MERGE bug report
On Sat, Apr 9, 2022 at 5:26 AM Alvaro Herrera wrote: > On 2022-Apr-06, Richard Guo wrote: > > > That's right. The varattno is set to zero for whole-row Var. And in this > > case these whole-row Vars are not included in the targetlist. > > > > Attached is an attempt for the fix. > > Wow, this is very interesting. I was surprised that this patch was > necessary at all -- I mean, if wholerow refs don't work, then why do > references to any other columns work? The answer is that parse_merge.c > is already setting up the subplan's targetlist by expanding all vars of > the source relation. I then remembered than in Simon's (or Pavan's) > original coding, parse_merge.c had a hack to include a var with the > source's wholerow in that targetlist, which I had later removed ... > At first I was wondering whether we need to also include vars used in each action's targetlist, just as what we did for each action's qual. Then later I realized parse_merge.c already did that. But now it looks much better to process them two in preprocess_targetlist. > > I eventually realized that there's no need for parse_merge.c to expand > the source rel at all, and indeed it's wasteful: we can just let > preprocess_targetlist include the vars that are referenced by either > quals or each action's targetlist instead. That led me to the attached > patch, which is not commit-quality yet but it should show what I have in > mind. > This patch looks in a good shape to me. A minor comment is that we can use list_concat_copy(list1, list2) instead of list_concat(list_copy(list1), list2) for better efficiency. Thanks Richard
Re: API stability
(a bit off-topic) I'm not sure where I am.. At Wed, 06 Apr 2022 10:36:30 +0900 (JST), Kyotaro Horiguchi wrote in me> > this if nobody else would like to do it, but let me ask whether me> > Kyotaro Horiguchi would like to propose a patch, since the original me> > patch did, and/or whether you would like to propose a patch, as the me> > person reporting the issue. me> me> I'd like to do that. Let me see. At Thu, 7 Apr 2022 10:04:20 -0400, Robert Haas wrote in > struct, which is what we now need to fix. Since I don't hear anyone > else volunteering to take care of that, I'll go work on it. Just confirmation. Is my message above didn't look like declaring that I'd like to volunteering? If so, please teach me the correct way to say that, since I don't want to repeat the same mistake. Or are there some other reasons? (Sorry if this looks like a blame, but I asking plainly (really:).) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: typos
( Added Joe and Robert for 0011 ) On Mon, 11 Apr 2022 at 14:03, Justin Pryzby wrote: > In docs and comments. Mostly for v15. 0001: Should this not use PostgreSQL? (new to master) 0002: The patch looks good. (new to v12) 0003: The patch looks good. (new to master) 0004: The patch looks good. (new to master) 0005: I'm not entirely certain this is an improvement. Your commit message I'd say is not true going by git grep "compression algorithm". There are 3 matches in the docs and take [1], for example. I'd say in that one it's better to use "algorithm". In that case, "method" could be talking about client or server. That makes me wonder if the change you're proposing is an improvement or not. 0006: The patch looks good. (new to master) 0007: -When the --max-tries option is used, the transaction with -serialization or deadlock error cannot be retried if the total time of +When the --max-tries option is used, a transaction with +serialization or deadlock error will not be retried if the total time of Shouldn't this be slightly clearer and say "a transaction which fails due to a serialization anomaly or a deadlock"? - database server / the syntax error in the meta command / thread + database server / syntax error in the meta command / thread Should we not separate these items out with commas? - the client is aborted. Otherwise, if an SQL fails with serialization or + the client is aborted. Otherwise, if an SQL command fails with serialization or deadlock errors, the client is not aborted. In such cases, the current I'd say "if an SQL command fails due to a serialization anomaly or due to deadlocking". (new to master) 0008: The patch looks good. (new to master) 0009: The patch looks good. (new to master) 0010: I don't understand this change. 0011: I can't quite parse the original. I might not have enough context here. Robert, Joe? (new to master) 0012: This seems to contain some documentation fixes too. The patch claims it's just for comments. - * pages that are outwith that range. + * pages that are outside that range. I personally don't see the issue here, but I'm Scottish. I think the best transaction is just "outside of" rather than replacing with just "outside". All the other changes look fine. 0013: I think we should fix all these, regardless of how old the mistake is. 0014: - * shouldn't PANIC just because we can't guarantee the the backup has been + * shouldn't PANIC just because we can't guarantee the backup has been "that the" 0015: Patch looks fine. 0016: 0017: I'm not really sure if we should fix these or not. From having a quick look at some of them it seems we'd be adding churn to some pretty old code. Happy to hear what others think. 0018: The patch looks good. 0019: -1. pgindent will fix these. I will start pushing the less controversial of these, after a bit of squashing. David [1] https://www.postgresql.org/docs/devel/app-pgbasebackup.html
Re: Support logical replication of DDLs
On Thu, Apr 7, 2022 at 3:46 PM Amit Kapila wrote: > > On Wed, Mar 23, 2022 at 10:39 AM Japin Li wrote: > > 2. For DDL replication, do we need to wait for a consistent point of > snapshot? For DMLs, that point is a convenient point to initialize > replication from, which is why we export a snapshot at that point, > which is used to read normal data. Do we have any similar needs for > DDL replication? > I have thought a bit more about this and I think we need to build the snapshot for DML replication as we need to read catalog tables to decode the corresponding WAL but it is not clear to me if we have a similar requirement for DDL replication. If the catalog access is required then it makes sense to follow the current snapshot model, otherwise, we may need to think differently for DDL replication. One more related point is that for DML replication, we do ensure that we copy the entire data of the table (via initial sync) which exists even before the publication for that table exists, so do we want to do something similar for DDLs? How do we sync the schema of the table before the user has defined the publication? Say the table has been created before the publication is defined and after that, there are only Alter statements, so do we expect, users to create the table on the subscriber and then we can replicate the Alter statements? And even if we do that it won't be clear which Alter statements will be replicated after publication is defined especially if those Alters happened concurrently with defining publications? -- With Regards, Amit Kapila.
Re: Support logical replication of DDLs
On Fri, Apr 8, 2022 at 5:04 PM Alvaro Herrera wrote: > > On 2022-Apr-08, Amit Kapila wrote: > > > > For runtime conditions, one of the things you have mentioned in that > > thread is to add schema name in the statement at the required places > > which this patch deals with in a different way by explicitly sending > > it along with the DDL statement. > > Hmm, ok. The point of the JSON-blob route is that the publisher sends a > command representation that can be parsed/processed/transformed > arbitrarily by the subscriber using generic rules; it should be trivial > to use a JSON tool to change schema A to schema B in any arbitrary DDL > command, and produce another working DDL command without having to know > how to write that command specifically. So if I have a rule that > "schema A there is schema B here", all DDL commands can be replayed with > no further coding (without having to rely on getting the run-time > search_path correct.) > Okay, thanks for the clarification. -- With Regards, Amit Kapila.
Outdated copyright year in parse_jsontable.c
Hi, I just noticed that parse_jsontable.c was added with a wrong copyright year, trivial patch attached. While at it I also see that a lot of Spanish translation files seems to be stuck in 2019, but those were fixed in the pgtranslation repo last June so they will get synced eventually. >From ffd6809ee725f186784fd141473c833b22a89a32 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Mon, 11 Apr 2022 14:01:59 +0800 Subject: [PATCH] Fix copyright year in parse_jsontable.c Oversight in 4e34747c88a. --- src/backend/parser/parse_jsontable.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/backend/parser/parse_jsontable.c b/src/backend/parser/parse_jsontable.c index c7dcefa11c..0dea7c998e 100644 --- a/src/backend/parser/parse_jsontable.c +++ b/src/backend/parser/parse_jsontable.c @@ -3,7 +3,7 @@ * parse_jsontable.c * parsing of JSON_TABLE * - * Portions Copyright (c) 1996-2021, PostgreSQL Global Development Group + * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * -- 2.35.0
Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
At Sat, 9 Apr 2022 18:03:01 +0530, Bharath Rupireddy wrote in > On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM > wrote: > > > > On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier wrote: > >> Are you referring to the pre-padding when creating a new partial > >> segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until > >> the file is fully created? What kind of error did you see? I guess > >> that a write() with ENOSPC would be more likely, but you got a > >> different problem? > > > > I see two cases, 1/ when no space is left on the device and 2/ when the > > process is taken down forcibly (a VM/container crash) > > Yeah, these cases can occur leaving uninitialized .partial files which > can be a problem for both pg_receivewal and pg_basebackup that uses > dir_open_for_write (CreateWalDirectoryMethod). > > >> I don't disagree with improving such cases, but we > >> should not do things so as there is a risk of leaving behind an > >> infinite set of segments in case of repeated errors > > > > Do you see a problem with the proposed patch that leaves the files behind, > > at least in my testing I don't see any files left behind? I guess that Michael took this patch as creating a temp file name such like "tmp.n" very time finding an incomplete file. > With the proposed patch, it doesn't leave the unpadded .partial files. > Also, the v2 patch always removes a leftover .partial.temp file before > it creates a new one. > > >> , and partial > >> segments are already a kind of temporary file. I'm not sure this is true for pg_receivewal case. The .partial file is not a temporary file but the current working file for the tool. > > if the .partial file exists with not zero-padded up to the wal segment size > > (WalSegSz), then open_walfile fails with the below error. I have two > > options here, 1/ to continue padding the existing partial file and let it > > zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought > > the latter is safe because it can handle corrupt cases as described below. > > Thoughts? I think this patch shouldn't involve pg_basebackup. I agree to Cary that deleting the erroring file should be fine. We already "skipping" (complete = non-.partial) WAL files with a wrong size in FindStreamingStart so we can error-out with suggesting a hint. $ pg_receivewal -D xlog -p 5432 -h /tmp pg_receivewal: error: segment file "0001002200F5.partial" has incorrect size 8404992 hint: You can continue after removing the file. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: pg_receivewal fail to streams when the partial file to write is not fully initialized present in the wal receiver directory
Sorry for the terrible typos.. At Sat, 9 Apr 2022 18:03:01 +0530, Bharath Rupireddy wrote in > On Tue, Jan 4, 2022 at 1:40 AM SATYANARAYANA NARLAPURAM > wrote: > > > > On Sun, Jan 2, 2022 at 11:56 PM Michael Paquier wrote: > >> Are you referring to the pre-padding when creating a new partial > >> segment, aka when we write chunks of XLOG_BLCKSZ full of zeros until > >> the file is fully created? What kind of error did you see? I guess > >> that a write() with ENOSPC would be more likely, but you got a > >> different problem? > > > > I see two cases, 1/ when no space is left on the device and 2/ when the > > process is taken down forcibly (a VM/container crash) > > Yeah, these cases can occur leaving uninitialized .partial files which > can be a problem for both pg_receivewal and pg_basebackup that uses > dir_open_for_write (CreateWalDirectoryMethod). > > >> I don't disagree with improving such cases, but we > >> should not do things so as there is a risk of leaving behind an > >> infinite set of segments in case of repeated errors > > > > Do you see a problem with the proposed patch that leaves the files behind, > > at least in my testing I don't see any files left behind? I guess that Michael took this patch as creating a temp file with a name such like "tmp.n" every time finding an incomplete file. > With the proposed patch, it doesn't leave the unpadded .partial files. > Also, the v2 patch always removes a leftover .partial.temp file before > it creates a new one. > > >> , and partial > >> segments are already a kind of temporary file. I'm not sure this is true for pg_receivewal case. The .partial file is not a temporary file but the current working file for the tool. > > if the .partial file exists with not zero-padded up to the wal segment size > > (WalSegSz), then open_walfile fails with the below error. I have two > > options here, 1/ to continue padding the existing partial file and let it > > zero up to WalSegSz , 2/create a temp file as I did in the patch. I thought > > the latter is safe because it can handle corrupt cases as described below. > > Thoughts? I think this patch shouldn't involve pg_basebackup. I agree to Cary that deleting the erroring file should be fine. We already "skipping" (complete = non-.partial) WAL files with a wrong size in FindStreamingStart so we can error-out with suggesting a hint. $ pg_receivewal -D xlog -p 5432 -h /tmp pg_receivewal: error: segment file "0001002200F5.partial" has incorrect size 8404992 hint: You can continue after removing the file. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Practical Timing Side Channel Attacks on Memory Compression
Thanks for all of your opinions. I have almost the same feeling. The best layer for mitigation should be probably a user application. There can be arranged the correct data layout in the database, set up access limit for the app, and many other mitigation mechanisms. -Filip- st 6. 4. 2022 v 17:24 odesílatel Greg Stark napsal: > On Wed, 6 Apr 2022 at 10:29, Robert Haas wrote: > > > > I think that the paper shows that, under the right set of > > circumstances, a timing-based attack is possible here. > > Generally any argument that an attack is infeasible is risky and > usually leads to security professionals showing that surprisingly > difficult attacks are entirely feasible. > > However I think the opposite argument is actually much more > compelling. There are *so many* timing attacks on a general purpose > computing platform like Postgres that any defense to them can't > concentrate on just one code path and has to take a more comprehensive > approach. > > So for example a front-end can add some stochastic latency or perhaps > padd latency to fixed amount. > > Perhaps postgres could offer some protection at that level by e.g. > offering a function to do it. For most users I think they're better > off implementing it at the application level but some people use > database stored functions as their application level so it might be > useful for them. > > Something like pg_sleep_until_multiple_of('50ms') which looks at the > transaction start time and calculates the amount of time to sleep > automatically. > > > -- > greg > >
RE: Logical replication timeout problem
On Wed, Apr 7, 2022 at 1:34 PM Amit Kapila wrote: > On Wed, Apr 6, 2022 at 6:30 PM wangw.f...@fujitsu.com > wrote: > > > > On Wed, Apr 6, 2022 at 1:58 AM Amit Kapila wrote: > > On Wed, Apr 6, 2022 at 4:32 AM Amit Kapila wrote: > > > Also, let's try to evaluate how it impacts lag functionality for large > transactions? > > I think this patch will not affect lag functionality. It will updates the > > lag > > field of view pg_stat_replication more frequently. > > IIUC, when invoking function WalSndUpdateProgress, it will store the lsn of > > change and invoking time in lag_tracker. Then when invoking function > > ProcessStandbyReplyMessage, it will calculate the lag field according to the > > message from subscriber and the information in lag_tracker. This patch does > > not modify this logic, but only increases the frequency of invoking. > > Please let me know if I understand wrong. > > > > No, your understanding seems correct to me. But what I want to check > is if calling the progress function more often has any impact on > lag-related fields in pg_stat_replication? I think you need to check > the impact of large transaction replay. Thanks for the explanation. After doing some checks, I found that the v13 patch makes the calculations of lag functionality inaccurate. In short, v13 patch lets us try to track lag more frequently and try to send a keepalive message to subscribers. But in order to prevent flooding the lag tracker, we could not track lag more than once within WALSND_LOGICAL_LAG_TRACK_INTERVAL_MS (see function WalSndUpdateProgress). This means we may lose informations that needs to be tracked. For example, suppose there is a large transaction with lsn from lsn1 to lsn3. In HEAD, when we calculate the lag time for lsn3, the lag time of lsn3 is (now - lsn3.time). But with v13 patch, when we calculate the lag time for lsn3, because there maybe no informations of lsn3 but has informations of lsn2 in lag_tracker, the lag time of lsn3 is (now - t2.time). (see function LagTrackerRead) Therefore, if we lose the informations that need to be tracked, the lag time becomes large and inaccurate. So I skip tracking lag during a transaction just like the current HEAD. Attach the new patch. Regards, Wang wei v14-0001-Fix-the-logical-replication-timeout-during-large.patch Description: v14-0001-Fix-the-logical-replication-timeout-during-large.patch
Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman
On Sat, Apr 9, 2022 at 12:59 PM Andres Freund wrote: > On 2022-04-08 17:55:51 -0400, Tom Lane wrote: > > I tried adjusting the patch so it does guarantee that (as attached), > > and in two out of two tries it got past the archive_cleanup_command > > failure but then hung up waiting for standby2 to promote. > > Adding > > $node_standby->safe_psql('postgres', "SELECT pg_switch_wal()"); > just after > $node_standby2->start; > > makes the tests pass here. Sorry for the delay... I got a bit confused about the different things going on in this thread but I hope I've got it now: 1. This test had some pre-existing bugs/races, which hadn't failed before due to scheduling, even under Valgrind. The above changes appear to fix those problems. To Michael for comment. > What is that second test really testing? > > # Check the presence of temporary files specifically generated during > # archive recovery. To ensure the presence of the temporary history > # file, switch to a timeline large enough to allow a standby to recover > # a history file from an archive. As this requires at least two timeline > # switches, promote the existing standby first. Then create a second > # standby based on the promoted one. Finally, the second standby is > # promoted. > > Note "Then create a second standby based on the promoted one." - but that's > not actually what's happening: 2. There may also be other problems with the test but those aren't relevant to skink's failure, which starts on the 5th test. To Michael for comment. > But I think there's also something funky in the prefetching logic. I think it > may attempt restoring during prefetching somehow, even though there's code > that appears to try to prevent that? 3. Urghl. Yeah. There is indeed code to report XLREAD_WOULDBLOCK for the lastSourceFailed case, but we don't really want to do that more often than every 5 seconds. So I think I need to make that "sticky", so that we don't attempt to prefetch again (ie to read from the next segment, which invokes restore_command) until we've replayed all the records we already have, then hit the end and go to sleep. The attached patch does that, and makes the offending test pass under Valgrind for me, even without the other changes already mentioned. If I understand correctly, this is due to a timing race in the tests (though I didn't check where exactly), because all those extra fork/exec calls are extremely slow under Valgrind. > And interestingly I'm not seeing the > "switched WAL source from stream to archive after failure" > lines I'd expect. I see them now. It's because it gives up when it's reading ahead (nonblocking), which may not be strictly necessary but I found it simpler to think about. Then when it tries again in 5 seconds it's in blocking mode so it doesn't give up so easily. 2022-04-11 18:15:08.220 NZST [524796] DEBUG: switched WAL source from stream to archive after failure cp: cannot stat '/tmp/archive/00010017': No such file or directory 2022-04-11 18:15:08.226 NZST [524796] DEBUG: could not restore file "00010017" from archive: child process exited with exit code 1 2022-04-11 18:15:08.226 NZST [524796] DEBUG: could not open file "pg_wal/00010017": No such file or directory 2022-04-11 18:15:08.226 NZST [524796] DEBUG: switched WAL source from archive to stream after failure From e488e20f7c364ba1adaae8d1f6ea92a7f5d0bda6 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 11 Apr 2022 17:16:24 +1200 Subject: [PATCH] Don't retry restore_command while reading ahead. Suppress further attempts to read ahead in the WAL if we run out of data, until records already decoded have been replayed. When replaying from archives, this avoids repeatedly retrying the restore_command until after we've slept for 5 seconds. When replaying from a network stream, this makes us less aggressive in theory, but we probably can't benefit from being more aggressive yet anyway as the flushedUpto variable doesn't advance until we run out of data (though we could improve that in future). While here, fix a potential off-by one confusion with the no_readahead_until mechanism: we suppress readahead until after the record that begins at that LSN has been replayed (ie we are replaying a higher LSN), so change < to <=. Defect in commit 5dc0418f. Reported-by: Andres Freund Discussion: https://postgr.es/m/20220409005910.alw46xqmmgny2sgr%40alap3.anarazel.de --- src/backend/access/transam/xlogprefetcher.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c index f342d2..43e82b65c2 100644 --- a/src/backend/access/transam/xlogprefetcher.c +++ b/src/backend/access/transam/xlogprefetcher.c @@ -487,8 +487,8 @@ XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn) */ nonblocking = XLogReaderHasQueuedRecordOrError(reader); - /* Certain recor
Fixes for compression options of pg_receivewal and refactoring of backup_compression.{c,h}
Hi all, (Added Robert and Georgios in CC:) Since ba5 and the introduction of LZ4, I have reworked the way compression is controlled for pg_receivewal, with two options: - --compress-method, settable to "gzip", "none" or "lz4". - --compress, to pass down a compression level, where the allowed range is 1-9. If passing down 0, we'd get an error rather than implying no compression, contrary to what we did in ~14. I initially thought that this was fine as-is, but then Robert and others have worked on client/server compression for pg_basebackup, introducing a much better design with centralized APIs where one can use METHOD:DETAIL for as compression value, where DETAIL is a comma-separated list of keyword=value (keyword = "level" or "workers"), with centralized checks and an extensible design. This is something I think we had better fix before beta1, because now we have binaries that use an inconsistent set of options. So, attached is a patch set aimed at rework this option set from the ground, taking advantage of the recent work done by Robert and others for pg_basebackup: - 0001 is a simple rename of backup_compression.{c,h} to compression.{c,h}, removing anything related to base backups from that. One extra reason behind this renaming is that I would like to use this infra for pg_dump, but that's material for 16~. - 0002 removes WalCompressionMethod, replacing it by pg_compress_algorithm as these are the same enums. Robert complained about the confusion that WalCompressionMethod could lead to as this could be used for the compression of data, and not only WAL. I have renamed some variables to be more consistent, while on it. - 0003 is the actual option rework for pg_receivewal. This takes advantage of 0001, leading to the result of removing --compress-method and replacing it with --compress, taking care of the backward compatibility problems for an integer value, aka 0 implies no compression and val > 0 implies gzip. One bonus reason to switch to that is that this would make the addition of zstd for pg_receivewal easier in the future. I am going to add an open item for this stuff. Comments or thoughts? Thanks, -- Michael From ccef39021a254c3ca4d0a380f138bf47e9b9b9c7 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 11 Apr 2022 13:50:18 +0900 Subject: [PATCH v1 1/3] Rename backup_compression.{c,h} to compression.{c,h} Compression option handling (level, algorithm or even workers) can be used across several parts of the system and not only base backups, with at least pg_receivewal and pg_dump. Structures, objects and routines are renamed in consequence, to remove the concept of base backups from that. --- src/include/common/backup_compression.h | 46 - src/include/common/compression.h | 46 + src/include/replication/basebackup_sink.h | 8 +-- src/backend/replication/basebackup.c | 30 - src/backend/replication/basebackup_gzip.c | 4 +- src/backend/replication/basebackup_lz4.c | 4 +- src/backend/replication/basebackup_zstd.c | 10 +-- src/common/Makefile | 2 +- .../{backup_compression.c => compression.c} | 66 +-- src/bin/pg_basebackup/bbstreamer.h| 8 +-- src/bin/pg_basebackup/bbstreamer_gzip.c | 4 +- src/bin/pg_basebackup/bbstreamer_lz4.c| 4 +- src/bin/pg_basebackup/bbstreamer_zstd.c | 6 +- src/bin/pg_basebackup/nls.mk | 2 +- src/bin/pg_basebackup/pg_basebackup.c | 44 ++--- src/tools/msvc/Mkvcbuild.pm | 2 +- 16 files changed, 143 insertions(+), 143 deletions(-) delete mode 100644 src/include/common/backup_compression.h create mode 100644 src/include/common/compression.h rename src/common/{backup_compression.c => compression.c} (80%) diff --git a/src/include/common/backup_compression.h b/src/include/common/backup_compression.h deleted file mode 100644 index 6a0ecaa99c..00 --- a/src/include/common/backup_compression.h +++ /dev/null @@ -1,46 +0,0 @@ -/*- - * - * backup_compression.h - * - * Shared definitions for backup compression methods and specifications. - * - * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group - * - * IDENTIFICATION - * src/common/backup_compression.h - *- - */ - -#ifndef BACKUP_COMPRESSION_H -#define BACKUP_COMPRESSION_H - -typedef enum bc_algorithm -{ - BACKUP_COMPRESSION_NONE, - BACKUP_COMPRESSION_GZIP, - BACKUP_COMPRESSION_LZ4, - BACKUP_COMPRESSION_ZSTD -} bc_algorithm; - -#define BACKUP_COMPRESSION_OPTION_LEVEL (1 << 0) -#define BACKUP_COMPRESSION_OPTION_WORKERS (1 << 1) - -typedef struct bc_specification -{ - bc_algorithm algorithm; - unsigned options; /* OR of BACKUP_COMPRESSION_OPTION constants */ - int level; - int workers; - char *parse_error; /*
Re: Outdated copyright year in parse_jsontable.c
On Mon, Apr 11, 2022 at 02:08:38PM +0800, Julien Rouhaud wrote: > I just noticed that parse_jsontable.c was added with a wrong copyright year, > trivial patch attached. Thanks, I'll go fix that in a bit. I am spotting four more, as of src/backend/replication/basebackup_*.c that point to 2020. -- Michael signature.asc Description: PGP signature