Re: Using COPY FREEZE in pgbench
Hi fabien, >> So overall gain by the patch is around 15%, whereas the last test >> before the commit was 14%. It seems the patch is still beneficial >> after the commit. > > Yes, that's good! Yeah! > I had a quick look again, and about the comment: > > /* > * If partitioning is not enabled and server version is 14.0 or later, we > * can take account of freeze option of copy. > */ > > I'd suggest instead the shorter: > > /* use COPY with FREEZE on v14 and later without partioning */ > > Or maybe even to fully drop the comment, because the code is clear > enough and the doc already says it. I'd prefer to leave a comment. People (including me) tend to forget things in the future, that are obvious now:-) Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 0c60077e1f..0f432767c2 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -220,6 +220,9 @@ pgbench options d data is generated in pgbench client and then sent to the server. This uses the client/server bandwidth extensively through a COPY. +pgbench uses the FREEZE option with 14 or later +versions of PostgreSQL to speed up +subsequent VACUUM, unless partitions are enabled. Using g causes logging to print one message every 100,000 rows while generating data for the pgbench_accounts table. diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 4aeccd93af..54d993075f 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -4119,6 +4119,7 @@ initGenerateDataClientSide(PGconn *con) PGresult *res; int i; int64 k; + char *copy_statement; /* used to track elapsed time and estimate of the remaining time */ pg_time_usec_t start; @@ -4165,7 +4166,15 @@ initGenerateDataClientSide(PGconn *con) /* * accounts is big enough to be worth using COPY and tracking runtime */ - res = PQexec(con, "copy pgbench_accounts from stdin"); + + /* use COPY with FREEZE on v14 and later without partioning */ + if (partitions == 0 && PQserverVersion(con) >= 14) + copy_statement = "copy pgbench_accounts from stdin with (freeze on)"; + else + copy_statement = "copy pgbench_accounts from stdin"; + + res = PQexec(con, copy_statement); + if (PQresultStatus(res) != PGRES_COPY_IN) { pg_log_fatal("unexpected copy in result: %s", PQerrorMessage(con));
Re: Numeric multiplication overflow errors
On Sat, 3 Jul 2021 at 11:04, Dean Rasheed wrote: > Thinking about this more, I think it's best not to risk back-patching. > It *might* be safe, but it's difficult to really be sure of that. The > bug itself is pretty unlikely to ever happen in practice, hence the > lack of prior complaints, and in fact I only found it by an > examination of the code. So it doesn't seem to be worth the risk. That seems like good logic to me. Perhaps we can reconsider that decision if users complain about it. David
Re: rand48 replacement
On Sat, 3 Jul 2021 at 08:06, Fabien COELHO wrote: > > Here is a v4, which: > > - moves the stuff to common and fully removes random/srandom (Tom) > - includes a range generation function based on the bitmask method (Dean) > but iterates with splitmix so that the state always advances once (Me) At the risk of repeating myself: do *not* invent your own scheme. The problem with iterating using splitmix is that splitmix is a simple shuffling function that takes a single input and returns a mutated output depending only on that input. So let's say for simplicity that you're generating numbers in the range [0,N) with N=2^64-n and n<2^63. Each of the n values in [N,2^64) that lie outside the range wanted are just mapped in a deterministic way back onto (at most) n values in the range [0,N), making those n values twice as likely to be chosen as the other N-n values. So what you've just invented is an algorithm with the complexity of the unbiased bitmask method, but basically the same bias as the biased integer multiply method. I don't understand why you object to advancing the state more than once. Doing so doesn't make the resulting sequence of numbers any less deterministic. In fact, I'm pretty sure you *have to* advance the state more than once in *any* unbiased scheme. That's a common characteristic of all the unbiased methods I've seen, and intuitively, I think it has to be so. Otherwise, I'm happy with the use of the bitmask method, as long as it's implemented correctly. Regards, Dean
Re: Remove useless int64 range checks on BIGINT sequence MINVALUE/MAXVALUE values
On Sat, 3 Jul 2021 at 22:44, Peter Eisentraut wrote: > I don't think this is a good change. > I think we should leave it as is. I'm inclined to agree. When I mentioned adding a comment I'd not imagined it would be quite so verbose. Plus, I struggle to imagine there's any compiler out there that someone would use that wouldn't just remove the check anyway. I had a quick click around on https://godbolt.org/z/PnKeq5bsT and didn't manage to find any compilers that didn't remove the check. David
Re: rand48 replacement
Hello Dean, - moves the stuff to common and fully removes random/srandom (Tom) - includes a range generation function based on the bitmask method (Dean) but iterates with splitmix so that the state always advances once (Me) At the risk of repeating myself: do *not* invent your own scheme. The problem with iterating using splitmix is that splitmix is a simple shuffling function that takes a single input and returns a mutated output depending only on that input. It also iterates over its 64 bits state in a round robin fashion so that the cycle size is 2^64 (it is a simple add). So let's say for simplicity that you're generating numbers in the range [0,N) with N=2^64-n and n<2^63. Each of the n values in [N,2^64) that lie outside the range wanted are just mapped in a deterministic way back onto (at most) n values in the range [0,N), making those n values twice as likely to be chosen as the other N-n values. I do understand your point. If the value is outside the range, splitmix iterates over its seed and the extraction functions produces a new number which is tested again. I do not understand the "mapped back onto" part, the out of range value is just discarded and the loops starts over with a new derivation, and why it would imply that some values are more likely to come out. So what you've just invented is an algorithm with the complexity of the unbiased bitmask method, That is what I am trying to implement. but basically the same bias as the biased integer multiply method. I did not understand. I don't understand why you object to advancing the state more than once. Doing so doesn't make the resulting sequence of numbers any less deterministic. It does, somehow, hence my struggle to try to avoid it. call seed(0xdeadbeef); x1 = somepseudorand(); x2 = somepseudorand(); x3 = somepsuedorand(); I think we should want x3 to be the same result whatever the previous calls to the API. In fact, I'm pretty sure you *have to* advance the state more than once in *any* unbiased scheme. That's a common characteristic of all the unbiased methods I've seen, and intuitively, I think it has to be so. Yes and no. We can advance another state seeded by the root prng. Otherwise, I'm happy with the use of the bitmask method, as long as it's implemented correctly. I did not understand why it is not correct. -- Fabien.
Re: Update maintenance_work_mem/autovacuum_work_mem to reflect the 1GB limitation with VACUUM
On Sat, 3 Jul 2021 at 00:40, Laurenz Albe wrote: > > On Fri, 2021-07-02 at 23:31 +1200, David Rowley wrote: > > I had a look at the patch in [1] and I find it a bit weird that we'd > > write the following about autovacuum_work_mem in our docs: > > > > + > > +Note that VACUUM has a hard-coded limit of 1GB > > +for the amount of memory used, so setting > > +autovacuum_work_mem higher than that has no > > effect. > > + > > > > Since that setting is *only* used for auto vacuum, why don't we just > > limit the range of the GUC to 1GB? > > > > Of course, it wouldn't be wise to backpatch the reduced limit of > > autovacuum_work_mem as it might upset people who have higher values in > > their postgresql.conf when their database fails to restart after an > > upgrade. I think what might be best is just to reduce the limit in > > master and apply the doc patch for just maintenance_work_mem in all > > supported versions. We could just ignore doing anything with > > autovacuum_work_mem in the back branches and put it down to a > > historical mistake that can't easily be fixed now. > > > > I think that is much better. > I am fine with that patch. Thanks for looking. I've pushed the doc fix patch for maintenance_work_mem and back-patched to 9.6. I could do with a 2nd opinion about if we should just adjust the maximum value for the autovacuum_work_mem GUC to 1GB in master. I'm also not sure if since we'd not backpatch the GUC max value adjustment if we need to document the upper limit in the manual. David set_maxvalue_for_autovacuum_work_mem_to_1gb.patch Description: Binary data
Re: Reducing the cycle time for CLOBBER_CACHE_ALWAYS buildfarm members
On 7/3/21 6:59 PM, Tom Lane wrote: > I wrote: >> Andrew Dunstan writes: >>> Seems reasonable. I don't have a CCA animal any more, but I guess I >>> could set up a test. >> I can run a test here --- I'll commandeer sifaka for awhile, >> since that's the fastest animal I have. > Done, and here's the results: > > Traditional way (-DCLOBBER_CACHE_ALWAYS): 32:53:24 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&dt=2021-07-01%2018%3A06%3A09 > > New way: 16:15:43 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&dt=2021-07-03%2004%3A02%3A16 > > That's running sifaka's full test schedule including TAP tests, > which ordinarily takes it a shade under 10 minutes. The savings > on a non-TAP run would be a lot less, of course, thanks to > fewer initdb invocations. > > Although I lacked the patience to run a full back-branch cycle > with -DCLOBBER_CACHE_ALWAYS, I did verify that the patch > correctly injects that #define when running an old branch. > So I think it's ready to go into the buildfarm, modulo any > cosmetic work you might want to do. > > Yeah, I'm looking at it now. A couple of things: I think we should probably call the setting 'use_clobber_cache_always' since that's what it does. And I think we should probably put in a sanity check to make it incompatible with any -DCLOBBER_CACHE_* define in CPPFLAGS. Thoughts? There is one item I want to complete before putting out a new client release - making provision for a change in the name of the default git branch - the aim is that with the new release in place that will be completely seamless whenever it happens and whatever name is chosen. I hope to have that done in a week or so., so the new release would be out in about two weeks, if all goes well. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: rand48 replacement
On Sun, 4 Jul 2021 at 10:35, Fabien COELHO wrote: > > I did not understand why it is not correct. > Well, to make it easier to visualise, let's imagine our word size is just 3 bits instead of 64 bits, and that the basic prng() function generates numbers in the range [0,8). Similarly, imagine a splitmix3() that operates on 3-bit values. So it might do something like this (state offset and return values made up): splitmix3(state): state=0 -> 5, return 2 state=1 -> 6, return 5 state=2 -> 7, return 0 state=3 -> 0, return 3 state=4 -> 1, return 6 state=5 -> 2, return 1 state=6 -> 3, return 7 state=7 -> 4, return 4 Now suppose we want a random number in the range [0,6). This is what happens with your algorithm for each of the possible prng() return values: prng() returns 0 -- OK prng() returns 1 -- OK prng() returns 2 -- OK prng() returns 3 -- OK prng() returns 4 -- OK prng() returns 5 -- OK prng() returns 6 -- out of range so use splitmix3() with initial state=6: state=6 -> 3, return 7 -- still out of range, so repeat state=3 -> 0, return 3 -- now OK prng() returns 7 -- out of range so use splitmix3() with initial state=7: state=7 -> 4, return 4 -- now OK So, assuming that prng() chooses each of the 8 possible values with equal probability (1/8), the overall result is that the values 0,1,2 and 5 are returned with a probability of 1/8, whereas 3 and 4 are returned with a probability of 2/8. Using the correct implementation of the bitmask algorithm, each iteration calls prng() again, so in the end no particular return value is ever more likely than any other (hence it's unbiased). As for determinism, the end result is still fully deterministic. For example, lets say that prng() returns the following sequence, for some initial state: 1,0,3,0,3,7,4,7,6,6,5,3,7,7,7,0,3,6,5,2,3,3,4,0,0,2,7,4,... then the bitmask method just returns that sequence with all the 6's and 7's removed: 1,0,3,0,3,4,5,3,0,3,5,2,3,3,4,0,0,2,4,... and that same sequence will always be returned, when starting from that initial seed. Regards, Dean
Re: Mention --enable-tap-tests in the TAP section page
On Fri, Jul 02, 2021 at 09:52:10AM -0400, Andrew Dunstan wrote: > Agreed. Applied. -- Michael signature.asc Description: PGP signature
PostgreSQL-13.3 parser.y with positional references by named references
Here https://gist.github.com/mingodad/49291e0e9505522c66fcd3fcea4a939d I posted the postgresql-13.3/src/backend/parser/gram.y with positional references by named references that is supported by bison for some time now. It was done with a custom script and some comments are missing, if there is any interest in accept it I could try work on it to include the missing comments and a different code layout. It compiles on ubuntu 18.04. I did a similar contribution here https://github.com/facebookincubator/CG-SQL/pull/6 And here is snippet of how it looks like: opt_all_clause: ALL { $opt_all_clause = NIL;} | /*EMPTY*/ { $opt_all_clause = NIL; } ; opt_sort_clause: sort_clause { $opt_sort_clause = $sort_clause;} | /*EMPTY*/ { $opt_sort_clause = NIL; } ; sort_clause: ORDER BY sortby_list { $sort_clause = $sortby_list; } ; sortby_list: sortby { $sortby_list = list_make1($sortby); } | sortby_list[rhs_1] ',' sortby { $$ /* sortby_list */ = lappend($rhs_1, $sortby); } ; sortby: a_expr USING qual_all_Op opt_nulls_order { $sortby = makeNode(SortBy); $sortby->node = $a_expr; $sortby->sortby_dir = SORTBY_USING; $sortby->sortby_nulls = $opt_nulls_order; $sortby->useOp = $qual_all_Op; $sortby->location = @qual_all_Op; } | a_expr opt_asc_desc opt_nulls_order { $sortby = makeNode(SortBy); $sortby->node = $a_expr; $sortby->sortby_dir = $opt_asc_desc; $sortby->sortby_nulls = $opt_nulls_order; $sortby->useOp = NIL; $sortby->location = -1; /* no operator */ } ; Cheers !
Re: ATTACH PARTITION locking documentation for DEFAULT partitions
On Sat, 17 Apr 2021 at 00:03, Matthias van de Meent wrote: > PFA an updated patch. I've updated the wording of the previous patch, > and also updated this section in alter_table.sgml, but with different > wording, explictly explaining the process used to validate the altered > default constraint. I had to squint at this: +ALTER TABLE measurement_default ADD CONSTRAINT excl_y2008m02 + CHECK ( (logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01') IS FALSE ); I tried your example and it does not work. set client_min_messages = 'debug1'; create table rp (dt date not null) partition by range(dt); create table rp_default partition of rp default; alter table rp_default add constraint rp_default_chk check ((dt >= '2022-01-01' and dt < '2023-01-01') is false); create table rp_2022 partition of rp for values from ('2022-01-01') to ('2023-01-01'); There's no debug message to indicate that the constraint was used. Let's try again: alter table rp_default drop constraint rp_default_chk; drop table rp_2022; alter table rp_default add constraint rp_default_chk check (not (dt >= '2022-01-01' and dt < '2023-01-01')); create table rp_2022 partition of rp for values from ('2022-01-01') to ('2023-01-01'); DEBUG: updated partition constraint for default partition "rp_default" is implied by existing constraints The debug message indicates that it worked as expected that time. But to be honest, I don't know why you've even added that. There's not even an example on how to add a DEFAULT partition, so why should we include an example of how to add a CHECK constraint on one? I've spent a bit of time hacking at this and I've come up with the attached patch. David partition_doc_fixes.patch Description: Binary data
Re: Increase value of OUTER_VAR
On Sat, 3 Jul 2021 at 06:23, Tom Lane wrote: > So I'm inclined to propose pushing this and seeing what happens. Is this really sane? As much as I would like to see the 65k limit removed, I just have reservations about fixing it in this way. Even if we get all the cases fixed in core, there's likely a whole bunch of extensions that'll have bugs as a result of this for many years to come. "git grep \sIndex\s -- *.[ch] | wc -l" is showing me 77 matches in the Citus code. That's not the only extension that uses the planner hook. I'm really just not sure it's worth all the dev hours fixing the fallout. To me, it seems much safer to jump bump 65k up to 1m. It'll be a while before anyone complains about that. It's also not that great to see the number of locations that you needed to add run-time checks for negative varnos. That's not going to come for free. David
Re: Reducing the cycle time for CLOBBER_CACHE_ALWAYS buildfarm members
Andrew Dunstan writes: > On 7/3/21 6:59 PM, Tom Lane wrote: >> So I think it's ready to go into the buildfarm, modulo any >> cosmetic work you might want to do. > Yeah, I'm looking at it now. A couple of things: I think we should > probably call the setting 'use_clobber_cache_always' since that's what > it does. And I think we should probably put in a sanity check to make it > incompatible with any -DCLOBBER_CACHE_* define in CPPFLAGS. > Thoughts? No objections here. regards, tom lane
Re: Disable WAL logging to speed up data loading
Greetings, * vignesh C (vignes...@gmail.com) wrote: > On Wed, Apr 7, 2021 at 12:13 PM osumi.takami...@fujitsu.com > wrote: > > Mainly affected by a commit 9de9294, > > I've fixed minor things to rebase the patch. > > All modifications I did are cosmetic changes and > > a little bit of documentation updates. > > Please have a look at attached v09. > > Patch is not applying on Head, kindly post a rebased version. As this > patch is in Ready for Committer state, it will help one of the > committers to pick up this patch during commit fest. Unless there's actually a committer who is interested and willing to take this on (understanding that multiple committers have already said they don't agree with this approach), I don't think it makes sense to spend additional time on this. Rather than RfC, the appropriate status seems like it should be Rejected, as otherwise it's just encouraging someone to ultimately waste their time rebasing and updating the patch when it isn't going to ever actually be committed. Thanks, Stephen signature.asc Description: PGP signature
Re: Increase value of OUTER_VAR
David Rowley writes: > Is this really sane? > As much as I would like to see the 65k limit removed, I just have > reservations about fixing it in this way. Even if we get all the > cases fixed in core, there's likely a whole bunch of extensions > that'll have bugs as a result of this for many years to come. Maybe. I'm not that concerned about planner hacking: almost all of the planner is only concerned with pre-setrefs.c representations and will never see these values. Still, the fact that we had to inject a couple of explicit IS_SPECIAL_VARNO tests is a bit worrisome. (I'm more surprised really that noplace in the executor needed it.) FWIW, experience with those places says that such bugs will be exposed immediately; it's not like they'd lurk undetected "for years". You might argue that the int-vs-Index declaration changes are something that would be much harder to get right, but in reality those are almost entirely cosmetic. We could make them completely so by changing the macro to #define IS_SPECIAL_VARNO(varno) ((int) (varno) < 0) so that it'd still do the right thing when applied to a variable declared as Index. (In the light of morning, I'm not sure why I didn't do that already.) But we've always been extremely cavalier about whether RT indexes should be declared as int or Index, so I felt that standardizing on the former was actually a good side-effect of the patch. Anyway, to address your point more directly: as I recall, the main objection to just increasing the values of these constants was the fear that it'd bloat bitmapsets containing these values. Now on the one hand, this patch has proven that noplace in the core code does that today. On the other hand, there's no certainty that someone might not try to do that tomorrow (if we don't fix it as per this patch); or extensions might be doing so. > I'm really just not sure it's worth all the dev hours fixing the > fallout. To me, it seems much safer to jump bump 65k up to 1m. It'll > be a while before anyone complains about that. TBH, if we're to approach it that way, I'd be inclined to go for broke and raise the values to ~2B. Then (a) we'll be shut of the problem pretty much permanently, and (b) if someone does try to make a bitmapset containing these values, hopefully they'll see performance bad enough to expose the issue immediately. > It's also not that great to see the number of locations that you > needed to add run-time checks for negative varnos. That's not going to > come for free. Since the test is just "< 0", I pretty much disbelieve that argument. There are only two such places in the patch, and neither of them are *that* performance-sensitive. Anyway, the raise-the-values solution does have the advantage of being a four-liner, so I can live with it if that's the consensus. But I do think this way is cleaner in the long run, and I doubt the argument that it'll create any hard-to-detect bugs. regards, tom lane
Re: PostgreSQL-13.3 parser.y with positional references by named references
Domingo Alvarez Duarte writes: > Here https://gist.github.com/mingodad/49291e0e9505522c66fcd3fcea4a939d I > posted the postgresql-13.3/src/backend/parser/gram.y with positional > references by named references that is supported by bison for some time now. When is "some time now"? Currently, we support bison versions back to 1.875. While we'd be willing to raise that bar as soon as a good reason to do so comes along, I'm not sure that getting rid of $N notation is a sufficient reason. Indeed, I'd say getting rid of $$ is a strict loss; the changes you show make actions much more verbose but certainly not any more readable. Having a special notation for a rule's output seems to me like a good thing not a bad one. The examples of named notation in the Bison docs don't seem like unconditional wins either; they're not very concise, and the contortions you're forced into when the same nonterminal type is used more than once in a rule are just horrid. I do see the point about it being annoying to update $N references when a rule is changed. But this solution has enough downsides that I'm not sure it's a net win. Maybe if it were applied selectively, to just the longer DDL productions, it'd be worth doing? regards, tom lane
Re: rand48 replacement
Now suppose we want a random number in the range [0,6). This is what happens with your algorithm for each of the possible prng() return values: prng() returns 0 -- OK prng() returns 1 -- OK prng() returns 2 -- OK prng() returns 3 -- OK prng() returns 4 -- OK prng() returns 5 -- OK prng() returns 6 -- out of range so use splitmix3() with initial state=6: state=6 -> 3, return 7 -- still out of range, so repeat state=3 -> 0, return 3 -- now OK prng() returns 7 -- out of range so use splitmix3() with initial state=7: state=7 -> 4, return 4 -- now OK So, assuming that prng() chooses each of the 8 possible values with equal probability (1/8), the overall result is that the values 0,1,2 and 5 are returned with a probability of 1/8, whereas 3 and 4 are returned with a probability of 2/8. Ok, I got that explanation. Using the correct implementation of the bitmask algorithm, each iteration calls prng() again, so in the end no particular return value is ever more likely than any other (hence it's unbiased). Ok, you're taking into account the number of states of the PRNG, so this finite number implies some bias on some values if you actually enumerate all possible cases, as you do above. I was reasoning "as if" the splitmix PRNG was an actual random function, which is obviously false, but is also somehow a usual (false) assumption with PRNGs, and with this false assumption my implementation is perfect:-) The defect of the modulo method for range extraction is that even with an actual (real) random generator the results would be biased. The bias is in the method itself. Now you are arguing for a bias linked to the internals of the PRNG. They are not the same in nature, even if the effect is the same. Also the bias is significant for close to the limit ranges, which is not the kind of use case I have in mind when looking at pgbench. If you consider the PRNG internals, then splitmix extraction function could also be taken into account. If it is not invertible (I'm unsure), then assuming it is some kind of hash function, about 1/e of output values would not reachable, which is yet another bias that you could argue against. Using the initial PRNG works better only because the underlying 128 bits state is much larger than the output value. Which is the point for having a larger state in the first place, anyway. As for determinism, the end result is still fully deterministic. For example, lets say that prng() returns the following sequence, for some initial state: 1,0,3,0,3,7,4,7,6,6,5,3,7,7,7,0,3,6,5,2,3,3,4,0,0,2,7,4,... then the bitmask method just returns that sequence with all the 6's and 7's removed: 1,0,3,0,3,4,5,3,0,3,5,2,3,3,4,0,0,2,4,... and that same sequence will always be returned, when starting from that initial seed. Yes and no. The result is indeed deterministic of you call the function with the same range. However, if you change the range value in one place then sometimes the state can advance differently, so the determinism is lost, meaning that it depends on actual range values. Attached a v7 which does as you wish, but also looses the deterministic non-value dependent property I was seeking. I would work around that by deriving another 128 bit generator instead of splitmix 64 bit, but that is overkill. -- Fabien.diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c index 2c2f149fb0..146b524076 100644 --- a/contrib/file_fdw/file_fdw.c +++ b/contrib/file_fdw/file_fdw.c @@ -1188,7 +1188,7 @@ file_acquire_sample_rows(Relation onerel, int elevel, * Found a suitable tuple, so save it, replacing one old tuple * at random */ -int k = (int) (targrows * sampler_random_fract(rstate.randstate)); +int k = (int) (targrows * sampler_random_fract(&rstate.randstate)); Assert(k >= 0 && k < targrows); heap_freetuple(rows[k]); diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index fafbab6b02..3009861e45 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -5152,7 +5152,7 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate) if (astate->rowstoskip <= 0) { /* Choose a random reservoir element to replace. */ - pos = (int) (targrows * sampler_random_fract(astate->rstate.randstate)); + pos = (int) (targrows * sampler_random_fract(&astate->rstate.randstate)); Assert(pos >= 0 && pos < targrows); heap_freetuple(astate->rows[pos]); } diff --git a/contrib/tsm_system_rows/tsm_system_rows.c b/contrib/tsm_system_rows/tsm_system_rows.c index 4996612902..1a46d4b143 100644 --- a/contrib/tsm_system_rows/tsm_system_rows.c +++ b/contrib/tsm_system_rows/tsm_system_rows.c @@ -69,7 +69,7 @@ static BlockNumber system_rows_nextsampleblock(SampleScanState *node, BlockNumbe static OffsetNumber system_rows_nextsampletuple(SampleScanState *node, BlockNumber blockno, Offs
Re: rand48 replacement
On Sun, 4 Jul 2021 at 17:03, Fabien COELHO wrote: > > > As for determinism, the end result is still fully deterministic. > > The result is indeed deterministic of you call the function with the same > range. However, if you change the range value in one place then sometimes > the state can advance differently, so the determinism is lost, meaning > that it depends on actual range values. Ah yes, that's true. I can trivially reproduce that in other languages too. For example, in python, if I call random.seed(0) and then random.randrange() with the inputs 10,10,10 then the results are 6,6,0. But if the randrange() inputs are 10,1000,10 then the results are 6,776,6. So the result from the 3rd call changes as a result of changing the 2nd input. That's not entirely surprising. The important property of determinism is that if I set a seed, and then make an identical set of calls to the random API, the results will be identical every time, so that it's possible to write tests with predictable/repeatable results. > I would work around that by > deriving another 128 bit generator instead of splitmix 64 bit, but that is > overkill. Not really relevant now, but I'm pretty sure that's impossible to do. You might try it as an interesting academic exercise, but I believe it's a logical impossibility. Regards, Dean
Excessive cost of OpClassCache flushes in CLOBBER_CACHE_ALWAYS mode
Over in [1] it is demonstrated that with CLOBBER_CACHE_ALWAYS enabled, initdb accounts for a full 50% of the runtime of "make check-world" (well, actually of the buildfarm cycle, which is not quite exactly that but close). Since initdb certainly doesn't cost that much normally, I wondered why it is so negatively affected by CCA. Some perf measuring led me to LookupOpclassInfo, and specifically this bit: /* * When testing for cache-flush hazards, we intentionally disable the * operator class cache and force reloading of the info on each call. This * is helpful because we want to test the case where a cache flush occurs * while we are loading the info, and it's very hard to provoke that if * this happens only once per opclass per backend. */ #ifdef CLOBBER_CACHE_ENABLED if (debug_invalidate_system_caches_always > 0) opcentry->valid = false; #endif Diking that out halves initdb's CCA runtime. Turns out it also roughly halves the runtime of the core regression tests under CCA, so this doesn't explain why initdb seems so disproportionately affected by CCA. However, seeing that this single choice is accounting for half the cost of CCA testing, we really have to ask whether it's worth that. This code was added by my commit 03ffc4d6d of 2007-11-28, about which I wrote: Improve test coverage of CLOBBER_CACHE_ALWAYS by having it also force reloading of operator class information on each use of LookupOpclassInfo. Had this been in place a year ago, it would have helped me find a bug in the then-new 'operator family' code. Now that we have a build farm member testing CLOBBER_CACHE_ALWAYS on a regular basis, it seems worth expending a little bit of effort here. I'm now a little dubious about my claim that this would have helped find any bugs. Invalidating a finished OpClassCache entry does not model any real-world scenario, because as noted elsewhere in LookupOpclassInfo, once such a cache entry is populated it is kept for the rest of the session. Also, the claim in the comment that we need this to test a cache flush during load seems like nonsense: if we have debug_invalidate_system_caches_always turned on, then we'll test the effects of such flushes throughout the initial population of a cache entry. Doing it over and over again adds nothing. So I'm now fairly strongly tempted to remove this code outright (effectively reverting 03ffc4d6d). Another possibility now that we have debug_invalidate_system_caches_always is to increase the threshold at which this happens, making it more like CLOBBER_CACHE_RECURSIVE. Thoughts? regards, tom lane [1] https://www.postgresql.org/message-id/1289102.1625353189%40sss.pgh.pa.us
"debug_invalidate_system_caches_always" is too long
As I've been poking around in this area, I find myself growing increasingly annoyed at the new GUC name "debug_invalidate_system_caches_always". It is too d*mn long. It's a serious pain to type in any context where you don't have autocomplete to help you. I've kept referring to this type of testing as CLOBBER_CACHE_ALWAYS testing, even though that name is now obsolete, just because it's so much shorter. I think we need to reconsider this name while we still can. I do agree with the "debug_" prefix given that it's now visible to users. However, it doesn't seem that hard to save some space in the rest of the name. The word "system" is adding nothing of value, and the word "always" seems rather confusing --- if it does something "always", why is there more than one level? So a simple proposal is to rename it to "debug_invalidate_caches". However, I think we should also give serious consideration to "debug_clobber_cache" or "debug_clobber_cache_always" for continuity with past practice (though it still feels like "always" is a good word to lose now). "debug_clobber_caches" is another reasonable variant. Thoughts? regards, tom lane
Re: rand48 replacement
The important property of determinism is that if I set a seed, and then make an identical set of calls to the random API, the results will be identical every time, so that it's possible to write tests with predictable/repeatable results. Hmmm… I like my stronger determinism definition more than this one:-) I would work around that by deriving another 128 bit generator instead of splitmix 64 bit, but that is overkill. Not really relevant now, but I'm pretty sure that's impossible to do. You might try it as an interesting academic exercise, but I believe it's a logical impossibility. Hmmm… I was simply thinking of seeding a new pg_prng_state from the main pg_prng_state with some transformation, and then iterate over the second PRNG, pretty much like I did with splitmix, but with 128 bits so that your #states argument does not apply, i.e. something like: /* select in a range with bitmask rejection */ uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range) { /* always advance state once */ uint64 next = xoroshiro128ss(state); uint64 val; if (range >= 2) { uint64 mask = mask_u64(range-1); val = next & mask; if (val >= range) { /* copy and update current prng state */ pg_prng_state iterator = *state; iterator.s0 ^= next; iterator.s1 += UINT64CONST(0x9E3779B97f4A7C15); /* iterate till val in [0, range) */ while ((val = xoroshiro128ss(&iterator) & mask) >= range) ; } } else val = 0; return val; } The initial pseudo-random sequence is left to proceed, and a new PRNG is basically forked for iterating on the mask, if needed. -- Fabien.
visibility map corruption
Hi hackers, We recently ran into an issue where the visibility map of a relation was corrupt, running Postgres 12.4. The error we'd get when running a SELECT * from this table is: could not access status of transaction 3704450152 DETAIL: Could not open file "pg_xact/0DCC": No such file or directory. On the lists I could find several similar reports, but corruption like this could obviously have a very wide range of root causes.. see [1] [2] [3] for example - not all of them have their root cause known. This particular case was similar to reported cases above, but also has some differences. The following query returns ~21.000 rows, which indicates something inconsistent between the visibility map and the pd_all_visible flag on the page: select * from pg_check_frozen('tbl'); Looking at one of the affected pages with pageinspect: =# SELECT lp,lp_off,lp_flags,lp_len,t_xmin,t_xmax,t_field3,t_ctid,t_infomask2,t_infomask,t_hoff,t_oid FROM heap_page_items(get_raw_page('tbl', 726127)); ┌┬┬──┬┬┬┬──┬┬─┬┬┬───┐ │ lp │ lp_off │ lp_flags │ lp_len │ t_xmin │ t_xmax │ t_field3 │ t_ctid │ t_infomask2 │ t_infomask │ t_hoff │ t_oid │ ├┼┼──┼┼┼┼──┼┼─┼┼┼───┤ │ 1 │ 6328 │1 │ 1864 │ 3704450155 │ 3704450155 │1 │ (726127,1) │ 249 │ 8339 │ 56 │ ∅ │ │ 2 │ 4464 │1 │ 1864 │ 3704450155 │ 3704450155 │1 │ (726127,2) │ 249 │ 8339 │ 56 │ ∅ │ │ 3 │ 2600 │1 │ 1864 │ 3704450155 │ 3704450155 │1 │ (726127,3) │ 249 │ 8339 │ 56 │ ∅ │ │ 4 │680 │1 │ 1920 │ 3704450155 │ 3704450155 │1 │ (726127,4) │ 249 │ 8339 │ 56 │ ∅ │ └┴┴──┴┴┴┴──┴┴─┴┴┴───┘ t_infomask shows that HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID bits are both unset. This pg_visibility() call shows the inconsistency between VM and page, with PD_ALL_VISIBLE=false =# select * from pg_visibility('tbl', 726127); ┌─┬┬┐ │ all_visible │ all_frozen │ pd_all_visible │ ├─┼┼┤ │ t │ t │ f │ └─┴┴┘ Looking at other pages show the same information. What's interesting is that out of the affected tuples returned by pg_check_frozen, over 99% belong to 1 transaction (3704450155 as seen above) and the remaining few are from one other transaction that occurred at roughly the same time. I find it hard to believe that this is due to some random bit flipping, because many pages are affected, but the "incorrect" ones are in total only from two specific transactions which occurred at roughly the same time. There were also no server crashes or other known failures around the time of this transaction. I'm not ruling out any other kind of failure still, but I also cannot really explain how this could have happened. The server has PG12.4 with full_page_writes=on, data_checksums=off. It's a large analytics database. The VM inconsistencies also occur on the streaming replicas. I realize these cases are pretty rare and hard to debug, but I wanted to share the information I found so far here for reference. Maybe someone has an idea what occurred, or maybe someone in the future finds it useful when he finds something similar. I have no idea how the inconsistency between VM and PD_ALL_VISIBLE started - from looking through the code I can't really find any way how this could occur. However, for it to lead to the problem described here, I believe there should be *no* SELECT that touches that particular page after the insert/update transaction and before the transaction log gets truncated. If the page is read before the transaction log gets truncated, then the hint bit HEAP_XMIN_COMMITTED will get set and future reads will succeed regardless of tx log truncation. One of the replica's had this happen to it: the affected pages are identical to the primary except that the HEAP_XMIN_COMMITTED flag is set (note that the VM inconsistency is still there on the replica though: PD_ALL_VISIBLE=false even though VM shows that all_frozen=all_visible=true). But I can query these rows on the replica without issues, because it doesn't check the tx log when it sees that HEAP_XMIN_COMMITTED is set. -Floris [1] https://postgrespro.com/list/thread-id/2422376 [2] https://postgrespro.com/list/thread-id/2501800 [3] https://postgrespro.com/list/thread-id/2321949
Re: visibility map corruption
On Sun, Jul 4, 2021 at 1:44 PM Floris Van Nee wrote: > We recently ran into an issue where the visibility map of a relation was > corrupt, running Postgres 12.4. The error we'd get when running a SELECT * > from this table is: > > could not access status of transaction 3704450152 > DETAIL: Could not open file "pg_xact/0DCC": No such file or directory. Have you ever used pg_upgrade on this database? -- Peter Geoghegan
Re: "debug_invalidate_system_caches_always" is too long
On Sun, Jul 04, 2021 at 04:27:13PM -0400, Tom Lane wrote: > As I've been poking around in this area, I find myself growing > increasingly annoyed at the new GUC name > "debug_invalidate_system_caches_always". It is too d*mn long. > It's a serious pain to type in any context where you don't have > autocomplete to help you. I've kept referring to this type of > testing as CLOBBER_CACHE_ALWAYS testing, even though that name is > now obsolete, just because it's so much shorter. I think we need > to reconsider this name while we still can. > > I do agree with the "debug_" prefix given that it's now visible to > users. However, it doesn't seem that hard to save some space in > the rest of the name. The word "system" is adding nothing of value, > and the word "always" seems rather confusing --- if it does > something "always", why is there more than one level? So a simple > proposal is to rename it to "debug_invalidate_caches". I agree with all that. The word "always" has been misinformation, given the multiple levels available. > However, I think we should also give serious consideration to > "debug_clobber_cache" or "debug_clobber_cache_always" for continuity > with past practice (though it still feels like "always" is a good > word to lose now). "debug_clobber_caches" is another reasonable > variant. https://en.wikipedia.org/wiki/Clobbering refers to cases where storage had no changes to its accessibility but now contains different data. That doesn't match InvalidateSystemCaches() especially well, so I think dropping that word has been a good step. Some other shorter terms could be debug_flush_caches, debug_rebuild_caches, or debug_expire_caches. (debug_caches is tempting, but that may ensnare folks looking for extra logging rather than a big slowdown.)
RE: visibility map corruption
> On Sun, Jul 4, 2021 at 1:44 PM Floris Van Nee > wrote: > > We recently ran into an issue where the visibility map of a relation was > corrupt, running Postgres 12.4. The error we'd get when running a SELECT * > from this table is: > > > > could not access status of transaction 3704450152 > > DETAIL: Could not open file "pg_xact/0DCC": No such file or directory. > > Have you ever used pg_upgrade on this database? > Yes. The last time (from v11 to v12) was in October 2020. The transaction id in the tuples (the one PG is trying to check in the tx log) dated from February 2021. I do believe (but am not 100% certain) that the affected table already existed at the time of the last pg_upgrade though.
Re: visibility map corruption
On Sun, Jul 4, 2021 at 2:26 PM Floris Van Nee wrote: > > Have you ever used pg_upgrade on this database? > > > > Yes. The last time (from v11 to v12) was in October 2020. The transaction id > in the tuples (the one PG is trying to check in the tx log) dated from > February 2021. I do believe (but am not 100% certain) that the affected table > already existed at the time of the last pg_upgrade though. I wonder if it's related to this issue: https://www.postgresql.org/message-id/20210423234256.hwopuftipdmp3...@alap3.anarazel.de Have you increased autovacuum_freeze_max_age from its default? This already sounds like the kind of database where that would make sense. -- Peter Geoghegan
Re: "debug_invalidate_system_caches_always" is too long
On Sun, Jul 04, 2021 at 04:27:13PM -0400, Tom Lane wrote: > and the word "always" seems rather confusing --- if it does > something "always", why is there more than one level? So a simple > proposal is to rename it to "debug_invalidate_caches". +1 to remove "always" -- Justin
RE: visibility map corruption
> > I wonder if it's related to this issue: > > https://www.postgresql.org/message- > id/20210423234256.hwopuftipdmp3...@alap3.anarazel.de > > Have you increased autovacuum_freeze_max_age from its default? This > already sounds like the kind of database where that would make sense. > autovacuum_freeze_max_age is increased in our setup indeed (it is set to 500M). However, we do regularly run manual VACUUM (FREEZE) on individual tables in the database, including this one. A lot of tables in the database follow an INSERT-only pattern and since it's not running v13 yet, this meant that these tables would only rarely be touched by autovacuum. Autovacuum would sometimes kick in on some of these tables at the same time at unfortunate moments. Therefore we have some regular job running that VACUUM (FREEZE)s tables with a xact age higher than a (low, 10M) threshold ourselves.
Re: Disable WAL logging to speed up data loading
On Sun, Jul 04, 2021 at 11:02:01AM -0400, Stephen Frost wrote: > Rather than RfC, the appropriate status seems like it should be > Rejected, as otherwise it's just encouraging someone to ultimately waste > their time rebasing and updating the patch when it isn't going to ever > actually be committed. Yes, agreed with that. -- Michael signature.asc Description: PGP signature
Re: Can a child process detect postmaster death when in pg_usleep?
On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote: > My bad. I was talking about the cases when do_pg_stop_backup is called > while the server is in recovery mode i.e. backup_started_in_recovery = > RecoveryInProgress(); evaluates to true. I'm not sure in these cases > whether we should replace pg_usleep with WaitLatch. If yes, whether we > should use procLatch/MyLatch or recoveryWakeupLatch as they are > currently serving different purposes. It seems to me that you should re-read the description of recoveryWakeupLatch at the top of xlog.c and check for which purpose it exists, which is, in this case, to wake up the startup process to accelerate WAL replay. So do_pg_stop_backup() has no business with it. Switching pg_stop_backup() to use a latch rather than pg_usleep() has benefits: - It simplifies the wait event handling. - The process waiting for the last WAL segment to be archived will be more responsive on signals like SIGHUP and on postmaster death. These don't sound bad to me to apply here, so 0002 could be simplified as attached. -- Michael diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7890e13d7a..c7c928f50b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11638,9 +11638,11 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p) reported_waiting = true; } - pgstat_report_wait_start(WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE); - pg_usleep(100L); - pgstat_report_wait_end(); + (void) WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, + 1000L, + WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE); + ResetLatch(MyLatch); if (++waits >= seconds_before_warning) { signature.asc Description: PGP signature
RE: Disable WAL logging to speed up data loading
On Monday, July 5, 2021 10:32 AM Michael Paquier wrote: > On Sun, Jul 04, 2021 at 11:02:01AM -0400, Stephen Frost wrote: > > Rather than RfC, the appropriate status seems like it should be > > Rejected, as otherwise it's just encouraging someone to ultimately > > waste their time rebasing and updating the patch when it isn't going > > to ever actually be committed. > > Yes, agreed with that. Thanks for paying attention to this thread. This cannot be helped but I've made the patch status as you suggested, because I judged the community would not accept this idea. Best Regards, Takamichi Osumi
Re: row filtering for logical replication
On Thu, Jul 1, 2021 at 10:43 AM Euler Taveira wrote: > > Amit, thanks for rebasing this patch. I already had a similar rebased patch in > my local tree. A recent patch broke your version v15 so I rebased it. > > I like the idea of a simple create_estate_for_relation() function (I fixed an > oversight regarding GetCurrentCommandId(false) because it is used only for > read-only purposes). This patch also replaces all references to version 14. > > Commit ef948050 made some changes in the snapshot handling. Set the current > active snapshot might not be required but future changes to allow functions > will need it. > > As the previous patches, it includes commits (0002 and 0003) that are not > intended to be committed. They are available for test-only purposes. > I have some review comments on the "Row filter for logical replication" patch: (1) Suggested update to patch comment: (There are some missing words and things which could be better expressed) This feature adds row filtering for publication tables. When a publication is defined or modified, rows that don't satisfy a WHERE clause may be optionally filtered out. This allows a database or set of tables to be partially replicated. The row filter is per table, which allows different row filters to be defined for different tables. A new row filter can be added simply by specifying a WHERE clause after the table name. The WHERE clause must be enclosed by parentheses. The WHERE clause should probably contain only columns that are part of the primary key or that are covered by REPLICA IDENTITY. Otherwise, any DELETEs won't be replicated. DELETE uses the old row version (that is limited to primary key or REPLICA IDENTITY) to evaluate the row filter. INSERT and UPDATE use the new row version to evaluate the row filter, hence, you can use any column. If the row filter evaluates to NULL, it returns false. For simplicity, functions are not allowed; that could possibly be addressed in a future patch. If you choose to do the initial table synchronization, only data that satisfies the row filters is sent. If the subscription has several publications in which a table has been published with different WHERE clauses, rows must satisfy all expressions to be copied. If subscriber is a pre-15 version, data synchronization won't use row filters if they are defined in the publisher. Previous versions cannot handle row filters. If your publication contains a partitioned table, the publication parameter publish_via_partition_root determines if it uses the partition row filter (if the parameter is false, the default) or the root partitioned table row filter. (2) Some inconsistent error message wording: Currently: err = _("cannot use subquery in publication WHERE expression"); Suggest changing it to: err = _("subqueries are not allowed in publication WHERE expressions"); Other examples from the patch: err = _("aggregate functions are not allowed in publication WHERE expressions"); err = _("grouping operations are not allowed in publication WHERE expressions"); err = _("window functions are not allowed in publication WHERE expressions"); errmsg("functions are not allowed in publication WHERE expressions"), err = _("set-returning functions are not allowed in publication WHERE expressions"); (3) The current code still allows arbitrary code execution, e.g. via a user-defined operator: e.g. publisher: CREATE OR REPLACE FUNCTION myop(left_arg INTEGER, right_arg INTEGER) RETURNS BOOL AS $$ BEGIN RAISE NOTICE 'I can do anything here!'; RETURN left_arg > right_arg; END; $$ LANGUAGE PLPGSQL VOLATILE; CREATE OPERATOR ( PROCEDURE = myop, LEFTARG = INTEGER, RIGHTARG = INTEGER ); CREATE PUBLICATION tap_pub FOR TABLE test_tab WHERE (a 5); subscriber: CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub; Perhaps add the following after the existing shell error-check in make_op(): /* User-defined operators are not allowed in publication WHERE clauses */ if (pstate->p_expr_kind == EXPR_KIND_PUBLICATION_WHERE && opform->oid >= FirstNormalObjectId) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("user-defined operators are not allowed in publication WHERE expressions"), parser_errposition(pstate, location))); Also, I believe it's also allowing user-defined CASTs (so could add a similar check to above in transformTypeCast()). Ideally, it would be preferable to validate/check publication WHERE expressions in one central place, rather than scattered all over the place, but that might be easier said than done. You need to update the patch comment accordingly. (4) src/backend/replication/pgoutput/pgoutput.c pgoutput_change() The 3 added calls to pgoutput_row_filter() are returning from pgoutput_change(), if false is returned, but instead they should break from the switch, otherwise cleanup code is missed. This is surely a bug. e.g. (3 similar cases of this) + i
Re: "debug_invalidate_system_caches_always" is too long
At Sun, 4 Jul 2021 14:12:34 -0700, Noah Misch wrote in > > However, I think we should also give serious consideration to > > "debug_clobber_cache" or "debug_clobber_cache_always" for continuity > > with past practice (though it still feels like "always" is a good > > word to lose now). "debug_clobber_caches" is another reasonable > > variant. > > https://en.wikipedia.org/wiki/Clobbering refers to cases where storage had no > changes to its accessibility but now contains different data. That doesn't > match InvalidateSystemCaches() especially well, so I think dropping that word > has been a good step. Some other shorter terms could be debug_flush_caches, > debug_rebuild_caches, or debug_expire_caches. (debug_caches is tempting, but (I murmur that I think "drop" is also usable here.) > that may ensnare folks looking for extra logging rather than a big slowdown.) I agree to this. (And one more +1 to removing "always".) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: "debug_invalidate_system_caches_always" is too long
On Mon, Jul 5, 2021 at 1:57 AM Tom Lane wrote: > > As I've been poking around in this area, I find myself growing > increasingly annoyed at the new GUC name > "debug_invalidate_system_caches_always". It is too d*mn long. > It's a serious pain to type in any context where you don't have > autocomplete to help you. I've kept referring to this type of > testing as CLOBBER_CACHE_ALWAYS testing, even though that name is > now obsolete, just because it's so much shorter. I think we need > to reconsider this name while we still can. > > I do agree with the "debug_" prefix given that it's now visible to > users. However, it doesn't seem that hard to save some space in > the rest of the name. The word "system" is adding nothing of value, > and the word "always" seems rather confusing --- if it does > something "always", why is there more than one level? So a simple > proposal is to rename it to "debug_invalidate_caches". > > However, I think we should also give serious consideration to > "debug_clobber_cache" or "debug_clobber_cache_always" for continuity > with past practice (though it still feels like "always" is a good > word to lose now). "debug_clobber_caches" is another reasonable > variant. > > Thoughts? +1 for the "debug_clobber_caches" variant, easy to remember. Regards, Amul
Re: Can a child process detect postmaster death when in pg_usleep?
At Fri, 2 Jul 2021 10:27:21 +0900, Michael Paquier wrote in > On Mon, Jun 28, 2021 at 11:01:57AM -0400, Tom Lane wrote: > > Dunno ... I cannot recall ever having had that as a debugging requirement > > in a couple of decades worth of PG bug-chasing. If the postmaster is > > dying, you generally want to deal with that before bothering with child > > processes. Moreover, child processes that don't go awy when the > > postmaster does are a very nasty problem, because they could screw up > > subsequent debugging work. > > At the same time, nobody has really complained about this being an > issue for developer options. I would tend to wait for more opinions > before doing anything with the auth_delay GUCs. I'm not sure the current behavior is especially useful for debugging, however, I don't think it is especially useful that children immediately respond to postmaster's death while the debug-delays, because anyway children don't respond while debugging (until the control (or code-pointer) reaches to the point of checking postmaster's death), and the delays must be very short even if someone abuses it on production systems. On the other hand, there could be a discussion as a convention that any user-definable sleep requires to respond to signals, maybe as Thomas mentioned. So, I don't object either way we will go. But if we don't change the behavior we instead would need a comment that explains the reason for the pg_usleep. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: "debug_invalidate_system_caches_always" is too long
On Mon, Jul 5, 2021 at 1:57 AM Tom Lane wrote: > > As I've been poking around in this area, I find myself growing > increasingly annoyed at the new GUC name > "debug_invalidate_system_caches_always". It is too d*mn long. > It's a serious pain to type in any context where you don't have > autocomplete to help you. I've kept referring to this type of > testing as CLOBBER_CACHE_ALWAYS testing, even though that name is > now obsolete, just because it's so much shorter. I think we need > to reconsider this name while we still can. > > I do agree with the "debug_" prefix given that it's now visible to > users. However, it doesn't seem that hard to save some space in > the rest of the name. The word "system" is adding nothing of value, > and the word "always" seems rather confusing --- if it does > something "always", why is there more than one level? So a simple > proposal is to rename it to "debug_invalidate_caches". > > However, I think we should also give serious consideration to > "debug_clobber_cache" or "debug_clobber_cache_always" for continuity > with past practice (though it still feels like "always" is a good > word to lose now). "debug_clobber_caches" is another reasonable > variant. > > Thoughts? +1. IMO, debug_clobber_caches is better because it is simple. And also, since the invalidation happens on multiple system caches, debug_clobber_caches is preferable than debug_clobber_cache. Regards, Bharath Rupireddy.
Re: Yet another fast GiST build
I tried reviewing the remaining patches. It seems to work correctly, and passes the tests on my laptop. > In this pattern I flipped PointerGetDatum(a) to PointerGetDatum(ra.lower), > because it seems to me correct. I've followed rule of thumb: every sort > function must extract and use "lower" somehow. Though I suspect numeric a > bit. Is it regular varlena? As far as I understand, we cannot use the sortsupport functions from the btree operator classes because the btree_gist extension handles things differently. This is unfortunate and a source of bugs [1], but we cannot do anything about it. Given that the lower and upper datums must be the same for the leaf nodes, it makes sense to me to compare one of them. Using numeric_cmp() for numeric in line with using bttextcmp() for text. > + /* > +* Numeric has abbreviation routines in numeric.c, but we don't try to use > +* them here. Maybe later. > +*/ This is also true for text. Perhaps we should also add a comment there. > PFA patchset with v6 intact + two fixes of discovered issues. > + /* Use byteacmp(), like gbt_bitcmp() does */ We can improve this comment by incorporating Heikki's previous email: > Ok, I think I understand that now. In btree_gist, the *_cmp() function > operates on non-leaf values, and *_lt(), *_gt() et al operate on leaf > values. For all other datatypes, the leaf and non-leaf representation is > the same, but for bit/varbit, the non-leaf representation is different. > The leaf representation is VarBit, and non-leaf is just the bits without > the 'bit_len' field. That's why it is indeed correct for gbt_bitcmp() to > just use byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len' > field separately. That's subtle, and 100% uncommented. I think patch number 3 should be squashed to patch number 1. I couldn't understand patch number 2 "Remove DEBUG1 verification". It seems like something rather useful. [1] https://www.postgresql.org/message-id/flat/201010112055.o9BKtZf7011251%40wwwmaster.postgresql.org
Re: Transactions involving multiple postgres foreign servers, take 2
On 2021/06/30 10:05, Masahiko Sawada wrote: > On Fri, Jun 25, 2021 at 9:53 AM Masahiro Ikeda > wrote: >> >> Hi Jamison-san, sawada-san, >> >> Thanks for testing! >> >> FWIF, I tested using pgbench with "--rate=" option to know the server >> can execute transactions with stable throughput. As sawada-san said, >> the latest patch resolved second phase of 2PC asynchronously. So, >> it's difficult to control the stable throughput without "--rate=" option. >> >> I also worried what I should do when the error happened because to increase >> "max_prepared_foreign_transaction" doesn't work. Since too overloading may >> show the error, is it better to add the case to the HINT message? >> >> BTW, if sawada-san already develop to run the resolver processes in parallel, >> why don't you measure performance improvement? Although Robert-san, >> Tunakawa-san and so on are discussing what architecture is best, one >> discussion point is that there is a performance risk if adopting asynchronous >> approach. If we have promising solutions, I think we can make the discussion >> forward. > > Yeah, if we can asynchronously resolve the distributed transactions > without worrying about max_prepared_foreign_transaction error, it > would be good. But we will need synchronous resolution at some point. > I think we at least need to discuss it at this point. > > I've attached the new version patch that incorporates the comments > from Fujii-san and Ikeda-san I got so far. We launch a resolver > process per foreign server, committing prepared foreign transactions > on foreign servers in parallel. To get a better performance based on > the current architecture, we can have multiple resolver processes per > foreign server but it seems not easy to tune it in practice. Perhaps > is it better if we simply have a pool of resolver processes and we > assign a resolver process to the resolution of one distributed > transaction one by one? That way, we need to launch resolver processes > as many as the concurrent backends using 2PC. Thanks for updating the patches. I have tested in my local laptop and summary is the following. (1) The latest patch(v37) can improve throughput by 1.5 times compared to v36. Although I expected it improves by 2.0 times because the workload is that one transaction access two remote servers... I think the reason is that the disk is bottleneck and I couldn't prepare disks for each postgresql servers. If I could, I think the performance can be improved by 2.0 times. (2) The latest patch(v37) throughput of foreign_twophase_commit = required is about 36% compared to the case if foreign_twophase_commit = disabled. Although the throughput is improved, the absolute performance is not good. It may be the fate of 2PC. I think the reason is that the number of WAL writes is much increase and, the disk writes in my laptop is the bottleneck. I want to know the result testing in richer environments if someone can do so. (3) The latest patch(v37) has no overhead if foreign_twophase_commit = disabled. On the contrary, the performance improved by 3%. It may be within the margin of error. The test detail is following. # condition * 1 coordinator and 3 foreign servers * 4 instance shared one ssd disk. * one transaction queries different two foreign servers. ``` fxact_update.pgbench \set id random(1, 100) \set partnum 3 \set p1 random(1, :partnum) \set p2 ((:p1 + 1) % :partnum) + 1 BEGIN; UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id; UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id; COMMIT; ``` * pgbench generates load. I increased ${RATE} little by little until "maximum number of foreign transactions reached" error happens. ``` pgbench -f fxact_update.pgbench -R ${RATE} -c 8 -j 8 -T 180 ``` * parameters max_prepared_transactions = 100 max_prepared_foreign_transactions = 200 max_foreign_transaction_resolvers = 4 # test source code patterns 1. 2pc patches(v36) based on 6d0eb385 (foreign_twophase_commit = required). 2. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = required). 3. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = disabled). 4. 2595e039 without 2pc patches(v37). # results 1. tps = 241.8000TPS latency average = 10.413ms 2. tps = 359.017519 ( by 1.5 times compared to 1. by 0.36% compared to 3.) latency average = 15.427ms 3. tps = 987.372220 ( by 1.03% compared to 4. ) latency average = 8.102ms 4. tps = 955.984574 latency average = 8.368ms The disk is the bottleneck in my environment because disk util is almost 100% in every pattern. If disks for each instance can be prepared, I think we can expect more performance improvements. >> In my understanding, there are three improvement idea. First is that to make >> the resolver processes run in parallel. Second is that to send "COMMIT/ABORT >> PREPARED" remote servers in bulk. Third is to stop syncing the WAL >> remove_fdwxact() after resolving is done, whi
Re: rand48 replacement
Fabien COELHO писал 2021-07-04 23:29: The important property of determinism is that if I set a seed, and then make an identical set of calls to the random API, the results will be identical every time, so that it's possible to write tests with predictable/repeatable results. Hmmm… I like my stronger determinism definition more than this one:-) I would work around that by deriving another 128 bit generator instead of splitmix 64 bit, but that is overkill. Not really relevant now, but I'm pretty sure that's impossible to do. You might try it as an interesting academic exercise, but I believe it's a logical impossibility. Hmmm… I was simply thinking of seeding a new pg_prng_state from the main pg_prng_state with some transformation, and then iterate over the second PRNG, pretty much like I did with splitmix, but with 128 bits so that your #states argument does not apply, i.e. something like: /* select in a range with bitmask rejection */ uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range) { /* always advance state once */ uint64 next = xoroshiro128ss(state); uint64 val; if (range >= 2) { uint64 mask = mask_u64(range-1); val = next & mask; if (val >= range) { /* copy and update current prng state */ pg_prng_state iterator = *state; iterator.s0 ^= next; iterator.s1 += UINT64CONST(0x9E3779B97f4A7C15); /* iterate till val in [0, range) */ while ((val = xoroshiro128ss(&iterator) & mask) >= range) ; } } else val = 0; return val; } The initial pseudo-random sequence is left to proceed, and a new PRNG is basically forked for iterating on the mask, if needed. I believe most "range" values are small, much smaller than UINT32_MAX. In this case, according to [1] fastest method is Lemire's one (I'd take original version from [2]) Therefore combined method pg_prng_u64_range could branch on range value uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range) { uint64 val = xoroshiro128ss(state); uint64 m; if ((range & (range-1) == 0) /* handle all power 2 cases */ return range != 0 ? val & (range-1) : 0; if (likely(range < PG_UINT32_MAX/32) { /* * Daniel Lamire's unbiased range random algorithm based on rejection sampling * https://lemire.me/blog/2016/06/30/fast-random-shuffling/ */ m = (uint32)val * range; if ((uint32)m < range) { uint32 t = -range % range; while ((uint32)m < t) m = (uint32)xoroshiro128ss(state) * range; } return m >> 32; } /* Apple's mask method */ m = mask_u64(range-1); val &= m; while (val >= range) val = xoroshiro128ss(state) & m; return val; } Mask method could also be faster when range is close to mask. For example, fast check for "range is within 1/1024 to mask" is range < (range/512 + (range&(range*2))) And then method choose could like: if (likely(range < UINT32_MAX/32 && range > (range/512 + (range&(range*2) But I don't know does additional condition worth difference or not. [1] https://www.pcg-random.org/posts/bounded-rands.html [2] https://lemire.me/blog/2016/06/30/fast-random-shuffling/ regards, Sokolov Yura
Re: Add proper planner support for ORDER BY / DISTINCT aggregates
Le vendredi 2 juillet 2021, 10:39:44 CEST David Rowley a écrit : > On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau wrote: > > I don't know if it's acceptable, but in the case where you add both an > > aggregate with an ORDER BY clause, and another aggregate without the > > clause, the output for the unordered one will change and use the same > > ordering, maybe suprising the unsuspecting user. Would that be acceptable > > ? > > That's a good question. There was an argument in [1] that mentions > that there might be a group of people who rely on aggregation being > done in a certain order where they're not specifying an ORDER BY > clause in the aggregate. If that group of people exists, then it's > possible they might be upset in the scenario that you describe. > > I also think that it's going to be pretty hard to make significant > gains in this area if we are too scared to make changes to undefined > behaviour. You wouldn't have to look too hard in the pgsql-general > mailing list to find someone complaining that their query output is in > the wrong order on some query that does not have an ORDER BY. We > pretty much always tell those people that the order is undefined > without an ORDER BY. I'm not too sure why Tom in [1] classes the ORDER > BY aggregate people any differently. We'll be stuck forever here and > in many other areas if we're too scared to change the order of > aggregation. You could argue that something like parallelism has > changed that for people already. I think the multi-batch Hash > Aggregate code could also change this. I would agree with that. > > > I was curious about the performance implication of that additional > > transition, and could not reproduce a signifcant difference. I may be > > doing something wrong: how did you highlight it ? > > It was pretty basic. I just created a table with two columns and no > index and did something like SELECT a,SUM(b ORDER BY b) from ab GROUP > BY 1; the new code will include a Sort due to lack of any index and > the old code would have done a sort inside nodeAgg.c. I imagine that > the overhead comes from the fact that in the patched version nodeAgg.c > must ask its subnode (nodeSort.c) for the next tuple each time, > whereas unpatched nodeAgg.c already has all the tuples in a tuplestore > and can fetch them very quickly in a tight loop. Ok, I reproduced that case, just not using a group by: by adding the group by a sort node is added in both cases (master and your patch), except that with your patch we sort on both keys and that doesn't really incur a performance penalty. I think the overhead occurs because in the ExecAgg case, we use the tuplesort_*_datum API as an optimization when we have a single column as an input, which the ExecSort code doesn't. Maybe it would be worth it to try to use that API in sort nodes too, when it can be done. > > David > > [1] https://www.postgresql.org/message-id/6538.1522096067%40sss.pgh.pa.us -- Ronan Dunklau