date:20210704

Re: Using COPY FREEZE in pgbench

2021-07-04 Thread Tatsuo Ishii

Hi fabien,

>> So overall gain by the patch is around 15%, whereas the last test
>> before the commit was 14%.  It seems the patch is still beneficial
>> after the commit.
> 
> Yes, that's good!

Yeah!

> I had a quick look again, and about the comment:
> 
>  /*
>   * If partitioning is not enabled and server version is 14.0 or later, we
>   * can take account of freeze option of copy.
>   */
> 
> I'd suggest instead the shorter:
> 
>  /* use COPY with FREEZE on v14 and later without partioning */
> 
> Or maybe even to fully drop the comment, because the code is clear
> enough and the doc already says it.

I'd prefer to leave a comment. People (including me) tend to forget
things in the future, that are obvious now:-)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 0c60077e1f..0f432767c2 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -220,6 +220,9 @@ pgbench  options  d
 data is generated in pgbench client and then
 sent to the server. This uses the client/server bandwidth
 extensively through a COPY.
+pgbench uses the FREEZE option with 14 or later
+versions of PostgreSQL to speed up
+subsequent VACUUM, unless partitions are enabled.
 Using g causes logging to print one message
 every 100,000 rows while generating data for the
 pgbench_accounts table.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4aeccd93af..54d993075f 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -4119,6 +4119,7 @@ initGenerateDataClientSide(PGconn *con)
 	PGresult   *res;
 	int			i;
 	int64		k;
+	char		*copy_statement;
 
 	/* used to track elapsed time and estimate of the remaining time */
 	pg_time_usec_t start;
@@ -4165,7 +4166,15 @@ initGenerateDataClientSide(PGconn *con)
 	/*
 	 * accounts is big enough to be worth using COPY and tracking runtime
 	 */
-	res = PQexec(con, "copy pgbench_accounts from stdin");
+
+	/* use COPY with FREEZE on v14 and later without partioning */
+	if (partitions == 0 && PQserverVersion(con) >= 14)
+		copy_statement = "copy pgbench_accounts from stdin with (freeze on)";
+	else
+		copy_statement = "copy pgbench_accounts from stdin";
+
+	res = PQexec(con, copy_statement);
+
 	if (PQresultStatus(res) != PGRES_COPY_IN)
 	{
 		pg_log_fatal("unexpected copy in result: %s", PQerrorMessage(con));

Re: Numeric multiplication overflow errors

2021-07-04 Thread David Rowley

On Sat, 3 Jul 2021 at 11:04, Dean Rasheed  wrote:
> Thinking about this more, I think it's best not to risk back-patching.
> It *might* be safe, but it's difficult to really be sure of that. The
> bug itself is pretty unlikely to ever happen in practice, hence the
> lack of prior complaints, and in fact I only found it by an
> examination of the code. So it doesn't seem to be worth the risk.

That seems like good logic to me.  Perhaps we can reconsider that
decision if users complain about it.

David

Re: rand48 replacement

2021-07-04 Thread Dean Rasheed

On Sat, 3 Jul 2021 at 08:06, Fabien COELHO  wrote:
>
> Here is a v4, which:
>
>   - moves the stuff to common and fully removes random/srandom (Tom)
>   - includes a range generation function based on the bitmask method (Dean)
> but iterates with splitmix so that the state always advances once (Me)

At the risk of repeating myself: do *not* invent your own scheme.

The problem with iterating using splitmix is that splitmix is a simple
shuffling function that takes a single input and returns a mutated
output depending only on that input. So let's say for simplicity that
you're generating numbers in the range [0,N) with N=2^64-n and n<2^63.
Each of the n values in [N,2^64) that lie outside the range wanted are
just mapped in a deterministic way back onto (at most) n values in the
range [0,N), making those n values twice as likely to be chosen as the
other N-n values. So what you've just invented is an algorithm with
the complexity of the unbiased bitmask method, but basically the same
bias as the biased integer multiply method.

I don't understand why you object to advancing the state more than
once. Doing so doesn't make the resulting sequence of numbers any less
deterministic.

In fact, I'm pretty sure you *have to* advance the state more than
once in *any* unbiased scheme. That's a common characteristic of all
the unbiased methods I've seen, and intuitively, I think it has to be
so.

Otherwise, I'm happy with the use of the bitmask method, as long as
it's implemented correctly.

Regards,
Dean

Re: Remove useless int64 range checks on BIGINT sequence MINVALUE/MAXVALUE values

2021-07-04 Thread David Rowley

On Sat, 3 Jul 2021 at 22:44, Peter Eisentraut
 wrote:
> I don't think this is a good change.

> I think we should leave it as is.

I'm inclined to agree.

When I mentioned adding a comment I'd not imagined it would be quite
so verbose. Plus, I struggle to imagine there's any compiler out there
that someone would use that wouldn't just remove the check anyway. I
had a quick click around on https://godbolt.org/z/PnKeq5bsT and didn't
manage to find any compilers that didn't remove the check.

David

Re: rand48 replacement

2021-07-04 Thread Fabien COELHO




Hello Dean,


  - moves the stuff to common and fully removes random/srandom (Tom)
  - includes a range generation function based on the bitmask method (Dean)
but iterates with splitmix so that the state always advances once (Me)


At the risk of repeating myself: do *not* invent your own scheme.



The problem with iterating using splitmix is that splitmix is a simple
shuffling function that takes a single input and returns a mutated
output depending only on that input.


It also iterates over its 64 bits state in a round robin fashion so that 
the cycle size is 2^64 (it is a simple add).


So let's say for simplicity that you're generating numbers in the range 
[0,N) with N=2^64-n and n<2^63. Each of the n values in [N,2^64) that 
lie outside the range wanted are just mapped in a deterministic way back 
onto (at most) n values in the range [0,N), making those n values twice 
as likely to be chosen as the other N-n values.


I do understand your point. If the value is outside the range, splitmix 
iterates over its seed and the extraction functions produces a new number 
which is tested again. I do not understand the "mapped back onto" part, 
the out of range value is just discarded and the loops starts over with a 
new derivation, and why it would imply that some values are more likely to 
come out.


So what you've just invented is an algorithm with the complexity of the 
unbiased bitmask method,


That is what I am trying to implement.


but basically the same bias as the biased integer multiply method.


I did not understand.


I don't understand why you object to advancing the state more than
once. Doing so doesn't make the resulting sequence of numbers any less
deterministic.


It does, somehow, hence my struggle to try to avoid it.

  call seed(0xdeadbeef);
  x1 = somepseudorand();
  x2 = somepseudorand();
  x3 = somepsuedorand();

I think we should want x3 to be the same result whatever the previous 
calls to the API.



In fact, I'm pretty sure you *have to* advance the state more than
once in *any* unbiased scheme. That's a common characteristic of all
the unbiased methods I've seen, and intuitively, I think it has to be
so.


Yes and no. We can advance another state seeded by the root prng.


Otherwise, I'm happy with the use of the bitmask method, as long as
it's implemented correctly.


I did not understand why it is not correct.

--
Fabien.

Re: Update maintenance_work_mem/autovacuum_work_mem to reflect the 1GB limitation with VACUUM

2021-07-04 Thread David Rowley

On Sat, 3 Jul 2021 at 00:40, Laurenz Albe  wrote:
>
> On Fri, 2021-07-02 at 23:31 +1200, David Rowley wrote:
> > I had a look at the patch in [1] and I find it a bit weird that we'd
> > write the following about autovacuum_work_mem in our docs:
> >
> > +   
> > +Note that VACUUM has a hard-coded limit of 1GB
> > +for the amount of memory used, so setting
> > +autovacuum_work_mem higher than that has no 
> > effect.
> > +   
> >
> > Since that setting is *only* used for auto vacuum, why don't we just
> > limit the range of the GUC to 1GB?
> >
> > Of course, it wouldn't be wise to backpatch the reduced limit of
> > autovacuum_work_mem as it might upset people who have higher values in
> > their postgresql.conf when their database fails to restart after an
> > upgrade. I think what might be best is just to reduce the limit in
> > master and apply the doc patch for just maintenance_work_mem in all
> > supported versions. We could just ignore doing anything with
> > autovacuum_work_mem in the back branches and put it down to a
> > historical mistake that can't easily be fixed now.
> >
>
> I think that is much better.
> I am fine with that patch.

Thanks for looking.  I've pushed the doc fix patch for
maintenance_work_mem and back-patched to 9.6.

I could do with a 2nd opinion about if we should just adjust the
maximum value for the autovacuum_work_mem GUC to 1GB in master.

I'm also not sure if since we'd not backpatch the GUC max value
adjustment if we need to document the upper limit in the manual.

David


set_maxvalue_for_autovacuum_work_mem_to_1gb.patch
Description: Binary data

Re: Reducing the cycle time for CLOBBER_CACHE_ALWAYS buildfarm members

2021-07-04 Thread Andrew Dunstan

On 7/3/21 6:59 PM, Tom Lane wrote:
> I wrote:
>> Andrew Dunstan  writes:
>>> Seems reasonable. I don't have a CCA animal any more, but I guess I
>>> could set up a test.
>> I can run a test here --- I'll commandeer sifaka for awhile,
>> since that's the fastest animal I have.
> Done, and here's the results:
>
> Traditional way (-DCLOBBER_CACHE_ALWAYS): 32:53:24
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&dt=2021-07-01%2018%3A06%3A09
>
> New way: 16:15:43
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sifaka&dt=2021-07-03%2004%3A02%3A16
>
> That's running sifaka's full test schedule including TAP tests,
> which ordinarily takes it a shade under 10 minutes.  The savings
> on a non-TAP run would be a lot less, of course, thanks to
> fewer initdb invocations.
>
> Although I lacked the patience to run a full back-branch cycle
> with -DCLOBBER_CACHE_ALWAYS, I did verify that the patch
> correctly injects that #define when running an old branch.
> So I think it's ready to go into the buildfarm, modulo any
> cosmetic work you might want to do.
>
>   

Yeah, I'm looking at it now. A couple of things: I think we should
probably call the setting 'use_clobber_cache_always' since that's what
it does. And I think we should probably put in a sanity check to make it
incompatible with any -DCLOBBER_CACHE_* define in CPPFLAGS.

Thoughts?

There is one item  I want to complete before putting out a new client
release - making provision for a change in the name of the default git
branch - the aim is that with the new release in place that will be
completely seamless whenever it happens and whatever name is chosen. I
hope to have that done in a week or so., so the new release would be out
in about two weeks, if all goes well.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: rand48 replacement

2021-07-04 Thread Dean Rasheed

On Sun, 4 Jul 2021 at 10:35, Fabien COELHO  wrote:
>
> I did not understand why it is not correct.
>

Well, to make it easier to visualise, let's imagine our word size is
just 3 bits instead of 64 bits, and that the basic prng() function
generates numbers in the range [0,8). Similarly, imagine a splitmix3()
that operates on 3-bit values. So it might do something like this
(state offset and return values made up):

splitmix3(state):
  state=0 -> 5, return 2
  state=1 -> 6, return 5
  state=2 -> 7, return 0
  state=3 -> 0, return 3
  state=4 -> 1, return 6
  state=5 -> 2, return 1
  state=6 -> 3, return 7
  state=7 -> 4, return 4

Now suppose we want a random number in the range [0,6). This is what
happens with your algorithm for each of the possible prng() return
values:

  prng() returns 0 -- OK
  prng() returns 1 -- OK
  prng() returns 2 -- OK
  prng() returns 3 -- OK
  prng() returns 4 -- OK
  prng() returns 5 -- OK
  prng() returns 6 -- out of range so use splitmix3() with initial state=6:
state=6 -> 3, return 7 -- still out of range, so repeat
state=3 -> 0, return 3 -- now OK
  prng() returns 7 -- out of range so use splitmix3() with initial state=7:
state=7 -> 4, return 4 -- now OK

So, assuming that prng() chooses each of the 8 possible values with
equal probability (1/8), the overall result is that the values 0,1,2
and 5 are returned with a probability of 1/8, whereas 3 and 4 are
returned with a probability of 2/8.

Using the correct implementation of the bitmask algorithm, each
iteration calls prng() again, so in the end no particular return value
is ever more likely than any other (hence it's unbiased).

As for determinism, the end result is still fully deterministic. For
example, lets say that prng() returns the following sequence, for some
initial state:

  1,0,3,0,3,7,4,7,6,6,5,3,7,7,7,0,3,6,5,2,3,3,4,0,0,2,7,4,...

then the bitmask method just returns that sequence with all the 6's
and 7's removed:

  1,0,3,0,3,4,5,3,0,3,5,2,3,3,4,0,0,2,4,...

and that same sequence will always be returned, when starting from
that initial seed.

Regards,
Dean

Re: Mention --enable-tap-tests in the TAP section page

2021-07-04 Thread Michael Paquier

On Fri, Jul 02, 2021 at 09:52:10AM -0400, Andrew Dunstan wrote:
> Agreed.

Applied.
--
Michael


signature.asc
Description: PGP signature

PostgreSQL-13.3 parser.y with positional references by named references

2021-07-04 Thread Domingo Alvarez Duarte

Here https://gist.github.com/mingodad/49291e0e9505522c66fcd3fcea4a939d I 
posted the postgresql-13.3/src/backend/parser/gram.y with positional 
references by named references that is supported by bison for some time now.


It was done with a custom script and some comments are missing, if there 
is any interest in accept it I could try work on it to include the 
missing comments and a different code layout.


It compiles on ubuntu 18.04.

I did a similar contribution here 
https://github.com/facebookincubator/CG-SQL/pull/6


And here is snippet of how it looks like:



opt_all_clause:
    ALL    { $opt_all_clause = NIL;}
    | /*EMPTY*/    { $opt_all_clause = NIL; }
        ;

opt_sort_clause:
    sort_clause    { $opt_sort_clause = $sort_clause;}
    | /*EMPTY*/    { $opt_sort_clause = NIL; }
        ;

sort_clause:
    ORDER BY sortby_list    { $sort_clause = $sortby_list; }
        ;

sortby_list:
    sortby    { $sortby_list = list_make1($sortby); }
    | sortby_list[rhs_1] ',' sortby    { $$ /* sortby_list */ = 
lappend($rhs_1, $sortby); }

        ;

sortby:
    a_expr USING qual_all_Op opt_nulls_order    {
                    $sortby = makeNode(SortBy);
                    $sortby->node = $a_expr;
                    $sortby->sortby_dir = SORTBY_USING;
                    $sortby->sortby_nulls = $opt_nulls_order;
                    $sortby->useOp = $qual_all_Op;
                    $sortby->location = @qual_all_Op;
                }
    | a_expr opt_asc_desc opt_nulls_order    {
                    $sortby = makeNode(SortBy);
                    $sortby->node = $a_expr;
                    $sortby->sortby_dir = $opt_asc_desc;
                    $sortby->sortby_nulls = $opt_nulls_order;
                    $sortby->useOp = NIL;
                    $sortby->location = -1;        /* no operator */
                }
        ;



Cheers !

Re: ATTACH PARTITION locking documentation for DEFAULT partitions

2021-07-04 Thread David Rowley

On Sat, 17 Apr 2021 at 00:03, Matthias van de Meent
 wrote:
> PFA an updated patch. I've updated the wording of the previous patch,
> and also updated this section in alter_table.sgml, but with different
> wording, explictly explaining the process used to validate the altered
> default constraint.

I had to squint at this:

+ALTER TABLE measurement_default ADD CONSTRAINT excl_y2008m02
+   CHECK ( (logdate >= DATE '2008-02-01' AND logdate < DATE
'2008-03-01') IS FALSE );

I tried your example and it does not work.

set client_min_messages = 'debug1';
create table rp (dt date not null) partition by range(dt);
create table rp_default partition of rp default;
alter table rp_default add constraint rp_default_chk check ((dt >=
'2022-01-01' and dt < '2023-01-01') is false);
create table rp_2022 partition of rp for values from ('2022-01-01') to
('2023-01-01');

There's no debug message to indicate that the constraint was used.

Let's try again:

alter table rp_default drop constraint rp_default_chk;
drop table rp_2022;
alter table rp_default add constraint rp_default_chk check (not (dt >=
'2022-01-01' and dt < '2023-01-01'));
create table rp_2022 partition of rp for values from ('2022-01-01') to
('2023-01-01');
DEBUG:  updated partition constraint for default partition
"rp_default" is implied by existing constraints

The debug message indicates that it worked as expected that time.

But to be honest, I don't know why you've even added that. There's not
even an example on how to add a DEFAULT partition, so why should we
include an example of how to add a CHECK constraint on one?

I've spent a bit of time hacking at this and I've come up with the
attached patch.

David


partition_doc_fixes.patch
Description: Binary data

Re: Increase value of OUTER_VAR

2021-07-04 Thread David Rowley

On Sat, 3 Jul 2021 at 06:23, Tom Lane  wrote:
> So I'm inclined to propose pushing this and seeing what happens.

Is this really sane?

As much as I would like to see the 65k limit removed, I just have
reservations about fixing it in this way.  Even if we get all the
cases fixed in core, there's likely a whole bunch of extensions
that'll have bugs as a result of this for many years to come.

"git grep \sIndex\s -- *.[ch] | wc -l" is showing me 77 matches in the
Citus code.  That's not the only extension that uses the planner hook.

I'm really just not sure it's worth all the dev hours fixing the
fallout.  To me, it seems much safer to jump bump 65k up to 1m. It'll
be a while before anyone complains about that.

It's also not that great to see the number of locations that you
needed to add run-time checks for negative varnos. That's not going to
come for free.

David

Re: Reducing the cycle time for CLOBBER_CACHE_ALWAYS buildfarm members

2021-07-04 Thread Tom Lane

Andrew Dunstan  writes:
> On 7/3/21 6:59 PM, Tom Lane wrote:
>> So I think it's ready to go into the buildfarm, modulo any
>> cosmetic work you might want to do.

> Yeah, I'm looking at it now. A couple of things: I think we should
> probably call the setting 'use_clobber_cache_always' since that's what
> it does. And I think we should probably put in a sanity check to make it
> incompatible with any -DCLOBBER_CACHE_* define in CPPFLAGS.

> Thoughts?

No objections here.

regards, tom lane

Re: Disable WAL logging to speed up data loading

2021-07-04 Thread Stephen Frost

Greetings,

* vignesh C (vignes...@gmail.com) wrote:
> On Wed, Apr 7, 2021 at 12:13 PM osumi.takami...@fujitsu.com
>  wrote:
> > Mainly affected by a commit 9de9294,
> > I've fixed minor things to rebase the patch.
> > All modifications I did are cosmetic changes and
> > a little bit of documentation updates.
> > Please have a look at attached v09.
> 
> Patch is not applying on Head, kindly post a rebased version. As this
> patch is in Ready for Committer state, it will help one of the
> committers to pick up this patch during commit fest.

Unless there's actually a committer who is interested and willing to
take this on (understanding that multiple committers have already said
they don't agree with this approach), I don't think it makes sense to
spend additional time on this.

Rather than RfC, the appropriate status seems like it should be
Rejected, as otherwise it's just encouraging someone to ultimately waste
their time rebasing and updating the patch when it isn't going to ever
actually be committed.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Increase value of OUTER_VAR

2021-07-04 Thread Tom Lane

David Rowley  writes:
> Is this really sane?

> As much as I would like to see the 65k limit removed, I just have
> reservations about fixing it in this way.  Even if we get all the
> cases fixed in core, there's likely a whole bunch of extensions
> that'll have bugs as a result of this for many years to come.

Maybe.  I'm not that concerned about planner hacking: almost all of
the planner is only concerned with pre-setrefs.c representations and
will never see these values.  Still, the fact that we had to inject
a couple of explicit IS_SPECIAL_VARNO tests is a bit worrisome.
(I'm more surprised really that noplace in the executor needed it.)
FWIW, experience with those places says that such bugs will be
exposed immediately; it's not like they'd lurk undetected "for years".

You might argue that the int-vs-Index declaration changes are
something that would be much harder to get right, but in reality
those are almost entirely cosmetic.  We could make them completely
so by changing the macro to

#define IS_SPECIAL_VARNO(varno) ((int) (varno) < 0)

so that it'd still do the right thing when applied to a variable
declared as Index.  (In the light of morning, I'm not sure why
I didn't do that already.)  But we've always been extremely
cavalier about whether RT indexes should be declared as int or
Index, so I felt that standardizing on the former was actually
a good side-effect of the patch.

Anyway, to address your point more directly: as I recall, the main
objection to just increasing the values of these constants was the
fear that it'd bloat bitmapsets containing these values.  Now on
the one hand, this patch has proven that noplace in the core code
does that today.  On the other hand, there's no certainty that
someone might not try to do that tomorrow (if we don't fix it as
per this patch); or extensions might be doing so.

> I'm really just not sure it's worth all the dev hours fixing the
> fallout.  To me, it seems much safer to jump bump 65k up to 1m. It'll
> be a while before anyone complains about that.

TBH, if we're to approach it that way, I'd be inclined to go for
broke and raise the values to ~2B.  Then (a) we'll be shut of the
problem pretty much permanently, and (b) if someone does try to
make a bitmapset containing these values, hopefully they'll see
performance bad enough to expose the issue immediately.

> It's also not that great to see the number of locations that you
> needed to add run-time checks for negative varnos. That's not going to
> come for free.

Since the test is just "< 0", I pretty much disbelieve that argument.
There are only two such places in the patch, and neither of them
are *that* performance-sensitive.

Anyway, the raise-the-values solution does have the advantage of
being a four-liner, so I can live with it if that's the consensus.
But I do think this way is cleaner in the long run, and I doubt
the argument that it'll create any hard-to-detect bugs.

regards, tom lane

Re: PostgreSQL-13.3 parser.y with positional references by named references

2021-07-04 Thread Tom Lane

Domingo Alvarez Duarte  writes:
> Here https://gist.github.com/mingodad/49291e0e9505522c66fcd3fcea4a939d I 
> posted the postgresql-13.3/src/backend/parser/gram.y with positional 
> references by named references that is supported by bison for some time now.

When is "some time now"?

Currently, we support bison versions back to 1.875.  While we'd be
willing to raise that bar as soon as a good reason to do so comes
along, I'm not sure that getting rid of $N notation is a sufficient
reason.

Indeed, I'd say getting rid of $$ is a strict loss; the changes you
show make actions much more verbose but certainly not any more
readable.  Having a special notation for a rule's output seems to me
like a good thing not a bad one.  The examples of named notation in
the Bison docs don't seem like unconditional wins either; they're not
very concise, and the contortions you're forced into when the same
nonterminal type is used more than once in a rule are just horrid.

I do see the point about it being annoying to update $N references
when a rule is changed.  But this solution has enough downsides that
I'm not sure it's a net win.  Maybe if it were applied selectively,
to just the longer DDL productions, it'd be worth doing?

regards, tom lane

Re: rand48 replacement

2021-07-04 Thread Fabien COELHO




Now suppose we want a random number in the range [0,6). This is what
happens with your algorithm for each of the possible prng() return
values:

 prng() returns 0 -- OK
 prng() returns 1 -- OK
 prng() returns 2 -- OK
 prng() returns 3 -- OK
 prng() returns 4 -- OK
 prng() returns 5 -- OK
 prng() returns 6 -- out of range so use splitmix3() with initial state=6:
   state=6 -> 3, return 7 -- still out of range, so repeat
   state=3 -> 0, return 3 -- now OK
 prng() returns 7 -- out of range so use splitmix3() with initial state=7:
   state=7 -> 4, return 4 -- now OK

So, assuming that prng() chooses each of the 8 possible values with
equal probability (1/8), the overall result is that the values 0,1,2
and 5 are returned with a probability of 1/8, whereas 3 and 4 are
returned with a probability of 2/8.


Ok, I got that explanation.


Using the correct implementation of the bitmask algorithm, each
iteration calls prng() again, so in the end no particular return value
is ever more likely than any other (hence it's unbiased).


Ok, you're taking into account the number of states of the PRNG, so this 
finite number implies some bias on some values if you actually enumerate 
all possible cases, as you do above.


I was reasoning "as if" the splitmix PRNG was an actual random function, 
which is obviously false, but is also somehow a usual (false) assumption 
with PRNGs, and with this false assumption my implementation is perfect:-)


The defect of the modulo method for range extraction is that even with an 
actual (real) random generator the results would be biased. The bias is in 
the method itself. Now you are arguing for a bias linked to the internals 
of the PRNG. They are not the same in nature, even if the effect is the 
same.


Also the bias is significant for close to the limit ranges, which is not 
the kind of use case I have in mind when looking at pgbench.


If you consider the PRNG internals, then splitmix extraction function 
could also be taken into account. If it is not invertible (I'm unsure), 
then assuming it is some kind of hash function, about 1/e of output values 
would not reachable, which is yet another bias that you could argue 
against.


Using the initial PRNG works better only because the underlying 128 bits 
state is much larger than the output value. Which is the point for having 
a larger state in the first place, anyway.



As for determinism, the end result is still fully deterministic. For
example, lets say that prng() returns the following sequence, for some
initial state:

 1,0,3,0,3,7,4,7,6,6,5,3,7,7,7,0,3,6,5,2,3,3,4,0,0,2,7,4,...

then the bitmask method just returns that sequence with all the 6's
and 7's removed:

 1,0,3,0,3,4,5,3,0,3,5,2,3,3,4,0,0,2,4,...

and that same sequence will always be returned, when starting from
that initial seed.


Yes and no.

The result is indeed deterministic of you call the function with the same 
range. However, if you change the range value in one place then sometimes 
the state can advance differently, so the determinism is lost, meaning 
that it depends on actual range values.


Attached a v7 which does as you wish, but also looses the deterministic 
non-value dependent property I was seeking. I would work around that by 
deriving another 128 bit generator instead of splitmix 64 bit, but that is 
overkill.


--
Fabien.diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 2c2f149fb0..146b524076 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -1188,7 +1188,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
  * Found a suitable tuple, so save it, replacing one old tuple
  * at random
  */
-int			k = (int) (targrows * sampler_random_fract(rstate.randstate));
+int			k = (int) (targrows * sampler_random_fract(&rstate.randstate));
 
 Assert(k >= 0 && k < targrows);
 heap_freetuple(rows[k]);
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fafbab6b02..3009861e45 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -5152,7 +5152,7 @@ analyze_row_processor(PGresult *res, int row, PgFdwAnalyzeState *astate)
 		if (astate->rowstoskip <= 0)
 		{
 			/* Choose a random reservoir element to replace. */
-			pos = (int) (targrows * sampler_random_fract(astate->rstate.randstate));
+			pos = (int) (targrows * sampler_random_fract(&astate->rstate.randstate));
 			Assert(pos >= 0 && pos < targrows);
 			heap_freetuple(astate->rows[pos]);
 		}
diff --git a/contrib/tsm_system_rows/tsm_system_rows.c b/contrib/tsm_system_rows/tsm_system_rows.c
index 4996612902..1a46d4b143 100644
--- a/contrib/tsm_system_rows/tsm_system_rows.c
+++ b/contrib/tsm_system_rows/tsm_system_rows.c
@@ -69,7 +69,7 @@ static BlockNumber system_rows_nextsampleblock(SampleScanState *node, BlockNumbe
 static OffsetNumber system_rows_nextsampletuple(SampleScanState *node,
 BlockNumber blockno,
 Offs

Re: rand48 replacement

2021-07-04 Thread Dean Rasheed

On Sun, 4 Jul 2021 at 17:03, Fabien COELHO  wrote:
>
> > As for determinism, the end result is still fully deterministic.
>
> The result is indeed deterministic of you call the function with the same
> range. However, if you change the range value in one place then sometimes
> the state can advance differently, so the determinism is lost, meaning
> that it depends on actual range values.

Ah yes, that's true. I can trivially reproduce that in other languages
too. For example, in python, if I call random.seed(0) and then
random.randrange() with the inputs 10,10,10 then the results are
6,6,0. But if the randrange() inputs are 10,1000,10 then the results
are 6,776,6. So the result from the 3rd call changes as a result of
changing the 2nd input. That's not entirely surprising. The important
property of determinism is that if I set a seed, and then make an
identical set of calls to the random API, the results will be
identical every time, so that it's possible to write tests with
predictable/repeatable results.

> I would work around that by
> deriving another 128 bit generator instead of splitmix 64 bit, but that is
> overkill.

Not really relevant now, but I'm pretty sure that's impossible to do.
You might try it as an interesting academic exercise, but I believe
it's a logical impossibility.

Regards,
Dean

Excessive cost of OpClassCache flushes in CLOBBER_CACHE_ALWAYS mode

2021-07-04 Thread Tom Lane

Over in [1] it is demonstrated that with CLOBBER_CACHE_ALWAYS enabled,
initdb accounts for a full 50% of the runtime of "make check-world"
(well, actually of the buildfarm cycle, which is not quite exactly
that but close).  Since initdb certainly doesn't cost that much
normally, I wondered why it is so negatively affected by CCA.  Some
perf measuring led me to LookupOpclassInfo, and specifically this bit:

/*
 * When testing for cache-flush hazards, we intentionally disable the
 * operator class cache and force reloading of the info on each call. This
 * is helpful because we want to test the case where a cache flush occurs
 * while we are loading the info, and it's very hard to provoke that if
 * this happens only once per opclass per backend.
 */
#ifdef CLOBBER_CACHE_ENABLED
if (debug_invalidate_system_caches_always > 0)
opcentry->valid = false;
#endif

Diking that out halves initdb's CCA runtime.  Turns out it also
roughly halves the runtime of the core regression tests under CCA,
so this doesn't explain why initdb seems so disproportionately
affected by CCA.

However, seeing that this single choice is accounting for half the
cost of CCA testing, we really have to ask whether it's worth that.
This code was added by my commit 03ffc4d6d of 2007-11-28, about which
I wrote:

Improve test coverage of CLOBBER_CACHE_ALWAYS by having it also force
reloading of operator class information on each use of LookupOpclassInfo.
Had this been in place a year ago, it would have helped me find a bug
in the then-new 'operator family' code.  Now that we have a build farm
member testing CLOBBER_CACHE_ALWAYS on a regular basis, it seems worth
expending a little bit of effort here.

I'm now a little dubious about my claim that this would have helped find
any bugs.  Invalidating a finished OpClassCache entry does not model any
real-world scenario, because as noted elsewhere in LookupOpclassInfo,
once such a cache entry is populated it is kept for the rest of the
session.  Also, the claim in the comment that we need this to test
a cache flush during load seems like nonsense: if we have
debug_invalidate_system_caches_always turned on, then we'll test
the effects of such flushes throughout the initial population of
a cache entry.  Doing it over and over again adds nothing.

So I'm now fairly strongly tempted to remove this code outright
(effectively reverting 03ffc4d6d).  Another possibility now that
we have debug_invalidate_system_caches_always is to increase the
threshold at which this happens, making it more like
CLOBBER_CACHE_RECURSIVE.

Thoughts?

regards, tom lane

[1] https://www.postgresql.org/message-id/1289102.1625353189%40sss.pgh.pa.us

"debug_invalidate_system_caches_always" is too long

2021-07-04 Thread Tom Lane

As I've been poking around in this area, I find myself growing
increasingly annoyed at the new GUC name
"debug_invalidate_system_caches_always".  It is too d*mn long.
It's a serious pain to type in any context where you don't have
autocomplete to help you.  I've kept referring to this type of
testing as CLOBBER_CACHE_ALWAYS testing, even though that name is
now obsolete, just because it's so much shorter.  I think we need
to reconsider this name while we still can.

I do agree with the "debug_" prefix given that it's now visible to
users.  However, it doesn't seem that hard to save some space in
the rest of the name.  The word "system" is adding nothing of value,
and the word "always" seems rather confusing --- if it does
something "always", why is there more than one level?  So a simple
proposal is to rename it to "debug_invalidate_caches".

However, I think we should also give serious consideration to
"debug_clobber_cache" or "debug_clobber_cache_always" for continuity
with past practice (though it still feels like "always" is a good
word to lose now).  "debug_clobber_caches" is another reasonable
variant.

Thoughts?

regards, tom lane

Re: rand48 replacement

2021-07-04 Thread Fabien COELHO



The important property of determinism is that if I set a seed, and then 
make an identical set of calls to the random API, the results will be 
identical every time, so that it's possible to write tests with 
predictable/repeatable results.


Hmmm… I like my stronger determinism definition more than this one:-)

I would work around that by deriving another 128 bit generator instead 
of splitmix 64 bit, but that is overkill.


Not really relevant now, but I'm pretty sure that's impossible to do.
You might try it as an interesting academic exercise, but I believe
it's a logical impossibility.


Hmmm… I was simply thinking of seeding a new pg_prng_state from the main 
pg_prng_state with some transformation, and then iterate over the second 
PRNG, pretty much like I did with splitmix, but with 128 bits so that your 
#states argument does not apply, i.e. something like:


 /* select in a range with bitmask rejection */
 uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range)
 {
/* always advance state once */
uint64 next = xoroshiro128ss(state);
uint64 val;

if (range >= 2)
{
uint64 mask = mask_u64(range-1);

val = next & mask;

if (val >= range)
{
/* copy and update current prng state */
pg_prng_state iterator = *state;

iterator.s0 ^= next;
iterator.s1 += UINT64CONST(0x9E3779B97f4A7C15);

/* iterate till val in [0, range) */
while ((val = xoroshiro128ss(&iterator) & mask) >= range)
;
}
}
else
val = 0;

return val;
 }

The initial pseudo-random sequence is left to proceed, and a new PRNG is 
basically forked for iterating on the mask, if needed.


--
Fabien.

visibility map corruption

2021-07-04 Thread Floris Van Nee

Hi hackers,

We recently ran into an issue where the visibility map of a relation was 
corrupt, running Postgres 12.4. The error we'd get when running a SELECT * from 
this table is:

could not access status of transaction 3704450152
DETAIL:  Could not open file "pg_xact/0DCC": No such file or directory.

On the lists I could find several similar reports, but corruption like this 
could obviously have a very wide range of root causes.. see [1] [2] [3] for 
example - not all of them have their root cause known.

This particular case was similar to reported cases above, but also has some 
differences.

The following query returns ~21.000 rows, which indicates something 
inconsistent between the visibility map and the pd_all_visible flag on the page:

select * from pg_check_frozen('tbl');

Looking at one of the affected pages with pageinspect:

=# SELECT 
lp,lp_off,lp_flags,lp_len,t_xmin,t_xmax,t_field3,t_ctid,t_infomask2,t_infomask,t_hoff,t_oid
 FROM heap_page_items(get_raw_page('tbl', 726127));
┌┬┬──┬┬┬┬──┬┬─┬┬┬───┐
│ lp │ lp_off │ lp_flags │ lp_len │   t_xmin   │   t_xmax   │ t_field3 │   
t_ctid   │ t_infomask2 │ t_infomask │ t_hoff │ t_oid │
├┼┼──┼┼┼┼──┼┼─┼┼┼───┤
│  1 │   6328 │1 │   1864 │ 3704450155 │ 3704450155 │1 │ 
(726127,1) │ 249 │   8339 │ 56 │ ∅ │
│  2 │   4464 │1 │   1864 │ 3704450155 │ 3704450155 │1 │ 
(726127,2) │ 249 │   8339 │ 56 │ ∅ │
│  3 │   2600 │1 │   1864 │ 3704450155 │ 3704450155 │1 │ 
(726127,3) │ 249 │   8339 │ 56 │ ∅ │
│  4 │680 │1 │   1920 │ 3704450155 │ 3704450155 │1 │ 
(726127,4) │ 249 │   8339 │ 56 │ ∅ │
└┴┴──┴┴┴┴──┴┴─┴┴┴───┘

t_infomask shows that HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID bits are both 
unset.
This pg_visibility() call shows the inconsistency between VM and page, with 
PD_ALL_VISIBLE=false

=# select * from pg_visibility('tbl', 726127);
┌─┬┬┐
│ all_visible │ all_frozen │ pd_all_visible │
├─┼┼┤
│ t   │ t  │ f  │
└─┴┴┘

Looking at other pages show the same information.
What's interesting is that out of the affected tuples returned by 
pg_check_frozen, over 99% belong to 1 transaction (3704450155 as seen above) 
and the remaining few are from one other transaction that occurred at roughly 
the same time.
I find it hard to believe that this is due to some random bit flipping, because 
many pages are affected, but the "incorrect" ones are in total only from two 
specific transactions which occurred at roughly the same time. There were also 
no server crashes or other known failures around the time of this transaction. 
I'm not ruling out any other kind of failure still, but I also cannot really 
explain how this could have happened.

The server has PG12.4 with full_page_writes=on, data_checksums=off. It's a 
large analytics database. The VM inconsistencies also occur on the streaming 
replicas.

I realize these cases are pretty rare and hard to debug, but I wanted to share 
the information I found so far here for reference. Maybe someone has an idea 
what occurred, or maybe someone in the future finds it useful when he finds 
something similar.

I have no idea how the inconsistency between VM and PD_ALL_VISIBLE started - 
from looking through the code I can't really find any way how this could occur. 
However, for it to lead to the problem described here, I believe there should 
be *no* SELECT that touches that particular page after the insert/update 
transaction and before the transaction log gets truncated. If the page is read 
before the transaction log gets truncated, then the hint bit 
HEAP_XMIN_COMMITTED will get set and future reads will succeed regardless of tx 
log truncation. One of the replica's had this happen to it: the affected pages 
are identical to the primary except that the HEAP_XMIN_COMMITTED flag is set 
(note that the VM inconsistency is still there on the replica though: 
PD_ALL_VISIBLE=false even though VM shows that all_frozen=all_visible=true). 
But I can query these rows on the replica without issues, because it doesn't 
check the tx log when it sees that HEAP_XMIN_COMMITTED is set.

-Floris

[1] https://postgrespro.com/list/thread-id/2422376
[2] https://postgrespro.com/list/thread-id/2501800
[3] https://postgrespro.com/list/thread-id/2321949

Re: visibility map corruption

2021-07-04 Thread Peter Geoghegan

On Sun, Jul 4, 2021 at 1:44 PM Floris Van Nee  wrote:
> We recently ran into an issue where the visibility map of a relation was 
> corrupt, running Postgres 12.4. The error we'd get when running a SELECT * 
> from this table is:
>
> could not access status of transaction 3704450152
> DETAIL:  Could not open file "pg_xact/0DCC": No such file or directory.

Have you ever used pg_upgrade on this database?

-- 
Peter Geoghegan

Re: "debug_invalidate_system_caches_always" is too long

2021-07-04 Thread Noah Misch

On Sun, Jul 04, 2021 at 04:27:13PM -0400, Tom Lane wrote:
> As I've been poking around in this area, I find myself growing
> increasingly annoyed at the new GUC name
> "debug_invalidate_system_caches_always".  It is too d*mn long.
> It's a serious pain to type in any context where you don't have
> autocomplete to help you.  I've kept referring to this type of
> testing as CLOBBER_CACHE_ALWAYS testing, even though that name is
> now obsolete, just because it's so much shorter.  I think we need
> to reconsider this name while we still can.
> 
> I do agree with the "debug_" prefix given that it's now visible to
> users.  However, it doesn't seem that hard to save some space in
> the rest of the name.  The word "system" is adding nothing of value,
> and the word "always" seems rather confusing --- if it does
> something "always", why is there more than one level?  So a simple
> proposal is to rename it to "debug_invalidate_caches".

I agree with all that.  The word "always" has been misinformation, given the
multiple levels available.

> However, I think we should also give serious consideration to
> "debug_clobber_cache" or "debug_clobber_cache_always" for continuity
> with past practice (though it still feels like "always" is a good
> word to lose now).  "debug_clobber_caches" is another reasonable
> variant.

https://en.wikipedia.org/wiki/Clobbering refers to cases where storage had no
changes to its accessibility but now contains different data.  That doesn't
match InvalidateSystemCaches() especially well, so I think dropping that word
has been a good step.  Some other shorter terms could be debug_flush_caches,
debug_rebuild_caches, or debug_expire_caches.  (debug_caches is tempting, but
that may ensnare folks looking for extra logging rather than a big slowdown.)

RE: visibility map corruption

2021-07-04 Thread Floris Van Nee

> On Sun, Jul 4, 2021 at 1:44 PM Floris Van Nee 
> wrote:
> > We recently ran into an issue where the visibility map of a relation was
> corrupt, running Postgres 12.4. The error we'd get when running a SELECT *
> from this table is:
> >
> > could not access status of transaction 3704450152
> > DETAIL:  Could not open file "pg_xact/0DCC": No such file or directory.
> 
> Have you ever used pg_upgrade on this database?
> 

Yes. The last time (from v11 to v12) was in October 2020. The transaction id in 
the tuples (the one PG is trying to check in the tx log) dated from February 
2021. I do believe (but am not 100% certain) that the affected table already 
existed at the time of the last pg_upgrade though.

Re: visibility map corruption

2021-07-04 Thread Peter Geoghegan

On Sun, Jul 4, 2021 at 2:26 PM Floris Van Nee  wrote:
> > Have you ever used pg_upgrade on this database?
> >
>
> Yes. The last time (from v11 to v12) was in October 2020. The transaction id 
> in the tuples (the one PG is trying to check in the tx log) dated from 
> February 2021. I do believe (but am not 100% certain) that the affected table 
> already existed at the time of the last pg_upgrade though.

I wonder if it's related to this issue:

https://www.postgresql.org/message-id/20210423234256.hwopuftipdmp3...@alap3.anarazel.de

Have you increased autovacuum_freeze_max_age from its default? This
already sounds like the kind of database where that would make sense.

-- 
Peter Geoghegan

Re: "debug_invalidate_system_caches_always" is too long

2021-07-04 Thread Justin Pryzby

On Sun, Jul 04, 2021 at 04:27:13PM -0400, Tom Lane wrote:
> and the word "always" seems rather confusing --- if it does
> something "always", why is there more than one level?  So a simple
> proposal is to rename it to "debug_invalidate_caches".

+1 to remove "always"

-- 
Justin

RE: visibility map corruption

2021-07-04 Thread Floris Van Nee

> 
> I wonder if it's related to this issue:
> 
> https://www.postgresql.org/message-
> id/20210423234256.hwopuftipdmp3...@alap3.anarazel.de
> 
> Have you increased autovacuum_freeze_max_age from its default? This
> already sounds like the kind of database where that would make sense.
> 

autovacuum_freeze_max_age is increased in our setup indeed (it is set to 500M). 
However, we do regularly run manual VACUUM (FREEZE) on individual tables in  
the database, including this one. A lot of tables in the database follow an 
INSERT-only pattern and since it's not running v13 yet, this meant that these 
tables would only rarely be touched by autovacuum. Autovacuum would sometimes 
kick in on some of these tables at the same time at unfortunate moments. 
Therefore we have some regular job running that VACUUM (FREEZE)s tables with a 
xact age higher than a (low, 10M) threshold ourselves.

Re: Disable WAL logging to speed up data loading

2021-07-04 Thread Michael Paquier

On Sun, Jul 04, 2021 at 11:02:01AM -0400, Stephen Frost wrote:
> Rather than RfC, the appropriate status seems like it should be
> Rejected, as otherwise it's just encouraging someone to ultimately waste
> their time rebasing and updating the patch when it isn't going to ever
> actually be committed.

Yes, agreed with that.
--
Michael


signature.asc
Description: PGP signature

Re: Can a child process detect postmaster death when in pg_usleep?

2021-07-04 Thread Michael Paquier

On Fri, Jul 02, 2021 at 12:03:07PM +0530, Bharath Rupireddy wrote:
> My bad. I was talking about the cases when do_pg_stop_backup is called
> while the server is in recovery mode i.e. backup_started_in_recovery =
> RecoveryInProgress(); evaluates to true. I'm not sure in these cases
> whether we should replace pg_usleep with WaitLatch. If yes, whether we
> should use procLatch/MyLatch or recoveryWakeupLatch as they are
> currently serving different purposes.

It seems to me that you should re-read the description of
recoveryWakeupLatch at the top of xlog.c and check for which purpose
it exists, which is, in this case, to wake up the startup process to
accelerate WAL replay.  So do_pg_stop_backup() has no business with
it.

Switching pg_stop_backup() to use a latch rather than pg_usleep() has
benefits:
- It simplifies the wait event handling.
- The process waiting for the last WAL segment to be archived will be
more responsive on signals like SIGHUP and on postmaster death.

These don't sound bad to me to apply here, so 0002 could be simplified
as attached.
--
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7890e13d7a..c7c928f50b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11638,9 +11638,11 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 reported_waiting = true;
 			}
 
-			pgstat_report_wait_start(WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE);
-			pg_usleep(100L);
-			pgstat_report_wait_end();
+			(void) WaitLatch(MyLatch,
+			 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+			 1000L,
+			 WAIT_EVENT_BACKUP_WAIT_WAL_ARCHIVE);
+			ResetLatch(MyLatch);
 
 			if (++waits >= seconds_before_warning)
 			{


signature.asc
Description: PGP signature

RE: Disable WAL logging to speed up data loading

2021-07-04 Thread osumi.takami...@fujitsu.com

On Monday, July 5, 2021 10:32 AM Michael Paquier wrote:
> On Sun, Jul 04, 2021 at 11:02:01AM -0400, Stephen Frost wrote:
> > Rather than RfC, the appropriate status seems like it should be
> > Rejected, as otherwise it's just encouraging someone to ultimately
> > waste their time rebasing and updating the patch when it isn't going
> > to ever actually be committed.
> 
> Yes, agreed with that.
Thanks for paying attention to this thread.

This cannot be helped but I've made the patch status as you suggested,
because I judged the community would not accept this idea.

Best Regards,
Takamichi Osumi

Re: row filtering for logical replication

2021-07-04 Thread Greg Nancarrow

On Thu, Jul 1, 2021 at 10:43 AM Euler Taveira  wrote:
>
> Amit, thanks for rebasing this patch. I already had a similar rebased patch in
> my local tree. A recent patch broke your version v15 so I rebased it.
>
> I like the idea of a simple create_estate_for_relation() function (I fixed an
> oversight regarding GetCurrentCommandId(false) because it is used only for
> read-only purposes). This patch also replaces all references to version 14.
>
> Commit ef948050 made some changes in the snapshot handling. Set the current
> active snapshot might not be required but future changes to allow functions
> will need it.
>
> As the previous patches, it includes commits (0002 and 0003) that are not
> intended to be committed. They are available for test-only purposes.
>

I have some review comments on the "Row filter for logical replication" patch:

(1) Suggested update to patch comment:
(There are some missing words and things which could be better expressed)

This feature adds row filtering for publication tables.
When a publication is defined or modified, rows that don't satisfy a WHERE
clause may be optionally filtered out. This allows a database or set of
tables to be partially replicated. The row filter is per table, which allows
different row filters to be defined for different tables. A new row filter
can be added simply by specifying a WHERE clause after the table name.
The WHERE clause must be enclosed by parentheses.

The WHERE clause should probably contain only columns that are part of the
primary key or that are covered by REPLICA IDENTITY. Otherwise, any DELETEs
won't be replicated. DELETE uses the old row version (that is limited to
primary key or REPLICA IDENTITY) to evaluate the row filter. INSERT and UPDATE
use the new row version to evaluate the row filter, hence, you can use any
column. If the row filter evaluates to NULL, it returns false. For simplicity,
functions are not allowed; that could possibly be addressed in a future patch.

If you choose to do the initial table synchronization, only data that satisfies
the row filters is sent. If the subscription has several publications in which
a table has been published with different WHERE clauses, rows must satisfy all
expressions to be copied. If subscriber is a pre-15 version, data
synchronization won't use row filters if they are defined in the publisher.
Previous versions cannot handle row filters.

If your publication contains a partitioned table, the publication parameter
publish_via_partition_root determines if it uses the partition row filter (if
the parameter is false, the default) or the root partitioned table row filter.

(2) Some inconsistent error message wording:

Currently:
err = _("cannot use subquery in publication WHERE expression");

Suggest changing it to:
err = _("subqueries are not allowed in publication WHERE expressions");

Other examples from the patch:
err = _("aggregate functions are not allowed in publication WHERE expressions");
err = _("grouping operations are not allowed in publication WHERE expressions");
err = _("window functions are not allowed in publication WHERE expressions");
errmsg("functions are not allowed in publication WHERE expressions"),
err = _("set-returning functions are not allowed in publication WHERE
expressions");

(3) The current code still allows arbitrary code execution, e.g. via a
user-defined operator:

e.g.
publisher:

CREATE OR REPLACE FUNCTION myop(left_arg INTEGER, right_arg INTEGER)
RETURNS BOOL AS
$$
BEGIN
  RAISE NOTICE 'I can do anything here!';
  RETURN left_arg > right_arg;
 END;
$$ LANGUAGE PLPGSQL VOLATILE;

CREATE OPERATOR  (
  PROCEDURE = myop,
  LEFTARG = INTEGER,
  RIGHTARG = INTEGER
);

CREATE PUBLICATION tap_pub FOR TABLE test_tab WHERE (a  5);

subscriber:
CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub
application_name=tap_sub' PUBLICATION tap_pub;

Perhaps add the following after the existing shell error-check in make_op():

/* User-defined operators are not allowed in publication WHERE clauses */
if (pstate->p_expr_kind == EXPR_KIND_PUBLICATION_WHERE && opform->oid
>= FirstNormalObjectId)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("user-defined operators are not allowed in publication
WHERE expressions"),
parser_errposition(pstate, location)));

Also, I believe it's also allowing user-defined CASTs (so could add a
similar check to above in transformTypeCast()).
Ideally, it would be preferable to validate/check publication WHERE
expressions in one central place, rather than scattered all over the
place, but that might be easier said than done.
You need to update the patch comment accordingly.

(4) src/backend/replication/pgoutput/pgoutput.c
pgoutput_change()

The 3 added calls to pgoutput_row_filter() are returning from
pgoutput_change(), if false is returned, but instead they should break
from the switch, otherwise cleanup code is missed. This is surely a
bug.

e.g.
(3 similar cases of this)

+ i

Re: "debug_invalidate_system_caches_always" is too long

2021-07-04 Thread Kyotaro Horiguchi

At Sun, 4 Jul 2021 14:12:34 -0700, Noah Misch  wrote in 
> > However, I think we should also give serious consideration to
> > "debug_clobber_cache" or "debug_clobber_cache_always" for continuity
> > with past practice (though it still feels like "always" is a good
> > word to lose now).  "debug_clobber_caches" is another reasonable
> > variant.
> 
> https://en.wikipedia.org/wiki/Clobbering refers to cases where storage had no
> changes to its accessibility but now contains different data.  That doesn't
> match InvalidateSystemCaches() especially well, so I think dropping that word
> has been a good step.  Some other shorter terms could be debug_flush_caches,
> debug_rebuild_caches, or debug_expire_caches.  (debug_caches is tempting, but

(I murmur that I think "drop" is also usable here.)

> that may ensnare folks looking for extra logging rather than a big slowdown.)

I agree to this.  (And one more +1 to removing "always".)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: "debug_invalidate_system_caches_always" is too long

2021-07-04 Thread Amul Sul

On Mon, Jul 5, 2021 at 1:57 AM Tom Lane  wrote:
>
> As I've been poking around in this area, I find myself growing
> increasingly annoyed at the new GUC name
> "debug_invalidate_system_caches_always".  It is too d*mn long.
> It's a serious pain to type in any context where you don't have
> autocomplete to help you.  I've kept referring to this type of
> testing as CLOBBER_CACHE_ALWAYS testing, even though that name is
> now obsolete, just because it's so much shorter.  I think we need
> to reconsider this name while we still can.
>
> I do agree with the "debug_" prefix given that it's now visible to
> users.  However, it doesn't seem that hard to save some space in
> the rest of the name.  The word "system" is adding nothing of value,
> and the word "always" seems rather confusing --- if it does
> something "always", why is there more than one level?  So a simple
> proposal is to rename it to "debug_invalidate_caches".
>
> However, I think we should also give serious consideration to
> "debug_clobber_cache" or "debug_clobber_cache_always" for continuity
> with past practice (though it still feels like "always" is a good
> word to lose now).  "debug_clobber_caches" is another reasonable
> variant.
>
> Thoughts?

 +1 for the "debug_clobber_caches" variant, easy to remember.

Regards,
Amul

Re: Can a child process detect postmaster death when in pg_usleep?

2021-07-04 Thread Kyotaro Horiguchi

At Fri, 2 Jul 2021 10:27:21 +0900, Michael Paquier  wrote 
in 
> On Mon, Jun 28, 2021 at 11:01:57AM -0400, Tom Lane wrote:
> > Dunno ... I cannot recall ever having had that as a debugging requirement
> > in a couple of decades worth of PG bug-chasing.  If the postmaster is
> > dying, you generally want to deal with that before bothering with child
> > processes.  Moreover, child processes that don't go awy when the
> > postmaster does are a very nasty problem, because they could screw up
> > subsequent debugging work.
> 
> At the same time, nobody has really complained about this being an
> issue for developer options.  I would tend to wait for more opinions
> before doing anything with the auth_delay GUCs.

I'm not sure the current behavior is especially useful for debugging,
however, I don't think it is especially useful that children
immediately respond to postmaster's death while the debug-delays,
because anyway children don't respond while debugging (until the
control (or code-pointer) reaches to the point of checking
postmaster's death), and the delays must be very short even if someone
abuses it on production systems. On the other hand, there could be a
discussion as a convention that any user-definable sleep requires to
respond to signals, maybe as Thomas mentioned.

So, I don't object either way we will go. But if we don't change the
behavior we instead would need a comment that explains the reason for
the pg_usleep.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: "debug_invalidate_system_caches_always" is too long

2021-07-04 Thread Bharath Rupireddy

On Mon, Jul 5, 2021 at 1:57 AM Tom Lane  wrote:
>
> As I've been poking around in this area, I find myself growing
> increasingly annoyed at the new GUC name
> "debug_invalidate_system_caches_always".  It is too d*mn long.
> It's a serious pain to type in any context where you don't have
> autocomplete to help you.  I've kept referring to this type of
> testing as CLOBBER_CACHE_ALWAYS testing, even though that name is
> now obsolete, just because it's so much shorter.  I think we need
> to reconsider this name while we still can.
>
> I do agree with the "debug_" prefix given that it's now visible to
> users.  However, it doesn't seem that hard to save some space in
> the rest of the name.  The word "system" is adding nothing of value,
> and the word "always" seems rather confusing --- if it does
> something "always", why is there more than one level?  So a simple
> proposal is to rename it to "debug_invalidate_caches".
>
> However, I think we should also give serious consideration to
> "debug_clobber_cache" or "debug_clobber_cache_always" for continuity
> with past practice (though it still feels like "always" is a good
> word to lose now).  "debug_clobber_caches" is another reasonable
> variant.
>
> Thoughts?

+1. IMO, debug_clobber_caches is better because it is simple.  And
also, since the invalidation happens on multiple system caches,
debug_clobber_caches is preferable than debug_clobber_cache.

Regards,
Bharath Rupireddy.

Re: Yet another fast GiST build

2021-07-04 Thread Emre Hasegeli

I tried reviewing the remaining patches.  It seems to work correctly,
and passes the tests on my laptop.

> In this pattern I flipped PointerGetDatum(a) to PointerGetDatum(ra.lower), 
> because it seems to me correct. I've followed rule of thumb: every sort 
> function must extract and use "lower" somehow. Though I suspect numeric a 
> bit. Is it regular varlena?

As far as I understand, we cannot use the sortsupport functions from
the btree operator classes because the btree_gist extension handles
things differently.  This is unfortunate and a source of bugs [1], but
we cannot do anything about it.

Given that the lower and upper datums must be the same for the leaf
nodes, it makes sense to me to compare one of them.

Using numeric_cmp() for numeric in line with using bttextcmp() for text.

> +   /*
> +* Numeric has abbreviation routines in numeric.c, but we don't try to use
> +* them here. Maybe later.
> +*/

This is also true for text.  Perhaps we should also add a comment there.

> PFA patchset with v6 intact + two fixes of discovered issues.

> +   /* Use byteacmp(), like gbt_bitcmp() does */

We can improve this comment by incorporating Heikki's previous email:

> Ok, I think I understand that now. In btree_gist, the *_cmp() function
> operates on non-leaf values, and *_lt(), *_gt() et al operate on leaf
> values. For all other datatypes, the leaf and non-leaf representation is
> the same, but for bit/varbit, the non-leaf representation is different.
> The leaf representation is VarBit, and non-leaf is just the bits without
> the 'bit_len' field. That's why it is indeed correct for gbt_bitcmp() to
> just use byteacmp(), whereas gbt_bitlt() et al compares the 'bit_len'
> field separately. That's subtle, and 100% uncommented.

I think patch number 3 should be squashed to patch number 1.

I couldn't understand patch number 2 "Remove DEBUG1 verification".  It
seems like something rather useful.

[1] 
https://www.postgresql.org/message-id/flat/201010112055.o9BKtZf7011251%40wwwmaster.postgresql.org

Re: Transactions involving multiple postgres foreign servers, take 2

2021-07-04 Thread Masahiro Ikeda

On 2021/06/30 10:05, Masahiko Sawada wrote:
> On Fri, Jun 25, 2021 at 9:53 AM Masahiro Ikeda  
> wrote:
>>
>> Hi Jamison-san, sawada-san,
>>
>> Thanks for testing!
>>
>> FWIF, I tested using pgbench with "--rate=" option to know the server
>> can execute transactions with stable throughput. As sawada-san said,
>> the latest patch resolved second phase of 2PC asynchronously. So,
>> it's difficult to control the stable throughput without "--rate=" option.
>>
>> I also worried what I should do when the error happened because to increase
>> "max_prepared_foreign_transaction" doesn't work. Since too overloading may
>> show the error, is it better to add the case to the HINT message?
>>
>> BTW, if sawada-san already develop to run the resolver processes in parallel,
>> why don't you measure performance improvement? Although Robert-san,
>> Tunakawa-san and so on are discussing what architecture is best, one
>> discussion point is that there is a performance risk if adopting asynchronous
>> approach. If we have promising solutions, I think we can make the discussion
>> forward.
> 
> Yeah, if we can asynchronously resolve the distributed transactions
> without worrying about max_prepared_foreign_transaction error, it
> would be good. But we will need synchronous resolution at some point.
> I think we at least need to discuss it at this point.
> 
> I've attached the new version patch that incorporates the comments
> from Fujii-san and Ikeda-san I got so far. We launch a resolver
> process per foreign server, committing prepared foreign transactions
> on foreign servers in parallel. To get a better performance based on
> the current architecture, we can have multiple resolver processes per
> foreign server but it seems not easy to tune it in practice. Perhaps
> is it better if we simply have a pool of resolver processes and we
> assign a resolver process to the resolution of one distributed
> transaction one by one? That way, we need to launch resolver processes
> as many as the concurrent backends using 2PC.

Thanks for updating the patches.

I have tested in my local laptop and summary is the following.

(1) The latest patch(v37) can improve throughput by 1.5 times compared to v36.

Although I expected it improves by 2.0 times because the workload is that one
transaction access two remote servers... I think the reason is that the disk
is bottleneck and I couldn't prepare disks for each postgresql servers. If I
could, I think the performance can be improved by 2.0 times.

(2) The latest patch(v37) throughput of foreign_twophase_commit = required is
about 36% compared to the case if foreign_twophase_commit = disabled.

Although the throughput is improved, the absolute performance is not good. It
may be the fate of 2PC. I think the reason is that the number of WAL writes is
much increase and, the disk writes in my laptop is the bottleneck. I want to
know the result testing in richer environments if someone can do so.

(3) The latest patch(v37) has no overhead if foreign_twophase_commit =
disabled. On the contrary, the performance improved by 3%. It may be within
the margin of error.

The test detail is following.

# condition

* 1 coordinator and 3 foreign servers

* 4 instance shared one ssd disk.

* one transaction queries different two foreign servers.

``` fxact_update.pgbench
\set id random(1, 100)

\set partnum 3
\set p1 random(1, :partnum)
\set p2 ((:p1 + 1) % :partnum) + 1

BEGIN;
UPDATE part:p1 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
UPDATE part:p2 SET md5 = md5(clock_timestamp()::text) WHERE id = :id;
COMMIT;
```

* pgbench generates load. I increased ${RATE} little by little until "maximum
number of foreign transactions reached" error happens.

```
pgbench -f fxact_update.pgbench -R ${RATE} -c 8 -j 8 -T 180
```

* parameters
max_prepared_transactions = 100
max_prepared_foreign_transactions = 200
max_foreign_transaction_resolvers = 4

# test source code patterns

1. 2pc patches(v36) based on 6d0eb385 (foreign_twophase_commit = required).
2. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = required).
3. 2pc patches(v37) based on 2595e039 (foreign_twophase_commit = disabled).
4. 2595e039 without 2pc patches(v37).

# results

1. tps = 241.8000TPS
   latency average = 10.413ms

2. tps = 359.017519 ( by 1.5 times compared to 1. by 0.36% compared to 3.)
   latency average = 15.427ms

3. tps = 987.372220 ( by 1.03% compared to 4. )
   latency average = 8.102ms

4. tps = 955.984574
   latency average = 8.368ms

The disk is the bottleneck in my environment because disk util is almost 100%
in every pattern. If disks for each instance can be prepared, I think we can
expect more performance improvements.

>> In my understanding, there are three improvement idea. First is that to make
>> the resolver processes run in parallel. Second is that to send "COMMIT/ABORT
>> PREPARED" remote servers in bulk. Third is to stop syncing the WAL
>> remove_fdwxact() after resolving is done, whi

Re: rand48 replacement

2021-07-04 Thread Yura Sokolov


Fabien COELHO писал 2021-07-04 23:29:
The important property of determinism is that if I set a seed, and 
then make an identical set of calls to the random API, the results 
will be identical every time, so that it's possible to write tests 
with predictable/repeatable results.


Hmmm… I like my stronger determinism definition more than this one:-)

I would work around that by deriving another 128 bit generator 
instead of splitmix 64 bit, but that is overkill.


Not really relevant now, but I'm pretty sure that's impossible to do.
You might try it as an interesting academic exercise, but I believe
it's a logical impossibility.


Hmmm… I was simply thinking of seeding a new pg_prng_state from the
main pg_prng_state with some transformation, and then iterate over the
second PRNG, pretty much like I did with splitmix, but with 128 bits
so that your #states argument does not apply, i.e. something like:

 /* select in a range with bitmask rejection */
 uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range)
 {
/* always advance state once */
uint64 next = xoroshiro128ss(state);
uint64 val;

if (range >= 2)
{
uint64 mask = mask_u64(range-1);

val = next & mask;

if (val >= range)
{
/* copy and update current prng state */
pg_prng_state iterator = *state;

iterator.s0 ^= next;
iterator.s1 += UINT64CONST(0x9E3779B97f4A7C15);

/* iterate till val in [0, range) */
while ((val = xoroshiro128ss(&iterator) & mask) >= range)
;
}
}
else
val = 0;

return val;
 }

The initial pseudo-random sequence is left to proceed, and a new PRNG
is basically forked for iterating on the mask, if needed.


I believe most "range" values are small, much smaller than UINT32_MAX.
In this case, according to [1] fastest method is Lemire's one (I'd take
original version from [2])

Therefore combined method pg_prng_u64_range could branch on range value

uint64 pg_prng_u64_range(pg_prng_state *state, uint64 range)
{
  uint64 val = xoroshiro128ss(state);
  uint64 m;
  if ((range & (range-1) == 0) /* handle all power 2 cases */
return range != 0 ? val & (range-1) : 0;
  if (likely(range < PG_UINT32_MAX/32)
  {
/*
 * Daniel Lamire's unbiased range random algorithm based on 
rejection sampling

 * https://lemire.me/blog/2016/06/30/fast-random-shuffling/
 */
m = (uint32)val * range;
if ((uint32)m < range)
{
  uint32 t = -range % range;
  while ((uint32)m < t)
m = (uint32)xoroshiro128ss(state) * range;
}
return m >> 32;
  }
  /* Apple's mask method */
  m = mask_u64(range-1);
  val &= m;
  while (val >= range)
val = xoroshiro128ss(state) & m;
  return val;
}

Mask method could also be faster when range is close to mask.
For example, fast check for "range is within 1/1024 to mask" is
  range < (range/512 + (range&(range*2)))

And then method choose could like:
  if (likely(range < UINT32_MAX/32 && range > (range/512 + 
(range&(range*2)


But I don't know does additional condition worth difference or not.

[1] https://www.pcg-random.org/posts/bounded-rands.html
[2] https://lemire.me/blog/2016/06/30/fast-random-shuffling/

regards,
Sokolov Yura

Re: Add proper planner support for ORDER BY / DISTINCT aggregates

2021-07-04 Thread Ronan Dunklau

Le vendredi 2 juillet 2021, 10:39:44 CEST David Rowley a écrit :
> On Fri, 2 Jul 2021 at 19:54, Ronan Dunklau  wrote:
> > I don't know if it's acceptable, but in the case where you add both an
> > aggregate with an ORDER BY clause, and another aggregate without the
> > clause, the output for the unordered one will change and use the same
> > ordering, maybe suprising the unsuspecting user. Would that be acceptable
> > ?
> 
> That's a good question. There was an argument in [1] that mentions
> that there might be a group of people who rely on aggregation being
> done in a certain order where they're not specifying an ORDER BY
> clause in the aggregate.  If that group of people exists, then it's
> possible they might be upset in the scenario that you describe.
> 
> I also think that it's going to be pretty hard to make significant
> gains in this area if we are too scared to make changes to undefined
> behaviour.  You wouldn't have to look too hard in the pgsql-general
> mailing list to find someone complaining that their query output is in
> the wrong order on some query that does not have an ORDER BY. We
> pretty much always tell those people that the order is undefined
> without an ORDER BY. I'm not too sure why Tom in [1] classes the ORDER
> BY aggregate people any differently. We'll be stuck forever here and
> in many other areas if we're too scared to change the order of
> aggregation. You could argue that something like parallelism has
> changed that for people already. I think the multi-batch Hash
> Aggregate code could also change this.

I would agree with that.

> 
> > I was curious about the performance implication of that additional
> > transition, and could not reproduce a signifcant difference. I may be
> > doing something wrong: how did you highlight it ?
> 
> It was pretty basic. I just created a table with two columns and no
> index and did something like SELECT a,SUM(b ORDER BY b) from ab GROUP
> BY 1;  the new code will include a Sort due to lack of any index and
> the old code would have done a sort inside nodeAgg.c. I imagine that
> the overhead comes from the fact that in the patched version nodeAgg.c
> must ask its subnode (nodeSort.c) for the next tuple each time,
> whereas unpatched nodeAgg.c already has all the tuples in a tuplestore
> and can fetch them very quickly in a tight loop.

Ok, I reproduced that case, just not using a group by: by adding the group by 
a sort node is added in both cases (master and your patch), except that with 
your patch we sort on both keys and that doesn't really incur a performance 
penalty. 

I think the overhead occurs because in the ExecAgg case, we use the 
tuplesort_*_datum API as an optimization when we have a single column as an 
input, which the ExecSort code doesn't. Maybe it would be worth it to try to 
use that API in sort nodes too, when it can be done. 


> 
> David
> 
> [1] https://www.postgresql.org/message-id/6538.1522096067%40sss.pgh.pa.us


-- 
Ronan Dunklau

40 matches

Mail list logo