date:20221016

On 10/15/22 14:33, Tomas Vondra wrote:
> Hi,
> 
> ...
> 
> There's a bunch of issues with this initial version of the patch,
> usually described in XXX comments in the relevant places.6)
> 
> ...

I forgot to mention one important issue in my list yesterday, and that's
memory consumption. The way the patch is coded now, the new BRIN support
function (brin_minmax_ranges) produces information about *all* ranges in
one go, which may be an issue. The worst case is 32TB table, with 1-page
BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
structs, so this would require ~128GB of RAM. With the default 128-page
ranges, it's still be ~1GB, which is quite a lot.

We could have a discussion about what's the reasonable size of BRIN
ranges on such large tables (e.g. building a bitmap on 4 billion ranges
is going to be "not cheap" so this is likely pretty rare). But we should
not introduce new nodes that ignore work_mem, so we need a way to deal
with such cases somehow.

The easiest solution likely is to check this while planning - we can
check the table size, calculate the number of BRIN ranges, and check
that the range info fits into work_mem, and just not create the path
when it gets too large. That's what we did for HashAgg, although that
decision was unreliable because estimating GROUP BY cardinality is hard.

The wrinkle here is that counting just the range info (BrinRange struct)
does not include the values for by-reference types. We could use average
width - that's just an estimate, though.

A more comprehensive solution seems to be to allow requesting chunks of
the BRIN ranges. So that we'd get "slices" of ranges and we'd process
those. So for example if you have 1000 ranges, and you can only handle
100 at a time, we'd do 10 loops, each requesting 100 ranges.

This has another problem - we do care about "overlaps", and we can't
really know if the overlapping ranges will be in the same "slice"
easily. The chunks would be sorted (for example) by maxval. But there
can be a range with much higher maxval (thus in some future slice), but
very low minval (thus intersecting with ranges in the current slice).

Imagine ranges with these minval/maxval values, sorted by maxval:

[101,200]
[201,300]
[301,400]
[150,500]

and let's say we can only process 2-range slices. So we'll get the first
two, but both of them intersect with the very last range.

We could always include all the intersecting ranges into the slice, but
what if there are too many very "wide" ranges?

So I think this will need to switch to an iterative communication with
the BRIN index - instead of asking "give me info about all the ranges",
we'll need a way to

  - request the next range (sorted by maxval)
  - request the intersecting ranges one by one (sorted by minval)

Of course, the BRIN side will have some of the same challenges with
tracking the info without breaking the work_mem limit, but I suppose it
can store the info into a tuplestore/tuplesort, and use that instead of
plain in-memory array. Alternatively, it could just return those, and
BrinSort would use that. OTOH it seems cleaner to have some sort of API,
especially if we want to support e.g. minmax-multi opclasses, that have
a more complicated concept of "intersection".

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Zhihong Yu

On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra 
wrote:

> On 10/15/22 14:33, Tomas Vondra wrote:
> > Hi,
> >
> > ...
> >
> > There's a bunch of issues with this initial version of the patch,
> > usually described in XXX comments in the relevant places.6)
> >
> > ...
>
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption. The way the patch is coded now, the new BRIN support
> function (brin_minmax_ranges) produces information about *all* ranges in
> one go, which may be an issue. The worst case is 32TB table, with 1-page
> BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
> structs, so this would require ~128GB of RAM. With the default 128-page
> ranges, it's still be ~1GB, which is quite a lot.
>
> We could have a discussion about what's the reasonable size of BRIN
> ranges on such large tables (e.g. building a bitmap on 4 billion ranges
> is going to be "not cheap" so this is likely pretty rare). But we should
> not introduce new nodes that ignore work_mem, so we need a way to deal
> with such cases somehow.
>
> The easiest solution likely is to check this while planning - we can
> check the table size, calculate the number of BRIN ranges, and check
> that the range info fits into work_mem, and just not create the path
> when it gets too large. That's what we did for HashAgg, although that
> decision was unreliable because estimating GROUP BY cardinality is hard.
>
> The wrinkle here is that counting just the range info (BrinRange struct)
> does not include the values for by-reference types. We could use average
> width - that's just an estimate, though.
>
> A more comprehensive solution seems to be to allow requesting chunks of
> the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> those. So for example if you have 1000 ranges, and you can only handle
> 100 at a time, we'd do 10 loops, each requesting 100 ranges.
>
> This has another problem - we do care about "overlaps", and we can't
> really know if the overlapping ranges will be in the same "slice"
> easily. The chunks would be sorted (for example) by maxval. But there
> can be a range with much higher maxval (thus in some future slice), but
> very low minval (thus intersecting with ranges in the current slice).
>
> Imagine ranges with these minval/maxval values, sorted by maxval:
>
> [101,200]
> [201,300]
> [301,400]
> [150,500]
>
> and let's say we can only process 2-range slices. So we'll get the first
> two, but both of them intersect with the very last range.
>
> We could always include all the intersecting ranges into the slice, but
> what if there are too many very "wide" ranges?
>
> So I think this will need to switch to an iterative communication with
> the BRIN index - instead of asking "give me info about all the ranges",
> we'll need a way to
>
>   - request the next range (sorted by maxval)
>   - request the intersecting ranges one by one (sorted by minval)
>
> Of course, the BRIN side will have some of the same challenges with
> tracking the info without breaking the work_mem limit, but I suppose it
> can store the info into a tuplestore/tuplesort, and use that instead of
> plain in-memory array. Alternatively, it could just return those, and
> BrinSort would use that. OTOH it seems cleaner to have some sort of API,
> especially if we want to support e.g. minmax-multi opclasses, that have
> a more complicated concept of "intersection".
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> Hi,
In your example involving [150,500], can this range be broken down into 4
ranges, ending in 200, 300, 400 and 500, respectively ?
That way, there is no intersection among the ranges.

bq. can store the info into a tuplestore/tuplesort

Wouldn't this involve disk accesses which may reduce the effectiveness of
BRIN sort ?

Cheers

Re: PATCH: Using BRIN indexes for sorted output




On 10/16/22 03:36, Zhihong Yu wrote:
> 
> 
> On Sat, Oct 15, 2022 at 8:23 AM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> On 10/15/22 15:46, Zhihong Yu wrote:
> >...
> >     8) Parallel version is not supported, but I think it shouldn't be
> >     possible. Just make the leader build the range info, and then
> let the
> >     workers to acquire/sort ranges and merge them by Gather Merge.
> > ...
> > Hi,
> > I am still going over the patch.
> >
> > Minor: for #8, I guess you meant `it should be possible` .
> >
> 
> Yes, I meant to say it should be possible. Sorry for the confusion.
> 
> 
> 
> regards
> 
> -- 
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com 
> The Enterprise PostgreSQL Company
> 
> Hi,
> 
> For brin_minmax_ranges, looking at the assignment to gottuple and
> reading gottuple, it seems variable gottuple can be omitted - we can
> check tup directly.
> 
> +   /* Maybe mark the range as processed. */
> +   range->processed |= mark_processed;
> 
> `Maybe` can be dropped.
> 

No, because the "mark_processed" may be false. So we may not mark it as
processed in some cases.

> For brinsort_load_tuples(), do we need to check for interrupts inside
> the loop ?
> Similar question for subsequent methods involving loops, such
> as brinsort_load_unsummarized_ranges.
> 

We could/should, although most of the loops should be very short.


regrds

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output




On 10/16/22 16:01, Zhihong Yu wrote:
> 
> 
> On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> On 10/15/22 14:33, Tomas Vondra wrote:
> > Hi,
> >
> > ...
> >
> > There's a bunch of issues with this initial version of the patch,
> > usually described in XXX comments in the relevant places.6)
> >
> > ...
> 
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption. The way the patch is coded now, the new BRIN support
> function (brin_minmax_ranges) produces information about *all* ranges in
> one go, which may be an issue. The worst case is 32TB table, with 1-page
> BRIN ranges, which means ~4 billion ranges. The info is an array of ~32B
> structs, so this would require ~128GB of RAM. With the default 128-page
> ranges, it's still be ~1GB, which is quite a lot.
> 
> We could have a discussion about what's the reasonable size of BRIN
> ranges on such large tables (e.g. building a bitmap on 4 billion ranges
> is going to be "not cheap" so this is likely pretty rare). But we should
> not introduce new nodes that ignore work_mem, so we need a way to deal
> with such cases somehow.
> 
> The easiest solution likely is to check this while planning - we can
> check the table size, calculate the number of BRIN ranges, and check
> that the range info fits into work_mem, and just not create the path
> when it gets too large. That's what we did for HashAgg, although that
> decision was unreliable because estimating GROUP BY cardinality is hard.
> 
> The wrinkle here is that counting just the range info (BrinRange struct)
> does not include the values for by-reference types. We could use average
> width - that's just an estimate, though.
> 
> A more comprehensive solution seems to be to allow requesting chunks of
> the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> those. So for example if you have 1000 ranges, and you can only handle
> 100 at a time, we'd do 10 loops, each requesting 100 ranges.
> 
> This has another problem - we do care about "overlaps", and we can't
> really know if the overlapping ranges will be in the same "slice"
> easily. The chunks would be sorted (for example) by maxval. But there
> can be a range with much higher maxval (thus in some future slice), but
> very low minval (thus intersecting with ranges in the current slice).
> 
> Imagine ranges with these minval/maxval values, sorted by maxval:
> 
> [101,200]
> [201,300]
> [301,400]
> [150,500]
> 
> and let's say we can only process 2-range slices. So we'll get the first
> two, but both of them intersect with the very last range.
> 
> We could always include all the intersecting ranges into the slice, but
> what if there are too many very "wide" ranges?
> 
> So I think this will need to switch to an iterative communication with
> the BRIN index - instead of asking "give me info about all the ranges",
> we'll need a way to
> 
>   - request the next range (sorted by maxval)
>   - request the intersecting ranges one by one (sorted by minval)
> 
> Of course, the BRIN side will have some of the same challenges with
> tracking the info without breaking the work_mem limit, but I suppose it
> can store the info into a tuplestore/tuplesort, and use that instead of
> plain in-memory array. Alternatively, it could just return those, and
> BrinSort would use that. OTOH it seems cleaner to have some sort of API,
> especially if we want to support e.g. minmax-multi opclasses, that have
> a more complicated concept of "intersection".
> 
> 
> regards
> 
> -- 
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com 
> The Enterprise PostgreSQL Company
> 
> Hi,
> In your example involving [150,500], can this range be broken down into
> 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> That way, there is no intersection among the ranges.
> 

Not really, I think. These "value ranges" map to "page ranges" and how
would you split those? I mean, you know values [150,500] map to blocks
[0,127]. You split the values into [150,200], [201,300], [301,400]. How
do you split the page range [0,127]?

Also, splitting a range into more ranges is likely making the issue
worse, because it increases the number of ranges, right? And I mean,
much worse, because imagine a "wide" range that overlaps with every
other range - the number of ranges would explode.

It's not clear to me at which point you'd make the split. At the
beginning, right after loading the ranges from BRIN index? A lot of that
may be unnecessary, in case the range is loaded as a "non-intersecting"
range.

Try to formulate the whole algorithm. Maybe I'm missing something.

The current algorithm is something like thi

Re: PATCH: Using BRIN indexes for sorted output

Tomas Vondra  writes:
> I forgot to mention one important issue in my list yesterday, and that's
> memory consumption.

TBH, this is all looking like vastly more complexity than benefit.
It's going to be impossible to produce a reliable cost estimate
given all the uncertainty, and I fear that will end in picking
BRIN-based sorting when it's not actually a good choice.

The examples you showed initially are cherry-picked to demonstrate
the best possible case, which I doubt has much to do with typical
real-world tables.  It would be good to see what happens with
not-perfectly-sequential data before even deciding this is worth
spending more effort on.  It also seems kind of unfair to decide
that the relevant comparison point is a seqscan rather than a
btree indexscan.

regards, tom lane

Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Zhihong Yu

On Sun, Oct 16, 2022 at 7:33 AM Tomas Vondra 
wrote:

>
>
> On 10/16/22 16:01, Zhihong Yu wrote:
> >
> >
> > On Sun, Oct 16, 2022 at 6:51 AM Tomas Vondra
> > mailto:tomas.von...@enterprisedb.com>>
> > wrote:
> >
> > On 10/15/22 14:33, Tomas Vondra wrote:
> > > Hi,
> > >
> > > ...
> > >
> > > There's a bunch of issues with this initial version of the patch,
> > > usually described in XXX comments in the relevant places.6)
> > >
> > > ...
> >
> > I forgot to mention one important issue in my list yesterday, and
> that's
> > memory consumption. The way the patch is coded now, the new BRIN
> support
> > function (brin_minmax_ranges) produces information about *all*
> ranges in
> > one go, which may be an issue. The worst case is 32TB table, with
> 1-page
> > BRIN ranges, which means ~4 billion ranges. The info is an array of
> ~32B
> > structs, so this would require ~128GB of RAM. With the default
> 128-page
> > ranges, it's still be ~1GB, which is quite a lot.
> >
> > We could have a discussion about what's the reasonable size of BRIN
> > ranges on such large tables (e.g. building a bitmap on 4 billion
> ranges
> > is going to be "not cheap" so this is likely pretty rare). But we
> should
> > not introduce new nodes that ignore work_mem, so we need a way to
> deal
> > with such cases somehow.
> >
> > The easiest solution likely is to check this while planning - we can
> > check the table size, calculate the number of BRIN ranges, and check
> > that the range info fits into work_mem, and just not create the path
> > when it gets too large. That's what we did for HashAgg, although that
> > decision was unreliable because estimating GROUP BY cardinality is
> hard.
> >
> > The wrinkle here is that counting just the range info (BrinRange
> struct)
> > does not include the values for by-reference types. We could use
> average
> > width - that's just an estimate, though.
> >
> > A more comprehensive solution seems to be to allow requesting chunks
> of
> > the BRIN ranges. So that we'd get "slices" of ranges and we'd process
> > those. So for example if you have 1000 ranges, and you can only
> handle
> > 100 at a time, we'd do 10 loops, each requesting 100 ranges.
> >
> > This has another problem - we do care about "overlaps", and we can't
> > really know if the overlapping ranges will be in the same "slice"
> > easily. The chunks would be sorted (for example) by maxval. But there
> > can be a range with much higher maxval (thus in some future slice),
> but
> > very low minval (thus intersecting with ranges in the current slice).
> >
> > Imagine ranges with these minval/maxval values, sorted by maxval:
> >
> > [101,200]
> > [201,300]
> > [301,400]
> > [150,500]
> >
> > and let's say we can only process 2-range slices. So we'll get the
> first
> > two, but both of them intersect with the very last range.
> >
> > We could always include all the intersecting ranges into the slice,
> but
> > what if there are too many very "wide" ranges?
> >
> > So I think this will need to switch to an iterative communication
> with
> > the BRIN index - instead of asking "give me info about all the
> ranges",
> > we'll need a way to
> >
> >   - request the next range (sorted by maxval)
> >   - request the intersecting ranges one by one (sorted by minval)
> >
> > Of course, the BRIN side will have some of the same challenges with
> > tracking the info without breaking the work_mem limit, but I suppose
> it
> > can store the info into a tuplestore/tuplesort, and use that instead
> of
> > plain in-memory array. Alternatively, it could just return those, and
> > BrinSort would use that. OTOH it seems cleaner to have some sort of
> API,
> > especially if we want to support e.g. minmax-multi opclasses, that
> have
> > a more complicated concept of "intersection".
> >
> >
> > regards
> >
> > --
> > Tomas Vondra
> > EnterpriseDB: http://www.enterprisedb.com <
> http://www.enterprisedb.com>
> > The Enterprise PostgreSQL Company
> >
> > Hi,
> > In your example involving [150,500], can this range be broken down into
> > 4 ranges, ending in 200, 300, 400 and 500, respectively ?
> > That way, there is no intersection among the ranges.
> >
>
> Not really, I think. These "value ranges" map to "page ranges" and how
> would you split those? I mean, you know values [150,500] map to blocks
> [0,127]. You split the values into [150,200], [201,300], [301,400]. How
> do you split the page range [0,127]?
>
> Also, splitting a range into more ranges is likely making the issue
> worse, because it increases the number of ranges, right? And I mean,
> much worse, because imagine a "wide" range that overlaps with every
> other range - the number of ranges would explode.
>
> It's not clear to me at which point you'

Re: PATCH: Using BRIN indexes for sorted output

On 10/16/22 16:41, Tom Lane wrote:
> Tomas Vondra  writes:
>> I forgot to mention one important issue in my list yesterday, and that's
>> memory consumption.
> 
> TBH, this is all looking like vastly more complexity than benefit.
> It's going to be impossible to produce a reliable cost estimate
> given all the uncertainty, and I fear that will end in picking
> BRIN-based sorting when it's not actually a good choice.
> 

Maybe. If it turns out the estimates we have are insufficient to make
good planning decisions, that's life.

As I wrote in my message, I know the BRIN costing is a bit shaky in
general (not just for this new operation), and I intend to propose some
improvement in a separate patch.

I think the main issue with BRIN costing is that we have no stats about
the ranges, and we can't estimate how many ranges we'll really end up
accessing. If you have 100 rows, will that be 1 range or 100 ranges? Or
for the BRIN Sort, how many overlapping ranges will there be?

I intend to allow index AMs to collect custom statistics, and the BRIN
minmax opfamily would collect e.g. this:

1) number of non-summarized ranges
2) number of all-nulls ranges
3) number of has-nulls ranges
4) average number of overlaps (given a random range, how many other
   ranges intersect with it)
5) how likely is it for a row to hit multiple ranges (cross-check
   sample rows vs. ranges)

I believe this will allow much better / more reliable BRIN costing (the
number of overlaps is particularly useful for the this patch).

> The examples you showed initially are cherry-picked to demonstrate
> the best possible case, which I doubt has much to do with typical
> real-world tables.  It would be good to see what happens with
> not-perfectly-sequential data before even deciding this is worth
> spending more effort on.

Yes, the example was trivial "happy case" example. Obviously, the
performance degrades as the data become more random (with ranges wider),
forcing the BRIN Sort to read / sort more tuples.

But let's see an example with less correlated data, say, like this:

create table t (a int) with (fillfactor = 10);

insert into t select i + 1 * random()
from generate_series(1,1000) s(i);

With the fillfactor=10, there are ~2500 values per 1MB range, so this
means each range overlaps with ~4 more. The results then look like this:

1) select * from t order by a;

  seqscan+sort: 4437 ms
  brinsort: 4233 ms

2) select * from t order by a limit 10;

  seqscan+sort: 1859 ms
  brinsort: 4 ms

If you increase the random factor from 1 to 10 (so, 40 ranges),
the seqscan timings remain about the same, while brinsort gets to 5200
and 20 ms. And with 1M, it's ~6000 and 300 ms.

Only at 500, where we pretty much read 1/2 the table because the
ranges intersect, we get the same timing as the seqscan (for the LIMIT
query). The "full sort" query is more like 5000 vs. 6600 ms, so slower
but not by a huge amount.

Yes, this is a very simple example. I can do more tests with other
datasets (larger/smaller, different distribution, ...).

> It also seems kind of unfair to decide
> that the relevant comparison point is a seqscan rather than a
> btree indexscan.
> 

I don't think it's all that unfair. How likely is it to have both a BRIN
and btree index on the same column? And even if you do have such indexes
(say, on different sets of keys), we kinda already have this costing
issue with index and bitmap index scans.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PATCH: Using BRIN indexes for sorted output

On 10/16/22 16:42, Zhihong Yu wrote:
> ...
> 
> I don't have good answer w.r.t. splitting the page range [0,127] now.
> Let me think more about it.
> 

Sure, no problem.

> The 10 step flow (subject to changes down the road) should be either
> given in the description of the patch or, written as comment inside the
> code.
> This would help people grasp the concept much faster.

True. I'll add it to the next version of the pach.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: use has_privs_of_role() for pg_hba.conf

2022-10-16 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Sat, Oct 08, 2022 at 11:46:50AM -0400, Robert Haas wrote:
> > Now there may be some other scenario in which the patch is going in
> > exactly the right direction, and if I knew what it was, maybe I'd
> > agree that the patch was a great idea. But I haven't seen anything
> > like that on the thread. Basically, the argument is just that the
> > change would make things more consistent. However, it might be an
> > abuse of the term. If you go out and buy blue curtains because you
> > have a blue couch, that's consistent interior decor. If you go out and
> > buy a blue car because you have a blue couch, that's not really
> > consistent anything, it's just two fairly-unrelated things that are
> > both blue.
> 
> I believe I started this thread after reviewing the remaining uses of
> is_member_of_role() after 6198420 was committed and wondering whether this
> case was an oversight.  If upon closer inspection we think that mere
> membership is appropriate for pg_hba.conf, I'm fully prepared to go and
> mark this commitfest entry as Rejected.  It obviously does not seem as
> clear-cut as 6198420.  And I'll admit I don't have a concrete use-case in
> hand to justify the behavior change.

Looks like we've already ended up there, but my recollection of this is
that it was very much intentional to use is_member_of_role() here.
Perhaps it should have been better commented (as all uses of
is_member_of_role() instead of has_privs_of_role() really should have
lots of comments as to exactly why it makes sense in those cases).

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: allowing for control over SET ROLE

2022-10-16 Thread Stephen Frost

Greetings,

* Nathan Bossart (nathandboss...@gmail.com) wrote:
> On Fri, Sep 30, 2022 at 04:34:32PM -0400, Robert Haas wrote:
> > That thread has not reached an entirely satisfying conclusion.
> > However, the behavior that was deemed outright buggy over there has
> > been fixed. The remaining question is what to do about commands that
> > allow you to give objects to other users (like ALTER  ..
> > OWNER TO) or commands that allow you to create objects owned by other
> > users (like CREATE DATABASE ... OWNER). I have, in this version,
> > adopted the proposal by Wolfgang Walther on the other thread to make
> > this controlled by the new SET option. This essentially takes the view
> > that the ability to create objects owned by another user is not
> > precisely a privilege, and is thus not inherited just because the
> > INHERIT option is set on the GRANT, but it is something you can do if
> > you could SET ROLE to that role, so we make it dependent on the SET
> > option. This logic is certainly debatable, but it does have the
> > practical advantage of making INHERIT TRUE, SET FALSE a useful
> > combination of settings for predefined roles. It's also 100%
> > backward-compatible, whereas if we made the behavior dependent on the
> > INHERIT option, users could potentially notice behavior changes after
> > upgrading to v16.
> 
> I'm not sure about tying the ownership stuff to this new SET privilege.
> While you noted some practical advantages, I'd expect users to find it kind
> of surprising.  Also, for predefined roles, I think you need to be careful
> about distributing ADMIN, as anyone with ADMIN on a predefined role can
> just GRANT SET to work around the restrictions.  I don't have a better
> idea, though, so perhaps neither of these things is a deal-breaker.  I was
> tempted to suggest using ADMIN instead of SET for the ownership stuff, but
> that wouldn't be backward-compatible, and you'd still be able to work
> around it to some extent with SET (e.g., SET ROLE followed by CREATE
> DATABASE).

As we work through splitting up the privileges and managing them in a
more fine-grained way, it seems clear that we'll need to have a similar
split for ADMIN rights on roles- that is, we'll need to be able to
say "role X is allowed to GRANT INHERIT for role Y to other roles, but
not SET".

I'm still half-tempted to say that predefined roles should just be dealt
with as a special case.. but if we split ADMIN in the manner as
described above then maybe we could get away with not having to, but it
would depend a great deal of people actually reading the documentation
and I'm concerned that's a bit too much to ask in this case.

That is- the first person who is likely to GRANT out ADMIN rights in a
predefined role is going to be a superuser.  To avoid breaking backwards
compatibility, GRANT'ing of ADMIN needs to GRANT all the partial-ADMIN
rights that exist, or at least exist today, which includes both SET and
INHERIT.  Unless we put some kind of special case for predefined roles
where we throw an error or at least a warning when a superuser
(presumably) inadvertantly does a simple GRANT ADMIN for $predefined
role, we're going to end up in the situation where folks can SET ROLE to
a predefined role and do things that they really shouldn't be allowed
to.

We could, of course, very clearly document that the way to GRANT ADMIN
rights for a predefined role is to always make sure to *only* GRANT
ADMIN/INHERIT, but again I worry that it simply wouldn't be followed in
many cases.  Perhaps we could arrange for the bootstrap superuser to
only be GRANT'd ADMIN/INHERIT for predefined roles and then not have an
explicit cut-out for superuser doing a GRANT on predefined roles or
perhaps having such be protected under allow_system_table_mods under the
general consideration that modifying of predefined roles isn't something
that folks should be doing post-initdb.

Just a few thoughts on this, not sure any of these ideas are great but
perhaps this helps move us forward.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: macos ventura SDK spews warnings

Andres Freund  writes:
> On 2022-10-15 21:00:00 -0400, Tom Lane wrote:
>> After further thought, I think the best compromise is just that:
>> 
>> (1) apply s/sprintf/snprintf/ patch in branches back to v12, where
>> we began to require C99.
>> 
>> (2) in v11 and back to 9.2, enable -Wno-deprecated if available.

> Makes sense to me.

I remembered another reason why v12 should be a cutoff: it's where
we started to use snprintf.c everywhere.  In prior branches, there'd
be a lot of complaints about sprintf elsewhere in the tree.

So I pushed (1), but on the way to testing (2), I discovered a totally
independent problem with the 13.0 SDK in older branches:

In file included from ../../../src/include/postgres.h:46:
In file included from ../../../src/include/c.h:1387:
In file included from ../../../src/include/port.h:17:
In file included from 
/Library/Developer/CommandLineTools/SDKs/MacOSX13.0.sdk/usr/include/netdb.h:91:
In file included from 
/Library/Developer/CommandLineTools/SDKs/MacOSX13.0.sdk/usr/include/netinet/in.h:81:
/Library/Developer/CommandLineTools/SDKs/MacOSX13.0.sdk/usr/include/sys/socket.h:471:1:
 error: expected ';' after top level declarator
__CCT_DECLARE_CONSTRAINED_PTR_TYPES(struct sockaddr_storage, sockaddr_storage);
^

This is apparently some sort of inclusion-order problem, which is probably
a bug in the SDK --- netdb.h and netinet/in.h are the same as they were in
SDK 12.3, but sys/socket.h has a few additions including this
__CCT_DECLARE_CONSTRAINED_PTR_TYPES macro, and evidently that's missing
something it needs.  I haven't traced down the cause of the problem yet.
It fails to manifest in v15 and HEAD, which I bisected to

98e93a1fc93e9b54eb477d870ec744e9e1669f34 is the first new commit
commit 98e93a1fc93e9b54eb477d870ec744e9e1669f34
Author: Tom Lane 
Date:   Tue Jan 11 13:46:12 2022 -0500

Clean up messy API for src/port/thread.c.

The point of this patch is to reduce inclusion spam by not needing
to #include  or  in port.h (which is read by every
compile in our tree).  To do that, we must remove port.h's
declarations of pqGetpwuid and pqGethostbyname.

I doubt we want to back-patch that, so what we'll probably end up with
is adding some #includes to port.h in the back branches.  Bleah.
Or we could file a bug with Apple and hope they fix it quickly.
(They might, actually, because this SDK is supposedly beta.)

regards, tom lane

Re: macos ventura SDK spews warnings

I wrote:
> So I pushed (1), but on the way to testing (2), I discovered a totally
> independent problem with the 13.0 SDK in older branches:

> In file included from ../../../src/include/postgres.h:46:
> In file included from ../../../src/include/c.h:1387:
> In file included from ../../../src/include/port.h:17:
> In file included from 
> /Library/Developer/CommandLineTools/SDKs/MacOSX13.0.sdk/usr/include/netdb.h:91:
> In file included from 
> /Library/Developer/CommandLineTools/SDKs/MacOSX13.0.sdk/usr/include/netinet/in.h:81:
> /Library/Developer/CommandLineTools/SDKs/MacOSX13.0.sdk/usr/include/sys/socket.h:471:1:
>  error: expected ';' after top level declarator
> __CCT_DECLARE_CONSTRAINED_PTR_TYPES(struct sockaddr_storage, 
> sockaddr_storage);
> ^

Ah, I see it.  This is not failing everywhere, only in gram.y and
associated files, and it happens because those have a #define for REF,
which is breaking this constrained_ctypes stuff:

/Library/Developer/CommandLineTools/SDKs/MacOSX13.0.sdk/usr/include/sys/socket.h:471:1:
 error: expected ';' after top level declarator
__CCT_DECLARE_CONSTRAINED_PTR_TYPES(struct sockaddr_storage, sockaddr_storage);
^
/Library/Developer/CommandLineTools/SDKs/MacOSX13.0.sdk/usr/include/sys/constrained_ctypes.h:588:101:
 note: expanded from macro '__CCT_DECLARE_CONSTRAINED_PTR_TYPES'
__CCT_DECLARE_CONSTRAINED_PTR_TYPE(basetype, basetag, REF); 
\

^

Now on the one hand Apple is pretty clearly violating user namespace
by using a name like "REF", and I'll go file a bug about that.
On the other hand, #defining something like "REF" isn't very bright
on our part either.  We usually write something like REF_P when
there is a danger of parser tokens colliding with other names.

I think the correct, future-proof fix is to s/REF/REF_P/ in the
grammar.  We'll have to back-patch that, too, unless we want to
change what port.h includes.  I found that an alternative possible
band-aid is to do this in port.h:

diff --git a/src/include/port.h b/src/include/port.h
index b405d0e740..416428a0d2 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -14,7 +14,6 @@
 #define PG_PORT_H

 #include 
-#include 
 #include 

 /*
@@ -491,6 +490,8 @@ extern int  pqGetpwuid(uid_t uid, struct passwd *resultbuf, 
char *buffer,
   size_t buflen, struct passwd **result);
 #endif

+struct hostent;/* avoid including  here */
+
 extern int pqGethostbyname(const char *name,
struct hostent *resultbuf,
char *buffer, size_t buflen,

but it seems like there's a nonzero risk that some third-party
code somewhere is depending on our having included  here.
So ceasing to do that in the back branches doesn't seem great.

Changing a parser token name in the back branches isn't ideal
either, but it seems less risky to me than removing a globally
visible #include.

regards, tom lane

Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Matthias van de Meent

On Sun, 16 Oct 2022 at 16:42, Tom Lane  wrote:
>
> It also seems kind of unfair to decide
> that the relevant comparison point is a seqscan rather than a
> btree indexscan.

I think the comparison against full table scan seems appropriate, as
the benefit of BRIN is less space usage when compared to other
indexes, and better IO selectivity than full table scans.

A btree easily requires 10x the space of a normal BRIN index, and may
require a lot of random IO whilst scanning. This BRIN-sorted scan
would have a much lower random IO cost during its scan, and would help
bridge the performance gap between having index that supports ordered
retrieval, and no index at all, which is especially steep in large
tables.

I think that BRIN would be an alternative to btree as a provider of
sorted data, even when the table is not 100% clustered. This
BRIN-assisted table sort can help reduce the amount of data that is
accessed in top-N sorts significantly, both at the index and at the
relation level, without having the space overhead of "all sortable
columns get a btree index".

If BRIN gets its HOT optimization back, the benefits would be even
larger, as we would then have an index that can speed up top-N sorts
without bloating other indexes, and at very low disk footprint.
Columns that are only occasionally accessed in a sorted manner could
then get BRIN minmax indexes to support this sort, at minimal overhead
to the rest of the application.

Kind regards,

Matthias van de Meent

Re: PATCH: Using BRIN indexes for sorted output

2022-10-16 Thread Matthias van de Meent

First of all, it's really great to see that this is being worked on.

On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
 wrote:
> Try to formulate the whole algorithm. Maybe I'm missing something.
>
> The current algorithm is something like this:
>
> 1. request info about ranges from the BRIN opclass
> 2. sort them by maxval and minval

Why sort on maxval and minval? That seems wasteful for effectively all
sorts, where range sort on minval should suffice: If you find a range
that starts at 100 in a list of ranges sorted at minval, you've
processed all values <100. You can't make a similar comparison when
that range is sorted on maxvals.

> 3. NULLS FIRST: read all ranges that might have NULLs => output
> 4. read the next range (by maxval) into tuplesort
>(if no more ranges, go to (9))
> 5. load all tuples from "splill" tuplestore, compare to maxval

Instead of this, shouldn't an update to tuplesort that allows for
restarting the sort be better than this? Moving tuples that we've
accepted into BRINsort state but not yet returned around seems like a
waste of cycles, and I can't think of a reason why it can't work.

> 6. load all tuples from no-summarized ranges (first range only)
>(into tuplesort/tuplestore, depending on maxval comparison)
> 7. load all intersecting ranges (with minval < current maxval)
>(into tuplesort/tuplestore, depending on maxval comparison)
> 8. sort the tuplesort, output all tuples, then back to (4)
> 9. NULLS LAST: read all ranges that might have NULLs => output
> 10. done
>
> For "DESC" ordering the process is almost the same, except that we swap
> minval/maxval in most places.

When I was thinking about this feature at the PgCon unconference, I
was thinking about it more along the lines of the following system
(for ORDER BY col ASC NULLS FIRST):

1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
[has_nulls, min ASC]
2. scan info about ranges from BRIN, store them in Rs.
3. Finalize the sorting of Rs.
4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
column ordering.
5. load all tuples from no-summarized ranges into Ts'
6. while Rs has a block range Rs' with has_nulls:
   - Remove Rs' from Rs
   - store the tuples of Rs' range in Ts.
We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
7. Finalize the Ts sorted set.
8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
  - Remove Ts' from Ts
  - Yield Ts'
Now, all tuples up to and including max_sorted are yielded.
9. If there are no more ranges in Rs:
  - Yield all remaining tuples from Ts, then return.
10. "un-finalize" Ts, so that we can start adding tuples to that tuplesort.
 This is different from Tomas' implementation, as he loads the
tuples into a new tuplestore.
11. get the next item from Rs: Rs'
   - remove Rs' from Rs
   - assign Rs' min value to max_sorted
   - store the tuples of Rs' range in Ts
12. while the next item Rs' from Rs has a min value of max_sorted:
   - remove Rs' from Rs
   - store the tuples of Rs' range in Ts
13. The 'new' value from the next item from Rs is stored in
max_sorted. If no such item exists, max_sorted is assigned a sentinel
value (+INF)
14. Go to Step 7

This set of operations requires a restarting tuplesort for Ts, but I
don't think that would result in many API changes for tuplesort. It
reduces the overhead of large overlapping ranges, as it doesn't need
to copy all tuples that have been read from disk but have not yet been
returned.

The maximum cost of this tuplesort would be the cost of sorting a
seqscanned table, plus sorting the relevant BRIN ranges, plus the 1
extra compare per tuple and range that are needed to determine whether
the range or tuple should be extracted from the tuplesort. The minimum
cost would be the cost of sorting all BRIN ranges, plus sorting all
tuples in one of the index's ranges.

Kind regards,

Matthias van de Meent

PS. Are you still planning on giving the HOT optimization for BRIN a
second try? I'm fairly confident that my patch at [0] would fix the
issue that lead to the revert of that feature, but it introduced ABI
changes after the feature freeze and thus it didn't get in. The patch
might need some polishing, but I think it shouldn't take too much
extra effort to get into PG16.

[0] 
https://www.postgresql.org/message-id/CAEze2Wi9%3DBay_%3DrTf8Z6WPgZ5V0tDOayszQJJO%3DR_9aaHvr%2BTg%40mail.gmail.com

Re: New "single-call SRF" APIs are very confusingly named

2022-10-16 Thread Melanie Plageman

On Fri, Oct 14, 2022 at 7:41 PM Michael Paquier  wrote:

> On Fri, Oct 14, 2022 at 05:09:46PM -0400, Melanie Plageman wrote:
> > To summarize, I am in support of renaming SetSingleFuncCall() ->
> > InitMaterializedSRF() and SRF_SINGLE_USE_EXPECTED ->
> > MAT_SRF_USE_EXPECTED_TUPLE_DESC (or just DESC) as suggested elsewhere in
> > this thread. And I think we should eventually consider renaming the
> > multi* function names and consider if ExprSingleResult is a good name.
>
> As already mentioned, these have been around for years, so the impact
> would be bigger.


That makes sense.


> Attached is a patch for HEAD and REL_15_STABLE to
> switch this stuff with new names, with what's needed for ABI
> compatibility.  My plan would be to keep the compatibility parts only
> in 15, and drop them from HEAD.
>

- * SetSingleFuncCall
+ * Compatibility function for v15.
+ */
+void
+SetSingleFuncCall(FunctionCallInfo fcinfo, bits32 flags)
+{
+ InitMaterializedSRF(fcinfo, flags);
+}
+

 Any reason not to use a macro?

- Melanie

Re: macos ventura SDK spews warnings

I wrote:
> I think the correct, future-proof fix is to s/REF/REF_P/ in the
> grammar.

Done like that, after which I found that the pre-v12 branches are
compiling perfectly warning-free with the 13.0 SDK, despite nothing
having been done about sprintf.  This confused me mightily, but
after digging in Apple's headers I understand it.  What actually
gets provided by  is a declaration of sprintf(), now
with deprecation attribute attached, followed awhile later by

#if __has_builtin(__builtin___sprintf_chk) || defined(__GNUC__)
extern int __sprintf_chk (char * __restrict, int, size_t,
  const char * __restrict, ...);

#undef sprintf
#define sprintf(str, ...) \
  __builtin___sprintf_chk (str, 0, __darwin_obsz(str), __VA_ARGS__)
#endif

So in the ordinary course of events, calling sprintf() results in
calling this non-deprecated builtin.  Only if you "#undef sprintf"
will you see the deprecation message.  snprintf.c does that, so
we see the message when that's built.  But if we don't use snprintf.c,
as the older branches do not on macOS, we don't ever #undef sprintf.

So for now, there seems no need for -Wno-deprecated, and I'm not
going to install it.

What I *am* seeing, in the 9.5 and 9.6 branches, is a ton of

ld: warning: -undefined dynamic_lookup may not work with chained fixups

apparently because we are specifying -Wl,-undefined,dynamic_lookup
which the other branches don't do.  That's kind of annoying,
but it looks like preventing that would be way too invasive :-(.
We'd added it to un-break some cases in the contrib transform
modules, and we didn't have a better solution until v10 [1].

regards, tom lane

[1] https://www.postgresql.org/message-id/2652.1475512158%40sss.pgh.pa.us

Re: macos ventura SDK spews warnings

2022-10-16 Thread Andres Freund

Hi,

On 2022-10-16 16:45:24 -0400, Tom Lane wrote:
> I wrote:
> > I think the correct, future-proof fix is to s/REF/REF_P/ in the
> > grammar.
>
> Done like that, after which I found that the pre-v12 branches are
> compiling perfectly warning-free with the 13.0 SDK, despite nothing
> having been done about sprintf.  This confused me mightily, but
> after digging in Apple's headers I understand it.  What actually
> gets provided by  is a declaration of sprintf(), now
> with deprecation attribute attached, followed awhile later by
>
> #if __has_builtin(__builtin___sprintf_chk) || defined(__GNUC__)
> extern int __sprintf_chk (char * __restrict, int, size_t,
> const char * __restrict, ...);
>
> #undef sprintf
> #define sprintf(str, ...) \
>   __builtin___sprintf_chk (str, 0, __darwin_obsz(str), __VA_ARGS__)
> #endif
>
> So in the ordinary course of events, calling sprintf() results in
> calling this non-deprecated builtin.  Only if you "#undef sprintf"
> will you see the deprecation message.

Oh, huh. That's an odd setup... It's not like the the object size stuff
provides reliable protection.


> What I *am* seeing, in the 9.5 and 9.6 branches, is a ton of
>
> ld: warning: -undefined dynamic_lookup may not work with chained fixups
>
> apparently because we are specifying -Wl,-undefined,dynamic_lookup
> which the other branches don't do.  That's kind of annoying,
> but it looks like preventing that would be way too invasive :-(.
> We'd added it to un-break some cases in the contrib transform
> modules, and we didn't have a better solution until v10 [1].

Hm - I think it might actually mean that transforms won't work with the new
macos relocation format, which is what I understand "chained fixups" to be.

Unfortunately it looks like the chained fixup stuff is enabled even when
targetting Monterey, even though it was only introduced with the 13 sdk (I
think).  But it does look like using the macos 11 SDK sysroot does force the
use of the older relocation format. So we have at least some way out :/

Greetings,

Andres Freund

Re: New "single-call SRF" APIs are very confusingly named

2022-10-16 Thread Andres Freund

Hi,

On 2022-10-16 13:22:41 -0700, Melanie Plageman wrote:
> On Fri, Oct 14, 2022 at 7:41 PM Michael Paquier  wrote:
> - * SetSingleFuncCall
> + * Compatibility function for v15.
> + */
> +void
> +SetSingleFuncCall(FunctionCallInfo fcinfo, bits32 flags)
> +{
> + InitMaterializedSRF(fcinfo, flags);
> +}
> +
> 
>  Any reason not to use a macro?

Yes - it'd introduce an ABI break, i.e. an already compiled extension
referencing SetSingleFuncCall() wouldn't fail to load into an upgraded sever,
due to the reference to the SetSingleFuncCall, which wouldn't exist anymore.

Greetings,

Andres Freund

Re: New "single-call SRF" APIs are very confusingly named

2022-10-16 Thread Andres Freund

Hi,

On 2022-10-15 11:41:08 +0900, Michael Paquier wrote:
> Attached is a patch for HEAD and REL_15_STABLE to switch this stuff with new
> names, with what's needed for ABI compatibility.  My plan would be to keep
> the compatibility parts only in 15, and drop them from HEAD.  -- Michael

Looks reasonable to me. Thanks for working on this.


> -/* flag bits for SetSingleFuncCall() */
> -#define SRF_SINGLE_USE_EXPECTED  0x01/* use expectedDesc as tupdesc 
> */
> -#define SRF_SINGLE_BLESS 0x02/* validate tuple for SRF */
> +/* flag bits for InitMaterializedSRF() */
> +#define MAT_SRF_USE_EXPECTED_DESC0x01/* use expectedDesc as tupdesc 
> */
> +#define MAT_SRF_BLESS0x02/* complete 
> tuple descriptor, for
> + 
>  * a transient RECORD datatype */

I don't really know what "complete tuple descriptor" means. BlessTupleDesc()
does say "make a completed tuple descriptor useful for SRFs" - but I don't
think that means that Bless* makes them complete, but that they *have* to be
complete to be blessed.


> @@ -2164,8 +2164,8 @@ elements_worker_jsonb(FunctionCallInfo fcinfo, const 
> char *funcname,
>  
>   rsi = (ReturnSetInfo *) fcinfo->resultinfo;
>  
> - SetSingleFuncCall(fcinfo,
> -   SRF_SINGLE_USE_EXPECTED | 
> SRF_SINGLE_BLESS);
> + InitMaterializedSRF(fcinfo,
> +   MAT_SRF_USE_EXPECTED_DESC | 
> MAT_SRF_BLESS);

Already was the case, so maybe not worth mucking with: Why the newline here,
but not in other cases?

Greetings,

Andres Freund

Re: macos ventura SDK spews warnings

Andres Freund  writes:
> On 2022-10-16 16:45:24 -0400, Tom Lane wrote:
>> What I *am* seeing, in the 9.5 and 9.6 branches, is a ton of
>> 
>> ld: warning: -undefined dynamic_lookup may not work with chained fixups
>> 
>> apparently because we are specifying -Wl,-undefined,dynamic_lookup
>> which the other branches don't do.  That's kind of annoying,
>> but it looks like preventing that would be way too invasive :-(.
>> We'd added it to un-break some cases in the contrib transform
>> modules, and we didn't have a better solution until v10 [1].

> Hm - I think it might actually mean that transforms won't work with the new
> macos relocation format, which is what I understand "chained fixups" to be.

Hm ... hstore_plpython and ltree_plpython still pass regression check in
9.6, so it works for at least moderate-size values of "work", at least on
Monterey.  But in any case, if there's a problem there I can't see us
doing anything about that in dead branches.  The "keep it building" rule
doesn't extend to perl or python dependencies IMO.

regards, tom lane

Re: New "single-call SRF" APIs are very confusingly named

On Sun, Oct 16, 2022 at 03:04:43PM -0700, Andres Freund wrote:
> Yes - it'd introduce an ABI break, i.e. an already compiled extension
> referencing SetSingleFuncCall() wouldn't fail to load into an upgraded sever,
> due to the reference to the SetSingleFuncCall, which wouldn't exist anymore.

Note that this layer should just be removed on HEAD.  Once an
extension catches up with the new name, they would not even need to
play with PG_VERSION_NUM even for a new version compiled with
REL_15_STABLE.
--
Michael


signature.asc
Description: PGP signature

Re: New "single-call SRF" APIs are very confusingly named

On Sun, Oct 16, 2022 at 03:09:14PM -0700, Andres Freund wrote:
> On 2022-10-15 11:41:08 +0900, Michael Paquier wrote:
>> -/* flag bits for SetSingleFuncCall() */
>> -#define SRF_SINGLE_USE_EXPECTED 0x01/* use expectedDesc as tupdesc 
>> */
>> -#define SRF_SINGLE_BLESS0x02/* validate tuple for SRF */
>> +/* flag bits for InitMaterializedSRF() */
>> +#define MAT_SRF_USE_EXPECTED_DESC   0x01/* use expectedDesc as tupdesc 
>> */
>> +#define MAT_SRF_BLESS   0x02/* complete 
>> tuple descriptor, for
>> +
>>  * a transient RECORD datatype */
> 
> I don't really know what "complete tuple descriptor" means. BlessTupleDesc()
> does say "make a completed tuple descriptor useful for SRFs" - but I don't
> think that means that Bless* makes them complete, but that they *have* to be
> complete to be blessed.

That's just assign_record_type_typmod(), which would make sure to fill
the cache for a RECORD tupdesc.  How about "fill the cache with the
information of the tuple descriptor type, for a transient RECORD
datatype"?  If you have a better, somewhat less confusing, idea, I am
open to suggestions.

>> @@ -2164,8 +2164,8 @@ elements_worker_jsonb(FunctionCallInfo fcinfo, const 
>> char *funcname,
>>  
>>  rsi = (ReturnSetInfo *) fcinfo->resultinfo;
>>  
>> -SetSingleFuncCall(fcinfo,
>> -  SRF_SINGLE_USE_EXPECTED | 
>> SRF_SINGLE_BLESS);
>> +InitMaterializedSRF(fcinfo,
>> +  MAT_SRF_USE_EXPECTED_DESC | 
>> MAT_SRF_BLESS);
> 
> Already was the case, so maybe not worth mucking with: Why the newline here,
> but not in other cases?

Yeah, that's fine as well.
--
Michael


signature.asc
Description: PGP signature

Re: Improve description of XLOG_RUNNING_XACTS

At Sun, 16 Oct 2022 12:32:56 +0530, Bharath Rupireddy 
 wrote in 
> On Sat, Oct 15, 2022 at 4:58 PM Amit Kapila  wrote:
> >
> > > I spent some time today reading this. As others said upthread, the
> > > output can be more verbose if all the backends are running max
> > > subtransactions or subtransactions overflow occurred in all the
> > > backends.
> > >
> >
> > As far as I can understand, this contains subtransactions only when
> > they didn't overflow. The latest information provided by Sawada-San
> > for similar records (XLOG_STANDBY_LOCK and XLOG_INVALIDATIONS) made me
> > think that maybe we are just over-worried about the worst case.
> 
> Agreed. I see the below comment, which means when
> xlrec->subxid_overflow is set to true, there will not be any
> subtransaction ids logged in the WAL record.

Since I categorized this tool as semi-debugging purpose so I'm fine
that sometimes very long lines are seen. In the first place it is
already seen in, for example, transaction commit records. They can be
30k characters long by many relfile locators, stats locators,
invalidations and snapshots, when 100 relations are dropped.

> If my above understanding is correct, having something like below does
> no harm, like Masahiko-san's one of the initial patches, no? I'm also
> fine with the way it is in the v3 patch.

Yeah, v3 works exactly the same way with the initial patch, except
when something bad happens in that record. So *I* thought that it's
rather better that the tool describes records as-is (even if only for
this record..:p) rather than how the broken records are recognized by
the recovery code.

> The v3 patch posted upthread LGTM and I marked it as RfC. I'm just
> reattaching the v3 patch posted upthread herewith so that the cfbot
> can test the right patch - https://commitfest.postgresql.org/40/3779/.
> 
> > >
> >  This can blow-up the output.
> > >
> >
> > If we get some reports like that, then we can probably use Michael's
> > idea of displaying additional information with a separate flag.
> 
> Agreed.

Agreed, but maybe we need to recheck what should be hidden (or
abbreviated) in the concise (or terse?) mode.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: TRAP: FailedAssertion("prev_first_lsn < cur_txn->first_lsn", File: "reorderbuffer.c", Line: 927, PID: 568639)

2022-10-16 Thread Masahiko Sawada

On Thu, Oct 13, 2022 at 4:08 PM Amit Kapila  wrote:
>
> On Wed, Oct 12, 2022 at 11:18 AM Masahiko Sawada  
> wrote:
> >
> > Summarizing this issue, the assertion check in AssertTXNLsnOrder()
> > fails as reported because the current logical decoding cannot properly
> > handle the case where the decoding restarts from NEW_CID. Since we
> > don't make the association between top-level transaction and its
> > subtransaction while decoding NEW_CID (ie, in
> > SnapBuildProcessNewCid()), two transactions are created in
> > ReorderBuffer as top-txn and have the same LSN. This failure happens
> > on all supported versions.
> >
> > To fix the problem, one idea is that we make the association between
> > top-txn and sub-txn during that by calling ReorderBufferAssignChild(),
> > as Tomas proposed. On the other hand, since we don't guarantee to make
> > the association between the top-level transaction and its
> > sub-transactions until we try to decode the actual contents of the
> > transaction, it makes sense to me that instead of trying to solve by
> > making association, we need to change the code which are assuming that
> > it is associated.
> >
> > I've attached the patch for this idea. With the patch, we skip the
> > assertion checks in AssertTXNLsnOrder() until we reach the LSN at
> > which we start decoding the contents of transaction, ie.
> > start_decoding_at in SnapBuild. The minor concern is other way that
> > the assertion check could miss some faulty cases where two unrelated
> > top-transactions could have same LSN. With this patch, it will pass
> > for such a case. Therefore, for transactions that we skipped checking,
> > we do the check when we reach the LSN.
> >
>
> >
> --- a/src/backend/replication/logical/decode.c
> +++ b/src/backend/replication/logical/decode.c
> @@ -113,6 +113,15 @@
> LogicalDecodingProcessRecord(LogicalDecodingContext *ctx,
> XLogReaderState *recor
>   buf.origptr);
>   }
>
> +#ifdef USE_ASSERT_CHECKING
> + /*
> + * Check the order of transaction LSNs when we reached the start decoding
> + * LSN. See the comments in AssertTXNLsnOrder() for details.
> + */
> + if (SnapBuildGetStartDecodingAt(ctx->snapshot_builder) == buf.origptr)
> + AssertTXNLsnOrder(ctx->reorder);
> +#endif
> +
>   rmgr = GetRmgr(XLogRecGetRmid(record));
> >
>
> I am not able to think how/when this check will be useful. Because we
> skipped assert checking only for records that are prior to
> start_decoding_at point, I think for those records ordering should
> have been checked before the restart. start_decoding_at point will be
> either (a) confirmed_flush location, or (b) lsn sent by client, and
> any record prior to that must have been processed before restart.

Good point. I was considering the case where the client sets far ahead
LSN but it's not worth considering this case in this context. I've
updated the patch accoringly.

Regards,


--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-Fix-the-assertion-failure-while-processing-NEW_CI.patch
Description: Binary data

RE: [Proposal] Add foreign-server health checks infrastructure

2022-10-16 Thread osumi.takami...@fujitsu.com

Hi,

On Wednesday, October 5, 2022 6:27 PM kuroda.hay...@fujitsu.com
wrote:
> Thanks for giving many comments! Based on off and on discussions, I modified
> my patch.
Thank you for your patch set !

While reviewing and testing the new v16, I've met a possible issue
by slightly adjusting the scenario written in [1].

I mainly followed the steps there and
replaced the command "SELECT" for the remote table at 6-9 with "INSERT" command.
Then, after waiting for few seconds, the "COMMIT" succeeded like below output,
even after the server stop of the worker side.

After the transaction itself, any reference to the remote table fails.
Note that the local table has some data initially in my test.

postgres=# begin;
BEGIN
postgres=*# insert into remote values (-1000);
INSERT 0 1
postgres=*# select * from local;
number

101
102
103
104
105
(5 rows)

postgres=*# commit;
COMMIT
postgres=# select * from remote;
ERROR: could not connect to server "my_external_server"
DETAIL: connection to server on socket "/tmp/.s.PGSQL." failed: No such
file or directory
Is the server running locally and accepting connections on that socket?

Additionally, the last reference "SELECT" which failed above can succeed,
if I restart the worker server before the "SELECT" command to the remote table.
This means the transaction looks successful but the data isn't there ?
Could you please have a look at this issue ?

[1] -
https://www.postgresql.org/message-id/TYAPR01MB58662CD4FD98AA475B3D10F9F59B9%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Takamichi Osumi

Re: create subscription - improved warning message

2022-10-16 Thread Peter Smith

On Sun, Oct 16, 2022 at 12:14 AM Amit Kapila  wrote:
>
> On Fri, Oct 14, 2022 at 8:22 AM Peter Smith  wrote:
> >
> > On Thu, Oct 13, 2022 at 9:07 AM Peter Smith  wrote:
> > >
...
> > PSA a patch for adding examples of how to activate a subscription that
> > was originally created in a disconnected mode.
> >
> > The new docs are added as part of the "Logical Replication -
> > Subscription" section 31.2.
> >
> > The CREATE SUBSCRIPTION reference page was also updated to include
> > links to the new docs.
> >
>
> You have used 'pgoutput' as plugin name in the examples. Shall we
> mention in some way that this is a default plugin used for built-in
> logical replication and it is required to use only this one to enable
> logical replication?
>

Updated as sugggested.

PSA v5.

--
Kind Regards,
Peter Smith.
Fujitsu Australia


v5-0001-create-subscriptipon-pgdocs-for-deferred-slot-cre.patch
Description: Binary data

Re: Fix error message for MERGE foreign tables

2022-10-16 Thread Richard Guo

On Fri, Oct 14, 2022 at 10:43 PM Tom Lane  wrote:

> Richard Guo  writes:
> > Or maybe we can make it even earlier, when we expand an RTE for a
> > partitioned table and add result tables to leaf_result_relids.
>
> I'm not really on board with injecting command-type-specific logic into
> completely unrelated places just so that we can throw an error a bit
> earlier.  Alvaro's suggestion of make_modifytable seemed plausible,
> not least because it avoids spending any effort when the command
> couldn't be MERGE at all.

Yeah, that makes sense. Putting this check in inherit.c does look some
weird as there is no other commandType related code in that file.

Agree that Alvaro's suggestion is more reasonable.

Thanks
Richard

Re: Add regular expression testing for user name mapping in the peer authentication TAP test

On Sat, Oct 15, 2022 at 07:54:30AM +0200, Drouvot, Bertrand wrote:
> Right. Giving a second thought to the non matching case, I think I'd prefer
> to concatenate the system_user to the system_user instead. This is what v2
> does.

Fine by me, so applied v2.  Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] comment fixes for delayChkptFlags

On Fri, Oct 14, 2022 at 04:36:36PM -0500, David Christensen wrote:
> Enclosed is a trivial fix for a typo and misnamed field I noted when doing
> some code review.

(Few days after the fact)

Thanks.  There was a second location where delayChkpt was still
mentioned, so fixed also that while on it.
--
Michael


signature.asc
Description: PGP signature

Re: thinko in basic_archive.c

On Sat, Oct 15, 2022 at 02:10:26PM -0700, Nathan Bossart wrote:
> On Sat, Oct 15, 2022 at 10:19:05AM +0530, Bharath Rupireddy wrote:
>> Can you please help me understand how name collisions can happen with
>> temp file names including WAL file name, timestamp to millisecond
>> scale, and PID? Having the timestamp is enough to provide a non-unique
>> temp file name when PID wraparound occurs, right? Am I missing
>> something here?
> 
> Outside of contrived cases involving multiple servers, inaccurate clocks,
> PID reuse, etc., it seems unlikely.

With a name based on a PID in a world where pid_max can be larger than
the default and a timestamp, I would say even more unlikely than what
you are implying with unlikely ;p

> I think the right way to do this would be to add handling for leftover
> files in the sigsetjmp() block and a shutdown callback (which just sets up
> a before_shmem_exit callback).  While this should ensure those files are
> cleaned up after an ERROR or FATAL, crashes and unlink() failures could
> still leave files behind.  We'd probably also need to avoid cleaning up the
> temp file if copy_file() fails because it already exists, as we won't know
> if it's actually ours.  Overall, I suspect it'd be more trouble than it's
> worth.

Agreed.  My opinion is that we should keep basic_archive as
minimalistic as we can: short still useful.  It does not have to be
perfect, just to fit with what we want it to show, as a reference.

Anyway, the maths were wrong, so I have applied the patch of upthread,
with an extra pair of parenthesis, a comment where epoch is declared
to tell that it is in milliseconds, and a comment in basic_archive's
Makefile to mention the reason why we have NO_INSTALLCHECK.
--
Michael


signature.asc
Description: PGP signature

Re: Unnecessary lateral dependencies implied by PHVs

2022-10-16 Thread Richard Guo

On Mon, Oct 10, 2022 at 10:35 AM Richard Guo  wrote:

> As we know when we pull up a simple subquery, if the subquery is within
> the nullable side of an outer join, lateral references to non-nullable
> items may have to be turned into PlaceHolderVars. I happened to wonder
> what should we do about the PHVs if the outer join is reduced to inner
> join afterwards. Should we unwrap the related PHVs? I'm asking because
> PHVs may imply lateral dependencies which may make us have to use
> nestloop join.
>

At first I considered about unwrapping the related PHVs after we've
successfully reduced outer joins to inner joins. But that requires a lot
of coding which seems not worth the trouble.

I think maybe the problem here is about the order we pull up subqueries
and we reduce outer joins. But simply flipping the order for them two is
definitely incorrect. I'm not sure how to make it right.

Any thoughts?

Thanks
Richard

RE: [Proposal] Add foreign-server health checks infrastructure

2022-10-16 Thread osumi.takami...@fujitsu.com

On Wednesday, October 5, 2022 6:27 PM kuroda.hay...@fujitsu.com 
 wrote:
> Thanks for giving many comments! Based on off and on discussions, I modified
> my patch.
Here are my other quick review comments on v16.

(1) v16-0001 : definition of a new structure

CheckingRemoteServersCallbackItem can be added to typedefs.list.

(2) v16-0001 : UnRegisterCheckingRemoteServersCallback's header comment

+void
+UnRegisterCheckingRemoteServersCallback(CheckingRemoteServersCallback callback,
+   
void *arg)
+{

Looks require a header comment for this function,
because in this file, all other functions have each one.

(3) v16-0002 : better place to declare new variables

New variables 'old' and 'server' in GetConnection() can be moved to
a scope of 'if' statement where those are used.

@@ -138,6 +149,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt, 
PgFdwConnState **state)
ConnCacheEntry *entry;
ConnCacheKey key;
MemoryContext ccxt = CurrentMemoryContext;
+   MemoryContext old;
+   ForeignServer *server = GetForeignServer(user->serverid);

(4) v16-0002 : typo in pgfdw_connection_check_internal()

+   /* return false is if the socket seems to be closed. */

It should be "return false if the socket seems to be closed" ?

(5) v16-0002 : pgfdw_connection_check's message

When I tested the new feature, I got a below message.

"ERROR:  Foreign Server my_external_server might be down."

I think we can wrap the server name with double quotes
so that the message is more aligned with existing error message
like connect_pg_server.

(6) v16-003 : typo

+  Authors needs to register the callback function via following API.

Should be "Authors need to register the ...".

(7) v16-0004 : unnecessary file

I think we can remove a new file which looks like a log file.

.../postgres_fdw/expected/postgres_fdw_1.out

Best Regards,
Takamichi Osumi

Improve errhint for ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2022-10-16 Thread houzj.f...@fujitsu.com

Hi,

While working on some logical replication related features.
I found the HINT message could be improved when I tried to add a publication to
a subscription which was disabled.

alter subscription sub add publication pub2;
--
ERROR: ALTER SUBSCRIPTION with refresh is not allowed for disabled subscriptions
HINT: Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).
--

Because I was executing the ADD PUBLICATION command, I feel the hint should
also mention it instead of SET PUBLICATION.

Best regards,
Hou zj




0001-Improve-errhint-for-ALTER-SUBSCRIPTION-ADD-DROP-PUBL.patch
Description:  0001-Improve-errhint-for-ALTER-SUBSCRIPTION-ADD-DROP-PUBL.patch

Re: Schema variables - new implementation for Postgres 15

2022-10-16 Thread Julien Rouhaud

Hi,

On Thu, Oct 13, 2022 at 07:41:32AM +0200, Pavel Stehule wrote:
>
> > I fixed the commit message of 0001 patch. Fixed shadowed variables too.

Thanks!

> >
> > There is a partially open issue, where I and Julien are not sure about a
> > solution, and we would like to ask for the  community's opinion. I'll send
> > this query in separate mail.
> >
>
>  rebased with simplified code related to usage of pfree function

If anyone is curious the discussion happend at [1].

I looked at the patchset, this time focusing on the LET command.  Here at the
comments I have for now:

- gram.y

@@ -11918,6 +11920,7 @@ ExplainableStmt:
| CreateMatViewStmt
| RefreshMatViewStmt
| ExecuteStmt   /* by 
default all are $$=$1 */
+   | LetStmt
;

(and other similar places) the comment should be kept to the last statement

Also, having LetStmt as an ExplainableStmt means it's allowed in a CTE:

cte_list:
common_table_expr   
{ $$ = list_make1($1); }
| cte_list ',' common_table_expr{ $$ = 
lappend($1, $3); }
;

common_table_expr:  name opt_name_list AS opt_materialized '(' PreparableStmt 
')' opt_search_clause opt_cycle_clause

And doing so hits this assert in transformWithClause:

if (!IsA(cte->ctequery, SelectStmt))
{
/* must be a data-modifying statement */
Assert(IsA(cte->ctequery, InsertStmt) ||
   IsA(cte->ctequery, UpdateStmt) ||
   IsA(cte->ctequery, DeleteStmt));

pstate->p_hasModifyingCTE = true;
}

and I'm assuming it would also fail on this in transformLetStmt:

+   /* There can't be any outer WITH to worry about */
+   Assert(pstate->p_ctenamespace == NIL);

I guess it makes sense to be able to explain a LetStmt (or using it in a
prepared statement), so it should be properly handled in transformSelectStmt.
Also, I don't see any test for a prepared LET statement, this should also be
covered.

- transformLetStmt:

+   varid = IdentifyVariable(names, &attrname, ¬_unique);

It would be nice to have a comment saying that the lock is acquired here

+   /* The grammar should have produced a SELECT */
+   if (!IsA(selectQuery, Query) ||
+   selectQuery->commandType != CMD_SELECT)
+   elog(ERROR, "unexpected non-SELECT command in LET command");

I'm wondering if this should be an Assert instead, as the grammar shouldn't
produce anything else no matter what how hard a user try.

+   /* don't allow multicolumn result */
+   if (list_length(exprList) != 1)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("expression is not scalar value"),
+parser_errposition(pstate,
+   
exprLocation((Node *) exprList;

This isn't covered by any regression test and it probably should.  It can be
reached with something like

LET myvar = (null::pg_class).*;

The error message could also use a bit of improvement.

I see that a_expr allows a select statement in parens, but this leads to a
sublink which already has all the required protection to gurantee a single
column, and a single row at most during execution.  This one returns for
non-scalar case:

subquery must return only one column

Maybe use something similar for it, like "expression must return only one
column"?  Similarly the error message in svariableStartupReceiver could be made
more consistent with the related errors:

+   if (++outcols > 1)
+   elog(ERROR, "svariable DestReceiver can take only one 
attribute");

While on svariableReceiver, I see that the current code assumes that the caller
did everything right.  That's the case right now, but it should still be made
more robust in case future code (or extensions) is added.  I'm thinking:

- svariableState.rows.  Currently not really used, should check that one and
  only one row is received in svariableReceiveSlot and
  svariableShutdownReceiver (if no row is received the variable won't be reset
  which should probably always happen once you setup an svariableReceiver)
- svariableState.typid, typmod and typlen should be double checked with the
  given varid in svariableStartupReceiver.
- svariableState.varid should be initialized with InvalidOid to avoid undefined
  behavior is caller forgets to set it.

I'm also wondering if SetVariableDestReceiverParams() should have an assert
like LockHeldByMe() for the given varid, and maybe an assert that the varid is
a session variable, to avoid running a possibly expensive execution that will
fail whe

Re: PATCH: Using BRIN indexes for sorted output

On 10/16/22 22:17, Matthias van de Meent wrote:
> First of all, it's really great to see that this is being worked on.
> 
> On Sun, 16 Oct 2022 at 16:34, Tomas Vondra
>  wrote:
>> Try to formulate the whole algorithm. Maybe I'm missing something.
>>
>> The current algorithm is something like this:
>>
>> 1. request info about ranges from the BRIN opclass
>> 2. sort them by maxval and minval
> 
> Why sort on maxval and minval? That seems wasteful for effectively all
> sorts, where range sort on minval should suffice: If you find a range
> that starts at 100 in a list of ranges sorted at minval, you've
> processed all values <100. You can't make a similar comparison when
> that range is sorted on maxvals.
> 

Because that allows to identify overlapping ranges quickly.

Imagine you have the ranges sorted by maxval, which allows you to add
tuples in small increments. But how do you know there's not a range
(possibly with arbitrarily high maxval), that however overlaps with the
range we're currently processing?

Consider these ranges sorted by maxval

  range #1  [0,100]
  range #2  [101,200]
  range #3  [150,250]
  ...
  range #100 [190,10]

processing the range #1 is simple, because there are no overlapping
ranges. When processing range #2, that's not the case - the following
range #3 is overlapping too, so we need to load the tuples too. But
there may be other ranges (in arbitrary distance) also overlapping.

So we either have to cross-check everything with everything - that's
O(N^2) so not great, or we can invent a way to eliminate ranges that
can't overlap.

The patch does that by having two arrays - one sorted by maxval, one
sorted by minval. After proceeding to the next range by maxval (using
the first array), the minval-sorted array is used to detect overlaps.
This can be done quickly, because we only care for new matches since the
previous range, so we can remember the index to the array and start from
it. And we can stop once the minval exceeds the maxval for the range in
the first step. Because we'll only sort tuples up to that point.

>> 3. NULLS FIRST: read all ranges that might have NULLs => output
>> 4. read the next range (by maxval) into tuplesort
>>(if no more ranges, go to (9))
>> 5. load all tuples from "splill" tuplestore, compare to maxval
> 
> Instead of this, shouldn't an update to tuplesort that allows for
> restarting the sort be better than this? Moving tuples that we've
> accepted into BRINsort state but not yet returned around seems like a
> waste of cycles, and I can't think of a reason why it can't work.
> 

I don't understand what you mean by "update to tuplesort". Can you
elaborate?

The point of spilling them into a tuplestore is to make the sort cheaper
by not sorting tuples that can't possibly be produced, because the value
exceeds the current maxval. Consider ranges sorted by maxval

   [0,1000]
   [500,1500]
   [1001,2000]
   ...

We load tuples from [0,1000] and use 1000 as "threshold" up to which we
can sort. But we have to load tuples from the overlapping range(s) too,
e.g. from [500,1500] except that all tuples with values > 1000 can't be
produced (because there might be yet more ranges intersecting with that
part).

So why sort these tuples at all? Imagine imperfectly correlated table
where each range overlaps with ~10 other ranges. If we feed all of that
into the tuplestore, we're now sorting 11x the amount of data.

Or maybe I just don't understand what you mean.

>> 6. load all tuples from no-summarized ranges (first range only)
>>(into tuplesort/tuplestore, depending on maxval comparison)
>> 7. load all intersecting ranges (with minval < current maxval)
>>(into tuplesort/tuplestore, depending on maxval comparison)
>> 8. sort the tuplesort, output all tuples, then back to (4)
>> 9. NULLS LAST: read all ranges that might have NULLs => output
>> 10. done
>>
>> For "DESC" ordering the process is almost the same, except that we swap
>> minval/maxval in most places.
> 
> When I was thinking about this feature at the PgCon unconference, I
> was thinking about it more along the lines of the following system
> (for ORDER BY col ASC NULLS FIRST):
> 
> 1. prepare tuplesort Rs (for Rangesort) for BRIN tuples, ordered by
> [has_nulls, min ASC]
> 2. scan info about ranges from BRIN, store them in Rs.
> 3. Finalize the sorting of Rs.
> 4. prepare tuplesort Ts (for Tuplesort) for sorting on the specified
> column ordering.
> 5. load all tuples from no-summarized ranges into Ts'
> 6. while Rs has a block range Rs' with has_nulls:
>- Remove Rs' from Rs
>- store the tuples of Rs' range in Ts.
> We now have all tuples with NULL in our sorted set; max_sorted = (NULL)
> 7. Finalize the Ts sorted set.
> 8. While the next tuple Ts' in the Ts tuplesort <= max_sorted
>   - Remove Ts' from Ts
>   - Yield Ts'
> Now, all tuples up to and including max_sorted are yielded.
> 9. If there are no more ranges in Rs:
>   - Yield all remaining tuples from Ts, then retur

Re: Improve errhint for ALTER SUBSCRIPTION ADD/DROP PUBLICATION

2022-10-16 Thread Amit Kapila

On Mon, Oct 17, 2022 at 8:39 AM houzj.f...@fujitsu.com
 wrote:
>
> While working on some logical replication related features.
> I found the HINT message could be improved when I tried to add a publication 
> to
> a subscription which was disabled.
>
> alter subscription sub add publication pub2;
> --
> ERROR: ALTER SUBSCRIPTION with refresh is not allowed for disabled 
> subscriptions
> HINT: Use ALTER SUBSCRIPTION ... SET PUBLICATION ... WITH (refresh = false).
> --
>
> Because I was executing the ADD PUBLICATION command, I feel the hint should
> also mention it instead of SET PUBLICATION.
>

+1. I haven't tested it yet but the changes look sane.

-- 
With Regards,
Amit Kapila.

Re: Improve description of XLOG_RUNNING_XACTS

2022-10-16 Thread Amit Kapila

On Mon, Oct 17, 2022 at 6:46 AM Kyotaro Horiguchi
 wrote:
>
> At Sun, 16 Oct 2022 12:32:56 +0530, Bharath Rupireddy 
>  wrote in
> > On Sat, Oct 15, 2022 at 4:58 PM Amit Kapila  wrote:
> > >
> > > > I spent some time today reading this. As others said upthread, the
> > > > output can be more verbose if all the backends are running max
> > > > subtransactions or subtransactions overflow occurred in all the
> > > > backends.
> > > >
> > >
> > > As far as I can understand, this contains subtransactions only when
> > > they didn't overflow. The latest information provided by Sawada-San
> > > for similar records (XLOG_STANDBY_LOCK and XLOG_INVALIDATIONS) made me
> > > think that maybe we are just over-worried about the worst case.
> >
> > Agreed. I see the below comment, which means when
> > xlrec->subxid_overflow is set to true, there will not be any
> > subtransaction ids logged in the WAL record.
>
> Since I categorized this tool as semi-debugging purpose so I'm fine
> that sometimes very long lines are seen. In the first place it is
> already seen in, for example, transaction commit records. They can be
> 30k characters long by many relfile locators, stats locators,
> invalidations and snapshots, when 100 relations are dropped.
>
> > If my above understanding is correct, having something like below does
> > no harm, like Masahiko-san's one of the initial patches, no? I'm also
> > fine with the way it is in the v3 patch.
>
> Yeah, v3 works exactly the same way with the initial patch, except
> when something bad happens in that record. So *I* thought that it's
> rather better that the tool describes records as-is (even if only for
> this record..:p) rather than how the broken records are recognized by
> the recovery code.
>

Okay, let's wait for two or three days and see if anyone thinks
differently, otherwise, I'll push v3 after a bit more testing.


-- 
With Regards,
Amit Kapila.

Re: archive modules

At Fri, 14 Oct 2022 14:42:56 -0700, Nathan Bossart  
wrote in 
> As promised...

As the code written, when archive library is being added while archive
command is already set, archiver first emits seemingly positive
message "restarting archive process because of..", then errors out
after the resatart and keep restarting with complaining for the wrong
setting. I think we don't need the first message.

The ERROR always turns into FATAL, so FATAL would less confusing here,
maybe.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: fix archive module shutdown callback

On Sun, Oct 16, 2022 at 01:39:14PM +0530, Bharath Rupireddy wrote:
> Is the shutdown callback meant to be called only after the archive
> library is loaded? The documentation [1] says that it just gets called
> before the archiver process exits. If this is true, can we just place
> before_shmem_exit(call_archive_module_shutdown_callback, 0); in
> PgArchiverMain() after on_shmem_exit(pgarch_die, 0);?

I am not sure to understand what you mean here.  The shutdown callback
is available once the archiver process has loaded the library defined
in archive_library (assuming it is itself in shared_preload_libraries)
and you cannot call something that does not exist yet.  So, yes, you
could define the call to before_shmem_exit() a bit earlier because
that would be a no-op until the library is loaded, but at the end that
would be just registering a callback that would do nothing useful in a
larger window, aka until the library is loaded.

FWIW, I think that the documentation should clarify that the shutdown
callback is called before shmem exit.  That's important.

Another thing is that the shutdown callback is used by neither
shell_archive.c nor basic_archive.  We could do something about that,
actually, say by plugging an elog(DEBUG1) in shutdown_cb for
shell_archive.c to inform that the archiver is going down?
basic_archive could do that, but we already use shell_archive.c in a 
bunch of tests, and this would need just log_min_messages=debug1 or
lower, so..

> Also, I've noticed other 3 threads and CF entries all related to
> 'archive modules' feature. IMO, it could be better to have all of them
> under a single thread and single CF entry to reduce
> reviewers/committers' efforts and seek more thoughts about all of the
> fixes.

I don't mind discussing each point separately as the first thread
dealing with archive modules is already very long, so the current way
of doing things makes sure to attract the correct type of attention
for each problem, IMO.
--
Michael

signature.asc
Description: PGP signature

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-10-16 Thread Maciek Sakrejda

On Thu, Oct 13, 2022 at 10:29 AM Melanie Plageman
 wrote:
> I think that it makes sense to count both the initial buffers added to
> the ring and subsequent shared buffers added to the ring (either when
> the current strategy buffer is pinned or in use or when a bulkread
> rejects dirty strategy buffers in favor of new shared buffers) as
> strategy clocksweeps because of how the statistic would be used.
>
> Clocksweeps give you an idea of how much of your working set is cached
> (setting aside initially reading data into shared buffers when you are
> warming up the db). You may use clocksweeps to determine if you need to
> make shared buffers larger.
>
> Distinguishing strategy buffer clocksweeps from shared buffer
> clocksweeps allows us to avoid enlarging shared buffers if most of the
> clocksweeps are to bring in blocks for the strategy operation.
>
> However, I could see an argument that discounting strategy clocksweeps
> done because the current strategy buffer is pinned makes the number of
> shared buffer clocksweeps artificially low since those other queries
> using the buffer would have suffered a cache miss were it not for the
> strategy. And, in this case, you would take strategy clocksweeps
> together with shared clocksweeps to make your decision. And if we
> include buffers initially added to the strategy ring in the strategy
> clocksweep statistic, this number may be off because those blocks may
> not be needed in the main shared working set. But you won't know that
> until you try to reuse the buffer and it is pinned. So, I think we don't
> have a better option than counting initial buffers added to the ring as
> strategy clocksweeps (as opposed to as reuses).
>
> So, in answer to your question, no, I cannot think of a scenario like
> that.

That analysis makes sense to me; thanks.

> It also made me remember that I am incorrectly counting rejected buffers
> as reused. I'm not sure if it is a good idea to subtract from reuses
> when a buffer is rejected. Waiting until after it is rejected to count
> the reuse will take some other code changes. Perhaps we could also count
> rejections in the stats?

I'm not sure what makes sense here.

> > Not critical, but is there a list of backend types we could
> > cross-reference elsewhere in the docs?
>
> The most I could find was this longer explanation (with exhaustive list
> of types) in pg_stat_activity docs [1]. I could duplicate what it says
> or I could link to the view and say "see pg_stat_activity" for a
> description of backend_type" or something like that (to keep them from
> getting out of sync as new backend_types are added. I suppose I could
> also add docs on backend_types, but I'm not sure where something like
> that would go.

I think linking pg_stat_activity is reasonable for now. A separate
section for this might be nice at some point, but that seems out of
scope.

> > From the io_context column description:
> >
> > +   The autovacuum daemon, explicit VACUUM,
> > explicit
> > +   ANALYZE, many bulk reads, and many bulk
> > writes use a
> > +   fixed amount of memory, acquiring the equivalent number of
> > shared
> > +   buffers and reusing them circularly to avoid occupying an
> > undue portion
> > +   of the main shared buffer pool.
> > +  
> >
> > I don't understand how this is relevant to the io_context column.
> > Could you expand on that, or am I just missing something obvious?
> >
>
> I'm trying to explain why those other IO Contexts exist (bulkread,
> bulkwrite, vacuum) and why they are separate from shared buffers.
> Should I cut it altogether or preface it with something like: these are
> counted separate from shared buffers because...?

Oh I see. That makes sense; it just wasn't obvious to me this was
talking about the last three values of io_context. I think a brief
preface like that would be helpful (maybe explicitly with "these last
three values", and I think "counted separately").

> > + 
> > +   > role="column_definition">
> > +   extended bigint
> > +  
> > +  
> > +   Extends of relations done by this
> > backend_type in
> > +   order to write data in this io_context.
> > +  
> > + 
> >
> > I understand what this is, but not why this is something I might want
> > to know about.
>
> Unlike writes, backends largely have to do their own extends, so
> separating this from writes lets us determine whether or not we need to
> change checkpointer/bgwriter to be more aggressive using the writes
> without the distraction of the extends. Should I mention this in the
> docs? The other stats views don't seems to editorialize at all, and I
> wasn't sure if this was an objective enough point to include in docs.

Thanks for the clarification. Just to make sure I understand, you mean
that if I see a high extended count, that may be interesting in terms
of write activity, but I can't fix that by tuning--it's just the
nature of my workload?

I think you're right that this is no

Re: fix archive module shutdown callback

At Mon, 17 Oct 2022 13:51:52 +0900, Michael Paquier  wrote 
in 
> On Sun, Oct 16, 2022 at 01:39:14PM +0530, Bharath Rupireddy wrote:
> > Is the shutdown callback meant to be called only after the archive
> > library is loaded? The documentation [1] says that it just gets called
> > before the archiver process exits. If this is true, can we just place
> > before_shmem_exit(call_archive_module_shutdown_callback, 0); in
> > PgArchiverMain() after on_shmem_exit(pgarch_die, 0);?
> 
> I am not sure to understand what you mean here.  The shutdown callback
> is available once the archiver process has loaded the library defined
> in archive_library (assuming it is itself in shared_preload_libraries)
> and you cannot call something that does not exist yet.  So, yes, you

I guess that the "callback" there means the callback-caller function
(call_archive_module_shutdown_callback), which in turn is set as a
callback...

> could define the call to before_shmem_exit() a bit earlier because
> that would be a no-op until the library is loaded, but at the end that
> would be just registering a callback that would do nothing useful in a
> larger window, aka until the library is loaded.

I thought that Bharath's point is to use before_shmem_exit() instead
of PG_ENSURE_ERROR_CLEANUP(). The place doesn't seem significant but
if we use before_shmem_exit(), it would be cleaner to place it
adjecent to on_sheme_exit() call.

> FWIW, I think that the documentation should clarify that the shutdown
> callback is called before shmem exit.  That's important.

Sure. What the shutdown callback can do differs by shared memory
access.

> Another thing is that the shutdown callback is used by neither
> shell_archive.c nor basic_archive.  We could do something about that,
> actually, say by plugging an elog(DEBUG1) in shutdown_cb for
> shell_archive.c to inform that the archiver is going down?
> basic_archive could do that, but we already use shell_archive.c in a 
> bunch of tests, and this would need just log_min_messages=debug1 or
> lower, so..

+1 for inserting DEBUG1.

> > Also, I've noticed other 3 threads and CF entries all related to
> > 'archive modules' feature. IMO, it could be better to have all of them
> > under a single thread and single CF entry to reduce
> > reviewers/committers' efforts and seek more thoughts about all of the
> > fixes.
> 
> I don't mind discussing each point separately as the first thread
> dealing with archive modules is already very long, so the current way
> of doing things makes sure to attract the correct type of attention
> for each problem, IMO.

I tend to agree to this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: fix archive module shutdown callback

At Sun, 16 Oct 2022 13:39:14 +0530, Bharath Rupireddy 
 wrote in 
> On Sun, Oct 16, 2022 at 3:43 AM Nathan Bossart  
> wrote:
> >
> > Hi hackers,
> >
> > Presently, when an archive module sets up a shutdown callback, it will be
> > called upon ERROR/FATAL (via PG_ENSURE_ERROR_CLEANUP), when the archive
> > library changes (via HandlePgArchInterrupts()), and upon normal shutdown.
> > There are a couple of problems with this:
> 
> Yeah.
> 
> > * HandlePgArchInterrupts() calls the shutdown callback directly before
> > proc_exit().  However, the PG_ENSURE_ERROR_CLEANUP surrounding the call to
> > pgarch_MainLoop() sets up a before_shmem_exit callback that also calls the
> > shutdown callback.  This means that the shutdown callback will be called
> > twice whenever archive_library is changed via SIGHUP.
> >
> > * PG_ENSURE_ERROR_CLEANUP is intended for both ERROR and FATAL.  However,
> > the archiver operates at the bottom of the exception stack, so ERRORs are
> > treated as FATALs.  This means that PG_ENSURE_ERROR_CLEANUP is excessive.
> > We only need to set up the before_shmem_exit callback.
> >
> > To fix, the attached patch removes the use of PG_ENSURE_ERROR_CLEANUP and
> > the call to the shutdown callback in HandlePgArchInterrupts() in favor of
> > just setting up a before_shmem_exit callback in LoadArchiveLibrary().
> 
> We could have used a flag in call_archive_module_shutdown_callback()
> to avoid it being called multiple times, but having it as
> before_shmem_exit () callback without direct calls to it is the right
> approach IMO.
>
> +1 to remove PG_ENSURE_ERROR_CLEANUP and PG_END_ENSURE_ERROR_CLEANUP.

That prevents archiver process from cleanly shut down when something
wrong happnes outside the interuppt handler.  In the frist place, why
do we need to call the cleanup callback directly in the handler?  We
can let the handler return something instead to tell the
pgarch_MainLoop to exit immediately on the normal path.

> Is the shutdown callback meant to be called only after the archive
> library is loaded? The documentation [1] says that it just gets called
> before the archiver process exits. If this is true, can we just place
> before_shmem_exit(call_archive_module_shutdown_callback, 0); in
> PgArchiverMain() after on_shmem_exit(pgarch_die, 0);?

+1 for using before_shmem_exit.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: fix archive module shutdown callback

On Mon, Oct 17, 2022 at 02:30:52PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 17 Oct 2022 13:51:52 +0900, Michael Paquier  
> wrote in 
>> I am not sure to understand what you mean here.  The shutdown callback
>> is available once the archiver process has loaded the library defined
>> in archive_library (assuming it is itself in shared_preload_libraries)
>> and you cannot call something that does not exist yet.  So, yes, you
> 
> I guess that the "callback" there means the callback-caller function
> (call_archive_module_shutdown_callback), which in turn is set as a
> callback...

A callback in a callback in a callback.

>> could define the call to before_shmem_exit() a bit earlier because
>> that would be a no-op until the library is loaded, but at the end that
>> would be just registering a callback that would do nothing useful in a
>> larger window, aka until the library is loaded.
> 
> I thought that Bharath's point is to use before_shmem_exit() instead
> of PG_ENSURE_ERROR_CLEANUP(). The place doesn't seem significant but
> if we use before_shmem_exit(), it would be cleaner to place it
> adjecent to on_sheme_exit() call.

Removing PG_ENSURE_ERROR_CLEANUP() and relying on before_shmem_exit()
is fine by me, that's what I imply upthread.
--
Michael


signature.asc
Description: PGP signature

RE: Data is copied twice when specifying both child and parent table in publication

2022-10-16 Thread wangw.f...@fujitsu.com

On Wed, Oct 5, 2022 at 11:08 AM Peter Smith  wrote:
> Hi Wang-san. Here are my review comments for HEAD_v12-0001 patch.

Thanks for your comments.

> ==
> 
> 1. Missing documentation.
> 
> In [1] you wrote:
> > I think the behaviour of multiple publications with parameter
> publish_via_partition_root could be added to the pg-doc later in a separate
> patch.
> 
> ~
> 
> That doesn't seem right to me. IMO the related documentation updates
> cannot really be separated from this patch. Otherwise, what's the
> alternative? Push this change, and then (while waiting for the
> documentation patch) users will just have to use trial and error to
> guess how it works...?

I tried to add related documentation in a separate patch (HEAD_v13-0002*).

> --
> 
> 2. src/backend/catalog/pg_publication.c
> 
> + typedef struct
> + {
> + Oid relid; /* OID of published table */
> + Oid pubid; /* OID of publication that publishes this
> + * table. */
> + } published_rel;
> 
> 2a.
> I think that should be added to typedefs.list

Added.

> ~
> 
> 2b.
> Maybe this also needs some comment to clarify that there will be
> *multiple* of these structures in scenarios where the same table is
> published by different publications in the array passed.

Added the comments.

> --
> 
> 3. QUESTION - pg_get_publication_tables / fetch_table_list
> 
> When the same table is published by different publications (but there
> are other differences like row-filters/column-lists in each
> publication) the result tuple of this function does not include the
> pubid. Maybe the SQL of pg_publication_tables/fetch_table_list() is OK
> as-is but how does it manage to associate each table with the correct
> tuple?
> 
> I know it apparently all seems to work but I’m not how does that
> happen? Can you explain why a puboid is not needed for the result
> tuple of this function?

Sorry, I am not sure I understand your question.
I try to answer your question by explaining the two functions you mentioned:

First, the function pg_get_publication_tables gets the list (see table_infos)
that included published table and the corresponding publication. Then based
on this list, the function pg_get_publication_tables returns information
(scheme, relname, row filter and column list) about the published tables in the
publications list. It just doesn't return pubid.

Then, the SQL in the function fetch_table_list will get the columns in the
column list from pg_attribute. (This is to return all columns when the column
list is not specified)

> ~~
> 
> test_pub=# create table t1(a int, b int, c int);
> CREATE TABLE
> test_pub=# create publication pub1 for table t1(a) where (a > 99);
> CREATE PUBLICATION
> test_pub=# create publication pub2 for table t1(a,b) where (b < 33);
> CREATE PUBLICATION
> 
> Following seems OK when I swap orders of publication names...
> 
> test_pub=# SELECT gpt.relid, gpt.attrs, pg_get_expr(gpt.qual,
> gpt.relid) AS rowfilter from pg_get_publication_tables(VARIADIC
> ARRAY['pub2','pub1']) gpt(relid, attrs, qual);
>  relid | attrs | rowfilter
> ---+---+---
>  16385 | 1 2   | (b < 33)
>  16385 | 1 | (a > 99)
> (2 rows)
> 
> test_pub=# SELECT gpt.relid, gpt.attrs, pg_get_expr(gpt.qual,
> gpt.relid) AS rowfilter from pg_get_publication_tables(VARIADIC
> ARRAY['pub1','pub2']) gpt(relid, attrs, qual);
>  relid | attrs | rowfilter
> ---+---+---
>  16385 | 1 | (a > 99)
>  16385 | 1 2   | (b < 33)
> (2 rows)
> 
> But what about this (this is similar to the SQL fragment from
> fetch_table_list); I swapped the pub names but the results are the
> same...
> 
> test_pub=# SELECT pg_get_publication_tables(VARIADIC
> array_agg(p.pubname)) from pg_publication p where pubname
> IN('pub2','pub1');
> 
>  pg_get_publication_tables
> 
> -
> -
> -
> -
> ---
>  (16385,1,"{OPEXPR :opno 521 :opfuncid 147 :opresulttype 16 :opretset
> false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1 :varattno 1
> :vartype 23 :vartypmod -1 :var
> collid 0 :varlevelsup 0 :varnosyn 1 :varattnosyn 1 :location 47}
> {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4
> :constbyval true :constisnull false :
> location 51 :constvalue 4 [ 99 0 0 0 0 0 0 0 ]}) :location 49}")
>  (16385,"1 2","{OPEXPR :opno 97 :opfuncid 66 :opresulttype 16
> :opretset false :opcollid 0 :inputcollid 0 :args ({VAR :varno 1
> :varattno 2 :vartype 23 :vartypmod -1 :v
> arcollid 0 :varlevelsup 0 :varnosyn 1 :varattnosyn 2 :location 49}
> {CONST :consttype 23 :consttypmod -1 :constcollid 0 :constlen 4
> :constbyval true :constisnull false
>  :location 53 :con

Re: archive modules

On Mon, Oct 17, 2022 at 01:46:39PM +0900, Kyotaro Horiguchi wrote:
> As the code written, when archive library is being added while archive
> command is already set, archiver first emits seemingly positive
> message "restarting archive process because of..", then errors out
> after the resatart and keep restarting with complaining for the wrong
> setting. I think we don't need the first message.
> 
> The ERROR always turns into FATAL, so FATAL would less confusing here,
> maybe.

You mean the second message in HandlePgArchInterrupts() when
archiveLibChanged is false?  An ERROR or a FATAL would not change much
as there is a proc_exit() anyway down the road.

+   if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("both archive_command and archive_library specified"),
+errdetail("Only one of archive_command, archive_library may be 
set.")));

So, backpedalling from upthread where Peter mentions that we should
complain if both archive_command and archive_library are set (creating
a parallel with recovery parameters), I'd like to think that pgarch.c
should have zero knowledge of what an archive_command is and should
just handle the library part.  This makes the whole reasoning around
what pgarch.c should be much simpler, aka it just needs to know about
archive *libraries*, not *commands*.  That's the kind of business that
check_configured_cb() is designed for, actually, as far as I
understand, or this callback could just be removed entirely for the
same effect, as there would be no point in having pgarch.c do its
thing without archive_library or archive_command where a WARNING is
issued in the default case (shell_archive with no archive_command).

And, by the way, this patch would prevent the existence of archive
modules that need to be loaded but *want* an archive_command with
what they want to achieve.  That does not strike me as a good idea if
we want to have a maximum of flexibility with this facility.  I think
that for all that, we should put the responsability of what should be
set or not set directly to the modules, aka basic_archive could
complain if archive_command is set, but that does not strike me as a
mandatory requirement, either.  It is true that archive_library has
been introduced as a way to avoid using archive_command, but the point
of creating a stronger dependency between both would be IMO annoying
in the long-term.
--
Michael

signature.asc
Description: PGP signature

RE: Data is copied twice when specifying both child and parent table in publication

2022-10-16 Thread wangw.f...@fujitsu.com

On Wed, Oct 5, 2022 at 23:05 PM Osumi, Takamichi/大墨 昂道 
 wrote:
> Hi, thank you for the updated patches!
> 
> 
> Here are my minor review comments for HEAD v12.

Thanks for your comments.

> (1) typo & suggestion to reword one comment
> 
> 
> +* Publications support partitioned tables. If
> +* publish_via_partition_root is false, all changes 
> are replicated
> +* using leaf partition identity and schema, so we 
> only need
> +* those. Otherwise, If publish_via_partition_root is 
> true, get
> +* the partitioned table itself.
> 
> 
> The last sentence has "If" in the middle of the sentence.
> We can use the lower letter for it. Or, I think "Otherwise" by itself means
> "If publish_via_partition_root is true". So, I'll suggest a below change.
> 
> 
> FROM:
> Otherwise, If publish_via_partition_root is true, get the partitioned table 
> itself.
> TO:
> Otherwise, get the partitioned table itself.

Improved.

> (2) Do we need to get "attnames" column from the publisher in the
> fetch_table_list() ?
> 
> When I was looking at v16 path, I didn't see any codes that utilize
> the "attnames" column information returned from the publisher.
> If we don't need it, could we remove it ?
> I can miss something greatly, but this might be affected by HEAD codes ?

Yes, it is affected by HEAD. I think we need this column to check whether the
same table has multiple column lists. (see commit fd0b9dc)

The new patch set were attached in [1].

[1] - 
https://www.postgresql.org/message-id/OS3PR01MB6275843B2BBE92870F7881C19E299%40OS3PR01MB6275.jpnprd01.prod.outlook.com

Regards,
Wang wei

Re: Improve description of XLOG_RUNNING_XACTS

On Mon, Oct 17, 2022 at 09:58:31AM +0530, Amit Kapila wrote:
> Okay, let's wait for two or three days and see if anyone thinks
> differently, otherwise, I'll push v3 after a bit more testing.

No objections from here if you want to go ahead with v3 and print the
full set of subxids on top of the information about these
overflowing.
--
Michael


signature.asc
Description: PGP signature

Re: Suppressing useless wakeups in walreceiver