date:20241211

jian he  writes:
> add parser_errposition to some places in
> transformTableConstraint, transformColumnDefinition
> where v8 didn't.

I'm not loving the idea of cons'ing up ParseStates in random places in
tablecmds.c.  I think we ought to fix things so that the one made in
standard_ProcessUtility is passed down to all these places, replacing
ad-hoc queryString and queryEnv parameters.

Eventually we might want to make ProcessUtility's callers pass down
a pstate, but that would be a whole other area of invasive changes,
so I think we should leave that idea for later.  Right now though,
it seems reasonable to change AlterTableUtilityContext to replace
the queryString and queryEnv fields with a ParseState carrying that
info --- and, perhaps, someday saving us from adding more ad-hoc
fields there.

regards, tom lane

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

On Thu, 12 Dec 2024 at 00:16, David Rowley  wrote:
> There were a few places that were missing a BUFFERS OFF as indicated
> by the buildfarm. I've fixed a few of these, but there's at least one
> more in pg_stat_statements's level_tracking.sql. I'm currently running
> a -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE enabled make
> check-world.  It's going to take a while, so I'll hold off doing yet
> another commit to fix these until I see if the level_tracking.sql one
> is the last issue to fix.

That run didn't show any other failures, so I pushed the patch.

David

Re: [BUG] Possible occurrence of segfault in ecpg test

Daniil Davydov <3daniss...@gmail.com> writes:
> Thus, if the program behaves in an unexpected way and the transaction
> is aborted before it executes the
> "select data1 into :recv_vlen_buf from test" query, dump_binary will
> refer to a null pointer. So, instead of an error
> message, the user will see a segfault.

> I think that in all such cases it is worth adding some checks into
> .pgc and .c files (like in attached patch)

I cannot get excited about changing this.  In the first place,
this is hardly the only conceivable failure here --- for example,
if the query succeeds but returns only one row, the code still
segfaults.  In the second place, this is just test code and any
sort of failure is as good as any other in terms of calling our
attention to a problem.  In the third place, if we did want to
improve our standard of robustness for the ECPG tests, there are
probably many other places in them with similar issues.  Making
that happen seems like a lot of work for little return.  So I'm
inclined to just leave it alone.

regards, tom lane

Re: VACUUM cleaning of CF 2024-11

> On 11 Dec 2024, at 06:17, Michael Paquier  wrote:

> I've done a pass over the CF app, and did some routine vacuuming of
> the whole.

FWIW, thank you very much for doing that!

--
Daniel Gustafsson

Re: Assert failure on running a completed portal again

2024-12-11 Thread Robert Haas

On Tue, Dec 10, 2024 at 4:45 PM Tom Lane  wrote:
> Anyway, if you feel like rewriting that stuff, step right up.
> My feeling about it is that the law of conservation of cruft
> will prevent a replacement from being all that much cleaner,
> but maybe I'm wrong.

Thanks for the thoughts. It's always good to get your perspective on
stuff like this, or at least I find it so.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add support to TLS 1.3 cipher suites and curves lists

Jacob Champion  writes:
> On Wed, Dec 11, 2024 at 9:11 AM Nathan Bossart  
> wrote:
>> Sorry for chiming in so late here, but I was a little surprised to see the
>> TLS version in the GUC name.  ISTM this would require us to create a new
>> GUC for every new TLS version, or explain that ssl_tls13_ciphers isn't just
>> for 1.3.

> I agree it's not ideal. But part of the problem IMO is that we might
> actually _have_ to introduce a new GUC for a future TLS 1.4, because
> we have no idea if the ciphersuites will change incompatibly again. (I
> hope not, but they did it once and they could do it again.)
> If 1.4, or 2.0, or... 4? [1] comes out later, and it turns out to be
> compatible, we could probably add a more appropriate alias then. (For
> now, just as some additional data points, both Apache and Curl use
> "1.3" or "13" in the configuration as a differentiator.) Do you have a
> different naming scheme in mind?

Oh yay, another naming problem :-(.  I think that neither "ciphers"
vs. "cipher suites" nor "ssl_ciphers" vs. "ssl_ciphers_tlsv13" is
going to convey a lot to the average person who's not steeped in
TLS minutiae.  However, following the precedent of Apache and Curl
seems like a good answer --- that will ensure that at least some
part of the internet-using world has seen this before.  So I guess
I'm +0.5 for the ssl_ciphers_tlsv13 answer, at least out of the
choices suggested so far.

regards, tom lane

Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)

2024-12-11 Thread Peter Geoghegan

On Sat, Nov 9, 2024 at 1:46 PM Peter Geoghegan  wrote:
> On Sat, Nov 9, 2024 at 12:37 PM Alena Rybakina
>  wrote:
> > I noticed that the "Index Searches" cases shown in the regression tests are 
> > only for partitioned tables, maybe something you should add some tests for 
> > regular tables like tenk1.
>
> I allowed the patch on this thread to bitrot, but I've been
> maintaining this same patch as part of the skip scan patchset.
> Attached is the latest version of this patch (technically this is the
> first patch in the skip scan patch series). Just to keep things
> passing on the CFTester app.

Attached revision just fixes bitrot.

The patch stopped applying against HEAD cleanly due to recent work
that made EXPLAIN ANALYZE show buffers output by default.

--
Peter Geoghegan


v19-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch
Description: Binary data

Re: Add Postgres module info

Andres Freund  writes:
> On 2024-12-11 13:21:03 -0500, Tom Lane wrote:
>> There are a couple of ways that we could deal with the API
>> seen by module authors:

> To be future proof, I think it'd be good to declare the arguments as
> designated initializers. E.g. like

> PG_MODULE_MAGIC_EXT(
>   .version = 1,
>   .threadsafe = 1
> );

Yeah, I'd come to pretty much the same conclusion after sending
my email.  That looks like it should work and be convenient
to extend further.

The other possibly-non-obvious bit is that we should probably
invent a sub-structure holding the ABI-related fields, so as to
minimize the amount of rewriting needed in dfmgr.c.

regards, tom lane

Re: Make use of pg_memory_is_all_zeros() in more places

Committed.

-- 
nathan

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

2024-12-11 Thread Guillaume Lelarge

Le mer. 11 déc. 2024 à 15:00, jian he  a
écrit :

> do we also need to update doc/src/sgml/rules.sgml?
>
> https://www.postgresql.org/docs/current/rules-materializedviews.html
>

Yes, you're right. Here is a fix. (Not a patch with a commit message as I
don't know if David will want to commit everything (the CI issue, this, and
probably more) in one patch.)


-- 
Guillaume.
diff --git a/doc/src/sgml/rules.sgml b/doc/src/sgml/rules.sgml
index 7a928bd7b9..6cebed5c37 100644
--- a/doc/src/sgml/rules.sgml
+++ b/doc/src/sgml/rules.sgml
@@ -1042,12 +1042,16 @@ SELECT count(*) FROM words WHERE word = 'caterpiler';
 If the materialized view is used instead, the query is much faster:
 
 
- Aggregate  (cost=4.44..4.45 rows=1 width=0) (actual time=0.042..0.042 rows=1 loops=1)
-   ->  Index Only Scan using wrd_word on wrd  (cost=0.42..4.44 rows=1 width=0) (actual time=0.039..0.039 rows=0 loops=1)
+ Aggregate  (cost=4.44..4.45 rows=1 width=8) (actual time=0.138..0.139 rows=1 loops=1)
+   Buffers: shared read=3
+   ->  Index Only Scan using wrd_word on wrd  (cost=0.42..4.44 rows=1 width=0) (actual time=0.132..0.133 rows=0 loops=1)
  Index Cond: (word = 'caterpiler'::text)
  Heap Fetches: 0
- Planning time: 0.164 ms
- Execution time: 0.117 ms
+ Buffers: shared read=3
+ Planning:
+   Buffers: shared hit=5
+ Planning Time: 0.143 ms
+ Execution Time: 0.200 ms
 
 
 Either way, the word is spelled wrong, so let's look for what we might
@@ -1087,11 +1091,15 @@ SELECT word FROM words ORDER BY word <-> 'caterpiler' LIMIT 10;
 Using the materialized view:
 
 
- Limit  (cost=0.29..1.06 rows=10 width=10) (actual time=187.222..188.257 rows=10 loops=1)
-   ->  Index Scan using wrd_trgm on wrd  (cost=0.29..37020.87 rows=479829 width=10) (actual time=187.219..188.252 rows=10 loops=1)
+ Limit  (cost=0.29..1.06 rows=10 width=14) (actual time=135.027..135.475 rows=10 loops=1)
+   Buffers: shared hit=4465
+   ->  Index Scan using wrd_trgm on wrd  (cost=0.29..37224.81 rows=479826 width=14) (actual time=135.023..135.470 rows=10 loops=1)
  Order By: (word <-> 'caterpiler'::text)
- Planning time: 0.196 ms
- Execution time: 198.640 ms
+ Buffers: shared hit=4465
+ Planning:
+   Buffers: shared hit=80
+ Planning Time: 0.767 ms
+ Execution Time: 140.836 ms
 
 
 If you can tolerate periodic update of the remote data to the local

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

2024-12-11 Thread Jelte Fennema-Nio

On Wed, 11 Dec 2024 at 10:38, David Rowley  wrote:
> I've pushed the main patch. I'm not planning on pushing the
> auto_explain.log_buffers default change unless there's a bit more
> discussion about it.

FreeBSD CFBot seems broken on a pg_stat_statements test because of
this change: 
https://api.cirrus-ci.com/v1/artifact/task/4845908522696704/testrun/build/testrun/pg_stat_statements/regress/regression.diffs

connection establishment versus parallel workers

My team recently received a report about connection establishment times
increasing substantially from v16 onwards.  Upon further investigation,
this seems to have something to do with commit 7389aad (which moved a lot
of postmaster code out of signal handlers) in conjunction with workloads
that generate many parallel workers.  I've attached a set of reproduction
steps.  The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
(>= 48).

Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
parallel worker processing much more responsive to the point of contending
with incoming connections, and that before this change, the kernel balanced
the execution of the signal handlers and ServerLoop() to prevent this.  I
don't have a concrete proposal yet, but I thought it was still worth
starting a discussion.  TBH I'm not sure we really need to do anything
since this arguably comes down to a trade-off between connection and worker
responsiveness.

-- 
nathan
setup:

psql -h "$host" -U "$user" postgres <

Re: on_error table, saving error info to a table

2024-12-11 Thread Nishant Sharma

Thanks for the v3 patch!

Please find review comments on v3:-

1) I think no need to change the below if condition, we can keep
it the way it was before i.e with
"cstate->opts.on_error != COPY_ON_ERROR_STOP" and we
add a new error ereport the way v3 has. Because for
cstate->opts.on_error as COPY_ON_ERROR_STOP cases we
can avoid two if conditions inside upper if.

+if (cstate->num_errors > 0 &&
 cstate->opts.log_verbosity >= COPY_LOG_VERBOSITY_DEFAULT)

2) No need for the below "if" check for maxattnum. We can simply
increment it with "++maxattnum" and later check if we have exactly
10 attributes for the error table. Because even if we drop any
attribute and maxattnum is 10 in pg_attribute for that rel, we should
still error out. Maybe we can rename it to "totalatts"?

+   if (maxattnum <= attForm->attnum)
+   maxattnum = attForm->attnum;

3) #define would be better, also as mentioned by Kirill switch
condition with proper #define would be better.

+   if (maxattnum != 10)
+   on_error_tbl_ok = false;

4)
>
> that would be more work.
> so i stick to if there is a table can use to
> error saving then use it, otherwise error out.
>
YES. but that would lead to a better design with an error table.
Also, I think Krill mentions the same. That is to auto create, if it
does not exist.


Regards,
Nishant Sharma (EDB).

On Tue, Dec 3, 2024 at 3:58 PM Kirill Reshke  wrote:

> On Tue, 3 Dec 2024 at 09:29, jian he  wrote:
> >
> > On Tue, Nov 5, 2024 at 6:30 PM Nishant Sharma
> >  wrote:
> > >
> > > Thanks for the v2 patch!
> > >
> > > I see v1 review comments got addressed in v2 along with some
> > > further improvements.
> > >
> > > 1) v2 Patch again needs re-base.
> > >
> > > 2) I think we need not worry whether table name is unique or not,
> > > table name can be provided by user and we can check if it does
> > > not exists then simply we can create it with appropriate columns,
> > > if it exists we use it to check if its correct on_error table and
> > > proceed.
> >
> > "simply we can create it with appropriate columns,"
> > that would be more work.
> > so i stick to if there is a table can use to
> > error saving then use it, otherwise error out.
> >
> >
> > >
> > > 3) Using #define in between the code? I don't see that style in
> > > copyfromparse.c file. I do see such style in other src file. So, not
> > > sure if committer would allow it or not.
> > > #define ERROR_TBL_COLUMNS   10
> > >
> > let's wait and see.
> >
> > > 4) Below appears redundant to me, it was not the case in v1 patch
> > > set, where it had only one return and one increment of error as new
> > > added code was at the end of the block:-
> > > +   cstate->num_errors++;
> > > +   return true;
> > > +   }
> > > cstate->num_errors++;
> > >
> > changed per your advice.
> >
> > > I was not able to test the v2 due to conflicts in v2, I did attempt to
> > > resolve but I saw many failures in make world.
> > >
> > I get rid of all the SPI code.
> >
> > Instead, now I iterate through AttributeRelationId to check if the
> > error saving table is ok or not,
> > using DirectFunctionCall3 to do the privilege check.
> > removed gram.y change, turns out it is not necessary.
> > and other kinds of refactoring.
> >
> > please check attached.
>
>
> Hi!
>
> 1)
> > + switch (attForm->attnum)
> > + {
> > + case 1:
> > + (.)
> > + case 2:
>
> case 1,2,3 ... Is too random. Other parts of core tend to use `#define
> Anum__ `. Can we follow this style?
>
> 2)
> >+ /*
> > + * similar to commit a9cf48a
> > + * (
> https://postgr.es/m/7bcfc39d4176faf85ab317d0c26786953646a411.ca...@cybertec.at
> )
> > + * in COPY FROM keep error saving table locks until the transaction end.
> > + */
>
> I can rarely see other comments referencing commits, and even few
> referencing a mail archive thread.
> Can we just write proper comment explaining the reasons?
>
>
> = overall
>
> Patch design is a little dubious for me. We give users some really
> incomprehensible API. To use on_error *relation* feature user must
> create tables with proper schema.
> Maybe a better design will be to auto-create on_error table if this
> table does not exist.
>
>
> Thoughts?
>
> --
> Best regards,
> Kirill Reshke
>

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

On Wed, 11 Dec 2024 at 22:38, David Rowley  wrote:
> I've pushed the main patch. I'm not planning on pushing the
> auto_explain.log_buffers default change unless there's a bit more
> discussion about it.

There were a few places that were missing a BUFFERS OFF as indicated
by the buildfarm. I've fixed a few of these, but there's at least one
more in pg_stat_statements's level_tracking.sql. I'm currently running
a -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE enabled make
check-world.  It's going to take a while, so I'll hold off doing yet
another commit to fix these until I see if the level_tracking.sql one
is the last issue to fix.

David

Re: FileFallocate misbehaving on XFS

2024-12-11 Thread Jakub Wartak

On Wed, Dec 11, 2024 at 4:00 AM Michael Harris  wrote:

> Hi Jakub
>
> On Tue, 10 Dec 2024 at 22:36, Jakub Wartak
>  wrote:

 [..]

>
> > 3. Maybe somehow there is a bigger interaction between posix_fallocate()
> and delayed XFS's dynamic speculative preallocation from many processes all
> writing into different partitions ? Maybe try "allocsize=1m" mount option
> for that /fs and see if that helps.  I'm going to speculate about XFS
> speculative :) pre allocations, but if we have fdcache and are *not*
> closing fds, how XFS might know to abort its own speculation about
> streaming write ? (multiply that up to potentially the number of opened fds
> to get an avalanche of "preallocations").
>
> I will try to organize that. They are production systems so it might
> take some time.
>

Cool.

> 4. You can also try compiling with patch from Alvaro from [2]
> "0001-Add-some-debugging-around-mdzeroextend.patch", so we might end up
> having more clarity in offsets involved. If not then you could use 'strace
> -e fallocate -p ' to get the exact syscall.
>
> I'll take a look at Alvaro's patch. strace sounds good, but how to
> arrange to start it on the correct PG backends? There will be a
> large-ish number of PG backends going at a time, only some of which
> are performing imports, and they will be coming and going every so
> often as the ETL application scales up and down with the load.
>

Yes, it sounds like mission impossible. Is there any chance you can get
reproduced down to one or a small number of postgres backends doing the
writes?


>
> > 5. Another idea could be catching the kernel side stacktrace of
> fallocate() when it is hitting ENOSPC. E.g. with XFS fs and attached
> bpftrace eBPF tracer I could get the source of the problem in my artificial
> reproducer, e.g
>
> OK, I will look into that also.
>
>
Hopefully that reveals some more. Somehow ENOSPC UNIX error reporting got
one big pile of errors into 1 error category and that's not helpful at all
(inode/extent/block allocation problems are all squeezed into 1 error)

Anyway, if that helps others here are my notes so far on this thread
including that useful file from subthread, hopefully I've did not
misinterpreted something:

- works in = PG16 due to posix_fallocate() rather
than multiple separate(but adjacent) iovectors to pg_writev. It launched
only in case of mdzeroextend() with numblocks > 8
- 179k or 414k files in single directory (0.3s - 0.5s just to list those)
- OS/FS upgraded from earlier release
- one AG with extreme low extent sizes compared to the others AGs (I bet
that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are
no large extents in that AG)
   from  to extents  blockspct
  1   149494949   0.65
  2   3   86113  173452  22.73
  4   7   19399   94558  12.39
  8  15   23233  248602  32.58
 16  31   12425  241421  31.64
   total free extents 146119
   total free blocks 762982
   average free extent size 5.22165 (!)
- note that the max extent size above (31) is very low when compared to the
others AG which have 1024-8192. Therefore it looks there are no contiguous
blocks for request sizes above 31*4096 = 126976 bytes within that AG (??).
- we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped
up to 64 pg blocks maximum (and that's higher than the above)
- but the fails where observed also using pg_upgrade --link -j/pg_restore
-j (also concurrent posix_fallocate() to many independent files sharing the
same AG, but that's 1 backend:1 file so no contention for waitcount in
RelationAddBlocks())
- so maybe it's lots of backends doing independent concurrent
posix_fallocate() that end up somehow coalesced ? Or hypothetically let's
say 16-32 fallocates() hit the same AG initially, maybe it's some form of
concurrency semi race-condition inside XFS where one of fallocate calls
fails to find space in that one AG, but according to [1] it should fallback
to another AGs.
- and there's also additional XFS dynamic speculative preallocation that
might cause space pressure during our normal writes..

Another workaround idea/test: create tablespace on the same XFS fs (but in
a somewhat different directory if possible) and see if it still fails.

-J.

[1] - https://blogs.oracle.com/linux/post/extent-allocation-in-xfs

Re: CREATE SCHEMA ... CREATE DOMAIN support

Kirill Reshke  writes:
> On Wed, 4 Dec 2024 at 01:07, Tom Lane  wrote:
>> I'm totally willing to throw that possibility overboard
>> permanently in order to expand the set of creatable object types
>> without introducing a ton of restrictions and weird behaviors.
>> What do you think?

> Im +1 on this, but can you please elaborate, which exact objects
> cannot be created now? What will be expanded after
> v2-0002-Dont_try-to-reoder?

The problem is not too awful right now, because of the very limited
set of object types that CREATE SCHEMA supports.  The only case
I can think of offhand is a table referencing a view's rowtype,
for example

   create schema s1
create view v1 as select ...
create table t1 (compositecol v1, ...);

Since transformCreateSchemaStmtElements re-orders views after
tables, this'll fail, and there is no way to fix that except
by giving up use of the elements-in-CREATE-SCHEMA feature.
Admittedly it's a strange usage, and probably no one has tried it.

However, once we start adding in data types and functions,
the hazard grows substantially, because there are more usage
patterns and they can't all be satisfied by a simple object-type
ordering.  For example, domains are already enough to cause
trouble, because we allow domains over composites:

   create schema s1
create table t1 (...)
create domain d1 as t1 check(...);

Re-ordering domains before tables would break this case, but
the other order has other problems.  Looking a bit further
down the road, how would you handle creation of a base type
within CREATE SCHEMA?

   create schema s1
create type myscalar
create function myscalar_in(cstring) returns myscalar ...
create function myscalar_out(myscalar) returns cstring ...
create type myscalar (input = myscalar_in, ...);

This cannot possibly work if an object-type-based re-ordering
is done to it.

So IMV, we have three possibilities:

1. CREATE SCHEMA's schema-element feature remains forevermore
a sad joke that (a) doesn't cover nearly enough to be useful and
(b) doesn't come close to doing what the spec says it should.

2. We invest an enormous amount of engineering effort on trying
to extract dependencies from not-yet-analyzed parse trees, after
which we invest a bunch more effort figuring out heuristics for
ordering the subcommands in the face of circular dependencies.
(Some of that could be stolen from pg_dump, but not all: pg_dump
only has to resolve a limited set of cases.)

3. We bypass the need for #2 by decreeing that we'll execute
the subcommands in order.

>> PS: if we were really excited about allowing circular FKs to be
>> made within CREATE SCHEMA, a possible though non-standard answer
>> would be to allow ALTER TABLE ADD CONSTRAINT as a .

> That's a nice feature to have by itself?

Not unless we abandon the idea of subcommand reordering, because
where are you going to put the ALTER TABLE subcommands?

regards, tom lane

Re: Add support to TLS 1.3 cipher suites and curves lists

> On 11 Dec 2024, at 18:47, Tom Lane  wrote:

> Oh yay, another naming problem :-(.  I think that neither "ciphers"
> vs. "cipher suites" nor "ssl_ciphers" vs. "ssl_ciphers_tlsv13" is
> going to convey a lot to the average person who's not steeped in
> TLS minutiae.  However, following the precedent of Apache and Curl
> seems like a good answer --- that will ensure that at least some
> part of the internet-using world has seen this before.  So I guess
> I'm +0.5 for the ssl_ciphers_tlsv13 answer, at least out of the
> choices suggested so far.

The subset of users who are likely to be interested in this setting would
probably be more confused if we didn't follow the precedent from other
well-known projects.

--
Daniel Gustafsson

Re: Add Postgres module info

Andrei Lepikhov  writes:
> I would like to propose the module_info structure, which aims to let 
> extension maintainers sew some data into the binary file. Being included 
> in the module code, this information remains unchanged and is available 
> for reading by a backend.

I don't have much of an opinion one way or the other about the
usefulness of these additional info fields.  But I would like to
object to the way you've gone about it, namely to copy-and-paste
the magic-block mechanism.  That doesn't scale: the next time
somebody else wants some more fields, will we have three such
structs?

The approach we foresaw using was that we could simply add more
fields to Pg_magic_struct (obviously, only in a major version).
That's happened at least once already - abi_extra was not there
to begin with.

There are a couple of ways that we could deal with the API
seen by module authors:

1. The PG_MODULE_MAGIC macro keeps the same API and leaves the
additional field(s) empty.  Authors who want to fill the
extra field(s) use a new macro, say PG_MODULE_MAGIC_V2.

2. PG_MODULE_MAGIC gains some arguments, forcing everybody
to change their code.  While this would be annoying, it'd be
within our compatibility rules for a major version update.
I wouldn't do it though unless there were a compelling reason
why everybody should fill these fields.

3. Maybe we could do something with making PG_MODULE_MAGIC
variadic, but I've not thought hard about what that could
look like.  In any case it'd only be a cosmetic improvement
over the above ways.

4. The extra fields are filled indirectly by macros that
extension authors can optionally provide (a variant on the
FMGR_ABI_EXTRA mechanism).  This would be code-order-sensitive
so I'm not sure it's really a great idea.

5. Something I didn't think of?

With any of these except #4, authors who want their source code to
support multiple PG major versions would be forced into using #if
tests on CATALOG_VERSION_NO to decide what to write.  That's a
bit annoying but hardly unusual.

regards, tom lane

Re: Add support to TLS 1.3 cipher suites and curves lists

On Wed, Dec 11, 2024 at 12:47:01PM -0500, Tom Lane wrote:
> Jacob Champion  writes:
>> On Wed, Dec 11, 2024 at 9:11 AM Nathan Bossart  
>> wrote:
>>> Sorry for chiming in so late here, but I was a little surprised to see the
>>> TLS version in the GUC name.  ISTM this would require us to create a new
>>> GUC for every new TLS version, or explain that ssl_tls13_ciphers isn't just
>>> for 1.3.
> 
>> I agree it's not ideal. But part of the problem IMO is that we might
>> actually _have_ to introduce a new GUC for a future TLS 1.4, because
>> we have no idea if the ciphersuites will change incompatibly again. (I
>> hope not, but they did it once and they could do it again.)
>> If 1.4, or 2.0, or... 4? [1] comes out later, and it turns out to be
>> compatible, we could probably add a more appropriate alias then. (For
>> now, just as some additional data points, both Apache and Curl use
>> "1.3" or "13" in the configuration as a differentiator.) Do you have a
>> different naming scheme in mind?

In a vacuum, I would've probably voted for ssl_cipher_suites since it is
reasonably descriptive and version-independent.  It's true that we'd need
lots of documentation to explain which parameter is used for which TLS
version, but I think we need that regardless of the parameter name.

> Oh yay, another naming problem :-(.  I think that neither "ciphers"
> vs. "cipher suites" nor "ssl_ciphers" vs. "ssl_ciphers_tlsv13" is
> going to convey a lot to the average person who's not steeped in
> TLS minutiae.  However, following the precedent of Apache and Curl
> seems like a good answer --- that will ensure that at least some
> part of the internet-using world has seen this before.  So I guess
> I'm +0.5 for the ssl_ciphers_tlsv13 answer, at least out of the
> choices suggested so far.

I wasn't aware that other projects were including the version, too.  IMHO
that's a fair argument for the current name, if for no other reason than
we'll be in good company if/when things change.  All things considered, I'd
probably still vote for something like ssl_cipher_suites, but I'm happy to
consider the matter resolved if we've given it some thought and decided to
stick with ssl_tls13_ciphers.

-- 
nathan

Re: Count and log pages set all-frozen by vacuum

2024-12-11 Thread Masahiko Sawada

On Thu, Dec 5, 2024 at 4:32 PM Melanie Plageman
 wrote:
>
> On Mon, Dec 2, 2024 at 9:28 AM Robert Haas  wrote:
> >
> > On Fri, Nov 29, 2024 at 2:46 PM Masahiko Sawada  
> > wrote:
> > > Finally, in case where we have:
> > >
> > > visibility map: 1 pages set all-visible, 1 pages set all-frozen.
> > >
> > > We can understand that 1 pages newly became all-frozen, but have
> > > no idea how many pages became all-visible but not all-frozen. It could
> > > be even 0. Users might want to know it to understand how (auto)vacuum
> > > and freezing are working well.
> >
> > I agree that this is possible, but I am not clear why the user should
> > care. In scenario #1, the same pages were set all-visible and also
> > all-frozen. In scenario #2, one set of pages were set all-visible and
> > a different set of pages were set all-frozen. Scenario #2 is a little
> > worse, for a couple of reasons. First, in scenario #2, more pages were
> > dirtied to achieve the same result. However, if the user is concerned
> > about the amount of I/O that vacuum is doing, they will more likely
> > look at the "buffer usage" and "WAL usage" lines in the VACUUM verbose
> > output rather than at the "visibility map" line. Second, in scenario
> > #2, we have deferred some work to the future and there is a risk that
> > the pages will remain all-visible but not all-frozen until the next
> > aggressive vacuum. I think it is possible someone could want to see
> > more detailed information for this reason.
> >
> > However, I expect that it will be unimportant in a practical sense of
> > Melanie's work in this area gets committed, because her goal is to set
> > things up so that we'll revisit all-visible but not all-frozen pages
> > sooner, so that the work doesn't build up so much prior to the next
> > aggressive vacuum. And I think that work will have its own logging
> > that will make it clear what it is doing, hence I don't foresee the
> > need for more detail here.
> >
> > All that said, if you really want this broken out into three
> > categories rather than two, I'm not overwhelmingly opposed to that. It
> > is always possible that more detail could be useful; it is just
> > difficult to decide where the likelihood is high enough to justify
> > adding more output.

I agree that it will be unimportant from Melanie's work in this area.
Also, I agree that if semi-aggressive vacuum has its own new logging
message about what it's done, this new message doesn't necessarily
need to be detailed. But looking at the proposed patches, there seems
to be no such new logging message so far. Showing three categories
makes sense to me independent of semi-aggressive vacuum work. If we
figure out it's better to merge some parts of this new message to
semi-aggressive vacuum's logs, we can adjust them later.

> Hmm. So at the risk of producing two loaves that are worse than none,
> I've decided I like the "everything" approach:
>
> visibility map: 5 pages newly set all-visible, of which 2 set
> all-frozen. 2 all-visible pages newly set all-frozen.
>
> With this I can clearly get any of the information I might want: the
> number of pages that changed state, the total number of pages set
> all-visible or all-frozen, and the total number of vmbits set. It
> makes the all-visible but not all-frozen debt clear. Without
> specifying how many of the pages it set all-frozen were all-visible,
> you don't really get that. I originally envisioned this log message as
> a way to know which vmbits were set. But it is kind of nice to know
> which pages changed state too.
>
> With three counters, the code and comment repetition is a bit trying,
> but I don't think it is quite enough to justify a "log_vmbits_set()"
> function.
>
> Here's an example to exercise the new log message:
>
> create table foo (a int, b int) with (autovacuum_enabled = false);
> insert into foo select generate_series(1,1000), 1;
> delete from foo where a > 500;
> vacuum (verbose) foo;
> -- visibility map: 5 pages newly set all-visible, of which 2 set
> all-frozen. 0 all-visible pages newly set all-frozen.
> insert into foo select generate_series(1,500), 1;
> vacuum (verbose, freeze) foo;
> --visibility map: 3 pages newly set all-visible, of which 3 set
> all-frozen. 2 all-visible pages newly set all-frozen.

The output makes sense to me. The patch mostly looks good to me. Here
is one minor comment:

if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0 &&
(old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0)

Maybe it would be better to rewrite it to:

if ((old_vmbits & (VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN)) == 0)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Unmark gen_random_uuid() function leakproof

2024-12-11 Thread Masahiko Sawada

On Tue, Dec 10, 2024 at 1:52 PM Masahiko Sawada  wrote:
>
> On Mon, Dec 9, 2024 at 2:48 PM Masahiko Sawada  wrote:
> >
> > On Mon, Dec 9, 2024 at 2:23 PM Andres Freund  wrote:
> > >
> > > Hi,
> > >
> > > On 2024-12-09 14:10:30 -0800, Masahiko Sawada wrote:
> > > > While reviewing UUIDv7 patch[1], I realized gen_random_uuid() is
> > > > marked leakproof even though it doesn't take arguments. The functions
> > > > without arguments don't need to be marked leakproof in principle. This
> > > > is the sole function that has no arguments and is listed in the "List
> > > > of built-in leakproof functions" in opr_sanity.sql. I've attached the
> > > > patch for fixing it and for better consistency with new UUID
> > > > generation functions discussed on that thread.
> > >
> > > Seems like it'd make sense to add a test to opr_sanity.sql so we don't
> > > reintroduce such cases?
> > >
> >
> > Thank you for the comment. It's a good idea. I've updated the patch.
> >
>
> I'm going to push the updated patch tomorrow, barring objections and
> further comments.

Pushed (398d3e3b5b).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Add Postgres module info

2024-12-11 Thread Andres Freund

Hi,


On 2024-12-11 13:21:03 -0500, Tom Lane wrote:
> Andrei Lepikhov  writes:
> > I would like to propose the module_info structure, which aims to let
> > extension maintainers sew some data into the binary file. Being included
> > in the module code, this information remains unchanged and is available
> > for reading by a backend.
>
> I don't have much of an opinion one way or the other about the
> usefulness of these additional info fields.

FWIW, Id like to have some more information in there, without commenting on
the specifics.


> But I would like to object to the way you've gone about it, namely to
> copy-and-paste the magic-block mechanism.  That doesn't scale: the next time
> somebody else wants some more fields, will we have three such structs?

I agree with that.


> The approach we foresaw using was that we could simply add more
> fields to Pg_magic_struct (obviously, only in a major version).
> That's happened at least once already - abi_extra was not there
> to begin with.
>
> There are a couple of ways that we could deal with the API
> seen by module authors:
>
> 1. The PG_MODULE_MAGIC macro keeps the same API and leaves the
> additional field(s) empty.  Authors who want to fill the
> extra field(s) use a new macro, say PG_MODULE_MAGIC_V2.
>
> 2. PG_MODULE_MAGIC gains some arguments, forcing everybody
> to change their code.  While this would be annoying, it'd be
> within our compatibility rules for a major version update.
> I wouldn't do it though unless there were a compelling reason
> why everybody should fill these fields.

I'd like to avoid needing to do this again if / when we invent the next set of
optional arguments. So just having a different macro with a hardcoded set of
arguments or changing PG_MODULE_MAGIC to have a hardcoded set of arguments
doesn't seem great.

To be future proof, I think it'd be good to declare the arguments as
designated initializers. E.g. like

PG_MODULE_MAGIC_EXT(
  .version = 1,
  .threadsafe = 1
);

where the macro would turn the arguments into a struct initializer inside
Pg_magic_struct.

That way we can add/remove arguments and only extensions that use
removed arguments need to change.


> 3. Maybe we could do something with making PG_MODULE_MAGIC
> variadic, but I've not thought hard about what that could
> look like.  In any case it'd only be a cosmetic improvement
> over the above ways.

Yea, it'd be nice to avoid needing an _EXT or _V2. But I can't immediately
think of a way that allows a macro with no arguments and and an argument.


> 4. The extra fields are filled indirectly by macros that
> extension authors can optionally provide (a variant on the
> FMGR_ABI_EXTRA mechanism).  This would be code-order-sensitive
> so I'm not sure it's really a great idea.

Agreed.


> With any of these except #4, authors who want their source code to
> support multiple PG major versions would be forced into using #if
> tests on CATALOG_VERSION_NO to decide what to write.  That's a
> bit annoying but hardly unusual.

#2 would be bit more annoying than #1, I'd say, because it'd affect every
single extension, even ones not interested in any of this.

Greetings,

Andres Freund

RE: COPY performance on Windows

2024-12-11 Thread Ryohei Takahashi (Fujitsu)

Hi,


I continuously investigate the performance problem of COPY on Windows.

I noticed that not only PG17.0 but also PG16.6 have performance problem 
compared to PG16.4.
The performance is 2.5%-5.8% worse, especially when the number of clients is 1 
or 2.

I modified the performance measurement script of the thread in [1].
* Enabled to run on Windows git bash
* Enabled to compare PG16.4, PG16.6 and PG17.0
* Increase the row number to 10 times (about 10GB)

I measured on Windows Server 2022 machine with 44 core CPU and 512GB memory.
The results are following.

* PG16.4
PG164: nclients = 1, time  = 432
PG164: nclients = 2, time  = 238
PG164: nclients = 4, time  = 157
PG164: nclients = 8, time  = 135
PG164: nclients = 16, time  = 163
PG164: nclients = 32, time  = 261
PG164: nclients = 64, time  = 458
PG164: nclients = 128, time  = 611
PG164: nclients = 256, time  = 622

* PG16.6
PG166: nclients = 1, time  = 444 (2.7% worse than PG16.4)
PG166: nclients = 2, time  = 252 (5.8% worse than PG16.4)
PG166: nclients = 4, time  = 156
PG166: nclients = 8, time  = 135
PG166: nclients = 16, time  = 163
PG166: nclients = 32, time  = 261
PG166: nclients = 64, time  = 458
PG166: nclients = 128, time  = 612
PG166: nclients = 256, time  = 621

* PG17.0
PG170: nclients = 1, time  = 448 (3.7% worse than PG16.4)
PG170: nclients = 2, time  = 244 (2.5% worse than PG16.4)
PG170: nclients = 4, time  = 159
PG170: nclients = 8, time  = 137
PG170: nclients = 16, time  = 165
PG170: nclients = 32, time  = 262
PG170: nclients = 64, time  = 458
PG170: nclients = 128, time  = 611
PG170: nclients = 256, time  = 621



(1)
I attach the performance measurement script.
If you have a Windows environment, could you please reproduce the same 
performance problem?

(2)
The performance of PG16.6 and PG17.0 are worse than PG16.4.
So, I think the commits between August and September affects the performance.
I will analyze these commits.


[1] 
https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=kp6mszngk3bq9yr...@mail.gmail.com


Regards,
Ryohei Takahashi



test.sh
Description: test.sh

Re: NOT ENFORCED constraint feature

On Tue, Dec 10, 2024 at 7:48 PM Amul Sul  wrote:
>
> >
> > static bool
> >  MergeWithExistingConstraint(Relation rel, const char *ccname, Node *expr,
> >   bool allow_merge, bool is_local,
> > + bool is_enforced,
> >   bool is_initially_valid,
> >   bool is_no_inherit)
> >  {
> > @@ -2729,12 +2738,24 @@ MergeWithExistingConstraint(Relation rel,
> > const char *ccname, Node *expr,
> >   * If the child constraint is "not valid" then cannot merge with a
> >   * valid parent constraint.
> >   */
> > - if (is_initially_valid && !con->convalidated)
> > + if (is_initially_valid && con->conenforced && !con->convalidated)
> >   ereport(ERROR,
> >   (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> >   errmsg("constraint \"%s\" conflicts with NOT VALID constraint on
> > relation \"%s\"",
> >   ccname, RelationGetRelationName(rel;
> >
> > There are no tests for this change. I think this change is not necessary.
> >
>
> It is necessary; otherwise, it would raise an error for a NOT ENFORCED
> constraint, which is NOT VALID by default.
>
got it.
overall v8-0001 looks good to me!

do you have a patch for
alter check constraint set [not] enforced?
If not, I will probably try to work on it.

I am playing around with the remaining patch.

ATExecAlterConstrRecurse
ATExecAlterConstrEnforceability
ATExecAlterChildConstr
AlterConstrTriggerDeferrability
These four functions take a lot of arguments.
more comments about these arguments would be helpful.
we only need to mention it at ATExecAlterConstrRecurse.

for example:
ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel,
 const Oid fkrelid, const Oid pkrelid,
 HeapTuple contuple, List **otherrelids,
 LOCKMODE lockmode, Oid ReferencedParentDelTrigger,
 Oid ReferencedParentUpdTrigger,
 Oid ReferencingParentInsTrigger,
 Oid ReferencingParentUpdTrigger)
the comments only explained otherrelids.

LOCKMODE lockmode,
Oid ReferencedParentDelTrigger,
Oid ReferencedParentUpdTrigger,
Oid ReferencingParentInsTrigger,
Oid ReferencingParentUpdTrigger

The above arguments are pretty intuitive.

Constraint *cmdcon
Relation conrel
Relation tgrel
HeapTuple contuple

but these arguments are not that very intuitive,
especially these arguments passing to another function.

Re: NOT ENFORCED constraint feature

2024-12-11 Thread Amul Sul

On Wed, Dec 11, 2024 at 6:12 PM jian he  wrote:
>
> On Tue, Dec 10, 2024 at 7:48 PM Amul Sul  wrote:
> >
> > >
> > > static bool
> > >  MergeWithExistingConstraint(Relation rel, const char *ccname, Node *expr,
> > >   bool allow_merge, bool is_local,
> > > + bool is_enforced,
> > >   bool is_initially_valid,
> > >   bool is_no_inherit)
> > >  {
> > > @@ -2729,12 +2738,24 @@ MergeWithExistingConstraint(Relation rel,
> > > const char *ccname, Node *expr,
> > >   * If the child constraint is "not valid" then cannot merge with a
> > >   * valid parent constraint.
> > >   */
> > > - if (is_initially_valid && !con->convalidated)
> > > + if (is_initially_valid && con->conenforced && !con->convalidated)
> > >   ereport(ERROR,
> > >   (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> > >   errmsg("constraint \"%s\" conflicts with NOT VALID constraint on
> > > relation \"%s\"",
> > >   ccname, RelationGetRelationName(rel;
> > >
> > > There are no tests for this change. I think this change is not necessary.
> > >
> >
> > It is necessary; otherwise, it would raise an error for a NOT ENFORCED
> > constraint, which is NOT VALID by default.
> >
> got it.
> overall v8-0001 looks good to me!
>

Thank you.

> do you have a patch for
> alter check constraint set [not] enforced?
> If not, I will probably try to work on it.
>

Not yet; I believe I need to first look into allowing NOT VALID
foreign key constraints on partitioned tables.

>
> I am playing around with the remaining patch.
>
> ATExecAlterConstrRecurse
> ATExecAlterConstrEnforceability
> ATExecAlterChildConstr
> AlterConstrTriggerDeferrability
> These four functions take a lot of arguments.
> more comments about these arguments would be helpful.
> we only need to mention it at ATExecAlterConstrRecurse.
>
> for example:
> ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel,
>  const Oid fkrelid, const Oid pkrelid,
>  HeapTuple contuple, List **otherrelids,
>  LOCKMODE lockmode, Oid ReferencedParentDelTrigger,
>  Oid ReferencedParentUpdTrigger,
>  Oid ReferencingParentInsTrigger,
>  Oid ReferencingParentUpdTrigger)
> the comments only explained otherrelids.
>
> LOCKMODE lockmode,
> Oid ReferencedParentDelTrigger,
> Oid ReferencedParentUpdTrigger,
> Oid ReferencingParentInsTrigger,
> Oid ReferencingParentUpdTrigger
>
> The above arguments are pretty intuitive.
>
> Constraint *cmdcon
> Relation conrel
> Relation tgrel
> HeapTuple contuple
>
> but these arguments are not that very intuitive,
> especially these arguments passing to another function.

Those are the existing ones; let me think what can be done with them.

Regards,
Amul

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

2024-12-11 Thread Guillaume Lelarge

Le mer. 11 déc. 2024 à 12:16, David Rowley  a écrit :

> On Wed, 11 Dec 2024 at 22:38, David Rowley  wrote:
> > I've pushed the main patch. I'm not planning on pushing the
> > auto_explain.log_buffers default change unless there's a bit more
> > discussion about it.
>
> There were a few places that were missing a BUFFERS OFF as indicated
> by the buildfarm. I've fixed a few of these, but there's at least one
> more in pg_stat_statements's level_tracking.sql. I'm currently running
> a -DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE enabled make
> check-world.  It's going to take a while, so I'll hold off doing yet
> another commit to fix these until I see if the level_tracking.sql one
> is the last issue to fix.
>
>
Sorry for all those late fixes. I'll refrain from saying "Sure looks easy
enough to do" from now on :) Guess this is a great example of what Robert
said in
http://rhaas.blogspot.com/2024/05/hacking-on-postgresql-is-really-hard.html
.

Anyway, thanks again for your work on this.


-- 
Guillaume.

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

do we also need to update doc/src/sgml/rules.sgml?

https://www.postgresql.org/docs/current/rules-materializedviews.html

Re: on_error table, saving error info to a table

2024-12-11 Thread Andrew Dunstan



On 2024-12-02 Mo 11:28 PM, jian he wrote:

On Tue, Nov 5, 2024 at 6:30 PM Nishant Sharma
  wrote:

Thanks for the v2 patch!

I see v1 review comments got addressed in v2 along with some
further improvements.

1) v2 Patch again needs re-base.

2) I think we need not worry whether table name is unique or not,
table name can be provided by user and we can check if it does
not exists then simply we can create it with appropriate columns,
if it exists we use it to check if its correct on_error table and
proceed.

"simply we can create it with appropriate columns,"
that would be more work.
so i stick to if there is a table can use to
error saving then use it, otherwise error out.



3) Using #define in between the code? I don't see that style in
copyfromparse.c file. I do see such style in other src file. So, not
sure if committer would allow it or not.
#define ERROR_TBL_COLUMNS   10


let's wait and see.


4) Below appears redundant to me, it was not the case in v1 patch
set, where it had only one return and one increment of error as new
added code was at the end of the block:-
+   cstate->num_errors++;
+   return true;
+   }
 cstate->num_errors++;


changed per your advice.


I was not able to test the v2 due to conflicts in v2, I did attempt to
resolve but I saw many failures in make world.


I get rid of all the SPI code.

Instead, now I iterate through AttributeRelationId to check if the
error saving table is ok or not,
using DirectFunctionCall3 to do the privilege check.
removed gram.y change, turns out it is not necessary.
and other kinds of refactoring.

please check attached.




I don't have a comment on the patch in general, but wouldn't it be 
better to use a typed table? Then all you would need to check is the 
reloftype to make sure it's the right type, instead of checking all the 
column names and types. Creating the table would be simpler, too. Just


CREATE TABLE my_copy_errors OF TYPE pg_copy_error;

cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: Document NULL

2024-12-11 Thread Marcos Pegoraro

Em ter., 10 de dez. de 2024 às 20:00, David G. Johnston <
david.g.johns...@gmail.com> escreveu:

> I got rid of the row counts on the examples.
>
Cool, I would love to get rid all of them, like I proposed on [1].


> When I finalize the examples I'm probably going to \pset null .
>
Yes, much better than an empty space in the examples, but you need to show
what PSET you did, maybe
\pset null meta-command to specify the
textual output of null values
it encounters in query results. To get same results as you are seeing
on this page, do "\pset null "

typo in func.sgml
   three-valued typed
   results (true, false, or null).
should remove that comma, right ?
   results (true, false or null).

Would be good to mention on nullvalues-json section that nulls on JSON
values are case sensitive, so NULL or Null won't work

[1] -
https://www.postgresql.org/message-id/CAB-JLwYaauh-QRjnqN-DLfgZ8xFmy_uyGe545tiuejtiX0-T%2BQ%40mail.gmail.com

regards
Marcos

Re: Suggestion to standardize comment format in pg_dump

2024-12-11 Thread Euler Taveira

On Wed, Dec 11, 2024, at 3:37 AM, Nohez Poonawala wrote:
> Currently, the pg_dump tool outputs comments in different formats for
> primary/foreign keys and indexes.  Below is the comment format.
> 
> - for Primary key:
>   -- Name: TABLENAME CONSTRAINTNAME; Type: CONSTRAINT; Schema: SCHEMA;
> Owner: OWNER
> - for Foreign key:
>   -- Name: TABLENAME CONSTRAINTNAME; Type: FK CONSTRAINT; Schema:
> SCHEMA; Owner: OWNER
> - for Index:
>   -- Name: INDEXNAME; Type: INDEX; Schema: SCHEMA; Owner: OWNER
> 
> To maintain consistency, I suggest modifying the comment format for
> indexes to include the associated TABLENAME, similar to constraints.
> 
> For example:
> - for Index:
>   -- Name: TABLENAME INDEXNAME; Type: INDEX; Schema: SCHEMA; Owner: OWNER
> 
> This small change would improve clarity and make the output format more 
> uniform.

Constraints and indexes are different classes of objects. Per SQL standard,
different tables can have same constraint names. However, different tables
cannot have same index name (SQL standard says nothing about indexes. That's a
Postgres implementation detail.) Having said that, you don't need the table
name to associate it with an index (because it is unique in a schema) but a
constraint requires a table (because there might be multiple constraints with
the same name in a schema).  An argument against this inclusion is that it will
increase the output file size without adding a crucial information. You mention
consistency but since it is a different class of objects I don't think this
argument holds much water.

--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Suggestion to standardize comment format in pg_dump

"Euler Taveira"  writes:
> On Wed, Dec 11, 2024, at 3:37 AM, Nohez Poonawala wrote:
>> To maintain consistency, I suggest modifying the comment format for
>> indexes to include the associated TABLENAME, similar to constraints.
>> For example:
>> - for Index:
>> -- Name: TABLENAME INDEXNAME; Type: INDEX; Schema: SCHEMA; Owner: OWNER

> ... An argument against this inclusion is that it will
> increase the output file size without adding a crucial information. You 
> mention
> consistency but since it is a different class of objects I don't think this
> argument holds much water.

I think a bigger problem is compatibility.  It seems likely that there
are tools out there that would be broken by such a change.  These
comments aren't just comments: they directly reflect what is in the
"tag" fields of the per-object entries in the custom dump format.
So for example this would also affect the output of "pg_restore -l".
Even if you doubt that anything is scanning actual dump files looking
for these comments, it seems certain that people have built tools
and scripts that examine -l output.  There might well be places in
pg_restore itself that depend on the tag being just the index name
and no more, too.

I'll concede that this proposal would have been a good idea if it'd
been done that way early on.  But I don't think it's such a good idea
as to be worth breaking things for.

regards, tom lane

Re: Proposal to add a new URL data type.

2024-12-11 Thread Alexander Borisov


10.12.2024 13:59, Victor Yegorov пишет:
чт, 5 дек. 2024 г. в 17:02, Alexander Borisov >:




[..]



Hey, I had a look at this patch and found its functionality mature and 
performant.


As Peter mentioned pguri, I used it to compare with the proposed 
extension. This brought up

the following differences:
- pguri (uriparser 0.9.8) doesn't support Chinese symbols in the host 
part of URI (uri_test1.sh):


     ERROR:  invalid input syntax for type uri at or near "事 
例.com#comments "


   Therefore, I avoided Chinese or Cyrillic symbols in the pguri test 
script.
- There are no SET functions in the pguri, changing specific portions of 
URI is troublesome. I used

   replace() in the test, but this is an error prone approach.
- It's even more troublesome to set parts of the URI that are not 
initially there. Probably, a full decomposition

   into parts and the following wrap up is needed
Suggested extension has no such limitations. Additionally, pguri 
extracts userinfo as a whole,

while suggested extension can get/set user and password individually.


Running tests (attached) I got the following numbers:
$ ./url_test.sh
NOTICE:  extension "url" already exists, skipping
tps = 13068.287423 (without initial connection time)
tps = 12888.937747 (without initial connection time)
tps = 12830.642558 (without initial connection time)
tps = 12846.341411 (without initial connection time)
tps = 13187.955601 (without initial connection time)

$ ./uri_test2.sh
NOTICE:  extension "uri" already exists, skipping
tps = 2441.934308 (without initial connection time)
tps = 2513.277660 (without initial connection time)
tps = 2484.641673 (without initial connection time)
tps = 2519.312395 (without initial connection time)
tps = 2512.364492 (without initial connection time)

So it's 12.9k vs 2.5k, or 6x faster for a case where we replace 5 parts 
of the original URL.


Given its performance and functionality, I find the suggested URL 
extension better than pguri.




Thanks for the constructive comments and the testing you have done.




Now to the review part.

1. Applying patch causes indentation warning, please, bring spacing to 
the project's policy


$ git apply ~/0001-Add-url-data-type-to-contrib.patch
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:837: indent with 
spaces.

     return lexbor_calloc(1, sizeof(lexbor_array_t));
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:843: indent with 
spaces.

     if (array == NULL) {
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:844: indent with 
spaces.

         return LXB_STATUS_ERROR_OBJECT_IS_NULL;
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:845: indent with 
spaces.

     }
/home/vyegorov/0001-Add-url-data-type-to-contrib.patch:847: indent with 
spaces.

     if (size == 0) {
warning: squelched 6350 whitespace errors
warning: 6355 lines add whitespace errors.


This will be fixed when the main URL parser code is rewritten to fit
the Postgres style/infrastructure.


2. There's a lexbor/ library that contains core and url parts. Feels 
like some commentary about what's

   inside is required.


Yeah, that's a good point.


3. Do you think it's possible to adjust your code to use existing 
postgres infrastructure instead? I don't
   think having its own infrastructure for a single extension is a good 
thing. Also, this might be a source

   for performance improvements in the core.


To implement (rewrite/modify) a URL parser using exclusively Postgres
infrastructure is one of the main goals.



4. There's no user visible documentation, please, add one.


That's a good point.


I've created a commitfest entry for the patch: https:// 
commitfest.postgresql.org/51/5432/ commitfest.postgresql.org/51/5432/>
I was not able to find you, please, register a community account and set 
yourself as an author for the patch.


Done.



--
Victor Yegorov



--
Alexander Borisov

Re: FileFallocate misbehaving on XFS

2024-12-11 Thread Andres Freund

Hi,

On 2024-12-10 16:33:06 -0500, Andres Freund wrote:
> Maybe. I think we would have gotten a lot more reports if it were common. I
> know of quite a few very busy installs using xfs.
>
> I think there must be some as-of-yet-unknown condition gating it. E.g. that
> the filesystem has been created a while ago and has some now-on-by-default
> options disabled.
>
>
> > > I think the source of this needs to be debugged further before we try to 
> > > apply
> > > workarounds in postgres.
> >
> > Why? It seems to me that this has to be a filesystem bug,
>
> Adding workarounds for half-understood problems tends to lead to code that we
> can't evolve in the future, as we a) don't understand b) can't reproduce the
> problem.
>
> Workarounds could also mask some bigger / worse issues.  We e.g. have blamed
> ext4 for a bunch of bugs that then turned out to be ours in the past. But we
> didn't look for a long time, because it was convenient to just blame ext4.

>
> > and we should almost certainly adopt one of these ideas from Michael Harris:
> >
> >  - Providing a way to configure PG not to use posix_fallocate at runtime
>
> I'm not strongly opposed to that. That's testable without access to an
> affected system.  I wouldn't want to automatically do that when detecting an
> affected system though, that'll make behaviour way less predictable.
>
>
> >  - In the case of posix_fallocate failing with ENOSPC, fall back to
> > FileZero (worst case that will fail as well, in which case we will
> > know that we really are out of space)
>
> I doubt that that's a good idea. What if fallocate failing is an indicator of
> a problem? What if you turn on AIO + DIO and suddenly get a much more
> fragmented file?

One thing that I think we should definitely do is to include more detail in
the error message. mdzeroextend()'s error messages don't include how many
blocks the relation was to be extended by. Neither mdextend() nor
mdzeroextend() include the offset at which the extension failed.

I'm not entirely sure about the phrasing though, we have a somewhat confusing
mix of blocks and bytes in messages.

Perhaps some of information should be in an errdetail, but I admit I'm a bit
hesitant about doing so for crucial details. I find that often only the
primary error message is available when debugging problems encountered by
others.

Maybe something like
  /* translator: second %s is a function name like FileAllocate() */
  could not extend file \"%s\" by %u blocks, from %llu to %llu bytes, using %s: 
%m
or
  could not extend file \"%s\" using %s by %u blocks, from its current size of 
%u blocks: %m
or
  could not extend file \"%s\" using %s by %u blocks/%llu bytes from its 
current size of %llu bytes: %m

If we want to use errdetail() judiciously, we could go for something like
  errmsg("could not extend file \"%s\" by %u blocks, using %s: %m", ...
  errdetail("Failed to extend file from %u blocks/%llu bytes to %u blocks / 
%llu bytes.", ...)

I think it might also be good - this is a slightly more complicated project -
to report the amount of free space the filesystem reports when we hit
ENOSPC. I have dealt with cases of the FS transiently filling up way too many
times, and it's always a pain to figure that out.

Greetings,

Andres Freund

Re: Proposal to add a new URL data type.

2024-12-11 Thread Robert Haas

On Fri, Dec 6, 2024 at 8:46 AM Daniel Gustafsson  wrote:
> A URL datatype is a *good idea* but one which I personally believe is best
> handled as an external extension.

+1. I don't think that it's an intrinsically bad idea to have a URL
data type in the core distribution, but no compelling reason for
putting it there has been proposed. Alexander's effort can equally
well live in github or pgxn or whatever and people who want to use it
still can. Also, it's pretty clear from reading this thread that
there's more than one way to do this and not everybody agrees with or
likes the particular thing Alexander has done. For an out-of-core
extension, that's perfectly fine, and even good: different people want
different things, and that's OK. For something that's part of
PostgreSQL itself, it's a big problem. There's no reason that we
should privilege one implementation over others, and we certainly do
not want the core project to have to maintain and ship multiple
versions of things like this.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Track the amount of time waiting due to cost_delay

On Wed, Dec 11, 2024 at 07:00:50AM +, Bertrand Drouvot wrote:
> On Tue, Dec 10, 2024 at 11:55:41AM -0600, Nathan Bossart wrote:
>> I wonder if it makes sense to provide this value as an interval instead of
>> the number of milliseconds to make it more human-readable. 
> 
> Yeah we could do so, but that would mean:
> 
> 1. Write a dedicated "pg_stat_get_progress_info()" function for VACUUM. 
> Indeed,
> the current pg_stat_get_progress_info() is shared across multiple "commands" 
> and
> then we wouldn't be able to change it's output types in pg_proc.dat.
> 
> Or
> 
> 2. Make use of make_interval() in the pg_stat_progress_vacuum view creation.
> 
> I don't like 1. that much and given that that would be as simple as: 
> 
> "
> select make_interval(secs => time_delayed / 1000) from 
> pg_stat_progress_vacuum; 
> "
> 
> for an end user to display an interval, I'm not sure we should provide an 
> interval
> by default.
> 
> That said, I agree that milliseconds is not really human-readable and
> does not provide that much added value (except flexibility), so I'd vote for 
> 2.
> if you feel we should provide an interval by default.

That's roughly what I had in mind.

> Yeah I like it, thanks! Now, I'm wondering if we should not also add something
> like this:
> 
> "
> Since multiple workers can sleep simultaneously, the total sleep time can 
> exceed
> the actual duration of the vacuum operation.
> "
> 
> As that could be surprising to see this behavior in action.

I'd vote for leaving that out, if for no other reason than it can be
deduced from the rest of the description.

-- 
nathan

Re: CRC32C Parallel Computation Optimization on ARM

On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:
> I added a port to x86 and poked at it, with the intent to have an easy
> on-ramp to that at least accelerates computation of CRCs on FPIs.
> 
> The 0008 patch only worked on chunks of 1024 at a time. At that size,
> the presence of hardware carryless multiplication is not that
> important. I removed the hard-coded constants in favor of a lookup
> table, so now it can handle anything up to 8400 bytes in a single
> pass.
> 
> There are still some "taste" issues, but I like the overall shape here
> and how light it was. With more hardware support, we can go much lower
> than 1024 bytes, but that can be left for future work.

Nice.  I'm curious how this compares to both the existing implementations
and the proposed ones that require new intrinsics.  I like the idea of
avoiding new runtime and config checks, especially if the performance is
somewhat comparable for the most popular cases (i.e., dozens of bytes to a
few thousand bytes).

If we still want to add new intrinsics, would it be easy enough to add them
on top of this patch?  Or would it require further restructuring?

-- 
nathan

Re: Add support to TLS 1.3 cipher suites and curves lists

First of all, thank you all for working on this feature.

On Wed, Sep 25, 2024 at 10:51:05AM +0200, Peter Eisentraut wrote:
> On 18.09.24 22:48, Jacob Champion wrote:
>> > +#ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'  # allowed TLSv1.2 ciphers
>> > +#ssl_cipher_suites = ''# allowed TLSv1.3 cipher suites, blank for 
>> > default
>> After marinating on this a bit... I think the naming may result in
>> some "who's on first" miscommunications in forums and on the list. "I
>> set the SSL ciphers to , but it says there are no valid
>> ciphers available!" Should we put TLS 1.3 into the new GUC name
>> somehow?
> 
> Yeah, I think just
> 
> ssl_ciphers =
> ssl_ciphers_tlsv13 =
> 
> would be clear enough.  Just using "ciphers" vs. "cipher suites" would not
> be.

Sorry for chiming in so late here, but I was a little surprised to see the
TLS version in the GUC name.  ISTM this would require us to create a new
GUC for every new TLS version, or explain that ssl_tls13_ciphers isn't just
for 1.3.  Perhaps neither of those things are too terrible, but I felt it
was worth bringing up.

-- 
nathan

Re: connection establishment versus parallel workers

2024-12-11 Thread Thomas Munro

On Thu, Dec 12, 2024 at 9:43 AM Nathan Bossart  wrote:
> My team recently received a report about connection establishment times
> increasing substantially from v16 onwards.  Upon further investigation,
> this seems to have something to do with commit 7389aad (which moved a lot
> of postmaster code out of signal handlers) in conjunction with workloads
> that generate many parallel workers.  I've attached a set of reproduction
> steps.  The issue seems to be worst on larger machines (e.g., r8g.48xlarge,
> r5.24xlarge) when max_parallel_workers/max_worker_process is set very high
> (>= 48).

Interesting.

> Our theory is that commit 7389aad (and follow-ups like commit 239b175) made
> parallel worker processing much more responsive to the point of contending
> with incoming connections, and that before this change, the kernel balanced
> the execution of the signal handlers and ServerLoop() to prevent this.  I
> don't have a concrete proposal yet, but I thought it was still worth
> starting a discussion.  TBH I'm not sure we really need to do anything
> since this arguably comes down to a trade-off between connection and worker
> responsiveness.

One factor is:

 * Check if the latch is set already. If so, leave the loop
 * immediately, avoid blocking again. We don't attempt to report any
 * other events that might also be satisfied.

If we had a way to say "no really, gimme everything you have", I guess
that'd help.  Which reminds me a bit of commit 04a09ee9 (Windows-only
problem, making sure that we handle multiple sockets fairly instead of
reporting only the lowest priority one); I think it'd work the same
way: if you already saw a latch, you'd use a zero timeout for the
system call.

Re: connection establishment versus parallel workers

2024-12-11 Thread Thomas Munro

On Thu, Dec 12, 2024 at 11:36 AM Thomas Munro  wrote:
> ... instead of
> reporting only the lowest priority one)

s/priority/position/

Re: Serverside SNI support in libpq

> On 11 Dec 2024, at 01:34, Michael Paquier  wrote:
> 
> On Wed, Dec 04, 2024 at 02:44:18PM +0100, Daniel Gustafsson wrote:
>> No worries, I know you have a big path on your plate right now.  The attached
>> v3 fixes a small buglet in the tests and adds silly reload testing to try and
>> stress out any issues.
> 
> Looks like this still fails quite heavily in the CI..  You may want to
> look at that.

Interestingly enough the CFBot hasn't picked up that there are new version
posted and the buildfailure is from the initial patch in the thread, which no
longer applies (as the CFBot righly points out).  I'll try posting another
version later today to see if that gets it unstuck.

--
Daniel Gustafsson

Re: Parallel heap vacuum

2024-12-11 Thread Masahiko Sawada

On Mon, Dec 9, 2024 at 2:11 PM Tomas Vondra  wrote:
>
> Hi,
>
> Thanks for working on this. I took a quick look at this today, to do
> some basic review. I plan to do a bunch of testing, but that's mostly to
> get a better idea of what kind of improvements to expect - the initial
> results look quite nice and sensible.

Thank you for reviewing the patch!

> A couple basic comments:
>
> 1) I really like the idea of introducing "compute_workers" callback to
> the heap AM interface. I faced a similar issue with calculating workers
> for index builds, because right now plan_create_index_workers is doing
> that the logic works for btree, but really not brin etc. It didn't occur
> to me we might make this part of the index AM ...

Thanks.

>
> 2) I find it a bit weird vacuumlazy.c needs to include optimizer/paths.h
> because it really has nothing to do with planning / paths. I realize it
> needs the min_parallel_table_scan_size, but it doesn't seem right. I
> guess it's a sign this bit of code (calculating parallel workers based
> on log of relation size) should in some "shared" location.

True. The same is actually true also for vacuumparallel.c. It includes
optimizer/paths.h to use min_parallel_index_scan_size.

>
> 3) The difference in naming ParallelVacuumState vs. PHVState is a bit
> weird. I suggest ParallelIndexVacuumState and ParallelHeapVacuumState to
> make it consistent and clear.

With the patch, since ParallelVacuumState is no longer dedicated for
parallel index vacuuming we cannot rename them in this way. Both
parallel table scanning/vacuuming and parallel index vacuuming can use
the same ParallelVacuumState instance. The heap-specific necessary
data for parallel heap scanning and vacuuming are stored in PHVState.

>
> 4) I think it would be good to have some sort of README explaining how
> the parallel heap vacuum works, i.e. how it's driven by FSM. Took me a
> while to realize how the workers coordinate which blocks to scan.

+1. I will add README in the next version patch.

>
> 5) Wouldn't it be better to introduce the scan_stats (grouping some of
> the fields in a separate patch)? Seems entirely independent from the
> parallel part, so doing it separately would make it easier to review.
> Also, maybe reference the fields through scan_stats->x, instead of
> through vacrel->scan_stats->x, when there's the pointer.

Agreed.

> 6) Is it a good idea to move NewRelfrozenXID/... to the scan_stats?
> AFAIK it's not a statistic, it's actually a parameter affecting the
> decisions, right?

Right. It would be better to move them to a separate struct or somewhere.

>
> 7) I find it a bit strange that heap_vac_scan_next_block() needs to
> check if it's a parallel scan, and redirect to the parallel callback. I
> mean, shouldn't the caller know which callback to invoke? Why should the
> serial callback care about this?

do_lazy_scan_heap(), the sole caller of heap_vac_scan_next_block(), is
called in serial vacuum and parallel vacuum cases. I wanted to make
heap_vac_scan_next_block() workable in both cases. I think it also
makes sense to have do_lazy_scan_heap() calls either function
depending on parallel scan being enabled.

>
> 8) It's not clear to me why do_lazy_scan_heap() needs to "advertise" the
> current block. Can you explain?

The workers' current block numbers are used to calculate the minimum
block number where we've scanned so far. In serial scan case, we
vacuum FSM of the particular block range for every
VACUUM_FSM_EVERY_PAGES pages . On the other hand, in parallel scan
case, it doesn't make sense to vacuum FSM in that way because we might
not have processed some blocks in the block range. So the idea is to
calculate the minimum block number where we've scanned so far and
vacuum FSM of the range of consecutive already-scanned blocks.

>
> 9) I'm a bit confused why the code checks IsParallelWorker() in so many
> places. Doesn't that mean the leader can't participate in the vacuum
> like a regular worker?

I used '!isParallelWorker()' for some jobs that should be done only by
the leader process. For example, checking failsafe mode, vacuuming FSM
etc.

>
> 10) I'm not quite sure I understand the comments at the end of
> do_lazy_scan_heap - it says "do a cycle of vacuuming" but I guess that
> means "index vacuuming", right?

It means both index vacuuming and heap vacuuming.

> And then it says "pause without invoking
> index and heap vacuuming" but isn't the whole point of this block to do
> that cleanup so that the TidStore can be discarded? Maybe I just don't
> understand how the work is divided between the leader and workers ...

The comment needs to be updated. But what the patch does is that when
the memory usage of the shared TidStore reaches the limit, worker
processes exit after updating the shared statistics, and then the
leader invokes (parallel) index vacuuming and parallel heap vacuuming.
Since the different number workers could be used for parallel heap
scan, parallel index vacuuming, and

Re: Logical replication timeout

2024-12-11 Thread RECHTÉ Marc

This how  to reproduce the problem.

Session 1:

psql -c "CREATE TABLE test (i int)" -c "INSERT INTO test SELECT 
generate_series(1, 2_000_000)"

Session 2:

pg_recvlogical  -d postgres --slot=test --create-slot
pg_recvlogical -d postgres --slot=test --start -f -


Session 3:

cd data/pg_repslots
watch 'ls test | wc -l'


Session 1:

date
time psql -c "BEGIN" -c "
DO LANGUAGE plpgsql
\$\$
DECLARE
  cur CURSOR FOR SELECT * FROM test FOR UPDATE;
  rec record;
BEGIN
  FOR rec IN cur LOOP
BEGIN  
   UPDATE test SET i = i + 1 WHERE CURRENT OF cur;
EXCEPTION
  WHEN no_data_found THEN
RAISE NOTICE 'no data found exception';
END;
  END LOOP;
END;
\$\$
" -c "ROLLBACK"

date
mer. 11 déc. 2024 08:59:03 CET
BEGIN
DO
ROLLBACK

real0m17,071s
user0m0,003s
sys 0m0,000s
mer. 11 déc. 2024 08:59:21 CET


Session 3: Watch session

Count increases up to

Wed Dec 11 09:00:02 2024
1434930

Then decreases down to 1

Wed Dec 11 09:03:17 2024
1

Session 2:

Appears last (after spill files deleted)

BEGIN 12874409
COMMIT 12874409


Conclusion:

- The exception block is responsible for generating subtransactions
- Although the transaction lasted 17s, one can see that the decoding was a bit 
late (40 seconds), but
- spent an extra 200s to delete the spill files !


On Wed, 6 Nov 2024 at 13:07, RECHTÉ Marc  wrote:
>
> Hello,
>
> For some unknown reason (probably a very big transaction at the source), we 
> experienced a logical decoding breakdown,
> due to a timeout from the subscriber side (either wal_receiver_timeout or 
> connexion drop by network equipment due to inactivity).
>
> The problem is, that due to that failure, the wal_receiver process stops. 
> When the wal_sender is ready to send some data, it finds the connexion broken 
> and exits.
> A new wal_sender process is created that restarts from the beginning (restart 
> LSN). This is an endless loop.
>
> Checking the network connexion between wal_sender and wal_receiver, we found 
> that no traffic occurs for hours.
>
> We first increased wal_receiver_timeout up to 12h and still got a 
> disconnection on the receiver party:
>
> 2024-10-17 16:31:58.645 GMT [1356203:2] user=,db=,app=,client= ERROR:  
> terminating logical replication worker due to timeout
> 2024-10-17 16:31:58.648 GMT [849296:212] user=,db=,app=,client= LOG:  
> background worker "logical replication worker" (PID 1356203) exited with exit 
> code 1
>
> Then put this parameter to 0, but got then disconnected by the network (note 
> the slight difference in message):
>
> 2024-10-21 11:45:42.867 GMT [1697787:2] user=,db=,app=,client= ERROR:  could 
> not receive data from WAL stream: could not receive data from server: 
> Connection timed out
> 2024-10-21 11:45:42.869 GMT [849296:40860] user=,db=,app=,client= LOG:  
> background worker "logical replication worker" (PID 1697787) exited with exit 
> code 1
>
> The message is generated in libpqrcv_receive function 
> (replication/libpqwalreceiver/libpqwalreceiver.c) which calls 
> pqsecure_raw_read (interfaces/libpq/fe-secure.c)
>
> The last message "Connection timed out" is the errno translation from the 
> recv system function:
>
> ETIMEDOUT   Connection timed out (POSIX.1-2001)
>
> When those timeout occurred, the sender was still busy deleting files from 
> data/pg_replslot/bdcpb21_sene, accumulating more than 6 millions small 
> ".spill" files.
> It seems this very long pause is at cleanup stage were PG is blindly trying 
> to delete those files.
>
> strace on wal sender show tons of calls like:
>
> unlink("pg_replslot/bdcpb21_sene/xid-2 721 821 917-lsn-439C-0.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-100.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-200.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-300.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-400.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
> unlink("pg_replslot/bdcpb21_sene/xid-2721821917-lsn-439C-500.spill") = -1 
> ENOENT (Aucun fichier ou dossier de ce type)
>
> This occurs in ReorderBufferRestoreCleanup 
> (backend/replication/logical/reorderbuffer.c).
> The call stack presumes this may probably occur in DecodeCommit or 
> DecodeAbort (backend/replication/logical/decode.c):
>
> unlink("pg_replslot/bdcpb21_sene/xid-2730444214-lsn-43A6-8800.spill") = 
> -1 ENOENT (Aucun fichier ou dossier de ce type)
>  > /usr/lib64/libc-2.17.so(unlink+0x7) [0xf12e7]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferRestoreCleanup.isra.17+0x5d) 
> [0x769e3d]
>  > /usr/pgsql-15/bin/postgres(ReorderBufferCleanupTXN+0x166) [0x76aec6] <=== 
> replication/logical/reorderbuff.c:1480 (mais cette fonction (static) n'est 
> utiliée qu'au sein de ce module

Re: Removing unneeded self joins


On 9/12/2024 13:03, Alexander Korotkov wrote:

remove_self_join_rel().  Should we better add comments to PlannerInfo
and other relevant structures saying: if you're going to modify this,
consider how that affects remove_self_join_rel()?

Any thoughts?
As I see, it is quite typical to keep two parts of the code in sync by 
mentioning them in comments (see reparameterisation stuff, for example). 
This would reduce the code needed to implement the feature.


--
regards, Andrei Lepikhov

Re: Memory leak in pg_logical_slot_{get,peek}_changes

2024-12-11 Thread vignesh C

On Wed, 11 Dec 2024 at 11:39, Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, December 11, 2024 12:28 PM vignesh C  
> wrote:
> > Hi,
> >
> > I'm starting a new thread for one of the issues reported by Sawada-san at 
> > [1].
> >
> > This is a memory leak on CacheMemoryContext when using pgoutput via SQL
> > APIs:
> > /* Map must live as long as the session does. */
> > oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> >
> > entry->attrmap = build_attrmap_by_name_if_req(indesc, outdesc, false);
> >
> > MemoryContextSwitchTo(oldctx);
> > RelationClose(ancestor);
> >
> > entry->attrmap is pfree'd only when validating the RelationSyncEntry
> > so remains even after logical decoding API calls.
> >
> > This issue can be reproduced with the following test:
> > -- Create table
> > create table t1( c1 int, c2 int, c3 int) PARTITION BY LIST (c1);
> > create table t1_1( c2 int, c1 int, c3 int);
> > ALTER TABLE t1 ATTACH PARTITION t1_1 FOR VALUES IN (0, 1, 2, 3);
> >
> > -- Create publication
> > create publication pub1 FOR TABLE t1, t1_1 WITH
> > (publish_via_partition_root = true);
> >
> > -- Create replication slot
> > SELECT 'init' FROM pg_create_logical_replication_slot('test', 'pgoutput');
> >
> > -- Run the below steps between Test start and Test end many times to
> > see the memory usage getting increased
> > -- Test start
> > insert into t1 values( 1);
> > SELECT count(*) FROM pg_logical_slot_get_binary_changes('test', NULL,
> > NULL, 'proto_version', '4', 'publication_names', 'pub1');
> >
> > -- Memory increases after each invalidation and insert
> > SELECT * FROM pg_backend_memory_contexts where name =
> > 'CacheMemoryContext';
> > -- Test end
> >
> > The attached patch resolves a memory leak by ensuring that the
> > attribute map is properly freed during plugin shutdown. This process
> > is triggered by the SQL API when the decoding context is being
> > released. Additionally, I am conducting a review to identify and
> > address any similar memory leaks that may exist elsewhere in the code.
>
> Thanks for reporting the issue and share the fix.
>
> I am not sure if freeing them in shutdown callback is safe, because shutdown
> callback Is not invoked in case of ERRORs. I think we'd better allocate them
> under cachectx in the beginning to avoid freeing them explicitly.

I initially considered addressing this issue in a way similar to your
suggestion while fixing it, but later decided to make the change in
pgoutput_shutdown, following the approach used for RelationSyncCache.
This was because RelationSyncCache relies on CacheMemoryContext, and
attrmap is a member of RelationSyncCache entry. Now that we're
considering attrmap in the context of cachectx, do you think we should
apply cachectx to RelationSyncCache as well to solve the similar issue
that can occur with RelationSyncCache.

Regards,
Vignesh

Re: [PATCH] SVE popcount support

2024-12-11 Thread chiranmoy.bhattacha...@fujitsu.com

Thank you for the suggestion; we have removed the `xsave` flag.

We have used the following command for benchmarking:
time ./build_fj/bin/psql pop_db -c "select drive_popcount(1000, 16);"

We ran it 20 times and took the average to flatten any CPU fluctuations. The 
results observed on `m7g.4xlarge`are in the attached Excel file.

We have also updated the condition for buffer alignment, skipping the alignment 
process if the buffer is already aligned. This seems to have improved the 
performance by a few milliseconds because the input buffer provided by 
`drive_popcount` is already aligned. PFA for the updated patch file.

Thanks,
Chiranmoy

From: Malladi, Rama 
Sent: Monday, December 9, 2024 10:06 PM
To: Susmitha, Devanga ; Nathan Bossart 

Cc: Kirill Reshke ; pgsql-hackers 
; Bhattacharya, Chiranmoy 
; M A, Rajat ; 
Hajela, Ragesh 
Subject: Re: [PATCH] SVE popcount support

On 12/9/24 12:21 AM, 
devanga.susmi...@fujitsu.com wrote:
Hello,

We are sharing our patch for pg_popcount with SVE support as a contribution 
from our side in this thread. We hope this contribution will help in exploring 
and refining the popcount implementation further.
Our patch uses the existing infrastructure, i.e. the 
"choose_popcount_functions" method, to determine the correct popcount 
implementation based on the architecture, thereby requiring fewer code changes. 
The patch also includes implementations for popcount and popcount masked.
We can reference both solutions and work together toward achieving the most 
efficient and effective implementation for PostgreSQL.

Thanks for the patch and it looks good. I will review the full patch in the 
next couple of days. One observation was that the patch has `xsave` flags 
added. This isn't needed.

`pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt + 
cflags_popcnt_arm, 'xsave': cflags_xsave}`

Algorithm Overview:
1. For larger inputs, align the buffers to avoid double loads. For smaller 
inputs alignment is not necessary and might even degrade the performance.
2. Process the aligned buffer chunk by chunk till the last incomplete chunk.
3. Process the last incomplete chunk.
Our setup:
Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
OS : Ubuntu 22.04.5 LTS
GCC: 11.4

Benchmark and Result:
We have used PostgreSQL community recommended popcount-test-module[0] for 
benchmarking and observed a speed-up of more than 4x for larger buffers. Even 
for smaller inputs of size 8 and 16 bytes there aren't any performance 
degradations observed.
Looking forward to your thoughts!

I tested the patch and here attached is the performance I see on a 
`c7g.xlarge`. The perf data doesn't quite match to what you observe (especially 
for 256B). In the chart, I have comparison of baseline, AWS SVE (what I had 
implemented) and Fujitsu SVE popcount implementations. Can you confirm the 
command-line you had used for the benchmark run?

I had used the below command-line:

`sudo su postgres -c "/usr/local/pgsql/bin/psql -c 'EXPLAIN ANALYZE SELECT 
drive_popcount(10, 16);'"`

From: Nathan Bossart 
Sent: Wednesday, December 4, 2024 21:37
To: Malladi, Rama 
Cc: Kirill Reshke ; 
pgsql-hackers 

Subject: Re: [PATCH] SVE popcount support

On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:
> Thank you, Kirill, for the review and the feedback. Please find inline my
> reply and an updated patch.

Thanks for the updated patch.  I have a couple of high-level comments.
Would you mind adding this to the commitfest system
(https://commitfest.postgresql.org/) so that it is picked up by our
automated patch testing tools?

> +# Check for ARMv8 SVE popcount intrinsics
> +#
> +CFLAGS_POPCNT=""
> +PG_POPCNT_OBJS=""
> +PGAC_SVE_POPCNT_INTRINSICS([])
> +if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
> +  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
> +fi
> +if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
> +  PG_POPCNT_OBJS="pg_popcount_sve.o"
> +  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE 
> popcount instructions with a runtime check.])
> +fi
> +AC_SUBST(CFLAGS_POPCNT)
> +AC_SUBST(PG_POPCNT_OBJS)

We recently switched some intrinsics support in PostgreSQL to use
__attribute__((target("..."))) instead of applying special compiler flags
to specific files (e.g., commits f78667b and 4b03a27).  The hope is that
this approach will be a little more sustainable as we add more
architecture-specific code.  IMHO we should do something similar here.
While this means that older versions of clang might not pick up this
optimization (see the commit message for 4b03a27 for details), I think
that's okay because 1) this patch is intended for the next major version of
Postgres, which will take some time for significant adoption,

RE: Conflict detection for update_deleted in logical replication

2024-12-11 Thread Hayato Kuroda (Fujitsu)

Dear hackers,

I did some benchmarks with the patch. More detail, a pub-sub replication system
was built and TPS was measured on the subscriber. Results were shown that the
performance can be degraded if the wal_receiver_status_interval is long.
This is expected behavior because the patch retains more dead tuples on the
subscriber side.
Also, we've considered a new mechanism which dynamically tunes the period of 
status
request, and confirmed that it reduced the regression. This is described the 
latter
part.

Below part contained the detailed report.

## Motivation - why the benchmark is needed

V15 patch set introduces a new replication slot on the subscriber side to retain
needed tuples for the update_deleted detection.
However, this may affect the performance of query executions on the subscriber 
because
1) tuples to be scaned will be increased and 2) HOT update cannot be worked.
The second issue comes from the fact that HOT update can work only when both 
tuples
can be located on the same page.

Based on above reasons I ran benchmark tests for the subscriber. The variable 
of the
measurement is the wal_receiver_status_interval, which controls the duration of 
status request.

## Used source code

HEAD was 962da900, and applied v15 patch set atop it.

## Environment

RHEL 7 machine which has 755GB memory, 4 physical CPUs and 120 logical 
processors.

## Workload

1. Constructed a pub-sub replication system.
   Parameters for both instances were:

   share_buffers = 30GB
   min_wal_size = 10GB
   max_wal_size = 20GB
   autovacuum = false
   track_commit_timestamp = on (only for subscriber)
  
2. Ran pgbench with initialize mode. The scale factor was set to 100.
3. Ran pgbench with 60 clients for the subscriber. The duration was 120s,
   and the measurement was repeated 5 times.

Attached script can automate above steps. You can specify the source type in the
measure.sh and run it.

## Comparison

The performance testing was done for HEAD and patched source code.
In case of patched, "detect_update_deleted" parameter was set to on. Also, the
parameter "wal_receiver_status_interval" was varied to 1s, 10s, and 100s to
check the effect.

Appendix table shows results [1]. The regression becomes larger based on the 
wal_receiver_status_interval.
TPS regressions are almost 5%(interval=1s) -> 25%(intervals=10s) -> 55% 
(intervals=55%).

Attached png file visualize the result: each bar shows median.

## Analysis

I attached to the backend via perf and found that heapam_index_fetch_tuple()
consumed much CPU time ony in case of patched [2]. Also, I checked 
pg_stat_all_tables
view and found that HOT update rarely happened only in the patched case [3].
This means that whether backend could do HOT update is the dominant.

When the detect_update_deleted = on, the additional slot is defined on the 
subscriber 
ide and it is updated based on the activity; The frequency is determined by the
wal_receiver_status_intervals. In intervals=100s case, it is relatively larger
for the workload so that some dead tuples remained, this makes query processing 
slower.

This result means that users may have to tune consider the interval period based
on their workload. However, it is difficult to predict the appropriate value.

## Experiment - dynamic period tuning

Based on above, I and Hou discussed off-list and implemented new mechanism which
tunes the duration between status request dynamically. The basic idea is similar
with what slotsync worker does. The interval of requests is initially 100ms,
and becomes twice when if there are no XID assignment since the last 
advancement.
The maxium value is either of wal_receiver_status_interval or 3min.

Benchmark results with this are shown in [4]. Here wal_receiver_status_interval
is not changed, so we can compare with the HEAD and interval=10s case in [1] - 
59536 vs 59597.
The regression is less than 1%.

The idea has already been included in v16-0002, please refer it.

## Experiment - shorter interval

Just in case - I did an extreme case that wal_receiver_status_interval is quite 
short - 10ms.
To make interval shorter I implemented an attached patch for both cases. 
Results are shown [5].
The regression is not observed or even better (I think this is caused by the 
randomness).

This experiment also shows the result that the regression is happened due to 
the dead tuple.

## Appendix [1] - result table

Each cells show transaction per seconds of the run.

patched
# run   interval=1s intervals=10s   intervals=100s
1   55876   45288   26956
2   56086   45336   26799
3   56121   45129   26753
4   56310   45169   26542
5   55389   45071   26735
median  56086   45169   26753

HEAD
# run   interval=1s intervals=10s   intervals=100s
1   59096   59343   59341
2   59671   59413   59281
3   59131   59597   58966
4   59239   59693   59518
5   59165   59631   59487
median  59165   59597   59341


## Appendix [2] - perf analysis

patched:

Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE

On Wed, 11 Dec 2024 at 01:22, Guillaume Lelarge  wrote:
> Le mar. 10 déc. 2024 à 03:57, David Rowley  a écrit :
>> I was very close to pushing 0001 today, but will hold off until
>> tomorrow to see if anyone has final comments.
>>
> No more comments. I'm fine with both patches.

Thanks for looking.

I've pushed the main patch. I'm not planning on pushing the
auto_explain.log_buffers default change unless there's a bit more
discussion about it.

David

Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-11 Thread Shubham Khanna

On Wed, Dec 11, 2024 at 2:08 PM vignesh C  wrote:
>
> On Wed, 11 Dec 2024 at 11:21, Shubham Khanna
>  wrote:
> >
> > On Wed, Dec 11, 2024 at 4:21 AM Peter Smith  wrote:
> >
> > I have fixed the given comments. The attached patch contains the
> > suggested changes.
>
> Since all the subscriptions are created based on the two_phase option
> provided, there is no need to store this for each database:
> @@ -53,6 +54,7 @@ struct LogicalRepInfo
> char   *pubname;/* publication name */
> char   *subname;/* subscription name */
> char   *replslotname;   /* replication slot name */
> +   booltwo_phase;  /* two-phase enabled
> for the subscription */
>
> dbinfo[i].dbname = cell->val;
> +   dbinfo[i].two_phase = opt->two_phase;
> if (num_pubs > 0)
>
> How about we handle something like in the attached changes.
>

Thank you for pointing this out and for suggesting the changes. I
agree with your approach.
Also, I found a mistake in getopt_long and fixed it in this version of
the patch.
The attached patch contains the suggested changes.

Thanks and regards,
Shubham Khanna.


v4-0001-Add-support-for-two-phase-commit-in-pg_createsubs.patch
Description: Binary data

Re: pg_dump memory leak of 400 bytes

> On 11 Dec 2024, at 10:59, Oleg Tselebrovskiy  
> wrote:

> When I was looking through static analyzer output I've found a memory leak in 
> pg_dump

Thanks for the report, although 400 bytes won't break the bank in the grand
scheme of things considering how much memory pg_dump use there is little point
in not freeing it.

> Patches for both variants are attached. I also propose changing palloc to 
> pg_malloc
> since it is the only place in pg_dump where palloc is used instead of 
> pg_malloc

Freeing it when not needed anymore seems appropriate, I'll go make that happen.

--
Daniel Gustafsson

Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-11 Thread vignesh C

On Wed, 11 Dec 2024 at 11:21, Shubham Khanna
 wrote:
>
> On Wed, Dec 11, 2024 at 4:21 AM Peter Smith  wrote:
>
> I have fixed the given comments. The attached patch contains the
> suggested changes.

Since all the subscriptions are created based on the two_phase option
provided, there is no need to store this for each database:
@@ -53,6 +54,7 @@ struct LogicalRepInfo
char   *pubname;/* publication name */
char   *subname;/* subscription name */
char   *replslotname;   /* replication slot name */
+   booltwo_phase;  /* two-phase enabled
for the subscription */

dbinfo[i].dbname = cell->val;
+   dbinfo[i].two_phase = opt->two_phase;
if (num_pubs > 0)

How about we handle something like in the attached changes.

Regards,
Vignesh
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c
index 01a203983d..d141b13bd6 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -54,7 +54,6 @@ struct LogicalRepInfo
 	char	   *pubname;		/* publication name */
 	char	   *subname;		/* subscription name */
 	char	   *replslotname;	/* replication slot name */
-	bool		two_phase;		/* two-phase enabled for the subscription */
 
 	bool		made_replslot;	/* replication slot was created */
 	bool		made_publication;	/* publication was created */
@@ -81,7 +80,7 @@ static void check_publisher(const struct LogicalRepInfo *dbinfo);
 static char *setup_publisher(struct LogicalRepInfo *dbinfo);
 static void check_subscriber(const struct LogicalRepInfo *dbinfo);
 static void setup_subscriber(struct LogicalRepInfo *dbinfo,
-			 const char *consistent_lsn);
+			 const char *consistent_lsn, bool two_phase);
 static void setup_recovery(const struct LogicalRepInfo *dbinfo, const char *datadir,
 		   const char *lsn);
 static void drop_primary_replication_slot(struct LogicalRepInfo *dbinfo,
@@ -100,7 +99,9 @@ static void wait_for_end_recovery(const char *conninfo,
   const struct CreateSubscriberOptions *opt);
 static void create_publication(PGconn *conn, struct LogicalRepInfo *dbinfo);
 static void drop_publication(PGconn *conn, struct LogicalRepInfo *dbinfo);
-static void create_subscription(PGconn *conn, const struct LogicalRepInfo *dbinfo);
+static void create_subscription(PGconn *conn,
+const struct LogicalRepInfo *dbinfo,
+bool two_phase);
 static void set_replication_progress(PGconn *conn, const struct LogicalRepInfo *dbinfo,
 	 const char *lsn);
 static void enable_subscription(PGconn *conn, const struct LogicalRepInfo *dbinfo);
@@ -459,7 +460,6 @@ store_pub_sub_info(const struct CreateSubscriberOptions *opt,
 		conninfo = concat_conninfo_dbname(pub_base_conninfo, cell->val);
 		dbinfo[i].pubconninfo = conninfo;
 		dbinfo[i].dbname = cell->val;
-		dbinfo[i].two_phase = opt->two_phase;
 		if (num_pubs > 0)
 			dbinfo[i].pubname = pubcell->val;
 		else
@@ -486,7 +486,7 @@ store_pub_sub_info(const struct CreateSubscriberOptions *opt,
 		pg_log_debug("subscriber(%d): subscription: %s ; connection string: %s, two_phase: %s", i,
 	 dbinfo[i].subname ? dbinfo[i].subname : "(auto)",
 	 dbinfo[i].subconninfo,
-	 dbinfo[i].two_phase ? "true" : "false");
+	 opt->two_phase ? "true" : "false");
 
 		if (num_pubs > 0)
 			pubcell = pubcell->next;
@@ -1143,7 +1143,8 @@ check_and_drop_existing_subscriptions(PGconn *conn,
  * replication setup.
  */
 static void
-setup_subscriber(struct LogicalRepInfo *dbinfo, const char *consistent_lsn)
+setup_subscriber(struct LogicalRepInfo *dbinfo, const char *consistent_lsn,
+ bool two_phase)
 {
 	for (int i = 0; i < num_dbs; i++)
 	{
@@ -1167,7 +1168,7 @@ setup_subscriber(struct LogicalRepInfo *dbinfo, const char *consistent_lsn)
 		 */
 		drop_publication(conn, &dbinfo[i]);
 
-		create_subscription(conn, &dbinfo[i]);
+		create_subscription(conn, &dbinfo[i], two_phase);
 
 		/* Set the replication progress to the correct LSN */
 		set_replication_progress(conn, &dbinfo[i], consistent_lsn);
@@ -1682,7 +1683,8 @@ drop_publication(PGconn *conn, struct LogicalRepInfo *dbinfo)
  * initial location.
  */
 static void
-create_subscription(PGconn *conn, const struct LogicalRepInfo *dbinfo)
+create_subscription(PGconn *conn, const struct LogicalRepInfo *dbinfo,
+	bool two_phase)
 {
 	PQExpBuffer str = createPQExpBuffer();
 	PGresult   *res;
@@ -1706,7 +1708,7 @@ create_subscription(PGconn *conn, const struct LogicalRepInfo *dbinfo)
 	  "WITH (create_slot = false, enabled = false, "
 	  "slot_name = %s, copy_data = false, two_phase = %s)",
 	  subname_esc, pubconninfo_esc, pubname_esc, replslotname_esc,
-	  dbinfo->two_phase ? "true" : "false");
+	  two_phase ? "true" : "false");
 
 	pg_free(pubname_esc);
 	pg_free(subname_esc);
@@ -2240,7 +2242,7 @@ main(int argc, char **argv)
 	 * point to the LSN reported

Add Postgres module info


Hi,

I would like to propose the module_info structure, which aims to let 
extension maintainers sew some data into the binary file. Being included 
in the module code, this information remains unchanged and is available 
for reading by a backend.
As I see, this question was already debated during the introduction of 
PG_MODULE_MAGIC [1], and developers didn't add such data because the 
practical case wasn't obvious. Right now, when multiple extensions are 
maintained and used in installations constantly, we have enough 
information to continue this discussion.
The idea was initially born with the support of the extension, which had 
a stable UI and frequently changing library code. To answer customer 
requests, it was necessary to know the specific version of the code to 
reproduce the situation in a test environment. That required introducing 
a file naming rule and a specific exported variable into the module 
code. However, this is not a sufficient guarantee of version 
determination and complicates the technical process when supporting many 
extensions obtained from different sources.
Another example is a library without an installation script at all - see 
auto_explain. Without a 'CREATE EXTENSION' call, we don't have an option 
to find out the library's version.
It would be much easier if the Postgres catalogue contained a function, 
for example, module_info(module_name), which would allow you to 
determine the file's full path and name containing the desired module in 
the psql console and its version.
On the other hand, the Omnigres project (author Yurii Rashkovski) also 
came up with the idea of module versioning, although it does this 
externally out of the Postgres core. When designing this code, I also 
adopted ideas from this repository.
So, let me propose a patch that introduces this tiny feature: the 
maintainer can add the PG_MODULE_INFO macro to the library code, and 
Postgres reads it on the module's load.


There is a question of how much information makes sense to add to the 
module. For now, each time I prepare extensions to release, I have to 
add the extension name (to avoid issues with file naming/renaming) and 
the version. Format of the version storage? Do we need a separate minor 
version number? It is a subject to debate.


[1] 
https://www.postgresql.org/message-id/flat/20060507211705.GB3808%40svana.org


--
regards, Andrei Lepikhov
From 56d4a0dadd32a601d52ed59b38935a36a175f635 Mon Sep 17 00:00:00 2001
From: "Andrei V. Lepikhov" 
Date: Tue, 19 Nov 2024 18:45:36 +0700
Subject: [PATCH v0] Introduce MODULE_INFO macro.

This optional macro allows dynamically loaded shared libraries (modules)
a standard way to incorporate version and name data. The introduced catalogue
routine module_info can be used to find this module by name and check
the version. It makes users independent from file naming conventions.

With a growing number of Postgres core hooks and the introduction of named DSM
segments, the number of modules that don't need to be loaded on startup may
grow fast. Moreover, in many cases related to query tree transformation or
extra path recommendation, such modules might not need database objects except
GUCs - see auto_explain as an example. That means they don't need to execute
the 'CREATE EXTENSION' statement at all and don't have a record in
the pg_extension table. Such a trick provides much flexibility, including
an online upgrade and may become widespread.

In addition, it is also convenient in support to be sure that the installation
(or at least the backend) includes a specific version of the module. Even if
a module has an installation script, it is not rare that it provides
an implementation for a range of UI routine versions. It makes sense to ensure
which specific version of the code is used.

Discussions [1,2] already mentioned module-info stuff, but at that time,
extensibility techniques and extension popularity were low, and it wasn't
necessary to provide that data.

[1] https://www.postgresql.org/message-id/flat/20060507211705.GB3808%40svana.org
[2] 
https://www.postgresql.org/message-id/flat/20051106162658.34c31d57%40thunder.logicalchaos.org
---
 contrib/auto_explain/auto_explain.c   |  2 +
 contrib/auto_explain/t/001_auto_explain.pl|  9 +++
 contrib/pg_prewarm/t/001_basic.pl | 10 ++-
 .../pg_stat_statements/pg_stat_statements.c   |  1 +
 src/backend/utils/fmgr/dfmgr.c| 73 ++-
 src/include/catalog/pg_proc.dat   |  6 ++
 src/include/fmgr.h| 18 +
 7 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c 
b/contrib/auto_explain/auto_explain.c
index 623a674f99..f4110eb1aa 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -22,6 +22,8 @@
 
 PG_MODULE_MAGIC;
 
+PG_MODULE_INFO("auto_explain", 100);
+
 /* GUC variables */
 static int auto_explain_log_min_duration = -1; /

Re: Expand applicability of aggregate's sortop optimization


On 11/12/2024 07:22, Michael Paquier wrote:

On Fri, Nov 08, 2024 at 01:05:04PM +0700, Andrei Lepikhov wrote:

Rebase onto current master and slight improvement.


Your patch is failing in the CI, please rebase.  I have it to the next
CF for now, waiting on author.
Thanks! I have observed such issues before, but their complexity is out 
of my league yet. Why can a prosupport function cause deadlock in a 
simple query? That's the question to discover.


--
regards, Andrei Lepikhov

RE: Memory leak in pg_logical_slot_{get,peek}_changes

2024-12-11 Thread Zhijie Hou (Fujitsu)

On Wednesday, December 11, 2024 5:11 PM vignesh C  wrote:
> 
> On Wed, 11 Dec 2024 at 11:39, Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Wednesday, December 11, 2024 12:28 PM vignesh C
>  wrote:
> > > The attached patch resolves a memory leak by ensuring that the
> > > attribute map is properly freed during plugin shutdown. This process
> > > is triggered by the SQL API when the decoding context is being
> > > released. Additionally, I am conducting a review to identify and
> > > address any similar memory leaks that may exist elsewhere in the code.
> >
> > Thanks for reporting the issue and share the fix.
> >
> > I am not sure if freeing them in shutdown callback is safe, because
> > shutdown callback Is not invoked in case of ERRORs. I think we'd
> > better allocate them under cachectx in the beginning to avoid freeing them
> explicitly.
> 
> I initially considered addressing this issue in a way similar to your 
> suggestion
> while fixing it, but later decided to make the change in pgoutput_shutdown,
> following the approach used for RelationSyncCache.
> This was because RelationSyncCache relies on CacheMemoryContext, and
> attrmap is a member of RelationSyncCache entry.

I think we have tended to allocate the member of RelationSyncEntry under
logical decoding context since 52e4f0c. I think that makes more sense because 
these
members ideally should live as long as the decoding context. In addition, it was
suggested[1] that allocating all the thing under CacheMemoryContext is hard to
debug. And people in another thread[2] also seems agree to remove the dependency
of CacheMemoryContext in the long term.

> Now that we're considering
> attrmap in the context of cachectx, do you think we should apply cachectx to
> RelationSyncCache as well to solve the similar issue that can occur with
> RelationSyncCache.

I personally think it could be considered as a separate project because
RelationSyncCache is accessed even after shutting down the output plugin due to
the registered cache invalidation callbacks. We probably need
MemoryContextRegisterResetCallback() to reset the static pointer
(RelationSyncCache).

[1] 
https://www.postgresql.org/message-id/20220129003110.6ndrrpanem5sb4ee%40alap3.anarazel.de
[2] https://www.postgresql.org/message-id/Z1d-uR20pt6wtQIS%40paquier.xyz

Best Regards,
Hou zj

pg_dump memory leak of 400 bytes

2024-12-11 Thread Oleg Tselebrovskiy


Greetings, everyone!

When I was looking through static analyzer output I've found a memory 
leak in pg_dump


In function dumpFunc, when dumping a function with TRANSFORMs, we 
allocate
memory for typeids array that contains OIDs of types that need to be 
transformed.

The memory is allocated with palloc and is never freed.

Valgrind also confirms this:

1) Create TRANSFORM and FUNCTION using it:
CREATE TRANSFORM FOR int LANGUAGE SQL (FROM SQL WITH FUNCTION 
prsd_lextype(internal),


 TO SQL WITH FUNCTION int4recv(internal));
CREATE FUNCTION add(int, int) RETURNS int
AS 'select $1 + $2;'
LANGUAGE SQL TRANSFORM FOR TYPE int;

2) Use valgrind to observe the memory leak:
valgrind --leak-check=yes --time-stamp=yes 
--error-markers=VALGRINDERROR-BEGIN,VALGRINDERROR-END ./pg_dump -f 
somedump.sql

...
==00:00:00:00.764 50282== VALGRINDERROR-BEGIN
==00:00:00:00.764 50282== 400 bytes in 1 blocks are definitely lost in 
loss record 92 of 134
==00:00:00:00.764 50282==at 0x4848899: malloc (in 
/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)

==00:00:00:00.764 50282==by 0x14B758: palloc (fe_memutils.c:107)
==00:00:00:00.764 50282==by 0x120D99: dumpFunc (pg_dump.c:12534)
==00:00:00:00.764 50282==by 0x10F764: main (pg_dump.c:1085)
==00:00:00:00.764 50282==
==00:00:00:00.764 50282== VALGRINDERROR-END
{
   
   Memcheck:Leak
   match-leak-kinds: definite
   fun:malloc
   fun:palloc
   fun:dumpFunc
   fun:main
}
...

Also, we always allocate the same ammount of memory (FUNC_MAX_ARGS * 
sizeof(Oid))


I propose two solutions for this problem:
1) Just free the memory in the end of this code block
2) Use static allocation since it is always the same amount

Patches for both variants are attached. I also propose changing palloc 
to pg_malloc
since it is the only place in pg_dump where palloc is used instead of 
pg_malloc


Oleg Tselebrovskiy, Postgres Prodiff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ec0cdf4ed74..9a24ebfca24 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -12531,7 +12531,7 @@ dumpFunc(Archive *fout, const FuncInfo *finfo)
 
 	if (*protrftypes)
 	{
-		Oid		   *typeids = palloc(FUNC_MAX_ARGS * sizeof(Oid));
+		Oid		   *typeids = pg_malloc(FUNC_MAX_ARGS * sizeof(Oid));
 		int			i;
 
 		appendPQExpBufferStr(q, " TRANSFORM ");
@@ -12543,6 +12543,7 @@ dumpFunc(Archive *fout, const FuncInfo *finfo)
 			appendPQExpBuffer(q, "FOR TYPE %s",
 			  getFormattedTypeName(fout, typeids[i], zeroAsNone));
 		}
+		free(typeids);
 	}
 
 	if (prokind[0] == PROKIND_WINDOW)
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ec0cdf4ed74..c63b76aa382 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -12531,7 +12531,7 @@ dumpFunc(Archive *fout, const FuncInfo *finfo)
 
 	if (*protrftypes)
 	{
-		Oid		   *typeids = palloc(FUNC_MAX_ARGS * sizeof(Oid));
+		Oid		typeids[FUNC_MAX_ARGS];
 		int			i;
 
 		appendPQExpBufferStr(q, " TRANSFORM ");

Re: advanced patch feedback session at FOSDEM, responses needed

2024-12-11 Thread Tomas Vondra

On 12/11/24 08:05, Peter Eisentraut wrote:
> Hi all,
> 
> I'm trying to organize an advanced patch feedback session adjacent to
> FOSDEM.  Right now, I'm looking to gauge interest so that concrete
> plans can be made.  Specifically, if you would like to participate and
> have your patches reviewed, please respond to this email.
> 

Thanks for organizing this. Count me in ;-)

I don't have any "logistical caveats" except that I'm not a morning
person, so would prefer a session at a bearable time.

thanks

-- 
Tomas Vondra

Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

Hi Shubham,

Here are some review comments for the patch v4-0001.

==
GENERAL.

1.
After reading Vignesh's last review and then the pg_createsubscriber
documentation I see there can be multiple databases simultaneously
specified (by writing multiple -d switches) and in that case each one
gets its own:
--publication
--replication-slot
--subscription

OTOH, this new '--enable-two-phase' switch is just a blanket two_phase
enablement across all subscriptions. There is no way for the user to
say if they want it enabled for some subscriptions (on some databases)
but not for others. I suppose this was intentional and OK (right?),
but this point needs to be clarified in the docs.

==
doc/src/sgml/ref/pg_createsubscriber.sgml

2.
+  
+   Enables two_phase
+   commit for the subscription. The default is false.
+  

Following on from the previous review comment. Since this switch is
applied to all database subscriptions I think the help needs to say
that. Something like.

"If there are multiple subscriptions specified, this option applies to
all of them."

~~~

3.
In the "Prerequisites" sections of the docs, it gives requirements for
various GUC settings on the source server and the target server. So,
should there be another note here advising to configure the
'max_prepared_transactions' GUC when the '--enable-two-phase' is
specified?

~~~

4. "Warnings" section includes the following:

pg_createsubscriber sets up logical replication with two-phase commit
disabled. This means that any prepared transactions will be replicated
at the time of COMMIT PREPARED, without advance preparation. Once
setup is complete, you can manually drop and re-create the
subscription(s) with the two_phase option enabled.

~

The above text is from the "Warnings" section, but it seems stale
information that needs to be updated due to the introduction of this
new '--enable-two-phase' option.

==
src/bin/pg_basebackup/pg_createsubscriber.c

usage:
5.
  printf(_("  -t, --recovery-timeout=SECS seconds to wait for
recovery to end\n"));
+ printf(_("  -T, --enable-two-phase  enable two-phase commit
for the subscription\n"));

Given the previous review comments (#1/#2 etc), I was wondering if it
might be better to say more like "enable two-phase commit for all
subscriptions".

==
.../t/040_pg_createsubscriber.pl

6.
+is($count_prepared_s, qq(1), 'Prepared transaction replicated to subscriber');

Should there also have been an earlier check (way back before the
PREPARE) just to make sure this count was initially 0?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

pg_createsubscriber TAP test wrapping makes command options hard to read.

Hi,

While reviewing a patch for another pg_createsubscriber thread I found
that multiple TAP tests have a terrible wrapping where the command
options and their associated oparg are separated on different lines
instead of paired together nicely. This makes it unnecessarily
difficult to see what the test is doing.

Please find the attached patch that changes the wrapping to improve
the command option readability. For example,

BEFORE
command_ok(
[
  'pg_createsubscriber', '--verbose',
  '--recovery-timeout', "$PostgreSQL::Test::Utils::timeout_default",
  '--dry-run', '--pgdata',
  $node_s->data_dir, '--publisher-server',
  $node_p->connstr($db1), '--socketdir',
  $node_s->host, '--subscriber-port',
  $node_s->port, '--publication',
  'pub1', '--publication',
  'pub2', '--subscription',
  'sub1', '--subscription',
  'sub2', '--database',
  $db1, '--database',
  $db2
],
'run pg_createsubscriber --dry-run on node S');
AFTER
command_ok(
[
  'pg_createsubscriber', '--verbose',
  '--recovery-timeout', "$PostgreSQL::Test::Utils::timeout_default",
  '--dry-run',
  '--pgdata', $node_s->data_dir,
  '--publisher-server', $node_p->connstr($db1),
  '--socketdir', $node_s->host,
  '--subscriber-port', $node_s->port,
  '--publication', 'pub1',
  '--publication', 'pub2',
  '--subscription', 'sub1',
  '--subscription', 'sub2',
  '--database', $db1,
  '--database', $db2
],
'run pg_createsubscriber --dry-run on node S');

~~~

BTW, now that they are more readable I can see that there are some anomalies:

1. There is a test (the last one in the patch) that has multiple
'--verbose' switches. There is no comment to say it was deliberate, so
it looks suspiciously accidental.

2. The same test names a publication as 'Pub2' instead of 'pub2'.
Again, there is no comment to say it was deliberate so was that case
change also accidental?

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Modify-wrapping-to-make-command-options-easier-to.patch
Description: Binary data

Re: Pre-proposal: unicode normalized text

2024-12-11 Thread Jeff Davis

On Thu, 2024-02-29 at 17:02 -0800, Jeff Davis wrote:
> Attached is an implementation of a per-database option STRICT_UNICODE
> which enforces the use of assigned code points only.

I'm withdrawing this patch due to lack of interest.

Regards,
Jeff Davis

Re: FileFallocate misbehaving on XFS

2024-12-11 Thread Andres Freund

Hi,

FWIW, I tried fairly hard to reproduce this.

An extended cycle of 80 backends copying into relations and occasionally
truncating them (to simulate the partitions being dropped and new ones
created). For this I ran a 4TB filesystem very close to fully filled (peaking
at 99.998 % full).

I did not see any ENOSPC errors unless the filesystem really was full at that
time. To check that, I made mdzeroextend() do a statfs() when encountering
ENOSPC, printed statfs.f_blocks and made that case PANIC.

What I do see is that after - intentionally - hitting an out-of-disk-space
error, the available disk space would occasionally increase a small amount
after a few seconds. Regardless of whether using the fallocate and
non-fallocate path.

>From what I can tell this small increase in free space has a few reasons:

- Checkpointer might not have gotten around to unlinking files, keeping the
  inode alive.

- Occasionally bgwriter or a backend would have relation segments that were
  unlinked open, so the inode (not the actual file space, because the segment
  to prevent that) could not yet be removed from the filesystem

- It looks like xfs does some small amount of work to reclaim space in the
  background. Which makes sense, otherwise each unlink would have to be a
  flush to disk.

But that's not in any way enough amount of space to explain what you're
seeing. The most I've were 6MB, when ramping up the truncation frequency a
lot.

Of course this was on a newer kernel, not on RHEL / RL 8/9.

Just to make sure - you're absolutely certain that you actually have space at
the time of the errors?  E.g. a checkpoint could free up a lot of WAL after a
checkpoint that's soon after an ENOSPC, due to removing now-unneeded WAL
files. That can be 100s of gigabytes.

If I were to provide you with a patch that showed the amount of free disk
space at the time of an error, the size of the relation etc, could you
reproduce the issue with it applied? Or is that unrealistic?

On 2024-12-11 13:05:21 +0100, Jakub Wartak wrote:
> - one AG with extreme low extent sizes compared to the others AGs (I bet
> that 2->3 22.73% below means small 8192b pg files in $PGDATA, but there are
> no large extents in that AG)
>from  to extents  blockspct
>   1   149494949   0.65
>   2   3   86113  173452  22.73
>   4   7   19399   94558  12.39
>   8  15   23233  248602  32.58
>  16  31   12425  241421  31.64
>total free extents 146119
>total free blocks 762982
>average free extent size 5.22165 (!)

Note that this does not mean that all extents in the AG are that small, just
that the *free* extents are of that size.

I think this might primarily be because this AG has the smallest amount of
free blocks (2.9GB). However, the fact that it *does* have less, could be
interesting. It might be the AG associated with the directory for the busiest
database or such.

The next least-space AG is:

   from  to extents  blockspct
  1   110211021   0.10
  2   3   48748   98255  10.06
  4   79840   47038   4.81
  8  15   13648  146779  15.02
 16  31   15818  323022  33.06
 32  63 584   27932   2.86
 64 127 147   14286   1.46
128 255 253   49047   5.02
256 511 229   87173   8.92
5121023 139  102456  10.49
   10242047  51   72506   7.42
   20484095   37422   0.76
total free extents 90481
total free blocks 976937

It seems plausible it'd would look similar if more of the free blocks were
used.

> - we have logic of `extend_by_pages += extend_by_pages * waitcount;` capped
> up to 64 pg blocks maximum (and that's higher than the above)
> - but the fails where observed also using pg_upgrade --link -j/pg_restore
> -j (also concurrent posix_fallocate() to many independent files sharing the
> same AG, but that's 1 backend:1 file so no contention for waitcount in
> RelationAddBlocks())

We also extend by more than one page, even without concurrency, if
bulk-insertion is used, and i think we do use that for
e.g. pg_attribute. Which is actually the table where pg_restore encountered
the issue:

pg_restore: error: could not execute query: ERROR:  could not extend
file "pg_tblspc/16401/PG_16_202307071/17643/1249.1" with
FileFallocate(): No space left on device

1249 is the initial relfilenode for pg_attribute.

There could also be some parallelism leading to bulk extension, due to the
parallel restore. I don't remember which commands pg_restore actually executes
in parallel.

Greetings,

Andres Freund

Re: Add Postgres module info

Andrei Lepikhov  writes:
> It makes sense. But I want to clarify that I avoided changing 
> PG_MODULE_MAGIC because the newly introduced structure has a totally 
> different purpose and usage logic: the struct is designed to check 
> compatibility, but module info isn't connected to the core version at 
> all: a single version of the code may be built for multiple PG versions. 
> At the same time, various versions of the same library may be usable 
> with the same core.

Surely.  But I don't see a need for two separately-looked-up
physical structures.  Seems to me it's sufficient to put the
ABI-checking fields into a sub-struct within the magic block.

regards, tom lane

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-12-11 Thread Jonathan S. Katz


On 12/11/24 10:14 AM, Daniel Gustafsson wrote:

On 11 Dec 2024, at 18:47, Tom Lane  wrote:



Oh yay, another naming problem :-(.  I think that neither "ciphers"
vs. "cipher suites" nor "ssl_ciphers" vs. "ssl_ciphers_tlsv13" is
going to convey a lot to the average person who's not steeped in
TLS minutiae.  However, following the precedent of Apache and Curl
seems like a good answer --- that will ensure that at least some
part of the internet-using world has seen this before.  So I guess
I'm +0.5 for the ssl_ciphers_tlsv13 answer, at least out of the
choices suggested so far.


The subset of users who are likely to be interested in this setting would
probably be more confused if we didn't follow the precedent from other
well-known projects.


+1 to this point. The people I talk to who are interested in the 
`cipher_suites` setting, are also the folks who are actually paying 
attention to when and how ciphers/ciphersuites are used, and have strong 
opinions on such. It also seems that OpenSSL is pushing in the direction 
of making everything a "ciphersuite", albeit the -ciphersuites flag is 
just for TLS v1.3+[1].


I think the `ssl_cipher_suites` proposal is fine; OK with bikeshedding 
to `ssl_ciphersuites`.


Thanks,

Jonathan

[1] https://docs.openssl.org/3.3/man1/openssl-ciphers/#options

Re: Add Postgres module info

2024-12-11 Thread Euler Taveira

On Wed, Dec 11, 2024, at 4:26 PM, Andres Freund wrote:
> On 2024-12-11 13:21:03 -0500, Tom Lane wrote:
> > Andrei Lepikhov  writes:
> > > I would like to propose the module_info structure, which aims to let
> > > extension maintainers sew some data into the binary file. Being included
> > > in the module code, this information remains unchanged and is available
> > > for reading by a backend.
> >
> > I don't have much of an opinion one way or the other about the
> > usefulness of these additional info fields.
> 
> FWIW, Id like to have some more information in there, without commenting on
> the specifics.

+1 for the general idea. I received some reports like [1] related to wal2json
that people wants to obtain the output plugin version. Since it is not installed
via CREATE EXTENSION, it is not possible to detect what version is installed,
hence, some tools cannot have some logic to probe the module version. In a
managed environment, it is hard to figure out the exact version for
non-CREATE-EXTENSION modules, unless it is explicitly informed by the vendor.

[1] https://github.com/eulerto/wal2json/issues/181


--
Euler Taveira
EDB   https://www.enterprisedb.com/

Re: Add Postgres module info


On 12/12/2024 01:21, Tom Lane wrote:

Andrei Lepikhov  writes:

I would like to propose the module_info structure, which aims to let
extension maintainers sew some data into the binary file. Being included
in the module code, this information remains unchanged and is available
for reading by a backend.


I don't have much of an opinion one way or the other about the
usefulness of these additional info fields.  But I would like to
object to the way you've gone about it, namely to copy-and-paste
the magic-block mechanism.  That doesn't scale: the next time
somebody else wants some more fields, will we have three such
structs?
It makes sense. But I want to clarify that I avoided changing 
PG_MODULE_MAGIC because the newly introduced structure has a totally 
different purpose and usage logic: the struct is designed to check 
compatibility, but module info isn't connected to the core version at 
all: a single version of the code may be built for multiple PG versions. 
At the same time, various versions of the same library may be usable 
with the same core.


From the coding point of view, I agree that your approach is more 
laconic and reasonable. I will rewrite the code using this approach.


--
regards, Andrei Lepikhov

Re: Pass ParseState as down to utility functions.

On Tue, Dec 10, 2024 at 10:38:41PM +0800, jian he wrote:
> add parser_errposition to some places in
> transformTableConstraint, transformColumnDefinition
> where v8 didn't.

I've looked at the new tests in 0001.  Here are some notes.  And I've
found some mistakes and simplifications on the way.

 CREATE TYPE test_type2 AS (a int, b text);
+CREATE TABLE test_tbl2 OF xx;
+ERROR:  type "xx" does not exist
[...]
+ALTER TABLE tt0 OF tt_t_noexist;
+ERROR:  type "tt_t_noexist" does not exist

typed_table.out checks that already, so these additions bring nothing
new:
CREATE TABLE ttable1 OF nothing;
ERROR:  type "nothing" does not exist

The three tests for ALTER TABLE .. SET DATA TYPE are new patterns, so
these are OK.  The COLLATE case was kind of covered with CREATE
DOMAIN, but the command is different.

The ALTER TABLE .. ALTER COLUMN case for a generated column is new, so
that's OK.

CREATE TYPE (like=no_such_type) also makes sense, that's new coverage.
This query pattern is only used in expressions, float8 and float4.

+--test error report position

As of 0001, this comment is incorrect.  We're not testing an error
position yet.  With 0002, it would be correct.  Let's just use a more
generic wording that applies to both patches.

+create domain d_fail as int4 constraint cc generated always as (2) stored;
+ERROR:  specifying GENERATED not supported for domains
+create domain d_fail as int4 constraint cc check(values > 1) no inherit;
+ERROR:  check constraints for domains cannot be marked NO INHERIT

Can be reduced to one rather than two.

+create domain d_fail as int4 constraint cc check(values > 1) deferrable;
+ERROR:  specifying constraint deferrability not supported for domains
+create domain d_fail as int4 constraint cc check(values > 1) not deferrable;
+ERROR:  specifying constraint deferrability not supported for domains
+create domain d_fail as int4 constraint cc check (value > 1) initially 
deferred;
+ERROR:  specifying constraint deferrability not supported for domains
+create domain d_fail as int4 constraint cc check(values > 1) initially 
immediate;
+ERROR:  specifying constraint deferrability not supported for domains
+create domain d_fail as int4 constraint cc check(values > 1) deferrable not 
deferrable ;
+ERROR:  specifying constraint deferrability not supported for domains

Testing the full set of keywords is not really interesting.  So let's
just use one to make the script cheaper.

I was wondering for a few seconds about exclusion constraints, but it
requires a named constraint to trigger the error, so I've left it out
for now.

+create domain d_fail as int constraint cc REFERENCES this_table_not_exists(i);

Hmm.  Funny.  We don't document REFERENCES in the docs of CREATE
DOMAIN.  Neither do we document GENERATED.  Looks like a doc issue to
me, independent of this thread.  ALTER DOMAIN uses a different parsing
clause than CREATE DOMAIN, meaning that generated columns or
references cannot be altered..  It looks like there's quite a bit more
going on here.  The fact that we don't have tests for these patterns
authorized by the parser should be tracked anyway, so let's add them
for now.  This should be looked at on a separate thread.

For now, I've applied the new tests.  Let's move on with the additions
of 0002, and see if these are good to have or not (noticed Tom's
comments about the type paths, of course).
--
Michael

signature.asc
Description: PGP signature

Re: Add Pipelining support in psql

2024-12-11 Thread Jelte Fennema-Nio

On Tue, 10 Dec 2024 at 11:44, Anthonin Bonnefoy
 wrote:
> num_queries (2nd element in the pipeline status prompt) is now used to
> track queued queries that were not flushed (with a flush request or
> sync) to the server. It used to count both unflushed queries and
> flushed queries.

I skimmed the code a bit, but haven't looked closely at it yet. I did
try the patch out though. My thoughts below:

I think that new prompt is super useful, so useful in fact that I'd
suggest linking to it from the \startpipeline docs. I do think that
the wording in the docs could be a bit more precise:
1. The columns are not necessarily queries, they are messages or
commands. i.e. \parse and \bind_named both count as 1, even though
they form one query together.
2. messages not followed by \sync and \flushrequest, could very well
already "be sent to the server" (if the client buffer got full, or in
case of manual \flush). The main thing that \sync and \flushrequest do
is make sure that the server actually sends its own result back, even
if its buffer is not full yet.

The main feedback I have after playing around with this version is
that I'd like to have a \getresult (without the s), to only get a
single result. So that you can get results one by one, possibly
interleaved with some other queries again.

One thing I'm wondering is how useful the num_syncs count is in the
pipeline prompt, since you never really wait for a sync.

Regarding the usefulness of \flush. I agree it's not as useful as I
thought, because indeed \getresults already flushes everything. But
it's not completely useless either. The main way I was able to use it
interactively in a somewhat interesting way was to send it after a
long running query, and then while that's processing type up the next
query after it. Something like the following:

localhost jelte@postgres:5432-26274=
#> \startpipeline
Time: 0.000 ms
localhost jelte@postgres:5432-26274=
#|0,0,0|> select pg_sleep(5) \bind \g
Time: 0.000 ms
localhost jelte@postgres:5432-26274=
#|0,1,0|*> \flush
Time: 0.000 ms
localhost jelte@postgres:5432-26274=
#|0,1,0|*> select 1 \bind \g
Time: 0.000 ms
localhost jelte@postgres:5432-26274=
#|0,2,0|*> \syncpipeline
Time: 0.000 ms
localhost jelte@postgres:5432-26274=
#|1,0,2|*> \getresults
 pg_sleep
──

(1 row)

 ?column?
──
1
(1 row)

Time: 0.348 ms

Re: Fix comments related to pending statistics

On Wed, Dec 11, 2024 at 07:32:38AM +, Bertrand Drouvot wrote:
> On Wed, Dec 11, 2024 at 02:56:08PM +0900, Michael Paquier wrote:
>> Your suggestion does not look completely right to me.  There is
>> nothing preventing us from using something else than event counters
>> since we don't use memcpy() and there is no comparison work, no?  It
>> seems to me that we could remove the entire sentence instead.
> 
> Do you mean also remove the comments in pgstat_function_flush_cb() and
> pgstat_subscription_flush_cb()? Those comments look fine to me given
> the places where those pending entries are created meaning in
> pgstat_init_function_usage() for the functions and 
> pgstat_report_subscription_error()
> and pgstat_report_subscription_conflict() for the subscriptions.

My apologies for the confusion.  I see no problem with the existing
comments in pgstat_subscription_flush_cb() and
pgstat_function_flush_cb() because they still apply.  The comment in
pgstat.h 

The only thing we should do here is to remove the comment for
PgStat_FunctionCounts because we could add pointers or something else
than plain counters in this structure, and fix the comment of
PgStat_TableCounts in the lines of what you are suggesting.
--
Michael


signature.asc
Description: PGP signature

Buffering in tuplesort.c for in-sort deduplication; nbtree edition

2024-12-11 Thread Matthias van de Meent

Hi,

In the GIN parallel build thread [0] I proposed adding a 'flush'
callback to the tuplesort API [1], to be called by the tuplesort
machinery after writing each run, so that users of tuplesort can
buffer write calls within each sorted run and use that buffer state to
merge/deduplicate sorttuples before writing them to disk (assuming
some invariants are held).
By implementing deduplication the user can thus reduce the number of
sortable tuples and the total working set size significantly, thus
later reducing time spent reading and processing those runs and
speeding up the build time for tuplesorts with many duplicate sort
keys, but with mergable output.

As the GIN parallel index patch was already large and complex enough,
it was suggested to split it into a separate patch. I'd said that I'd
try to build one, so here is a patch that adds the required API to the
tuplesort internals, and implements the 'buffer, deduplicate, flush'
idea for nbtree index builds that can benefit from deduplication.

If deduplication is enabled for the btree index, we now merge tuples
up to the tuplesort buffer size while we're still extracting tuples
from the heap, thereby reducing the temporary storage requirement of
the index build.

One complication that this patch needs to deal with is that parallel
scans (and syncscan wraparounds) can cause some higher TIDs to appear
before lower tids in two posting lists (e.g. lists [1, 4] and [2, 3]).
Therefore, we must take some special care while loading the tree to
make sure we only write out TIDs which are smaller than the latest
tuple's TID; which is implemented with a growing TID buffer.

As part of these changes, there are some changes to the btree build
behaviour. Where previously we wouldn't ever create posting lists
larger than the target posting list size (812 bytes), we can now emit
some posting tuples with up to 10 [^3] TIDs and up to BTMaxItemSize
large when the non-deduplicated base tuple would otherwise be large
enough to consume that 812-byte posting list size limit.

Local testing shows that index builds can see 60%+ reduced temporary
disk usage when indexing values with high duplication factors -storage
comparable to the resulting index- and I've seen 50% index reduced
build times for some text-based deduplication workloads.

Note: this isn't yet very polished, but it works. I'm reasonably happy
with happy-path performance so far, but I've seen bad cases (unique
dataset without UNIQUE specifier) regress by as much as 30%, so this
is definitely not (yet?) a panacea.

It should be noted that UNIQUE btree builds still make use of the old
non-duplicating tuplesort code, meaning they can be used to detect
regressions in this patch (vs a non-UNIQUE index build with otherwise
the same definition).

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0] 
https://postgr.es/m/flat/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com
[1] 
https://postgr.es/m/CAEze2WjB1vpxtvKuWVEThSaB-v4%2B8H0EXsOB%3DyLAv8pLcrQuKw%40mail.gmail.com
[2] 
https://postgr.es/m/CAEze2WiTAeZe4t5wAeRN834xFBqROPmjeK2XTstNko6bbVPX%3DA%40mail.gmail.com

[^3] Chosen by fair dice roll
From bf0f1b64932dfc206bb6f94a875abf92b44c7da0 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Wed, 28 Aug 2024 15:28:37 +0200
Subject: [PATCH v0 1/2] Allow tuplesort implementations to buffer writes

Before, all writes to the sort tapes would have to be completed during
the call to writetup().  That's sufficient when the user of tuplesort
isn't interested in merging sorted tuples, but btree (and in the future,
GIN) sorts tuples to later merge them during insertion into the index.
If it'd merge the tuples before writing them to disk, we can save
significant disk space and IO.

As such, we allow WRITETUP to do whatever it wants when we're filling a
tape with tuples, and call FLUSHWRITES() at the end to mark the end of
that tape so that the tuplesort can flush any remaining buffers to disk.

By design, this does _not_ allow deduplication while the dataset is still
in memory. Writing data to disk is inherently expensive, so we're likely
to win time by spending some additional cycles on buffering the data in
the hopes of not writing as much data. However, in memory the additional
cycles may cause too much of an overhead to be useful.

Note that any implementation of tuple merging using the buffering
strategy that is enabled by this commit must also make sure that the
merged tuples are definitely not larger than the sum of the sizes of the
merged tuples.
---
 src/include/utils/tuplesort.h  | 9 +
 src/backend/utils/sort/tuplesort.c | 5 +
 src/backend/utils/sort/tuplesortvariants.c | 7 +++
 3 files changed, 21 insertions(+)

diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f6201..ed17ca00b6 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -194,6 +194,15 @@ typedef struct
 	void		(*writetup) (Tuplesortstat

Re: CRC32C Parallel Computation Optimization on ARM

2024-12-11 Thread John Naylor

On Wed, Dec 11, 2024 at 11:54 PM Nathan Bossart
 wrote:
>
> On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:

> > and how light it was. With more hardware support, we can go much lower
> > than 1024 bytes, but that can be left for future work.
>
> Nice.  I'm curious how this compares to both the existing implementations
> and the proposed ones that require new intrinsics.  I like the idea of
> avoiding new runtime and config checks, especially if the performance is
> somewhat comparable for the most popular cases (i.e., dozens of bytes to a
> few thousand bytes).

With 8k inputs on x86 its fairly close to 3x faster than master.

I wasn't very clear, but v9 still has a cutoff of 1008 bytes just to
copy from 0008, but on a slightly old machine the crossover point is
about 400-600 bytes. Doing microbenchmarks that hammer on single
instructions is very finicky, so I don't trust these numbers much.

With hardware CLMUL, I'm guessing cutoff would be between 120 and 192
bytes (must be a multiple of 24 -- 3 words), and would depend on
architecture. Arm has an advantage that vmull_p64() operates on
scalars, but on x86 the corresponding operation is
_mm_clmulepi64_si128() , and there's a bit of shuffling in and out of
vector registers.

> If we still want to add new intrinsics, would it be easy enough to add them
> on top of this patch?  Or would it require further restructuring?

I'm still trying to wrap my head around how function selection works
after commit 4b03a27fafc , but it could be something like this on x86:

#if defined(__has_attribute) && __has_attribute (target)

pg_attribute_target("sse4.2,pclmul")
pg_comp_crc32c_sse42
{

}

#endif

pg_attribute_target("sse4.2")
pg_comp_crc32c_sse42
{

}

...where we have the tail part in a separate function for readability.

On Arm it might have to be as complex as in 0008, since as you've
mentioned, compiler support for the needed attributes is still pretty
new.

--
John Naylor
Amazon Web Services

Re: Pass ParseState as down to utility functions.

On Thu, Dec 12, 2024 at 10:08:04AM +0800, jian he wrote:
> On Thu, Dec 12, 2024 at 4:48 AM Tom Lane  wrote:
>> I'm not loving the idea of cons'ing up ParseStates in random places in
>> tablecmds.c.  I think we ought to fix things so that the one made in
>> standard_ProcessUtility is passed down to all these places, replacing
>> ad-hoc queryString and queryEnv parameters.

> the main code change is within DefineDomain.

[ .. more comments from Jian .. ]

Yeah, I'm not much a fan of some of the changes of tablecmds.c, that
makes the whole stack more complicated than it should.

ProcessUtilitySlow() passing down a ParseState to DefineDomain() and
AlterType() is more consistent and a good idea IMO.

I think that the patch should be split a bit more.  Even if different
areas of the code are touched, there's more than one idea of how to
create this information for the error position, so each piece could be
committed separately with changes showing up as diffs in the
regression tests.
--
Michael


signature.asc
Description: PGP signature

Re: pg_createsubscriber TAP test wrapping makes command options hard to read.

On Thu, Dec 12, 2024 at 12:41 PM Tom Lane  wrote:
>
> Peter Smith  writes:
> > While reviewing a patch for another pg_createsubscriber thread I found
> > that multiple TAP tests have a terrible wrapping where the command
> > options and their associated oparg are separated on different lines
> > instead of paired together nicely. This makes it unnecessarily
> > difficult to see what the test is doing.
>
> I think that is mostly the fault of pgperltidy.  We did change
> the options we use with it awhile back, so maybe now it will honor
> your manual changes to its line-wrapping choices.  But I wouldn't
> bet on that.  Did you check what happens if you run the modified
> code through pgperltidy?
>
> (If the answer is bad, we could look into making further changes so
> that pgperltidy won't override these decisions.  But there's no point
> in manually patching this if it'll just get undone.)
>

Thanks for your suggestion. As you probably suspected, the answer is bad.

I ran pgperltidy on the "fixed" file:
[postgres@CentOS7-x64 oss_postgres_misc]$
src/tools/pgindent/pgperltidy
src/bin/pg_basebackup/t/040_pg_createsubscriber.pl

This reverted it to how it currently looks on master.

The strange thing is there are other commands in that file very
similar to the ones I had changed but those already looked good, yet
they remained unaffected by the pgperltidy. Why?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Pass ParseState as down to utility functions.

On Thu, Dec 12, 2024 at 4:48 AM Tom Lane  wrote:
>
> jian he  writes:
> > add parser_errposition to some places in
> > transformTableConstraint, transformColumnDefinition
> > where v8 didn't.
>
> I'm not loving the idea of cons'ing up ParseStates in random places in
> tablecmds.c.  I think we ought to fix things so that the one made in
> standard_ProcessUtility is passed down to all these places, replacing
> ad-hoc queryString and queryEnv parameters.
>

the main code change is within DefineDomain.

AlterTableUtilityContext comments says:
/* Info needed when recursing from ALTER TABLE */
so we cannot pass DefineDomain with AlterTableUtilityContext.

-DefineDomain(CreateDomainStmt *stmt)
+DefineDomain(ParseState *pstate, CreateDomainStmt *stmt)
we have to pass either ParseState or queryString to DefineDomain.

-extern ObjectAddress AlterType(AlterTypeStmt *stmt);
+extern ObjectAddress AlterType(ParseState *pstate, AlterTypeStmt *stmt);
this change not necessary, we can remove it.


but other places (listed in below),
we are passing (AlterTableUtilityContext *context) which seems ok?

-ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode)
+ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode,
+ AlterTableUtilityContext *context)

 static ObjectAddress ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
-   AlterTableCmd *cmd, LOCKMODE lockmode);
+   AlterTableCmd *cmd, LOCKMODE lockmode,
+   AlterTableUtilityContext *context);

 static ObjectAddress
 ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
-  AlterTableCmd *cmd, LOCKMODE lockmode)
+  AlterTableCmd *cmd, LOCKMODE lockmode,
+  AlterTableUtilityContext *context)

RE: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-11 Thread Hayato Kuroda (Fujitsu)

Dear Shubham,

> Thank you for pointing this out and for suggesting the changes. I
> agree with your approach.
> Also, I found a mistake in getopt_long and fixed it in this version of
> the patch.
> The attached patch contains the suggested changes.

Thanks for updating the patch. I think the patch looks mostly OK.

Regarding the test code - I think we should directly refer the pg_subscription 
catalog,
and confirm that subtwophase is 'p'. IIUC, we can wait until
all subscriptions. Subtwophasestate are 'e', by using poll_query_until() and 
[1].

[1]: SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN 
('e');

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Wrong results with right-semi-joins

2024-12-11 Thread Richard Guo

On Wed, Dec 11, 2024 at 11:27 AM Richard Guo  wrote:
> I spent some time on this and came up with a simpler query to
> reproduce the issue.
>
> explain (costs off)
> select * from tbl_rs t1 join
>   lateral (select * from tbl_rs t2 where t2.a in
> (select t1.a+t3.a from tbl_rs t3) and t2.a < 5)
>   on true;
> QUERY PLAN
> ---
>  Nested Loop
>->  Seq Scan on tbl_rs t1
>->  Hash Right Semi Join
>  Hash Cond: ((t1.a + t3.a) = t2.a)
>  ->  Seq Scan on tbl_rs t3
>  ->  Hash
>->  Seq Scan on tbl_rs t2
>  Filter: (a < 5)
> (8 rows)
>
> Without the fix, this query returns 3 rows rather than the expected 6.
>
> Maybe I should update the test case introduced in 5668a857d to this
> one.

Done.

Thanks
Richard

Re: pg_createsubscriber TAP test wrapping makes command options hard to read.

Peter Smith  writes:
> The strange thing is there are other commands in that file very
> similar to the ones I had changed but those already looked good, yet
> they remained unaffected by the pgperltidy. Why?

You sure it's not just luck-of-the-draw?  I think that perltidy
is just splitting the lines based on length, so sometimes related
options would be kept together and sometimes not.

regards, tom lane

Re: Add Postgres module info

"Euler Taveira"  writes:
> +1 for the general idea. I received some reports like [1] related to wal2json
> that people wants to obtain the output plugin version. Since it is not 
> installed
> via CREATE EXTENSION, it is not possible to detect what version is installed,
> hence, some tools cannot have some logic to probe the module version. In a
> managed environment, it is hard to figure out the exact version for
> non-CREATE-EXTENSION modules, unless it is explicitly informed by the vendor.

What would you foresee as the SQL API for inspecting a module that's
not tied to an extension?

regards, tom lane

Re: pg_createsubscriber TAP test wrapping makes command options hard to read.

Peter Smith  writes:
> While reviewing a patch for another pg_createsubscriber thread I found
> that multiple TAP tests have a terrible wrapping where the command
> options and their associated oparg are separated on different lines
> instead of paired together nicely. This makes it unnecessarily
> difficult to see what the test is doing.

I think that is mostly the fault of pgperltidy.  We did change
the options we use with it awhile back, so maybe now it will honor
your manual changes to its line-wrapping choices.  But I wouldn't
bet on that.  Did you check what happens if you run the modified
code through pgperltidy?

(If the answer is bad, we could look into making further changes so
that pgperltidy won't override these decisions.  But there's no point
in manually patching this if it'll just get undone.)

regards, tom lane

Re: not null constraints, again

On Wed, Dec 4, 2024 at 10:52 AM jian he  wrote:
>
> hi.
>
> heap_create_with_catalog argument (cooked_constraints):
> passed as NIL in function {create_toast_table, make_new_heap}
> passed as list_concat(cookedDefaults,old_constraints) in DefineRelation
>
> in DefineRelation we have function call:
> MergeAttributes
> heap_create_with_catalog
> StoreConstraints
>
> StoreConstraints second argument: cooked_constraints, some is comes from
> DefineRelation->MergeAttributes old_constraints:
> {
> stmt->tableElts = MergeAttributes(stmt->tableElts, inheritOids,
> stmt->relation->relpersistence, stmt->partbound != NULL, &old_constraints,
> &old_notnulls);
> }
>
> My understanding from DefineRelation->MergeAttributes is that old_constraints
> will only have CHECK constraints.
> that means heap_create_with_catalog->StoreConstraints
> StoreConstraints didn't actually handle CONSTR_NOTNULL.
>
> heap_create_with_catalog comments also says:
> * cooked_constraints: list of precooked check constraints and defaults
>
> coverage https://coverage.postgresql.org/src/backend/catalog/heap.c.gcov.html
> also shows StoreConstraints, CONSTR_NOTNULL never being called,
> which is added by this thread.
>
>
> my question is can we remove StoreConstraints, CONSTR_NOTNULL handling.
> we have 3 functions {StoreConstraints, AddRelationNotNullConstraints,
> AddRelationNewConstraints} that will call StoreRelNotNull to store the 
> not-null
> constraint. That means if we want to bullet proof that something is 
> conflicting
> with not-null, we need to add code to check all these 3 places.
> removing StoreConstraints handling not-null seems helpful.
>
>
> also comments in MergeAttributes:
>  * Output arguments:
>  * 'supconstr' receives a list of constraints belonging to the parents,
>  *updated as necessary to be valid for the child.
>  * 'supnotnulls' receives a list of CookedConstraints that corresponds to
>  *constraints coming from inheritance parents.
>
> can we be explicit that "supconstr" is only about CHECK constraint,
> "supnotnulls" is
> only about NOT-NULL constraint.

patch attached.

also change comments of heap_create_with_catalog,
StoreConstraints, MergeAttributes.
so we can clear idea what's kind of constraints we are dealing with
in these functions.
From 1a2d75b6d107eb372edfca8e9a2e7df19ba08a6e Mon Sep 17 00:00:00 2001
From: jian he 
Date: Thu, 12 Dec 2024 10:45:58 +0800
Subject: [PATCH v1 1/1] remove StoreConstraints dealing with CONSTR_NOTNULL

StoreConstraints never need to deal with CONSTR_NOTNULL.
so remove that part.
because of this, change comments for functions: heap_create_with_catalog,
StoreConstraints, MergeAttributes.
so we can clear idea what's kind of constraints we are dealing with
in these functions.

discussion: https://postgr.es/m/CACJufxFxzqrCiUNfjJ0tQU+=nKQkQCGtGzUBude=smowj5v...@mail.gmail.com
---
 src/backend/catalog/heap.c   | 11 ++-
 src/backend/commands/tablecmds.c |  8 
 2 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index d7b88b61dc..6a6c328a27 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1489,7 +1489,7 @@ heap_create_with_catalog(const char *relname,
 	InvokeObjectPostCreateHookArg(RelationRelationId, relid, 0, is_internal);
 
 	/*
-	 * Store any supplied constraints and defaults.
+	 * Store any supplied CHECK constraints and defaults.
 	 *
 	 * NB: this may do a CommandCounterIncrement and rebuild the relcache
 	 * entry, so the relation must be valid and self-consistent at this point.
@@ -2224,7 +2224,7 @@ StoreRelNotNull(Relation rel, const char *nnname, AttrNumber attnum,
 }
 
 /*
- * Store defaults and constraints (passed as a list of CookedConstraint).
+ * Store defaults and CHECK constraints (passed as a list of CookedConstraint).
  *
  * Each CookedConstraint struct is modified to store the new catalog tuple OID.
  *
@@ -2268,13 +2268,6 @@ StoreConstraints(Relation rel, List *cooked_constraints, bool is_internal)
 numchecks++;
 break;
 
-			case CONSTR_NOTNULL:
-con->conoid =
-	StoreRelNotNull(rel, con->name, con->attnum,
-	!con->skip_validation, con->is_local,
-	con->inhcount, con->is_no_inherit);
-break;
-
 			default:
 elog(ERROR, "unrecognized constraint type: %d",
 	 (int) con->contype);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6ccae4cb4a..9142f8e4ad 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2433,10 +2433,10 @@ storage_name(char c)
  * 'is_partition' tells if the table is a partition.
  *
  * Output arguments:
- * 'supconstr' receives a list of constraints belonging to the parents,
- *		updated as necessary to be valid for the child.
- * 'supnotnulls' receives a list of CookedConstraints that corresponds to
- *		constraints coming from inheritance parents.
+ * 'supconstr' receives a list of CookedConstraints CH

Re: Remove useless GROUP BY columns considering unique index

On Mon, 2 Dec 2024 at 17:22, jian he  wrote:
> regarding v10.
> you placed remove_useless_groupby_columns right after add_base_rels_to_query
> makes so much sense.
> so we can be safely use cached RelOptInfo->indexlist, 
> RelOptInfo->notnullattnums
> overall it didn't find any issue.

Thanks for looking.

After a bit more adjustment, I've pushed both patches.

I felt it was best to have the commit that moved
remove_useless_groupby_columns also change the call site of that
function.

David

Re: Remove useless GROUP BY columns considering unique index

On Mon, 2 Dec 2024 at 17:18, Andrei Lepikhov  wrote:
> Patch 0002 looks helpful and performant. I propose to check 'relid > 0'
> to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has
> been found.

I did end up adding another fast path there, but I felt like checking
relid > 0 wasn't as good as it could be as that would have only
short-circuited when we don't see any Vars of level 0 in the GROUP BY.
It seemed cheap enough to short-circuit when none of the relations
mentioned in the GROUP BY have multiple columns mentioned.

> NOTES:
> 1. Uniqueness is proved by a UNIQUE index. It might be a bit more
> sophisticated to probe not only grouping qual but also employ
> equivalence classes. In that case, we could process something like that:
>
> CREATE TABLE test (
>x int NOT NULL, y int NOT NULL, z int NOT NULL, w int);
> CREATE UNIQUE INDEX ON test(y,z);
> SELECT t2.z FROM test t2, test t1 WHERE t1.y=t2.y
> GROUP BY t1.y,t2.z,t2.w;

It might be worth doing something like that. It looks like we could
delay remove_useless_groupby_columns() until standard_qp_callback
anyway. Further modifications to the GROUP BY can occur there. It
might make sense to replace the call to
make_pathkeys_for_sortclauses_extended() in standard_qp_callback()
with a version of remove_useless_groupby_columns() which does both
tasks plus the one you mention.

However, I don't know what the logic would be exactly for the case you
mention as it seems possible if we start swapping columns out for
another EquivalenceMember that we might cause some regression. For
example, if you had:

create table t1(a int, b int, primary key(a,b);
create table t2(x int, y int);

select ... from t1 inner join t2 on t1.a=t2.x and t1.b = t2.y group by
t2.x,t1.b;

when how do you decide if the GROUP BY should become t1.a,t1.b or
t2.x,t2.y? It's not clear to me that using t1's columns is always
better than using t2's. I imagine using a mix is never better, but I'm
unsure how you'd decide which ones to use.

David

Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-11 Thread vignesh C

On Thu, 12 Dec 2024 at 08:14, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shubham,
>
> > Thank you for pointing this out and for suggesting the changes. I
> > agree with your approach.
> > Also, I found a mistake in getopt_long and fixed it in this version of
> > the patch.
> > The attached patch contains the suggested changes.
>
> Thanks for updating the patch. I think the patch looks mostly OK.
>
> Regarding the test code - I think we should directly refer the 
> pg_subscription catalog,
> and confirm that subtwophase is 'p'. IIUC, we can wait until
> all subscriptions. Subtwophasestate are 'e', by using poll_query_until() and 
> [1].
>
> [1]: SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN 
> ('e');

Yes, that approach is better. Also this is not required after checking
using pg_subscription:
+# Prepare a transaction on the publisher
+$node_p->safe_psql(
+ $db1, qq[
+BEGIN;
+INSERT INTO tbl1 SELECT generate_series(1, 10);
+PREPARE TRANSACTION 'test_prepare';
+]);
+
 # Start subscriber
 $node_s->start;

+# Verify that the prepared transaction is replicated to the subscriber
+my $count_prepared_s =
+  $node_s->safe_psql($db1, "SELECT count(*) FROM pg_prepared_xacts;");
+
+is($count_prepared_s, qq(1), 'Prepared transaction replicated to subscriber');

Regards,
Vignesh

Re: FileFallocate misbehaving on XFS

2024-12-11 Thread Michael Harris

Hi Andres

On Thu, 12 Dec 2024 at 10:50, Andres Freund  wrote:
> Just to make sure - you're absolutely certain that you actually have space at
> the time of the errors?

As sure as I can be. The RHEL8 system that I took prints from
yesterday has > 1.5TB free. I can't see it varying by that much.

It does look as though the system needs to be quite full to provoke
this problem. The systems I have looked at so far have >90% full
filesystems.

Another interesting snippet: the application has a number of ETL
workers going at once. The actual number varies depending on a number
of factors but might be somewhere from 10 - 150. Each worker will have
a single postgres backend that they are feeding data to.

At the time of the error, it is not the case that all ETL workers
strike it at once - it looks like a lot of the time only a single
worker is affected, or at most a handful of workers. I can't see for
sure what the other workers were doing at the time, but I would expect
they were all importing data as well.

> If I were to provide you with a patch that showed the amount of free disk
> space at the time of an error, the size of the relation etc, could you
> reproduce the issue with it applied? Or is that unrealistic?

I have not been able to reproduce it on demand, and so far it has only
happened in production systems.

As long as the patch doesn't degrade normal performance it should be
possible to deploy it to one of the systems that is regularly
reporting the error, although it might take a while to get approval to
do that.

Cheers
Mike

Re: Add Postgres module info

Michael Paquier  writes:
> On Wed, Dec 11, 2024 at 08:34:28PM -0500, Tom Lane wrote:
>> What would you foresee as the SQL API for inspecting a module that's
>> not tied to an extension?

> Rather than a function that can be called with a specific module name
> in input, invent a new system SRF function that would report back for
> a process all the libraries that have been loaded in it?

Yeah, that could work.

> Presumably,
> the extra tracking can be done in dfmgr.c with more fields added to
> DynamicFileList to track the information involved.

I wouldn't add any overhead to the normal case for this.  Couldn't
we walk the list and re-fetch each module's magic block inside
this new function?

regards, tom lane

Re: Remove useless GROUP BY columns considering unique index

On 12/12/24 10:09, David Rowley wrote:

On Mon, 2 Dec 2024 at 17:18, Andrei Lepikhov wrote:

Patch 0002 looks helpful and performant. I propose to check 'relid > 0'
to avoid diving into 'foreach(lc, parse->rtable)' at all if nothing has
been found.

I did end up adding another fast path there, but I felt like checking
relid > 0 wasn't as good as it could be as that would have only
short-circuited when we don't see any Vars of level 0 in the GROUP BY.
It seemed cheap enough to short-circuit when none of the relations
mentioned in the GROUP BY have multiple columns mentioned.

Your solution seems much better my proposal. Thanks to apply it!

when how do you decide if the GROUP BY should become t1.a,t1.b or
t2.x,t2.y? It's not clear to me that using t1's columns is always
better than using t2's. I imagine using a mix is never better, but I'm
unsure how you'd decide which ones to use.
Depends on how to calculate that 'better'. Right now, GROUP-BY employs
two strategies to reduce path cost: 1) ORDER-BY statement (avoid final
sorting); 2) To fit incoming subtree pathkeys (avoid grouping presorting).
My idea comes close with [1], where the cost depends on the estimated
number of groups in the first grouping column because cost_sort predicts
the number of comparison operator calls based on statistics. In this
case, the choice between (x,y) and (a,b) will depend on the ndistinct of
'x' and 'a'.
In general, it was the idea to debate, more for further development than
for the patch in this thread.

[1] Consider the number of columns in the sort cost model
https://www.postgresql.org/message-id/flat/8742aaa8-9519-4a1f-91bd-364aec65f5cf%40gmail.com

--
regards, Andrei Lepikhov

Re: Add Postgres module info


On 12/12/24 10:44, Michael Paquier wrote:

On Wed, Dec 11, 2024 at 10:39:38PM -0500, Tom Lane wrote:

Michael Paquier  writes:

Presumably,
the extra tracking can be done in dfmgr.c with more fields added to
DynamicFileList to track the information involved.


I wouldn't add any overhead to the normal case for this.  Couldn't
we walk the list and re-fetch each module's magic block inside
this new function?


Depends on how much we should try to cache to make that less expensive
on repeated calls because we cannot unload libraries, but sure, I
don't see why we could not that for each SQL function call to simplify
the logic and the structures in place.
I want to say that 'cannot unload libraries' is a negative outcome of 
the architecture. It would be better to invent something like 
PG_unregister, allowing libraries to at least return a hook routine call 
back to the system.
So, maybe it makes sense to design this feature with re-fetching 
libraries, supposing it is already implemented somehow and elements of 
the DynamicFileList may be removed.


--
regards, Andrei Lepikhov

Re: Track the amount of time waiting due to cost_delay

2024-12-11 Thread Bertrand Drouvot

Hi,

On Wed, Dec 11, 2024 at 10:40:04AM -0600, Nathan Bossart wrote:
> That's roughly what I had in mind.

Thanks for confirming, done that way in v11 attached.

> I'd vote for leaving that out, if for no other reason than it can be
> deduced from the rest of the description.

Yeah, fair enough.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 35e6075791000498ed05a7eb62fd34616957c4ce Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Mon, 24 Jun 2024 08:43:26 +
Subject: [PATCH v11] Report the total amount of time that vacuum has been
 delayed due to cost delay

This commit adds one column: total_delay to the pg_stat_progress_vacuum system
view to show the total accumulated time spent sleeping due to the cost-based
vacuum delay settings.

This uses the new parallel message type for progress reporting added
by f1889729dd.

In case of parallel worker, to avoid the leader to be interrupted too frequently
(while it might be sleeping for cost delay), the report is done only if the last
report has been done more than 1 second ago.

Having a time based only approach to throttle the reporting of the parallel
workers sounds reasonable.

Indeed when deciding about the throttling:

1. The number of parallel workers should not come into play:

 1.1) the more parallel workers is used, the less the impact of the leader on
 the vacuum index phase duration/workload is (because the repartition is done
 on more processes).

 1.2) the less parallel workers is, the less the leader will be interrupted (
 less parallel workers would report their delayed time).

2. The cost limit should not come into play as that value is distributed
proportionally among the parallel workers (so we're back to the previous point).

3. The cost delay does not come into play as the leader could be interrupted at
the beginning, the midle or whatever part of the wait and we are more interested
about the frequency of the interrupts.

3. A 1 second reporting "throttling" looks a reasonable threshold as:

 3.1 the idea is to have a significant impact when the leader could have been
interrupted say hundred/thousand times per second.

 3.2 it does not make that much sense for any tools to sample pg_stat_progress_vacuum
multiple times per second (so a one second reporting granularity seems ok).

 3.3 the total_display column is an interval data type

XXX: Would need to bump catversion because this changes the definition of
pg_stat_progress_vacuum.
---
 doc/src/sgml/monitoring.sgml  | 14 
 src/backend/catalog/system_views.sql  |  3 +-
 src/backend/commands/vacuum.c | 49 +++
 src/backend/commands/vacuumparallel.c |  7 
 src/include/commands/progress.h   |  1 +
 src/include/commands/vacuum.h |  1 +
 src/test/regress/expected/rules.out   |  3 +-
 7 files changed, 76 insertions(+), 2 deletions(-)
  22.6% doc/src/sgml/
   5.0% src/backend/catalog/
  63.5% src/backend/commands/
   3.3% src/include/commands/
   5.3% src/test/regress/expected/

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 840d7f8161..995a35618d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6428,6 +6428,20 @@ FROM pg_stat_get_backend_idset() AS backendid;
cleaning up indexes.

+
+ 
+  
+   total_delay interval
+  
+  
+   Total accumulated time spent sleeping due to the cost-based vacuum
+   delay settings (e.g., ,
+   ).  This includes the time that
+   any associated parallel workers have slept, too. However, parallel workers
+   report their sleep time no more frequently than once per second, so the
+   reported value may be slightly stale.
+  
+ 

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index da9a8fe99f..494b2e348d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1222,7 +1222,8 @@ CREATE VIEW pg_stat_progress_vacuum AS
 S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
 S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
 S.param8 AS num_dead_item_ids, S.param9 AS indexes_total,
-S.param10 AS indexes_processed
+S.param10 AS indexes_processed,
+make_interval(secs => S.param11 / 1000) AS total_delay
 FROM pg_stat_get_progress_info('VACUUM') AS S
 LEFT JOIN pg_database D ON S.datid = D.oid;

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index bb639ef51f..e01444b417 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -39,6 +39,7 @@
 #include "catalog/pg_inherits.h"
 #include "commands/cluster.h"
 #include "commands/defrem.h"
+#include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -59,6 +60

Re: Pass ParseState as down to utility functions.

On Thu, Dec 12, 2024 at 10:29 AM Michael Paquier  wrote:
>
> I think that the patch should be split a bit more.  Even if different
> areas of the code are touched, there's more than one idea of how to
> create this information for the error position, so each piece could be
> committed separately with changes showing up as diffs in the
> regression tests.
> --

I've split it into two patches, one for CREATE DOMAIN only. one for
another DDL command.

0001:
I am using DefineDomain(ParseState *pstate, CreateDomainStmt *stmt) for now.
we can also pass querystring or another struct.

0002:
passing AlterTableUtilityContext for some ALTER TABLE subroutine.
add parser_errposition to some existing `ereport(ERROR` places.
---
you mentioned ALTER DOMAIN, I have further simplified it at
https://postgr.es/m/cacjufxg0n_wlfk-nc_k5w6vv26qlvxupbhvnkktc2npftjq...@mail.gmail.com


create domain d_fail as int constraint cc REFERENCES this_table_not_exists(i);
like this command will fail, so we don't need to change
create_domain.sgml synopsis section
?
From 2f28cc9b4f605b981801a307e44de6cbc9f504e5 Mon Sep 17 00:00:00 2001
From: jian he 
Date: Thu, 12 Dec 2024 12:15:16 +0800
Subject: [PATCH v11 2/2] print out the error position for some DDL commands.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This can be particularly helpful when working with a sequence of DML commands,
such as `create schema create schema_element`.
It also makes it easier to quickly identify the relevant error area in a single DDL command

these DDL commands will printout error position:
ALTER TABLE ALTER COLUMN SET DATA TYPE;
CREATE TABLE OF type;
ALTER TABLE ALTER COLUMN ADD GENERATED ALWAYS AS IDENTITY;
CREATE TYPE LIKE type.
CREATE TABLE while specifying conflict constraint.

Author: Kirill Reshke 
Author: Jian He 
Reviewed-By: Michaël Paquier 
Reviewed-By: Álvaro Herrera 

discussion: https://postgr.es/m/CALdSSPhqfvKbDwqJaY=yeepi_aq61gmmpw88i6zh7cmg_2z...@mail.gmail.com
---
 src/backend/commands/tablecmds.c  | 43 +++
 src/backend/commands/typecmds.c   |  2 +-
 src/backend/parser/parse_utilcmd.c| 21 +++
 src/test/regress/expected/alter_table.out |  6 
 src/test/regress/expected/constraints.out | 14 
 src/test/regress/expected/float8.out  |  2 ++
 src/test/regress/expected/identity.out|  2 ++
 src/test/regress/expected/typed_table.out |  4 +++
 8 files changed, 72 insertions(+), 22 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6ccae4cb4a..efa38b1470 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -593,7 +593,8 @@ static void ATPrepAlterColumnType(List **wqueue,
   AlterTableUtilityContext *context);
 static bool ATColumnChangeRequiresRewrite(Node *expr, AttrNumber varattno);
 static ObjectAddress ATExecAlterColumnType(AlteredTableInfo *tab, Relation rel,
-		   AlterTableCmd *cmd, LOCKMODE lockmode);
+		   AlterTableCmd *cmd, LOCKMODE lockmode,
+		   AlterTableUtilityContext *context);
 static void RememberAllDependentForRebuilding(AlteredTableInfo *tab, AlterTableType subtype,
 			  Relation rel, AttrNumber attnum, const char *colName);
 static void RememberConstraintForRebuilding(Oid conoid, AlteredTableInfo *tab);
@@ -639,7 +640,9 @@ static ObjectAddress ATExecAddInherit(Relation child_rel, RangeVar *parent, LOCK
 static ObjectAddress ATExecDropInherit(Relation rel, RangeVar *parent, LOCKMODE lockmode);
 static void drop_parent_dependency(Oid relid, Oid refclassid, Oid refobjid,
    DependencyType deptype);
-static ObjectAddress ATExecAddOf(Relation rel, const TypeName *ofTypename, LOCKMODE lockmode);
+static ObjectAddress ATExecAddOf(Relation rel, const TypeName *ofTypename,
+ LOCKMODE lockmode,
+ AlterTableUtilityContext *context);
 static void ATExecDropOf(Relation rel, LOCKMODE lockmode);
 static void ATExecReplicaIdentity(Relation rel, ReplicaIdentityStmt *stmt, LOCKMODE lockmode);
 static void ATExecGenericOptions(Relation rel, List *options);
@@ -5413,7 +5416,7 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			break;
 		case AT_AlterColumnType:	/* ALTER COLUMN TYPE */
 			/* parse transformation was done earlier */
-			address = ATExecAlterColumnType(tab, rel, cmd, lockmode);
+			address = ATExecAlterColumnType(tab, rel, cmd, lockmode, context);
 			break;
 		case AT_AlterColumnGenericOptions:	/* ALTER COLUMN OPTIONS */
 			address =
@@ -5537,7 +5540,7 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 			address = ATExecDropInherit(rel, (RangeVar *) cmd->def, lockmode);
 			break;
 		case AT_AddOf:
-			address = ATExecAddOf(rel, (TypeName *) cmd->def, lockmode);
+			address = ATExecAddOf(rel, (TypeName *) cmd->def, lockmode, context);
 			break;
 		case AT_DropOf:
 			ATExecDropOf(rel, lockmode);
@@ -13218,10 +13221,12 @@ ATPrepAlterColumnType(List **wq

Re: Skip collecting decoded changes of already-aborted transactions

2024-12-11 Thread Dilip Kumar

On Thu, Dec 12, 2024 at 11:08 AM Amit Kapila  wrote:
>
> On Wed, Dec 11, 2024 at 8:21 AM Dilip Kumar  wrote:
> >
> > On Wed, Dec 11, 2024 at 3:18 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Dec 9, 2024 at 10:19 PM Dilip Kumar  wrote:
> >
> > > > >
> > > > > If the largest transaction is non-streamable, won't the transaction
> > > > > returned by ReorderBufferLargestTXN() in the other case already
> > > > > suffice the need?
> > > >
> > > > I see your point, but I don’t think it’s quite the same. When
> > > > ReorderBufferCanStartStreaming() is true, the function
> > > > ReorderBufferLargestStreamableTopTXN() looks for the largest
> > > > transaction among those that have a base_snapshot. So, if the largest
> > > > transaction is aborted but hasn’t yet received a base_snapshot, it
> > > > will instead select the largest transaction that does have a
> > > > base_snapshot, which could be significantly smaller than the largest
> > > > aborted transaction.
> > >
> > > IIUC the transaction entries in reorderbuffer have the base snapshot
> > > before decoding the first change (see SnapBuildProcessChange()). In
> > > which case the transaction doesn't have the base snapshot and has the
> > > largest amount of changes? Subtransaction entries could transfer its
> > > base snapshot to its parent transaction entry but such subtransactions
> > > will be picked by ReorderBufferLargestTXN().
> > >
> > IIRC, there could be cases where reorder buffers of transactions can
> > grow in size without having a base snapshot, I think transactions
> > doing DDLs and generating a lot of INVALIDATION messages could fall in
> > such a category.
> >
>
> Are we recording such changes in the reorder buffer? If so, can you
> please share how?

xact_decode, do add the XLOG_XACT_INVALIDATIONS in the reorder buffer
and for such changes we don't call SnapBuildProcessChange() that means
it is possible to collect such changes in reorder buffer without
setting the base_snapshot

  AFAICU, the main idea behind skipping aborts is to
> avoid sending a lot of data to the client that later needs to be
> discarded or cases where we spent resources/time spilling the changes
> that later need to be discarded. In that vein, the current idea of the
> patch where it truncates and skips aborted xacts before streaming or
> spilling them sounds reasonable.

I believe in one of my previous responses (a few emails above), I
agreed that it's a reasonable goal to check for aborted transactions
just before spilling or streaming, and if we detect an aborted
transaction, we can avoid streaming/spilling and simply discard the
changes. However, I wanted to make a point that if we have a large
aborted transaction without a base snapshot (assuming that's
possible), we might end up streaming many small transactions to stay
under the memory limit. Even though we try to stay within the limit,
we still might not succeed because the main issue is the large aborted
transaction, which doesn't have a base snapshot.

So, instead of streaming many small transactions, if we had selected
the largest transaction first and checked if it was aborted, we could
have avoided streaming all those smaller transactions. I agree this is
a hypothetical scenario and may not be worth optimizing, and that's
completely fair. I just wanted to clarify the point I raised when I
first started reviewing this patch.

I haven't tried it myself, but I believe this scenario could be
created by starting a transaction that performs multiple DDLs and then
ultimately gets aborted.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Difference in dump from original and restored database due to NOT NULL constraints on children

2024-12-11 Thread Ashutosh Bapat

Hi Alvaro,

On Thu, Nov 28, 2024 at 4:47 PM Ashutosh Bapat
 wrote:
>
> On Thu, Nov 28, 2024 at 4:44 PM Ashutosh Bapat
>  wrote:
> >
> > On Wed, Nov 27, 2024 at 7:04 PM Alvaro Herrera  
> > wrote:
> > >
> > > On 2024-Nov-27, Ashutosh Bapat wrote:
> > >
> > > > I noticed that. But two reasons why I chose the backend changes
> > > > 1. The comment where we add explicit ADD CONSTRAINT is
> > > > /*
> > > > * Dump additional per-column properties that we can't handle in the
> > > > * main CREATE TABLE command.
> > > > */
> > > > ... snip
> > > >
> > > > /*
> > > > * If we didn't dump the column definition explicitly above, and
> > > > * it is not-null and did not inherit that property from a parent,
> > > > * we have to mark it separately.
> > > > */
> > > > if (!shouldPrintColumn(dopt, tbinfo, j) &&
> > > > tbinfo->notnull_constrs[j] != NULL &&
> > > > (tbinfo->notnull_islocal[j] && !tbinfo->ispartition && 
> > > > !dopt->binary_upgrade))
> > > > ... snip
> > > >
> > > > The comment seems to say that we can not handle the NOT NULL
> > > > constraint property in the CREATE TABLE command. Don't know why. We
> > > > add CHECK constraints separately in CREATE TABLE even if we didn't add
> > > > corresponding columns in CREATE TABLE. So there must be a reason not
> > > > to dump NOT NULL constraints that way and hence we required separate
> > > > code like above. I am afraid going that direction will show us some
> > > > other problems.
> > >
> > > I don't think this is an important restriction.  We can change that, as
> > > long as all cases work correctly.  We previously didn't try to use
> > > "CONSTRAINT foobar NOT NULL a" because 1) we didn't support the
> > > table-constraint syntax for not-null constraint and 2) not-null
> > > constraint didn't support names anyway.  We now support that syntax, so
> > > we can use it.
> > >
> >
> > Ok. Here's the patch implementing the same. As you said, it's a much
> > simpler patch. The test being developed in [1] passes with this
> > change. pg_dump and pg_upgrade test suites also pass.
> >
> > [1] 
> > https://www.postgresql.org/message-id/flat/CAExHW5uvx2LEyrUBdctV5gS25Zeb%2B-eXESkK93siQxWSjYFy6A%40mail.gmail.com#c8ed57b77d2f6132d5b8e1ecb2a8c47b
> >
> > Adding this to CF for CI run.
>
> CF entry: https://commitfest.postgresql.org/51/5408/
>
> --
> Best Wishes,
> Ashutosh Bapat

I looked at the patch again. Here are notes from self-review
1. The code to properly format non-first attribute is related in both
if and else blocks
if (shouldPrintColumn(..))
{
 ...
}
else if ()
{

}
However, the code needs to be executed only when we are printing
something, so it can not be removed outside the if () else ()
structure to avoid duplication. Separating this small piece of code
into a function would add more lines of code than it would save. So I
have left it as is.

2. Improved the comment to make the purpose and context of the code clear.

3. With the code changes in the patch, we either print a local NOT
NULL constraint along with the attribute definition or as a separate
constraint in CREATE TABLE command. Non-local NOT NULL constraints
will be inherited from the parent. So there's no case where a NOT NULL
constraint would be lost. So it looks safe to remove the code to add
constraints through the ALTER TABLE command.

Attached updated patch. Once we commit this patch, I will be able to
proceed with the dump/restore test at [1].

[1] 
https://www.postgresql.org/message-id/flat/CAExHW5uvx2LEyrUBdctV5gS25Zeb%2B-eXESkK93siQxWSjYFy6A%40mail.gmail.com#c8ed57b77d2f6132d5b8e1ecb2a8c47b

-- 
Best Wishes,
Ashutosh Bapat
From 796c726ec44f75fb02637aa2488c078f6df5149e Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Thu, 28 Nov 2024 16:21:42 +0530
Subject: [PATCH] Dumping local NOT NULL constraints on non-local columns

A child table is dumped as CREATE TABLE ... INHERITS command. A local NOT NULL
constraint on a non-local column is printed separately as a subsequent ALTER
TABLE ...  command. When restored, this constraint inherits the name of the
corresponding parent constraint since the constraint name, if mentioned, in the
ALTER TABLE command is ignored. We end up losing the name of child constraint in
the restored database. Instead dump them as part of the CREATE TABLE command
itself so that their given or default name is preserved.

Author: Ashutosh Bapat
Reviewed-by: Alvaro Herrera
Discussion: https://www.postgresql.org/message-id/flat/CAExHW5tbdgAKDfqjDJ-7Fk6PJtHg8D4zUF6FQ4H2Pq8zK38Nyw%40mail.gmail.com
---
 src/bin/pg_dump/pg_dump.c | 50 +--
 1 file changed, 27 insertions(+), 23 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 89276524ae0..a422421b6e7 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -16220,6 +16220,33 @@ dumpTableSchema(Archive *fout, const TableInfo *tbinfo)
 			  fmtQualifiedDumpable(coll));
 	}
 }
+
+/*
+ * Add local NOT NULL constraint

Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-11 Thread Shubham Khanna

On Thu, Dec 12, 2024 at 6:04 AM Peter Smith  wrote:
>
> Hi Shubham,
>
> Here are some review comments for the patch v4-0001.
>
> ==
> GENERAL.
>
> 1.
> After reading Vignesh's last review and then the pg_createsubscriber
> documentation I see there can be multiple databases simultaneously
> specified (by writing multiple -d switches) and in that case each one
> gets its own:
> --publication
> --replication-slot
> --subscription
>
> OTOH, this new '--enable-two-phase' switch is just a blanket two_phase
> enablement across all subscriptions. There is no way for the user to
> say if they want it enabled for some subscriptions (on some databases)
> but not for others. I suppose this was intentional and OK (right?),
> but this point needs to be clarified in the docs.
>

Fixed.

> ==
> doc/src/sgml/ref/pg_createsubscriber.sgml
>
> 2.
> +  
> +   Enables  linkend="sql-createsubscription-params-with-two-phase">two_phase
> +   commit for the subscription. The default is false.
> +  
>
> Following on from the previous review comment. Since this switch is
> applied to all database subscriptions I think the help needs to say
> that. Something like.
>
> "If there are multiple subscriptions specified, this option applies to
> all of them."
>
> ~~~
>

Fixed.

> 3.
> In the "Prerequisites" sections of the docs, it gives requirements for
> various GUC settings on the source server and the target server. So,
> should there be another note here advising to configure the
> 'max_prepared_transactions' GUC when the '--enable-two-phase' is
> specified?
>
> ~~~
>

Fixed.

> 4. "Warnings" section includes the following:
>
> pg_createsubscriber sets up logical replication with two-phase commit
> disabled. This means that any prepared transactions will be replicated
> at the time of COMMIT PREPARED, without advance preparation. Once
> setup is complete, you can manually drop and re-create the
> subscription(s) with the two_phase option enabled.
>
> ~
>
> The above text is from the "Warnings" section, but it seems stale
> information that needs to be updated due to the introduction of this
> new '--enable-two-phase' option.
>

Fixed.

> ==
> src/bin/pg_basebackup/pg_createsubscriber.c
>
> usage:
> 5.
>   printf(_("  -t, --recovery-timeout=SECS seconds to wait for
> recovery to end\n"));
> + printf(_("  -T, --enable-two-phase  enable two-phase commit
> for the subscription\n"));
>
> Given the previous review comments (#1/#2 etc), I was wondering if it
> might be better to say more like "enable two-phase commit for all
> subscriptions".
>

Fixed.

> ==
> .../t/040_pg_createsubscriber.pl
>
> 6.
> +is($count_prepared_s, qq(1), 'Prepared transaction replicated to 
> subscriber');
>
> Should there also have been an earlier check (way back before the
> PREPARE) just to make sure this count was initially 0?
>

Removed this and added a new test case instead of this.

The attached patch contains the suggested changes.

Thanks and regards,
Shubham Khanna.


v5-0001-Add-support-for-two-phase-commit-in-pg_createsubs.patch
Description: Binary data

Re: Adding a '--two-phase' option to 'pg_createsubscriber' utility.

2024-12-11 Thread Shubham Khanna

On Thu, Dec 12, 2024 at 8:14 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Shubham,
>
> > Thank you for pointing this out and for suggesting the changes. I
> > agree with your approach.
> > Also, I found a mistake in getopt_long and fixed it in this version of
> > the patch.
> > The attached patch contains the suggested changes.
>
> Thanks for updating the patch. I think the patch looks mostly OK.
>
> Regarding the test code - I think we should directly refer the 
> pg_subscription catalog,
> and confirm that subtwophase is 'p'. IIUC, we can wait until
> all subscriptions. Subtwophasestate are 'e', by using poll_query_until() and 
> [1].
>
> [1]: SELECT count(1) = 0 FROM pg_subscription WHERE subtwophasestate NOT IN 
> ('e');
>

I have fixed the given comment. The v5 version patch attached at [1]
has the changes for the same.
[1] - 
https://www.postgresql.org/message-id/CAHv8Rj%2Bhd2MTNRs4AsK6%3DhRqvV6JC9g2tTAJwGjrNfXg6vhD8g%40mail.gmail.com

Thanks and Regards,
Shubham Khanna.

Re: Skip collecting decoded changes of already-aborted transactions

2024-12-11 Thread Amit Kapila

On Wed, Dec 11, 2024 at 8:21 AM Dilip Kumar  wrote:
>
> On Wed, Dec 11, 2024 at 3:18 AM Masahiko Sawada  wrote:
> >
> > On Mon, Dec 9, 2024 at 10:19 PM Dilip Kumar  wrote:
>
> > > >
> > > > If the largest transaction is non-streamable, won't the transaction
> > > > returned by ReorderBufferLargestTXN() in the other case already
> > > > suffice the need?
> > >
> > > I see your point, but I don’t think it’s quite the same. When
> > > ReorderBufferCanStartStreaming() is true, the function
> > > ReorderBufferLargestStreamableTopTXN() looks for the largest
> > > transaction among those that have a base_snapshot. So, if the largest
> > > transaction is aborted but hasn’t yet received a base_snapshot, it
> > > will instead select the largest transaction that does have a
> > > base_snapshot, which could be significantly smaller than the largest
> > > aborted transaction.
> >
> > IIUC the transaction entries in reorderbuffer have the base snapshot
> > before decoding the first change (see SnapBuildProcessChange()). In
> > which case the transaction doesn't have the base snapshot and has the
> > largest amount of changes? Subtransaction entries could transfer its
> > base snapshot to its parent transaction entry but such subtransactions
> > will be picked by ReorderBufferLargestTXN().
> >
> IIRC, there could be cases where reorder buffers of transactions can
> grow in size without having a base snapshot, I think transactions
> doing DDLs and generating a lot of INVALIDATION messages could fall in
> such a category.
>

Are we recording such changes in the reorder buffer? If so, can you
please share how? AFAICU, the main idea behind skipping aborts is to
avoid sending a lot of data to the client that later needs to be
discarded or cases where we spent resources/time spilling the changes
that later need to be discarded. In that vein, the current idea of the
patch where it truncates and skips aborted xacts before streaming or
spilling them sounds reasonable.

-- 
With Regards,
Amit Kapila.

Re: Memory leak in WAL sender with pgoutput (v10~)

2024-12-11 Thread Amit Kapila

On Wed, Dec 11, 2024 at 11:09 AM Masahiko Sawada  wrote:
>
> On Tue, Dec 10, 2024 at 6:13 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Wednesday, December 11, 2024 2:14 AM Masahiko Sawada 
> >  wrote:
> > >
> > > On Tue, Dec 10, 2024 at 1:16 AM Amit Kapila 
> > > wrote:
> > > >
> > > > On Tue, Dec 10, 2024 at 11:24 AM Amit Kapila 
> > > wrote:
> > > > >
> > > > > On Tue, Dec 10, 2024 at 8:54 AM vignesh C 
> > > wrote:
> > > > > >
> > > > > > On Tue, 10 Dec 2024 at 04:56, Michael Paquier 
> > > wrote:
> > > > > > >
> > > > > > > On Mon, Dec 09, 2024 at 03:36:15PM +0530, Amit Kapila wrote:
> > > > > > > > It couldn't solve the problem completely even in back-branches. 
> > > > > > > > The
> > > > > > > > SQL API case I mentioned and tested by Hou-San in the email [1]
> > > won't
> > > > > > > > be solved.
> > > > > > > >
> > > > > > > > [1] -
> > > https://www.postgresql.org/message-id/OS0PR01MB57166A4DA0ABBB94F
> > > 2FBB28694362%40OS0PR01MB5716.jpnprd01.prod.outlook.com
> > > > > > >
> > > > > > > Yeah, exactly (wanted to reply exactly that yesterday but lacked 
> > > > > > > time,
> > > > > > > thanks!).
> > > > > >
> > > > > > Yes, that makes sense. How about something like the attached patch.
> > > > > >
> > > > >
> > > > > - oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> > > > > - if (data->publications)
> > > > > - {
> > > > > - list_free_deep(data->publications);
> > > > > - data->publications = NIL;
> > > > > - }
> > > > > + static MemoryContext pubctx = NULL;
> > > > > +
> > > > > + if (pubctx == NULL)
> > > > > + pubctx = AllocSetContextCreate(CacheMemoryContext,
> > > > > +"logical replication publication list context",
> > > > > +ALLOCSET_SMALL_SIZES);
> > > > > + else
> > > > > + MemoryContextReset(pubctx);
> > > > > +
> > > > > + oldctx = MemoryContextSwitchTo(pubctx);
> > > > >
> > > > > Considering the SQL API case, why is it okay to allocate this context
> > > > > under CacheMemoryContext?
> > > > >
> > > >
> > > > On further thinking, we can't allocate it under
> > > > LogicalDecodingContext->context because once that is freed at the end
> > > > of SQL API pg_logical_slot_get_changes(), pubctx will be pointing to a
> > > > dangling memory. One idea is that we use
> > > > MemoryContextRegisterResetCallback() to invoke a reset callback
> > > > function where we can reset pubctx but not sure if we want to go there
> > > > in back branches. OTOH, the currently proposed fix won't leak memory
> > > > on repeated calls to pg_logical_slot_get_changes(), so that might be
> > > > okay as well.
> > > >
> > > > Thoughts?
> > >
> > > Alternative idea is to declare pubctx as a file static variable. And
> > > we create the memory context under LogicalDecodingContext->context in
> > > the startup callback and free it in the shutdown callback.
> >
> > I think when an ERROR occurs during the execution of the 
> > pg_logical_slot_xx()
> > API, the shutdown callback function is not invoked. This would result in the
> > static variable not being reset, which, I think, is why Amit mentioned the 
> > use
> > of MemoryContextRegisterResetCallback().
>
> My idea is that since that new context is cleaned up together with its
> parent context (LogicalDecodingContext->context), we unconditionally
> set that new context to the static variable at the startup callback.
> That being said, Amit's idea would be cleaner.
>

Your preference is not completely clear. Are you okay with the idea of
Vignesh's currently proposed patch for back-branches, or do you prefer
to use a memory context reset callback, or do you have a different
idea that should be adopted for back-branches?

-- 
With Regards,
Amit Kapila.

Re: Add Postgres module info

On Wed, Dec 11, 2024 at 08:34:28PM -0500, Tom Lane wrote:
> "Euler Taveira"  writes:
>> +1 for the general idea. I received some reports like [1] related to wal2json
>> that people wants to obtain the output plugin version. Since it is not 
>> installed
>> via CREATE EXTENSION, it is not possible to detect what version is installed,
>> hence, some tools cannot have some logic to probe the module version. In a
>> managed environment, it is hard to figure out the exact version for
>> non-CREATE-EXTENSION modules, unless it is explicitly informed by the vendor.
> 
> What would you foresee as the SQL API for inspecting a module that's
> not tied to an extension?

Rather than a function that can be called with a specific module name
in input, invent a new system SRF function that would report back for
a process all the libraries that have been loaded in it?  Presumably,
the extra tracking can be done in dfmgr.c with more fields added to
DynamicFileList to track the information involved.

Being able to print the information of DynamicFileList can be argued
as useful on its own as long as its execution can be granted, with
superuser right by default.
--
Michael


signature.asc
Description: PGP signature

Re: pg_createsubscriber TAP test wrapping makes command options hard to read.