date:20250323

On Sat, Mar 22, 2025 at 2:21 AM Tom Lane  wrote:
> Ugh, no, that is *completely* unworkable.  Suppose that the user
> does CREATE VIEW, and the parse tree recorded for that claims that
> column X is not-nullable.  Then the user drops the not-null
> constraint, and then asks to execute the view.  We'll optimize on
> the basis of stale information.

Thanks for pointing this out.

> The way to make this work is what I said before: move the planner's
> collection of relation information to somewhere a bit earlier in
> the planner.  But not to outside the planner.

I'm considering moving the collection of attnotnull information before
pull_up_sublinks, in hopes of leveraging this info to pull up NOT IN
in the future, something like attached.

Maybe we could also collect the attgenerated information in the same
routine, making life easier for expand_virtual_generated_columns.

Another issue I found is that in convert_EXISTS_to_ANY, we pass the
parent's root to eval_const_expressions, which can cause problems when
reducing "Var IS [NOT] NULL" quals.  To fix, v2 patch constructs up a
dummy PlannerInfo with "glob" link set to the parent's and "parser"
link set to the subquery.  I believe these are the only fields used.

Thanks
Richard

v2-0001-Reduce-Var-IS-NOT-NULL-quals-during-constant-folding.patch
Description: Binary data

Re: Add Postgres module info


On 3/22/25 23:49, Tom Lane wrote:

I spent awhile reviewing the v5 patch, and here's a proposed v6.
Some notes:

* I didn't like depending on offsetof(Pg_magic_struct, module_extra)
to determine which parts of the struct are checked for compatibility.
It just seems way too easy to break that with careless insertion
of new fields, and such breakage might not cause obvious failures.
I think the right thing is to break out the ABI-checking fields as
their own sub-struct, rather than breaking out the new fields as a
sub-struct.

Agree. It is a clear approach. I like it.


* I renamed the inquiry function to pg_get_loaded_modules, since
it only works on loaded modules but that's hardly clear from the
previous name.

+1


* It is not clear to me what permission restrictions we should put
on pg_get_loaded_modules, ...
I vote for the idea of stripping the full path to just a filename. My 
initial use cases were:
1. User reports the issue and need to provide me all loaded modules at 
the moment of query execution. Higher privileges needs administrative 
procedures that is a long way and not all the time possible.
2. A module needs to detect another loaded module - it is not a frequent 
case so far, but concurrency on queryId with pg_stat_statements is at 
least one of my examples happening sometimes.


Also, permissions here should be in agreement with permissions on 
pg_available_extensions(), right?




* I didn't like anything about the test setup. ...

Ok, thanks. I just played with alternatives.


Still TBD:

* I'm not happy with putting pg_get_loaded_modules into dfmgr.c.
It feels like the wrong layer to have a SQL-callable function,
and the large expansion in its #include list is evidence that we're
adding functionality that doesn't belong there.  But I'm not quite
sure where to put it instead.  Also, the naive way to do that would
require exporting DynamicFileList which doesn't feel nice either.
Maybe we could make dfmgr.c export some sort of iterator function?
I just attempted to reduce number of exported objects here. If it is ok 
to introduce an iterator, the pg_get_loaded_modules() may live in 
extension.c


* Should we convert our existing modules to use PG_MODULE_MAGIC_EXT?
I'm mildly in favor of that, but I think we'd need some automated way
to manage their version strings, and I don't know what that ought to
look like.  Maybe it'd be enough to make all the in-core modules use
PG_VERSION as their version string, but I think that might put a dent
in the idea of the version strings following semantic versioning
rules.
Yes, additional burden to bump version string was a stopper for me to 
propose such a brave idea.


--
regards, Andrei Lepikhov

Fix infinite loop from setting scram_iterations to INT_MAX

2025-03-23 Thread Kevin K Biju

Hi,

I stumbled upon a problem with the scram_iterations GUC where setting
scram_iterations to INT_MAX and then creating a user causes the command to
hang indefinitely.

postgres=# SET scram_iterations=2147483647;
SET
postgres=# CREATE ROLE maxscram WITH PASSWORD 'forever';


I looked into the relevant code and found the issue. Each SCRAM iteration
after the first is done in a loop with the following condition:

int i;
...
for (i = 2; i <= iterations; i++)
{
...
}

For iterations = INT_MAX, the loop will never terminate since the condition
is <= and adding 1 to INT_MAX will lead to i wrapping around to INT_MIN.

I've fixed this by modifying the loop condition to be i < iterations. I've
attached a patch with the fix. I considered adding a test as well, but
since generating a password with a high number of iterations is very
time-consuming, I'm not sure if that would be practical.

I also tried adding this to the current CommitFest, but my account hasn't
passed the cooldown period yet.

Thanks,
Kevin


fix_max_scram_iterations.patch
Description: Binary data

Regression test postgres_fdw might fail due to autovacuum

2025-03-23 Thread Alexander Lakhin


Hello hackers,

A recent buildfarm failure [1] with the following diagnostics:
72/72 postgresql:postgres_fdw-running / postgres_fdw-running/regress
   ERROR    19.04s exit status 1

postgres_fdw-running/regress/results/postgres_fdw.out
--- /home/bf/bf-build/culicidae/HEAD/pgsql/contrib/postgres_fdw/expected/postgres_fdw.out 2025-03-11 15:21:27.681846597 
+
+++ /home/bf/bf-build/culicidae/HEAD/pgsql.build/testrun/postgres_fdw-running/regress/results/postgres_fdw.out 
2025-03-14 04:02:32.573999799 +

@@ -6392,6 +6392,7 @@
 UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;
   c1  | c2 | c3  | c4 | c5 | c6 | c7 | c8
 --++-+++++
+ 2010 |  0 | bar |    |    |    | ft2    |
  2001 |  1 | bar |    |    |    | ft2    |
  2002 |  2 | bar |    |    |    | ft2    |
  2003 |  3 | bar |    |    |    | ft2    |
@@ -6401,7 +6402,6 @@
  2007 |  7 | bar |    |    |    | ft2    |
  2008 |  8 | bar |    |    |    | ft2    |
  2009 |  9 | bar |    |    |    | ft2    |
- 2010 |  0 | bar |    |    |    | ft2    |
 (10 rows)

 EXPLAIN (verbose, costs off)

shows that the UPDATE ... RETURNING query might be unstable due to lack of
ORDER BY.

I managed to reproduce this failure locally with the attached patch
applied to make the test repeatable and with:
sed 's/REGRESS = postgres_fdw.*/REGRESS = $(shell printf "postgres_fdw %.0s" 
`seq 50`)/' -i contrib/postgres_fdw/Makefile
(Running 10 instances of the test in parallel eases reproducing as well.)

I also added SELECT ctid, * FROM ft2; just after the query in question and
found out that the results of the SELECT more unstable, but for the UPDATE
to produce the difference, the tuple with c1 == 2010 must reside on a new
page. For example:
--- /home/vagrant/postgresql/contrib/postgres_fdw_2/expected/postgres_fdw.out 
2025-03-22 05:21:39.414773284 +
+++ /home/vagrant/postgresql/contrib/postgres_fdw_2/results/postgres_fdw.out 
2025-03-23 04:43:34.608281935 +
@@ -6399,6 +6399,7 @@
UPDATE ft2 SET c3 = 'bar' WHERE postgres_fdw_abs(c1) > 2000 RETURNING *;
  c1  | c2 | c3  | c4 | c5 | c6 | c7 | c8
--++-+++++
+ 2010 |  0 | bar |    |    |    | ft2    |
 2001 |  1 | bar |    |    |    | ft2    |
 2002 |  2 | bar |    |    |    | ft2    |
 2003 |  3 | bar |    |    |    | ft2    |
@@ -6408,7 +6409,6 @@
 2007 |  7 | bar |    |    |    | ft2    |
 2008 |  8 | bar |    |    |    | ft2    |
 2009 |  9 | bar |    |    |    | ft2    |
- 2010 |  0 | bar |    |    |    | ft2    |

SELECT ctid, * FROM ft2;
@@ -6470,15 +6470,6 @@
 (0,103)  |   57 | 407 | 00057_update7  | Fri Feb 27 00:00:00 1970 PST | Fri Feb 27 00:00:00 1970 | 7  | 7  
| foo
 (0,104)  |   67 | 407 | 00067_update7  | Mon Mar 09 00:00:00 1970 PST | Mon Mar 09 00:00:00 1970 | 7  | 7  
| foo
 (0,105)  |   77 | 407 | 00077_update7  | Thu Mar 19 00:00:00 1970 PST | Thu Mar 19 00:00:00 1970 | 7  | 7  
| foo
- (0,106)  |    9 | 509 | 9_update9  | Sat Jan 10 00:00:00 1970 PST | Sat Jan 10 00:00:00 1970 | 9  | ft2    
| foo
- (0,107)  |   19 | 509 | 00019_update9  | Tue Jan 20 00:00:00 1970 PST | Tue Jan 20 00:00:00 1970 | 9  | ft2    
| foo
- (0,108)  |   29 | 509 | 00029_update9  | Fri Jan 30 00:00:00 1970 PST | Fri Jan 30 00:00:00 1970 | 9  | ft2    
| foo
- (0,109)  |   39 | 509 | 00039_update9  | Mon Feb 09 00:00:00 1970 PST | Mon Feb 09 00:00:00 1970 | 9  | ft2    
| foo
- (0,110)  |   49 | 509 | 00049_update9  | Thu Feb 19 00:00:00 1970 PST | Thu Feb 19 00:00:00 1970 | 9  | ft2    
| foo
- (0,111)  |   59 | 509 | 00059_update9  | Sun Mar 01 00:00:00 1970 PST | Sun Mar 01 00:00:00 1970 | 9  | ft2    
| foo
- (0,112)  |   69 | 509 | 00069_update9  | Wed Mar 11 00:00:00 1970 PST | Wed Mar 11 00:00:00 1970 | 9  | ft2    
| foo
- (0,113)  |   79 | 509 | 00079_update9  | Sat Mar 21 00:00:00 1970 PST | Sat Mar 21 00:00:00 1970 | 9  | ft2    
| foo
- (0,114)  |   89 | 509 | 00089_update9  | Tue Mar 31 00:00:00 1970 PST | Tue Mar 31 00:00:00 1970 | 9  | ft2    
| foo
 (1,1)    |   98 |   8 | 00098  | Thu Apr 09 00:00:00 1970 PST | Thu Apr 09 00:00:00 1970 | 8  | 8  
| foo
 (1,3)    |  100 |   0 | 00100  | Thu Jan 01 00:00:00 1970 PST | Thu Jan 01 00:00:00 1970 | 0  | 0  
| foo
 (1,4)    |  101 |   1 | 00101  | Fri Jan 02 00:00:00 1970 PST | Fri Jan 02 00:00:00 1970 | 1  | 1  
| foo

...

- (12,137) | 2007 |   7 | bar |  |  
    |    | ft2    |
- (12,138) | 2008 |   8 | bar |  |  
    |    | ft2    |
- (12,139) | 2009 |   9 | bar |  |  
    |    | ft2    |
- (12,140) | 2010 |   0 | bar |

Re: Change log level for notifying hot standby is waiting non-overflowed snapshot

2025-03-23 Thread Fujii Masao





On 2025/03/21 21:29, torikoshia wrote:

Hi,

On 2025-03-21 02:15, Fujii Masao wrote:
Thanks for your review!


Personally, I feel 1st patch may be sufficient, but I would appreciate any 
feedback.


Agreed.

- errdetail("Consistent recovery state has not been yet 
reached.")));
+ errdetail("Consistent recovery state has not been 
yet
reached, or snappshot is pending because subtransaction is
overflowed."),
+ errhint("In the latter case, find and close the 
transaction
with more than %d subtransactions", PGPROC_MAX_CACHED_SUBXIDS)));

This message might be too detailed. Instead, how about simplifying it
to something like: "Consistent recovery state has not been reached,
or snapshot is not ready for hot standby."


Agreed.

Do you also think the errhint message is unnecessary?
I agree with your idea to add a description of the overflowed subtransaction in 
the manual, but I'm not sure all users will be able to find it.
Some people may not understand what needs to be done to make the snapshot ready 
for hot standby.
I think adding an errhint may help those users.


I see your concern that users might overlook the documentation and
struggle to find a solution. However, I still believe it's better to
include this information in the documentation rather than logging it
as a hint. Since the scenario where the hint would be useful is
relatively rare, logging it every time might be more confusing than helpful.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Proposal - Allow extensions to set a Plan Identifier

2025-03-23 Thread Sami Imseih

> > but the opposite may be
> > not necessarily true: a query ID could be associated with multiple
> > plan patterns (aka multiple plan IDs).  What this offers is that we
> > know about which plan one query is currently using in live, for a
> > given query ID.
> Okay, as I've said before, it seems practical. I just worry because I
> see that people don't understand that queryId is a more or less
> arbitrary value that may or may not group queries that differ only by
> constants to a single "Query Class". I think the same attitude will be
> paid to planId. It is okay for statistics. However, non-statistical
> extensions need more guarantees and want to put different logic into
> these values.
> For now, we have examples of only statistical extensions in open-source
> and may live with a proposed solution.

statistical/monitoring reasons are an important use case. Detecting plan
changes, load attributed to a specific plan, etc.
However, I also do see other extensions that could implement a plan_id
that has meaning beyond statistical/monitoring purposes.
Plan management/enforcement is another use-case.

> >> So, it seems that extensions may conflict with the same field. Are we sure
> >> that will happen if they are loaded in different orders in neighbouring
> >> backends?
> >
> > Depends on what kind of computation once wants to do, and surely
> > without some coordination across the extensions you are using these
> > cannot be shared, but it's no different from the concept of a query
> > ID.
> Hmm, queryId generation code lies in the core and we already came to
> terms that this field has only a statistical purpose. Do you want to
> commit planId generation code?

But, extensions don't necessarily need to rely on the core queryId. They
can generate their own queryId.
We have it documented for compute_query_id as such [0]
"Note that an external module can alternatively be used if the in-core query
identifier computation method is not acceptable"

[0] 
https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-COMPUTE-QUERY-ID

--
Sami Imseih
Amazon Web Services (AWS)

Re: doc patch: wrong descriptions for dropping replication slots

2025-03-23 Thread Fujii Masao





On 2025/03/21 10:07, Hayato Kuroda (Fujitsu) wrote:

Dear Fujii-san,


Unless there are any objections, I plan to push your patch with
the following commit message and back-patch it to all supported versions.

...

Thanks for updating the commit message. LGTM.


I've committed the patch with that message. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: AIO v2.5

2025-03-23 Thread Andres Freund

Hi,

On 2025-03-22 17:20:56 -0700, Noah Misch wrote:
> On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> >   Not sure yet how to best disable testing io_uring in this case. We can't
> >   just query EXEC_BACKEND from pg_config.h unfortunately.  I guess making 
> > the
> >   initdb not fail and checking the error log would work, but that doesn't 
> > work
> >   nicely with Cluster.pm.
> 
> How about "postgres -c io_method=io_uring -C ":
> 
> --- a/src/test/modules/test_aio/t/001_aio.pl
> +++ b/src/test/modules/test_aio/t/001_aio.pl
> @@ -29,7 +29,13 @@ $node_worker->stop();
>  # Test io_method=io_uring
>  ###
>  
> -if ($ENV{with_liburing} eq 'yes')
> +sub have_io_uring
> +{
> + local %ENV = $node_worker->_get_env();  # any node works
> + return run_log [qw(postgres -c io_method=io_uring -C io_method)];
> +}
> +
> +if (have_io_uring())
>  {
>   my $node_uring = create_node('io_uring');
>   $node_uring->start();

Yea, that's a good idea.

One thing that doesn't seem great is that it requires a prior node - what if
we do -c io_method=invalid' that would report the list of valid GUC options,
so we could just grep for io_uring?

It's too bad that postgres --describe-config
a) doesn't report the possible enum values
b) doesn't apply/validate -c options


> > Subject: [PATCH v2.11 01/27] aio, bufmgr: Comment fixes
> 
> Ready to commit, though other comment fixes might come up in later reviews.

I'll reorder it to a bit later in the series, to accumulate a few more.


> One idea so far is to comment on valid states after some IoMethodOps
> callbacks:
> 
> --- a/src/include/storage/aio_internal.h
> +++ b/src/include/storage/aio_internal.h
> @@ -310,6 +310,9 @@ typedef struct IoMethodOps
>   /*
>* Start executing passed in IOs.
>*
> +  * Shall advance state to PGAIO_HS_SUBMITTED.  (By the time this 
> returns,
> +  * other backends might have advanced the state further.)
> +  *
>* Will not be called if ->needs_synchronous_execution() returned true.
>*
>* num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
> @@ -321,6 +324,12 @@ typedef struct IoMethodOps
>   /*
>* Wait for the IO to complete. Optional.
>*
> +  * On return, state shall be PGAIO_HS_COMPLETED_IO,
> +  * PGAIO_HS_COMPLETED_SHARED or PGAIO_HS_COMPLETED_LOCAL.  (The callback
> +  * need not change the state if it's already one of those.)  If state is
> +  * PGAIO_HS_COMPLETED_IO, state will reach PGAIO_HS_COMPLETED_SHARED
> +  * without further intervention.
> +  *
>* If not provided, it needs to be guaranteed that the IO method calls
>* pgaio_io_process_completion() without further interaction by the
>* issuing backend.

I think these are a good idea. I added those to the copy-edit patch, with a
few more tweaks:

@@ -315,6 +315,9 @@ typedef struct IoMethodOps
 /*
  * Start executing passed in IOs.
  *
+ * Shall advance state to at least PGAIO_HS_SUBMITTED.  (By the time this
+ * returns, other backends might have advanced the state further.)
+ *
  * Will not be called if ->needs_synchronous_execution() returned true.
  *
  * num_staged_ios is <= PGAIO_SUBMIT_BATCH_SIZE.
@@ -323,12 +326,24 @@ typedef struct IoMethodOps
  */
 int (*submit) (uint16 num_staged_ios, PgAioHandle **staged_ios);
 
-/*
+/* ---
  * Wait for the IO to complete. Optional.
  *
+ * On return, state shall be on of
+ * - PGAIO_HS_COMPLETED_IO
+ * - PGAIO_HS_COMPLETED_SHARED
+ * - PGAIO_HS_COMPLETED_LOCAL
+ *
+ * The callback must not block if the handle is already in one of those
+ * states, or has been reused (see pgaio_io_was_recycled()).  If, on
+ * return, the state is PGAIO_HS_COMPLETED_IO, state will reach
+ * PGAIO_HS_COMPLETED_SHARED without further intervention by the IO
+ * method.
+ *
  * If not provided, it needs to be guaranteed that the IO method calls
  * pgaio_io_process_completion() without further interaction by the
  * issuing backend.
+ * ---
  */
 void(*wait_one) (PgAioHandle *ioh,
  uint64 ref_generation);



> > Subject: [PATCH v2.11 03/27] Redefine max_files_per_process to control
> >  additionally opened files
> 
> Ready to commit

Cool!


> > Subject: [PATCH v2.11 04/27] aio: Add liburing dependency
> 
> > --- a/meson.build
> > +++ b/meson.build
> > @@ -944,6 +944,18 @@ endif
> >  
> >  
> >  
> > +###
> > +# Library: liburing
> > +###
> > +
> > +liburingopt = get_option('liburing')
> > +liburing = dependency('liburing', required: liburingopt)
> > +if liburing.found()
> > +  cdata.set('USE_LIBURING', 1)
> > +endif
> 
> This is a different style from other deps; is it equivalent to our standard
> style?

Yes - the only reason to

Re: Proposal - Allow extensions to set a Plan Identifier


On 3/23/25 15:56, Sami Imseih wrote:

Hmm, queryId generation code lies in the core and we already came to
terms that this field has only a statistical purpose. Do you want to
commit planId generation code?


But, extensions don't necessarily need to rely on the core queryId. They
can generate their own queryId.
We have it documented for compute_query_id as such [0]
"Note that an external module can alternatively be used if the in-core query
identifier computation method is not acceptable"

In theory they don't, but in practice they must.
This mess especially boring because one of problematic extensions at the 
same time the most popular one - pg_stat_statements. People forget to 
follow strict instructions about load order of extensions and think it 
is the extension instability when one of their query plans isn't 
captured, but should.
So, it may be not an issue in a cloud provider predefined installations, 
but a headache for custom configurations.


--
regards, Andrei Lepikhov

Re: BitmapHeapScan streaming read user and prelim refactoring

2025-03-23 Thread Melanie Plageman

On Sat, Mar 22, 2025 at 5:04 PM Andres Freund  wrote:
>
> The problem is that sometimes recheck is performed for pending empty
> tuples. The comment about empty tuples says:
>
> /*
>  * Bitmap is exhausted. Time to emit empty tuples if relevant. We emit
>  * all empty tuples at the end instead of emitting them per block we
>  * skip fetching. This is necessary because the streaming read API
>  * will only return TBMIterateResults for blocks actually fetched.
>  * When we skip fetching a block, we keep track of how many empty
>  * tuples to emit at the end of the BitmapHeapScan. We do not recheck
>  * all NULL tuples.
>  */
> *recheck = false;
> return bscan->rs_empty_tuples_pending > 0;
>
> But we actually emit empty tuples at each page boundary:
>
> /*
>  * Out of range?  If so, nothing more to look at on this page
>  */
> while (hscan->rs_cindex >= hscan->rs_ntuples)
> {
> /*
>  * Emit empty tuples before advancing to the next block
>  */
> if (bscan->rs_empty_tuples_pending > 0)
> {
> /*
>  * If we don't have to fetch the tuple, just return 
> nulls.
>  */
> ExecStoreAllNullTuple(slot);
> bscan->rs_empty_tuples_pending--;
> return true;
> }
>
> /*
>  * Returns false if the bitmap is exhausted and there are no 
> further
>  * blocks we need to scan.
>  */
> if (!BitmapHeapScanNextBlock(scan, recheck, lossy_pages, 
> exact_pages))
> return false;
> }
>
> So we don't just process tuples at the end of the scan, but at each page
> boundary.

Whoops! I think an earlier version of the patch did emit all the empty
tuples at the end, but I changed the control flow without moving the
recheck = false and removing/modifying that comment. Quite
embarrassing.

> This leads to wrong results whenever there is a rs_empty_tuples_pending > 0
> after a previous page that needed rechecks, because nothing will reset
> *recheck = false.
>
> And indeed, if I add a *recheck = false in that inside the
> /*
>  * Emit empty tuples before advancing to the next block
>  */
> if (bscan->rs_empty_tuples_pending > 0)
> {
> the problem goes away.
>
>
> A fix should do slightly more, given that the "at the end" comment is wrong.
>
> But I'm wondering if it's worth fixing it, or if we should just rip the logic
> out alltogether, due to the whole VM checking being unsound as we learned in
> the thread I referenced earlie, without even bothering to fix the bug here.

Perhaps it is better I just fix it since ripping out the skip fetch
optimization has to be backported and even though that will look very
different on master than on backbranches, I wonder if people will look
for a "clean" commit on master which removes the feature.

- Melanie

Re: Fwd: [BUG]: the walsender does not update its IO statistics until it exits

On Wed, Mar 19, 2025 at 04:00:49PM +0800, Xuneng Zhou wrote:
> Hi,
> Moving the other two provides a more complete view of the settings. For
> newcomers(like me) to the codebase, seeing all three related values in one
> place helps avoid a narrow view of the settings.
> 
> But I am not sure that I understand the cons of this well.

While I don't disagree with the use of a hardcoded interval of time to
control timing the flush of the WAL sender stats, do we really need to
rely on the timing defined by pgstat.c?  Wouldn't it be simpler to
assign one in walsender.c and pick up a different, perhaps higher,
value?

At the end the timestamp calculations are free because we can rely on
the existing call of GetCurrentTimestamp() for the physical WAL
senders to get an idea of the current time.  For the logical WAL
senders, perhaps we'd better do the reports in WalSndWaitForWal(),
actually.  There is also a call to GetCurrentTimestamp() that we can
rely on in this path.
--
Michael


signature.asc
Description: PGP signature

Re: Proposal - Allow extensions to set a Plan Identifier


On 3/23/25 01:01, Michael Paquier wrote:

On Sat, Mar 22, 2025 at 11:50:06PM +0100, Andrei Lepikhov wrote:

planId actually looks less controversial than queryId or plan_node_id. At
the same time, it is not free from the different logic that extensions may
incorporate into this value: I can imagine, for example, the attempt of
uniquely numbering plans related to the same queryId, plain hashing of the
plan tree, hashing without constants, etc.


One plan ID should have one single query ID,
I'm not sure I understand what do you mean by that. QueryId reflects 
syntactic structure of the query, but planId - semantics. For example:


SELECT relname FROM pg_class JOIN pg_am USING (oid);
SELECT relname FROM pg_am JOIN pg_class USING (oid);

have different queryIds: -6493293269785547447 and 4243743071053231833

but the same plan:

 Hash Join
   Hash Cond: (pg_class.oid = pg_am.oid)
   ->  Seq Scan on pg_class
   ->  Hash
 ->  Seq Scan on pg_am

You can invent multiple other cases here - remember query tree 
transformations we made before optimisation.



but the opposite may be
not necessarily true: a query ID could be associated with multiple
plan patterns (aka multiple plan IDs).  What this offers is that we
know about which plan one query is currently using in live, for a
given query ID.
Okay, as I've said before, it seems practical. I just worry because I 
see that people don't understand that queryId is a more or less 
arbitrary value that may or may not group queries that differ only by 
constants to a single "Query Class". I think the same attitude will be 
paid to planId. It is okay for statistics. However, non-statistical 
extensions need more guarantees and want to put different logic into 
these values.
For now, we have examples of only statistical extensions in open-source 
and may live with a proposed solution.



So, it seems that extensions may conflict with the same field. Are we sure
that will happen if they are loaded in different orders in neighbouring
backends?


Depends on what kind of computation once wants to do, and surely
without some coordination across the extensions you are using these
cannot be shared, but it's no different from the concept of a query
ID.
Hmm, queryId generation code lies in the core and we already came to 
terms that this field has only a statistical purpose. Do you want to 
commit planId generation code?


--
regards, Andrei Lepikhov

Re: Proposal - Allow extensions to set a Plan Identifier

On 3/23/25 04:22, Sami Imseih wrote:

On Sat, Mar 22, 2025 at 11:50:06PM +0100, Andrei Lepikhov wrote:

planId actually looks less controversial than queryId or plan_node_id. At
the same time, it is not free from the different logic that extensions may
incorporate into this value: I can imagine, for example, the attempt of
uniquely numbering plans related to the same queryId, plain hashing of the
plan tree, hashing without constants, etc.

I think plan_node_id is probably the least controversial because that value
comes straight from core, and different extensions cannot have their own
interpretation of what that value could be.
I meant the proposal to let extensions calculate this field. Someone may
want node_id to be unique inside the plan; for someone - it should
reflect the nature of the node, and it may be the same even inside one
plan, etc. In this case, it is highly controversial.

Computing a plan_id is even more open for interpretation than it is
for query_id perhaps, which is why giving this ability to extensions
will be useful. Different extensions may choose to compute a plan_id
different ways, but they all should be able to push this value into core,
and this is what this patch will allow for.
That's why I worry about allowing extensions to compute it. Remember:
for now, an extension can't be sure that a conflicting one is not
already loaded in the backend. The code [1] may relieve this problem.

[1]
https://www.postgresql.org/message-id/flat/441661.1742683...@sss.pgh.pa.us#7640020deb6497fe049ac4839eaeb33d

--
regards, Andrei Lepikhov

Re: AIO v2.5

2025-03-23 Thread Andres Freund

Hi,

On 2025-03-19 18:11:18 -0700, Noah Misch wrote:
> On Wed, Mar 19, 2025 at 06:17:37PM -0400, Andres Freund wrote:
> > On 2025-03-19 14:25:30 -0700, Noah Misch wrote:

> > Hm, we retry more frequently that that if there are new connections...  
> > Maybe
> > just "try again next time"?
> 
> Works for me.
> 

> > > And these individual lines from "git grep BM_IO_IN_PROGRESS":
> > >
> > >  *i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
> > >
> > > The last especially.
> > 
> > Huh - yea.  This isn't a "new" issue, I think I missed this comment in 16's
> > 12f3867f5534.  I think the comment can just be deleted?
> 
> Hmm, yes, it's orthogonal to $SUBJECT and deletion works fine.
> 
> > >* I/O already in progress.  We already hold BM_IO_IN_PROGRESS 
> > > for the
> > >* only one process at a time can set the BM_IO_IN_PROGRESS bit.
> > >* only one process at a time can set the BM_IO_IN_PROGRESS bit.
> > 
> > > For the other three lines and the paragraph, the notion
> > > of a process "holding" BM_IO_IN_PROGRESS or being the process to "set" it 
> > > or
> > > being the process "doing a read" becomes less significant when one process
> > > starts the IO and another completes it.
> > 
> > Hm. I think they'd be ok as-is, but we can probably improve them. Maybe
> 
> Looking again, I agree they're okay.
> 
> > 
> >  * Now it's safe to write buffer to disk. Note that no one else should
> >  * have been able to write it while we were busy with log flushing 
> > because
> >  * we got the exclusive right to perform I/O by setting the
> >  * BM_IO_IN_PROGRESS bit.
> 
> That's fine too.  Maybe s/perform/stage/ or s/perform/start/.

I put these comment changes into their own patch, as it seemed confusing to
change them as part of one of the already queued commits.

> > > I see this relies on md_readv_complete having converted "result" to 
> > > blocks.
> > > Was there some win from doing that as opposed to doing the division here?
> > > Division here ("blocks_read = prior_result.result / BLCKSZ") would feel 
> > > easier
> > > to follow, to me.
> > 
> > It seemed like that would be wrong layering - what if we had an smgr that
> > could store data in a compressed format? The raw read would be of a smaller
> > size. The smgr API deals in BlockNumbers, only the md.c layer should know
> > about bytes.
> 
> I hadn't thought of that.  That's a good reason.

I thought that was better documented, but alas, it wasn't. How about updating
the documentation of smgrstartreadv to the following:

/*
 * smgrstartreadv() -- asynchronous version of smgrreadv()
 *
 * This starts an asynchronous readv IO using the IO handle `ioh`. Other than
 * `ioh` all parameters are the same as smgrreadv().
 *
 * Completion callbacks above smgr will be passed the result as the number of
 * successfully read blocks if the read [partially] succeeds. This maintains
 * the abstraction that smgr operates on the level of blocks, rather than
 * bytes.
 */

I briefly had a bug in test_aio's injection point that lead to *increasing*
the number of bytes successfully read. That triggered an assertion failure in
bufmgr.c, but not closer to the problem.  Is it worth adding an assert against
that to md_readv_complete? Can't quite decide.

Greetings,

Andres Freund

Re: wrong error message related to unsupported feature

2025-03-23 Thread Pavel Stehule

čt 20. 3. 2025 v 22:30 odesílatel Pavel Stehule 
napsal:

> Hi
>
> út 18. 3. 2025 v 21:33 odesílatel Álvaro Herrera 
> napsal:
>
>> On 2025-Mar-18, Pavel Stehule wrote:
>>
>> > Maybe I found a bug
>> >
>> > (2025-03-18 19:28:06) postgres=# create table foo(a int constraint gzero
>> > check(a > 10) NOT ENFORCED);
>> > CREATE TABLE
>> > (2025-03-18 19:29:37) postgres=# insert into foo values(0);
>> > INSERT 0 1
>> > (2025-03-18 19:29:49) postgres=# insert into foo values(6);
>> > INSERT 0 1
>> > (2025-03-18 19:29:55) postgres=# alter table foo alter constraint gzero
>> > enforced;
>> > ERROR:  FOREIGN KEY constraints cannot be marked ENFORCED
>> > LINE 1: alter table foo alter constraint gzero enforced;
>> >
>> > I know so altering enforcing constraint is not supported yet, but the
>> error
>> > message is surely wrong
>>
>> Yep, this is a bug all right -- I reported this and related problems a
>> few days ago [1].  There's a proposal in that thread for how to fix this
>> (see Amul's email [2] and my followup), but I haven't had time to fully
>> implement it.  If you want to give it a couple of hours to complete it,
>> that'd be great.  I have a couple of patches that I need to handle
>> before coming back to that.
>>
>> [1] https://postgr.es/m/202503041822.aobpqke3igvb@alvherre.pgsql
>> [2]
>> https://postgr.es/m/caaj_b97hd-jmts7ajgu6tdbczdx_kyukxg+k-dtymoieg+g...@mail.gmail.com
>
>
> I am looking this issue, and I am not sure if proposed way is the best
>
> cannot we change processCASbits just like ?
>
>
> if (cas_bits & (CAS_DEFERRABLE | CAS_INITIALLY_DEFERRED))
> {
> if (deferrable)
> *deferrable = true;
> else if (constrType)
> ereport(ERROR,
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> /* translator: %s is CHECK, UNIQUE, or similar */
>  constrType ?
>  errmsg("%s constraints cannot be marked DEFERRABLE",
> constrType) :
>  errmsg("constraint cannot be marked DEFERRABLE"),
>  parser_errposition(location)));
> }
> ...
>
> Probably can be better to not try to read from catalog in this moment, and
> then we can accept so we don't know constraint type
>

something like attached patch

Regards

Pavel


>
> Regards
>
> Pavel
>
>
>
>
>
>
>
>
>>
>> --
>> Álvaro Herrera PostgreSQL Developer  —
>> https://www.EnterpriseDB.com/
>>
>
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 271ae26cbaf..7b0e66a8940 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -2663,7 +2663,7 @@ alter_table_cmd:
 	n->def = (Node *) c;
 	c->conname = $3;
 	c->alterDeferrability = true;
-	processCASbits($4, @4, "FOREIGN KEY",
+	processCASbits($4, @4, NULL,
 	&c->deferrable,
 	&c->initdeferred,
 	NULL, NULL, NULL, yyscanner);
@@ -19531,9 +19531,11 @@ processCASbits(int cas_bits, int location, const char *constrType,
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			/* translator: %s is CHECK, UNIQUE, or similar */
+	 constrType ?
+	 /* translator: %s is CHECK, UNIQUE, TRIGGER, or similar */
 	 errmsg("%s constraints cannot be marked DEFERRABLE",
-			constrType),
+			constrType) :
+	 errmsg("constraint cannot be marked DEFERRABLE"),
 	 parser_errposition(location)));
 	}
 
@@ -19544,9 +19546,11 @@ processCASbits(int cas_bits, int location, const char *constrType,
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			/* translator: %s is CHECK, UNIQUE, or similar */
+	 constrType ?
+	 /* translator: %s is CHECK, UNIQUE, TRIGGER, or similar */
 	 errmsg("%s constraints cannot be marked DEFERRABLE",
-			constrType),
+			constrType) :
+	 errmsg("constraint cannot be marked DEFERRABLE"),
 	 parser_errposition(location)));
 	}
 
@@ -19557,9 +19561,11 @@ processCASbits(int cas_bits, int location, const char *constrType,
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			/* translator: %s is CHECK, UNIQUE, or similar */
+	 constrType ?
+	 /* translator: %s is CHECK, UNIQUE, TRIGGER, or similar */
 	 errmsg("%s constraints cannot be marked NOT VALID",
-			constrType),
+			constrType) :
+	 errmsg("constraint cannot be marked NOT VALID"),
 	 parser_errposition(location)));
 	}
 
@@ -19570,9 +19576,11 @@ processCASbits(int cas_bits, int location, const char *constrType,
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			/* translator: %s is CHECK, UNIQUE, or similar */
+	 constrType ?
+	 /* translator: %s is CHECK, UNIQUE, TRIGGER, or similar */
 	 errmsg("%s constraints cannot be marked NO INHERIT",
-			constrType),
+			constrType) :
+	 errmsg("constraint cannot be marked NO INHERIT"),
 	 parser_errposition(location)));
 	}
 
@@ -19583,9 +19591,1

Re: Parallel heap vacuum

2025-03-23 Thread Masahiko Sawada

On Sat, Mar 22, 2025 at 7:16 AM Melanie Plageman
 wrote:
>
> On Thu, Mar 20, 2025 at 4:36 AM Masahiko Sawada  wrote:
> >
> > When testing the multi passes of table vacuuming, I found an issue.
> > With the current patch, both leader and parallel workers process stop
> > the phase 1 as soon as the shared TidStore size reaches to the limit,
> > and then the leader resumes the parallel heap scan after heap
> > vacuuming and index vacuuming. Therefore, as I described in the patch,
> > one tricky part of this patch is that it's possible that we launch
> > fewer workers than the previous time when resuming phase 1 after phase
> > 2 and 3. In this case, since the previous parallel workers might have
> > already allocated some blocks in their chunk, newly launched workers
> > need to take over their parallel scan state. That's why in the patch
> > we store workers' ParallelBlockTableScanWorkerData in shared memory.
> > However, I found my assumption is wrong; in order to take over the
> > previous scan state properly we need to take over not only
> > ParallelBlockTableScanWorkerData but also ReadStream data as parallel
> > workers might have already queued some blocks for look-ahead in their
> > ReadStream. Looking at ReadStream codes, I find that it's not
> > realistic to store it into the shared memory.
>
> It seems like one way to solve this would be to add functionality to
> the read stream to unpin the buffers it has in the buffers queue
> without trying to continue calling the callback until the stream is
> exhausted.
>
> We have read_stream_reset(), but that is to restart streams that have
> already been exhausted. Exhausted streams are where the callback has
> returned InvalidBlockNumber. In the read_stream_reset() cases, the
> read stream user knows there are more blocks it would like to scan or
> that it would like to restart the scan from the beginning.
>
> Your case is you want to stop trying to exhaust the read stream and
> just unpin the remaining buffers. As long as the worker which paused
> phase I knows exactly the last block it processed and can communicate
> this to whatever worker resumes phase I later, it can initialize
> vacrel->current_block to the last block processed.

If we use ParallelBlockTableScanDesc with streaming read like the
patch did, we would also need to somehow rewind the number of blocks
allocated to workers. The problem I had with such usage was that a
parallel vacuum worker allocated a new chunk of blocks when doing
look-ahead reading and therefore advanced
ParallelBlockTableScanDescData.phs_nallocated. In this case, even if
we unpin the remaining buffers in the queue by a new functionality and
a parallel worker resumes the phase 1 from the last processed block,
we would lose some blocks in already allocated chunks unless we rewind
ParallelBlockTableScanDescData and ParallelBlockTableScanWorkerData
data. However, since a worker might have already allocated multiple
chunks it would not be easy to rewind these scan state data.

Another idea is that parallel workers don't exit phase 1 until it
consumes all pinned buffers in the queue, even if the memory usage of
TidStore exceeds the limit. It would need to add new functionality to
the read stream to disable the look-ahead reading. Since we could use
much memory while processing these buffers, exceeding the memory
limit, we can trigger this mode when the memory usage of TidStore
reaches 70% of the limit or so. On the other hand, it means that we
would not use the streaming read for the blocks in this mode, which is
not efficient.

>
> > One plausible solution would be that we don't use ReadStream in
> > parallel heap vacuum cases but directly use
> > table_block_parallelscan_xxx() instead. It works but we end up having
> > two different scan methods for parallel and non-parallel lazy heap
> > scan. I've implemented this idea in the attached v12 patches.
>
> One question is which scenarios will parallel vacuum phase I without
> AIO be faster than read AIO-ified vacuum phase I. Without AIO writes,
> I suppose it would be trivial for phase I parallel vacuum to be faster
> without using read AIO. But it's worth thinking about the tradeoff.

 As Andres pointed out, there are major downsides. So we would need to
invent a way to stop and resume the read stream in the middle during
parallel scan.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Have postgres.bki self-identify

2025-03-23 Thread Robert Haas

On Thu, Mar 20, 2025 at 3:47 PM David G. Johnston
 wrote:
> While trying to find postgres.bki in my build directory searching for the 
> file name didn't work because there is no comment in the file containing the 
> file name; like there is in every other file we write or generate, including 
> the related *_d.h files.  Add it.

I'm not a fan of making it a policy that everyone has to do this. I'd
rather see us remove filenames from some places where they cause
maintenance overhead for little benefit. If somebody wants to find
postgres.bki, I guess you can just "find . -name postgres.bki -print"

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Proposal - Allow extensions to set a Plan Identifier

On Sun, Mar 23, 2025 at 04:30:04PM +0100, Andrei Lepikhov wrote:
> So, it may be not an issue in a cloud provider predefined installations, but
> a headache for custom configurations.

Sure, I mean, but it's not really related to the issue discussed on
this thread, so..

It sounds like we could improve the documentation about the GUCs that
hold a list of library names and load them in the order specified, so
as we could point somewhere rather that just saying "don't do that".
Another thing is extensions that have the idea of not tracking a hook
previously registered, and missing to call it.  Another one is that
the previous hook can be called before a module's look, or after it.
Depending on how an out-of-core extension is maintained, there are a
lot of things that can be done.
--
Michael


signature.asc
Description: PGP signature

Re: Improve monitoring of shared memory allocations

2025-03-23 Thread Rahila Syed

Hi Bilal,


I have a couple of comments, I have only reviewed 0001 so far.
>

Thank you for reviewing!


>
> You may need to run pgindent, it makes some changes.
>

Attached v4-patch has been updated after running pgindent.


+ * If table is shared, calculate the offset at which to find the
> the
> + * first partition of elements
> + */
> +
> +nsegs = compute_buckets_and_segs(nelem, &nbuckets,
> hctl->num_partitions, hctl->ssize);
>
> Blank line between the comment and the code.
>

Removed this.


>  /*
>   * allocate some new elements and link them into the indicated free list
>   */
> -static bool
> -element_alloc(HTAB *hashp, int nelem, int freelist_idx)
> +static HASHELEMENT *
> +element_alloc(HTAB *hashp, int nelem)
>
> Comment needs an update. This function no longer links elements into
> the free list.
>

Updated this and few other comments in the attached v4-patch.


>
> +static int
> +compute_buckets_and_segs(long nelem, int *nbuckets, long
> num_partitions, long ssize)
> +{
> ...
> +/*
> + * In a partitioned table, nbuckets must be at least equal to
> + * num_partitions; were it less, keys with apparently different
> partition
> + * numbers would map to the same bucket, breaking partition
> independence.
> + * (Normally nbuckets will be much bigger; this is just a safety
> check.)
> + */
> +while ((*nbuckets) < num_partitions)
> +(*nbuckets) <<= 1;
>
> I have some worries about this function, I am not sure what I said
> below has real life implications as you already said 'Normally
> nbuckets will be much bigger; this is just a safety check.'.
>
> 1- num_partitions is long and nbuckets is int, so could there be any
> case where num_partition is bigger than MAX_INT and cause an infinite
> loop?
> 2- Although we assume both nbuckets and num_partition initialized as
> the same type, (*nbuckets) <<= 1 will cause an infinite loop if
> num_partition is bigger than MAX_TYPE / 2.
>
> So I think that the solution is to confirm that num_partition <
> MAX_NBUCKETS_TYPE / 2, what do you think?
>
>
Your concern is valid. This has been addressed in the existing code by
calling next_pow2_int() on num_partitions before running the function.
Additionally, I am not adding any new code to the compute_buckets_and_segs
function. I am simply moving part of the init_tab() code into a separate
function
for reuse.

Please find attached the updated and rebased patches.

Thank you,
Rahila Syed


v4-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch
Description: Binary data


v4-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch
Description: Binary data

Re: Reduce "Var IS [NOT] NULL" quals during constant folding

On Sat, Mar 22, 2025 at 1:12 AM Robert Haas  wrote:
> However, I'm a bit concerned about the overall premise of the patch
> set. It feels like it is moving something that really ought to happen
> at optimization time back to parse time. I have a feeling that's going
> to break something, although I am not sure right now exactly what.
> Wouldn't it be better to have this still happen in the planner, but
> sooner than it does now?

You're right.  It's just flat wrong to collect catalog information in
the parser and use it in the planner.  As Tom pointed out, the catalog
information could change in between, which would cause us to use stale
data.

Yeah, this should still happen in the planner, perhaps before
pull_up_sublinks, if we plan to leverage that info to convert NOT IN
to anti-join.

Thanks
Richard

Re: Make COPY format extendable: Extract COPY TO format implementations

2025-03-23 Thread Masahiko Sawada

On Wed, Mar 19, 2025 at 6:25 PM Sutou Kouhei  wrote:
>
> Hi,
>
> In 
>   "Re: Make COPY format extendable: Extract COPY TO format implementations" 
> on Wed, 19 Mar 2025 17:49:49 -0700,
>   "David G. Johnston"  wrote:
>
> >> And could someone help (take over if possible) writing a
> >> document for this feature? I'm not good at writing a
> >> document in English... 0009 in the attached v37 patch set
> >> has a draft of it. It's based on existing documents in
> >> doc/src/sgml/ and *.h.
> >>
> >>
> > I haven't touched the innards of the structs aside from changing
> > programlisting to synopsis.  And redoing the two section opening paragraphs
> > to better integrate with the content in the chapter opening.
> >
> > The rest I kinda went to town on...
>
> Thanks!!! It's very helpful!!!
>
> I've applied your patch. 0009 is only changed.

Thank you for updating the patches. I've reviewed the main part of
supporting the custom COPY format. Here are some random comments:

---
+/*
+ * Process the "format" option.
+ *
+ * This function checks whether the option value is a built-in format such as
+ * "text" and "csv" or not. If the option value isn't a built-in format, this
+ * function finds a COPY format handler that returns a CopyToRoutine (for
+ * is_from == false) or CopyFromRountine (for is_from == true). If no COPY
+ * format handler is found, this function reports an error.
+ */

I think this comment needs to be updated as the part "If the option
value isn't ..." is no longer true.

I think we don't necessarily need to create a separate function
ProcessCopyOptionFormat for processing the format option.

We need more regression tests for handling the given format name. For example,

- more various input patterns.
- a function with the specified format name exists but it returns an
unexpected Node.
- looking for a handler function in a different namespace.
etc.

---
I think that we should accept qualified names too as the format name
like tablesample does. That way, different extensions implementing the
same format can be used.

---
+if (routine == NULL || !IsA(routine, CopyFromRoutine))
+ereport(
+ERROR,
+(errcode(
+
ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY handler function "
+"%u did not return "
+"CopyFromRoutine struct",
+opts->handler)));

It's not conventional to put a new line between 'ereport(' and 'ERROR'
(similarly between 'errcode(' and 'ERRCODE_...'. Also, we don't need
to split the error message into multiple lines as it's not long.

---
+if (routine == NULL || !IsA(routine, CopyToRoutine))
+ereport(
+ERROR,
+(errcode(
+
ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("COPY handler function "
+"%u did not return "
+"CopyToRoutine struct",
+opts->handler)));

Same as the above comment.

---
+  descr => 'pseudo-type for the result of a copy to/from method function',

s/method function/format function/

---
+Oidhandler;/* handler
function for custom format routine */

'handler' is used also for built-in formats.

---
+static void
+CopyFromInFunc(CopyFromState cstate, Oid atttypid,
+   FmgrInfo *finfo, Oid *typioparam)
+{
+ereport(NOTICE, (errmsg("CopyFromInFunc: atttypid=%d", atttypid)));
+}

OIDs could be changed across major versions even for built-in types. I
think it's better to avoid using it for tests.

---
+static void
+CopyToOneRow(CopyToState cstate, TupleTableSlot *slot)
+{
+ereport(NOTICE, (errmsg("CopyToOneRow: tts_nvalid=%u",
slot->tts_nvalid)));
+}

Similar to the above comment, the field name 'tts_nvalid' might also
be changed in the future, let's use another name.

---
+static const CopyFromRoutine CopyFromRoutineTestCopyFormat = {
+.type = T_CopyFromRoutine,
+.CopyFromInFunc = CopyFromInFunc,
+.CopyFromStart = CopyFromStart,
+.CopyFromOneRow = CopyFromOneRow,
+.CopyFromEnd = CopyFromEnd,
+};

I'd suggest not using the same function names as the fields.

---
+/*
+ * Export CopySendEndOfRow() for extensions. We want to keep
+ * CopySendEndOfRow() as a static function for
+ * optimization. CopySendEndOfRow() calls in this file may be optimized by a
+ * compiler.
+ */
+void
+CopyToStateFlush(CopyToState cstate)
+{
+CopySendEndOfRow(cstate);
+}

Is there any reason to use a different name for public functions?

---
+/*
+ * Export CopyGetData() for extensions. We want to keep CopyGetData() as a
+ * static function for optimization. CopyGet

Re: Support NOT VALID / VALIDATE constraint options for named NOT NULL constraints

2025-03-23 Thread jian he

hi.
you may like the attached. it's based on your idea: attnotnullvalid.

I came across a case, not sure if it's a bug.
CREATE TABLE ttchk (a INTEGER);
ALTER TABLE ttchk ADD CONSTRAINT cc check (a is NOT NULL) NOT VALID;
CREATE TABLE ttchk_child(a INTEGER) INHERITS(ttchk);
ttchk_child's constraint cc will default to valid,
but pg_dump && pg_restore will make ttchk_child's constraint invalid.
since it's an existing behavior, so not-null constraint will align with it.

-the following text is copied from the commit message

NOT NULL NOT VALID

* TODO: In doc/src/sgml/ref/alter_table.sgml, under the
Compatibility section,
  clarify how the "NOT NULL NOT VALID" syntax conforms with the standard.
* TODO: Should CREATE TABLE LIKE copy an existing invalid not-null
constraint to the new table,
  and if so, the new table's not-null will be marked as valid.

description entry of pg_attribute.attnotnullvalid:
+ 
+  
+   attnotnullvalid bool
+  
+  
+   The not-null constraint validity status of the column.
+   If true, it means this column has a valid not-null constraint,
+   false means this column doesn't have a not-null constraint or
has an unvalidated one.
+   If attnotnull is false, this must be false.
   

* attnotnull means that a not-null constraint exists; it doesn't imply anything
  regarding the constraint being valid or not.
  attnotnullvalid will indicate whether the constraint is valid; this column can
  only be true if attnotnull is already true.
  attnotnullvalid only added to FormData_pg_attribute, didn't add to
CompactAttribute.
  mainly because invalid not-null is not being commonly used.
  TupleDesc->TupleConstr->has_not_null now also represents invalid not-null
  constraint.

* For table in pg_catalog schema, if that column attnotnull attribute is true,
 then attnotnullvalid attribute is also true.  Similarly, if
attnotnull is false,
 then attnotnullvalid is false.  I added an SQL check at the end of
 src/test/regress/sql/constraints.sql (not sure it's necessary)

* CREATE TABLE specifying not valid not-null constraint will be set to valid,
  a warning is issued within function transformCreateStmt.
  that means InsertPgAttributeTuples can not insert attribute
  that is (attnotnull && !attnotnullvalid).
 I added an Assert in InsertPgAttributeTuples.
(also added to other places, to demo i didn't mess something, maybe
it's necessary).

* table rewrite won't validate invalid not-null constraint, that is aligned
  with check constraint.

* attnotnullvalid mainly changed in these two places:
  1. ATAddCheckNNConstraint, if you specified "NOT NULL NOT VALID", it
will change
it from false to false, but will set attnotnull to true.
  2. QueueNNConstraintValidation, subroutine of ATExecValidateConstraint.
when validing an not valid not-null constraint, toggle it from
false to true,
also set attnotnull to true.

* A partitioned table can have an invalid NOT NULL constraint while its
  partitions have a valid one, but not the other way around.
  but pg_dump/pg_restore may not preserve the constraint name properly, but
  that's fine for not-null constraint, i think.

* regular table invalid not null constraint pg_dump also works fine.
From fc4bf954772d25dfbf60774429d875f78e4fd69e Mon Sep 17 00:00:00 2001
From: jian he 
Date: Mon, 24 Mar 2025 09:21:10 +0800
Subject: [PATCH v5 1/1] NOT NULL NOT VALID
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* TODO: In doc/src/sgml/ref/alter_table.sgml, under the Compatibility section,
  clarify how the "NOT NULL NOT VALID" syntax conforms with the standard.
* TODO: Should CREATE TABLE LIKE copy an existing invalid not-null constraint to the new table,
  and if so, the new table's not-null will be marked as valid.

* attnotnull means that a not-null constraint exists; it doesn't imply anything
  regarding the constraint being valid or not.
  attnotnullvalid  will indicate whether the constraint is valid; this column can
  only be true if attnotnull is already true.
  attnotnullvalid only added to FormData_pg_attribute, didn't add to CompactAttribute.
  mainly because invalid not-null is not being commonly used.
  TupleDesc->TupleConstr->has_not_null now also represents invalid not-null
  constraint.

* For table in pg_catalog schema, if that column attnotnull attribute is true,
 then attnotnullvalid attribute is also true.  Similarly, if attnotnull is false,
 then attnotnullvalid is false.  I added an SQL check at the end of
 src/test/regress/sql/constraints.sql (not sure it's necessary)

* CREATE TABLE specifying not valid not-null constraint will be set to valid,
  a warning is issued within function transformCreateStmt.
  that means InsertPgAttributeTuples can not insert attribute
  that is (attnotnull && !attnotnullvalid).
  I added an Assert in InsertPgAttributeTuples.

* table rewrite won't

Re: Add missing tab completion for VACUUM and ANALYZE with ONLY option

2025-03-23 Thread Tom Lane

vignesh C  writes:
> On Wed, 19 Mar 2025 at 18:12, David Rowley  wrote:
>> While VACUUM ONLY on a partitioned table has no effect, the same isn't
>> true for inheritance parents. 62ddf7ee9 did change the behaviour of
>> VACUUM for these so that vacuuming the inheritance parent now vacuums
>> all of its children, unless ONLY is used. So, I'd say if the tab
>> completion is only added to ANALYZE, then it'd be incomplete still.
>> (I've not looked at the v3 patch to see which of those it handles.)

> I also felt it is necessary for both ANALYZE and VACUUM, and the v3
> patch includes changes for both.

Agreed.

Pushed with some cosmetic adjustments.  I made the order of the
options match the syntax diagrams in the comments, which is just
neatnik-ism (it changes no behavior) but seemed to read better.
I also realized that we could simplify the match patterns for the
various VACUUM cases by using MatchAnyN --- the existing code was
getting rather contorted there.  That wasn't something your patch
introduced, but might as well fix it while we're here.

regards, tom lane

Re: AIO v2.5

2025-03-23 Thread Noah Misch

commit 247ce06b wrote:
> + pgaio_io_reopen(ioh);
> +
> + /*
> +  * To be able to exercise the reopen-fails path, allow 
> injection
> +  * points to trigger a failure at this point.
> +  */
> + pgaio_io_call_inj(ioh, "AIO_WORKER_AFTER_REOPEN");
> +
> + error_errno = 0;
> + error_ioh = NULL;
> +
> + /*
> +  * We don't expect this to ever fail with ERROR or 
> FATAL, no need
> +  * to keep error_ioh set to the IO.
> +  * pgaio_io_perform_synchronously() contains a critical 
> section to
> +  * ensure we don't accidentally fail.
> +  */
> + pgaio_io_perform_synchronously(ioh);

A CHECK_FOR_INTERRUPTS() could close() the FD that pgaio_io_reopen() callee
smgr_aio_reopen() stores.  Hence, I think smgrfd() should assert that
interrupts are held instead of doing its own HOLD_INTERRUPTS(), and a
HOLD_INTERRUPTS() should surround the above region of code.  It's likely hard
to reproduce a problem, because pgaio_io_call_inj() does nothing in many
builds, and pgaio_io_perform_synchronously() starts by entering a critical
section.

On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> Attached v2.11

> Subject: [PATCH v2.11 06/27] aio: Implement support for reads in smgr/md/fd

> +int
> +FileStartReadV(PgAioHandle *ioh, File file,
> +int iovcnt, off_t offset,
> +uint32 wait_event_info)
> +{
> + int returnCode;
> + Vfd*vfdP;
> +
> + Assert(FileIsValid(file));
> +
> + DO_DB(elog(LOG, "FileStartReadV: %d (%s) " INT64_FORMAT " %d",
> +file, VfdCache[file].fileName,
> +(int64) offset,
> +iovcnt));
> +
> + returnCode = FileAccess(file);
> + if (returnCode < 0)
> + return returnCode;
> +
> + vfdP = &VfdCache[file];
> +
> + pgaio_io_prep_readv(ioh, vfdP->fd, iovcnt, offset);

FileStartReadV() and pgaio_io_prep_readv() advance the IO to PGAIO_HS_STAGED
w/ batch mode, PGAIO_HS_SUBMITTED w/o batch mode.  I didn't expect that from
functions so named.  The "start" verb sounds to me like unconditional
PGAIO_HS_SUBMITTED, and the "prep" verb sounds like PGAIO_HS_DEFINED.  I like
the "stage" verb, because it matches PGAIO_HS_STAGED, and the comment at
PGAIO_HS_STAGED succinctly covers what to expect.  Hence, I recommend names
FileStageReadV, pgaio_io_stage_readv, mdstagereadv, and smgrstageread.  How do
you see it?

> +/*
> + * AIO error reporting callback for mdstartreadv().
> + *
> + * Errors are encoded as follows:
> + * - PgAioResult.error_data != 0 encodes IO that failed with errno != 0

I recommend replacing "errno != 0" with either "that errno" or "errno ==
error_data".

> Subject: [PATCH v2.11 07/27] aio: Add README.md explaining higher level design

Ready for commit apart from some trivia:

> +if (ioret.result.status == PGAIO_RS_ERROR)
> +pgaio_result_report(aio_ret.result, &aio_ret.target_data, ERROR);

I think ioret and aio_ret are supposed to be the same object.  If that's
right, change one of the names.  Likewise elsewhere in this file.

> +The central API piece for postgres' AIO abstraction are AIO handles. To
> +execute an IO one first has to acquire an IO handle (`pgaio_io_acquire()`) 
> and
> +then "defined", i.e. associate an IO operation with the handle.

s/"defined"/"define" it/ or similar

> +The "solution" to this the ability to associate multiple completion callbacks

s/this the/this is the/

> Subject: [PATCH v2.11 08/27] localbuf: Track pincount in BufferDesc as well

> @@ -5350,6 +5350,18 @@ ConditionalLockBufferForCleanup(Buffer buffer)
>   Assert(refcount > 0);
>   if (refcount != 1)
>   return false;
> +
> + /*
> +  * Check that the AIO subsystem doesn't have a pin. Likely not
> +  * possible today, but better safe than sorry.
> +  */
> + bufHdr = GetLocalBufferDescriptor(-buffer - 1);
> + buf_state = pg_atomic_read_u32(&bufHdr->state);
> + refcount = BUF_STATE_GET_REFCOUNT(buf_state);
> + Assert(refcount > 0);
> + if (refcount != 1)
> + return false;
> +

LockBufferForCleanup() should get code like this
ConditionalLockBufferForCleanup() code, either now or when "not possible
today" ends.  Currently, it just assumes all local buffers are
cleanup-lockable:

/* Nobody else to wait for */
if (BufferIsLocal(buffer))
return;

> @@ -570,7 +577,13 @@ InvalidateLocalBuffer(BufferDesc *bufHdr, bool 
> check_unreferenced)
>  
>   buf_state = pg_atomic_read_u32(&bufHdr->state);
>  
> - if (check_unrefer

Re: AIO v2.5

2025-03-23 Thread Thomas Munro

On Mon, Mar 24, 2025 at 5:59 AM Andres Freund  wrote:
> On 2025-03-23 08:55:29 -0700, Noah Misch wrote:
> > An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in
> > PGAIO_HS_COMPLETED_IO clearly doesn't block that close.  For 
> > io_method=worker,
> > closing in PGAIO_HS_SUBMITTED is okay.  For io_method=io_uring, is there a
> > reference about it being okay to close during PGAIO_HS_SUBMITTED?  I looked
> > awhile for an authoritative view on that, but I didn't find one.  If we can
> > rely on io_uring_submit() returning only after the kernel has given the
> > io_uring its own reference to all applicable file descriptors, I expect it's
> > okay to close the process's FD.  If the io_uring acquires its reference 
> > later
> > than that, I expect we shouldn't close before that later time.
>
> I'm fairly sure io_uring has its own reference for the file descriptor by the
> time io_uring_enter() returns [1].  What io_uring does *not* reliably tolerate
> is the issuing process *exiting* before the IO completes, even if there are
> other processes attached to the same io_uring instance.

It is a bit strange that the documentation doesn't say that
explicitly.  You can sorta-maybe-kinda infer it from the fact that
io_uring didn't originally support cancelling requests at all, maybe a
small clue that it also didn't cancel them when you closed the fd :-)
The only sane alternative would seem to be that they keep running and
have their own reference to the *file* (not the fd), which is the
actual case, and might also be inferrable at a stretch from the
io_uring_register() documentation that says it reduces overheads with
a "long term reference" reducing "per-I/O overhead".  (The distant
third option/non-option is a sort of late/async binding fd as seen in
the Glibc user space POSIX AIO implementation, but that sort of
madness doesn't seem to be the sort of thing anyone working in the
kernel would entertain for a nanosecond...)  Anyway, there are also
public discussions involving Mr Axboe that discuss the fact that async
operations continue to run when the associated fd is closed, eg from
people who were surprised by that when porting stuff from other
systems, which might help fill in the documentation gap a teensy bit
if people want to see something outside the source code:

https://github.com/axboe/liburing/issues/568

> AIO v1 had a posix_aio backend, which, on several platforms, did *not*
> tolerate the FD being closed before the IO completes. Because of that
> IoMethodOps had a closing_fd callback, which posix_aio used to wait for the
> IO's completion [2].

Just for the record while remembering this stuff: Windows is another
system that took the cancel-on-close approach, so the Windows IOCP
proof-of-concept patches also used that AIO v1 callback and we'll have
to think about that again if/when we want to get that stuff
going on AIO v2.  I recall also speculating that it might be better to
teach the vfd system to pick another victim to close instead if an fd
was currently tied up with an asynchronous I/O for the benefit of
those cancel-on-close systems, hopefully without any happy-path
book-keeping.  But just submitting staged I/O is a nice and cheap
solution for now, without them in the picture.

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

On Mon, 24 Mar 2025 at 15:23, Michael Paquier  wrote:
> And here we get 18564.06 (head) vs 16667.92 (patch) so a 10.7%
> difference in this run.  Hence the automatic addition of NULL to the
> jumbling is disappointing here, even if this is what I'd see this as a
> worst case scenario, unlikely what one would see for real if they care
> about monitoring.

Can you share which patch you tested here?  There are many patches on
this thread and I don't want to make assumptions about which one these
are the results for.

It sounds like it wasn't your latest patch as you mentioned "automatic
addition of NULL to the jumbling".  That doesn't sound like the
correct description for your latest patch.

David

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Fri, 21 Mar 2025 at 22:02, Andrei Lepikhov  wrote:
> In cases
> I have seen before, the main question is how effective was (or maybe) a
> Memoize node == how often the incoming key fits the cache. In that case,
> the hit ratio fraction is more understandable for a broad audience.
> That's why according to my experience in case of a good cache
> reusability factor, users are usually okay with increasing the cache
> size to the necessary numbers and avoiding evictions at all costs. So,
> the predicted evict_ratio also tells us about incrementing work_mem to
> enhance the chances of Memoisation.

Can you explain why "Estimated Capacity" and "Estimated Distinct
Lookup Keys" don't answer that?  If there are more distinct lookup
keys than there is capacity to store them, then some will be evicted.

Once again, I'm not necessarily objecting to hit and evict ratios
being shown, I just want to know they're actually useful enough to
show and don't just bloat the EXPLAIN output needlessly. So far your
arguments aren't convincing me that they are.

> Having written the last sentence I came back to the point why work_mem
> is so universal and is used at each node as a criteria of memory
> allocation size? But it is a different story, I think.

We have to set the limit somehow.  We could have done this by having a
GUC per node type that uses memory, but it looks like something more
universal was decided, perhaps to save on GUCs. I don't know the exact
history, but once upon a time, sort_mem existed. Perhaps that
disappeared because we grew more node types that needed to allocate
large, otherwise unbounded amounts of memory.  We did more recently
grow a hash_mem_multiplier GUC, so it's not true to say that work_mem
solely controls the limits of each node's memory allocation sizes.

David

Re: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET

On Mon, Mar 24, 2025 at 07:42:21PM +1300, David Rowley wrote:
> Can you share which patch you tested here?  There are many patches on
> this thread and I don't want to make assumptions about which one these
> are the results for.

The patch tested is the latest one that has been posted on this
thread, for a comparison of HEAD at 8a3e4011f02d and this 0001 (no
0002 which was forcing a loop to make the jumbling heavier):
https://www.postgresql.org/message-id/z9z85ui5lpkpn...@paquier.xyz

The message holding the patch tested is the one I've replied to when
posting the results shared here:
https://www.postgresql.org/message-id/z-db83o5azmgn...@paquier.xyz
--
Michael


signature.asc
Description: PGP signature

Re: Add mention in docs about locking all partitions for generic plans

2025-03-23 Thread Tender Wang

David Rowley  于2025年3月24日周一 05:28写道：

> Over in [1], there was some uncertainty about whether locking an
> unrelated partition was expected behaviour or not for the particular
> use-case which used a generic plan to scan a partitioned table and all
> of the partitions.
>
> I noticed that we don't mention this in the docs and though that
> perhaps we should. I thought a good place might be in [2] at the end
> of the following paragraph:
>
> "During initialization of the query plan. Partition pruning can be
> performed here for parameter values which are known during the
> initialization phase of execution. Partitions which are pruned during
> this stage will not show up in the query's EXPLAIN or EXPLAIN ANALYZE.
> It is possible to determine the number of partitions which were
> removed during this phase by observing the “Subplans Removed” property
> in the EXPLAIN output."
>
> Perhaps adding something like "Because all relations which are part of
> the plan are locked at the beginning of execution, any partitions
> pruned at this stage are already locked and will remain so until the
> end of the transaction.".
>
> or:
>
> "It's important to note that any partitions removed by the partition
> pruning done at this stage are still locked at the beginning of
> execution".
>

I prefer this.


>
> This is no longer true in master, so if we do something here it's only
> v17 and earlier.
>

In the case of [1], we still have AccessShareLock on entity_2, even though
it is pruned during initial partition pruning.
This seems to contradict what you said.  "This is no longer true in master"
.


[1]
https://postgr.es/m/01020195b987abd3-a008b77d-8c63-4931-80a4-be36a351c8b2-000...@eu-west-1.amazonses.com


-- 
Thanks,
Tender Wang

Re: Commit fest 2025-03

2025-03-23 Thread vignesh C

On Mon, 17 Mar 2025 at 09:26, vignesh C  wrote:
>
Here's a quick commitfest status report as of today:
status |   start  |  10th  |  17th  |  24th
+-+-+-+-
Needs review:   |   198   | 182 | 134  | 120
Waiting on Author: | 37  |   35 |69 |62
Ready for Committer:| 33  |   33 |34 |34
Committed:| 52  |   67 |75 |90
Moved to next CF: |  7   | 7 |11 |14
Withdrawn:| 12  |   15 |15 |18
Rejected:   |   1  | 1 | 3  |  3
RWF: |   2  | 2 | 2  |  2
Total:  |   343 |  343 |  343|  343

There are currently 34 patches in the "Ready for Committer" status at
[1]. If any committers have bandwidth, please consider reviewing and
advancing them.
If you have submitted a patch that is in the "Waiting for Author"
state, please update it to "Needs Review" as soon as possible. This is
where reviewers are most likely to focus their efforts. There are 62
patches that are in "Waiting for Author" state at [2]. Additionally,
13 Needs review patches require a rebase due to recent commits at [3].
Patch owners are requested to rebase and submit updated versions.

15 patches have been committed in the last week.

[1] - https://commitfest.postgresql.org/52/?status=3
[2] - https://commitfest.postgresql.org/52/?status=2
[3] - https://commitfest.postgresql.org/52/?status=1

Regards,
Vignesh

Re: Orphaned users in PG16 and above can only be managed by Superusers

2025-03-23 Thread Ashutosh Sharma

Thank you, Robert and Tom, for sharing your valuable insights, and
apologies for the slight delay in my response. From the discussion,
what I understand is that we aim to extend the current DROP ROLE
syntax to include the CASCADE/RESTRICT option, which has been
introduced in the latest SQL standard, as mentioned by Tom. However,
as Robert pointed out, implementing the CASCADE option for handling
dependent objects that span multiple databases is not feasible at the
moment. The RESTRICT option, as I understand it, is already the
default behavior. Therefore, we will proceed with implementing the
CASCADE/RESTRICT syntax specifically for handling dependent roles,
rather than the dependent database objects like tables, views, etc.,
which can span multiple databases.

Please correct me if I’m mistaken or if there’s anything I’ve missed
in my understanding. thanks.

--
With Regards,
Ashutosh Sharma.

Re: Patch: Cover POSITION(bytea,bytea) with tests

On Tue, 18 Mar 2025 at 03:14, Aleksander Alekseev
 wrote:
> Here is the corrected patch.

I had a look at this and it all seems good to me.

Pushed.

David

Re: Parallel safety for extern params

2025-03-23 Thread Amit Kapila

On Sat, Mar 22, 2025 at 1:25 AM Tom Lane  wrote:
>
> Robert Haas  writes:
> > I'm happy to have you tidy up here in whatever way seems best to you.
>
> Cool, done.
>

Thanks for taking care of this, and sorry for not digging deeper to
find the appropriate fix.

-- 
With Regards,
Amit Kapila.

RE: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-03-23 Thread Hayato Kuroda (Fujitsu)

Dear Shubham,

> I have reviewed and merged the proposed changes into the patch. The
> attached patches contain the suggested changes.

Thanks for updating the patch! Few comments:

01.
```
+   /*
+* Fetch all databases from the source (publisher) if --all is 
specified.
+*/
+   if (opt.all_dbs)
+   fetch_source_databases(&opt);
+
if (opt.database_names.head == NULL)
```

I feel we can convert "if" -> "else if" for the opt.database_names.head case,
because fetch_source_databases() ensures databases are listed.

02.
```
+# Set up node U as standby linking to node
```

To confirm: why can we use "U" here?

03.
```
+$node_u->append_conf(
+   'postgresql.conf', qq[
+primary_conninfo = '$pconnstr dbname=postgres'
+]);
```
I think this part is not required. init_from_backup() with has_streaming sets 
the
primary_conninfo.

04.
```
+# Stop $node_s
+$node_s->stop;
```

The comment does not describe anything. I feel you can say "Stop node S because
it won't be used anymore", or just remove it.

05.
```
+# Drop one of the databases (e.g., $db2)
+$node_p->safe_psql('postgres', "DROP DATABASE \"$db2\"");
```

"e.g." means "for example", so it is not suitable to put here. Please explain 
why it must be removed.

06.
```
+# Get subscription names
+$result = $node_u->safe_psql(
+   'postgres', qq(
+   SELECT subname FROM pg_subscription WHERE subname ~ 
'^pg_createsubscriber_'
+));
+my @subnames1 = split("\n", $result);
+
+# Wait subscriber to catch up
+$node_u->wait_for_subscription_sync($node_p, $subnames1[0]);
+$node_u->wait_for_subscription_sync($node_p, $subnames1[1]);
```

wait_for_subscription_sync() is used to wait until the initial synchronization 
is
done, so not suitable here. wait_for_catchup() is more appropriate.

07.
Also, the similar part exists on the pre-existing part 
(wait_for_subscription_sync
was used there, but I feel this may not be correct). How about unifing them like
attached?

08.
```
+# Verify logical replication works for all databases
+# Insert rows on node P
+$node_p->safe_psql('postgres', "INSERT INTO tbl1 VALUES('first row')");
```

This is confusing because 'fourth row' is inserted to $db1 just after it. How
about the 'row in database postgres' or something?

Best regards,
Hayato Kuroda
FUJITSU LIMITED



kuroda.diffs
Description: kuroda.diffs

Re: Doc: fix the rewrite condition when executing ALTER TABLE ADD COLUMN

2025-03-23 Thread jian he

hi.

https://git.postgresql.org/cgit/postgresql.git/commit/?id=11bd8318602fc2282a6201f714c15461dc2009c6

+ Adding a column with a volatile DEFAULT
+ (e.g., clock_timestamp()), a generated column
+ (e.g., GENERATED BY DEFAULT AS IDENTITY), a domain
+ data type with constraints will require the entire table and its
+ indexes to be rewritten, as will changing the type of an existing
+ column. As an exception, when changing the type of an existing column,
+ if the USING clause does not change the column
+ contents and the old type is either binary coercible to the new type
+ or an unconstrained domain over the new type, a table rewrite is not
+ needed.

In the current development branch,
virtual generated columns still do not support the domain.
you can not change the generation expression if it contains a check
constraint on it.
so virtual generated columns don't need rewriting.

IMHO, the committed doc description didn't mention this exception.
we need some text to cover this exception?

Re: [PoC] Reducing planning time when tables have many partitions

Thank you for addressing those comments.

On Mon, 24 Mar 2025 at 12:24, Yuya Watari  wrote:
> It is true that currently, indexes for EquivalenceMembers do not store
> information about base rels. However, the subsequent commit (v35-0004)
> introduces indexes for base rels to enable faster RestrictInfo
> lookups. Therefore, if we commit the later patch as well, the comment
> will remain accurate. What do you think about this?

I understand Ashutosh would like to handle the RestrictInfo speedup
another way, so there's additional review work to do there to
determine the merits of each method and figure out the best method.
I'm worried that means we don't get to fix this part for v18 and if
that happens and 0002 goes in alone, then we'd be left with a struct
with a single field.  Maybe you should adjust the patch series and
only introduce the new struct in 0004 where it's required.

> I agree it's challenging to redesign the union planner at this stage
> of the release cycle. For now, I have proposed a workaround solution
> (v35-0003, previously v34-0002). Would this workaround be acceptable
> for the current release cycle?

I think something like that is probably ok.  You have a problem with
your implementation as you're trying to add the AppendRelInfo once for
each child_tlist element rather than once per union child. Can you fix
this and incorporate into the 0002 patch please?

Here are some more review comments for v35-0002:

1. I don't think the header comment for eclass_member_iterator_next()
needs to mention setup_eclass_member_iterator_with_children().  The
renaming you did in v35 is meant to make it so the
eclass_member_iterator_next and dispose_eclass_member_iterator()
functions don't care about what set up the iterator.  We might end up
with new ones in the future and this seems like a comment that might
not get updated when that happens.

2. You should use list_free() in the following:

/*
* XXX Should we use list_free()? I decided to use this style to take
* advantage of speculative execution.
*/
if (unlikely(it->list_is_copy))
pfree(it->ec_members);

The reason is that you're wrongly assuming that calling pfree on the
List pointer is enough to get rid of all memory used by the list.  The
List may have a separately allocated elements[] array (this happens
when there's > 5 elements) which you're leaking with the current code.

I assume the speculative execution comment is there because you want
to omit the "list == NULL" check in list_free_private.  Is this
measurable, performance-wise?

3. Maybe I'm missing something, but I'm confused about the need for
the eclass_indexes_array field in PlannerInfo.  This array is indexed
by the relid, so why can't we get rid of the array and add a field to
RelOptInfo to store the EquivalenceClassIndexes?

4. Could you also please run another set of benchmarks against current
master with the the v36 patches: master,  master + v36-0001 + 0002,
master + v36-0001 + 0002 + 0003 (0003 will be the v34-0004 patch), and
then also with v36-0004 (which is the same as v35-0005).  The main
thing I'd like to understand here is if there's not enough time to get
the entire patch set committed, is there much benefit to just having
the EquivalenceMember index stuff in by itself without the
RestrictInfo changes.

David

RE: Fix 035_standby_logical_decoding.pl race conditions

2025-03-23 Thread Hayato Kuroda (Fujitsu)

Dear Bertrand,

> > SIGSTOP signal for pg_recvlogical may do the idea,
> 
> Yeah, but would we be "really" testing an "active" slot?

Yeah, this is also a debatable point.

> At the end we want to produce an invalidation that may or not happen on a real
> environment. The corner case is in the test, not an issue of the feature to
> fix.

I also think this is the test-issue, not the codebase.

> So, I'm not sure I like the idea that much, but thinking out loud: I wonder if
> we could bypass the "active" slot checks in 16 and 17 and use injection 
> points as
> proposed as of 18 (as we need the injection points changes proposed in 0001
> up-thread). Thoughts?

I do not have other idea neither. I checked your patch set could solve the 
issue.

Comments for the patch:
I'm not sure whether new API is really needed. Isn't it enough to use both
injection_points_wakeup() and injection_points_detach()? This approach does not
require bumping the version, and can be backported to PG17.

Also, another check whether the extension can be installed for the node is 
required.
Please see 041_checkpoint_at_promote.pl.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: AIO v2.5

2025-03-23 Thread Noah Misch

On Sun, Mar 23, 2025 at 11:11:53AM -0400, Andres Freund wrote:
> On 2025-03-22 17:20:56 -0700, Noah Misch wrote:
> > On Thu, Mar 20, 2025 at 09:58:37PM -0400, Andres Freund wrote:
> > >   Not sure yet how to best disable testing io_uring in this case. We can't
> > >   just query EXEC_BACKEND from pg_config.h unfortunately.  I guess making 
> > > the
> > >   initdb not fail and checking the error log would work, but that doesn't 
> > > work
> > >   nicely with Cluster.pm.
> > 
> > How about "postgres -c io_method=io_uring -C ":
> > 
> > --- a/src/test/modules/test_aio/t/001_aio.pl
> > +++ b/src/test/modules/test_aio/t/001_aio.pl
> > @@ -29,7 +29,13 @@ $node_worker->stop();
> >  # Test io_method=io_uring
> >  ###
> >  
> > -if ($ENV{with_liburing} eq 'yes')
> > +sub have_io_uring
> > +{
> > +   local %ENV = $node_worker->_get_env();  # any node works
> > +   return run_log [qw(postgres -c io_method=io_uring -C io_method)];
> > +}
> > +
> > +if (have_io_uring())
> >  {
> > my $node_uring = create_node('io_uring');
> > $node_uring->start();
> 
> Yea, that's a good idea.
> 
> One thing that doesn't seem great is that it requires a prior node - what if
> we do -c io_method=invalid' that would report the list of valid GUC options,
> so we could just grep for io_uring?

Works for me.

> > One idea so far is to comment on valid states after some IoMethodOps
> > callbacks:

> I think these are a good idea. I added those to the copy-edit patch, with a
> few more tweaks:

The tweaks made it better.

> > > Subject: [PATCH v2.11 04/27] aio: Add liburing dependency

> > > +  [AC_DEFINE([USE_LIBURING], 1, [Define to build with 
> > > io_uring support. (--with-liburing)])])
> > > +AC_MSG_RESULT([$with_liburing])
> > > +AC_SUBST(with_liburing)
> > >  
> > >  #
> > >  # UUID library
> > > @@ -1463,6 +1471,9 @@ elif test "$with_uuid" = ossp ; then
> > >  fi
> > >  AC_SUBST(UUID_LIBS)
> > >  
> > > +if test "$with_liburing" = yes; then
> > > +  PKG_CHECK_MODULES(LIBURING, liburing)
> > > +fi
> > 
> > We usually put this right after the AC_MSG_CHECKING ... AC_SUBST block.
> 
> We don't really seem to do that for "dependency checks" in general, e.g.
> PGAC_CHECK_PERL_CONFIGS, PGAC_CHECK_PYTHON_EMBED_SETUP, PGAC_CHECK_READLINE,
> dependency dependent AC_CHECK_LIB calls, .. later in configure.ac than the
> defnition of the option.

AC_CHECK_LIB stays far away, yes.

> But you're right that the PKG_CHECK_MODULES calls are closer-by. And I'm happy
> to move towards having the code for each dep all in one place, so moved.
> 
> 
> A related thing: We seem to have no order of the $with_ checks that I can
> discern. Should the liburing check be at a different place?

No opinion on that one.  It's fine.

> > This currently has unrelated stuff separating them.  Also, with the
> > exception of icu, we follow PKG_CHECK_MODULES uses by absorbing flags from
> > pkg-config and use AC_CHECK_LIB to add the actual "-l".
> 
> I think for liburing I was trying to follow ICU's example - injecting CFLAGS
> and LIBS just in the parts of the build dir that needs them.
> 
> For LIBS I think I did so:
> 
> diff --git a/src/backend/Makefile b/src/backend/Makefile
> ...
> +# The backend conditionally needs libraries that most executables don't need.
> +LIBS += $(LDAP_LIBS_BE) $(ICU_LIBS) $(LIBURING_LIBS)
> 
> But ugh, for some reason I didn't do that for LIBURING_CFLAGS. In the v1.x
> version of aio I had
> aio:src/backend/storage/aio/Makefile:override CPPFLAGS += $(LIBURING_CFLAGS)
> 
> but somehow lost that somewhere along the way to v2.x
> 
> 
> I think I like targetting where ${LIB}_LIBS and ${LIB}_CFLAGS are applied more
> narrowly better than just adding to the global CFLAGS, CPPFLAGS, LDFLAGS.

Agreed.

> somewhat inclined to add it LIBURING_CFLAGS in src/backend rather than
> src/backend/storage/aio/ though.
> 
> But I'm also willing to do it entirely differently.

The CPPFLAGS addition, located wherever makes sense, resolves that point.

> > > --- a/doc/src/sgml/installation.sgml
> > > +++ b/doc/src/sgml/installation.sgml
> > 
> > lz4 and other deps have a mention in , in
> > addition to sections edited here.
> 
> Good point.
> 
> Although once more I feel defeated by the ordering used :)
> 
> Hm, that list is rather incomplete. At least libxml, libxslt, selinux, curl,
> uuid, systemd, selinux and bonjour aren't listed.
> 
> Not sure if it makes sense to add liburing, given that?

That's a lot of preexisting incompleteness.  I withdraw the point about .


Unrelated to the above, another question about io_uring:

commit da722699 wrote:
> +/*
> + * Need to submit staged but not yet submitted IOs using the fd, otherwise
> + * the IO would end up targeting something bogus.
> + */
> +void
> +pgaio_closing_fd(int fd)

An IO in PGAIO_HS_STAGED clearly blocks closing the IO's FD, and an IO in
PGAIO_HS_COMPLETED_IO clearly doesn't block that close.  For io_method=worker,
closing in PGAIO_HS_SUBMITTED is okay.  For io_method=io_uring, is

Re: Fix infinite loop from setting scram_iterations to INT_MAX

On Mon, Mar 24, 2025 at 09:50:36AM +0900, Richard Guo wrote:
> Nice catch.  The fix looks good to me.  It seems to me that it's fine
> to go without a test case, since the fix is quite straightforward.

One could argue about using an injection point to force trick the
iteration loop to be cheaper, but I could not really get excited about
the coverage vs the cycles spent for it..
--
Michael


signature.asc
Description: PGP signature

Re: Fix infinite loop from setting scram_iterations to INT_MAX

On Sun, Mar 23, 2025 at 10:41 PM Kevin K Biju  wrote:
> int i;
> ...
> for (i = 2; i <= iterations; i++)
> {
> ...
> }
>
> For iterations = INT_MAX, the loop will never terminate since the condition 
> is <= and adding 1 to INT_MAX will lead to i wrapping around to INT_MIN.
>
> I've fixed this by modifying the loop condition to be i < iterations. I've 
> attached a patch with the fix. I considered adding a test as well, but since 
> generating a password with a high number of iterations is very 
> time-consuming, I'm not sure if that would be practical.

Nice catch.  The fix looks good to me.  It seems to me that it's fine
to go without a test case, since the fix is quite straightforward.

Thanks
Richard

Re: Fix infinite loop from setting scram_iterations to INT_MAX

On Mon, Mar 24, 2025 at 9:54 AM Michael Paquier  wrote:
> On Mon, Mar 24, 2025 at 09:50:36AM +0900, Richard Guo wrote:
> > Nice catch.  The fix looks good to me.  It seems to me that it's fine
> > to go without a test case, since the fix is quite straightforward.
>
> One could argue about using an injection point to force trick the
> iteration loop to be cheaper, but I could not really get excited about
> the coverage vs the cycles spent for it..

Exactly.

Thanks
Richard

Re: Add semi-join pushdown to postgres_fdw

2025-03-23 Thread Alexander Korotkov

Hi, Alexander!

On Tue, Mar 18, 2025 at 6:04 PM Alexander Pyhalov
 wrote:
> This shouldn't. When semi-join is found below left/right join, it's
> deparsed as subquery.
> Interesting enough, this mechanics (deparsing as subquery) is used
> 1) for semi-joins under left/right join,
> 2) for full outer joins when inner or outer part has some remote_conds.
>
> The issue here is that after subquery is deparsed, we don't consider if
> its target attributes are available to the upper level
> join . As for semi-join itself, all conditions are still deparsed on
> left/right join boundary, they are just not propagated further.
> This shouldn't be a problem, as they are evaluated in subquery. As for
> left/right join without semi-join beneath it - its behavior is not
> affected
> (as hidden_subquery_rels is empty).

Thank you for the explanation.  But I have another question.  Aren't
the checks you've proposed too strict?  hidden_subquery_rels are
propagated all the way to the join tree.  So, pulling conditions would
be disables all the way to the join tree too.  Is it enough to just
disable pulling conditions directly from semi-joins, then their
further pulls will be disabled automatically?  See the attached patch.
It also contains other (mostly cosmetic improvements).

--
Regards,
Alexander Korotkov
Supabase

v2-0001-Avoid-pulling-up-restrict-infos-from-subqueries.patch
Description: Binary data

Re: [PATCH] SVE popcount support

2025-03-23 Thread chiranmoy.bhattacha...@fujitsu.com

Looks good, the code is more readable now.

> For both Neon and SVE, I do see improvements with looping over 4
 > registers at a time, so IMHO it's worth doing so even if it performs the
 > same as 2-register blocks on some hardware.

There was no regression on Graviton 3 when using the 4-register version so can 
keep it.

-Chiranmoy

Re: Proposal - Allow extensions to set a Plan Identifier