Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-11 Thread Nisha Moond
On Tue, Feb 11, 2025 at 11:42 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Monday, February 10, 2025 8:03 PM Nisha Moond  
> wrote:
> >
> > On Sat, Feb 8, 2025 at 12:28 PM Zhijie Hou (Fujitsu) 
> > 
> > wrote:
> > >
> >
> > > 3.
> > >
> > > +   if (cause & RS_INVAL_HORIZON)
> > > +   {
> > > +   if (!SlotIsLogical(s))
> > > +   goto invalidation_marked;
> > >
> > > I am not sure if this logic is correct. Even if the slot would not be
> > > invalidated due to RS_INVAL_HORIZON, we should continue to check other
> > causes.
> > >
> >
> > Used goto here since we do not expect RS_INVAL_HORIZON to be combined
> > with any other "cause" and to keep the pgHead behavior.
> > However, with the bitflag approach, the code should be future-safe, so
> > replacing goto in v73 should handle this now.
>
> I think the following logic needs some adjustments.
>
> +   if (invalidation_cause == RS_INVAL_NONE &&
> +   (possible_causes & RS_INVAL_HORIZON))
> +   {
> +   if (SlotIsLogical(s) &&
> +   /* invalid DB oid signals a shared relation */
> +   (dboid == InvalidOid || dboid == 
> s->data.database) &&
> +   
> TransactionIdIsValid(initial_effective_xmin) &&
> +   
> TransactionIdPrecedesOrEquals(initial_effective_xmin,
> + 
> snapshotConflictHorizon))
> +   invalidation_cause = RS_INVAL_HORIZON;
> +   else if 
> (TransactionIdIsValid(initial_catalog_effective_xmin) &&
> +
> TransactionIdPrecedesOrEquals(initial_catalog_effective_xmin,
> + 
>  snapshotConflictHorizon))
> +   invalidation_cause = RS_INVAL_HORIZON;
> +   }
>
> I think we assign RS_INVAL_HORIZON to invalidation_cause only when the slot is
> logical and the dboid is valid, but it is not guaranteed in the second if
> condition ("else if (TransactionIdIsValid(initial_catalog_effective_xmin)").
>
> Here is a top-up patch to fix this.

Thank you for reviewing and providing the fix! v75 addresses this bug
with a slightly different approach after introducing the new function
EvaluateSlotInvalidationCause().

--
Thanks,
Nisha




Re: proposal - plpgsql - support standard syntax for named arguments for cursors

2025-02-11 Thread Pavel Stehule
Hi

so 8. 2. 2025 v 22:25 odesílatel Tom Lane  napsal:

> Pavel Stehule  writes:
> > so 8. 2. 2025 v 20:25 odesílatel Tom Lane  napsal:
> >> Is there any reason to think that that's actually in the standard?
>
> > I think the possibility to use named arguments in OPEN statements is a
> > PostgreSQL proprietary feature.
> > And usage of cursors in PL/pgSQL is based on PL/SQL (not on SQL/PSM from
> > standard), but named
> > arguments for cursor is PostgreSQL proprietary feature and the syntax
> based
> > on usage `:=` is our
> > proprietary too.
>
> Hmm ... yeah, it's not in SQL/PSM, but looking at PL/SQL:
>
>
> https://docs.oracle.com/en/database/oracle/oracle-database/19/lnpls/OPEN-statement.html
>
> I see
>
> You can specify actual cursor parameters with either
> positional notation or named notation. For information about
> these notations, see "Positional, Named, and Mixed Notation
> for Actual Parameters".
>
> and that link blesses the use of "name => value" (and not ":=").
> So agreed, we should adjust this.
>
> Is there a reason we need a whole new test case instead of
> tweaking one of the existing ones?
>
>
I changed regress tests like you proposed

Regards

Pavel


> regards, tom lane
>
From 938bd5007b62c854e423b696ac42c680fa28178e Mon Sep 17 00:00:00 2001
From: "ok...@github.com" 
Date: Sat, 8 Feb 2025 07:45:58 +0100
Subject: [PATCH] allow to use standard syntax for named arguments for plpgsql
 cursor arguments

---
 doc/src/sgml/plpgsql.sgml |  3 ++-
 src/pl/plpgsql/src/pl_gram.y  | 13 -
 src/test/regress/expected/plpgsql.out |  3 ++-
 src/test/regress/sql/plpgsql.sql  |  3 ++-
 4 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml
index 78e4983139b..60af57712b7 100644
--- a/doc/src/sgml/plpgsql.sgml
+++ b/doc/src/sgml/plpgsql.sgml
@@ -3317,7 +3317,7 @@ OPEN curs1 FOR EXECUTE format('SELECT * FROM %I WHERE col1 = $1',tabname) USING
  Opening a Bound Cursor
 
 
-OPEN bound_cursorvar  (  argument_name :=  argument_value , ... ) ;
+OPEN bound_cursorvar  (  argument_name { := | => }  argument_value , ... ) ;
 
 
  
@@ -3352,6 +3352,7 @@ OPEN bound_cursorvar  (   42);
 
  
 
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 64d2c362bf9..618a97b64f2 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -3911,9 +3911,12 @@ read_cursor_args(PLpgSQL_var *cursor, int until, YYSTYPE *yylvalp, YYLTYPE *yyll
 	tok2;
 		int			arglocation;
 
-		/* Check if it's a named parameter: "param := value" */
+		/*
+		 * Check if it's a named parameter: "param := value"
+		 * or "param => value"
+		 */
 		plpgsql_peek2(&tok1, &tok2, &arglocation, NULL, yyscanner);
-		if (tok1 == IDENT && tok2 == COLON_EQUALS)
+		if (tok1 == IDENT && (tok2 == COLON_EQUALS || tok2 == EQUALS_GREATER))
 		{
 			char	   *argname;
 			IdentifierLookup save_IdentifierLookup;
@@ -3939,11 +3942,11 @@ read_cursor_args(PLpgSQL_var *cursor, int until, YYSTYPE *yylvalp, YYLTYPE *yyll
 		 parser_errposition(*yyllocp)));
 
 			/*
-			 * Eat the ":=". We already peeked, so the error should never
-			 * happen.
+			 * Eat the ":=" and the "=>". We already peeked, so the error should
+			 * never happen.
 			 */
 			tok2 = yylex(yylvalp, yyllocp, yyscanner);
-			if (tok2 != COLON_EQUALS)
+			if (tok2 != COLON_EQUALS && tok2 != EQUALS_GREATER)
 yyerror(yyllocp, NULL, yyscanner, "syntax error");
 
 			any_named = true;
diff --git a/src/test/regress/expected/plpgsql.out b/src/test/regress/expected/plpgsql.out
index 0a6945581bd..31aa806e491 100644
--- a/src/test/regress/expected/plpgsql.out
+++ b/src/test/regress/expected/plpgsql.out
@@ -2419,7 +2419,8 @@ declare
   p2 int4 := 1006;
   n int4;
 begin
-  open c1 (p1 := p1, p2 := p2, debug := 2);
+  -- use both supported syntaxes for named arguments
+  open c1 (p1 := p1, p2 => p2, debug => 2);
   fetch c1 into n;
   return n;
 end $$ language plpgsql;
diff --git a/src/test/regress/sql/plpgsql.sql b/src/test/regress/sql/plpgsql.sql
index 18c91572ae1..aaae1e44c6f 100644
--- a/src/test/regress/sql/plpgsql.sql
+++ b/src/test/regress/sql/plpgsql.sql
@@ -2072,7 +2072,8 @@ declare
   p2 int4 := 1006;
   n int4;
 begin
-  open c1 (p1 := p1, p2 := p2, debug := 2);
+  -- use both supported syntaxes for named arguments
+  open c1 (p1 := p1, p2 => p2, debug => 2);
   fetch c1 into n;
   return n;
 end $$ language plpgsql;
-- 
2.48.1



Fix possible resource leak (src/bin/pg_basebackup/pg_receivewal.c)

2025-02-11 Thread Ranier Vilela
Hi.

Per Coverity.

CID 1591288: (#1 of 1): Resource leak (RESOURCE_LEAK)
10. leaked_storage: Variable sysidentifier going out of scope leaks the
storage it points to.

Trivial patch attached.

best regards,
Ranier Vilela


fix-resource-leak-pg_receivewal.patch
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-11 Thread Nisha Moond
On Tue, Feb 11, 2025 at 8:49 AM Peter Smith  wrote:
>
> Hi Nisha.
>
> Some review comments about v74-0001
>
> ==
> src/backend/replication/slot.c
>
> 1.
>  /* Maximum number of invalidation causes */
> -#define RS_INVAL_MAX_CAUSES RS_INVAL_WAL_LEVEL
> -
> -StaticAssertDecl(lengthof(SlotInvalidationCauses) == (RS_INVAL_MAX_CAUSES + 
> 1),
> - "array length mismatch");
> +#define RS_INVAL_MAX_CAUSES (lengthof(InvalidationCauses)-1)
>
> The static assert was here to protect against dev mistakes in keeping
> the lookup table up-to-date with the enum of slot.h. So it's not a
> good idea to remove it...
>
> IMO the RS_INVAL_MAX_CAUSES should be relocated to slot.h where the
> enum is defined and where the devs know exactly how many invalidation
> types there are. Then this static assert can be put back in to do its
> job of ensuring the integrity properly again for this lookup table.
>

How about keeping RS_INVAL_MAX_CAUSES dynamic in slot.c (as it was)
and updating the static assert to ensure the lookup table stays
up-to-date with the enums?
The change has been implemented in v75.

> ~~~
>
> ==

Here are v75 patches addressing comments in [1], [2] and [3].
 - A new function, "EvaluateSlotInvalidationCause()", has been
introduced to separate the invalidation_cause evaluation logic from
InvalidatePossiblyObsoleteSlot().
 - Also, another new inline function "CalculateTimeDuration()" added
as suggested in [3].

[1] 
https://www.postgresql.org/message-id/CAHut%2BPvsvHWoiEkGTP4NfVNsADsy-Jan3Dvp%2B_GW3gmPDHf5Qw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/OS0PR01MB57163889BE5F9F30DD3318A394FD2%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[3] 
https://www.postgresql.org/message-id/CAA4eK1LuvXa6sVj3xuLoe2X%3D0xjbJXrnJePbpXQZaTMws8pZqg%40mail.gmail.com

--
Thanks,
Nisha


v75-0001-Introduce-inactive_timeout-based-replication-slo.patch
Description: Binary data


v75-0002-Add-TAP-test-for-slot-invalidation-based-on-inac.patch
Description: Binary data


Re: NOT ENFORCED constraint feature

2025-02-11 Thread Álvaro Herrera
On 2025-Feb-10, Isaac Morland wrote:

> I'm having a lot of trouble understanding the operational distinction
> between your 'u' and 'U'. If it's not enforced, it cannot be assumed to be
> valid, regardless of whether it was valid in the past. I'm not sure what I
> think of a single character vs. 2 booleans, but there are only 3 sensible
> states either way: valid enforced, invalid enforced, and invalid unenforced.

I kinda agree with you and would prefer that things were that way as
well.  But look at the discussion starting at
https://postgr.es/m/CAExHW5tV23Sw+Nznv0KpdNg_t7LrXY1WM9atiC=ekksskhs...@mail.gmail.com
whereby it was apparently established that if you have a 
NOT VALID NOT ENFORCED
constraint, and you make it enforced, then you should somehow end up
with a NOT VALID ENFORCED constraint, which says to me that we need to
store the fact that the constraint was NOT VALID to start with; and
correspondingly if it's VALID NOT ENFORCED and you enforce it, then it
ends up VALID ENFORCED.  If we take this view of the world (with which,
I repeat, I disagree) then we must keep track of whether the constraint
was valid or not valid to start with.  And this means that we need to
keep convalidated=true _regardless_ of whether conenforced is false.
So in this view of the world there aren't three states but four.

I would prefer there to be three states as well, but apparently I'm
outvoted on this.

> Additionally, if there are officially 4 status possibilities then all code
> that looks for unenforced constraints has to look for both valid and
> invalid unenforced constraints if we use a char; it's not as bad with 2
> booleans because one can just check the "enforced" boolean.

Well, yes.  You have kinda the same issue with any other system catalog
column that's a 'char', I guess.  Maybe this is more of a problem here
because it's more user-visible than most other catalogs, not sure.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: pull-up subquery if JOIN-ON contains refs to upper-query

2025-02-11 Thread Alena Rybakina

On 10.02.2025 23:51, Ilia Evdokimov wrote:


On 09.02.2025 18:14, Alena Rybakina wrote:
Hi! I found another example where the transformation worked 
incorrectly and reconsidered the idea.


As for conversion of exists_sublink_to_ANY, we need to get the 
flattened implicit-AND list of clauses and pull out the chunks of the 
WHERE clause that belong to the parent query,
since we are called halfway through the parent's 
preprocess_expression() and earlier steps of preprocess_expression() 
wouldn't get applied to the pulled-up stuff unless we do them here.
We also do some processing for vars depending on which side the var 
is on - if it's in a subquery, we only need to lower its level 
(varlevel) because subquery will be flatted, while
for other vars that belong to the parent query, we need to do 
preparation to pull up the sub-select into top range table.


For those expressions that we couldn't assign to either list, we 
define newWhere and apply both cases.




When I run 'make -C contrib/ check', tests of postgres_fdw extension 
failed. I might be wrong, but you should be careful with LIMIT.



Thank you for the review, I'm working on it.

--
Regards,
Alena Rybakina
Postgres Professional





Re: RFC: Allow EXPLAIN to Output Page Fault Information

2025-02-11 Thread Jelte Fennema-Nio
On Tue, 11 Feb 2025 at 16:36, Andres Freund  wrote:
> Shrug. It means that it'll not work in what I hope will be the default
> mechanism before long.  I just can't get excited for that. In all likelihood
> it'll result in bug reports that I'll then be on the hook to fix.

My assumption was that io_uring would become the default mechanism on
Linux. And that bgworker based AIO would generally only be used by
platforms without similar functionality. Is that assumption incorrect?




Re: Eagerly scan all-visible pages to amortize aggressive vacuum

2025-02-11 Thread Andres Freund
Hi,

On 2025-02-10 14:30:15 -0500, Robert Haas wrote:
> On Wed, Feb 5, 2025 at 12:23 PM Melanie Plageman
>  wrote:
> > I started getting worried thinking about this. If you have a cursor
> > for select * from a table and fetch forward far enough, couldn't
> > vacuum fail to get cleanup locks on a whole range of blocks?
> 
> I don't think so. A given scan holds at most 1 heap pin at a time, so
> I don't see how a cursor doing a FETCH FORWARD would make anything
> happen that wouldn't otherwise. I do think it's theoretically possible
> that a running query and a running VACUUM could be going at exactly
> the same speed so that the VACUUM always tries to pin every page at
> the exact same moment the SELECT has it pinned, but in practice I
> think it is very unlikely that the timing will work out like that.
> Like win-the-lottery kind of unlikely.

I think the most common way to get a range of pages pinned is to have a bunch
of backends doing index nested loop joins to a fairly narrow value range in a
table. You can sometimes see (via pg_buffercache) that there are some pages
that are rather heavily pinned, but just about no other page in the table is.


> > Yea, I guess that means the freeze limit caps wasted read I/O -- but
> > you could end up doing no read I/O and still hitting the freeze fail
> > limit because it is trying to detect when data is too new to be worth
> > eager scanning. But that's probably the behavior we want.
> 
> The too-new case is very scary, I think. It's already a problem that
> when there are long-running transactions in play, autovacuum commands
> a VACUUM which finds nothing it can usefully clean up. It would be
> very valuable for someone to figure out a way to prevent that, but the
> relevance to this patch is that we need to try hard to avoid making
> that problem substantially worse than it is already. Even if the whole
> database is memory-resident, the too-new case allows for wasting a
> bunch of effort that we would rather not waste; the cap here tries to
> make sure that isn't any worse than it needs to be.

IME the most common way for this issue is anti-wraparound vacuums that can't
actually clean up anything, due to some old transaction. We'll just fire those
off over and over again, without taking into account that we still can't make
progress.

That scenario doesn't apply the same way to this patch, because by the time
you get to an anti-wrap vacuum, this patch doesn't do anything.

Of course it's possible that the dead tuple trigger causes repeated vacuums
that each can't clean up anything, but that's much less common IME.


> > > > Interestingly, we call heap_tuple_should_freeze() in
> > > > lazy_scan_noprune(), so we actually could tell whether or not there
> > > > are some tuples on the page old enough to trigger freezing if we had
> > > > gotten the cleanup lock. One option would be to add a new output
> > > > parameter to lazy_scan_noprune() that indicates whether or not there
> > > > were tuples with xids older than the FreezeLimit, and only if there
> > > > were not, count it as a failed eager scan.
> > >
> > > Yeah, that could be done, and it doesn't sound like a bad idea.
> > > However, I'm also unsure whether we need to add additional logic for
> > > this case. My intuition is that it just won't happen very often. I'm
> > > not quite confident that I'm correct about that, but if I had to
> > > guess, my guess would be "rare scenario".
> >
> > I've not done this in the attached v16. I have added a comment about it.
> > I think not doing it is a judgment call and not a bug, right?
> 
> As of this writing, I do not believe that this is an essential part of
> this patch, but it is always possible that with the benefit of
> hindsight it will seem otherwise. That's just life, though. If I knew
> in advance which of my decisions would later seem like mistakes, I
> would make very few mistakes.

I agree that we can disregard this for now.

Greetings,

Andres Freund




Re: Proposal to CREATE FOREIGN TABLE LIKE

2025-02-11 Thread Sami Imseih
> + Foreign tables have no real storage in PostgreSQL.
> + Inapplicable options: INCLUDING INDEXES,
> INCLUDING STORAGE,
>
> Oh, I corrected another one in the code comments, but I forgot about this one.
> Done in patch v3.

I attached v4 with some slight modifications to the wording, otherwise
this looks good.

>> I think the test coverage to check for the negative conditions only is
>> enough.
>>
>>
> Hmm... I copied from the cases in the same file for each option.
> There's no harm in having more tests, how about we keep them?

I agree. I was just saying the test cases you provided are
enough. No changes needed for the tests.

I have no further comments.

Regards,

Sami


v4-0001-CREATE-FOREIGN-TABLE-LIKE.patch
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-11 Thread Nathan Bossart
On Tue, Feb 11, 2025 at 03:22:49PM +0100, Álvaro Herrera wrote:
> I find this proposed patch a bit strange and I feel it needs more
> explanation.
> 
> When this thread started, Bharath justified his patches saying that a
> slot that's inactive for a very long time could be problematic because
> of XID wraparound.  Fine, that sounds a reasonable feature.  If you
> wanted to invalidate slots whose xmins were too old, I would support
> that.  He submitted that as his 0004 patch then.
> 
> However, he also chose to submit 0003 with invalidation based on a
> timeout.  This is far less convincing a feature to me.  The
> justification for the time out seems to be that ... it's difficult to
> have a one-size-fits-all value because size of disks vary. (???)
> Or something like that.  Really?  I mean -- yes, this will prevent
> problems in toy databases when run in developer's laptops.  It will not
> prevent any problems in production databases.  Do we really want a
> setting that is only useful for toy situations rather than production?
> 
> 
> Anyway, the thread is way too long, but after some initial pieces were
> committed, Nisha took over and submitting patches derived from Bharath's
> 0003, and at some point the initial 0004 was dropped.  But 0004 was the
> more useful one, I thought, so what's going on?
> 
> I'm baffled.

I agree, and I am also baffled because I think this discussion has happened
at least once already on this thread.  I still feel like the XID-based
parameter makes more sense.  For replication slots, two primary concerns
are 1) storage, for which we have max_slot_wal_keep_size and 2) XID
wraparound, for which we don't really have anything today.  A timeout might
be useful in some contexts, but if the goal is to prevent wraparound, why
not target that directly?

-- 
nathan




Re: Introduce XID age and inactive timeout based replication slot invalidation

2025-02-11 Thread Álvaro Herrera
Hello,

I find this proposed patch a bit strange and I feel it needs more
explanation.

When this thread started, Bharath justified his patches saying that a
slot that's inactive for a very long time could be problematic because
of XID wraparound.  Fine, that sounds a reasonable feature.  If you
wanted to invalidate slots whose xmins were too old, I would support
that.  He submitted that as his 0004 patch then.

However, he also chose to submit 0003 with invalidation based on a
timeout.  This is far less convincing a feature to me.  The
justification for the time out seems to be that ... it's difficult to
have a one-size-fits-all value because size of disks vary. (???)
Or something like that.  Really?  I mean -- yes, this will prevent
problems in toy databases when run in developer's laptops.  It will not
prevent any problems in production databases.  Do we really want a
setting that is only useful for toy situations rather than production?


Anyway, the thread is way too long, but after some initial pieces were
committed, Nisha took over and submitting patches derived from Bharath's
0003, and at some point the initial 0004 was dropped.  But 0004 was the
more useful one, I thought, so what's going on?

I'm baffled.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")




Re: Move wal_buffers_full to WalUsage (and report it in pgss/explain)

2025-02-11 Thread Ilia Evdokimov

Hi,

Thank you for your work!

1. Perhaps In EXPLAIN you forget to check that usage->wal_buffers_full > 0:

if ((usage->wal_records > 0) || (usage->wal_fpi > 0) || 
(usage->wal_bytes > 0))



2. I have a small suggestion for pg_stat_statements: would it make sense 
to move wal_buffers_full next to wal_records, wal_fpi and wal_bytes? 
This way, all WAL-related information would be grouped together.


--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.





Re: RFC: Allow EXPLAIN to Output Page Fault Information

2025-02-11 Thread Andres Freund
Hi,

On 2025-02-11 17:00:36 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 16:36, Andres Freund  wrote:
> > Shrug. It means that it'll not work in what I hope will be the default
> > mechanism before long.  I just can't get excited for that. In all likelihood
> > it'll result in bug reports that I'll then be on the hook to fix.
> 
> My assumption was that io_uring would become the default mechanism on
> Linux. And that bgworker based AIO would generally only be used by
> platforms without similar functionality. Is that assumption incorrect?

Yes, at least initially:
1) it's not enabled on the kernel level everywhere
2) it requires an optional build dependency
3) it requires tuning the file descriptor ulimit, unless we can convince Tom
   that it's ok to do that ourselves

Greetings,

Andres Freund




Re: Orphaned users in PG16 and above can only be managed by Superusers

2025-02-11 Thread Ashutosh Sharma
Hi Robert,

On Tue, Feb 4, 2025 at 10:54 PM Robert Haas  wrote:
>
> On Thu, Jan 30, 2025 at 8:45 AM Ashutosh Sharma  wrote:
> > Imagine a superuser creates role u1. Since the superuser is creating
> > u1, it won't have membership in any role. Now, suppose u1 creates a
> > new role, u2. In this case, u1 automatically becomes a member of u2
> > with the admin option. However, at this point, there is no dependency
> > between u1 and u2, meaning that dropping u2 shouldn't impact u1. Now,
> > if u2 creates yet another role, u3, that's when u1 will start
> > depending on u2. This is because if u2 were dropped, u1 would lose the
> > ability to administer u3. At this stage, a dependency between u1 and
> > u2 is recorded.
>
> This seems to me to be assuming that who can administer which roles
> won't change, but it can. The superuser is free to grant the ability
> to administer u3 to some new user u4 or an existing user such as u1,
> and is also free to remove that ability from u2.
>

You're right, I'll fix that in the next patch.

> I think if you want to legislate that attempting to drop u2 fails, you
> have to do that by testing what the situation is at the time of the
> drop command, not by adding dependencies in advance.

I agree, and I will give this some thought to see if we can find a way
forward along these lines.

--
Thanks & Regards,
Ashutosh Sharma.




Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.

2025-02-11 Thread Shlok Kyal
On Tue, 11 Feb 2025 at 09:51, Shubham Khanna
 wrote:
>
> On Fri, Feb 7, 2025 at 7:46 AM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Shubham,
> >
> > Thanks for updating the patch.
> >
> > Previously you told that you had a plan to extend the patch to drop other 
> > replication
> > objects [1], but I think it is not needed. pg_createsubscriber has already 
> > been
> > able to drop the existing subscrisubscriptions in 
> > check_and_drop_existing_subscriptions().
> > As for the replication slot, I have told in [2], it would be created 
> > intentionally
> > thus I feel it should not be dropped.
> > Thus I regard the patch does not have concrete extending plan.
> >
> > Below part contains my review comment.
> >
> > 01. Option name
> >
> > Based on the above discussion, "--cleanup-publisher-objects" is not 
> > suitable because
> > it won't drop replication slots. How about "--cleanup-publications"?
> >
>
> I have changed the name of the option  to "--cleanup-existing-publications"
>
> > 02. usage()
> > ```
> > +   printf(_("  -C  --cleanup-publisher-objects drop all publications 
> > on the logical replica\n"));
> > ```
>
> Fixed.
>
> > s/logical replica/subscriber
> >
> > 03. drop_all_publications
> > ```
> > +/* Drops all existing logical replication publications from all subscriber
> > + * databases
> > + */
> > +static void
> > ```
> >
> > Initial line of the comment must be blank [3].
> >
>
> Removed this function.
>
> > 04. main
> > ```
> > +   {"cleanup-publisher-objects", no_argument, NULL, 'C'},
> > ```
> >
> > Is there a reason why upper case is used? I feel lower one is enough.
> >
>
> Fixed.
>
> > 05. main
> > ```
> > +   /* Drop publications from the subscriber if requested */
> > +   if (opt.cleanup_publisher_objects)
> > +   drop_all_publications(dbinfo);
> > ```
> >
> > After considering more, I noticed that we have already called 
> > drop_publication()
> > in the setup_subscriber(). Can we call drop_all_publications() there 
> > instead when
> > -C is specified?
> >
>
> I agree with you on this. I have removed the drop_all_publication()
> and added the "--cleanup-existing-publications" option to the
> drop_publication()
>
> > 06. 040_pg_createsubscriber.pl
> >
> > ```
> > +$node_s->start;
> > +# Create publications to test it's removal
> > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL TABLES;");
> > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL TABLES;");
> > +
> > +# Verify the existing publications
> > +my $pub_count_before =
> > +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> > +is($pub_count_before, '2',
> > +   'two publications created before --cleanup-publisher-objects is 
> > run');
> > +
> > +$node_s->stop;
> > ```
> >
> > I feel it requires unnecessary startup and shutdown. IIUC, creating 
> > publications and check
> > counts can be before stopping the node_s, around line 331.
> >
>
> Fixed.
>
> > 07. 040_pg_createsubscriber.pl
> >
> > ```
> > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub FOR ALL TABLES;");
> > +$node_p->safe_psql($db1, "CREATE PUBLICATION test_pub2 FOR ALL TABLES;");
> > +
> > +# Verify the existing publications
> > +my $pub_count_before =
> > +  $node_s->safe_psql($db1, "SELECT COUNT(*) FROM pg_publication;");
> > +is($pub_count_before, '2',
> > +   'two publications created before --cleanup-publisher-objects is 
> > run');
> > +
> > ```
> >
> > Also, there is a possibility that CREATE PUBLICATION on node_p is not 
> > replicated yet
> > when SELECT COUNT(*) is executed. Please wait for the replay.
> >
> > [1]: 
> > https://www.postgresql.org/message-id/CAHv8RjL4OvoYafofTb_U_JD5HuyoNowBoGpMfnEbhDSENA74Kg%40mail.gmail.com
> > [2]: 
> > https://www.postgresql.org/message-id/OSCPR01MB1496664FDC38DA40A441F449FF5EE2%40OSCPR01MB14966.jpnprd01.prod.outlook.com
> > [3]: https://www.postgresql.org/docs/devel/source-format.html
> >
>
> Fixed.
>
> The attached Patch contains the suggested changes.
>

Hi Shubham,

I have some comments for v4 patch:
1. I think we should update the comment for the function
'drop_publication'. As its usage is changed with this patch
Currently it states:
/*
 * Remove publication if it couldn't finish all steps.
 */

2. In case when --cleanup_existing_publications is not specified the
info message has two double quotes.

pg_createsubscriber: dropping publication
""pg_createsubscriber_5_aa3c31f2"" in database "postgres"

The code:
+   appendPQExpBufferStr(targets,
+PQescapeIdentifier(conn, dbinfo->pubname,
+   strlen(dbinfo->pubname)));

It is appending the value along with the double quotes. I think we
should assign the value of 'PQescapeIdentifier(conn, dbinfo->pubname,
strlen(dbinfo->pubname)' to a string and then use it.

Thanks and Regards,
Shlok Kyal




Re: Eagerly scan all-visible pages to amortize aggressive vacuum

2025-02-11 Thread Andres Freund
On 2025-02-05 12:23:29 -0500, Melanie Plageman wrote:
> Attached v16 implements the logic to not count pages we failed to
> freeze because of cleanup lock contention as eager freeze failures.

That looks good to me.




Re: Small memory fixes for pg_createsubcriber

2025-02-11 Thread Euler Taveira
On Mon, Feb 10, 2025, at 1:31 PM, Ranier Vilela wrote:
> Coverity has some reports about pg_createsubcriber.
> 
> CID 1591322: (#1 of 1): Resource leak (RESOURCE_LEAK)
> 10. leaked_storage: Variable dbname going out of scope leaks the storage it 
> points to.
> Additionally there are several calls that are out of the ordinary, according 
> to the Postgres API.
> 
> According to the documentation:
> libpq-exec 
> 
> 
> The correct function to free memory when using PQescapeLiteral and 
> PQescapeIdentifier would be PQfreemem.
> 

Hi,

>From src/common/fe_memutils.c:

void
pg_free(void *ptr)
{
free(ptr);
}

>From src/interfaces/libpq/fe-exec.c:

void
PQfreemem(void *ptr)
{
free(ptr);
}

There is no bug. They are the same behind the scenes. I'm fine changing it. It
is a new code and it wouldn't cause a lot of pain to backpatch patches in the
future.


@@ -1130,6 +1130,7 @@ check_and_drop_existing_subscriptions(PGconn *conn,
 
PQclear(res);
destroyPQExpBuffer(query);
+ PQfreemem(dbname);
}

Even if the pg_createsubscriber aims to run in a small amount of time, hence,
it is fine to leak memory, the initial commit cleaned up all variables but a
subsequent commit 4867f8a555c apparently didn't. Although it is just a low
impact improvement, it is better to be strict and shut up SASTs.


--
Euler Taveira
EDB   https://www.enterprisedb.com/


[PATCH] snowball: fix potential NULL dereference

2025-02-11 Thread Коротков Максим
Hi all,
I found the case of potential NULL pointer dereference.
In snowball/libstemmer/api.c if we transfer control to the SN_create_env() 
function
by using the error label when there is a memory allocation error of z->p or 
z->S,
we can then dereference the NULL pointer z->S in the function SN_close_env().
Added the pointer check for avoiding a potential problem.
---
Best regards, Korotkov Maksim
PostgresPro
m.korot...@postgrespro.ru
From c4596f6e23c4f5e9ae8294d51783afb733eccc9d Mon Sep 17 00:00:00 2001
From: Maksim Korotkov 
Date: Tue, 24 Dec 2024 17:07:12 +0300
Subject: [PATCH] snowball: fix potential NULL dereference

If we transfer control to the SN_create_env() function by using the error
label when there is a memory allocation error of z->p or z->S,
we can then dereference the NULL pointer z->S in the function SN_close_env().
Added the pointer check for avoiding potential problem.

Fixes: 140d4ebcb4 ("Tsearch2 functionality migrates to core.  The bulk of this work is by Oleg Bartunov and Teodor Sigaev, but I did a lot of editorializing, so anything that's broken is probably my fault.")

Found by Postgres Pro with Svace static analyzer
Signed-off-by: Maksim Korotkov 
---
 src/backend/snowball/libstemmer/api.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/snowball/libstemmer/api.c b/src/backend/snowball/libstemmer/api.c
index 375938e6d1..3a7169abd8 100644
--- a/src/backend/snowball/libstemmer/api.c
+++ b/src/backend/snowball/libstemmer/api.c
@@ -34,7 +34,7 @@ error:
 extern void SN_close_env(struct SN_env * z, int S_size)
 {
 if (z == NULL) return;
-if (S_size)
+if (S_size && z->S)
 {
 int i;
 for (i = 0; i < S_size; i++)
-- 
2.34.1



Re: pg_stat_statements and "IN" conditions

2025-02-11 Thread Sami Imseih
I have only looked at 0001, but I am wondering why
query_id_const_merge is a pg_stat_statements GUC
rather than a core GUC?

The dependency of pg_stat_statements to take advantage
of this useful feature does not seem right.

For example if the user does not have pg_stat_statements enabled,
but are sampling top queryId from pg_stat_activity, they will
likely want this merge behavior to build meaningful database
load graphs.

Other extensions that consume queryIds may also want this
behavior without needing to enable pg_stat_statements.

Also, we have compute_query_id as a core parameter, this
new guc will become an option for how to compute a queryId.
In the future we may want to introduce other controls for how a
queryId is generated.

Regards,

Sami




Re: [PATCH] snowball: fix potential NULL dereference

2025-02-11 Thread Tom Lane
=?utf-8?q?=D0=9A=D0=BE=D1=80=D0=BE=D1=82=D0=BA=D0=BE=D0=B2_=D0=9C=D0=B0=D0=BA=D1=81=D0=B8=D0=BC?=
  writes:
> I found the case of potential NULL pointer dereference.
> In snowball/libstemmer/api.c if we transfer control to the SN_create_env() 
> function
> by using the error label when there is a memory allocation error of z->p or 
> z->S,
> we can then dereference the NULL pointer z->S in the function SN_close_env().
> Added the pointer check for avoiding a potential problem.

I believe you are right: OOM partway through SN_create_env would fail.
However, backend/snowball is not our code so applying our own local
patch is not the way to fix it.  You should report this upstream;
see src/backend/snowball/README.

(Whenever they apply the patch, we should then re-sync...)

regards, tom lane




Re: DOCS - inactive_since field readability

2025-02-11 Thread Amit Kapila
On Fri, Feb 7, 2025 at 12:05 PM Peter Smith  wrote:
>
>
> I've had a go at rewording this paragraph in the 0002 patch.
>

The change in 0001 looks odd after seeing it in HTML format. We should
either add one empty line between two paragraphs otherwise it doesn't
appear good. Did you see multi-paragraphs in any other column
definitions?

For the 0002 patch, I find the following changes as improvements,
*
  
-Note that for slots on the standby
to
+For standby slots

*
On standby, this is useful for slots
-that are being synced from a primary server (whose
-synced field is true)
-so they know when the slot stopped being synchronized.
to
This helps standby slots
+track when synchronization was interrupted.

Other than that, I find the current wording okay.

-- 
With Regards,
Amit Kapila.




Re: Test to dump and restore objects left behind by regression

2025-02-11 Thread Ashutosh Bapat
Hi Michael,


On Sun, Feb 9, 2025 at 1:25 PM Michael Paquier  wrote:
>
> On Fri, Feb 07, 2025 at 07:11:25AM +0900, Michael Paquier wrote:
> > Okay, thanks for the feedback.  We have been relying on diff -u for
> > the parts of the tests touched by 0001 for some time now, so if there
> > are no objections I would like to apply 0001 in a couple of days.
>
> This part has been applied as 169208092f5c.

Thanks. PFA rebased patches.

I have added another diff adjustment to adjust_regress_dumpfile().
It's introduced by 83ea6c54025bea67bcd4949a6d58d3fc11c3e21b.

On Thu, Feb 6, 2025 at 11:32 AM Michael Paquier  wrote:
>
> On Wed, Feb 05, 2025 at 03:28:04PM +0900, Michael Paquier wrote:
> > Hmm.  I was reading through the patch and there is something that
> > clearly stands out IMO: the new compare_dumps().  It is in Utils.pm,
> > and it acts as a wrapper of `diff` with its formalized output format.
> > It is not really about dumps, but about file comparisons.  This should
> > be renamed compare_files(), with internals adjusted as such, and
> > reused in all the existing tests.  Good idea to use that in
> > 027_stream_regress.pl, actually.  I'll go extract that first, to
> > reduce the presence of `diff` in the whole set of TAP tests.
>
> The result of this part is pretty neat, resulting in 0001 where it is
> possible to use the refactored routine as well in pg_combinebackup
> where there is a piece comparing dumps.  There are three more tests
> with diff commands and assumptions of their own, that I've left out.
> This has the merit of unifying the output generated should any diffs
> show up, while removing a nice chunk from the main patch.
>
> > AdjustDump.pm looks like a fine concept as it stands.  I still need to
> > think more about it.  It feels like we don't have the most optimal
> > interface, though, but perhaps that will be clearer once
> > compare_dumps() is moved out of the way.

Without knowing what makes the interface suboptimal, it's hard to make
it optimal. I did think about getting rid of adjust_child_columns
flag. But that either means we adjust CREATE TABLE ... INHERIT
statements from both the dump outputs from original and the restored
database to a canonical form or get rid of the tests in that function
to make sure that the adjustment is required. The first seems more
work (coding and run time). The tests look useful to detect when the
adjustment won't be required.

I also looked at the routines which adjust the dumps from upgrade
tests. They seem to be specific to the older versions and lack the
extensibility you mentioned earlier.

The third thing I looked at was the possibility of applying the
adjustments to only the dump from the original database where it is
required by passing the newline adjustments to compare_files().
However 0002 in the attached set of patches adds more logic,
applicable to both the original and restored dump outputs, to
AdjustDump.pm. So we can't do that either.

I am clueless as to what could be improved here.

>
> +my %dump_formats = ('plain' => 'p', 'tar' => 't', 'directory' => 'd', 
> 'custom' => 'c');
>
> No need for this mapping, let's just use the long options.

Hmm, didn't realize -F accepts whole format name as well. pg_dump
--help doesn't give that impression but user facing documentation
mentions it. Done.

>
> +# restore data as well to catch any errors while doing so.
> +command_ok(
> +[
> +'pg_dump', "-F$format_spec", '--no-sync',
> +'-d', $src_node->connstr('regression'),
> +'-f', $dump_file
> +],
> +"pg_dump on source instance in $format format");
>
> The use of command_ok() looks incorrect here.  Shouldn't we use
> $src_node->command_ok() here to ensure a correct PATH?  That would be
> more consistent with the other dump commands.  Same remark about
> @restore_command.

Cluster::command_ok's prologue doesn't mention PATH but mentions
PGHOST and PGPORT.
```
Runs a shell command like PostgreSQL::Test::Utils::command_ok, but
with PGHOST and PGPORT set
so that the command will default to connecting to this PostgreSQL::Test::Cluster
```
According to sub _get_env(), PATH is set only when custom install path
is provided. In the absence of that, build path is used. In this case,
the source and destination nodes are created from the build itself, so
no separate path is required. PGHOST and PGPORT are anyway overridden
by the connection string fetched from the node. So I don't think
there's any correctness issue here, but it's better to use
Cluster::command_ok() just for better readability. Done

>
> +# The order of columns in COPY statements dumped from the original 
> database
> +# and that from the restored database differs. These differences are 
> hard to
>
> What are the relations we are talking about here?

These are the child tables whose parent has added a column after being
inherited. Now that I have more expertise with perl regex, I have
a

Re: Restrict publishing of partitioned table with a foreign table as partition

2025-02-11 Thread Shlok Kyal
On Wed, 5 Feb 2025 at 14:14, Álvaro Herrera  wrote:
>
> On 2025-Feb-05, vignesh C wrote:
>
> > We can maintain the behavior you suggested when the
> > PUBLISH_VIA_PARTITION_ROOT option is set to false. However, when
> > PUBLISH_VIA_PARTITION_ROOT is true, the table data is copied from the
> > root table (as intended by the user), which will also include the
> > foreign table data. In this case, wouldn’t it be better to throw an
> > error?
>
> It sounds to me a reasonable restriction that you can only add a
> partitioned table to a publication if publish_via_partition_root=false.
> Then the case of hybrid partitioned tables is supported, but it requires
> that changes are published directly by partitions, which is sensible
> anyway.
>
I have created a patch with the above approach.
We can create a publication on a partition table with foreign
partition when publish_via_partition_root=false. In this case foreign
partitions will not be published. And if
publish_via_partition_root=true we will throw an error.
Please find the v5 patch.

> In this case, during CREATE FOREIGN TABLE of a partition with this
> condition, we must check whether any publications include the schema
> that the table is being created on (or attached, for ALTER TABLE ATTACH
> PARTITION), and whether there are any publications that are FOR ALL
> TABLES that would be affected.
>

I have handled the above cases and added tests for the same.

> (If we later figure out a way to allow publish_via_partition_root and
> skip the tuples in foreign partitions, it's easy to remove the
> restriction.)
>

Thanks and Regards,
Shlok Kyal


v5-0001-Restrict-publishing-of-partitioned-table-with-for.patch
Description: Binary data


Re: Restrict publishing of partitioned table with a foreign table as partition

2025-02-11 Thread Shlok Kyal
On Mon, 10 Feb 2025 at 16:11, vignesh C  wrote:
>
> On Tue, 4 Feb 2025 at 18:31, vignesh C  wrote:
> >
> > On Thu, 30 Jan 2025 at 17:32, Shlok Kyal  wrote:
> > >
> > > @@ -1428,6 +1427,12 @@ check_foreign_tables_in_schema(Oid schemaid)
> > >  errdetail("foreign table \"%s\" is a partition of
> > > partitioned table \"%s\"",
> > >get_rel_name(foreign_tbl_relid), 
> > > parent_name)));
> > > }
> > > +   else if (relForm->relkind == RELKIND_FOREIGN_TABLE)
> > > +   ereport(ERROR,
> > > +   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > +errmsg("cannot add relation \"%s\" to publication",
> > > +   get_rel_name(relForm->oid)),
> > > +
> > > errdetail_relkind_not_supported(RELKIND_FOREIGN_TABLE)));
> > > }
> > >
> > > We should only throw error when foreign table is part of a partition
> > > table in case of 'FOR TABLES IN SCHEMA' . We should not throw an error
> > > otherwise because in case of 'FOR TABLES IN SCHEMA' foreign tables are
> > > not published by default.
> > >
> > > I have added the changes in v3-0001.
> >
> > In case of all tables publication you have retrieved all the foreign
> > tables and then checked if any of the foreign tables is a partition of
> > a partitioned table. In case of all tables in schema publication you
> > have retrieved all the partitioned tables and then check if it
> > includes foreign tables. I felt you can keep the all tables in schema
> > publication also similar to all tables publication(as generally the
> > number of foreign tables will be lesser than the tables) i.e. retrieve
> > all the foreign tables and then check if any of the foreign tables is
> > a partition of a partitioned table.
>
> I believe you chose this approach because partitioned tables can have
> their partitions as foreign tables in a different schema. As a result,
> the current approach works well for the 'TABLES IN SCHEMA' case. It
> would be helpful to include a brief comment explaining this reasoning
> where you're handling tables in the schema publication.
>

Correct, I used this approach because partitioned tables can have
their foreign partitions in a different schema. I have added a comment
for the same in the v5 patch [1].

[1]: 
https://www.postgresql.org/message-id/CANhcyEXxjq9U7BXxCba3Njz%2BeHpNzAsSVY2GzbV%3D8iy5j%3DUAUA%40mail.gmail.com


Thanks and Regards,
Shlok Kyal




Re: Proposal to CREATE FOREIGN TABLE LIKE

2025-02-11 Thread Zhang Mingli
On Feb 11, 2025 at 08:14 +0800, Sami Imseih , wrote:
> > Patch V2 addressed the comments.
> >
> > Overall this LGTM.
> >
> > I still see a "no real storage" in v2 that should be removed
> > from the documentation.
> >
> > + Foreign tables have no real storage in PostgreSQL.
> > + Inapplicable options: INCLUDING INDEXES,
> > INCLUDING STORAGE,
Oh, I corrected another one in the code comments, but I forgot about this one.
Done in patch v3.
>
>
> I think the test coverage to check for the negative conditions only is
> enough.

Hmm... I copied from the cases in the same file for each option.
There's no harm in having more tests, how about we keep them?

--
Zhang Mingli
HashData



v3-0001-CREATE-FOREIGN-TABLE-LIKE.patch
Description: Binary data


Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

2025-02-11 Thread Jelte Fennema-Nio
The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
change is because doing so would break legacy programs that use the
select(2) system call in hard to debug ways. So instead programs that
want to opt-in to a higher open file limit are expected to bump their
soft limit to their hard limit on startup. Details on this are very well
explained in a blogpost by the systemd author[1]. There's also a similar
change done by the Go language[2].

So this starts bumping postmaster and pgbench its soft open file limit
to the hard open file limit. Doing so is especially useful for the AIO
work that Andres is doing, because io_uring consumes a lot of file
descriptors. But even without looking at AIO there is a large number of
reports from people that require changing their soft file limit before
starting postgres, sometimes falling back to lowering
max_files_per_process when they fail to do so[3-8]. It's also not all
that strange to fail at setting the soft open file limit because there
are multiple places where one can configure such limits and usually only
one of them is effective (which one depends on how Postgres is started).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: 
https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: 
https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: 
https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: 
https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: 
https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
From ecf7fc5f34b4e025eaa4dda809874f18dfdb24b4 Mon Sep 17 00:00:00 2001
From: Jelte Fennema-Nio 
Date: Tue, 11 Feb 2025 19:15:36 +0100
Subject: [PATCH v1] Bump soft open file limit (RLIMIT_NOFILE) to hard limit on
 startup

The default open file limit of 1024 is extremely low, given modern
resources and kernel architectures. The reason that this hasn't changed
change is because doing so would break legacy programs that use the
select(2) system call in hard to debug ways. So instead programs that
want to opt-in to a higher open file limit are expected to bump their
soft limit to their hard limit on startup. Details on this are very well
explained in a blogpost by the systemd author[1]. There's also a similar
change done by the Go language[2].

So this starts bumping postmaster and pgbench its soft open file limit
to the hard open file limit. Doing so is especially useful for the AIO
work that Andres is doing, because io_uring consumes a lot of file
descriptors. But even without looking at AIO there is a large number of
reports from people that require changing their soft file limit before
starting postgres, sometimes falling back to lowering
max_files_per_process when they fail to do so[3-8]. It's also not all
that strange to fail at setting the soft open file limit because there
are multiple places where one can configure such limits and usually only
one of them is effective (which one depends on how Postgres is started).

[1]: https://0pointer.net/blog/file-descriptor-limits.html
[2]: https://github.com/golang/go/issues/46279
[3]: https://serverfault.com/questions/785330/getting-too-many-open-files-error-for-postgres
[4]: https://serverfault.com/questions/716982/how-to-raise-max-no-of-file-descriptors-for-daemons-running-on-debian-jessie
[5]: https://www.postgresql.org/message-id/flat/CAKtc8vXh7NvP_qWj8EqqorPY97bvxSaX3h5u7a9PptRFHW5x7g%40mail.gmail.com
[6]: https://www.postgresql.org/message-id/flat/113ce31b0908120955w77029099i7ececc053084095a%40mail.gmail.com
[7]: https://github.com/abiosoft/colima/discussions/836
[8]: https://www.postgresql.org/message-id/flat/29663.1007738957%40sss.pgh.pa.us#2079ec9e2d8b251593812a3711bfe9e9
---
 src/backend/postmaster/postmaster.c | 22 ++
 src/bin/pgbench/pgbench.c   |  5 +
 2 files changed, 27 insertions(+)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index bb22b13adef..3c2fbebfa62 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -72,6 +72,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -406,6 +407,8 @@ static DNSServiceRef bonjour_sdref = NULL;
 /*
  * postmaster.c - function prototypes
  */
+
+static void ConfigureFileLimit(void);
 static void CloseServerPorts(int status, Datum arg);
 static void unlink_external_pid_file(int status, Datum arg);
 static void getInstallationPaths(const char *argv0);
@@ -573,6 +576,8 @@ PostmasterMain(int argc, char *argv[])
 	/* Begin accepting signals. */
 	sigprocmask(SI

Re: pg_stat_statements and "IN" conditions

2025-02-11 Thread Álvaro Herrera
On 2025-Feb-11, Sami Imseih wrote:

> I have only looked at 0001, but I am wondering why
> query_id_const_merge is a pg_stat_statements GUC
> rather than a core GUC?

I was wondering the same thing and found the explanation 
here:
https://postgr.es/m/ztmuctymis3n3...@paquier.xyz

> Other extensions that consume queryIds may also want this
> behavior without needing to enable pg_stat_statements.

I agree.  In fact, pg_stat_activity will behave differently (using
merged query_ids) if this value is turned on, for which you need the
contrib module.  This makes no sense to me.

Besides, the patch cheats in this regard: what Dmitry did was
create a function SetQueryIdConstMerge() which the extension with the
GUC can call to set the value of the variable.  I really don't see that
this is better.  I think we should put the GUC back where it was in v15
of the patch.  (I didn't check what other changes there were
afterwards.)

About the GUC name -- query_id_const_merge -- I think this is too much a
hacker's name.  How about
query_id_merge_values
query_id_merge_value_lists
query_id_squash_constant_lists

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No es bueno caminar con un hombre muerto"




Re: pg_stat_statements and "IN" conditions

2025-02-11 Thread Álvaro Herrera
On 2025-Feb-11, Dmitry Dolgov wrote:

> > On Tue, Feb 11, 2025 at 10:49:59AM GMT, Sami Imseih wrote:
> > I have only looked at 0001, but I am wondering why
> > query_id_const_merge is a pg_stat_statements GUC
> > rather than a core GUC?
> 
> It was moved from being a core GUC into a pg_stat_statements GUC on the 
> request
> from the reviewers. Community tries to prevent adding more and more core GUCs
> into PostgreSQL.

I understand, but I tend to disagree with the argument because it's okay
to simplify things that can be simplified, but if simplifying makes them
more complex, then it goes against the original desire for simplicity
anyway.  My impression is that it would be better to put it back.

(Otherwise, how do you document the behavior that pg_stat_activity
suddenly emits different query_ids than before because you installed
pg_stat_statement and enabled this feature?  It just doesn't make much
sense.)

> > I suppose this is telling us that detecting the case with consts wrapped
> > in casts is not really optional.  (Dmitry said this was supported at
> > early stages of the patch, and now I'm really curious about that
> > implementation because what IsMergeableConstList sees is a FuncExpr that
> > invokes the cast function for float8 to int4.)
> 
> Yes, those cases in question are usually FuncExpr. The original patch
> implementation used to handle that via simplifying the node with
> eval_const_expressions to figure out if the value we work with is a constant.
> This approach was marked as too risky by reviewers, as this could reach a lot
> of unexpected functionality in the mutator part.

Hmm, what about doing something much simpler, such as testing whether
there's just a CoerceViaIO/RelabelType around a Const or a one-parameter
function call of an immutable boostrap-OID function that has a Const as
argument, and trivial cases like that?  Something very very simple
that's going to catch the majority of cases without anything as complex
as a node walker.

Maybe something like statext_is_compatible_clause_internal() can be an
inspiration (and even that is far more complex than we need here.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"If it is not right, do not do it.
If it is not true, do not say it." (Marcus Aurelius, Meditations)




Re: RFC: Allow EXPLAIN to Output Page Fault Information

2025-02-11 Thread Andres Freund
Hi,

On 2025-02-11 18:45:13 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 17:19, Andres Freund  wrote:
> > Yes, at least initially:
> 
> Ah, then I understand your point of view much better. Still I think we
> could easily frame it as: If you enable io_uring, you also get these
> additional fancy stats.

> Also afaict the items don't have to mean that
> 
> > 1) it's not enabled on the kernel level everywhere
> 
> Is that really a common thing to do? From a quick internet search, I
> can "only" find that Google does this. (Google is definitely a big
> cloud player, so I don't want to suggest that that is not important,
> but if that's really the only one still the bulk of systems would have
> io_uring support)

RHEL had it disabled for quite a while, not sure if that's still the case.


> > 2) it requires an optional build dependency
> 
> What build dependency is this?

Liburing.


> In any case, can't we choose the default at build time based on the
> available build dependencies? And if we cannot, I think we could always add
> an "auto" default that would mean the best available AIO implementation
> (where io_uring is better than bgworkers).

We could, but because of 3) I don't want to do that right now.


> > 3) it requires tuning the file descriptor ulimit, unless we can convince Tom
> >that it's ok to do that ourselves
> 
> I think we should just do this, given the reasoning in the blog[1]
> from the systemd author I linked in the AIO thread. I agree that a
> response/explicit approval from Tom would be nice though.

I think it's the right path, but that's a fight to fight after AIO has been
merged, not before.

Greetings,

Andres Freund




Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

2025-02-11 Thread Tom Lane
Jelte Fennema-Nio  writes:
> The default open file limit of 1024 is extremely low, given modern
> resources and kernel architectures. The reason that this hasn't changed
> change is because doing so would break legacy programs that use the
> select(2) system call in hard to debug ways. So instead programs that
> want to opt-in to a higher open file limit are expected to bump their
> soft limit to their hard limit on startup. Details on this are very well
> explained in a blogpost by the systemd author[1].

On a handy Linux machine (running RHEL9):

$ ulimit -n
1024
$ ulimit -n -H
524288

I'm okay with believing that 1024 is unreasonably small, but that
doesn't mean I think half a million is a safe value.  (Remember that
that's *per backend*.)  Postgres has run OSes out of FDs in the past
and I don't believe we couldn't do it again.

Also, the argument you cite is completely recent-Linux-centric
and does not consider the likely effects on other platforms.
To take one example, on current macOS:

$ ulimit -n
4864
$ ulimit -n -H
unlimited

(Hm, so Apple wasn't impressed by the "let's not break select(2)"
argument.  But I digress.)

I'm afraid this patch will replace "you need to tune ulimit -n
to get best performance" with "you need to tune ulimit -n to
avoid crashing your system".  Does not sound like an improvement.

Maybe a sanity limit on how high we'll try to raise the ulimit
would help.

regards, tom lane




Re: pgbench with partitioned tables

2025-02-11 Thread Melanie Plageman
On Tue, Feb 11, 2025 at 3:10 AM Sergey Tatarintsev
 wrote:
>
> 08.02.2025 17:32, Álvaro Herrera пишет:
> > On 2025-Feb-07, Melanie Plageman wrote:
> >
> >> Okay, I've stared at this a bit, and it seems basically fine the way
> >> it is (I might add a bit more whitespace, clean up the commit message,
> >> etc). So I'm interested in committing it. I will admit that having
> >> never committed anything to pgbench, I'm a bit nervous. So, give me a
> >> couple days to work up the nerve.
> > Sounds good.  I only wanted to say that I would use PQexecParams()
> > instead of PQexec() in the new function, avoiding the psprintf and
> > pfree, and also that initializing relkind to RELKIND_RELATION is strange
> > and probably unnecessary.
> >
> Let it be so. I made the suggested changes (patch v4 attached)

I was testing with this with the intent to commit it and noticed that
it does change behavior in one way -- previously if you created an
unlogged table with the same schema as one of the pgbench tables and
then used client-side data generation, it would COPY FREEZE data into
that table. With the patch, it would no longer do that. I can live
with that, but I thought it was worth mentioning.

- Melanie




Re: explain analyze rows=%.0f

2025-02-11 Thread Tom Lane
Robert Haas  writes:
> On Tue, Feb 11, 2025 at 12:14 PM Andrei Lepikhov  wrote:
>> I support the idea in general, but I believe it should be expanded to
>> cover all cases of parameterised plan nodes. Each rescan iteration may
>> produce a different number of tuples, and rounding can obscure important
>> data.

> I agree strongly with all of this. I believe we should just implement
> what was agreed here:
> https://www.postgresql.org/message-id/21013.1243618236%40sss.pgh.pa.us
> Let's just display 2 fractional digits when nloops>1, else 0, and call it 
> good.

Note that that formulation has nothing especially to do with
parameterized plan nodes.  Any nestloop inner side would end up
getting shown with fractional rowcounts.  Maybe that's fine.

I suggest that when thinking about what to change here,
you start by considering how you'd adjust the docs at
https://www.postgresql.org/docs/devel/using-explain.html
to explain the new behavior.  If you can't explain it
clearly for users, then maybe it's not such a great idea.

regards, tom lane




Re: PATCH: Disallow a netmask of zero unless the IP is also all zeroes

2025-02-11 Thread Daniel Gustafsson
> On 11 Feb 2025, at 21:25, Tom Lane  wrote:

> I'm a bit distressed to realize that hba.c isn't using cidr_in.
> Maybe we should try to share code instead of duplicating yet more.

+1.  I have a note along these lines on my never-shrinking TODO, I think it
would be great if we took a stab at that.

--
Daniel Gustafsson





Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

2025-02-11 Thread Andres Freund
Hi,

On 2025-02-11 21:04:25 +0100, Tomas Vondra wrote:
> I agree the defaults may be pretty low for current systems, but do we
> want to get into the business of picking a value and overriding whatever
> value is set by the sysadmin? I don't think a high hard limit should be
> seen as an implicit permission to just set is as the soft limit.

As documented in the links sent by Jelte, that's *explicitly* the reasoning
for the difference between default soft and hard limits. For safety programs
should opt in into using higher FD limits, rather than being opted into it.


> Imagine you're a sysadmin / DBA who picks a low soft limit (for whatever
> reason - there may be other stuff running on the system, ...). And then
> postgres starts and just feels like bumping the soft limit. Sure, the
> sysadmin can lower the hard limit and then we'll respect that, but I don't
> recall any other tool requiring this approach, and it would definitely be
> quite surprising to me.

https://codesearch.debian.net/search?q=setrlimit&literal=1

In a quick skim I found that at least gimp, openjdk, libreoffice, gcc, samba,
dbus increase the soft limit to something closer to the hard limit. And that
was at page 62 out of 1269.



> I did run into bottlenecks due to "too few file descriptors" during a
> recent experiments with partitioning, which made it pretty trivial to
> get into a situation when we start trashing the VfdCache. I have a
> half-written draft of a blog post about that somewhere.
>
> But my conclusion was that it's damn difficult to even realize that's
> happening, especially if you don't have access to the OS / perf, etc. So my
> takeaway was we should improve that first, so that people have a chance to
> realize they have this issue, and can do the tuning. The improvements I
> thought about were:

Hm, that seems something orthogonal to me. I'm on board with that suggestion,
but I don't see why that should stop us from having code to adjust the rlimit.


My suggestion would be to redefine max_files_per_process as the number of
files we try to be able to open in backends. I.e. set_max_safe_fds() would
first count the number of already open fds (since those will largely be
inherited by child processes) and then check if we can open up to
max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if
necessary.

That way we don't increase the number of FDs we use in the default
configuration drastically, but increasing the number only requires a config
change, it doesn't require also figuring out how to increase the settings in
whatever starts postgres. Which, e.g. in cloud environments, typically won't
be possible.

And when using something like io_uring for AIO, it'd allow to
max_files_per_process in addition to the files requires for the io_uring
instances.

Greetings,

Andres Freund




Re: Expanding HOT updates for expression and partial indexes

2025-02-11 Thread Matthias van de Meent
On Mon, 10 Feb 2025 at 20:11, Burd, Greg  wrote:
> > On Feb 10, 2025, at 12:17 PM, Matthias van de Meent 
> >  wrote:
> >
> >>
> >> I have a few concerns with the patch, things I’d greatly appreciate your 
> >> thoughts on:
> >>
> >> First, I pass an EState along the update path to enable running the checks 
> >> in heapam, this works but leaves me feeling as if I violated separation of 
> >> concerns. If there is a better way to do this let me know or if you think 
> >> the cost of creating one in the execIndexing.c 
> >> ExecIndexesRequiringUpdates() is okay that’s another possibility.
> >
> > I think that doesn't have to be bad.
>
> Meaning that the approach I’ve taken is okay with you?

If you mean "passing EState down through the AM so we can check if we
really need HOT updates", then Yes, that's OK. I don't think the basic
idea (executing the projections to check for differences) is bad per
se, however I do think we may need more design work on the exact shape
of how ExecIndexesRequiringUpdates() receives its information (which
includes the EState).

E.g. we could pass an opaquely typed pointer that's passed from the
executor state through the table_update method into this
ExecIndexesRequiringUpdates(). That opaque struct would then contain
the important information for index update state checking, so that the
AM can't realistically break things without bypassing the separation
of concerns, and doesn't have to know about any executor nodes.

>>> Third, there is overhead to this patch, it is no longer a single simple 
>>> bitmap test to choose HOT or not in heap_update().
>>
>> Why can't it mostly be that simple in simple cases?
>
> It can remain that simple in the cases you mention.  In relcache the hot 
> blocking attributes are nearly the same, only the summarizing attributes are 
> removed.  The first test then in heap_update() is for overlap with the 
> modified set.  When there is none, the update will proceed on the HOT path.
>
> The presence of a summarizing index is determined in 
> ExecIndexesRequiringUpdates() in execIndexing.c, so a slightly longer code 
> path but not much new overhead.

Yes, but that's only determined at an index-by-index level, rather
than determined all at once, and that's real bad when you have
hundreds of indexes to go through (however unlikely it might be, I've
heard of cases where there are 1000s of indexes on one table). So, I
prefer limiting any O(n_indexes) operations to only the most critical
cases.

> > I mean, it's clear that "updated indexed column's value == non-HOT
> > update". And that to determine whether an updated *projected* column's
> > value (i.e., expression index column's value) was actually updated we
> > need to calculate the previous and current index value, thus execute
> > the projection twice. But why would we have significant additional
> > overhead if there are no expression indexes, or when we can know by
> > bitmap overlap that the only interesting cases are summarizing
> > indexes?
>
> You’re right, there’s not a lot of new overhead in that case except what 
> happens in ExecIndexesRequiringUpdates() to scan over the list of IndexInfo.  
> It is really only when there are many expressions/predicates requiring 
> examination that there is any significant cost to this approach AFAICT (but 
> if you see something please point it out).

See the attached approach. Evaluation of the expressions only has to
happen if there are any HOT-blocking attributes which are exclusively
hot-blockingly indexed through expressions, so if the updated
attribute numbers are a subset of hotblockingexprattrs. (substitute
hotblocking with summarizing for the summarizing approach)

> > I would've implemented this with (1) two new bitmaps, one each for
> > normal and summarizing indexes, each containing which columns are
> > exclusively used in expression indexes (and which should thus be used
> > to trigger the (comparatively) expensive recalculation).
>
> That was one where I started, over time that became harder to work as the 
> bitmaps contain the union of index attributes for the table not per-column.

I think it's fairly easy to create, though.

> Now there is one bitmap to cover the broadest case and then a function to 
> find the modified set of indexes where each is examined against bitmaps that 
> contain only attributes specific to the index in question.  This helped in 
> cases where there were both expression and non-expression indexes on the same 
> attribute.

Fair, but do we care about one expression index on (attr1->>'data')'s
value *not* changing when an index on (attr1) exists and attr1 has
changed? That index on att1 would block HOT updates regardless of the
(lack of) changes to the (att1->>'data') index, so doing those
expensive calculations seems quite wasteful.

So, in my opinion, we should also keep track of those attributes only
included in expressions of indexes, and that's fairly easy: see
attached prototype.diff.txt (might need some wo

Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

2025-02-11 Thread Tom Lane
Andres Freund  writes:
> My suggestion would be to redefine max_files_per_process as the number of
> files we try to be able to open in backends. I.e. set_max_safe_fds() would
> first count the number of already open fds (since those will largely be
> inherited by child processes) and then check if we can open up to
> max_files_per_process files in addition. Adjusting the RLIMIT_NOFILE if
> necessary.

Seems plausible.  IIRC we also want 10 or so FDs available as "slop"
for code that doesn't go through fd.c.

> And when using something like io_uring for AIO, it'd allow to
> max_files_per_process in addition to the files requires for the io_uring
> instances.

Not following?  Surely we'd not be configuring that so early in
postmaster start?

regards, tom lane




Re: POC: enable logical decoding when wal_level = 'replica' without a server restart

2025-02-11 Thread Masahiko Sawada
On Fri, Jan 24, 2025 at 11:16 AM Masahiko Sawada  wrote:
>
> On Thu, Jan 23, 2025 at 3:24 AM Ashutosh Bapat
>  wrote:
> >
> > On Thu, Jan 23, 2025 at 6:16 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Jan 10, 2025 at 12:33 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Jan 9, 2025 at 3:29 AM Ashutosh Bapat
> > > >  wrote:
> > > > >
> > > > > On Tue, Dec 31, 2024 at 10:15 AM Masahiko Sawada 
> > > > >  wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Logical decoding (and logical replication) are available only when
> > > > > > wal_level = logical. As the documentation says[1], Using the 
> > > > > > 'logical'
> > > > > > level increases the WAL volume which could negatively affect the
> > > > > > performance. For that reason, users might want to start with using
> > > > > > 'replica', but when they want to use logical decoding they need a
> > > > > > server restart to increase wal_level to 'logical'. My goal is to 
> > > > > > allow
> > > > > > users who are using 'replica' level to use logical decoding without 
> > > > > > a
> > > > > > server restart. There are other GUC parameters related to logical
> > > > > > decoding and logical replication such as max_wal_senders,
> > > > > > max_logical_replication_workers, and max_replication_slots, but even
> > > > > > if users set these parameters >0, there would not be a noticeable
> > > > > > performance impact. And their default values are already >0. So I'd
> > > > > > like to focus on making only the wal_level dynamic GUC parameter.
> > > > > > There are several earlier discussions[2][3] but no one has submitted
> > > > > > patches unless I'm missing something.
> > > > > >
> > > > > > The first idea I came up with is to make the wal_level a PGC_SIGHUP
> > > > > > parameter. However, it affects not only setting 'replica' to 
> > > > > > 'logical'
> > > > > > but also setting 'minimal' to 'replica' or higher. I'm not sure the
> > > > > > latter case is common and it might require a checkpoint. I don't 
> > > > > > want
> > > > > > to make the patch complex for uncommon cases.
> > > > > >
> > > > > > The second idea is to somehow allow both WAL-logging logical info 
> > > > > > and
> > > > > > logical decoding even when wal_level is 'replica'. I've attached a 
> > > > > > PoC
> > > > > > patch for that. The patch introduces new SQL functions such as
> > > > > > pg_activate_logical_decoding() and pg_deactivate_logical_decoding().
> > > > > > These functions are available only when wal_level is 'repilca'(or
> > > > > > higher). In pg_activate_logical_decoding(), we set the status of
> > > > > > logical decoding stored on the shared memory from 'disabled' to
> > > > > > 'xlog-logical-info', allowing all processes to write logical
> > > > > > information to WAL records for logical decoding. But the logical
> > > > > > decoding is still not allowed. Once we confirm all in-progress
> > > > > > transactions completed, we switch the status to
> > > > > > 'logical-decoding-ready', meaning that users can create logical
> > > > > > replication slots and use logical decoding.
> > > > > >
> > > > > > Overall, with the patch, there are two ways to enable logical
> > > > > > decoding: setting wal_level to 'logical' and calling
> > > > > > pg_activate_logical_decoding() when wal_level is 'replica'. I left 
> > > > > > the
> > > > > > 'logical' level for backward compatibility and for users who want to
> > > > > > enable the logical decoding without calling that SQL function. If we
> > > > > > can automatically enable the logical decoding when creating the 
> > > > > > first
> > > > > > logical replication slot, probably we no longer need the 'logical'
> > > > > > level. There is room to discuss the user interface. Feedback is very
> > > > > > welcome.
> > > > > >
> > > > >
> > > > > If a server is running at minimal wal_level and they want to enable
> > > > > logical replication, they would still need a server restart. That
> > > > > would be rare but not completely absent.
> > > >
> > > > Currently we don't allow the server to start with the 'minimal' level
> > > > and max_wal_senders > 0. Even if we support changing 'minimal' to
> > > > 'logical' without a server restart, we still need a server restart to
> > > > increase max_wal_senders for users who want to use logical
> > > > replication. Or we need to eliminate this restriction too. I guess it
> > > > would be too complex for such uncommon use cases.
> > > >
> > > > >
> > > > > Our documentation says "wal_level determines how much information is
> > > > > written to the WAL.". Users would may not expect that the WAL amount
> > > > > changes while wal_level = replica depending upon whether logical
> > > > > decoding is possible. It may be possible to set the expectations right
> > > > > by changing the documentation. It's not in the patch, so I am not sure
> > > > > whether this is considered.
> > > >
> > > > We should mention that in the doc. The WAL amount changes depending on
> > > > not

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

2025-02-11 Thread Jeff Davis
On Tue, 2025-02-11 at 10:39 -0800, James Hunter wrote:
> * The Path would store "nbytes" (= the optimizer's estimate of how
> much working memory a given Path will use), to allow for future
> optimizer logic to consider memory usage when choosing the best Path.
> 
> * The Plan would store a copy of "nbytes," along with "work_mem," and
> the executor would enforce work_mem. A "(work_mem on)" option to the
> "EXPLAIN" command would display both "nbytes" and "work_mem", per
> Plan
> node.

Storing work_mem in each Plan node, and using that to enforce the
memory limit (rather than using the GUC directly), seems
uncontroversial to me. I'd suggest a standalone patch.

Storing the optimizer's estimate of the memory wanted also sounds like
a good idea. Let's pick a better name than "nbytes" though; maybe
"requested_mem" or something? This change would make it a lot easier
for an extension to adjust the per-node-work_mem, and also seems like
good infrastructure for anything we build into the planner later. I
suggest a standalone patch for this, as well.

Can you write a useful extension with just the above two core patches?

Regards,
Jeff Davis





Re: Skip collecting decoded changes of already-aborted transactions

2025-02-11 Thread Masahiko Sawada
On Mon, Feb 3, 2025 at 10:41 AM Masahiko Sawada  wrote:
>
> On Wed, Jan 29, 2025 at 11:12 PM Peter Smith  wrote:
> >
> > On Tue, Jan 28, 2025 at 9:26 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Jan 27, 2025 at 7:01 PM Peter Smith  wrote:
> > > >
> > ...
> >
> > > > To be honest, I didn't understand the "CLEAR" part of that name. It
> > > > seems more like it should've been called something like
> > > > RBTXN_IS_SERIALIZED_ALREADY or RBTXN_IS_SERIALIZED_PREVIOUSLY or
> > > > whatever  instead of something that appears to be saying "has the
> > > > RBTXN_IS_SERIALIZED bitflag been cleared?" I understand the reluctance
> > > > to over-comment everything but OTOH currently there is no way really
> > > > to understand what these flags mean without looking through all the
> > > > code to try to figure them out from the usage.
> > > >
> > > > My recurring gripe about these flags is simply that their meanings and
> > > > how to use them should be apparent just by looking at reorderbuffer.h
> > > > and not having to guess anything or look at how they get used in the
> > > > code. It doesn't matter if that is achieved by better constant names,
> > > > by more comments or by enhanced macros/functions with asserts but
> > > > currently just looking at that file still leaves the reader with lots
> > > > of unanswered questions.
> > >
> > > I see your point. IIUC we have the comments about what the checks with
> > > the flags means but not have the description about the relationship
> > > among the flags. I think we can start a new thread for clarifying
> > > these flags and their usage. We can also discuss renaming
> > > RBTXN_IS_SERIALIZED[_CLEARE] there too.
> > >
> >
> > OK.
> >
> > ==
> >
> > Some comments for patch v17-0002.
>
> Thank you for reviewing the patch.
>
> >
> > ==
> > .../replication/logical/reorderbuffer.c
> >
> > ReorderBufferSkipPrepare:
> >
> > 1.
> > + /* txn must have been marked as a prepared transaction */
> > + Assert((txn->txn_flags & RBTXN_IS_PREPARED) != 0);
> > +
> >   txn->txn_flags |= RBTXN_SKIPPED_PREPARE;
> >
> > Should this also be asserting that the _SENT_PREPARE flag is false,
> > because we cannot be skipping it if we already sent the prepare.
> >
> > ~~~
> >
> > ReorderBufferFinishPrepared:
> >
> > 2.
> >
> > - txn->txn_flags |= RBTXN_PREPARE;
> > -
> >   /*
> > - * The prepare info must have been updated in txn even if we skip
> > - * prepare.
> > + * txn must have been marked as a prepared transaction and skipped but
> > + * not sent a prepare. Also, the prepare info must have been updated
> > + * in txn even if we skip prepare.
> >   */
> > + Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)) != 
> > 0);
> > + Assert((txn->txn_flags & RBTXN_SENT_PREPARE) == 0);
> >   Assert(txn->final_lsn != InvalidXLogRecPtr);
> >
> > 2a.
> > If it must have been prepared *and* skipped (as the comment says) then
> > the first assert should be written as:
> > Assert((txn->txn_flags & (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE))
> > == (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));
> >
> > or easier to just have 2 asserts:
> > Assert(txn->txn_flags & RBTXN_IS_PREPARED);
> > Assert(txn->txn_flags & RBTXN_SKIPPED_PREPARE);
> >
>
> Agreed with all the above comments. Since checking
> prepared-transaction-related-flags is getting complicated I've
> introduced RBTXN_PREPARE_STATUS_FLAGS so that we can check the desired
> prepared transaction status easily.
>
> > ~
> >
> > 2b.
> > later in the same function there is code:
> >
> > if (is_commit)
> >   rb->commit_prepared(rb, txn, commit_lsn);
> > else
> >   rb->rollback_prepared(rb, txn, prepare_end_lsn, prepare_time);
> >
> > So it is OK to do a commit_prepared/rollback_prepared even though no
> > prepare() has been sent?
>
> IIUC ReorderBufferReplay() is responsible for sending a prepare
> message in this case. See the comment around there:
>
> /*
>  * By this time the txn has the prepare record information and it is
>  * important to use that so that downstream gets the accurate
>  * information. If instead, we have passed commit information here
>  * then downstream can behave as it has already replayed commit
>  * prepared after the restart.
>  */
>
> I've attached the updated patches.
>

If there are no further comments on v18 patches, I'm going to push
them tomorrow.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Proposal: allow non-masked IPs inside of pg_hba.conf

2025-02-11 Thread Greg Sabino Mullane
>
> This too would work better if hba.c were sharing cidr_in's logic:


+1, a two-for-one solution.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


Re: PATCH: Disallow a netmask of zero unless the IP is also all zeroes

2025-02-11 Thread Greg Sabino Mullane
On Tue, Feb 11, 2025 at 3:25 PM Tom Lane  wrote:

> More generally, should we reject if the netmask causes *any* nonzero
> IP bits to be ignored?  Our CIDR type already imposes that rule:
>

Yeah, I like that idea a lot. That's a great DETAIL message.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

2025-02-11 Thread Andres Freund
Hi,

On 2025-02-11 16:18:37 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > And when using something like io_uring for AIO, it'd allow to
> > max_files_per_process in addition to the files requires for the io_uring
> > instances.
>
> Not following?  Surely we'd not be configuring that so early in
> postmaster start?

The issue is that, with io_uring, we need to create one FD for each possible
child process, so that one backend can wait for completions for IO issued by
another backend [1]. Those io_uring instances need to be created in
postmaster, so they're visible to each backend. Obviously that helps to much
more quickly run into an unadjusted soft RLIMIT_NOFILE, particularly if
max_connections is set to a higher value.

In the current version of the AIO patchset, the creation of those io_uring
instances does happen as part of an shmem init callback, as the io uring
creation also sets up queues visible in shmem. And shmem init callbacks are
currently happening *before* postmaster's set_max_safe_fds() call:


/*
 * Set up shared memory and semaphores.
 *
 * Note: if using SysV shmem and/or semas, each postmaster startup will
 * normally choose the same IPC keys.  This helps ensure that we will
 * clean up dead IPC objects if the postmaster crashes and is restarted.
 */
CreateSharedMemoryAndSemaphores();

/*
 * Estimate number of openable files.  This must happen after setting up
 * semaphores, because on some platforms semaphores count as open files.
 */
set_max_safe_fds();


So the issue would actually be that we're currently doing set_max_safe_fds()
too late, not too early :/

Greetings,

Andres Freund

[1] Initially I tried to avoid that, by sharing a smaller number of io_uring
instances across backends. Making that work was a fair bit of code *and*
was considerably slower, due to now needing a lock around submission of
IOs. Moving to one io_uring instance per backend fairly dramatically
simplified the code while also speeding it up.




Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

2025-02-11 Thread Tomas Vondra



On 2/11/25 21:18, Tom Lane wrote:
> Tomas Vondra  writes:
>> I did run into bottlenecks due to "too few file descriptors" during a
>> recent experiments with partitioning, which made it pretty trivial to
>> get into a situation when we start trashing the VfdCache. I have a
>> half-written draft of a blog post about that somewhere.
> 
>> But my conclusion was that it's damn difficult to even realize that's
>> happening, especially if you don't have access to the OS / perf, etc.
> 
> Yeah.  fd.c does its level best to keep going even with only a few FDs
> available, and it's hard to tell that you have a performance problem
> arising from that.  (Although I recall old war stories about Postgres
> continuing to chug along just fine after it'd run the kernel out of
> FDs, although every other service on the system was crashing left and
> right, making it difficult e.g. even to log in.  That scenario is why
> I'm resistant to pushing our allowed number of FDs to the moon...)
> 
>> So
>> my takeaway was we should improve that first, so that people have a
>> chance to realize they have this issue, and can do the tuning. The
>> improvements I thought about were:
> 
>> - track hits/misses for the VfdCache (and add a system view for that)
> 
> I think what we actually would like to know is how often we have to
> close an open FD in order to make room to open a different file.
> Maybe that's the same thing you mean by "cache miss", but it doesn't
> seem like quite the right terminology.  Anyway, +1 for adding some way
> to discover how often that's happening.
> 

We can count the evictions (i.e. closing a file so that we can open a
new one) too, but AFAICS that's about the same as counting "misses"
(opening a file after not finding it in the cache). After the cache
warms up, those counts should be about the same, I think.

Or am I missing something?

>> - maybe have wait event for opening/closing file descriptors
> 
> Not clear that that helps, at least for this specific issue.
> 

I don't think Jelte described any specific issue, but the symptoms I've
observed were that a query was accessing a table with ~1000 relations
(partitions + indexes), trashing the vfd cache, getting ~0% cache hits.
And the open/close calls were taking a lot of time (~25% CPU time).
That'd be very visible as a wait event, I believe.

>> - show max_safe_fds value somewhere, not just max_files_per_process
>>   (which we may silently override and use a lower value)
> 
> Maybe we should just assign max_safe_fds back to max_files_per_process
> after running set_max_safe_fds?  The existence of two variables is a
> bit confusing anyhow.  I vaguely recall that we had a reason for
> keeping them separate, but I can't think of the reasoning now.
> 

That might work. I don't know what were the reasons for not doing that,
I suppose there were reasons not to do that.


regards

-- 
Tomas Vondra





Re: describe special values in GUC descriptions more consistently

2025-02-11 Thread Peter Smith
I don't have an opinion about the ssl_crl stuff. Everything else looks
good to me.

==
Kind Regards,
Peter Smith.
Fujitsu Australia.




Re: should we have a fast-path planning for OLTP starjoins?

2025-02-11 Thread Richard Guo
On Mon, Feb 10, 2025 at 5:35 PM Tomas Vondra  wrote:
> On 2/10/25 08:29, Richard Guo wrote:
> > Hmm, I'm still a little concerned about whether the resulting joins
> > are legal.  Suppose we have a join pattern like the one below.
> >
> >  F left join
> >   (D1 inner join T on true) on F.b = D1.b
> >   left join D2 on F.c = D2.c;
> >
> > For this query, the original joinlist is [F, D1, T, D2].  If we
> > reorder it to [[F, T], D1, D2], the sub-joinlist [F, T] would fail to
> > produce any joins, as the F/T join is not legal.
> >
> > This may not be the pattern we are targeting.  But if we intend to
> > support it, I think we need a way to ensure that the resulting joins
> > are legal.

> It's quite possible the PoC patch I posted fails to ensure this, but I
> think the assumption is we'd not reorder joins for dimensions that any
> any join order restrictions (except for the FK join).

Then, we'll need a way to determine if a given relation has join-order
restrictions, which doesn't seem like a trivial task.  We do have the
has_join_restriction() function, but it considers any relations
involved in an outer join as having join restrictions, and that makes
it unsuitable for our needs here.

Thanks
Richard




Re: Convert macros to static inline functions

2025-02-11 Thread Peter Eisentraut

On 31.01.25 14:29, Maxim Orlov wrote:
Great job! I've been working on the 64 XIDs patch for years, and I've 
never liked this place.  On the other hand,
as we know, inlining does not always work since it only suggests to the 
compiler to do it. After all, many of
these calls are used in pretty "hot" places and every instruction is 
important, in my opinion.  Wouldn't it be

better to use pg_attribute_always_inline in this particular module?

PFA patch.  I don't use pg_attribute_always_inline for fastgetattr and 
heap_getattr because they are relatively

large. I think it's worth leaving the possibility for debugging here.


I've done some analysis with -Winline.  The reasons for inlining to fail 
are:


1) The function is too large.
2) The function call is unlikely.  (Usually when used in elog() arguments.)
3) The function can never be inlined because it uses setjmp().  (This is 
kind of a bug on our end, I think.)


The existing uses of pg_attribute_always_inline all appear to address 
reason 1.


I think my/your patch does not touch any functions near the limit size, 
so it does not seem necessary.


I think if we use pg_attribute_always_inline without any evidence, then 
"inline" by itself might become meaningless.






Re: Move wal_buffers_full to WalUsage (and report it in pgss/explain)

2025-02-11 Thread Bertrand Drouvot
Hi,

On Thu, Feb 06, 2025 at 10:27:17AM +, Bertrand Drouvot wrote:
> === Patch's structure
> 
> The patch series is made of 3 "small" sub-patches:
> 
> 0001: to move wal_buffers_full to WalUsage
> 0002: to report wal_buffers_full in pg_stat_statements
> 0003: to report wal_buffers_full in explain/auto_explain

While at it, adding 0004 to report wal_buffers_full in VACUUM/ANALYZE (VERBOSE).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 3f7ba865f2d3fe0ab1bd6497fb275d249a486067 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Wed, 5 Feb 2025 13:17:43 +
Subject: [PATCH v2 1/4] Move wal_buffers_full to pgWalUsage

Moving wal_buffers_full to pgWalUsage. That simplifies the refactoring needed
in an upcoming commit (to display per-backend WAL statistics). Last but not
least that also open the door to add wal_buffers_full in EXPLAIN and in
pg_stat_statements (all existing pgWalUsage members are reported in the
above).
---
 src/backend/access/transam/xlog.c   | 2 +-
 src/backend/executor/instrument.c   | 2 ++
 src/backend/utils/activity/pgstat_wal.c | 2 +-
 src/include/executor/instrument.h   | 1 +
 src/include/pgstat.h| 1 -
 5 files changed, 5 insertions(+), 3 deletions(-)
  19.2% src/backend/access/transam/
  30.4% src/backend/executor/
  24.1% src/backend/utils/activity/
  17.7% src/include/executor/
   8.4% src/include/

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9c270e7d466..b334f0dd9e6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2057,7 +2057,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 	WriteRqst.Flush = 0;
 	XLogWrite(WriteRqst, tli, false);
 	LWLockRelease(WALWriteLock);
-	PendingWalStats.wal_buffers_full++;
+	pgWalUsage.wal_buffers_full++;
 	TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
 }
 /* Re-acquire WALBufMappingLock and retry */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2d3569b3748..56e635f4700 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -280,6 +280,7 @@ WalUsageAdd(WalUsage *dst, WalUsage *add)
 	dst->wal_bytes += add->wal_bytes;
 	dst->wal_records += add->wal_records;
 	dst->wal_fpi += add->wal_fpi;
+	dst->wal_buffers_full += add->wal_buffers_full;
 }
 
 void
@@ -288,4 +289,5 @@ WalUsageAccumDiff(WalUsage *dst, const WalUsage *add, const WalUsage *sub)
 	dst->wal_bytes += add->wal_bytes - sub->wal_bytes;
 	dst->wal_records += add->wal_records - sub->wal_records;
 	dst->wal_fpi += add->wal_fpi - sub->wal_fpi;
+	dst->wal_buffers_full += add->wal_buffers_full - sub->wal_buffers_full;
 }
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 18fa6b2936a..c1ca65eb991 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -123,7 +123,7 @@ pgstat_wal_flush_cb(bool nowait)
 	WALSTAT_ACC(wal_records, wal_usage_diff);
 	WALSTAT_ACC(wal_fpi, wal_usage_diff);
 	WALSTAT_ACC(wal_bytes, wal_usage_diff);
-	WALSTAT_ACC(wal_buffers_full, PendingWalStats);
+	WALSTAT_ACC(wal_buffers_full, wal_usage_diff);
 	WALSTAT_ACC(wal_write, PendingWalStats);
 	WALSTAT_ACC(wal_sync, PendingWalStats);
 	WALSTAT_ACC_INSTR_TIME(wal_write_time);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 5a6eff75c69..03653ab6c6c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -53,6 +53,7 @@ typedef struct WalUsage
 	int64		wal_records;	/* # of WAL records produced */
 	int64		wal_fpi;		/* # of WAL full page images produced */
 	uint64		wal_bytes;		/* size of WAL records produced */
+	int64		wal_buffers_full;	/* # of times the WAL buffers became full */
 } WalUsage;
 
 /* Flag bits included in InstrAlloc's instrument_options bitmask */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index dd823d3f56e..53f2a8458e6 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -495,7 +495,6 @@ typedef struct PgStat_WalStats
  */
 typedef struct PgStat_PendingWalStats
 {
-	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
 	instr_time	wal_write_time;
-- 
2.34.1

>From 5d0345554afc326e6dc5e657694ba3f832b97d7b Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Wed, 5 Feb 2025 15:55:36 +
Subject: [PATCH v2 2/4] Allow pg_stat_statements to track WAL buffers full

Now that wal_buffers_full is part of the WalUsage struct, let's report it in
pg_stat_statements.

This commit doesn't bump the version of pg_stat_statements as the same is done
for this release in commit cf54a2c00254.
---
 contrib/pg_stat_statements/expected/oldextversions.out   | 1 +
 .../pg_stat_statements--1.11--1.12.sql   | 1 +
 contrib/pg_stat_statements/pg_stat

Re: Track the amount of time waiting due to cost_delay

2025-02-11 Thread Bertrand Drouvot
Hi,

On Mon, Feb 10, 2025 at 02:52:46PM -0600, Nathan Bossart wrote:
> Here is what I have prepared for commit.  Other expanding the commit
> messages, I've modified 0001 to just add a parameter to
> vacuum_delay_point() to indicate whether this is a vacuum or analyze.  I
> was worried that adding an analyze_delay_point() could cause third-party
> code to miss this change.  We want such code to correctly indicate the type
> of operation so that the progress views work for them, too.

Good point, that makes fully sense. v17 LGTM.

> Off-list, I've asked Bertrand to gauge the feasibility of adding this
> information to the autovacuum logs and to VACUUM/ANALYZE (VERBOSE).  IMHO
> those are natural places to surface this information, and I want to ensure
> that we're not painting ourselves into a corner with the approach we're
> using for the progress views.

Yeah, I looked at it and that looks as simmple as 0003 attached (as that's the
leader that is doing the report in case of parallel workers being used).

0001 and 0002 remain unchanged.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From f4640855bbeb0b2fb63e8ba5afff61c04302fa40 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Mon, 10 Feb 2025 11:41:22 -0600
Subject: [PATCH v17 1/3] Add is_analyze parameter to vacuum_delay_point().

This function is used in both vacuum and analyze code paths, and a
follow-up commit will require distinguishing between the two.  This
commit forces callers to declare whether they are being used for
vacuum or analyze, but it does not use that information for
anything yet.

Author: Nathan Bossart 
Co-authored-by: Bertrand Drouvot 
Discussion: https://postgr.es/m/ZmaXmWDL829fzAVX%40ip-10-97-1-34.eu-west-3.compute.internal
---
 contrib/bloom/blvacuum.c  |  4 ++--
 contrib/file_fdw/file_fdw.c   |  2 +-
 src/backend/access/gin/ginfast.c  |  6 +++---
 src/backend/access/gin/ginvacuum.c|  6 +++---
 src/backend/access/gist/gistvacuum.c  |  2 +-
 src/backend/access/hash/hash.c|  2 +-
 src/backend/access/heap/vacuumlazy.c  |  4 ++--
 src/backend/access/nbtree/nbtree.c|  2 +-
 src/backend/access/spgist/spgvacuum.c |  4 ++--
 src/backend/commands/analyze.c| 10 +-
 src/backend/commands/vacuum.c |  2 +-
 src/backend/tsearch/ts_typanalyze.c   |  2 +-
 src/backend/utils/adt/array_typanalyze.c  |  2 +-
 src/backend/utils/adt/rangetypes_typanalyze.c |  2 +-
 src/include/commands/vacuum.h |  2 +-
 15 files changed, 26 insertions(+), 26 deletions(-)
   7.5% contrib/bloom/
   3.6% contrib/file_fdw/
  22.9% src/backend/access/gin/
   3.6% src/backend/access/gist/
   3.7% src/backend/access/hash/
   7.5% src/backend/access/heap/
   3.6% src/backend/access/nbtree/
   7.3% src/backend/access/spgist/
  22.8% src/backend/commands/
   3.6% src/backend/tsearch/
   7.3% src/backend/utils/adt/
   6.1% src/include/commands/

diff --git a/contrib/bloom/blvacuum.c b/contrib/bloom/blvacuum.c
index 7e1db0b52fc..86b15a75f6f 100644
--- a/contrib/bloom/blvacuum.c
+++ b/contrib/bloom/blvacuum.c
@@ -57,7 +57,7 @@ blbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
    *itupPtr,
    *itupEnd;
 
-		vacuum_delay_point();
+		vacuum_delay_point(false);
 
 		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
 	RBM_NORMAL, info->strategy);
@@ -187,7 +187,7 @@ blvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		Buffer		buffer;
 		Page		page;
 
-		vacuum_delay_point();
+		vacuum_delay_point(false);
 
 		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno,
 	RBM_NORMAL, info->strategy);
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 678e754b2b9..0655bf532a0 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -1237,7 +1237,7 @@ file_acquire_sample_rows(Relation onerel, int elevel,
 	for (;;)
 	{
 		/* Check for user-requested abort or sleep */
-		vacuum_delay_point();
+		vacuum_delay_point(true);
 
 		/* Fetch next row */
 		MemoryContextReset(tupcontext);
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 4ab815fefe0..cc5d046c4b0 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -892,7 +892,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
 		 */
 		processPendingPage(&accum, &datums, page, FirstOffsetNumber);
 
-		vacuum_delay_point();
+		vacuum_delay_point(false);
 
 		/*
 		 * Is it time to flush memory to disk?	Flush if we are at the end of
@@ -929,7 +929,7 @@ ginInsertCleanup(GinState *ginstate, bool full_clean,
 			{
 ginEntryInsert(ginstate, attnum, key, category,
 			   list, nlist, NULL);
-vacuum_delay_point();
+vacuum_delay_point(false);
 			}
 
 			/*
@@ -1002,7 +1002,7 @@ ginInsertClea

Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-02-11 Thread Shubham Khanna
On Tue, Feb 11, 2025 at 6:30 AM Peter Smith  wrote:
>
> Hi Shubham.
>
> Responding with a blanket statement "Fixed the given comments" makes
> it difficult to be sure what was done when I come to confirm the
> changes. Often embedded questions go unanswered, and if changes are
> *not* made, then I can't tell if there was a good reason for rejection
> or was the comment just overlooked. Please can you treat the itemised
> feedback as a checklist and reply individually to each item. Even just
> saying "Fixed" is useful because then I can trust the item was seen
> and addressed.
>

I appreciate the feedback. I will ensure that each item is addressed
individually.

> e.g. From previous review [1]
> #7. Not fixed. The same docs issue still exists for --publication and
> for --subscription.

Fixed this issue in the attached patch.

> #13. Unanswered question "How are tests expecting this even passing?".
> Was a reason identified? IOW, how can we be sure the latest tests
> don't have a similar problem?
>

In the v4-0001 patch [1], the tests were mistakenly using
'command_fails' instead of 'command_fails_like' to verify failed test
cases. Since 'command_fails' only checks for a non-zero exit code and
not specific error messages, the tests were passing even when the
expected errors were not being triggered correctly.
To address this, I have updated the patch to use 'command_fails_like',
ensuring that the test cases now explicitly verify the correct failure
messages.

> //
>
> Some more review comments for the patch v5-0001.
>
> ==
> doc/src/sgml/ref/pg_createsubscriber.sgml
>
> Synopsis
>
> 1.
> pg_createsubscriber [option...] { -a | --all-databases } { -d |
> --database }dbname { -D | --pgdata }datadir { -P | --publisher-server
> }connstr
>
> This looks misleading because '-all-database' and '--database' cannot
> be combined.
>
> I think it will be better to have 2 completely different synopsis like
> I had originally suggested [2, comment #5]. E.g. just add a whole
> seperate  block adjacent to the other one in the SGML:
>
> --
>   
>pg_createsubscriber
>option
>
> 
>  -a
>  --all-databases
> 
> 
>  -D 
>  --pgdata
> 
> datadir
> 
>  -P
>  --publisher-server
> 
> connstr
>
>   
> --
>
> ~~~
>

Fixed.

> 2. --all-databases
>
> +  
> +   --all-databases cannot be used with
> +   --database, --publication,
> +   --replication-slot or 
> --subscription.
> +  
>
> If you want consistent wording with the other places in this patch,
> then this should just be the last sentence of the previous paragraph
> and be worded like: "Cannot be used together with..."
>
> ~~~
>

Fixed.

> 3. --publication
>
> -   a generated name is assigned to the publication name.
> +   a generated name is assigned to the publication name. Cannot be used
> +   together with -a.
>
> Previously reported. Claimed fixed -a to --all-databases. Not fixed.
>
> ~~~
>

Fixed.

> 4. --subscription
>
> -   a generated name is assigned to the subscription name.
> +   a generated name is assigned to the subscription name. Cannot be used
> +   together with -a.
>
>
> (same as the previous review comment).
>
> Previously reported. Claimed fixed -a to --all-databases. Not fixed.
>

Fixed.

> ==
> src/bin/pg_basebackup/pg_createsubscriber.c
>
> 5.
> + bool all_databases; /* fetch and specify all databases */
>
> Comment wording. "and specified" (??). Maybe just "--all-databases
> option was specified"
>
> ==

Fixed.

The attached Patch contains the suggested changes.

[1] - 
https://www.postgresql.org/message-id/CAHv8Rj%2BDOQ9vvkKCsmDeFKyM073yRfs47Tn_qywQOdx15pmOMA%40mail.gmail.com

Thanks and regards,
Shubham Khanna.


v6-0001-Enhance-pg_createsubscriber-to-fetch-and-append-a.patch
Description: Binary data


Re: Virtual generated columns

2025-02-11 Thread Richard Guo
On Tue, Feb 11, 2025 at 10:34 AM Richard Guo  wrote:
> Yeah, I also feel that the virtual generated columns should adhere to
> outer join semantics, rather than being unconditionally replaced by
> the generation expressions.  But maybe I'm wrong.
>
> If that's the case, this incorrect-result issue isn't limited to
> constant expressions; it could also occur with non-strict ones.

It seems that outer-join removal does not work well with virtual
generated columns.

create table t (a int, b int);
create table vt (a int primary key, b int generated always as (a * 2));

explain (costs off)
select t.a from t left join vt on t.a = vt.a where coalesce(vt.b, 1) = 1;
  QUERY PLAN
---
 Seq Scan on t
(1 row)

This plan does not seem correct to me.  The inner-rel attribute 'vt.b'
is used above the join, which means the join should not be removed.

explain (costs off)
select t.a from t left join vt on t.a = vt.a where coalesce(vt.b, 1) =
1 or t.a is null;
server closed the connection unexpectedly

For this query, an Assert in remove_rel_from_query() is hit.

I haven't looked into the details yet, but I suspect that both of
these issues are caused by our failure to mark the correct nullingrel
bits for the virtual generated columns.

Thanks
Richard




Re: pgbench with partitioned tables

2025-02-11 Thread Sergey Tatarintsev

08.02.2025 17:32, Álvaro Herrera пишет:

On 2025-Feb-07, Melanie Plageman wrote:


Okay, I've stared at this a bit, and it seems basically fine the way
it is (I might add a bit more whitespace, clean up the commit message,
etc). So I'm interested in committing it. I will admit that having
never committed anything to pgbench, I'm a bit nervous. So, give me a
couple days to work up the nerve.

Sounds good.  I only wanted to say that I would use PQexecParams()
instead of PQexec() in the new function, avoiding the psprintf and
pfree, and also that initializing relkind to RELKIND_RELATION is strange
and probably unnecessary.


Let it be so. I made the suggested changes (patch v4 attached)


--
With best regards,
Sergey Tatarintsev,
PostgresPro
From cb363a01cdff72ccde303a1a67180e96bc9e74a8 Mon Sep 17 00:00:00 2001
From: Sergey Tatarintsev 
Date: Mon, 3 Feb 2025 19:18:06 +0700
Subject: [PATCH] Fix pgbench client-side data generation for partitioned
 tables

Client-side data generation cannot perform COPY WITH (FREEZE = ON) on partitioned
tables except pgbench_accounts.
Patch disables FREEZE for any non-regular tables
---
 src/bin/pgbench/pgbench.c | 37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 40592e6260..8b867b45a7 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -53,6 +53,7 @@
 #include 
 #endif
 
+#include "catalog/pg_class.h"
 #include "common/int.h"
 #include "common/logging.h"
 #include "common/pg_prng.h"
@@ -848,6 +849,30 @@ static const PsqlScanCallbacks pgbench_callbacks = {
 	NULL,		/* don't need get_variable functionality */
 };
 
+static char
+get_table_relkind(PGconn *con, const char *table)
+{
+	PGresult	*res;
+	char		*val;
+	char		relkind;
+	const char	*sql = "SELECT relkind FROM pg_catalog.pg_class WHERE oid=$1::pg_catalog.regclass";
+	const char	*params[1] = {table};
+
+	res = PQexecParams(con, sql, 1, NULL, params, NULL, NULL, 0);
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("query failed: %s", PQerrorMessage(con));
+		pg_log_error_detail("Query was: %s", sql);
+		exit(1);
+	}
+	val = PQgetvalue(res, 0, 0);
+	Assert(strlen(val) == 1);
+	relkind = val[0];
+	PQclear(res);
+
+	return relkind;
+}
+
 static inline pg_time_usec_t
 pg_time_now(void)
 {
@@ -4962,16 +4987,10 @@ initPopulateTable(PGconn *con, const char *table, int64 base,
 
 	initPQExpBuffer(&sql);
 
-	/*
-	 * Use COPY with FREEZE on v14 and later for all the tables except
-	 * pgbench_accounts when it is partitioned.
-	 */
-	if (PQserverVersion(con) >= 14)
-	{
-		if (strcmp(table, "pgbench_accounts") != 0 ||
-			partitions == 0)
+	/* Use COPY with FREEZE on v14 and later for all regular tables */
+	if ((PQserverVersion(con) >= 14) && get_table_relkind(con, table) == RELKIND_RELATION)
 			copy_statement_fmt = "copy %s from stdin with (freeze on)";
-	}
+
 
 	n = pg_snprintf(copy_statement, sizeof(copy_statement), copy_statement_fmt, table);
 	if (n >= sizeof(copy_statement))
-- 
2.43.0



Re: RFC: Allow EXPLAIN to Output Page Fault Information

2025-02-11 Thread Jelte Fennema-Nio
On Tue, 11 Feb 2025 at 00:53, Andres Freund  wrote:
> > I mostly meant worker based AIO, yes. I haven't checked how accurately these
> > are kept for io_uring. I would hope they are...
>
> It does look like it is tracked.

nice!

> > The thing is that you'd often get completely misleading stats. Some of the 
> > IO
> > will still be done by the backend itself, so there will be a non-zero
> > value. But it will be a significant undercount, because the asynchronously
> > executed IO won't be tracked (if worker mode is used).

Yeah, makes sense. Like I said, I would be completely fine with not
showing these numbers at all/setting them to 0 for setups where we
cannot easily get useful numbers (and this bgworker AIO would be one
of those setups).

> Independent to of this, it's probably not good that we're tracking shared
> buffer hits after io combining, if I interpret this correctly... That looks to
> be an issue in master, not just the AIO branch.

You mean that e.g. a combined IO for 20 blocks still sounds only as 1
"shared read"? Yeah, that sounds like a bug.




Re: Non-text mode for pg_dumpall

2025-02-11 Thread Mahendra Singh Thalor
On Tue, 11 Feb 2025 at 20:40, jian he  wrote:
>
> hi.
> review based on v16.
>
> because of
>
https://postgr.es/m/cafc+b6pwqisl+3rvlxn9vhc8aonp4ov9c6u+bvd6kmwmdbd...@mail.gmail.com
>
> in copy_global_file_to_out_file, now it is:
> if (strcmp(outfile, "-") == 0)
> OPF = stdout;
> I am confused, why "-" means stdout.
> ``touch ./- `` command works fine.
> i think dash is not special character, you may see
> https://stackoverflow.com/a/40650391/15603477

"-" is used for stdout. This is mentioned in the doc.
pg_restore link 

> -f *filename*
> --file=*filename*
>
> Specify output file for generated script, or for the listing when used
> with -l. Use - for stdout.
>

>
>
> + /* Create a subdirectory with 'databases' name under main directory. */
> + if (mkdir(db_subdir, 0755) != 0)
> + pg_log_error("could not create subdirectory \"%s\": %m", db_subdir);
> here we should use pg_fatal?

Yes, we should use pg_fatal.

>
>
> pg_log_info("executing %s", sqlstatement.data);
> change to
> pg_log_info("executing query: %s", sqlstatement.data);
> message would be more similar to the next pg_log_error(...) message.

Okay.

>
>
> + /*
> + * User is suggested to use single database dump for --list option.
> + */
> + if (opts->tocSummary)
> + pg_fatal("option -l/--list cannot be used when using dump of
pg_dumpall");
> maybe change to
> + pg_fatal("option -l/--list cannot be used when restoring multiple
databases");

okay.

>
> $BIN10/pg_restore --format=directory --list dir10_x
> if the directory only has one database, then we can actually print out
> the tocSummary.
> if the directory has more than one database then pg_fatal.
> To tolerate this corner case (only one database) means that pg_restore
> --list requires a DB connection,
> but I am not sure that is fine.
> anyway, the attached patch allows this corner case.

No, we don't need this corner case. If a user wants to restore a
single database with --list option, then the user should give a particular
dump file with pg_restore.

>
>
> PrintTOCSummary can only print out summary for a single database.
> so we don't need to change PrintTOCSummary.
>
>
> + /*
> + * To restore multiple databases, -C (create database) option should
> be specified
> + * or all databases should be created before pg_restore.
> + */
> + if (opts->createDB != 1)
> + pg_log_info("restoring dump of pg_dumpall without -C option, there
> might be multiple databases in directory.");
>
> we can change it to
> + if (opts->createDB != 1 && num_db_restore > 0)
> + pg_log_info("restoring multiple databases without -C option.");

okay.

>
>
> Bug.
> when pg_restore --globals-only can be applied when we are restoring a
> single database (can be an output of pg_dump).

As of now, we are ignoring this option. We can add an error in the "else"
part of the global.dat file.
Ex: option --globals-only is only supported with dump of pg_dumpall.
Similarly --exclude-database also.

>
>
> There are some tests per https://commitfest.postgresql.org/52/5495, I
> will check it later.
> The attached patch is the change for the above reviews.


-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com


Re: AIO v2.3

2025-02-11 Thread Robert Haas
On Tue, Feb 11, 2025 at 12:11 PM James Hunter  wrote:
> I like this idea. If we want to submit a batch, then just submit a batch.

Sounds good to me, too.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: AIO v2.3

2025-02-11 Thread James Hunter
On Mon, Feb 10, 2025 at 2:40 PM Thomas Munro  wrote:
>
...
> Problem statement: You want to be able to batch I/O submission, ie
> make a single call to ioring_enter() (and other mechanisms) to start
> several I/Os, but the code that submits is inside StartReadBuffers()
> and the code that knows how many I/Os it wants to start now is at a
> higher level, read_stream.c and in future elsewhere.  So you invented
> this flag to tell StartReadBuffers() not to call
> pgaio_submit_staged(), because you promise to do it later, via this
> staging list.  Additionally, there is a kind of programming rule here
> that you *must* submit I/Os that you stage, you aren't allowed to (for
> example) stage I/Os and then sleep, so it has to be a fairly tight
> piece of code.
>
> Would the API be better like this?:  When you want to create a batch
> of I/Os submitted together, you wrap the work in pgaio_begin_batch()
> and pgaio_submit_batch(), eg the loop in read_stream_lookahead().
> Then bufmgr wouldn't need this flag: when it (or anything else) calls
> smgrstartreadv(), if there is not currently an explicit batch then it
> would be submitted immediately, and otherwise it would only be staged.
> This way, batch construction (or whatever word you prefer for batch)
> is in a clearly and explicitly demarcated stretch of code in one
> lexical scope (though its effect is dynamically scoped just like the
> staging list itself because we don't want to pass explicit I/O
> contexts through the layers), but code that doesn't call those and
> reaches AsyncReadBuffer() or whatever gets an implicit batch of size
> one and that's also OK.  Not sure what semantics nesting would have
> but I doubt it matters much.

I like this idea. If we want to submit a batch, then just submit a batch.

James




Re: explain analyze rows=%.0f

2025-02-11 Thread Andrei Lepikhov

On 8/2/2025 04:28, Ilia Evdokimov wrote:

On 08.02.2025 00:01, Matheus Alcantara wrote:

Just for reference I'm trying to apply based on commit fb056564ec5.
You are right, because two commits were appeared after creating v6-patch 
on partition_prune.out and patch v6 must not have applied on master. 
Then I created v7 patch rebased on fb056564ec5 . Thank for your remark!
I support the idea in general, but I believe it should be expanded to 
cover all cases of parameterised plan nodes. Each rescan iteration may 
produce a different number of tuples, and rounding can obscure important 
data.


For example, consider five loops of a scan node: the first loop returns 
nine tuples, and each other - zero tuples. When we calculate the 
average, 9 divided by 5 equals 1.8. This results in an explanation that 
indicates "rows = 1," masking almost 40% of the data.


Now, if we apply the same two loops but have a total of 900,000 tuples, 
then 400,000 masked tuples represent a significant portion of the data.


Moreover, switching to a floating-point type for row explanations in 
each parameterised node would provide a more comprehensive view and add 
valuable information about the parameterisation of the node, which may 
not be immediately apparent.


--
regards, Andrei Lepikhov




Re: explain analyze rows=%.0f

2025-02-11 Thread Robert Haas
On Tue, Feb 11, 2025 at 12:14 PM Andrei Lepikhov  wrote:
> I support the idea in general, but I believe it should be expanded to
> cover all cases of parameterised plan nodes. Each rescan iteration may
> produce a different number of tuples, and rounding can obscure important
> data.
>
> For example, consider five loops of a scan node: the first loop returns
> nine tuples, and each other - zero tuples. When we calculate the
> average, 9 divided by 5 equals 1.8. This results in an explanation that
> indicates "rows = 1," masking almost 40% of the data.
>
> Now, if we apply the same two loops but have a total of 900,000 tuples,
> then 400,000 masked tuples represent a significant portion of the data.
>
> Moreover, switching to a floating-point type for row explanations in
> each parameterised node would provide a more comprehensive view and add
> valuable information about the parameterisation of the node, which may
> not be immediately apparent.

I agree strongly with all of this. I believe we should just implement
what was agreed here:

https://www.postgresql.org/message-id/21013.1243618236%40sss.pgh.pa.us

Let's just display 2 fractional digits when nloops>1, else 0, and call it good.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Using Expanded Objects other than Arrays from plpgsql

2025-02-11 Thread Tom Lane
Andrey Borodin  writes:
> On 7 Feb 2025, at 02:05, Tom Lane  wrote:
>> Do you have any further comments on this patch?

> No, all steps of the patch set look good to me.

Pushed then.  Thanks for reviewing!

regards, tom lane




Re: RFC: Allow EXPLAIN to Output Page Fault Information

2025-02-11 Thread Jelte Fennema-Nio
On Tue, 11 Feb 2025 at 17:19, Andres Freund  wrote:
> Yes, at least initially:

Ah, then I understand your point of view much better. Still I think we
could easily frame it as: If you enable io_uring, you also get these
additional fancy stats.

Also afaict the items don't have to mean that

> 1) it's not enabled on the kernel level everywhere

Is that really a common thing to do? From a quick internet search, I
can "only" find that Google does this. (Google is definitely a big
cloud player, so I don't want to suggest that that is not important,
but if that's really the only one still the bulk of systems would have
io_uring support)

> 2) it requires an optional build dependency

What build dependency is this? In any case, can't we choose the
default at build time based on the available build dependencies? And
if we cannot, I think we could always add an "auto" default that would
mean the best available AIO implementation (where io_uring is better
than bgworkers).

> 3) it requires tuning the file descriptor ulimit, unless we can convince Tom
>that it's ok to do that ourselves

I think we should just do this, given the reasoning in the blog[1]
from the systemd author I linked in the AIO thread. I agree that a
response/explicit approval from Tom would be nice though.

[1]: https://0pointer.net/blog/file-descriptor-limits.html




Re: postgresql.conf.sample ordering for IO, worker related GUCs

2025-02-11 Thread Andres Freund
Hi,

On 2025-01-31 18:21:54 -0500, Andres Freund wrote:
> On January 31, 2025 5:22:43 PM EST, Robert Treat  wrote:
> >On Fri, Jan 31, 2025 at 10:25 AM Andres Freund  wrote:
> >>
> >> Hi,
> >>
> >> On 2025-01-30 21:24:05 -0500, Andres Freund wrote:
> >> > On January 30, 2025 8:55:36 PM EST, Tom Lane  wrote:
> >> > >Andres Freund  writes:
> >> > >> While working on polishing the AIO patchset, I was trying to figure 
> >> > >> out where
> >> > >> to fit the new GUCs. So far I had a new "top-level" #--- style 
> >> > >> section named
> >> > >> "WIP AIO GUC docs" which I suspect you all won't let me get away with.
> >> > >> There is an existing (sub-)section which already has a few related 
> >> > >> GUCs and
> >> > >> could fit AIO related ones.
> >> > >
> >> > >I think the normal theory for postgresql.conf.sample is that it should
> >> > >match the organization of config.sgml.  What are you planning there?
> >> >
> >> > Pretty much the same. I.e. I'm thinking that the worker stuff should be 
> >> > it's
> >> > own subsection and that the existing IO parameters should be moved to 
> >> > either
> >> > a new subsection or a new top level section.  But I'm wondering how 
> >> > others
> >> > think it should be structured...
> >>
> >> Here are draft changes for the minimal thing I think we should do.
> >>
> >> I don't really know what to do about the "IO" abbreviation. The other 
> >> sections
> >> un-abbreviate abbreviations, but I suspect that Input/Output will be less
> >> informative than IO to most...
> >>
> >> I still wonder if we instead ought to create a top-level "IO" section 
> >> instead
> >> of leaving it under "Resource Usage".  How many IOs we combine, how
> >> aggressively we flush unflushed data, etc only kinda fits into the resource
> >> usage category.
> >>
> >
> >+1 from me, though I did pause on whether it should be called
> >"background workers" rather than "worker processes", but I think this
> >is the right direction.
>
> +1 for the patch as-is, or my suggestion to make it a top level section?

I pushed the patch as I had it, save for replacing IO with I/O where
appropriate. If we want to go for a larger restructuring in the future, this
won't make it harder, and perhaps even a little simpler.

Greetings,

Andres Freund




Re: pg_stat_statements and "IN" conditions

2025-02-11 Thread Sami Imseih
I do not have an explanation from the patch yet, but I have a test
that appears to show unexpected results. I only tested a few datatypes,
but from what I observe, some merge as expected and others do not;
i.e. int columns merge correctly but bigint do not.

"""
show pg_stat_statements.query_id_const_merge ;
create table foo (col_int int, col_smallint smallint, col_bigint
bigint, col_float float, col_text text, col_varchar varchar);
select from foo where col_int in (1, 2, 3);
select from foo where col_int in (1, 2, 3, 4);
select from foo where col_int in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
select from foo where col_smallint in (1, 2, 3);
select from foo where col_smallint in (1, 2, 3, 4);
select from foo where col_bigint in (1, 2, 3);
select from foo where col_bigint in (1, 2, 3, 4);
select from foo where col_bigint in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
select from foo where col_float in (1, 2, 3);
select from foo where col_float in (1, 2, 3, 4);
select query, queryid, calls from pg_stat_statements where query like
'select from foo where%' order by 1 desc ;
"""

postgres=# show pg_stat_statements.query_id_const_merge ;
 pg_stat_statements.query_id_const_merge
-
 on
(1 row)


...
..

postgres=# select query, queryid, calls from pg_stat_statements where
query like 'select from foo where%' order by 1 desc ;
 query
|   queryid| calls
---+--+---
 select from foo where col_smallint in (...)
| -2065640271713949220 | 2
 select from foo where col_int in (...)
|  291124129257715 | 3
 select from foo where col_float in ($1, $2, $3, $4)
| -6847088148705359339 | 1
 select from foo where col_float in ($1, $2, $3)
|  1631437678183488606 | 1
 select from foo where col_bigint in ($1, $2, $3, $4, $5, $6, $7, $8,
$9, $10) |  3174053975478689499 | 1
 select from foo where col_bigint in ($1, $2, $3, $4)
| -5236067031911646410 | 1
 select from foo where col_bigint in ($1, $2, $3)
| -5529594240898645457 | 1
(7 rows)

---

Sami




Re: Expanding HOT updates for expression and partial indexes

2025-02-11 Thread Matthias van de Meent
On Tue, 11 Feb 2025 at 00:20, Nathan Bossart  wrote:
>
> On Mon, Feb 10, 2025 at 06:17:42PM +0100, Matthias van de Meent wrote:
> > I have serious doubts about the viability of any proposal working to
> > implement PHOT/WARM in PostgreSQL, as they seem to have an inherent
> > nature of fundamentally breaking the TID lifecycle:
> > [... concerns]
>
> I share your concerns, but I don't think things are as dire as you suggest.
> For example, perhaps we put a limit on how long a PHOT chain can be, or
> maybe we try to detect update patterns that don't work well with PHOT.
> Another option could be to limit PHOT updates to only when the same set of
> indexed columns are updated or when <50% of the indexed columns are
> updated.  These aren't fully fleshed-out ideas, of course, but I am at
> least somewhat optimistic we could find appropriate trade-offs.

Yes, there are methods which could limit the overhead. But I'm not
sure there are cheap-enough designs which would make PHOT a
universally good choice (i.e. not tunable with guc/table option),
considering its significantly larger un-reclaimable storage overhead
vs HOT.

Kind regards,

Matthias van de Meent.




Re: pg_stat_statements and "IN" conditions

2025-02-11 Thread Álvaro Herrera
Hello

I noticed something that surprised me at first, but on further looking
it should have been obvious: setting pg_stat_statements.query_id_const_merge
affects the query ID for all readers of it, not just pg_stat_statement.
This is good because it preserves the property that pg_stat_activity
entries can be matched to pg_stat_statement entries by query_id.

Looking to commit 0001 soon.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"On the other flipper, one wrong move and we're Fatal Exceptions"
(T.U.X.: Term Unit X  - http://www.thelinuxreview.com/TUX/)




Re: Move wal_buffers_full to WalUsage (and report it in pgss/explain)

2025-02-11 Thread Bertrand Drouvot
Hi,

On Tue, Feb 11, 2025 at 03:30:09PM +0300, Ilia Evdokimov wrote:
> Hi,
> 
> Thank you for your work!

Thanks for the review!

> 1. Perhaps In EXPLAIN you forget to check that usage->wal_buffers_full > 0:
> 
> if ((usage->wal_records > 0) || (usage->wal_fpi > 0) || (usage->wal_bytes >
> 0))

I don't think that's possible to have wal_buffers_full > 0 if the above returns
false. A check is done at appendStringInfo() time so I think that's ok as it is.

> 2. I have a small suggestion for pg_stat_statements: would it make sense to
> move wal_buffers_full next to wal_records, wal_fpi and wal_bytes? This way,
> all WAL-related information would be grouped together.

I think I prefer to add it in "append" order. That way, that does not break
queries that rely on ordinal numbers.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: RFC: Allow EXPLAIN to Output Page Fault Information

2025-02-11 Thread Andres Freund
Hi,

On 2025-02-11 09:59:43 +0100, Jelte Fennema-Nio wrote:
> On Tue, 11 Feb 2025 at 00:53, Andres Freund  wrote:
> > > The thing is that you'd often get completely misleading stats. Some of 
> > > the IO
> > > will still be done by the backend itself, so there will be a non-zero
> > > value. But it will be a significant undercount, because the asynchronously
> > > executed IO won't be tracked (if worker mode is used).
> 
> Yeah, makes sense. Like I said, I would be completely fine with not
> showing these numbers at all/setting them to 0 for setups where we
> cannot easily get useful numbers (and this bgworker AIO would be one
> of those setups).

Shrug. It means that it'll not work in what I hope will be the default
mechanism before long.  I just can't get excited for that. In all likelihood
it'll result in bug reports that I'll then be on the hook to fix.


> > Independent to of this, it's probably not good that we're tracking shared
> > buffer hits after io combining, if I interpret this correctly... That looks 
> > to
> > be an issue in master, not just the AIO branch.
> 
> You mean that e.g. a combined IO for 20 blocks still sounds only as 1
> "shared read"? Yeah, that sounds like a bug.

Yep.

Greetings,

Andres Freund




Re: NOT ENFORCED constraint feature

2025-02-11 Thread Isaac Morland
On Tue, 11 Feb 2025 at 08:36, Álvaro Herrera 
wrote:

> On 2025-Feb-10, Isaac Morland wrote:
>
> > I'm having a lot of trouble understanding the operational distinction
> > between your 'u' and 'U'. If it's not enforced, it cannot be assumed to
> be
> > valid, regardless of whether it was valid in the past. I'm not sure what
> I
> > think of a single character vs. 2 booleans, but there are only 3 sensible
> > states either way: valid enforced, invalid enforced, and invalid
> unenforced.
>
> I kinda agree with you and would prefer that things were that way as
> well.  But look at the discussion starting at
>
> https://postgr.es/m/CAExHW5tV23Sw+Nznv0KpdNg_t7LrXY1WM9atiC=ekksskhs...@mail.gmail.com
> whereby it was apparently established that if you have a
> NOT VALID NOT ENFORCED
> constraint, and you make it enforced, then you should somehow end up
> with a NOT VALID ENFORCED constraint, which says to me that we need to
> store the fact that the constraint was NOT VALID to start with; and
> correspondingly if it's VALID NOT ENFORCED and you enforce it, then it
> ends up VALID ENFORCED.  If we take this view of the world (with which,
> I repeat, I disagree) then we must keep track of whether the constraint
> was valid or not valid to start with.  And this means that we need to
> keep convalidated=true _regardless_ of whether conenforced is false.
> So in this view of the world there aren't three states but four.
>
> I would prefer there to be three states as well, but apparently I'm
> outvoted on this.
>

Sounds like we agree. I think the problem is with the statement in the
linked discussion that “If the constraint is VALID and later marked as NOT
ENFORCED, changing it to ENFORCED should also keep it VALID.” This ignores
that if it is changed to NOT ENFORCED that should immediately change it to
NOT VALID if it is not already so.

Has anybody argued for how it makes any sense at all to have a constraint
that is VALID (and therefore will be assumed to be true by the planner),
yet NOT ENFORCED (and therefore may well not be true)? What next, a patch
to the planner so that it only treats as true constraints that are both
VALID and ENFORCED?

Re: the 3 or 4 values for the single character status, there is a similar
issue with relkind, where one can imagine writing "relkind IN ('r')" when
one meant "relkind IN ('r', 'v')" or something else; but on the other hand,
one can easily imagine actually wanting the first one of those. But here,
it's not at all clear to me when you would ever want to distinguish between
'u' and 'U', but it is clear to me that it would be natural to write "… =
'U'" when one actually needs to write "… IN ('u', 'U')", or perhaps "…
ILIKE 'u'" (not what I would want to see).


Re: should we have a fast-path planning for OLTP starjoins?

2025-02-11 Thread Tomas Vondra
On 2/10/25 22:36, Robert Haas wrote:
> On Fri, Feb 7, 2025 at 3:09 PM Tomas Vondra  wrote:
>> I don't think that's quite true. The order of dimension joins does not
>> matter because the joins do not affect the join size at all. The size of
>> |F| has nothing to do with that, I think. We'll do the same number of
>> lookups against the dimensions no matter in what order we join them. And
>> we know it's best to join them as late as possible, after all the joins
>> that reduce the size (and before joins that "add" rows, I think).
> 
> This is often not quite true, because there are often restriction
> clauses on the fact tables that result in some rows being eliminated.
> 
> e.g. SELECT * FROM hackers h JOIN languages l ON h.language_id = l.id
> JOIN countries c ON h.country_id = c.id WHERE c.name = 'Czechia';
> 

True. I think this was discussed earlier in this thread - dimensions
with additional restrictions may affect the row count, and thus would be
exempt from this (and would instead go through the "regular" join order
search algorithm). So I assumed the "dimensions" don't have any such
restrictions in my message, I should have mentioned that.

> However, I think that trying to somehow leverage the existence of
> either FK or LJ+UNIQUE is still a pretty good idea. In a lot of cases,
> many of the joins don't change the row count, so we don't really need
> to explore all possible orderings of those joins. We might be able to
> define some concept of "join that does't change the row count at all"
> or maybe better "join that doesn't change the row count very much".
> And then if we have a lot of such joins, we can consider them as a
> group. Say we have 2 joins that do change the row count significantly,
> and then 10 more than don't. The 10 that don't can be done before,
> between, or after the two that do, but it doesn't seem necessary to
> consider doing some of them at one point and some at another.
> 
> Maybe that's not the right way to think about this problem; I haven't
> read the academic literature on star-join optimization. But it has
> always felt stupid to me that we spend a bunch of time considering
> join orders that are not meaningfully different, and I think what
> makes two join orders not meaningfully different is that we're
> commuting joins that are not changing the row count.
> 
> (Also worth noting: even joins of this general form change the row
> count, they can only reduce it.)
> 

I searched for papers on star-joins, but pretty much everything I found
focuses on the OLAP case. Which is interesting, I think the OLTP
star-join I described seems quite different, and I'm not sure the trade
offs are necessarily the same.

My gut feeling is that in the first "phase" we should focus on the case
with no restrictions - that makes it obvious what the optimal order is,
and it will help a significant fraction of cases.

And then in the next step we can try doing something for cases with
restrictions - be it some sort of greedy algorithm, something that
leverages knowledge of the selectivity to prune join orders early
(instead of actually exploring all N! join orders like today). Or
something else, not sure.


regards

-- 
Tomas Vondra





Re: Non-text mode for pg_dumpall

2025-02-11 Thread jian he
hi.
review based on v16.

because of
https://postgr.es/m/cafc+b6pwqisl+3rvlxn9vhc8aonp4ov9c6u+bvd6kmwmdbd...@mail.gmail.com

in copy_global_file_to_out_file, now it is:
if (strcmp(outfile, "-") == 0)
OPF = stdout;
I am confused, why "-" means stdout.
``touch ./- `` command works fine.
i think dash is not special character, you may see
https://stackoverflow.com/a/40650391/15603477


+ /* Create a subdirectory with 'databases' name under main directory. */
+ if (mkdir(db_subdir, 0755) != 0)
+ pg_log_error("could not create subdirectory \"%s\": %m", db_subdir);
here we should use pg_fatal?


pg_log_info("executing %s", sqlstatement.data);
change to
pg_log_info("executing query: %s", sqlstatement.data);
message would be more similar to the next pg_log_error(...) message.


+ /*
+ * User is suggested to use single database dump for --list option.
+ */
+ if (opts->tocSummary)
+ pg_fatal("option -l/--list cannot be used when using dump of pg_dumpall");
maybe change to
+ pg_fatal("option -l/--list cannot be used when restoring multiple databases");

$BIN10/pg_restore --format=directory --list dir10_x
if the directory only has one database, then we can actually print out
the tocSummary.
if the directory has more than one database then pg_fatal.
To tolerate this corner case (only one database) means that pg_restore
--list requires a DB connection,
but I am not sure that is fine.
anyway, the attached patch allows this corner case.


PrintTOCSummary can only print out summary for a single database.
so we don't need to change PrintTOCSummary.


+ /*
+ * To restore multiple databases, -C (create database) option should
be specified
+ * or all databases should be created before pg_restore.
+ */
+ if (opts->createDB != 1)
+ pg_log_info("restoring dump of pg_dumpall without -C option, there
might be multiple databases in directory.");

we can change it to
+ if (opts->createDB != 1 && num_db_restore > 0)
+ pg_log_info("restoring multiple databases without -C option.");


Bug.
when pg_restore --globals-only can be applied when we are restoring a
single database (can be an output of pg_dump).


There are some tests per https://commitfest.postgresql.org/52/5495, I
will check it later.
The attached patch is the change for the above reviews.


v16_misc_changes.nocfbot
Description: Binary data


Re: explain analyze rows=%.0f

2025-02-11 Thread Matheus Alcantara
Em seg., 10 de fev. de 2025 às 18:14, Ilia Evdokimov
 escreveu:
> Sorry for missing your question earlier. If you notice in the code above, the 
> variable(average) 'rows' is defined as:
>
> double rows = planstate->instrument->ntuples / nloops;
>
> This represents the total rows divided by the number of loops. The condition
> means that variable 'rows' will always  between zero and one. Therefore, the
> average rows under such conditions cannot be greater than or even equal to
> one. I wrote this condition specifically to avoid the verbose expression
> 'rows > 0 && rows < 1'. However, since this might not be obvious to everyone,
> perhaps it'd be better to write is using 'rows' directly or add a comment
> explaining this logic.
>
Thanks for the details! It makes sense to me now. I think that adding a comment
could be a good idea

> I agree with the last two points. As for the first one—maybe we could
> simply state that the average rows value can be decimal, especially for
> very small values?
>
> I'm just not sure about the "small values"; the 'rows' in decimal will only
> happen with small values? What would be a "small value" in this context? My 
> main
> point here is more that I think that it would be good to mention *why* the
> 'rows' can be decimal, not just describe that it could be decimal.
>
>
> As for 'small values', it means that the average rows is between zero and
> one, to avoid rounding errors and misunderstanding. I think this would be
> ideal.
>
Get it, sounds reasonable to me.

-- 
Matheus Alcantara




Re: should we have a fast-path planning for OLTP starjoins?

2025-02-11 Thread Tomas Vondra
On 2/11/25 10:28, Richard Guo wrote:
> On Mon, Feb 10, 2025 at 5:35 PM Tomas Vondra  wrote:
>> On 2/10/25 08:29, Richard Guo wrote:
>>> Hmm, I'm still a little concerned about whether the resulting joins
>>> are legal.  Suppose we have a join pattern like the one below.
>>>
>>>  F left join
>>>   (D1 inner join T on true) on F.b = D1.b
>>>   left join D2 on F.c = D2.c;
>>>
>>> For this query, the original joinlist is [F, D1, T, D2].  If we
>>> reorder it to [[F, T], D1, D2], the sub-joinlist [F, T] would fail to
>>> produce any joins, as the F/T join is not legal.
>>>
>>> This may not be the pattern we are targeting.  But if we intend to
>>> support it, I think we need a way to ensure that the resulting joins
>>> are legal.
> 
>> It's quite possible the PoC patch I posted fails to ensure this, but I
>> think the assumption is we'd not reorder joins for dimensions that any
>> any join order restrictions (except for the FK join).
> 
> Then, we'll need a way to determine if a given relation has join-order
> restrictions, which doesn't seem like a trivial task.  We do have the
> has_join_restriction() function, but it considers any relations
> involved in an outer join as having join restrictions, and that makes
> it unsuitable for our needs here.
> 

I admit knowing next to nothing about join order planning :-( Could you
maybe explain why it would be non-trivial to determine if a relation has
join-order restrictions? Surely we already determine that, no? So what
would we need to do differently?

Or are you saying that because has_join_restriction() treats each
relation with an outer join as having a restriction, that makes it
unusable for the purpose of this optimization/patch? And we'd need to
invent something more elaborate?

I'm not sure that's quite true. The problem with joining the dimensions
(with inner joins) is *exactly* the lack of restrictions, which means
that explore possible orders of those dimensions (all N! of them). With
the restrictions (e.g. from LEFT JOIN), that's no longer true - in a
way, this is similar to what the patch does. And in fact, replacing the
inner joins with LEFT JOINs makes the queries much faster. I've seen
this used as a workaround to cut down on planning time ...

So I don't think treating outer joins as "having restriction" is a
problem. It doesn't regress any queries, although it might lead to a bit
strange situation that "less restricted" joins are faster to plan.


regards

-- 
Tomas Vondra





Re: Removing unneeded self joins

2025-02-11 Thread Alena Rybakina
Hi! Thank you for the work with this subject, I think it is really 
important.


On 10.02.2025 22:58, Alexander Korotkov wrote:

Hi!

On Mon, Feb 10, 2025 at 7:19 AM Andrei Lepikhov  wrote:

On 9/2/2025 18:41, Alexander Korotkov wrote:

Regarding adjust_relid_set() and replace_relid().  I think they are
now strictly equivalent, except for the case then old relid is given
and not found.  In this case adjust_relid_set() returns the original
relids while replace_relid() returns a copy.  The behavior of
adjust_relid_set() appears more desirable as we don't need extra
copying when no modification is done.  So, I've replaced all
replace_relid() with adjust_relid_set().

Ok, I glanced into it, and it makes sense to merge these routines.
I think the comment to adjust_relid_set() should be arranged, too. See
the attachment for a short variant of such modification.

Also, I did some grammar correction to your new comment in tests.

Thanks!

I've further revised adjust_relid_set() header comment.

Looking back to the work done since previous attempt to commit this to
pg17, I can highlight following.
1) We're now using more of existing infrastructure including
adjust_relid_set() and ChangeVarNodes().  The most of complexity is
still there though.
2) We've checked few ways to further simplify this patch.  But yet the
current way still feels to be best possible.
3) For sure, several bugs were fixed.

I think we could give it another chance for pg18 after some further
polishing (at least commit message still needs to be revised).  Any
thoughts on this?  Tom?

I didn't find any mistakes, I just have a refactoring remark. I think 
the part where we add non-redundant expressions with the 
binfo_candidates, jinfo_candidates
check can be moved to a separate function, otherwise the code is very 
repetitive in this place. I did it and attached diff file



--
Regards,
Alena Rybakina
Postgres Professional
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index d0d9b80be9c..d31427693da 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -1578,6 +1578,48 @@ restrict_infos_logically_equal(RestrictInfo *a, RestrictInfo *b)
 	return result;
 }
 
+/*
+ * The functions adds all non-redundant clauses to the keeping relation.
+ * Contradictory operation. On the one side, we reduce the length of
+ * restrict lists that can impact planning or executing time.
+ * Additionally, we improve the accuracy of cardinality estimation. On the
+ * other side, it is one more place that can make planning time much
+ * longer in specific cases. It would have been better to avoid calling
+ * the equal() function here, but it's the only way to detect duplicated
+ * inequality expressions.
+ */
+static void
+add_non_redundant_clauses(PlannerInfo *root,
+		  List *rinfo_candidates,
+		  List *keep_rinfo_list,
+		  Index removed_relid)
+{
+	foreach_node(RestrictInfo, rinfo, rinfo_candidates)
+	{
+		bool		is_redundant = false;
+
+		Assert(!bms_is_member(removed_relid, rinfo->required_relids));
+
+		foreach_node(RestrictInfo, src, keep_rinfo_list)
+		{
+			if (!bms_equal(src->clause_relids, rinfo->clause_relids))
+/* Can't compare trivially different clauses */
+continue;
+
+			if (src == rinfo ||
+(rinfo->parent_ec != NULL &&
+ src->parent_ec == rinfo->parent_ec) ||
+ restrict_infos_logically_equal(rinfo, src))
+			{
+is_redundant = true;
+break;
+			}
+		}
+		if (!is_redundant)
+			distribute_restrictinfo_to_rels(root, rinfo);
+	}
+}
+
 /*
  * Remove a relation after we have proven that it participates only in an
  * unneeded unique self join.
@@ -1654,62 +1696,10 @@ remove_self_join_rel(PlannerInfo *root, PlanRowMark *kmark, PlanRowMark *rmark,
 
 	/*
 	 * Now, add all non-redundant clauses to the keeping relation.
-	 * Contradictory operation. On the one side, we reduce the length of
-	 * restrict lists that can impact planning or executing time.
-	 * Additionally, we improve the accuracy of cardinality estimation. On the
-	 * other side, it is one more place that can make planning time much
-	 * longer in specific cases. It would have been better to avoid calling
-	 * the equal() function here, but it's the only way to detect duplicated
-	 * inequality expressions.
 	 */
-	foreach_node(RestrictInfo, rinfo, binfo_candidates)
-	{
-		bool		is_redundant = false;
-
-		Assert(!bms_is_member(toRemove->relid, rinfo->required_relids));
-
-		foreach_node(RestrictInfo, src, toKeep->baserestrictinfo)
-		{
-			if (!bms_equal(src->clause_relids, rinfo->clause_relids))
-/* Const and non-const expressions can't be equal */
-continue;
-
-			if (src == rinfo ||
-(rinfo->parent_ec != NULL
- && src->parent_ec == rinfo->parent_ec)
-|| restrict_infos_logically_equal(rinfo, src))
-			{
-is_redundant = true;
-break;
-			}
-		}
-		if (!is_redundant)
-			distribute_restrictinfo_to_rels(root, rinfo);
-	}
-	foreach

Re: Eagerly scan all-visible pages to amortize aggressive vacuum

2025-02-11 Thread Melanie Plageman
On Mon, Feb 10, 2025 at 8:06 PM Andres Freund  wrote:
>
> On 2025-02-04 12:44:22 -0500, Melanie Plageman wrote:
> > On Mon, Feb 3, 2025 at 9:09 PM Andres Freund  wrote:
> > > > + /*
> > > > +  * Now calculate the eager scan start block. Start at a random 
> > > > spot
> > > > +  * somewhere within the first eager scan region. This avoids eager
> > > > +  * scanning and failing to freeze the exact same blocks each 
> > > > vacuum of the
> > > > +  * relation.
> > > > +  */
> > >
> > > If I understand correctly, we're not really choosing a spot inside the 
> > > first
> > > eager scan region, we determine the bounds of the first region?
> >
> > I'm not sure I understand how those are different, but I updated the
> > comment a bit. Maybe you can elaborate what you mean?
>
> Let's assume that we use regions of 512 pages. Without randomness we'd do:
>
> [0 .. 512) [512 .. 1024) ...
>
>
> IMO, if we were to choose a spot inside the first region, we'd do:
>
> [random(0, 512) .. 512) [512 .. 1024)
>
>
> If we choose the bounds of the first region we'd do:
>
> [0, random(0, 512)) [$prior bound .. $prior_bound + 512)
>
> or something like that.

Ah, that's true. I guess I meant the start block of the second region.
Anyway, I've updated the comment to be more correct I think.

- Melanie
From f713f7b3ca33e3cf5825706b86740c5f8f1e5ad6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Mon, 27 Jan 2025 12:23:00 -0500
Subject: [PATCH v17] Eagerly scan all-visible pages to amortize aggressive
 vacuum

Aggressive vacuums must scan every unfrozen tuple in order to advance
the relfrozenxid/relminmxid. Because data is often vacuumed before it is
old enough to require freezing, relations may build up a large backlog
of pages that are set all-visible but not all-frozen in the visibility
map. When an aggressive vacuum is triggered, all of these pages must be
scanned. These pages have often been evicted from shared buffers and
even from the kernel buffer cache. Thus, aggressive vacuums often incur
large amounts of extra I/O at the expense of foreground workloads.

To amortize the cost of aggressive vacuums, eagerly scan some
all-visible but not all-frozen pages during normal vacuums.

All-visible pages that are eagerly scanned and set all-frozen in the
visibility map are counted as successful eager freezes and those not
frozen are counted as failed eager freezes.

If too many eager scans fail in a row, eager scanning is temporarily
suspended until a later portion of the relation. The number of failures
tolerated is configurable globally and per table. To effectively
amortize aggressive vacuums, we cap the number of successes as well.
Once we reach the maximum number of blocks successfully eager frozen,
eager scanning is disabled for the remainder of the vacuum of the
relation.

Original design idea from Robert Haas, with enhancements from
Andres Freund, Tomas Vondra, and me

Reviewed-by: Robert Haas 
Reviewed-by: Masahiko Sawada 
Reviewed-by: Andres Freund 
Reviewed-by: Robert Treat 
Reviewed-by: Bilal Yavuz 
Discussion: https://postgr.es/m/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com
---
 doc/src/sgml/config.sgml  |  39 ++
 doc/src/sgml/maintenance.sgml |  33 +-
 doc/src/sgml/ref/create_table.sgml|  15 +
 src/backend/access/common/reloptions.c|  14 +-
 src/backend/access/heap/vacuumlazy.c  | 435 --
 src/backend/commands/vacuum.c |  15 +
 src/backend/postmaster/autovacuum.c   |   6 +
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/bin/psql/tab-complete.in.c|   2 +
 src/include/commands/vacuum.h |  17 +
 src/include/utils/rel.h   |   6 +
 12 files changed, 552 insertions(+), 41 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 38244409e3c..804b8d0cd7f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9141,6 +9141,45 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;

   
 
+ 
+  vacuum_max_eager_freeze_failure_rate (floating point)
+  
+   vacuum_max_eager_freeze_failure_rate configuration parameter
+  
+  
+  
+   
+Specifies the maximum number of pages (as a fraction of total pages in
+the relation) that VACUUM may scan and
+fail to set all-frozen in the visibility map
+before disabling eager scanning. A value of 0
+disables eager scanning altogether. The default is
+0.03 (3%).
+   
+
+   
+Note that when eager scanning is enabled, successful page freezes do
+not count against the cap on eager freeze failures. Successful page
+freezes are capped internally at 20% of the all-visible but not
+all-frozen pages in the relation. Capping successful page freeze

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

2025-02-11 Thread Jeff Davis
On Mon, 2025-02-10 at 19:09 -0800, James Hunter wrote:
> I think it makes sense to split the work into two parts: one part
> that
> improves SQL execution, and a second part that improves the
> optimizer,
> to reflect the improvements to execution.

I like the idea to store the value of work_mem in the
path/plan/executor nodes, and use that at execution time rather than
the GUC directly.

IIUC, that would allow an extension to do what you want, right? A
planner hook could just walk the tree and edit those values for
individual nodes, and the executor would enforce them.

Regards,
Jeff Davis





Re: describe special values in GUC descriptions more consistently

2025-02-11 Thread Nathan Bossart
Thank you for the reviews.

On Mon, Feb 10, 2025 at 05:25:42PM -0700, David G. Johnston wrote:
> On Mon, Feb 10, 2025 at 4:53 PM Peter Smith  wrote:
>>   {"shared_memory_size_in_huge_pages", PGC_INTERNAL, PRESET_OPTIONS,
>>   gettext_noop("Shows the number of huge pages needed for the main
>> shared memory area."),
>> - gettext_noop("-1 indicates that the value could not be determined."),
>> + gettext_noop("-1 means the value could not be determined."),
>>
>> I didn't know what that meant. And the docs seem worded differently. More
>> like:
>> "-1 means huge pages are not supported."
>>
> 
> Agreed

Fixed.

>> - gettext_noop("An empty string indicates that \"archive_command\"
>> should be used.")
>> + gettext_noop("An empty string means \"archive_command\" should be used.")
>>
>> Should that be worded more like other ones that delegate to other GUCs:
>>
>> "An empty string means use \"archive_command\"."
>>
> 
> Agreed

Fixed.

>> What is the difference between "system setting" and "system default",
>> or should those be made the same?
>>
>>
> Looking at the documentation it seems to be communicating the difference
> between a PostgreSQL default (system default) and a value taken from the
> operating environment (system setting).  Maybe making that more clear by
> saying "operating system setting" is warranted.

I've used "operating system setting" in v3.

> Minor typo, trailing whitespace in message:
> 
> "lost before a connection is considered dead. "

This is intentional so that there is a space between the sentences in the
long description.

> And then these two don't seem worded symmetrically enough:
> 
> "An empty string means don't load CRL file unless \"ssl_crl_dir\" is set."
> "An empty string means don't use CRLs unless \"ssl_crl_file\" is set."
> 
> The first one also needs to be "the CRL file".
> 
> But I'm thinking something more like:
> 
> "An empty string disables the directory-based CRL auto-load mechanism."
> (ssl_crl_dir)
> "An empty string disables the fixed-file CRL reload mechanism."
> (ssl_crl_file)
> 
> I don't see much benefit trying to shoe-horn a cross-reference to the other
> setting in these brief usage messages.  Though if we wanted to a simple
> (see also ssr_crl_) would suffice.  It seems unworthy of the
> limited space to try and communicate the fact that if both are empty
> strings no CRL files will be loaded (or that both can be used at the same
> time).  Even trying to fit in the "auto-load versus reload" dynamic of
> these two choices seems awkward.

I thought about this one a bit, and I actually came to the opposite
conclusion.  IMHO it's reasonably obvious that an empty string means that
the file isn't loaded, so there's not much point in stating it in the GUC
description.  Instead, I think we should follow the
archive_command/archive_library example and use this space _only_ as a
cross-reference to each other.  There's certainly some nuances missed with
this strategy, but that's not unique to this GUC.

-- 
nathan
>From f2a6b6078f8d788df4818f10ad5b54667bef5dd2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 7 Feb 2025 16:06:18 -0600
Subject: [PATCH v3 1/1] Describe special values in GUC descriptions more
 consistently.

---
 src/backend/utils/misc/guc_tables.c | 141 ++--
 1 file changed, 71 insertions(+), 70 deletions(-)

diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index ce7534d4d23..1dc5ccd5264 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2109,7 +2109,7 @@ struct config_int ConfigureNamesInt[] =
{"archive_timeout", PGC_SIGHUP, WAL_ARCHIVING,
gettext_noop("Sets the amount of time to wait before 
forcing a "
 "switch to the next WAL 
file."),
-   NULL,
+   gettext_noop("0 disables the timeout."),
GUC_UNIT_S
},
&XLogArchiveTimeout,
@@ -2186,7 +2186,7 @@ struct config_int ConfigureNamesInt[] =
{
{"geqo_pool_size", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("GEQO: number of individuals in the 
population."),
-   gettext_noop("Zero selects a suitable default value."),
+   gettext_noop("0 means use a suitable default value."),
GUC_EXPLAIN
},
&Geqo_pool_size,
@@ -2196,7 +2196,7 @@ struct config_int ConfigureNamesInt[] =
{
{"geqo_generations", PGC_USERSET, QUERY_TUNING_GEQO,
gettext_noop("GEQO: number of iterations of the 
algorithm."),
-   gettext_noop("Zero selects a suitable default value."),
+   gettext_noop("0 means use a suitable default value."),
GUC_EXPLAIN
},
&Geqo_generations

Re: pg_stat_statements and "IN" conditions

2025-02-11 Thread Álvaro Herrera
On 2025-Feb-11, Sami Imseih wrote:

> I do not have an explanation from the patch yet, but I have a test
> that appears to show unexpected results. I only tested a few datatypes,
> but from what I observe, some merge as expected and others do not;
> i.e. int columns merge correctly but bigint do not.

Yep, I noticed this too, and realized that this is because these values
are wrapped in casts of some sort, while the others are not.

> select from foo where col_bigint in (1, 2, 3);
> select from foo where col_bigint in (1, 2, 3, 4);
> select from foo where col_bigint in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
> select from foo where col_float in (1, 2, 3);
> select from foo where col_float in (1, 2, 3, 4);

You can see that it works correctly if you use quotes around the values,
e.g.
select from foo where col_float in ('1', '2', '3');
select from foo where col_float in ('1', '2', '3', '4');
and so on.  There are no casts here because these literals are of type
unknown.

I suppose this is telling us that detecting the case with consts wrapped
in casts is not really optional.  (Dmitry said this was supported at
early stages of the patch, and now I'm really curious about that
implementation because what IsMergeableConstList sees is a FuncExpr that
invokes the cast function for float8 to int4.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)




Re: Confine vacuum skip logic to lazy_scan_skip

2025-02-11 Thread Melanie Plageman
On Thu, Feb 6, 2025 at 1:06 PM Melanie Plageman
 wrote:
>
> On Wed, Feb 5, 2025 at 5:26 PM Melanie Plageman
>  wrote:
> >
> > Yes, looking at these results, I also feel good about it. I've updated
> > the commit metadata in attached v14, but I could use a round of review
> > before pushing it.
>
> I've done a bit of self-review and updated these patches.

This needed a rebase in light of 052026c9b90.
v16 attached has an additional commit which converts the block
information parameters to heap_vac_scan_next_block() into flags
because we can only get one piece of information per block from the
read stream API. This seemed nicer than a cumbersome struct.

- Melanie
From 920b567a0afa77f1478f804dc934936a9e9e9ae8 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Wed, 5 Feb 2025 17:23:05 -0500
Subject: [PATCH v16 3/3] Use streaming read I/O in VACUUM's third phase

Make vacuum's third phase (its second pass over the heap), which reaps
dead items collected in the first phase and marks them as reusable, use
the read stream API. This commit adds a new read stream callback,
vacuum_reap_lp_read_stream_next(), that looks ahead in the TidStore and
returns the next block number to read for vacuum.

Author: Melanie Plageman 
Co-authored-by: Thomas Munro 
Discussion: https://postgr.es/m/CA%2BhUKGKN3oy0bN_3yv8hd78a4%2BM1tJC9z7mD8%2Bf%2ByA%2BGeoFUwQ%40mail.gmail.com
---
 src/backend/access/heap/vacuumlazy.c | 49 +---
 1 file changed, 44 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f705f31f5e6..5b2d9c535c1 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2637,6 +2637,29 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	return allindexes;
 }
 
+/*
+ * Read stream callback for vacuum's third phase (second pass over the heap).
+ * Gets the next block from the TID store and returns it or InvalidBlockNumber
+ * if there are no further blocks to vacuum.
+ */
+static BlockNumber
+vacuum_reap_lp_read_stream_next(ReadStream *stream,
+void *callback_private_data,
+void *per_buffer_data)
+{
+	TidStoreIter *iter = callback_private_data;
+	TidStoreIterResult *iter_result;
+
+	iter_result = TidStoreIterateNext(iter);
+	if (iter_result == NULL)
+		return InvalidBlockNumber;
+
+	/* Save the TidStoreIterResult for later, so we can extract the offsets. */
+	memcpy(per_buffer_data, iter_result, sizeof(*iter_result));
+
+	return iter_result->blkno;
+}
+
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
@@ -2657,6 +2680,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
+	ReadStream *stream;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
@@ -2677,7 +2701,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 			 InvalidBlockNumber, InvalidOffsetNumber);
 
 	iter = TidStoreBeginIterate(vacrel->dead_items);
-	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+
+	/* Set up the read stream */
+	stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+		vacrel->bstrategy,
+		vacrel->rel,
+		MAIN_FORKNUM,
+		vacuum_reap_lp_read_stream_next,
+		iter,
+		sizeof(TidStoreIterResult));
+
+	while (true)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2688,9 +2722,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = iter_result->blkno;
-		vacrel->blkno = blkno;
+		buf = read_stream_next_buffer(stream, (void **) &iter_result);
 
+		if (!BufferIsValid(buf))
+			break;
+
+		vacrel->blkno = blkno = BufferGetBlockNumber(buf);
+
+		Assert(iter_result);
 		num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
 		Assert(num_offsets <= lengthof(offsets));
 
@@ -2702,8 +2741,6 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
 
 		/* We need a non-cleanup exclusive lock to mark dead_items unused */
-		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
- vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		lazy_vacuum_heap_page(vacrel, blkno, buf, offsets,
 			  num_offsets, vmbuffer);
@@ -2716,6 +2753,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+
+	read_stream_end(stream);
 	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
-- 
2.34.1

From 4d6e90bd73dc185415d0278a19b22c691b6fa598 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Tue, 11 Feb 2025 19:02:13 -0500
Subject: [PATCH v16 1/3] Convert heap_vac_scan_next_block() boolean parameters
 to flags

The read stream API only allows one piece of extra per block state to be
passed back to the API user. Heap vacuum needs to know whether or not a
given block was all-visible in the visibility map and 

RE: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-02-11 Thread Hayato Kuroda (Fujitsu)
Dear Shubham,

Thanks for updating the patch! I feel the patch has good shape. Here is a small 
comment.

```
+   /* Error if no databases were found on the source server */
+   if (num_rows == 0)
+   {
+   pg_log_error("no databases found on the source server");
+   pg_log_error_hint("Ensure that there are user-created databases 
on the source server.");
+   PQclear(res);
+   disconnect_database(conn, false);
+   exit(1);
+   }
```

I think the error message is not accurate. This error can happen when there are
user-created database but it is created as template. How about below:

```
-   pg_log_error("no databases found on the source server");
-   pg_log_error_hint("Ensure that there are user-created databases 
on the source server.");
+   pg_log_error("no convertable databases found on the source 
server");
+   pg_log_error_hint("Ensure that there are non-template and 
connectable databases on the source server.");
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: [PATCH] Optionally record Plan IDs to track plan changes for a query

2025-02-11 Thread Julien Rouhaud
On Wed, Feb 12, 2025 at 09:08:00AM +0900, Michael Paquier wrote:
> Wikipedia seems to agree with you that "fingerprint" would fit for
> this purpose, though:
> https://en.wikipedia.org/wiki/Fingerprint_(computing)
>
> Has anybody any comments about that?  That would be a large renaming,
> but in the long term is makes sense if we want to apply that to more
> than just parse nodes and query strings.  If you do that, it impacts
> the file names and the properties, that are hidden in the backend for
> most of it, except the entry API and JumbleState.  This last part
> impacts some extensions and I have been maintaining one a bit
> (pg_hint_plan).

I agree that fingerprint is a good improvement.

>
> Also adding Julien in CC,
> as he has some out-of-core extension code that depends on the jumbling
> structures if I recall correctly.

I do have an extension to support custom fingerprinting logic, but the
introduction of the pg_node_attr based jumbling kind of broke it.

FTR my main motivation was to be able to deal with queries referencing
temporary relations, as if your application creates a lot of those it basically
means that you cannot use pg_stat_statements anymore.




Re: [PATCH] Optionally record Plan IDs to track plan changes for a query

2025-02-11 Thread Greg Sabino Mullane
On Tue, Feb 11, 2025 at 7:08 PM Michael Paquier  wrote:

> On Mon, Feb 10, 2025 at 02:02:10PM -0600, Sami Imseih wrote:
> > I am OK with moving away from "jumble" in-lieu of something else, but my
> thoughts are we should actually call this process "fingerprint"
>

I agree fingerprint is the right final word. But "jumble" conveys the
*process* better than "fingerprinting". I view it as jumbling produces an
object that can be fingerprinted.

> For node attributes we can specify "fingerprint_ignore" or
> "no_fingerprint". What do you think?
>

Still should be jumble_ignore.

Cheers,
Greg


--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support


Re: [PATCH] Optionally record Plan IDs to track plan changes for a query

2025-02-11 Thread Sami Imseih
> Of course some people may want to keep the current behavior, if they have
> limited number of temp tables or similar, so I had a GUC for that.  I don't
> think that the community would really welcome such GUC for core-postgres,
> especially since it wouldn't be pg_stat_statements specific.

FWIW, I think options to tweak queryId computation is something
that should be in core. It was discussed earlier in the context
of IN list merging; the patch for this currently has the guc
for the feature in pg_stat_statements, but there was a discussion
about actually moving this to core [1] Giving the user a way
to control certain behavior about the queryId computation
is a good thing to do in core; especially queryId is no longer
just consumed in pg_stat_statements. Maybe the right answer
is an enum GUC, not sure yet.

Specifically for the use-case you mention, using names vs OIDs in
queryId computation is a valid use case for more than temporary tables,
I can also think of upgrade, dump/restore, logical replication cases which
can then allow for a consistent queryId.

[1] 
https://www.postgresql.org/message-id/202502111852.btskmr7nhien%40alvherre.pgsql

-- 
Sami




Re: Skip collecting decoded changes of already-aborted transactions

2025-02-11 Thread Peter Smith
Hi. Here are some minor comments for the v18* patch set.

//

Patch v18-0001

1.1. Commit message

A previously reported typo still exists:

/noticeble/noticeable/

//

Patch v18-0002

2.1
+#define RBTXN_PREPARE_STATUS_FLAGS (RBTXN_IS_PREPARED |
RBTXN_SKIPPED_PREPARE | RBTXN_SENT_PREPARE)
+

AFAICT bitmasks like this are more commonly named with a _MASK suffix.

How about something like:
- RBTXN_PREPARE_MASK
- RBTXN_PREPARE_STATUS_MASK
- RBTXN_PREPARE_FLAGS_MASK

==
Kind Regards,
Peter Smith.
Fujitsu Australia




RE: pg_rewind with --write-recovery-conf option doesn't write dbname to primary_conninfo value.

2025-02-11 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

Thanks for updating the patch!

> I've attached the updated patch. I address all comments I got so far
> and added a small regression test.
> 
> It makes sense to me that we move GetDbnameFromConnectionOptions() to
> recovery_gen.c since this function is currently used only with
> GenerateRecoveryConfig() which is defined in the same file. If we find
> a more appropriate place, we can move it later. Feedback is very
> welcome.

I considered your idea that adding new API, but it seemed for me to have less
benefit. Also, I do not know better place for the declaration now. Overall, the
patch looks good to me.

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: EvictUnpinnedBuffer and buffer free list

2025-02-11 Thread Ashutosh Bapat
Thanks a lot Melanie for a very detailed response, a good reference to pin.

On Fri, Jan 31, 2025 at 8:20 PM Melanie Plageman
 wrote:

>
> I don't have an explicit issue with EvictUnpinnedBuffer() putting
> buffers on the freelist -- it seems like that could be fine. But since
> it is for testing/development, I don't see what benefits it will have
> to users. It sounds like you saw issues when developing -- what kinds
> of issues?

I can't say it was issue, may be an expectation mismatch. In my
experiment where the entire buffer pool was full, I was expecting the
evicted buffer to be available immediately for the next page request.
I didn't expect another eviction. I kinda thought that the buffer was
lost, but it was returned. The next thing I tried was to evict many
buffers together using EvictUnpinnedBuffer() and those buffers took a
long time to return to the pool because clock sweep wasn't as fast as
the eviction. But that's not a regular scenario, so may be current
behaviour is okay to avoid the lock contention.

>
> > The prologue of function InvalidateVictimBuffer() says "/* Helper
> > routine for GetVictimBuffer() ". I believe that it's expected that the
> > buffer will be allocated to some other page, and that's why it doesn't
> > return the buffer to the free list. But in the case of
> > EvictUnpinnedBuffer() we are not using that buffer for any page, so it
> > must be returned to the free list. InvalidateBuffer() does that but
> > its prologue mentions that it's supposed to be used when freeing
> > buffers for relations and databases.
> >
> > I think there are two solutions
> > 1. Use InvalidBuffer() instead of InvalidateVictimBuffer(). But I am
> > not sure whether that's safe or what other safety measures we have to
> > put in EvictUnpinnedBuffer()
>
> I don't really think we can do this. InvalidateBuffer() waits forever
> to be able to put the buffer on the freelist. That's because it is
> only used when dropping a relation or database. So it can assume (as
> it says in the comments above WaitIO()) that the only reason the
> buffer will be pinned is if someone else is flushing out the page. It
> will always retry -- since the relation is being dropped, no one else
> could be trying to concurrently access it to read it. You can't make
> this assumption in EvictUnpinnedBuffer().

Thanks for the explanation. This option is ruled out then.

>
> > 2. Call StrategyFreeBuffer() after InvalidateVictimBuffer()
>
> I don't know exactly what would be required to make this work, but it
> seems reasonable to try. The only places StrategyFreeBuffer() is used
> is 1) InvalidateBuffer() and 2) when doing relation extension. In the
> first case, we know no one can know about the buffer because we waited
> until all pins were released and the buffer is part of a relation that
> is being dropped. In the second case, I think the buffers we add to
> the freelist are also ones that no one can know about yet because the
> extension hasn't completed. I'm fuzzy on the details here, so I would
> defer to Andres.
>
> Anyway, my gut feeling is that we have to do something to make calling
> StrategyFreeBuffer() safe to do in EvictUnpinnedBuffer(), but I don't
> know what it is.

I think we may enhance the pg_buffercache_evict() function to put it
back in the freelist; the behaviour being controlled by an argument
flag. I haven't explored the feasibility yet. That will leave
EvictUnpinnedBuffer() as is.

-- 
Best Wishes,
Ashutosh Bapat




Re: Proposal: Filter irrelevant change before reassemble transactions during logical decoding

2025-02-11 Thread Ajin Cherian
On Wed, Jan 29, 2025 at 9:31 AM Peter Smith  wrote:

Hi Ajin,

Some review comments for patch v12-0001.

==
Commit message

1.
Track transactions which have snapshot changes with a new flag
RBTXN_HAS_SNAPSHOT_CHANGES

~

The commit message only says *what* it does, but not *why* this patch
even exists.  TBH, I don't understand why this patch needs to be
separated from your patch 0002, because 0001 makes no independent use
of the flag, nor is it separately tested.

Anyway, if it is going to remain separated then IMO at least the the
message should explain the intended purpose e.g. why the subsequent
patches require this flagged info and how they will use it.


Fixed.


==
src/include/replication/reorderbuffer.h

2.
+/* Does this transaction have snapshot changes? */
+#define rbtxn_has_snapshot_changes(txn) \
+( \
+ ((txn)->txn_flags & RBTXN_HAS_SNAPSHOT_CHANGES) != 0 \
+)
+

Is the below wording maybe a more plain way to say that:

/* Does this transaction make changes to the current snapshot? */


Fixed as suggested.


On Sat, Feb 1, 2025 at 12:43 AM Hayato Kuroda (Fujitsu)
 wrote:
Dear Ajin, Hou,

> - Snapshot construction

I understand the approach that we do not try to filter for all the workloads; do
just best-effort.

I played with your PoC and here are my comments.

1.
Can you add tests for better understanding? I've tried for
tes_decoding like attached,
but it couldn't pass the regression test. Cases "stream.sql" and
"twophase_stream.sql"
were failed.


I've added tests. I've analysed your patch and the regression test
failure. The tests fail to detect concurrent aborts.I think the reason
for that is because of the filter check logic access of the catalog,
the changes are cached.
As a result, when the actual decoding happens, the catalog is not
accessed as the relation detailsare in cache. Without catalog access,
concurrent aborts cannot be detected as concurrent aborts are
detectedonly when the catalog is accessed. There is a new thread by
Sawada-san on a more efficient detection of concurrent aborts,I don't
know if that will solve this issue, otherwise I don't know how to fix
this in a meaningful way. Caching improvesperformance, and at the same
time it prevents detection of concurrent aborts.

2.
I think we can extend the skip mechanism to
UPDATE/DELETE/MultiInsert/SpecConfirm.
Regarding the TRUNCATE, I'm not sure we can handle hte TRUNCATE case
because the we
can't track RelFileLocator anymore.


Updated.


On Tue, Feb 4, 2025 at 2:19 PM vignesh C  wrote:

On Tue, 28 Jan 2025 at 08:40, Ajin Cherian  wrote:
>
> Here's a patch-set created based on the design changes proposed
by Hou-san.

Few comments:
1) Shouldn't we do the same thing for other DecodeXXX functions?
@@ -915,6 +915,11 @@ DecodeInsert(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
return;

+   if (ctx->reorder->can_filter_change &&
+   ReorderBufferFilterByRelFileLocator(ctx->reorder,
XLogRecGetXid(r),
+
 buf->origptr, &target_locator, true))
+   return;
+


Updated.


2) Let's add some debug logs so that it will be easy to verify the
changes that are getting filtered, or else we will have to debug and
verify them:
@@ -915,6 +915,11 @@ DecodeInsert(LogicalDecodingContext *ctx,
XLogRecordBuffer *buf)
if (FilterByOrigin(ctx, XLogRecGetOrigin(r)))
return;

+   if (ctx->reorder->can_filter_change &&
+   ReorderBufferFilterByRelFileLocator(ctx->reorder,
XLogRecGetXid(r),
+
 buf->origptr, &target_locator, true))
+   return;

3) Also there are no tests currently, probably if we add the above
mentioned debug logs we could add few tests and verify them based on
the logs.


Updated.


4) Can you elaborate a bit in the commit message why we need to
capture if a transaction has snapshot changes:
Subject: [PATCH v12 1/3] Track transactions with internal snapshot changes

Track transactions which have snapshot changes with a new flag
RBTXN_HAS_SNAPSHOT_CHANGES


Updated.

regards,
Ajin Cherian
Fujitsu Australia


v13-0003-Introduce-a-output-plugin-callback-to-filter-cha.patch
Description: Binary data


v13-0001-Track-transactions-with-internal-snapshot-change.patch
Description: Binary data


v13-0002-Filter-transactions-that-need-not-be-published.patch
Description: Binary data


EquivalenceClass and custom_read_write

2025-02-11 Thread Ashutosh Bapat
Hi All,
In pathnodes.h
typedef struct EquivalenceClass
{
pg_node_attr(custom_read_write, no_copy_equal, no_read, no_query_jumble)

Because of custom_read_write attribute, I expect _outEquivalenceClass
to be present in outfuncs.c and _readEquivalenceClass to be present in
readfuncs.c. I find the first but not the second. This could be
because _out function is only for debugging and what it writes is
never read back. Should we add a comment in EquivalenceClass prologue
mentioning the same?

-- 
Best Wishes,
Ashutosh Bapat




Re: Track the amount of time waiting due to cost_delay

2025-02-11 Thread Bertrand Drouvot
Hi,

On Tue, Feb 11, 2025 at 04:42:26PM -0600, Nathan Bossart wrote:
> On Tue, Feb 11, 2025 at 08:51:15AM +, Bertrand Drouvot wrote:
> > On Mon, Feb 10, 2025 at 02:52:46PM -0600, Nathan Bossart wrote:
> >> Off-list, I've asked Bertrand to gauge the feasibility of adding this
> >> information to the autovacuum logs and to VACUUM/ANALYZE (VERBOSE).  IMHO
> >> those are natural places to surface this information, and I want to ensure
> >> that we're not painting ourselves into a corner with the approach we're
> >> using for the progress views.
> > 
> > Yeah, I looked at it and that looks as simmple as 0003 attached (as that's 
> > the
> > leader that is doing the report in case of parallel workers being used).
> > 
> > 0001 and 0002 remain unchanged.
> 
> Thanks.  I've committed 0001 and 0002.

Thanks! Regarding 0003 I think it's ok to keep it in this thread (and not
create a dedicated one), as it still fits well with $SUBJECT (and the folks 
interested in are probably already part of this thread). Sounds good to you?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Fix punctuation errors in PostgreSQL documentation

2025-02-11 Thread John Naylor
On Mon, Feb 10, 2025 at 6:34 PM John Naylor  wrote:
> Thanks for the patch! I will push this after the upcoming minor releases.

This is done.

-- 
John Naylor
Amazon Web Services




Re: pg_rewind with --write-recovery-conf option doesn't write dbname to primary_conninfo value.

2025-02-11 Thread Masahiko Sawada
On Mon, Feb 3, 2025 at 12:36 PM Masahiko Sawada  wrote:
>
> On Sun, Feb 2, 2025 at 9:50 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Sawada-san,
> >
> > > I think it's a good idea to support it at least on HEAD. I've attached
> > > a patch for that.
> >
> > +1. I've confirmed that pg_rewind and -R can't output dbname for now,
> > and your patch allows to do it.
> >
> > Few comments for your patch.
>
> Thank you for reviewing the patch!
>
> >
> > 1.
> >
> > pg_basebackup.sgml has below description. I feel this can be ported to
> > pg_rewind.sgml as well.
> >
> > ```
> > The dbname will be recorded only if the dbname was
> > specified explicitly in the connection string or  > linkend="libpq-envars">
> > environment variable.
> > ```
>
> Agreed, will fix.
>
> >
> > 2.
> > I'm not sure whether recovery_gen.h/c is a correct place for the exported 
> > function
> > GetDbnameFromConnectionOptions(). The function itself does not handle
> > postgresql.cuto.conf file.
> > I seeked other header files and felt that connect_utils.h might be.
> >
> > ```
> > /*-
> >  *
> >  * Facilities for frontend code to connect to and disconnect from databases.
> > ```
>
> But this function neither connects to nor disconnects from databases, either.
>
> >
> > Another idea is to change the third API to accept the whole connection 
> > string, and
> > it extracts dbname from it. In this approach we can make 
> > GetDbnameFromConnectionOptions()
> > to static function - which does not feel strange for me.
>
> I'm concerned that it reduces the usability; users (or existing
> extensions) would need to construct the whole connection string just
> to pass the database name. If we want to avoid exposing
> GetDbnameFromConnectionOptions(), I'd introduce another exposed
> function for that, say GenerateRecoveryConfigWithConnStr().

I've attached the updated patch. I address all comments I got so far
and added a small regression test.

It makes sense to me that we move GetDbnameFromConnectionOptions() to
recovery_gen.c since this function is currently used only with
GenerateRecoveryConfig() which is defined in the same file. If we find
a more appropriate place, we can move it later. Feedback is very
welcome.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-pg_rewind-Add-dbname-to-primary_conninfo-when-usi.patch
Description: Binary data


Re: Test to dump and restore objects left behind by regression

2025-02-11 Thread Michael Paquier
On Tue, Feb 11, 2025 at 12:19:33PM +0530, Ashutosh Bapat wrote:
> Sorry for replying late here. The refactored code in
> 002_compare_backups.pl has a potential to cause confusion even without
> this refactoring. The differences in tablespace paths are adjusted in
> compare_files() and not in the actual dump outputs. In case there's a
> difference other than paths, diff between the dump outputs is reported
> which will also show the differences in paths. That might mislead
> developers in thinking that the differences in paths are also not
> expected. Am I right?

Logically, 002_compare_backups.pl is still the same, isn't it?  We're
still passing the file paths to compare_text(), except that the
comparison routine is given as an argument one level higher.

You are right that there could be an argument for changing the files
are they are on-disk, and do a diff based on what's on disk after what
has changed so as the filtered parts are out of the report. However,
there is also an argument for not changing them as that's more useful
to know the original state of the dump for debugging.  This one
involves only a small change, which is OK as-is, IMHO.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Optionally record Plan IDs to track plan changes for a query

2025-02-11 Thread Michael Paquier
On Mon, Feb 10, 2025 at 02:14:09PM -0600, Sami Imseih wrote:
> Another thought that I have is that If we mention that extensions can use
> these jumbling ( or whatever the final name is ) functions outside of
> core, it makes
> sense to actually show an example of this. What do you think?

Not sure.  Do you have anything specific in mind that pgss is not able
to achieve with its jumbling based on the query strings?
--
Michael


signature.asc
Description: PGP signature


Re: describe special values in GUC descriptions more consistently

2025-02-11 Thread Daniel Gustafsson
> On 11 Feb 2025, at 19:11, Nathan Bossart  wrote:

> I thought about this one a bit, and I actually came to the opposite
> conclusion.  IMHO it's reasonably obvious that an empty string means that
> the file isn't loaded, so there's not much point in stating it in the GUC
> description.  Instead, I think we should follow the
> archive_command/archive_library example and use this space _only_ as a
> cross-reference to each other.  There's certainly some nuances missed with
> this strategy, but that's not unique to this GUC.

I don't necessarily disagree with this, but the proposed wording makes it sound
sort of like users have to select one or the other.  Could it be softened a
little like perhaps "An empty string disables, \"ssl_crl_foo\" is still used"?

--
Daniel Gustafsson





Re: Track the amount of time waiting due to cost_delay

2025-02-11 Thread Nathan Bossart
On Tue, Feb 11, 2025 at 08:51:15AM +, Bertrand Drouvot wrote:
> On Mon, Feb 10, 2025 at 02:52:46PM -0600, Nathan Bossart wrote:
>> Off-list, I've asked Bertrand to gauge the feasibility of adding this
>> information to the autovacuum logs and to VACUUM/ANALYZE (VERBOSE).  IMHO
>> those are natural places to surface this information, and I want to ensure
>> that we're not painting ourselves into a corner with the approach we're
>> using for the progress views.
> 
> Yeah, I looked at it and that looks as simmple as 0003 attached (as that's the
> leader that is doing the report in case of parallel workers being used).
> 
> 0001 and 0002 remain unchanged.

Thanks.  I've committed 0001 and 0002.

-- 
nathan




Re: Bump soft open file limit (RLIMIT_NOFILE) to hard limit on startup

2025-02-11 Thread Tom Lane
Tomas Vondra  writes:
> On 2/11/25 21:18, Tom Lane wrote:
>> I think what we actually would like to know is how often we have to
>> close an open FD in order to make room to open a different file.
>> Maybe that's the same thing you mean by "cache miss", but it doesn't
>> seem like quite the right terminology.  Anyway, +1 for adding some way
>> to discover how often that's happening.

> We can count the evictions (i.e. closing a file so that we can open a
> new one) too, but AFAICS that's about the same as counting "misses"
> (opening a file after not finding it in the cache). After the cache
> warms up, those counts should be about the same, I think.

Umm ... only if the set of files you want access to is quite static,
which doesn't seem like a great bet in the presence of temporary
tables and such.  I think if we don't explicitly count evictions
then we'll be presenting misleading results.

regards, tom lane




Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.

2025-02-11 Thread Peter Smith
On Tue, Feb 11, 2025 at 9:16 PM Shubham Khanna
 wrote:
>
> > #13. Unanswered question "How are tests expecting this even passing?".
> > Was a reason identified? IOW, how can we be sure the latest tests
> > don't have a similar problem?
> >
>
> In the v4-0001 patch [1], the tests were mistakenly using
> 'command_fails' instead of 'command_fails_like' to verify failed test
> cases. Since 'command_fails' only checks for a non-zero exit code and
> not specific error messages, the tests were passing even when the
> expected errors were not being triggered correctly.
> To address this, I have updated the patch to use 'command_fails_like',
> ensuring that the test cases now explicitly verify the correct failure
> messages.
>

Ah, that makes sense. Thanks for sharing the reason. So in fact, it
was a valid concern because the v5 was still carrying over the same
flaw from v4... Anyway, it is good to know it is fixed now in v6.

=

Some general comments for the patch v6-0001:

Do you need to test every possible bad option combination? It may be
fine because the error will be immediately raised so I expect the
execution overhead to be ~zero.

BTW, your bad option combination tests are only using  --all-databases
*after* the other options. Maybe you should mix it up a bit, sometimes
putting it *before* the others as well, because rearranging will cause
different errors.

Everything else now looks good to me.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Allow io_combine_limit up to 1MB

2025-02-11 Thread Andres Freund
Hi,

On 2025-02-11 13:12:17 +1300, Thomas Munro wrote:
> Tomas queried[1] the limit of 256kB (or really 32 blocks) for
> io_combine_limit.  Yeah, I think we should increase it and allow
> experimentation with larger numbers.  Note that real hardware and
> protocols have segment and size limits that can force the kernel to
> split your I/Os, so it's not at all a given that it'll help much or at
> all to use very large sizes, but YMMV.

FWIW, I see substantial performance *regressions* with *big* IO sizes using
fio. Just looking at cached buffered IO.

for s in 4 8 16 32 64 128 256 512 1024 2048 4096 8192;do echo -ne "$s\t\t"; 
numactl --physcpubind 3 fio --directory /srv/dev/fio/ --size=32GiB --overwrite 
1 --time_based=0 --runtime=10 --name test --rw read --buffered 0 --ioengine 
psync --buffered 1 --invalidate 0 --output-format json --bs=$((1024*${s})) |jq 
'.jobs[] | .read.bw_mean';done

io size kB  throughput in MB/s
46752
89297
16  11082
32  14392
64  15967
128 16658
256 16864
512 19114
102412874
204811770
409611781
819211744


I.e. throughput peaks at 19GB/s and drops of fairly noticeably after that.

I've measured this on a number of different AMD and Intel Systems, with
similar results, albeit with different inflection points. On the Intel systems
I have access to the point where things slows down seems typically be earlier
than on AMD.


It's worth noting that if I boot with mitigations=off clearcpuid=smap I get
*vastly* better performance:

io size kB  throughput in MB/s
4   12054
8   13872
16  16709
32  20564
64  22559
128 23133
256 23317
512 25829
102415912
204815213
409614129
819213795

Most of the gain isn't due to mitigations=off but clearcpuid=smap. Apparently
SMAP, which requires explicit code to allow kernel space to access userspace
memory, to make exploitation harder, reacts badly to copying lots of memory.

This seems absolutely bonkers to me.



> I was originally cautious because I didn't want to make a few stack buffers
> too big, but arrays of BlockNumber, struct iovec, and pointer don't seem too
> excessive at say 128 (cf whole blocks on the stack, a thing we do, which
> would still be many times larger that the relevant arrays).  I was also
> anticipating future code that would need to multiply that number by other
> terms to allocate shared memory, but after some off-list discussion, that
> seems OK: such code should be able to deal with that using GUCs instead of
> maximally pessimal allocation.  128 gives a nice round number of 1M as a
> maximum transfer size, and comparable systems seem to have upper limits
> around that mark.  Patch attached.

To make that possible we'd need two different io_combine_limit GUCs, one
PGC_POSTMASTER that defines a hard max, and one that can be changed at
runtime, up to the PGC_POSTMASTER one.

It's somewhat painful to have such GUCs, because we don't have real
infrastructure for interdependent GUCs.  Typically the easiest way is to just
do a Min() at runtime between the two GUCs.  But looking at the number of
references to io_combine_limit in read_stream.c, that doesn't look like fun.

Do you have a good idea how to keep read_stream.c readable?

Greetings,

Andres Freund




Re: [PATCH] Optionally record Plan IDs to track plan changes for a query

2025-02-11 Thread Julien Rouhaud
On Wed, Feb 12, 2025 at 10:59:04AM +0900, Michael Paquier wrote:
> On Wed, Feb 12, 2025 at 09:20:53AM +0800, Julien Rouhaud wrote:
> >
> > FTR my main motivation was to be able to deal with queries referencing
> > temporary relations, as if your application creates a lot of those it 
> > basically
> > means that you cannot use pg_stat_statements anymore.
>
> Do you have an issue more details about your problem?  If we can
> improve the situation in core without impacting the existing cases
> that we need to support in pgss, that may be worth looking at.

I thought this was a well known limitation.  The basic is that if you rely on
temp tables, you usually end up with a virtually infinite number of queryids
since all temp tables get a different oid and that oid is used in the queryid
computation.  And in that case the overhead of pg_stat_statements is insanely
high.  The last figures I saw was by Andres many years ago, with a mention 40%
overhead, and I don't think it's hard to get way worse overhead than that if
you have lengthier query texts.

As a prototype in my extension I think I just entirely ignored such queries,
but another (and probably friendlier for the actual pg_stat_statements
statistics) approach would be to use the relation name to compute the queryid
rather than its oid.  This would add some overhead, but I think it would have
very limited impact especially compared to the current situation.

Of course some people may want to keep the current behavior, if they have
limited number of temp tables or similar, so I had a GUC for that.  I don't
think that the community would really welcome such GUC for core-postgres,
especially since it wouldn't be pg_stat_statements specific.




Re: Allow io_combine_limit up to 1MB

2025-02-11 Thread Andres Freund
Hi,

On 2025-02-12 13:59:21 +1300, Thomas Munro wrote:
> How about just maintaining it in a new variable
> effective_io_combine_limit, whenever either of them is assigned?

Yea, that's probably the least bad way.

I wonder if we should just name that variable io_combine_limit and have the
GUC be _raw or _guc or such? There's gonna be a fair number of references to
the variable in code...

Greetings,

Andres Freund




Re: Allow io_combine_limit up to 1MB

2025-02-11 Thread Thomas Munro
On Wed, Feb 12, 2025 at 3:22 PM Andres Freund  wrote:
> On 2025-02-12 13:59:21 +1300, Thomas Munro wrote:
> > How about just maintaining it in a new variable
> > effective_io_combine_limit, whenever either of them is assigned?
>
> Yea, that's probably the least bad way.
>
> I wonder if we should just name that variable io_combine_limit and have the
> GUC be _raw or _guc or such? There's gonna be a fair number of references to
> the variable in code...

Alternatively, we could compute that as stream->io_combine_limit and
use that.  That has the advantage that it's fixed for the life of the
stream, even if you change it (eg between fetches from a CURSOR that
has streams).  Pretty sure it won't break anything today, but it might
just run out of queue space limiting concurrency arbitrarily if you
increase it, which is a bit weird now that I focus on that.  Capturing
the value we'll use up front seems better on that front.  On the other
hand, other future code might also have to remember to compute that
too (write streams, ...), a tiny bit of duplication.  Something like
the attached.  Or ... I guess we could do both things?
From 8cfa23a370a4564a0369991e2b0068b48983a0f6 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 12 Feb 2025 13:52:22 +1300
Subject: [PATCH v2] Introduce max_io_combine_limit.

The existing io_combine_limit parameter can be set by users.  The new
max_io_combine_limit parameter can be set only at server startup time.
Code that combines I/Os should respect both of them by taking the
smaller value.

This allows the administrator to cap the total I/O size system-wide, but
also provides a way for proposed patches to know what the maximum could
possibly be in cases where it is multiplied by other GUCs to allocate
shared memory, without having to assume that it's as high as the
compile-time MAX_IO_COMBINE_LIMIT value.

The read_stream.c code is changed to compute the minimum value up front
as stream->io_combine_limit instead of using io_combine_limit directly.
That has the extra benefit of remaining stable throughout the lifetime
of the stream even if the user changes it (eg while consuming from a
CURSOR).  As previously coded, an mid-stream increase could limit
concurrency artificially just because we run out of queue space too
soon.
---
 doc/src/sgml/config.sgml  | 23 ++-
 src/backend/commands/variable.c   |  1 -
 src/backend/storage/aio/read_stream.c | 29 ---
 src/backend/storage/buffer/bufmgr.c   |  5 ++--
 src/backend/utils/misc/postgresql.conf.sample |  2 ++
 src/include/storage/bufmgr.h  |  1 +
 6 files changed, 47 insertions(+), 14 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 3b557ecabfb..c6de8b9e236 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2605,6 +2605,24 @@ include_dir 'conf.d'

   
 
+  
+   max_io_combine_limit (integer)
+   
+max_io_combine_limit configuration parameter
+   
+   
+   
+
+ Controls the largest I/O size in operations that combine I/O, and silently
+ limits the user-settable parameter io_combine_limit.
+ This parameter can only be set in
+ the postgresql.conf file or on the server
+ command line.
+ The default is 128kB.
+
+   
+  
+
   
io_combine_limit (integer)

@@ -2613,7 +2631,10 @@ include_dir 'conf.d'


 
- Controls the largest I/O size in operations that combine I/O.
+ Controls the largest I/O size in operations that combine I/O.  If set
+ higher than the max_io_combine_limit parameter, the
+ smaller value will silently be used instead, so both may need to be raised
+ to increase the I/O size.
  The default is 128kB.
 

diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..8f77eb6ee79 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1156,7 +1156,6 @@ assign_maintenance_io_concurrency(int newval, void *extra)
 #endif
 }
 
-
 /*
  * These show hooks just exist because we want to show the values in octal.
  */
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index e4414b2e915..ad3da5470c4 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -108,6 +108,7 @@ typedef struct InProgressIO
  */
 struct ReadStream
 {
+	int16		io_combine_limit;
 	int16		max_ios;
 	int16		ios_in_progress;
 	int16		queue_size;
@@ -225,7 +226,7 @@ read_stream_start_pending_read(ReadStream *stream, bool suppress_advice)
 
 	/* This should only be called with a pending read. */
 	Assert(stream->pending_read_nblocks > 0);
-	Assert(stream->pending_read_nblocks <= io_combine_limit);
+	Assert(stream->pending_read_nblocks <= stream->io_combine_limit

Re: [PATCH] Optionally record Plan IDs to track plan changes for a query

2025-02-11 Thread Julien Rouhaud
On Tue, Feb 11, 2025 at 08:57:46PM -0600, Sami Imseih wrote:
> > Of course some people may want to keep the current behavior, if they have
> > limited number of temp tables or similar, so I had a GUC for that.  I don't
> > think that the community would really welcome such GUC for core-postgres,
> > especially since it wouldn't be pg_stat_statements specific.
>
> FWIW, I think options to tweak queryId computation is something
> that should be in core. It was discussed earlier in the context
> of IN list merging; the patch for this currently has the guc
> for the feature in pg_stat_statements, but there was a discussion
> about actually moving this to core [1] Giving the user a way
> to control certain behavior about the queryId computation
> is a good thing to do in core; especially queryId is no longer
> just consumed in pg_stat_statements. Maybe the right answer
> is an enum GUC, not sure yet.
>
> Specifically for the use-case you mention, using names vs OIDs in
> queryId computation is a valid use case for more than temporary tables,
> I can also think of upgrade, dump/restore, logical replication cases which
> can then allow for a consistent queryId.

Well, the ability for extensions to override the actual queryid calculation was
the result of more than half a decade of strong disagreements about it.   And
I'm definitely not volunteering to reopen that topic :)




Re: Small memory fixes for pg_createsubcriber

2025-02-11 Thread Michael Paquier
On Tue, Feb 11, 2025 at 01:32:32PM -0300, Euler Taveira wrote:
> There is no bug. They are the same behind the scenes. I'm fine changing it. It
> is a new code and it wouldn't cause a lot of pain to backpatch patches in the
> future.

On consistency grounds, and as this is documented in fe-exec.c at the
top of PQfreemem(), I can get behind the switch.

> Even if the pg_createsubscriber aims to run in a small amount of time, hence,
> it is fine to leak memory, the initial commit cleaned up all variables but a
> subsequent commit 4867f8a555c apparently didn't. Although it is just a low
> impact improvement, it is better to be strict and shut up SASTs.

check_and_drop_existing_subscriptions() is called once per database in
setup_subscriber(), and we are not going to have millions of them in
this list.  We don't usually care for such short-lived things, but as
the original commit did the effort in d44032d01463, I don't see why we
cannot do it here, either.
--
Michael


signature.asc
Description: PGP signature


  1   2   >