Re: thread-safety: getpwuid_r()

2024-09-02 Thread Peter Eisentraut

On 26.08.24 19:54, Heikki Linnakangas wrote:

On 26/08/2024 20:38, Peter Eisentraut wrote:

On 24.08.24 15:55, Heikki Linnakangas wrote:
Come to think of it, the pg_get_user_name() function is just a thin 
wrapper around getpwuid_r(). It doesn't provide a lot of value. 
Perhaps we should remove pg_get_user_name() and 
pg_get_user_home_dir() altogether and call getpwuid_r() directly.


Yeah, that seems better.  These functions are somewhat strangely 
designed and as you described have faulty error handling.  By calling 
getpwuid_r() directly, we can handle the errors better and the code 
becomes more transparent.  (There used to be a lot more interesting 
portability complications in that file, but those are long gone.)


New patch looks good to me, thanks!


committed





Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-09-02 Thread Daniel Gustafsson
> On 23 Aug 2024, at 01:56, Michael Paquier  wrote:
> 
> On Thu, Aug 22, 2024 at 11:13:15PM +0200, Daniel Gustafsson wrote:
>> On 22 Aug 2024, at 02:31, Michael Paquier  wrote:
>>> Just do it :)
>> 
>> That's my plan, I wanted to wait a bit to see if anyone else chimed in with
>> concerns.
> 
> Cool, thanks!

Attached is a rebased v15 (only changes are commit-message changes noted by
Peter upthread) for the sake of archives, and for a green-check run in the
CFBot.  Assuming this builds green I intend to push this.

--
Daniel Gustafsson



v15-0002-Only-perform-pg_strong_random-init-when-required.patch
Description: Binary data


v15-0001-Remove-support-for-OpenSSL-older-than-1.1.0.patch
Description: Binary data


Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-02 Thread Peter Smith
Hi. Thanks for addressing my previous review comments.

Here are some review comments for v44-0001.

==
Commit message.

1.
Because such synced slots are typically considered not
active (for them to be later considered as inactive) as they don't
perform logical decoding to produce the changes.

~

This sentence is bad grammar. The docs have the same wording, so
please see my doc review comment #4 suggestion below.

==
doc/src/sgml/config.sgml

2.
+   
+Invalidates replication slots that are inactive for longer than
+specified amount of time. If this value is specified without units,
+it is taken as seconds. A value of zero (which is default) disables
+the timeout mechanism. This parameter can only be set in
+the postgresql.conf file or on the server
+command line.
+   
+

nit - This is OK as-is, but OTOH why not make the wording consistent
with the previous GUC description? (e.g. see my v43 [1] #2 review
comment)

~~~

3.
+   
+This invalidation check happens either when the slot is acquired
+for use or during checkpoint. The time since the slot has become
+inactive is known from its
+inactive_since value using which the
+timeout is measured.
+   
+

I felt this is slightly misleading because slot acquiring has nothing
to do with setting the slot invalidation anymore. Furthermore, the 2nd
sentence is bad grammar.

nit - IMO something simple like the following rewording can address
both of those points:

Slot invalidation due to inactivity timeout occurs during checkpoint.
The duration of slot inactivity is calculated using the slot's
inactive_since field value.

~

4.
+Because such synced slots are typically considered not active
+(for them to be later considered as inactive) as they don't perform
+logical decoding to produce the changes.

That sentence has bad grammar.

nit – suggest a much simpler replacement:
Synced slots are always considered to be inactive because they don't
perform logical decoding to produce changes.

==
src/backend/replication/slot.c

5.
+#define IsInactiveTimeoutSlotInvalidationApplicable(s) \
+ (replication_slot_inactive_timeout > 0 && \
+ s->inactive_since > 0 && \
+ !RecoveryInProgress() && \
+ !s->data.synced)
+

5a.
I felt this would be better implemented as an inline function. Then it
can be commented on properly to explain the parts of the condition.
e.g. the large comment currently in InvalidatePossiblyObsoleteSlot()
would be more appropriate in this function.

~

5b.
The name is very long. Can't it be something shorter/simpler like:
'IsSlotATimeoutCandidate()'

~~~

6. ReplicationSlotAcquire

-ReplicationSlotAcquire(const char *name, bool nowait)
+ReplicationSlotAcquire(const char *name, bool nowait,
+bool check_for_invalidation)

nit - Previously this new parameter really did mean to "check" for
[and set the slot] invalidation. But now I suggest renaming it to
'error_if_invalid' to properly reflect the new usage. And also in the
slot.h.

~

7.
+ /*
+ * Error out if the slot has been invalidated previously. Because there's
+ * no use in acquiring the invalidated slot.
+ */

nit - The comment is contrary to the code. If there was no reason to
skip this error, then you would not have the new parameter allowing
you to skip this error. I suggest just repeating the same comment as
in the function header.

~~~

8. ReportSlotInvalidation

nit - Added some blank lines for consistency.

~~~

9. InvalidatePossiblyObsoleteSlot

+ /*
+ * Quick exit if inactive timeout invalidation mechanism
+ * is disabled or slot is currently being used or the
+ * server is in recovery mode or the slot on standby is
+ * currently being synced from the primary.
+ *
+ * Note that the inactive timeout invalidation mechanism
+ * is not applicable for slots on the standby server that
+ * are being synced from primary server. Because such
+ * synced slots are typically considered not active (for
+ * them to be later considered as inactive) as they don't
+ * perform logical decoding to produce the changes.
+ */
+ if (!IsInactiveTimeoutSlotInvalidationApplicable(s))
+ break;

9a.
Consistency is good (commit message, docs and code comments for this),
but the added sentence has bad grammar. Please see the docs review
comment #4 above for some alternate phrasing.

~

9b.
Now that this logic is moved into a macro (I suggested it should be an
inline function) IMO this comment does not belong here anymore because
it is commenting code that you cannot see. Instead, this comment (or
something like it) should be as comments within the new function.

==
src/include/replication/slot.h

10.
+extern void ReplicationSlotAcquire(const char *name, bool nowait,
+bool check_for_invalidation);

Change the new param name as described in the earlier review comment.

==
src/test/recovery/t/050_invalidate_slots.pl

~~~

Please refer to the attached file which implements some o

Re: pgsql: Add more SQL/JSON constructor functions

2024-09-02 Thread Amit Langote
On Fri, Aug 30, 2024 at 4:32 PM Amit Langote  wrote:
> On Thu, Aug 22, 2024 at 12:44 PM Amit Langote  wrote:
> > On Thu, Aug 22, 2024 at 11:02 jian he  wrote:
> >> On Tue, Jul 30, 2024 at 12:59 PM Amit Langote  
> >> wrote:
> >> > On Fri, Jul 26, 2024 at 11:19 PM jian he  
> >> > wrote:
> >> > > {
> >> > > ...
> >> > > /*
> >> > >  * For expression nodes that support soft errors.  Should be set 
> >> > > to NULL
> >> > >  * before calling ExecInitExprRec() if the caller wants errors 
> >> > > thrown.
> >> > >  */
> >> > > ErrorSaveContext *escontext;
> >> > > } ExprState;
> >> > >
> >> > > i believe by default makeNode will set escontext to NULL.
> >> > > So the comment should be, if you want to catch the soft errors, make
> >> > > sure the escontext pointing to an allocated ErrorSaveContext.
> >> > > or maybe just say, default is NULL.
> >> > >
> >> > > Otherwise, the original comment's meaning feels like: we need to
> >> > > explicitly set it to NULL
> >> > > for certain operations, which I believe is false?
> >> >
> >> > OK, I'll look into updating this.
>
> See 0001.
>
> >> > > json_behavior_type:
> >> > > ERROR_P{ $$ = JSON_BEHAVIOR_ERROR; }
> >> > > | NULL_P{ $$ = JSON_BEHAVIOR_NULL; }
> >> > > | TRUE_P{ $$ = JSON_BEHAVIOR_TRUE; }
> >> > > | FALSE_P{ $$ = JSON_BEHAVIOR_FALSE; }
> >> > > | UNKNOWN{ $$ = JSON_BEHAVIOR_UNKNOWN; }
> >> > > | EMPTY_P ARRAY{ $$ = JSON_BEHAVIOR_EMPTY_ARRAY; }
> >> > > | EMPTY_P OBJECT_P{ $$ = JSON_BEHAVIOR_EMPTY_OBJECT; }
> >> > > /* non-standard, for Oracle compatibility only */
> >> > > | EMPTY_P{ $$ = JSON_BEHAVIOR_EMPTY_ARRAY; }
> >> > > ;
> >> > >
> >> > > EMPTY_P behaves the same as EMPTY_P ARRAY
> >> > > so for function GetJsonBehaviorConst, the following "case
> >> > > JSON_BEHAVIOR_EMPTY:" is wrong?
> >> > >
> >> > > case JSON_BEHAVIOR_NULL:
> >> > > case JSON_BEHAVIOR_UNKNOWN:
> >> > > case JSON_BEHAVIOR_EMPTY:
> >> > > val = (Datum) 0;
> >> > > isnull = true;
> >> > > typid = INT4OID;
> >> > > len = sizeof(int32);
> >> > > isbyval = true;
> >> > > break;
> >> > >
> >> > > also src/backend/utils/adt/ruleutils.c
> >> > > if (jexpr->on_error->btype != JSON_BEHAVIOR_EMPTY)
> >> > > get_json_behavior(jexpr->on_error, context, "ERROR");
> >> >
> >> > Something like the attached makes sense?  While this meaningfully
> >> > changes the deparsing output, there is no behavior change for
> >> > JsonTable top-level path execution.  That's because the behavior when
> >> > there's an error in the execution of the top-level path is to throw it
> >> > or return an empty set, which is handled in jsonpath_exec.c, not
> >> > execExprInterp.c.
>
> See 0002.
>
> I'm also attaching 0003 to fix a minor annoyance that JSON_TABLE()
> columns' default ON ERROR, ON EMPTY behaviors are unnecessarily
> emitted in the deparsed output when the top-level ON ERROR behavior is
> ERROR.
>
> Will push these on Monday.

Didn't as there's a release freeze in effect for the v17 branch.  Will
push to both master and v17 once the freeze is over.

> I haven't had a chance to take a closer look at your patch to optimize
> the code in ExecInitJsonExpr() yet.

I've simplified your patch a bit and attached it as 0004.

-- 
Thanks, Amit Langote


v2-0002-SQL-JSON-Fix-default-ON-ERROR-behavior-for-JSON_T.patch
Description: Binary data


v2-0004-SQL-JSON-Avoid-initializing-unnecessary-ON-ERROR-.patch
Description: Binary data


v2-0003-SQL-JSON-Fix-JSON_TABLE-column-deparsing.patch
Description: Binary data


v2-0001-Update-comment-about-ExprState.escontext.patch
Description: Binary data


Re: generic plans and "initial" pruning

2024-09-02 Thread Amit Langote
On Sat, Aug 31, 2024 at 9:30 PM Junwang Zhao  wrote:
> @@ -1241,7 +1244,7 @@ GetCachedPlan(CachedPlanSource *plansource,
> ParamListInfo boundParams,
>   if (customplan)
>   {
>   /* Build a custom plan */
> - plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv);
> + plan = BuildCachedPlan(plansource, qlist, boundParams, queryEnv, true);
>
> Is the *true* here a typo? Seems it should be *false* for custom plan?

That's correct, thanks for catching that.  Will fix.

-- 
Thanks, Amit Langote




Re: More performance improvements for pg_dump in binary upgrade mode

2024-09-02 Thread Daniel Gustafsson
> On 5 Jun 2024, at 04:39, Nathan Bossart  wrote:
> 
> On Wed, May 15, 2024 at 03:21:36PM -0500, Nathan Bossart wrote:
>> Nice!  I'll plan on taking a closer look at this one.
> 
> LGTM.  I've marked the commitfest entry as ready-for-committer.

Thanks for review, committed.

--
Daniel Gustafsson





Windows socket problems, interesting connection to AIO

2024-09-02 Thread Thomas Munro
There's a category[1] of random build farm/CI failures where Windows
behaves differently and our stuff breaks, which also affects end
users.  A recent BF failure[2] that looks like one of those jangled my
nerves when I pushed a commit, so I looked into a new theory on how to
fix it.  First, let me restate my understanding of the two categories
of known message loss on Windows, since the information is scattered
far and wide across many threads:

1.  When a backend exits without closing the socket gracefully, which
was briefly fixed[3] but later reverted because it broke something
else, a Windows server's network stack might fail to send data that it
had buffered but not yet physically sent[4].

The reason we reverted that and went back to abortive socket shutdown
(ie just exit()) is that our WL_SOCKET_READABLE was buggy, and could
miss FD_CLOSE events from graceful but not abortive shutdowns (which
keep reporting themselves repeatedly, something to do with being an
error state (?)).  Sometimes a libpq socket we're waiting for with
WaitLatchOrSocket() on the client end of the socket could hang
forever.  Concretely: a replication connection or postgres_fdw running
inside another PostgreSQL server.  We fixed that event loss, albeit in
a gross kludgy way[5], because other ideas seemed too complicated (to
wit, various ways to manage extra state associated with each socket,
really hard to retro-fit in a satisfying way).  Graceful shutdown
should fix the race cases where the next thing the client calls is
recv(), as far as I know.

2.  If a Windows client tries to send() and gets an ECONNRESET/EPIPE
error, then the network stack seems to drop already received data, so
a following recv() will never see it.  In other words, it depends on
whether the application-level protocol is strictly request/response
based, or has sequence points at which both ends might send().  AFAIK
the main consequence for real users is that FATAL recovery conflict,
idle termination, etc messages are not delivered to clients, leaving
just "server closed the connection unexpectedly".

I have wondered whether it might help to kludgify the Windows TCP code
even more by doing an extra poll() for POLLRD before every single
send().  "Hey network stack, before I try to send this message, is
there anything the server wanted to tell me?", but I guess that must
be racy because the goodbye message could arrive between poll() and
send().  Annoyingly, I suspect it would *mostly* work.

The new thought I had about the second category of problem is: if you
use asynchronous networking APIs, then the kernel *can't* throw your
data out, because it doesn't even have it.  If the server's FATAL
message arrives before the client calls send(), then the data is
already written to user space memory and the I/O is marked as
complete.  If it arrives after, then there's no issue, because
computers can't see into the future yet.  That's my hypothesis,
anyway.  To try that, I started with a very simple program[6] on my
local FreeBSD system that does a failing send, and tries synchronous
and asynchronous recv():

=== synchronous ===
send -> -1, error = 32
recv -> "FATAL: flux capacitor failed", error = 0
=== posix aio ===
send -> -1, error = 32
async recv -> "FATAL: flux capacitor failed", error = 0

... and then googled enough Windows-fu to translate it and run it on
CI, and saw the known category 2 failure with the plain old
synchronous version.  The good news is that the async version sees the
goodbye message:

=== synchronous ===
send -> 14, error = 0
recv -> "", error = 10054
=== windows overlapped ===
send -> 14, error = 0
async recv -> "FATAL: flux capacitor failed", error = 0

That's not the same as a torture test for weird timings, and I have
zero knowledge of the implementation of this stuff, but I currently
can't imagine how it could possibly be implemented in any way that
could give a different answer.

Perhaps we could figure out a way to use that API to simulate
synchronous recv() built on top of that stuff, but I think a more
satisfying use of our time and energy would be to redesign all our
networking code to do cross-platform AIO.  I think that will mostly
come down to a bunch of network buffer management redesign work.
Anyway, I don't have anything concrete there, I just wanted to share
this observation.

[1] 
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Miscellaneous_tests_fail_on_Windows_due_to_a_connection_closed_before_receiving_a_final_error_message
[2] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-08-31%2007%3A54%3A58
[3] 
https://github.com/postgres/postgres/commit/6051857fc953a62db318329c4ceec5f9668fd42a
[4] 
https://learn.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2
[5] 
https://github.com/postgres/postgres/commit/a8458f508a7a441242e148f008293128676df003
[6] https://github.com/macdice/hello-windows/blob/socket-hacking/test.c




pg_stats_subscription_stats order of the '*_count' columns

2024-09-02 Thread Peter Smith
Hi,

While reviewing another thread I was looking at the view
'pg_stats_subscription_stats' view. In particular, I was looking at
the order of the "*_count" columns of that view.

IMO there is an intuitive/natural ordering for the logical replication
operations (LR) being counted. For example, LR "initial tablesync"
always comes before LR "apply".

I propose that the columns of the view should also be in this same
intuitive order: Specifically, "sync_error_count" should come before
"apply_error_count" (left-to-right in view, top-to-bottom in docs).

Currently, they are not arranged that way.

The view today has only 2 count columns in HEAD, so this proposal
seems trivial, but there is another patch [2] soon to be pushed, which
will add more conflict count columns. As the number of columns
increases IMO it becomes more important that each column is where you
would intuitively expect to find it.

~

Changes would be needed in several places:
- docs (doc/src/sgml/monitoring.sgml)
- function pg_stat_get_subscription_stats (pg_proc.dat)
- view pg_stat_subscription_stats (src/backend/catalog/system_views.sql)
- TAP test SELECTs (test/subscription/t/026_stats.pl)

~

Thoughts?

==
[1] docs - 
https://www.postgresql.org/docs/devel/monitoring-stats.html#MONITORING-PG-STAT-SUBSCRIPTION-STATS
[2] stats for conflicts -
https://www.postgresql.org/message-id/flat/OS0PR01MB57160A07BD575773045FC214948F2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Collect statistics about conflicts in logical replication

2024-09-02 Thread Peter Smith
On Mon, Sep 2, 2024 at 1:28 PM shveta malik  wrote:
>
> On Mon, Sep 2, 2024 at 4:20 AM Peter Smith  wrote:
> >
> > On Fri, Aug 30, 2024 at 4:24 PM shveta malik  wrote:
> > >
> > > On Fri, Aug 30, 2024 at 10:53 AM Peter Smith  
> > > wrote:
> > > >
> > ...
> > > > 2. Arrange all the counts into an intuitive/natural order
> > > >
> > > > There is an intuitive/natural ordering for these counts. For example,
> > > > the 'confl_*' count fields are in the order insert -> update ->
> > > > delete, which LGTM.
> > > >
> > > > Meanwhile, the 'apply_error_count' and the 'sync_error_count' are not
> > > > in a good order.
> > > >
> > > > IMO it makes more sense if everything is ordered as:
> > > > 'sync_error_count', then 'apply_error_count', then all the 'confl_*'
> > > > counts.
> > > >
> > > > This comment applies to lots of places, e.g.:
> > > > - docs (doc/src/sgml/monitoring.sgml)
> > > > - function pg_stat_get_subscription_stats (pg_proc.dat)
> > > > - view pg_stat_subscription_stats (src/backend/catalog/system_views.sql)
> > > > - TAP test SELECTs (test/subscription/t/026_stats.pl)
> > > >
> > > > As all those places are already impacted by this patch, I think it
> > > > would be good if (in passing) we (if possible) also swapped the
> > > > sync/apply counts so everything  is ordered intuitively top-to-bottom
> > > > or left-to-right.
> > >
> > > Not sure about this though. It does not seem to belong to the current 
> > > patch.
> > >
> >
> > Fair enough. But, besides being inappropriate to include in the
> > current patch, do you think the suggestion to reorder them made sense?
> > If it has some merit, then I will propose it again as a separate
> > thread.
> >
>
>  Yes, I think it makes sense. With respect to internal code, it might
> be still okay as is, but when it comes to pg_stat_subscription_stats,
> I think it is better if user finds it in below order:
>  subid | subname | sync_error_count | apply_error_count | confl_*
>
>  rather than the existing one:
>  subid | subname | apply_error_count | sync_error_count | confl_*
>

Hi Shveta, Thanks. FYI - I created a new thread for this here [1].

==
[1] 
https://www.postgresql.org/message-id/CAHut+PvbOw90wgGF4aV1HyYtX=6pjWc+pn8_fep7L=alxwx...@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-02 Thread Amit Kapila
On Sat, Aug 31, 2024 at 1:45 PM Bharath Rupireddy
 wrote:
>
> Please find the attached v44 patch with the above changes. I will
> include the 0002 xid_age based invalidation patch later.
>

It is better to get the 0001 reviewed and committed first. We can
discuss about 0002 afterwards as 0001 is in itself a complete and
separate patch that can be committed.

-- 
With Regards,
Amit Kapila.




Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-02 Thread Amit Kapila
On Mon, Sep 2, 2024 at 11:21 AM Dilip Kumar  wrote:
>
> While working on some other code I noticed that in
> FindReplTupleInLocalRel() there is an assert [1] that seems to be
> passing IndexRelation to GetRelationIdentityOrPK() whereas it should
> be passing normal relation.
>

Agreed. But this should lead to assertion failure. Did you try testing it?

-- 
With Regards,
Amit Kapila.




Re: AIO v2.0

2024-09-02 Thread Heikki Linnakangas

On 01/09/2024 09:27, Andres Freund wrote:

The main reason I had previously implemented WAL AIO etc was to know the
design implications - but now that they're somewhat understood, I'm planning
to keep the patchset much smaller, with the goal of making it upstreamable.


+1 on that approach.


To solve the issue with an unbounded number of AIO references there are few
changes compared to the prior approach:

1) Only one AIO handle can be "handed out" to a backend, without being
defined. Previously the process of getting an AIO handle wasn't super
lightweight, which made it appealing to cache AIO handles - which was one
part of the problem for running out of AIO handles.

2) Nothing in a backend can force a "defined" AIO handle (i.e. one that is a
valid operation) to stay around, it's always possible to execute the AIO
operation and then reuse the handle.  This provides a forward guarantee, by
ensuring that completing AIOs can free up handles (previously they couldn't
be reused until the backend local reference was released).

3) Callbacks on AIOs are not allowed to error out anymore, unless it's ok to
take the server down.

4) Obviously some code needs to know the result of AIO operation and be able
to error out. To allow for that the issuer of an AIO can provide a pointer
to local memory that'll receive the result of an AIO, including details
about what kind of errors occurred (possible errors are e.g. a read failing
or a buffer's checksum validation failing).


In the next few days I'll add a bunch more documentation and comments as well
as some better perf numbers (assuming my workstation survived...).


Yeah, a high-level README would be nice. Without that, it's hard to 
follow what "handed out" and "defined" above means for example.


A few quick comments the patches:

v2.0-0001-bufmgr-Return-early-in-ScheduleBufferTagForWrit.patch

+1, this seems ready to be committed right away.

v2.0-0002-Allow-lwlocks-to-be-unowned.patch

With LOCK_DEBUG, LWLock->owner will point to the backend that acquired 
the lock, but it doesn't own it anymore. That's reasonable, but maybe 
add a boolean to the LWLock to mark whether the lock is currently owned 
or not.


The LWLockReleaseOwnership() name is a bit confusing together with 
LWLockReleaseUnowned() and LWLockrelease(). From the names, you might 
think that they all release the lock, but LWLockReleaseOwnership() just 
disassociates it from the current process. Rename it to LWLockDisown() 
perhaps.


v2.0-0003-Use-aux-process-resource-owner-in-walsender.patch

+1. The old comment "We don't currently need any ResourceOwner in a 
walsender process" was a bit misleading, because the walsender did 
create the short-lived "base backup" resource owner, so it's nice to get 
that fixed.


v2.0-0008-aio-Skeleton-IO-worker-infrastructure.patch

My refactoring around postmaster.c child process handling will conflict 
with this [1]. Not in any fundamental way, but can I ask you to review 
those patch, please? After those patches, AIO workers should also have 
PMChild slots (formerly known as Backend structs).


[1] 
https://www.postgresql.org/message-id/a102f15f-eac4-4ff2-af02-f9ff209ec...@iki.fi


--
Heikki Linnakangas
Neon (https://neon.tech)





Re: not null constraints, again

2024-09-02 Thread Tender Wang
Alvaro Herrera  于2024年8月31日周六 11:59写道:

> Hello
>
> Here I present another attempt at making not-null constraints be
> catalogued.  This is largely based on the code reverted at 9ce04b50e120,
> except that we now have a not-null constraint automatically created for
> every column of a primary key, and such constraint cannot be removed
> while the PK exists.  Thanks to this, a lot of rather ugly code is gone,
> both in pg_dump and in backend -- in particular the handling of NO
> INHERIT, which was needed for pg_dump.
>
> Noteworthy psql difference: because there are now even more not-null
> constraints than before, the \d+ display would be far too noisy if we
> just let it grow.  So here I've made it omit any constraints that
> underlie the primary key.  This should be OK since you can't do much
> with those constraints while the PK is still there.  If you drop the PK,
> the next \d+ will show those constraints.
>
> One thing that regretfully I haven't yet had time for, is paring down
> the original test code: a lot of it is verifying the old semantics,
> particularly for NO INHERIT constraints, which had grown ugly special
> cases.  It now mostly raises errors; or the tests are simply redundant.
> I'm going to remove that stuff as soon as I'm back on my normal work
> timezone.
>
> sepgsql is untested.
>
> I'm adding this to the September commitfest.
>

The attached patch adds  List *nnconstraints, which store the not-null
definition, in struct CreateStmt.
This makes me a little confused about List *constraints in struct
CreateStmt. Actually, the List constraints
store ckeck constraint, and it will be better if the comments can reflect
that. Re-naming it to List *ckconstraints
seems more reasonable. But a lot of codes that use stmt->constraints will
be changed.

Since AddRelationNewConstraints() can now add not-null column constraint,
the comments about AddRelationNewConstraints()
should tweak a little.
"All entries in newColDefaults will be processed.  Entries in
newConstraints
will be processed only if they are CONSTR_CHECK type."
Now, the type of new constraints may be not-null constraints.


If the column has already had one not-null constraint, and we add same
not-null constraint again.
Then the code will call AdjustNotNullInheritance1() in
AddRelationNewConstraints().
The comments
before entering AdjustNotNullInheritance1() in AddRelationNewConstraints()
look confusing to me.
Because constraint is not inherited.

-- 
Tender Wang


Re: Jargon and acronyms on this mailing list

2024-09-02 Thread Dagfinn Ilmari Mannsåker
Greg Sabino Mullane  writes:

> I normally wouldn't mention my blog entries here, but this one was about
> the hackers mailing list, so wanted to let people know about it in case you
> don't follow Planet Postgres. I scanned the last year's worth of posts and
> gathered the most used acronyms and jargon. The most commonly used acronym
> was IMO (in my opinion), followed by FWIW (for what it's worth), and IIUC
> (if I understand correctly). The complete list can be found in the post
> below, I'll refrain from copying everything here.
>
> https://www.crunchydata.com/blog/understanding-the-postgres-hackers-mailing-list

Nice write-up! Might it also be worth linking to the acronyms and
glossary sections of the docs?

https://www.postgresql.org/docs/current/acronyms.html
https://www.postgresql.org/docs/current/glossary.html

> Cheers,
> Greg

- ilmari




Re: Avoiding superfluous buffer locking during nbtree backwards scans

2024-09-02 Thread Matthias van de Meent
On Fri, 30 Aug 2024 at 21:43, Matthias van de Meent
 wrote:
>
> On Mon, 19 Aug 2024 at 13:43, Matthias van de Meent
>  wrote:
> >
> > On Sun, 11 Aug 2024 at 21:44, Peter Geoghegan  wrote:
> > >
> > > On Tue, Aug 6, 2024 at 6:31 PM Matthias van de Meent
> > >  wrote:
> > > > +1, LGTM.
> > > >
> > > > This changes the backward scan code in _bt_readpage to have an
> > > > approximately equivalent handling as the forward scan case for
> > > > end-of-scan cases, which is an improvement IMO.
> >
> > Here's a new patch that further improves the situation, so that we
> > don't try to re-lock the buffer we just accessed when we're stepping
> > backward in index scans, reducing buffer lock operations in the common
> > case by 1/2.
>
> Attached is an updated version of the patch, now v2, which fixes some
> assertion failures for parallel plans by passing the correct
> parameters to _bt_parallel_release for forward scans.

I noticed I attached an older version of the patch which still had 1
assertion failure case remaining (thanks cfbot), so here's v3 which
solves that problem.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch
Description: Binary data


Re: Jargon and acronyms on this mailing list

2024-09-02 Thread Daniel Gustafsson
> On 2 Sep 2024, at 13:06, Dagfinn Ilmari Mannsåker  wrote:
> 
> Greg Sabino Mullane  writes:
> 
>> I normally wouldn't mention my blog entries here, but this one was about
>> the hackers mailing list, so wanted to let people know about it in case you
>> don't follow Planet Postgres. I scanned the last year's worth of posts and
>> gathered the most used acronyms and jargon. The most commonly used acronym
>> was IMO (in my opinion), followed by FWIW (for what it's worth), and IIUC
>> (if I understand correctly). The complete list can be found in the post
>> below, I'll refrain from copying everything here.
>> 
>> https://www.crunchydata.com/blog/understanding-the-postgres-hackers-mailing-list
> 
> Nice write-up!

+1

> Might it also be worth linking to the acronyms and
> glossary sections of the docs?

Or maybe on the site under https://www.postgresql.org/list/ in some way?

--
Daniel Gustafsson





Re: Use read streams in pg_visibility

2024-09-02 Thread Nazir Bilal Yavuz
Hi,

On Sat, 31 Aug 2024 at 02:51, Noah Misch  wrote:
>
> To read blocks 10 and 11, I would expect to initialize the struct with one of:
>
> { .first=10, .nblocks=2 }
> { .first=10, .last_inclusive=11 }
> { .first=10, .last_exclusive=12 }
>
> With the patch's API, I would need {.first=10,.nblocks=12}.  The struct field
> named "nblocks" behaves like a last_block_exclusive.  Please either make the
> behavior an "nblocks" behavior or change the field name to replace the term
> "nblocks" with something matching the behavior.  (I used longer field names in
> my examples here, to disambiguate those examples.  It's okay if the final
> field names aren't those, as long as the field names and the behavior align.)

I decided to use 'current_blocknum' and 'last_exclusive'. I think
these are easier to understand and use.

> > Thanks for the information, I will check these. What I still do not
> > understand is how to make sure that only the second block is processed
> > and the first one is skipped. pg_check_visible() and pg_check_frozen()
> > returns TIDs that cause corruption in the visibility map, there is no
> > information about block numbers.
>
> I see what you're saying.  collect_corrupt_items() needs a corrupt table to
> report anything; all corruption-free tables get the same output.  Testing this
> would need extra C code or techniques like corrupt_page_checksum() to create
> the corrupt state.  That wouldn't be a bad thing to have, but it's big enough
> that I'll consider it out of scope for $SUBJECT.  With the callback change
> above, I'll be ready to push all this.

Thanks, updated patches are attached.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 1dbdeacc54c3575a2cb82e95aab5b335b0fa1d2f Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 26 Aug 2024 12:12:52 +0300
Subject: [PATCH v4 1/5] Add general-use struct and callback to read stream

Number of callbacks that just iterates over block range in read stream
are increasing. So, add general-use struct and callback to read stream.
---
 src/include/storage/read_stream.h | 13 +
 src/backend/storage/aio/read_stream.c | 21 +++--
 src/tools/pgindent/typedefs.list  |  1 +
 3 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/read_stream.h b/src/include/storage/read_stream.h
index 4e599904f26..0a2398fd7df 100644
--- a/src/include/storage/read_stream.h
+++ b/src/include/storage/read_stream.h
@@ -45,11 +45,24 @@
 struct ReadStream;
 typedef struct ReadStream ReadStream;
 
+/*
+ * General-use struct to use in callback functions for block range scans.
+ * Callback loops between current_blocknum (inclusive) and last_exclusive.
+ */
+typedef struct BlockRangeReadStreamPrivate
+{
+	BlockNumber current_blocknum;
+	BlockNumber last_exclusive;
+} BlockRangeReadStreamPrivate;
+
 /* Callback that returns the next block number to read. */
 typedef BlockNumber (*ReadStreamBlockNumberCB) (ReadStream *stream,
 void *callback_private_data,
 void *per_buffer_data);
 
+extern BlockNumber block_range_read_stream_cb(ReadStream *stream,
+			  void *callback_private_data,
+			  void *per_buffer_data);
 extern ReadStream *read_stream_begin_relation(int flags,
 			  BufferAccessStrategy strategy,
 			  Relation rel,
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 93cdd35fea0..8a449bab8a0 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -164,8 +164,25 @@ get_per_buffer_data(ReadStream *stream, int16 buffer_index)
 }
 
 /*
- * Ask the callback which block it would like us to read next, with a one block
- * buffer in front to allow read_stream_unget_block() to work.
+ * General-use callback function for block range scans.
+ */
+BlockNumber
+block_range_read_stream_cb(ReadStream *stream,
+		   void *callback_private_data,
+		   void *per_buffer_data)
+{
+	BlockRangeReadStreamPrivate *p = callback_private_data;
+
+	if (p->current_blocknum < p->last_exclusive)
+		return p->current_blocknum++;
+
+	return InvalidBlockNumber;
+}
+
+/*
+ * Ask the callback which block it would like us to read next, with a small
+ * buffer in front to allow read_stream_unget_block() to work and to allow the
+ * fast path to skip this function and work directly from the array.
  */
 static inline BlockNumber
 read_stream_get_block(ReadStream *stream, void *per_buffer_data)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9e951a9e6f3..df3f336bec0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -275,6 +275,7 @@ BlockId
 BlockIdData
 BlockInfoRecord
 BlockNumber
+BlockRangeReadStreamPrivate
 BlockRefTable
 BlockRefTableBuffer
 BlockRefTableChunk
-- 
2.45.2

From 84bc78cc36814309634c3686827b8a222bdfbfef Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 26 Aug 2024 12:16:06 +0300
Subject: [PATCH v4

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-09-02 Thread Daniel Gustafsson
> On 2 Sep 2024, at 10:03, Daniel Gustafsson  wrote:
> 
>> On 23 Aug 2024, at 01:56, Michael Paquier  wrote:
>> 
>> On Thu, Aug 22, 2024 at 11:13:15PM +0200, Daniel Gustafsson wrote:
>>> On 22 Aug 2024, at 02:31, Michael Paquier  wrote:
 Just do it :)
>>> 
>>> That's my plan, I wanted to wait a bit to see if anyone else chimed in with
>>> concerns.
>> 
>> Cool, thanks!
> 
> Attached is a rebased v15 (only changes are commit-message changes noted by
> Peter upthread) for the sake of archives, and for a green-check run in the
> CFBot.  Assuming this builds green I intend to push this.

And pushed.  All BF owners with animals using 1.0.2 have been notified but not
all have been updated (or modified to skip SSL) so there will be some failing.

--
Daniel Gustafsson





Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-02 Thread Dilip Kumar
On Mon, Sep 2, 2024 at 3:32 PM Amit Kapila  wrote:
>
> On Mon, Sep 2, 2024 at 11:21 AM Dilip Kumar  wrote:
> >
> > While working on some other code I noticed that in
> > FindReplTupleInLocalRel() there is an assert [1] that seems to be
> > passing IndexRelation to GetRelationIdentityOrPK() whereas it should
> > be passing normal relation.
> >
>
> Agreed. But this should lead to assertion failure. Did you try testing it?

No, I did not test this particular case, it impacted me with my other
addition of the code where I got Index Relation as input to the
RelationGetIndexList() function, and my local changes were impacted by
that.  I will write a test for this stand-alone case so that it hits
the assert.  Thanks for looking into this.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Proposal for implementing OCSP Stapling in PostgreSQL

2024-09-02 Thread Daniel Gustafsson
> On 15 Aug 2024, at 00:42, Jacob Champion  
> wrote:

> It's pretty frustrating to hear about a "transition" when there is
> nothing to transition to.

I guess they prefer that orgs transition back to just using CRL's.

> Anyways, I look forward to seeing how broken my crystal ball is this
> time. The timing is awful for this patchset in particular.

It is indeed, if it ends up deprecated server-side among the big providers then
support for it risks being very hard to use.  Not sure what is the best course
of action here is.

--
Daniel Gustafsson





Re: Virtual generated columns

2024-09-02 Thread jian he
On Wed, Aug 21, 2024 at 6:52 PM Dean Rasheed  wrote:
>
> On Wed, 21 Aug 2024 at 08:00, Peter Eisentraut  wrote:
> >
> > On 08.08.24 20:22, Dean Rasheed wrote:
> > > Looking at the rewriter changes, it occurred to me that it could
> > > perhaps be done more simply using ReplaceVarsFromTargetList() for each
> > > RTE with virtual generated columns. That function already has the
> > > required wholerow handling code, so there'd be less code duplication.
> >
> > Hmm, I don't quite see how ReplaceVarsFromTargetList() could be used
> > here.  It does have the wholerow logic that we need somehow, but other
> > than that it seems to target something different?
> >
>


> Well what I was thinking was that (in fireRIRrules()'s final loop over
> relations in the rtable), if the relation had any virtual generated
> columns, you'd build a targetlist containing a TLE for each one,
> containing the generated expression. Then you could just call
> ReplaceVarsFromTargetList() to replace any Vars in the query with the
> corresponding generated expressions. That takes care of descending
> into subqueries, adjusting varlevelsup, and expanding wholerow Vars
> that might refer to the generated expression.
>
> I also have half an eye on how this patch will interact with my patch
> to support RETURNING OLD/NEW values. If you use
> ReplaceVarsFromTargetList(), it should just do the right thing for
> RETURNING OLD/NEW generated expressions.
>
> > > I think it might be better to do this from within fireRIRrules(), just
> > > after RLS policies are applied, so it wouldn't need to worry about
> > > CTEs and sublink subqueries. That would also make the
> > > hasGeneratedVirtual flags unnecessary, since we'd already only be
> > > doing the extra work for tables with virtual generated columns. That
> > > would eliminate possible bugs caused by failing to set those flags.
> >
> > Yes, ideally, we'd piggy-back this into fireRIRrules().  One thing I'm
> > missing is that if you're descending into subqueries, there is no link
> > to the upper levels' range tables, which we need to lookup the
> > pg_attribute entries of column referencing Vars.  That's why there is
> > this whole custom walk with its own context data.  Maybe there is a way
> > to do this already that I missed?
> >
>
> That link to the upper levels' range tables wouldn't be needed because
> essentially using ReplaceVarsFromTargetList() flips the whole thing
> round: instead of traversing the tree looking for Var nodes that need
> to be replaced (possibly from upper query levels), you build a list of
> replacement expressions to be applied and apply them from the top,
> descending into subqueries as needed.
>

CREATE TABLE gtest1 (a int, b int GENERATED ALWAYS AS (a * 2) VIRTUAL);
INSERT INTO gtest1 VALUES (1,default), (2, DEFAULT);

select b from  (SELECT b FROM gtest1) sub;
here we only need to translate the second "b" to (a *2), not the first one.
but these two "b" query tree representation almost the same (varno,
varattno, varlevelsup)

I am not sure how ReplaceVarsFromTargetList can disambiguate this?
Currently v4-0001-Virtual-generated-columns.patch
works. because v4 properly tags the main query hasGeneratedVirtual to false,
and tag subquery's hasGeneratedVirtual to true.




Re: Virtual generated columns

2024-09-02 Thread Nazir Bilal Yavuz
Hi,

On Thu, 29 Aug 2024 at 15:16, Peter Eisentraut  wrote:
>
> I also committed the two patches that renamed the existing test files,
> so those are not included here anymore.
>
> The new patch does some rebasing and contains various fixes to the
> issues you presented.  As I mentioned, I'll look into improving the
> rewriting.

xid_wraparound test started to fail after edee0c621d. It seems the
error message used in xid_wraparound/002_limits is updated. The patch
that applies the same update to the test file is attached.

-- 
Regards,
Nazir Bilal Yavuz
Microsoft
From 748721898e8171d35d54ffe2b6edb38b9f5b020d Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Mon, 2 Sep 2024 16:18:57 +0300
Subject: [PATCH v1] Fix xid_wraparound/002_limits test

Error message used in xid_wraparound/002_limits is updated in edee0c621d.
Apply the same update in the test file as well.
---
 src/test/modules/xid_wraparound/t/002_limits.pl | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/modules/xid_wraparound/t/002_limits.pl b/src/test/modules/xid_wraparound/t/002_limits.pl
index aca3fa15149..889689d3bde 100644
--- a/src/test/modules/xid_wraparound/t/002_limits.pl
+++ b/src/test/modules/xid_wraparound/t/002_limits.pl
@@ -103,7 +103,7 @@ $ret = $node->psql(
 	stderr => \$stderr);
 like(
 	$stderr,
-	qr/ERROR:  database is not accepting commands that assign new XIDs to avoid wraparound data loss in database "postgres"/,
+	qr/ERROR:  database is not accepting commands that assign new transaction IDs to avoid wraparound data loss in database "postgres"/,
 	"stop-limit");
 
 # Finish the old transaction, to allow vacuum freezing to advance
-- 
2.45.2



Re: Use XLOG_CONTROL_FILE macro everywhere?

2024-09-02 Thread Daniel Gustafsson
> On 27 Apr 2024, at 11:12, Peter Eisentraut  wrote:
> 
> On 26.04.24 22:51, Tom Lane wrote:
>> Robert Haas  writes:
>>> On Wed, Apr 24, 2024 at 8:04 PM Michael Paquier  wrote:
 Not sure that I would bother with a second one.  But, well, why not if
 people want to rename it, as long as you keep compatibility.
>>> I vote for just standardizing on XLOG_CONTROL_FILE. That name seems
>>> sufficiently intuitive to me, and I'd rather have one identifier for
>>> this than two. It's simpler that way.
>> +1.  Back when we did the great xlog-to-wal renaming, we explicitly
>> agreed that we wouldn't change internal symbols referring to xlog.
>> It might or might not be appropriate to revisit that decision,
>> but I sure don't want to do it piecemeal, one symbol at a time.
>> Also, if we did rename this one, the logical choice would be
>> WAL_CONTROL_FILE not PG_CONTROL_FILE.
> 
> My reasoning was mainly that I don't see pg_control as controlling just the 
> WAL.  But I don't feel strongly about instigating a great renaming here or 
> something.

Summarizing the thread it seems consensus is using XLOG_CONTROL_FILE
consistently as per the original patch.

A few comments on the patch though:

- * reads the data from $PGDATA/global/pg_control
+ * reads the data from $PGDATA/

I don't think this is an improvement, I'd leave that one as the filename
spelled out.

- "the \".old\" suffix from %s/global/pg_control.old.\n"
+ "the \".old\" suffix from %s/%s.old.\n"

Same with that change, not sure I think that makes reading the errormessage
code any easier.

--
Daniel Gustafsson





Re: proposal: schema variables

2024-09-02 Thread Laurenz Albe
On Thu, 2024-08-29 at 19:33 +0200, Pavel Stehule wrote:
> > > > >   > +   /*
> > > > >   > +    * Although svar is freshly validated in this point, the 
> > > > > svar->is_valid can
> > > > >   > +    * be false, due possible accepting invalidation message 
> > > > > inside domain
> > > > >   > +    * check. Now, the validation is done after lock, that can 
> > > > > also accept
> > > > >   > +    * invalidation message, so validation should be trustful.
> > > > >   > +    *
> > > > >   > +    * For now, we don't need to repeat validation. Only svar 
> > > > > should be valid
> > > > >   > +    * pointer.
> > > > >   > +    */
> > > > 
> > > > This comment is related to assertions. Before I had there 
> > > > `Assert(svar->is_valid)`,
> > > > because I expected it. But it was not always true. And although it is 
> > > > true,
> > > > we don't need to validate a variable, because at this moment, the 
> > > > variable
> > > > should be locked, and then we can return content safely.
> > > 
> > > I guess my main problem is the word "trustful".  I don't recognize that 
> > > word.
> > > Perhaps you can reword the comment along the lines of your above 
> > > explanation.
> > 
> > 
> > I'll try to change it
> 
> is this better
> 
> <-->/*
> <--> * Although svar is freshly validated in this point, the svar->is_valid 
> can
> <--> * be false, due possible accepting invalidation message inside domain
> <--> * check. But now, the variable, and all dependencies are locked, so we
> <--> * don't need to repeat validation.
> <--> */

Much better.

Here is an improved version:

  Although "svar" is freshly validated in this point, svar->is_valid can
  be false, if an invalidation message ws processed during the domain check.
  But the variable and all its dependencies are locked now, so we don't need
  to repeat the validation.

Yours,
Laurenz Albe




Re: Flush pgstats file during checkpoints

2024-09-02 Thread Heikki Linnakangas

It'd not be such an issue if we updated stats during recovery, but I
think think we're doing that. Perhaps we should, which might also help
on replicas - no idea if it's feasible, though.


Stats on replicas are considered an independent thing AFAIU (scans are
counted for example in read-only queries).  If we were to do that we
may want to split stats handling between nodes in standby state and
crash recovery.  Not sure if that's worth the complication.  First,
the stats exist at node-level.


Hmm, I'm a bit disappointed this doesn't address replication. It makes 
sense that scans are counted separately on a standby, but it would be 
nice if stats like last_vacuum were propagated from primary to standbys. 
 I guess that can be handled separately later.



Reviewing v7-0001-Flush-pgstats-file-during-checkpoints.patch:

There are various race conditions where a stats entry can be leaked in 
the pgstats file. I.e. relation is dropped, but its stats entry is 
retained in the stats file after crash. In the worst case, suck leaked 
entries can accumulate until the stats file is manually removed, which 
resets all stats again. Perhaps that's acceptable - it's still possible 
leak the actual relation file for a new relation on crash, after all, 
which is much worse (I'm glad Horiguchi-san is working on that [1]).


For example:
1. BEGIN; CREATE TABLE foo (); ANALYZE foo;
2. CHECKPOINT;
3. pg_ctl restart -m immediate

This is the same scenario where we leak the relfile, but now you can 
have it with e.g. function statistics too.


Until 5891c7a8ed, there was a mechanism to garbage collect such orphaned 
entries (pgstat_vacuum()). How bad would it be to re-introduce that? Or 
can we make it more watertight so that there are no leaks?



If you do this:

pg_ctl start -D data
pg_ctl stop -D data -m immediate
pg_ctl start -D data
pg_ctl stop -D data -m immediate

You get this in the log:

2024-09-02 16:28:37.874 EEST [1397281] WARNING:  found incorrect redo 
LSN 0/160A3C8 (expected 0/160A440)


I think it's failing to flush the stats file at the end of recovery 
checkpoint.



[1] 
https://www.postgresql.org/message-id/20240901.010925.656452225144636594.horikyota.ntt%40gmail.com


--
Heikki Linnakangas
Neon (https://neon.tech)





per backend I/O statistics

2024-09-02 Thread Bertrand Drouvot
Hi hackers,

Please find attached a patch to implement $SUBJECT.

While pg_stat_io provides cluster-wide I/O statistics, this patch adds a new
pg_my_stat_io view to display "my" backend I/O statistics and a new
pg_stat_get_backend_io() function to retrieve the I/O statistics for a given
backend pid.

By having the per backend level of granularity, one could for example identify
which running backend is responsible for most of the reads, most of the extends
and so on... The pg_my_stat_io view could also be useful to check the
impact on the I/O made by some operations, queries,... in the current session.

Some remarks:

- it is split in 2 sub patches: 0001 introducing the necessary changes to 
provide
the pg_my_stat_io view and 0002 to add the pg_stat_get_backend_io() function.
- the idea of having per backend I/O statistics has already been mentioned in
[1] by Andres.

Some implementation choices:

- The KIND_IO stats are still "fixed amount" ones as the maximum number of
backend is fixed.
- The statistics snapshot is made for the global stats (the aggregated ones) and
for my backend stats. The snapshot is not build for all the backend stats (that
could be memory expensive depending on the number of max connections and given
the fact that PgStat_IO is 16KB long).
- The above point means that pg_stat_get_backend_io() behaves as if
stats_fetch_consistency is set to none (each execution re-fetches counters
from shared memory).
- The above 2 points are also the reasons why the pg_my_stat_io view has been
added (as its results takes care of the stats_fetch_consistency setting). I 
think
that makes sense to rely on it in that case, while I'm not sure that would make
a lot of sense to retrieve other's backend I/O stats and taking care of
stats_fetch_consistency.


[1]: 
https://www.postgresql.org/message-id/20230309003438.rectf7xo7pw5t5cj%40awork3.anarazel.de

Looking forward to your feedback,

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 04c0a9ed462850a620df61e5d0dc5053717f2b26 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Thu, 22 Aug 2024 15:16:50 +
Subject: [PATCH v1 1/2] per backend I/O statistics

While pg_stat_io provides cluster-wide I/O statistics, this commit adds a new
pg_my_stat_io view to display "my" backend I/O statistics. The KIND_IO stats
are still "fixed amount" ones as the maximum number of backend is fixed.

The statistics snapshot is made for the global stats (the aggregated ones) and
for my backend stats. The snapshot is not build for all the backend stats (that
could be memory expensive depending of the number of max connections and given
the fact that PgStat_IO is 16K bytes long).

A subsequent commit will add a new pg_stat_get_backend_io() function to be
able to retrieve the I/O statistics for a given backend pid.

Bump catalog version.
---
 doc/src/sgml/config.sgml  |  4 +-
 doc/src/sgml/monitoring.sgml  | 28 +
 src/backend/catalog/system_views.sql  | 24 +++-
 src/backend/utils/activity/pgstat.c   | 10 ++-
 src/backend/utils/activity/pgstat_io.c| 75 +++
 src/backend/utils/activity/pgstat_shmem.c |  4 +-
 src/backend/utils/adt/pgstatfuncs.c   | 18 +-
 src/include/catalog/catversion.h  |  2 +-
 src/include/catalog/pg_proc.dat   |  8 +--
 src/include/pgstat.h  |  8 ++-
 src/include/utils/pgstat_internal.h   | 13 ++--
 src/test/regress/expected/rules.out   | 21 ++-
 src/test/regress/expected/stats.out   | 44 -
 src/test/regress/sql/stats.sql| 25 +++-
 14 files changed, 245 insertions(+), 39 deletions(-)
  10.7% doc/src/sgml/
   4.1% src/backend/catalog/
  30.9% src/backend/utils/activity/
   5.2% src/backend/utils/adt/
   7.9% src/include/catalog/
   3.4% src/include/utils/
  22.0% src/test/regress/expected/
  12.7% src/test/regress/sql/

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0aec11f443..2a59b97093 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8331,7 +8331,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 displayed in 
 pg_stat_database,
 
-pg_stat_io, in the output of
+pg_stat_io,
+
+pg_my_stat_io, in the output of
  when the BUFFERS option
 is used, in the output of  when
 the VERBOSE option is used, by autovacuum
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 55417a6fa9..27d2548d61 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -488,6 +488,16 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  
  
 
+ 
+  pg_my_stat_iopg_my_stat_io
+  
+   One row for each combination of context and target object containing
+   my backend I/O statistics.
+   See 
+ 

Re: Track IO times in pg_stat_io

2024-09-02 Thread Bertrand Drouvot
Hi,

On Fri, Aug 23, 2024 at 07:32:16AM +, Bertrand Drouvot wrote:
> Hi,
> 
> On Wed, Mar 08, 2023 at 04:34:38PM -0800, Andres Freund wrote:
> > On 2023-03-08 12:55:34 +0100, Drouvot, Bertrand wrote:
> > > - pg_stat_io is "global" across all sessions. So, even if one session is 
> > > doing some "testing" and needs to turn track_io_timing on, then it
> > > is even not sure it's only reflecting its own testing (as other sessions 
> > > may have turned it on too).
> > 
> > I think for 17 we should provide access to per-existing-connection 
> > pg_stat_io
> > stats, and also provide a database aggregated version. Neither should be
> > particularly hard.
> 
> FWIW, I think that would be great and plan to have a look at this (unless 
> someone
> beats me to it).

FWIW, here is the patch proposal for per backend I/O statistics [1].

[1]: 
https://www.postgresql.org/message-id/ZtXR%2BCtkEVVE/LHF%40ip-10-97-1-34.eu-west-3.compute.internal

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: PG_TEST_EXTRA and meson

2024-09-02 Thread Nazir Bilal Yavuz
Hi,

On Fri, 30 Aug 2024 at 21:36, Jacob Champion
 wrote:
>
> On Wed, Aug 28, 2024 at 8:21 AM Nazir Bilal Yavuz  wrote:
> > I do not exactly remember the reason but I think I copied the same
> > behavior as before, PG_TEST_EXTRA variable was checked in the
> > src/test/Makefile so I exported it there.
>
> Okay, give v3 a try then. This exports directly from Makefile.global.
> Since that gets pulled into a bunch of places, the scope is a lot
> wider than it used to be; I've disabled it for PGXS so it doesn't end
> up poisoning other extensions.

Patch looks good and it passes all the test cases in Ashutosh's test script.

--
Regards,
Nazir Bilal Yavuz
Microsoft




Re: long-standing data loss bug in initial sync of logical replication

2024-09-02 Thread Nitin Motiani
On Tue, Aug 20, 2024 at 4:10 PM Amit Kapila  wrote:
>
> On Thu, Aug 15, 2024 at 9:31 PM vignesh C  wrote:
> > Since we are applying invalidations to all in-progress transactions,
> > the publisher will only replicate half of the transaction data up to
> > the point of invalidation, while the remaining half will not be
> > replicated.
> > Ex:
> > Session1:
> > BEGIN;
> > INSERT INTO tab_conc VALUES (1);
> >
> > Session2:
> > ALTER PUBLICATION regress_pub1 DROP TABLE tab_conc;
> >
> > Session1:
> > INSERT INTO tab_conc VALUES (2);
> > INSERT INTO tab_conc VALUES (3);
> > COMMIT;
> >
> > After the above the subscriber data looks like:
> > postgres=# select * from tab_conc ;
> >  a
> > ---
> >  1
> > (1 row)
> >
> > You can reproduce the issue using the attached test.
> > I'm not sure if this behavior is ok. At present, we’ve replicated the
> > first record within the same transaction, but the second and third
> > records are being skipped.
> >
>
> This can happen even without a concurrent DDL if some of the tables in
> the database are part of the publication and others are not. In such a
> case inserts for publicized tables will be replicated but other
> inserts won't. Sending the partial data of the transaction isn't a
> problem to me. Do you have any other concerns that I am missing?
>

Hi,

I think that the partial data replication for one table is a bigger
issue than the case of data being sent for a subset of the tables in
the transaction. This can lead to inconsistent data if the same row is
updated multiple times or deleted in the same transaction. In such a
case if only the partial updates from the transaction are sent to the
subscriber, it might end up with the data which was never visible on
the publisher side.

Here is an example I tried with the patch v8-001 :

I created following 2 tables on the publisher and the subscriber :

CREATE TABLE delete_test(id int primary key, name varchar(100));
CREATE TABLE update_test(id int primary key, name varchar(100));

I added both the tables to the publication p on the publisher and
created a subscription s on the subscriber.

I run 2 sessions on the publisher and do the following :

Session 1 :
BEGIN;
INSERT INTO delete_test VALUES(0, 'Nitin');

Session 2 :
ALTER PUBLICATION p DROP TABLE delete_test;

Session 1 :
DELETE FROM delete_test WHERE id=0;
COMMIT;

After the commit there should be no new row created on the publisher.
But because the partial data was replicated, this is what the select
on the subscriber shows :

SELECT * FROM delete_test;
 id |   name
+---
  0 | Nitin
(1 row)

I don't think the above is a common use case. But this is still an
issue because the subscriber has the data which never existed on the
publisher.

Similar issue can be seen with an update command.

Session 1 :
BEGIN;
INSERT INTO update_test VALUES(1, 'Chiranjiv');

Session 2 :
ALTER PUBLICATION p DROP TABLE update_test;

Session 1:
UPDATE update_test SET name='Eeshan' where id=1;
COMMIT;

After the commit, this is the state on the publisher :
SELECT * FROM update_test;
  1 | Eeshan
(1 row)

While this is the state on the subscriber :
SELECT * FROM update_test;
  1 | Chiranjiv
(1 row)

I think the update during a transaction scenario might be more common
than deletion right after insertion. But both of these seem like real
issues to consider. Please let me know if I'm missing something.

Thanks & Regards
Nitin Motiani
Google




Re: Improving tracking/processing of buildfarm test failures

2024-09-02 Thread Andrew Dunstan


On 2024-09-01 Su 2:46 PM, sia kc wrote:

Hello everyone

I am a developer interested in this project. Had a little involvement 
with MariaDB and now I like to work on Postgres. Never worked with 
mailing lists so I am not sure if this is the way I should interact. 
Liked to be pointed to some tasks and documents to get started.



Do you mean you want to be involved with $subject, or that you just want 
to be involved in Postgres development generally? If the latter, then 
replying to a specific email thread is not the way to go, and the first 
thing to do is look at this wiki page 




cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com


Re: Track IO times in pg_stat_io

2024-09-02 Thread Michael Paquier
On Mon, Sep 02, 2024 at 03:00:32PM +, Bertrand Drouvot wrote:
> On Fri, Aug 23, 2024 at 07:32:16AM +, Bertrand Drouvot wrote:
>> FWIW, I think that would be great and plan to have a look at this (unless 
>> someone
>> beats me to it).
> 
> FWIW, here is the patch proposal for per backend I/O statistics [1].
> 
> [1]: 
> https://www.postgresql.org/message-id/ZtXR%2BCtkEVVE/LHF%40ip-10-97-1-34.eu-west-3.compute.internal

Cool, thanks!
--
Michael


signature.asc
Description: PGP signature


Re: define PG_REPLSLOT_DIR

2024-09-02 Thread Michael Paquier
On Fri, Aug 30, 2024 at 12:21:29PM +, Bertrand Drouvot wrote:
> That said, I don't have a strong opinion on this one, I think that also makes
> sense to leave it as it is. Please find attached v4 doing so.

The changes in astreamer_file.c are actually wrong regarding the fact
that should_allow_existing_directory() needs to be able to work with
the branch where this code is located as well as back-branches,
because pg_basebackup from version N supports ~(N-1) versions down to
a certain version, so changing it is not right.  This is why pg_xlog
and pg_wal are both listed there.

Perhaps we should to more for the two entries in basebackup.c with the
relative paths, but I'm not sure that's worth bothering, either.  At
the end, I got no objections about the remaining pieces, so applied.

How do people feel about the suggestions to update the comments at the
end?  With the comment in relpath.h suggesting to not change that, the
current state of HEAD is fine by me.
--
Michael


signature.asc
Description: PGP signature


Re: scalability bottlenecks with (many) partitions (and more)

2024-09-02 Thread Tomas Vondra
On 9/2/24 01:53, Robert Haas wrote:
> On Sun, Sep 1, 2024 at 3:30 PM Tomas Vondra  wrote:
>> I don't think that's possible with hard-coded size of the array - that
>> allocates the memory for everyone. We'd need to make it variable-length,
>> and while doing those benchmarks I think we actually already have a GUC
>> for that - max_locks_per_transaction tells us exactly what we need to
>> know, right? I mean, if I know I'll need ~1000 locks, why not to make
>> the fast-path array large enough for that?
> 
> I really like this idea. I'm not sure about exactly how many fast path
> slots you should get for what value of max_locks_per_transaction, but
> coupling the two things together in some way sounds smart.
> 

I think we should keep that simple and make the cache large enough for
max_locks_per_transaction locks. That's the best information about
expected number of locks we have. If the GUC is left at the default
value, that probably means they backends need that many locks on
average. Yes, maybe there's an occasional spike in one of the backends,
but then that means other backends need fewer locks, and so there's less
contention for the shared lock table.

Of course, it's possible to construct counter-examples to this. Say a
single backend that needs a lot of these locks. But how's that different
from every other fixed-size cache with eviction?

The one argument to not tie this to max_locks_per_transaction is the
vastly different "per element" memory requirements. If you add one entry
to max_locks_per_transaction, that adds LOCK which is a whopping 152B.
OTOH one fast-path entry is ~5B, give or take. That's a pretty big
difference, and it if the locks fit into the shared lock table, but
you'd like to allow more fast-path locks, having to increase
max_locks_per_transaction is not great - pretty wastefull.

OTOH I'd really hate to just add another GUC and hope the users will
magically know how to set it correctly. That's pretty unlikely, IMO. I
myself wouldn't know what a good value is, I think.

But say we add a GUC and set it to -1 by default, in which case it just
inherits the max_locks_per_transaction value. And then also provide some
basic metric about this fast-path cache, so that people can tune this?

I think just knowing the "hit ratio" would be enough, i.e. counters for
how often it fits into the fast-path array, and how often we had to
promote it to the shared lock table would be enough, no?

>> Of course, the consequence of this would be making PGPROC variable
>> length, or having to point to a memory allocated separately (I prefer
>> the latter option, I think). I haven't done any experiments, but it
>> seems fairly doable - of course, not sure if it might be more expensive
>> compared to compile-time constants.
> 
> I agree that this is a potential problem but it sounds like the idea
> works well enough that we'd probably still come out quite far ahead
> even with a bit more overhead.
> 

OK, I did some quick tests on this, and I don't see any regressions.

Attached are 4 patches:

1) 0001 - original patch, with some minor fixes (remove init, which is
   not necessary, that sort of thing)

2) 0002 - a bit of reworks, improving comments, structuring the macros a
   little bit better, etc. But still compile-time constants.

3) 0003 - dynamic sizing, based on max_locks_pet_transaction. It's a bit
   ugly, because the size is calculated during shmem allocation - it
   should happen earlier, but good enough for PoC.

4) 0004 - introduce a separate GUC, this is mostly to allow testing of
   different values without changing max_locks_per_transaction


I've only did that on my smaller 32-core machine, but for three simple
tests it looks like this (throughput using 16 clients):

mode  test master12   34

prepared count   1460 1477 1488 14901491
  join  15556244512604425026   24237
   pgbench 148187   151192   151688   150389  152681

simple   count   1341 1351 1373 13741370
  join   4643 5439 5459 53935345
   pgbench 139763   141267   142796   141207  142600

Those are some simple benchmarks on 100 partitions, where the regular
pgbench and count(*) are expected to not be improved, and the join is
the partitioned join this thread started with. 1-4 are the attached
patches, to see the impact for each of them.

Translated to results relative to

mode test   1   2   3   4
-
preparedcount101%102%102%102%
 join157%167%161%156%
  pgbench102%102%101%103%
-
simple  

Re: Typos in the code and README

2024-09-02 Thread Alexander Lakhin

Hello,

12.08.2024 14:59, David Rowley wrote:

(I know Daniel mentioned he'd get to these, but the ScanDirection one
was my fault and I needed to clear that off my mind. I did a few
others while on this topic.)


Thank you, David, for working on that!

I've gathered another bunch of defects with the possible substitutions.
Please take a look:
adapated -> adapted

becasue -> because

cancelled -> canceled (introduced by 90f517821, but see 8c9da1441)

cange -> change

comand -> command

CommitTSSLRU -> CommitTsSLRU (introduced by 53c2a97a9; maybe the fix
 should be back-patched...)

connectOptions2 -> pqConnectOptions2 (see 774bcffe4)

Injections points -> Injection points

jsetate -> jsestate

LockShmemSize -> remove the sentence? (added by ec0baf949, outdated with
 a794fb068)

MaybeStartSlotSyncWorker -> LaunchMissingBackgroundProcesses (the logic to
 start B_SLOTSYNC_WORKER moved from the former to the latter function with
 3354f8528)

multixact_member_buffer -> multixact_member_buffers

per_data_data -> per_buffer_data (see code below the comment; introduced by
 b5a9b18cd)

per_buffer_private -> remove the function declaration? (the duplicate
 declaration was added by a858be17c)

performancewise -> performance-wise? (coined by a7f107df2)

pgstat_add_kind -> pgstat_register_kind (see 7949d9594)

pg_signal_autovacuum -> pg_signal_autovacuum_worker (see d2b74882c)

recoveery -> recovery

RegisteredWorker -> RegisteredBgWorker

RUNNING_XACT -> RUNNING_XACTS

sanpshot -> snapshot

TypeEntry -> TypeCacheEntry (align with AttoptCacheEntry, from the same
 commit 40064a8ee)

The corresponding patch is attached for your convenience.

Best regards,
Alexanderdiff --git a/contrib/test_decoding/specs/skip_snapshot_restore.spec b/contrib/test_decoding/specs/skip_snapshot_restore.spec
index 3f1fb6f02c7..7b35dbcc9f3 100644
--- a/contrib/test_decoding/specs/skip_snapshot_restore.spec
+++ b/contrib/test_decoding/specs/skip_snapshot_restore.spec
@@ -39,7 +39,7 @@ step "s2_get_changes_slot0" { SELECT data FROM pg_logical_slot_get_changes('slot
 # serializes consistent snapshots to the disk at LSNs where are before
 # s0-transaction's commit. After s0-transaction commits, "s1_init" resumes but
 # must not restore any serialized snapshots and will reach the consistent state
-# when decoding a RUNNING_XACT record generated after s0-transaction's commit.
+# when decoding a RUNNING_XACTS record generated after s0-transaction's commit.
 # We check if the get_changes on 'slot1' will not return any s0-transaction's
 # changes as its confirmed_flush_lsn will be after the s0-transaction's commit
 # record.
diff --git a/doc/src/sgml/xfunc.sgml b/doc/src/sgml/xfunc.sgml
index 9bc23a9a938..af7864a1b5b 100644
--- a/doc/src/sgml/xfunc.sgml
+++ b/doc/src/sgml/xfunc.sgml
@@ -3891,8 +3891,8 @@ static const PgStat_KindInfo custom_stats = {
  it with pgstat_register_kind and a unique ID used to
  store the entries related to this type of statistics:
 
-extern PgStat_Kind pgstat_add_kind(PgStat_Kind kind,
-   const PgStat_KindInfo *kind_info);
+extern PgStat_Kind pgstat_register_kind(PgStat_Kind kind,
+const PgStat_KindInfo *kind_info);
 
  While developing a new extension, use
  PGSTAT_KIND_EXPERIMENTAL for
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index a03d56541d0..8c37d7eba76 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2017,7 +2017,7 @@ check_multixact_offset_buffers(int *newval, void **extra, GucSource source)
 }
 
 /*
- * GUC check_hook for multixact_member_buffer
+ * GUC check_hook for multixact_member_buffers
  */
 bool
 check_multixact_member_buffers(int *newval, void **extra, GucSource source)
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 91f0fd6ea3e..b2457f121a7 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -382,7 +382,7 @@ RefreshMatViewByOid(Oid matviewOid, bool is_create, bool skipData,
 	 * command tag is left false in cmdtaglist.h. Otherwise, the change of
 	 * completion tag output might break applications using it.
 	 *
-	 * When called from CREATE MATERIALIZED VIEW comand, the rowcount is
+	 * When called from CREATE MATERIALIZED VIEW command, the rowcount is
 	 * displayed with the command tag CMDTAG_SELECT.
 	 */
 	if (qc)
diff --git a/src/backend/commands/waitlsn.c b/src/backend/commands/waitlsn.c
index d9cf9e7d75e..d7065726749 100644
--- a/src/backend/commands/waitlsn.c
+++ b/src/backend/commands/waitlsn.c
@@ -369,7 +369,7 @@ pg_wal_replay_wait(PG_FUNCTION_ARGS)
 	 */
 	InvalidateCatalogSnapshot();
 
-	/* Give up if there is still an active or registered sanpshot. */
+	/* Give up if there is still an active or registered snapshot. */
 	if (GetOldestSnapshot())
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
diff --git a/src/backe

Re: [PATCH] Add roman support for to_number function

2024-09-02 Thread Maciek Sakrejda
Thanks for the contribution.

I took a look at the patch, and it works as advertised. It's too late
for the September commitfest, but I took the liberty of registering
your patch for the November CF [1]. In the course of that, I found an
older thread proposing this feature seven years ago [2]. That patch
was returned with feedback and (as far as I can tell), was not
followed-up on by the author. You may want to review that thread for
feedback; I won't repeat it here.

On Fri, Aug 30, 2024 at 12:22 AM Hunaid Sohail  wrote:
>While looking at formatting.c file, I noticed a TODO about "add support for 
>roman number to standard number conversion"  
>(https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/formatting.c#L52)

Your patch should also remove the TODO =)

> - Is it okay for the function to handle Roman numerals in a case-insensitive 
> way? (e.g., 'XIV', 'xiv', and 'Xiv' are all seen as 14).

The patch in the thread I linked also took a case-insensitive
approach. I did not see any objections to that, and it seems
reasonable to me as well.

> - How should we handle Roman numerals with leading or trailing spaces, like ' 
> XIV' or 'MC '? Should we trim the spaces, or would it be better to throw an 
> error in such cases?

I thought we could reference existing to_number behavior here, but
after playing with it a bit, I'm not really sure what that is:

-- single leading space
maciek=# select to_number(' 1', '9');
 to_number
---
 1
(1 row)

-- two leading spaces
maciek=# select to_number('  1', '9');
ERROR:  invalid input syntax for type numeric: " "
-- two leading spaces and template pattern with a decimal
maciek=# select to_number('  1', '9D9');
 to_number
---
 1
(1 row)

Separately, I also noticed some unusual Roman representations work
with your patch:

postgres=# select to_number('viv', 'RN');
 to_number
---
 9
(1 row)

Is this expected? In contrast, some somewhat common alternative
representations don't work:

postgres=# select to_number('', 'RN');
ERROR:  invalid roman numeral

I know this is expected, but is this the behavior we want? If so, we
probably want to reject the former case, too. If not, maybe that one
is okay, too.

I know I've probably offered more questions than answers, but I hope
finding the old thread here is useful.

Thanks,
Maciek

[1]: https://commitfest.postgresql.org/50/5233/
[2]: 
https://www.postgresql.org/message-id/flat/CAGMVOduAJ9wKqJXBYnmFmEetKxapJxrG3afUwpbOZ6n_dWaUnA%40mail.gmail.com




Re: JIT: Remove some unnecessary instructions.

2024-09-02 Thread Andreas Karlsson

On 9/2/24 4:23 AM, Xing Guo wrote:

Thanks for testing it! I spotted another unnecessary store instruction
and added it in my V2 patch.


Another well-spotted unnecessary store. Nice!

I think this patch is ready for committer. It is simple and pretty 
obviously correct.


Andreas





Re: JIT: Remove some unnecessary instructions.

2024-09-02 Thread Andreas Karlsson

On 9/2/24 9:06 PM, Andreas Karlsson wrote:

On 9/2/24 4:23 AM, Xing Guo wrote:

Thanks for testing it! I spotted another unnecessary store instruction
and added it in my V2 patch.


Another well-spotted unnecessary store. Nice!

I think this patch is ready for committer. It is simple and pretty 
obviously correct.


Oh, and please add your patch to the commitfest app so it is not lost.

https://commitfest.postgresql.org/50/

Andreas





Re: query_id, pg_stat_activity, extended query protocol

2024-09-02 Thread Andrei Lepikhov

On 14/8/2024 23:05, Imseih (AWS), Sami wrote:

I think the testing discussion should be moved to a different thread.
What do you think?

See v4.

0001 deals with reporting queryId in exec_execute_message and 
exec_bind_message.

0002 deals with reporting queryId after a cache invalidation.

There are no tests as this requires more discussion in a separate thread(?)

At first, these patches look good.
But I have a feeling of some mess here:
queryId should be initialised at the top-level query. At the same time, 
the RevalidateCachedQuery routine can change this value in the case of 
the query tree re-validation.
You can say that this routine can't be called from a non-top-level query 
right now, except SPI. Yes, but what about extensions or future usage?


--
regards, Andrei Lepikhov





thread-safety: strerror_r()

2024-09-02 Thread Peter Eisentraut
There are only a few (not necessarily thread-safe) strerror() calls in 
the backend; most other potential users use %m in a format string.


In two cases, the reason for using strerror() was that we needed to 
print the error message twice, and so errno has to be reset for the 
second time.  And/or some of this code is from before snprintf() gained 
%m support.  This can easily be simplified now.


The other is a workaround for OpenSSL that we have already handled in an 
equivalent way in libpq.


(And there is one in postmaster.c, but that one is before forking.)

I think we can apply these patches now to check this off the list of 
not-thread-safe functions to check.From c2ce542d61d5e86ab138b72e2e0d74fdac589f04 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Sep 2024 11:02:22 +0200
Subject: [PATCH 1/2] Remove a couple of strerror() calls

Change to using %m in the error message string.  We need to be a bit
careful here to preserve errno until we need to print it.

This change avoids the use of not-thread-safe strerror() and unifies
some error message strings, and maybe makes the code appear more
consistent.
---
 src/backend/libpq/hba.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index 75d588e36a1..2fd96a71294 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -624,8 +624,11 @@ open_auth_file(const char *filename, int elevel, int depth,
 errmsg("could not open file \"%s\": %m",
filename)));
if (err_msg)
-   *err_msg = psprintf("could not open file \"%s\": %s",
-   filename, 
strerror(save_errno));
+   {
+   errno = save_errno;
+   *err_msg = psprintf("could not open file \"%s\": %m",
+   filename);
+   }
/* the caller may care about some specific errno */
errno = save_errno;
return NULL;
@@ -762,8 +765,9 @@ tokenize_auth_file(const char *filename, FILE *file, List 
**tok_lines,
ereport(elevel,
(errcode_for_file_access(),
 errmsg("could not read file \"%s\": 
%m", filename)));
-   err_msg = psprintf("could not read file \"%s\": %s",
-  filename, 
strerror(save_errno));
+   errno = save_errno;
+   err_msg = psprintf("could not read file \"%s\": %m",
+  filename);
break;
}
 
-- 
2.46.0

From a2ad11452ac8c8036981bcfa5777ad7c5068aa4a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 2 Sep 2024 11:08:13 +0200
Subject: [PATCH 2/2] Avoid strerror()

Replace one strerror() with strerror_r(), mirroring the equivalent
code in frontend libpq.
---
 src/backend/libpq/be-secure-openssl.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/backend/libpq/be-secure-openssl.c 
b/src/backend/libpq/be-secure-openssl.c
index 60cf5d16e74..a04a514bff9 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -1456,7 +1456,7 @@ static const char *
 SSLerrmessage(unsigned long ecode)
 {
const char *errreason;
-   static char errbuf[36];
+   static char errbuf[128];
 
if (ecode == 0)
return _("no SSL error reported");
@@ -1473,7 +1473,10 @@ SSLerrmessage(unsigned long ecode)
 */
 #ifdef ERR_SYSTEM_ERROR
if (ERR_SYSTEM_ERROR(ecode))
-   return strerror(ERR_GET_REASON(ecode));
+   {
+   strerror_r(ERR_GET_REASON(ecode), errbuf, sizeof(errbuf));
+   return errbuf;
+   }
 #endif
 
/* No choice but to report the numeric ecode */
-- 
2.46.0



Re: thread-safety: strerror_r()

2024-09-02 Thread Tom Lane
Peter Eisentraut  writes:
> I think we can apply these patches now to check this off the list of 
> not-thread-safe functions to check.

+1 for the first patch.  I'm less happy with

-   static char errbuf[36];
+   static char errbuf[128];

As a minor point, shouldn't this be

+   static char errbuf[PG_STRERROR_R_BUFLEN];

But the bigger issue is that the use of a static buffer makes
this not thread-safe, so having it use strerror_r to fill that
buffer is just putting lipstick on a pig.  If we really want
to make this thread-ready, we need to adopt the approach used
in libpq's fe-secure-openssl.c, where callers have to free the
buffer later.  Or maybe we could just palloc the result, and
trust that it's not in a long-lived context?

regards, tom lane




Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2024-09-02 Thread Jehan-Guillaume de Rorthais
Hi,

On Tue, 20 Aug 2024 23:09:27 -0400
Alvaro Herrera  wrote:

> On 2024-Aug-20, Jehan-Guillaume de Rorthais wrote:
> 
> > I'm back on this issue as well. I start poking at this patch to review it,
> > test it, challenge it and then report here.
> > 
> > I'll try to check if some other issues might have lost/forgot on they way as
> > well.
> 
> Thanks, much appreciated, looking forward to your feedback.

Sorry, it took me a while to come back to you on this topic. It has been hard to
untangle subjects, reproductions and patch…

There's three distinct issues/thread:

* Constraint & trigger catalog cleanup [1] (this thread)
* FK broken after DETACH [2]
* Maintenance consideration about self referencing FK between partitions [3]

0. Splitting in two commits

  Your patch addresses two bugs:

* one for the constraint & trigger catalog cleanup;
* one for the FK broken after DETACH.

  These issues are unrelated, therefore I am wondering if it would be better
  to split their resolution in two different patches.

  Last year, I reported them in two different threads [1][2]. The first with
  implementation consideration, the second with a demo/proposal/draft fix.

  Unfortunately, this discussion about the first bug slipped to the second one
  when Tender stumbled on this bug as well and reported it. But, both bugs can
  be triggered independently, and have distinct fixes.

  Finally, splitting the patch might help setting finer patch co-authoring. I
  know my patch for [2] was a draft and somewhat trivial, but I spend a fair
  amount of time to report, then produce a draft patch, so I was wondering if
  it would be candidate to a co-author flag on this (small, humble and
  refactored by you) patch?

  I'm definitely not involved (yet) in the second part though.

1. Constraint & trigger catalog cleanup [1]

  I have been focusing on the current master branch and haven't taken into
  consideration backpatching related issues yet.

  When I first studied this bug and reported it, I held on writing a patch
  because it seemed it would duplicate some existing code. I wrote:

  > I poked around DetachPartitionFinalize() to try to find a way to fix this,
  > but it looks like it would duplicate a bunch of code from other code path
  > (eg. from CloneFkReferenced).

  My proposal was to clean everything related to the old FK and use some
  existing code path to create a fresh and cleaner one. This requires some
  refactoring in existing code, but we would win a common path of code between
  create/attach/detach, a cleaner catalog and easier code maintenance.

  I've finally been able to write a PoC that implement this by calling
  addFkRecurseReferenced() from DetachPartitionFinalize(). I can't join
  it here because it is currently an ugly draft and I still have some work
  to do. But I would really like to have a little more time (one or two days) to
  explore this avenue further before you commit yours, if you don't mind? Or
  maybe you already have considered this avenue and rejected it?


2. FK broken after DETACH [2]

  Comparing your patch to my draft from [2], I just have a question about the
  refactoring.

  Fencing the constraint/trigger removal inside a conditional
  RELKIND_PARTITIONED_TABLE block of code was obvious. It avoids some useless
  catalog scan compared to my draft patch.

  Also, the "contype == CONSTRAINT_FOREIGN" I had sounds safe to remove.

  However, is it clean/light enough to add the "conparentid == fk->conoid" in
  the scan key as I did? I'm not sure it saves anything else but the small
  conditional block you inserted inside the loop, but I wonder if there's a
  serious concern about this anyway?

  Last, considering the tests, I think we should add some rows in the tables,
  to make sure the FK is correctly enforced after DETACH. Something like:

CREATE SCHEMA fkpart12
  CREATE TABLE fk_p ( id bigint PRIMARY KEY ) PARTITION BY list (id)
  CREATE TABLE fk_p_1 PARTITION OF fk_p FOR VALUES IN (1)
  CREATE TABLE fk_p_2 PARTITION OF fk_p FOR VALUES IN (2)
  CREATE TABLE fk_r_1 ( id bigint PRIMARY KEY, p_id bigint NOT NULL)
  CREATE TABLE fk_r_2 ( id bigint PRIMARY KEY, p_id bigint NOT NULL)
  CREATE TABLE fk_r   ( id bigint PRIMARY KEY, p_id bigint NOT NULL,
 FOREIGN KEY (p_id) REFERENCES fk_p (id)
  ) PARTITION BY list (id);
SET search_path TO fkpart12;

INSERT INTO fk_p VALUES (1);

ALTER TABLE fk_r ATTACH PARTITION fk_r_2 FOR VALUES IN (2);

ALTER TABLE fk_r ATTACH PARTITION fk_r_1 FOR VALUES IN (1);
\d fk_r_1

INSERT INTO fk_r VALUES (1,1);

ALTER TABLE fk_r DETACH PARTITION fk_r_1;
\d fk_r_1

INSERT INTO c_1 VALUES (2,2); -- fails as EXPECTED
DELETE FROM p; -- should fails but was buggy

ALTER TABLE fk_r ATTACH PARTITION fk_r_1 FOR VALUES IN (1);
\d fk_r_1


3. Self referencing FK between partitions [3]

  You added to your commit message:

verify: 20230707175859.17c91538@karst

  I'm n

Re: Partitioned tables and [un]loggedness

2024-09-02 Thread Michael Paquier
On Thu, Aug 29, 2024 at 09:49:44AM -0500, Nathan Bossart wrote:
> IMHO continuing to allow partitioned tables to be marked UNLOGGED just
> preserves the illusion that it does something.  An ERROR could help dispel
> that misconception.

Okay.  This is going to be disruptive if we do nothing about pg_dump,
unfortunately.  How about tweaking dumpTableSchema() so as we'd never
issue UNLOGGED for a partitioned table?  We could filter that out as
there is tbinfo->relkind.
--
Michael


signature.asc
Description: PGP signature


Re: Flush pgstats file during checkpoints

2024-09-02 Thread Michael Paquier
On Mon, Sep 02, 2024 at 05:08:03PM +0300, Heikki Linnakangas wrote:
> Hmm, I'm a bit disappointed this doesn't address replication. It makes sense
> that scans are counted separately on a standby, but it would be nice if
> stats like last_vacuum were propagated from primary to standbys.  I guess
> that can be handled separately later.

Yes, it's not something that I'm planning to tackle for this thread.
Speaking of which, the design that I got in mind for this area was not
"that" complicated:
- Add a new RMGR for all the stats.
- Add a first callback for stats kinds for WAL inserts, giving to each
stats the possibility to pass down data inserted to the record, as we
want to replicate a portion of the data depending on the kind dealt
with.
- Add a second callback for recovery, called depending on the kind ID.

I have not looked into the details yet, but stats to replicate should
be grouped in a single record on transaction commit or depending on
the flush timing for fixed-numbered stats.  Or we should just add them
in commit records?

> Reviewing v7-0001-Flush-pgstats-file-during-checkpoints.patch:
> 
> There are various race conditions where a stats entry can be leaked in the
> pgstats file. I.e. relation is dropped, but its stats entry is retained in
> the stats file after crash. In the worst case, suck leaked entries can
> accumulate until the stats file is manually removed, which resets all stats
> again. Perhaps that's acceptable - it's still possible leak the actual
> relation file for a new relation on crash, after all, which is much worse
> (I'm glad Horiguchi-san is working on that [1]).

Yeah, that's not an easy issue.  We don't really have a protection
regarding that as well now.  Backends can also refer to stats entries
in their shutdown callback that have been dropped concurrently.  See
some details about that at https://commitfest.postgresql.org/49/5045/.

> Until 5891c7a8ed, there was a mechanism to garbage collect such orphaned
> entries (pgstat_vacuum()). How bad would it be to re-introduce that? Or can
> we make it more watertight so that there are no leaks?

Not sure about this part, TBH.  Doing that again in autovacuum does
not excite me much as it has a cost.

> I think it's failing to flush the stats file at the end of recovery
> checkpoint.

Missed that, oops.  I'll double-check this area.
--
Michael


signature.asc
Description: PGP signature


Re: Partitioned tables and [un]loggedness

2024-09-02 Thread Nathan Bossart
On Tue, Sep 03, 2024 at 09:22:58AM +0900, Michael Paquier wrote:
> On Thu, Aug 29, 2024 at 09:49:44AM -0500, Nathan Bossart wrote:
>> IMHO continuing to allow partitioned tables to be marked UNLOGGED just
>> preserves the illusion that it does something.  An ERROR could help dispel
>> that misconception.
> 
> Okay.  This is going to be disruptive if we do nothing about pg_dump,
> unfortunately.  How about tweaking dumpTableSchema() so as we'd never
> issue UNLOGGED for a partitioned table?  We could filter that out as
> there is tbinfo->relkind.

That's roughly what I had in mind.

-- 
nathan




Re: macOS prefetching support

2024-09-02 Thread Thomas Munro
On Mon, Aug 19, 2024 at 1:35 AM Peter Eisentraut  wrote:
> On 17.08.24 00:01, Thomas Munro wrote:
> > I think that's fine.  I don't really like the word "prefetch", could
> > mean many different things.  What about "requires OS support for
> > issuing read-ahead advice", which uses a word that appears in both of
> > those interfaces?
>
> I like that term.

A couple of other places still use the old specific terminology.  PSA.
From 287a56f8d177e56983d68cdd7201866ba119c130 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Tue, 3 Sep 2024 13:30:22 +1200
Subject: [PATCH] Standardize "read-ahead advice" terminology.

Commit 6654bb920 added macOS's equivalent of POSIX_FADV_WILLNEED,
and adjusted a few explicit references to posix_fadvise to use a more
general name for the concept.  Update some remaining references.

Reviewed-by:
Discussion: https://postgr.es/m/0827edec-1317-4917-a186-035eb1e3241d%40eisentraut.org
---
 src/backend/access/transam/xlogprefetcher.c |  2 +-
 src/backend/storage/aio/read_stream.c   | 17 +
 2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlogprefetcher.c b/src/backend/access/transam/xlogprefetcher.c
index 3cb698d3bcb..3acaaea5b70 100644
--- a/src/backend/access/transam/xlogprefetcher.c
+++ b/src/backend/access/transam/xlogprefetcher.c
@@ -1083,7 +1083,7 @@ check_recovery_prefetch(int *new_value, void **extra, GucSource source)
 #ifndef USE_PREFETCH
 	if (*new_value == RECOVERY_PREFETCH_ON)
 	{
-		GUC_check_errdetail("\"recovery_prefetch\" is not supported on platforms that lack posix_fadvise().");
+		GUC_check_errdetail("\"recovery_prefetch\" is not supported on platforms that lack support for issuing read-ahead advice.");
 		return false;
 	}
 #endif
diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
index 93cdd35fea0..aae92e6b1f8 100644
--- a/src/backend/storage/aio/read_stream.c
+++ b/src/backend/storage/aio/read_stream.c
@@ -24,16 +24,17 @@
  * already.  There is no benefit to looking ahead more than one block, so
  * distance is 1.  This is the default initial assumption.
  *
- * B) I/O is necessary, but fadvise is undesirable because the access is
- * sequential, or impossible because direct I/O is enabled or the system
- * doesn't support fadvise.  There is no benefit in looking ahead more than
+ * B) I/O is necessary, but read-ahead advice is undesirable because the access
+ * is sequential and we can rely on the kernel's read-ahead heuristics, or
+ * impossible because direct I/O is enabled, or the system doesn't support
+ * read-ahead advice.  There is no benefit in looking ahead more than
  * io_combine_limit, because in this case the only goal is larger read system
- * calls.  Looking further ahead would pin many buffers and perform
- * speculative work looking ahead for no benefit.
+ * calls.  Looking further ahead would pin many buffers and perform speculative
+ * work for no benefit.
  *
- * C) I/O is necessary, it appears random, and this system supports fadvise.
- * We'll look further ahead in order to reach the configured level of I/O
- * concurrency.
+ * C) I/O is necessary, it appears to be random, and this system supports
+ * read-ahead advice.  We'll look further ahead in order to reach the
+ * configured level of I/O concurrency.
  *
  * The distance increases rapidly and decays slowly, so that it moves towards
  * those levels as different I/O patterns are discovered.  For example, a
-- 
2.46.0



Add callback in pgstats for backend initialization

2024-09-02 Thread Michael Paquier
Hi all,

Currently, the backend-level initialization of pgstats happens in
pgstat_initialize(), where we are using a shortcut for the WAL stats,
with pgstat_init_wal().

I'd like to propose to add a callback for that, so as in-core or
custom pgstats kinds can assign custom actions when initializing a
backend.  The use-case I had for this one are pretty close to what we
do for WAL, where we would rely on a difference of state depending on
what may have existed before reaching the initialization path.  So
this can offer more precision.  Another case, perhaps less
interesting, is to be able to initialize some static backend state.

I wanted to get that sent for the current commit fest, but did not get
back in time to it.  Anyway, here it is.  This gives the simple patch
attached.

Thanks,
--
Michael
From 1e41a5155fb689f1d4ae8a5cfb244b824e913bae Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 3 Sep 2024 10:38:15 +0900
Subject: [PATCH] Add callback for backend initialization in pgstats

This is currently used by the WAL stats, and is now made available for
others as well.
---
 src/include/utils/pgstat_internal.h |  7 +++
 src/backend/utils/activity/pgstat.c | 14 --
 src/backend/utils/activity/pgstat_wal.c |  2 +-
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index fb132e439d..53f7d5c98c 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -229,6 +229,12 @@ typedef struct PgStat_KindInfo
 	 */
 	uint32		pending_size;
 
+	/*
+	 * Perform custom actions when initializing a backend (standalone or under
+	 * postmaster). Optional.
+	 */
+	void		(*init_backend_cb) (void);
+
 	/*
 	 * For variable-numbered stats: flush pending stats. Required if pending
 	 * data is used.
@@ -676,6 +682,7 @@ extern bool pgstat_flush_wal(bool nowait);
 extern void pgstat_init_wal(void);
 extern bool pgstat_have_pending_wal(void);
 
+extern void pgstat_wal_init_backend_cb(void);
 extern void pgstat_wal_init_shmem_cb(void *stats);
 extern void pgstat_wal_reset_all_cb(TimestampTz ts);
 extern void pgstat_wal_snapshot_cb(void);
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b2ca3f39b7..3b6983ff5d 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -441,6 +441,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
 		.shared_data_off = offsetof(PgStatShared_Wal, stats),
 		.shared_data_len = sizeof(((PgStatShared_Wal *) 0)->stats),
 
+		.init_backend_cb = pgstat_wal_init_backend_cb,
 		.init_shmem_cb = pgstat_wal_init_shmem_cb,
 		.reset_all_cb = pgstat_wal_reset_all_cb,
 		.snapshot_cb = pgstat_wal_snapshot_cb,
@@ -604,10 +605,19 @@ pgstat_initialize(void)
 
 	pgstat_attach_shmem();
 
-	pgstat_init_wal();
-
 	pgstat_init_snapshot_fixed();
 
+	/* Backend initialization callbacks */
+	for (PgStat_Kind kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
+	{
+		const PgStat_KindInfo *kind_info = pgstat_get_kind_info(kind);
+
+		if (kind_info == NULL || kind_info->init_backend_cb == NULL)
+			continue;
+
+		kind_info->init_backend_cb();
+	}
+
 	/* Set up a process-exit hook to clean up */
 	before_shmem_exit(pgstat_shutdown_hook, 0);
 
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index e2a3f6b865..8c19c3f2fd 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -138,7 +138,7 @@ pgstat_flush_wal(bool nowait)
 }
 
 void
-pgstat_init_wal(void)
+pgstat_wal_init_backend_cb(void)
 {
 	/*
 	 * Initialize prevWalUsage with pgWalUsage so that pgstat_flush_wal() can
-- 
2.45.2



signature.asc
Description: PGP signature


Re: GUC names in messages

2024-09-02 Thread Peter Smith
Hi.

The cfbot was reporting my patches needed to be rebased.

Here is the rebased patch set v10*. Everything is the same as before
except now there are only 7 patches instead of 8. The previous v9-0001
("bool") patch no longer exists because those changes are now already
present in HEAD.

I hope these might be pushed soon to avoid further rebasing.

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v10-0001-Add-quotes-for-GUCs-int.patch
Description: Binary data


v10-0004-Add-quotes-for-GUCs-enum.patch
Description: Binary data


v10-0003-Add-quotes-for-GUCs-string.patch
Description: Binary data


v10-0002-Add-quotes-for-GUCs-real.patch
Description: Binary data


v10-0005-GUC-names-fix-case-intervalstyle.patch
Description: Binary data


v10-0006-GUC-names-fix-case-datestyle.patch
Description: Binary data


v10-0007-GUC-names-make-common-translatable-messages.patch
Description: Binary data


Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails

2024-09-02 Thread Tender Wang
Jehan-Guillaume de Rorthais  于2024年9月3日周二 05:02写道:

> Hi,
>
> On Tue, 20 Aug 2024 23:09:27 -0400
> Alvaro Herrera  wrote:
>
> > On 2024-Aug-20, Jehan-Guillaume de Rorthais wrote:
> >
> > > I'm back on this issue as well. I start poking at this patch to review
> it,
> > > test it, challenge it and then report here.
> > >
> > > I'll try to check if some other issues might have lost/forgot on they
> way as
> > > well.
> >
> > Thanks, much appreciated, looking forward to your feedback.
>
> Sorry, it took me a while to come back to you on this topic. It has been
> hard to
> untangle subjects, reproductions and patch…
>
> There's three distinct issues/thread:
>
> * Constraint & trigger catalog cleanup [1] (this thread)
> * FK broken after DETACH [2]
> * Maintenance consideration about self referencing FK between partitions
> [3]
>

The third issue has been fixed, and codes have been pushed.  Because of my
misunderstanding,
It should not be here.


> 0. Splitting in two commits
>
>   Your patch addresses two bugs:
>
> * one for the constraint & trigger catalog cleanup;
> * one for the FK broken after DETACH.
>
>   These issues are unrelated, therefore I am wondering if it would be
> better
>   to split their resolution in two different patches.
>
>   Last year, I reported them in two different threads [1][2]. The first
> with
>   implementation consideration, the second with a demo/proposal/draft fix.
>
>   Unfortunately, this discussion about the first bug slipped to the second
> one
>   when Tender stumbled on this bug as well and reported it. But, both bugs
> can
>   be triggered independently, and have distinct fixes.
>

It's ok that these two issues are fixed together.  It is  because current
codes don't handle better
when the referenced side is the partition table.


>   Finally, splitting the patch might help setting finer patch
> co-authoring. I
>   know my patch for [2] was a draft and somewhat trivial, but I spend a
> fair
>   amount of time to report, then produce a draft patch, so I was wondering
> if
>   it would be candidate to a co-author flag on this (small, humble and
>   refactored by you) patch?
>
>   I'm definitely not involved (yet) in the second part though.
>
> 1. Constraint & trigger catalog cleanup [1]
>
>   I have been focusing on the current master branch and haven't taken into
>   consideration backpatching related issues yet.
>
>   When I first studied this bug and reported it, I held on writing a patch
>   because it seemed it would duplicate some existing code. I wrote:
>
>   > I poked around DetachPartitionFinalize() to try to find a way to fix
> this,
>   > but it looks like it would duplicate a bunch of code from other code
> path
>   > (eg. from CloneFkReferenced).
>
>   My proposal was to clean everything related to the old FK and use some
>   existing code path to create a fresh and cleaner one. This requires some
>   refactoring in existing code, but we would win a common path of code
> between
>   create/attach/detach, a cleaner catalog and easier code maintenance.
>
>   I've finally been able to write a PoC that implement this by calling
>   addFkRecurseReferenced() from DetachPartitionFinalize(). I can't join
>   it here because it is currently an ugly draft and I still have some work
>   to do. But I would really like to have a little more time (one or two
> days) to
>   explore this avenue further before you commit yours, if you don't mind?
> Or
>   maybe you already have considered this avenue and rejected it?
>
>
> 2. FK broken after DETACH [2]
>
>   Comparing your patch to my draft from [2], I just have a question about
> the
>   refactoring.
>
>   Fencing the constraint/trigger removal inside a conditional
>   RELKIND_PARTITIONED_TABLE block of code was obvious. It avoids some
> useless
>   catalog scan compared to my draft patch.
>
>   Also, the "contype == CONSTRAINT_FOREIGN" I had sounds safe to remove.
>
>   However, is it clean/light enough to add the "conparentid == fk->conoid"
> in
>   the scan key as I did? I'm not sure it saves anything else but the small
>   conditional block you inserted inside the loop, but I wonder if there's a
>   serious concern about this anyway?
>
>   Last, considering the tests, I think we should add some rows in the
> tables,
>   to make sure the FK is correctly enforced after DETACH. Something like:
>
> CREATE SCHEMA fkpart12
>   CREATE TABLE fk_p ( id bigint PRIMARY KEY ) PARTITION BY list (id)
>   CREATE TABLE fk_p_1 PARTITION OF fk_p FOR VALUES IN (1)
>   CREATE TABLE fk_p_2 PARTITION OF fk_p FOR VALUES IN (2)
>   CREATE TABLE fk_r_1 ( id bigint PRIMARY KEY, p_id bigint NOT NULL)
>   CREATE TABLE fk_r_2 ( id bigint PRIMARY KEY, p_id bigint NOT NULL)
>   CREATE TABLE fk_r   ( id bigint PRIMARY KEY, p_id bigint NOT NULL,
>  FOREIGN KEY (p_id) REFERENCES fk_p (id)
>   ) PARTITION BY list (id);
> SET search_path TO fkpart12;
>
> INSERT INTO fk_p VALUES (1);
>
> ALTER TABLE

Remove no-op PlaceHolderVars

2024-09-02 Thread Richard Guo
In [1] there was a short discussion about removing no-op
PlaceHolderVars.  This thread is for continuing that discussion.

We may insert PlaceHolderVars when pulling up a subquery that is
within the nullable side of an outer join: if subquery pullup needs to
replace a subquery-referencing Var that has nonempty varnullingrels
with an expression that is not simply a Var, we may need to wrap the
replacement expression into a PlaceHolderVar.  However, if the outer
join above is later reduced to an inner join, the PHVs would become
no-ops with no phnullingrels bits.

I think it's always desirable to remove these no-op PlaceHolderVars
because they can constrain optimization opportunities.  The issue
reported in [1] shows that they can block subexpression folding, which
can prevent an expression from being matched to an index.  I provided
another example in that thread which shows that no-op PlaceHolderVars
can force join orders, potentially forcing us to resort to nestloop
join in some cases.

As explained in the comment of remove_nulling_relids_mutator, PHVs are
used in some cases to enforce separate identity of subexpressions.  In
other cases, however, it should be safe to remove a PHV if its
phnullingrels becomes empty.

Attached is a WIP patch that marks PHVs that need to be kept because
they are serving to isolate subexpressions, and removes all other PHVs
in remove_nulling_relids_mutator if their phnullingrels bits become
empty.  One problem with it is that a PHV's contained expression may
not have been fully preprocessed.  For example if the PHV is used as a
qual clause, its contained expression is not processed for AND/OR
flattening, which could trigger the Assert in make_restrictinfo that
the clause shouldn't be an AND clause.

For example:

create table t (a boolean, b boolean);

select * from t t1 left join
lateral (select (t1.a and t1.b) as x, * from t t2) s on true
where s.x;

The other problem with this is that it breaks 17 test cases.  We need
to modify them one by one to ensure that they still test what they are
supposed to, which is not a trivial task.

Before delving into these two problems, I'd like to know whether this
optimization is worthwhile, and whether I'm going in the right
direction.

[1] 
https://postgr.es/m/cambws4-9dyrf44pkpkfrpolxc_yh15dl8xt8j-ohggp_wvs...@mail.gmail.com

Thanks
Richard


v1-0001-Remove-no-op-PlaceHolderVars.patch
Description: Binary data


Re: Remove no-op PlaceHolderVars

2024-09-02 Thread Tom Lane
Richard Guo  writes:
> Attached is a WIP patch that marks PHVs that need to be kept because
> they are serving to isolate subexpressions, and removes all other PHVs
> in remove_nulling_relids_mutator if their phnullingrels bits become
> empty.  One problem with it is that a PHV's contained expression may
> not have been fully preprocessed.

Yeah.  I've been mulling over how we could do this, and the real
problem is that the expression containing the PHV *has* been fully
preprocessed by the time we get to outer join strength reduction
(cf. file header comment in prepjointree.c).  Simply dropping the PHV
can break various invariants that expression preprocessing is supposed
to establish, such as "no RelabelType directly above another" or "no
AND clause directly above another".  I haven't thought of a reliable
way to fix that short of re-running eval_const_expressions afterwards,
which seems like a mighty expensive answer.  We could try to make
remove_nulling_relids_mutator preserve all these invariants, but
keeping it in sync with what eval_const_expressions does seems like
a maintenance nightmare.

> The other problem with this is that it breaks 17 test cases.

I've not looked into that, but yeah, it would need some tedious
analysis to verify whether the changes are OK.

> Before delving into these two problems, I'd like to know whether this
> optimization is worthwhile, and whether I'm going in the right
> direction.

I believe this is an area worth working on.  I've been wondering
whether it'd be better to handle the expression-identity problems by
introducing an "expression wrapper" node type that is distinct from
PHV and has the sole purpose of demarcating a subexpression we don't
want to be folded into its parent.  I think that doesn't really move
the football in terms of fixing the problems mentioned above, but
perhaps it's conceptually cleaner than adding a bool field to PHV.

Another thought is that grouping sets provide one of the big reasons
why we need an "expression wrapper" or equivalent functionality.
So maybe we should try to move forward on your other patch to change
the representation of those before we spend too much effort here.

regards, tom lane




Re: Improving the latch handling between logical replication launcher and worker processes.

2024-09-02 Thread vignesh C
On Fri, 5 Jul 2024 at 18:38, Heikki Linnakangas  wrote:
>
> On 05/07/2024 14:07, vignesh C wrote:
> > On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas  wrote:
> >>
> >> I'm don't quite understand the problem we're trying to fix:
> >>
> >>> Currently the launcher's latch is used for the following: a) worker
> >>> process attach b) worker process exit and c) subscription creation.
> >>> Since this same latch is used for multiple cases, the launcher process
> >>> is not able to handle concurrent scenarios like: a) Launcher started a
> >>> new apply worker and waiting for apply worker to attach and b) create
> >>> subscription sub2 sending launcher wake up signal. In this scenario,
> >>> both of them will set latch of the launcher process, the launcher
> >>> process is not able to identify that both operations have occurred 1)
> >>> worker is attached 2) subscription is created and apply worker should
> >>> be started. As a result the apply worker does not get started for the
> >>> new subscription created immediately and gets started after the
> >>> timeout of 180 seconds.
> >>
> >> I don't see how we could miss a notification. Yes, both events will set
> >> the same latch. Why is that a problem?
> >
> > The launcher process waits for the apply worker to attach at
> > WaitForReplicationWorkerAttach function. The launcher's latch is
> > getting set concurrently for another create subscription and apply
> > worker attached. The launcher now detects the latch is set while
> > waiting at WaitForReplicationWorkerAttach, it resets the latch and
> > proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
> > the latch has already been reset). Further details are provided below.
> >
> > The loop will see that the new
> >> worker has attached, and that the new subscription has been created, and
> >> process both events. Right?
> >
> > Since the latch is reset at WaitForReplicationWorkerAttach, it skips
> > processing the create subscription event.
> >
> > Slightly detailing further:
> > In the scenario when we execute two concurrent create subscription
> > commands, first CREATE SUBSCRIPTION sub1, followed immediately by
> > CREATE SUBSCRIPTION sub2.
> > In a few random scenarios, the following events may unfold:
> > After the first create subscription command(sub1), the backend will
> > set the launcher's latch because of ApplyLauncherWakeupAtCommit.
> > Subsequently, the launcher process will reset the latch and identify
> > the addition of a new subscription in the pg_subscription list. The
> > launcher process will proceed to request the postmaster to start the
> > apply worker background process (sub1) and request the postmaster to
> > notify the launcher once the apply worker(sub1) has been started.
> > Launcher will then wait for the apply worker(sub1) to attach in the
> > WaitForReplicationWorkerAttach function.
> > Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
> > executed concurrently, also set the launcher's latch(because of
> > ApplyLauncherWakeupAtCommit).
> > Simultaneously when the launcher remains in the
> > WaitForReplicationWorkerAttach function waiting for apply worker of
> > sub1 to start, the postmaster will start the apply worker for
> > subscription sub1 and send a SIGUSR1 signal to the launcher process
> > via ReportBackgroundWorkerPID. Upon receiving this signal, the
> > launcher process will then set its latch.
> >
> > Now, the launcher's latch has been set concurrently after the apply
> > worker for sub1 is started and the execution of the CREATE
> > SUBSCRIPTION sub2 command.
> >
> > At this juncture, the launcher, which had been awaiting the attachment
> > of the apply worker, detects that the latch is set and proceeds to
> > reset it. The reset operation of the latch occurs within the following
> > section of code in WaitForReplicationWorkerAttach:
> > ...
> > rc = WaitLatch(MyLatch,
> > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> > 10L, WAIT_EVENT_BGWORKER_STARTUP);
> >
> > if (rc & WL_LATCH_SET)
> > {
> > ResetLatch(MyLatch);
> > CHECK_FOR_INTERRUPTS();
> > }
> > ...
> >
> > After resetting the latch here, the activation signal intended to
> > start the apply worker for subscription sub2 is no longer present. The
> > launcher will return to the ApplyLauncherMain function, where it will
> > await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
> > processing the create subscription request (i.e., creating a new apply
> > worker for sub2).
> > The issue arises from the latch being reset in
> > WaitForReplicationWorkerAttach, which can occasionally delay the
> > synchronization of table data for the subscription.
>
> Ok, I see it now. Thanks for the explanation. So the problem isn't using
> the same latch for different purposes per se. It's that we're trying to
> use it in a nested fashion, resetting it in the inner loop.
>
> Looking at the proposed patch more closely:
>
> > @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(Logi

Re: define PG_REPLSLOT_DIR

2024-09-02 Thread Bertrand Drouvot
Hi,

On Tue, Sep 03, 2024 at 09:15:50AM +0900, Michael Paquier wrote:
> On Fri, Aug 30, 2024 at 12:21:29PM +, Bertrand Drouvot wrote:
> > That said, I don't have a strong opinion on this one, I think that also 
> > makes
> > sense to leave it as it is. Please find attached v4 doing so.
> 
> The changes in astreamer_file.c are actually wrong regarding the fact
> that should_allow_existing_directory() needs to be able to work with
> the branch where this code is located as well as back-branches,
> because pg_basebackup from version N supports ~(N-1) versions down to
> a certain version, so changing it is not right.  This is why pg_xlog
> and pg_wal are both listed there.

I understand why pg_xlog and pg_wal both need to be listed here, but I don't
get why the proposed changes were "wrong". Or, are you saying that if for any
reason PG_TBLSPC_DIR needs to be changed that would not work anymore? If
that's the case, then I guess we'd have to add a new define and test like:

 strcmp(filename, PG_TBLSPC_DIR) == 0) || strcmp(filename, NEW_PG_TBLSPC_DIR) 
== 0)

, no? 

The question is more out of curiosity, not saying the changes should be applied
in astreamer_file.c though.

> Perhaps we should to more for the two entries in basebackup.c with the
> relative paths, but I'm not sure that's worth bothering, either.

I don't have a strong opinon on those ones.

> At
> the end, I got no objections about the remaining pieces, so applied.

Thanks!

> How do people feel about the suggestions to update the comments at the
> end?  With the comment in relpath.h suggesting to not change that, the
> current state of HEAD is fine by me.

Yeah, I think that's fine and specially because there is still some places (
outside of the comments) that don't rely on the define(s).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Add callbacks for fixed-numbered stats flush in pgstats

2024-09-02 Thread Michael Paquier
Hi all,

The last TODO item I had in my bucket about the generalization of
pgstats is the option to a better control on the flush of the stats
depending on the kind for fixed-numbered stats.  Currently, this is
controlled by pgstat_report_stat(), that includes special handling for
WAL, IO and SLRU stats, with two generic concepts:
- Check if there are pending entries, allowing a fast-path exit.
- Do the actual flush, with a recheck on pending entries.

SLRU and IO control that with one variable each, and WAL uses a
routine for the same called pgstat_have_pending_wal().  Please find
attached a patch to generalize the concept, with two new callbacks
that can be used for fixed-numbered stats.  SLRU, IO and WAL are
switched to use these (the two pgstat_flush_* routines have been kept
on purpose).  This brings some clarity in the code, by making
have_iostats and have_slrustats static in their respective files.  The
two pgstat_flush_* wrappers do not need a boolean as return result.

Running Postgres on scissors with a read-only workload that does not
trigger stats, I was not able to see a difference in runtime, but that
was on my own laptop, and I am planning to do more measurements on a
bigger machine.

This is in the same line of thoughts as the recent thread about the
backend init callback, generalizing more the whole facility:
https://www.postgresql.org/message-id/ztzr1k4pldewc...@paquier.xyz

Like the other one, I wanted to send that a few days ago, but well,
life likes going its own ways sometimes.

Thanks,
--
Michael
From e98f8d50c17cd8fc521bffb4bdc73ef58b7fa430 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 3 Sep 2024 13:36:10 +0900
Subject: [PATCH] Add callbacks to control flush of fixed-numbered stats

This commit adds two callbacks in pgstats:
- have_fixed_pending_cb, to check if a stats kind has any pending stuff
waiting for a flush.
- flush_fixed_cb, to do the flush.

This is used by the SLRU, WAL and IO statistics, generalizing the
concept for all stats kinds (builtin and custom).
---
 src/include/utils/pgstat_internal.h  | 42 +---
 src/backend/utils/activity/pgstat.c  | 63 +++-
 src/backend/utils/activity/pgstat_io.c   | 22 -
 src/backend/utils/activity/pgstat_slru.c | 13 -
 src/backend/utils/activity/pgstat_wal.c  | 19 +--
 5 files changed, 119 insertions(+), 40 deletions(-)

diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index fb132e439d..d69aa05b1c 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -231,7 +231,7 @@ typedef struct PgStat_KindInfo
 
 	/*
 	 * For variable-numbered stats: flush pending stats. Required if pending
-	 * data is used.
+	 * data is used.  See flush_fixed_cb for fixed-numbered stats.
 	 */
 	bool		(*flush_pending_cb) (PgStat_EntryRef *sr, bool nowait);
 
@@ -259,6 +259,19 @@ typedef struct PgStat_KindInfo
 	 */
 	void		(*init_shmem_cb) (void *stats);
 
+	/*
+	 * For fixed-numbered statistics: Flush pending stats. Returns true if
+	 * some of the stats could not be flushed, due to lock contention for
+	 * example. Optional.
+	 */
+	bool		(*flush_fixed_cb) (bool nowait);
+
+	/*
+	 * For fixed-numbered statistics: Check for pending stats in need of
+	 * flush. Returns true if there are any stats pending for flush. Optional.
+	 */
+	bool		(*have_fixed_pending_cb) (void);
+
 	/*
 	 * For fixed-numbered statistics: Reset All.
 	 */
@@ -603,7 +616,10 @@ extern bool pgstat_function_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
  * Functions in pgstat_io.c
  */
 
-extern bool pgstat_flush_io(bool nowait);
+extern void pgstat_flush_io(bool nowait);
+
+extern bool pgstat_io_have_pending_cb(void);
+extern bool pgstat_io_flush_cb(bool nowait);
 extern void pgstat_io_init_shmem_cb(void *stats);
 extern void pgstat_io_reset_all_cb(TimestampTz ts);
 extern void pgstat_io_snapshot_cb(void);
@@ -662,7 +678,8 @@ extern PgStatShared_Common *pgstat_init_entry(PgStat_Kind kind,
  * Functions in pgstat_slru.c
  */
 
-extern bool pgstat_slru_flush(bool nowait);
+extern bool pgstat_slru_have_pending_cb(void);
+extern bool pgstat_slru_flush_cb(bool nowait);
 extern void pgstat_slru_init_shmem_cb(void *stats);
 extern void pgstat_slru_reset_all_cb(TimestampTz ts);
 extern void pgstat_slru_snapshot_cb(void);
@@ -672,10 +689,11 @@ extern void pgstat_slru_snapshot_cb(void);
  * Functions in pgstat_wal.c
  */
 
-extern bool pgstat_flush_wal(bool nowait);
 extern void pgstat_init_wal(void);
-extern bool pgstat_have_pending_wal(void);
+extern void pgstat_flush_wal(bool nowait);
 
+extern bool pgstat_wal_have_pending_cb(void);
+extern bool pgstat_wal_flush_cb(bool nowait);
 extern void pgstat_wal_init_shmem_cb(void *stats);
 extern void pgstat_wal_reset_all_cb(TimestampTz ts);
 extern void pgstat_wal_snapshot_cb(void);
@@ -705,20 +723,6 @@ extern void pgstat_create_transactional(PgStat_Kind kind, Oid dboid, Oid objoid)
 extern PGDLLIMPORT PgStat_Loc

Re: Virtual generated columns

2024-09-02 Thread jian he
On Thu, Aug 29, 2024 at 9:35 PM jian he  wrote:
>
> On Thu, Aug 29, 2024 at 8:15 PM Peter Eisentraut  wrote:
> >


> >
> > The new patch does some rebasing and contains various fixes to the
> > issues you presented.  As I mentioned, I'll look into improving the
> > rewriting.
>
>
> based on your latest patch (v4-0001-Virtual-generated-columns.patch),
> I did some minor cosmetic code change
> and tried to address get_attgenerated overhead.
>
> basically in expand_generated_columns_in_query
> and expand_generated_columns_in_expr preliminary collect (reloid,attnum)
> that have generated_virtual flag into expand_generated_context.
> later in expand_generated_columns_mutator use the collected information.
>
> deal with wholerow within the expand_generated_columns_mutator seems
> tricky, will try later.


please just ignore v4-0001-Virtual-generated-columns_minorchange.no-cfbot,
which I made some mistakes, but the tests still passed.

please checking this mail attached
v5-0001-Virtual-generated-wholerow-var-and-virtual-che.no-cfbot

It solves:
1. minor cosmetic changes.
2. virtual generated column wholerow var reference, tests added.
3. optimize get_attgenerated overhead, instead of for each var call
get_attgenerated.
  walk through the query tree, collect the virtual column's relation
oid, and the virtual generated column's attnum
and use this information later.


I will check the view insert case later.


v5-0001-Virtual-generated-wholerow-var-and-virtual-che.no-cfbot
Description: Binary data


Re: v17 vs v16 performance comparison

2024-09-02 Thread Alexander Lakhin

Hello Thomas,

02.08.2024 12:00, Alexander Lakhin wrote:







I had payed attention to:
Best pg-src-17--.* worse than pg-src-16--.* by 57.9 percents (225.11 > 142.52): 
pg_tpcds.query15
Average pg-src-17--.* worse than pg-src-16--.* by 55.5 percents (230.57 > 
148.29): pg_tpcds.query15
in May, performed `git bisect` for this degradation, that led me to commit
b7b0f3f27 [1].

This time I bisected the following anomaly:
Best pg-src-17--.* worse than pg-src-16--.* by 23.6 percents (192.25 > 155.58): 
pg_tpcds.query21
Average pg-src-17--.* worse than pg-src-16--.* by 25.1 percents (196.19 > 
156.85): pg_tpcds.query21
and to my surprise I got "b7b0f3f27 is the first bad commit".

Moreover, bisecting of another anomaly:
Best pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24269.21 > 
19539.89): pg_tpcds.query72
Average pg-src-17--.* worse than pg-src-16--.* by 24.2 percents (24517.66 > 
19740.12): pg_tpcds.query72
pointed at the same commit again.

...

But beside that, I've found a separate regression. Bisecting for this 
degradation:
Best pg-src-17--.* worse than pg-src-16--.* by 105.0 percents (356.63 > 
173.96): s64da_tpcds.query95
Average pg-src-17--.* worse than pg-src-16--.* by 105.2 percents (357.79 > 
174.38): s64da_tpcds.query95
pointed at f7816aec2.




Meanwhile I've bisected another degradation:
Best pg-src-17--.* worse than pg-src-16--.* by 11.3 percents (7.17 > 6.44): 
job.query6f
and came to the commit b7b0f3f27 again.


Now that the unfairness in all-cached parallel seq scan eliminated (with
3ed3683618),  I've re-run the same performance tests and got new results
(please see attached). As we can see, the aforementioned pg_tpcds.query72
got better:
  2024-05-15  2024-07-30 2024-09-03
pg-src-16--1   20492.58    19669.34    19913.32
pg-src-17--1   25286.10    24269.21    20654.95
pg-src-16--2   20769.88    19539.89    20429.72
pg-src-17--2   25771.90    24530.51    21244.92
pg-src-17--3   25978.55    24753.25    20904.09
pg-src-16--3   20943.10    20011.13    20086.61

We can also see the improvement of pg_tpcds.query16, but not on all runs:
  2024-05-15  2024-07-30 2024-09-03
pg-src-16--1 105.36   94.31   97.74
pg-src-17--1 145.74  145.53  145.51
pg-src-16--2 101.82   98.36   96.63
pg-src-17--2 140.07  146.90   96.93
pg-src-17--3 154.89  148.11  106.18
pg-src-16--3 101.03  100.94   93.44

So it looks like now we see the same instability, that we observed before
([1]).

Unfortunately, the troublesome tpcds.query15 hasn't produced good numbers
this time too:
  2024-05-15  2024-07-30 2024-09-03
pg-src-16--1 153.41  142.52  142.54
pg-src-17--1 229.84  225.11  212.51
pg-src-16--2 153.47  150.13  149.37
pg-src-17--2 236.34  227.15  232.73
pg-src-17--3 236.43  239.46  233.77
pg-src-16--3 151.03  152.23  144.90

From a bird's eye view, new v17-vs-v16 comparison has only 87 "worse",
while the previous one had 115 (it requires deeper analysis, of course, but
still...).

[1] 
https://www.postgresql.org/message-id/d1fb5c09-dd03-2540-9ec2-86dbfdfa2c65%40gmail.com

Best regards,
AlexanderTitle: Postgres benchmarking results

Benchmarking run started at 2024-09-02T09:38:29, ended at 2024-09-03T03:22:21Benchmark versionInstanceversionpg-src-16--1PostgreSQL 16.4-REL_16_STABLE/2015dd5c90 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-17--1PostgreSQL 17beta3-REL_17_STABLE/3ed3683618 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-16--2PostgreSQL 16.4-REL_16_STABLE/2015dd5c90 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-17--2PostgreSQL 17beta3-REL_17_STABLE/3ed3683618 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-17--3PostgreSQL 17beta3-REL_17_STABLE/3ed3683618 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitpg-src-16--3PostgreSQL 16.4-REL_16_STABLE/2015dd5c90 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bitBenchmark pgbench_nativeInstancetpspg-src-16--110555.04pg-src-17--110713.05pg-src-16--210539.33pg-src-17--210689.77pg-src-17--310713.91pg-src-16--310562.90Benchmark pgbench_referenceInstancelatency_​avgtps1tps2pg-src-16--18.577471.287471.31pg-src-17--18.317700.457700.49pg-src-16--28.587463.047463.07pg-src-17--28.317701.527701.55pg-src-17--38.367655.437655.46pg-src-16--38.577464.597464.63Benchmark pg_tpchInstancequery1query2query3query4query5query6query7query8query9query10query11query12query13query14query15query16query17query18query19query20query21query22pg-src-16--19.631.452.970.700.901.442.894.630.842.450.772.182.860.853.160.988.6414.300.195.792.360.23pg-src-17--19.301.482.820.690.881.352.794.550.952.330.782.122.760.862.990.998.2014.540.195.462.240.23pg-src-16--29.731.433.1

Re: Add callback in pgstats for backend initialization

2024-09-02 Thread Bertrand Drouvot
Hi,

On Tue, Sep 03, 2024 at 10:52:20AM +0900, Michael Paquier wrote:
> Hi all,
> 
> Currently, the backend-level initialization of pgstats happens in
> pgstat_initialize(), where we are using a shortcut for the WAL stats,
> with pgstat_init_wal().
> 
> I'd like to propose to add a callback for that, so as in-core or
> custom pgstats kinds can assign custom actions when initializing a
> backend.  The use-case I had for this one are pretty close to what we
> do for WAL, where we would rely on a difference of state depending on
> what may have existed before reaching the initialization path.  So
> this can offer more precision.  Another case, perhaps less
> interesting, is to be able to initialize some static backend state.

I think the proposal makes sense and I can see the use cases too, so +1.

> This gives the simple patch
> attached.

Should we add a few words about this new callback (and the others in
PgStat_KindInfo while at it) in the "Custom Cumulative Statistics" section of
xfunc.sgml?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

2024-09-02 Thread Amit Kapila
On Mon, Sep 2, 2024 at 9:14 AM shveta malik  wrote:
>
> On Mon, Sep 2, 2024 at 5:47 AM Peter Smith  wrote:
> > 
> >
> > To summarize, the current description wrongly describes the field as a
> > time duration:
> > "The time since the slot has become inactive."
> >
> > I suggest replacing it with:
> > "The slot has been inactive since this time."
> >
>
> +1 for the change. If I had read the document without knowing about
> the patch, I too would have interpreted it as a duration.
>

The suggested change looks good to me as well. I'll wait for a day or
two before pushing to see if anyone thinks otherwise.

-- 
With Regards,
Amit Kapila.




Re: Typos in the code and README

2024-09-02 Thread Michael Paquier
On Mon, Sep 02, 2024 at 09:00:00PM +0300, Alexander Lakhin wrote:
> I've gathered another bunch of defects with the possible substitutions.
> Please take a look:
> pgstat_add_kind -> pgstat_register_kind (see 7949d9594)

And here I thought I took care of these inconsistencies.  This one is
on me so I'll go fix that in a bit, with the rest while on it.
--
Michael


signature.asc
Description: PGP signature


Re: Use XLOG_CONTROL_FILE macro everywhere?

2024-09-02 Thread Kyotaro Horiguchi
At Mon, 2 Sep 2024 15:44:45 +0200, Daniel Gustafsson  wrote in 
> Summarizing the thread it seems consensus is using XLOG_CONTROL_FILE
> consistently as per the original patch.
> 
> A few comments on the patch though:
> 
> - * reads the data from $PGDATA/global/pg_control
> + * reads the data from $PGDATA/
> 
> I don't think this is an improvement, I'd leave that one as the filename
> spelled out.
> 
> - "the \".old\" suffix from %s/global/pg_control.old.\n"
> + "the \".old\" suffix from %s/%s.old.\n"
> 
> Same with that change, not sure I think that makes reading the errormessage
> code any easier.

I agree with the first point. In fact, I think it might be even better
to just write something like "reads the data from the control file" in
plain language rather than using the actual file name. As for the
second point, I'm fine either way, but if the main goal is to ensure
resilience against future changes to the value of XLOG_CONTROL_FILE,
then changing it makes sense. On the other hand, I don't see any
strong reasons not to change it. That said, I don't really expect the
value to change in the first place.

The following point caught my attention.

> +++ b/src/backend/postmaster/postmaster.c
...
> +#include "access/xlog_internal.h"

The name xlog_internal suggests that the file should only be included
by modules dealing with XLOG details, and postmaster.c doesn't seem to
fit that category. If the macro is used more broadly, it might be
better to move it to a more public location. However, following the
current discussion, if we decide to keep the macro's name as it is, it
would make more sense to keep it in its current location.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Typos in the code and README

2024-09-02 Thread Michael Paquier
On Tue, Sep 03, 2024 at 02:24:32PM +0900, Michael Paquier wrote:
> On Mon, Sep 02, 2024 at 09:00:00PM +0300, Alexander Lakhin wrote:
> > I've gathered another bunch of defects with the possible substitutions.
> > Please take a look:
> > pgstat_add_kind -> pgstat_register_kind (see 7949d9594)
> 
> And here I thought I took care of these inconsistencies.  This one is
> on me so I'll go fix that in a bit, with the rest while on it.

And done that.

The bit about CommitTSSLRU -> CommitTsSLRU in lwlock.c should be
backpatched down to 17, indeed, but the branch is frozen until the RC
tag lands in the tree, so I have left it out for now. The tag should
show up tomorrow or so.  Good thing that you have noticed this issue
before the release.
--
Michael


signature.asc
Description: PGP signature


Re: Add callback in pgstats for backend initialization

2024-09-02 Thread Michael Paquier
On Tue, Sep 03, 2024 at 05:00:54AM +, Bertrand Drouvot wrote:
> Should we add a few words about this new callback (and the others in
> PgStat_KindInfo while at it) in the "Custom Cumulative Statistics" section of
> xfunc.sgml?

Not sure if it is worth having.  This adds extra maintenance and
there's already a mention of src/include/utils/pgstat_internal.h
telling where the callbacks are described.
--
Michael


signature.asc
Description: PGP signature


Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

2024-09-02 Thread shveta malik
On Tue, Sep 3, 2024 at 10:43 AM Amit Kapila  wrote:
>
> > >
> > > To summarize, the current description wrongly describes the field as a
> > > time duration:
> > > "The time since the slot has become inactive."
> > >
> > > I suggest replacing it with:
> > > "The slot has been inactive since this time."
> > >
> >
> > +1 for the change. If I had read the document without knowing about
> > the patch, I too would have interpreted it as a duration.
> >
>
> The suggested change looks good to me as well. I'll wait for a day or
> two before pushing to see if anyone thinks otherwise.

Shall we make the change in code-comment as well:

typedef struct ReplicationSlot
{
...
/* The time since the slot has become inactive */
TimestampTz inactive_since;
}

thanks
Shveta




Re: Add const qualifiers to XLogRegister*() functions

2024-09-02 Thread Peter Eisentraut

On 28.08.24 12:04, Aleksander Alekseev wrote:

Hi,


On 04.10.23 16:37, Peter Eisentraut wrote:

On 03.10.23 13:28, Aleksander Alekseev wrote:

While examining the code for similar places I noticed that the
following functions can also be const'ified:



- XLogRegisterData (?)


I don't think this would work, at least without further work elsewhere,
because the data is stored in XLogRecData, which has no const handling.


I got around to fixing this.  Here is a patch.  It allows removing a few
unconstify() calls, which is nice.


LGTM.


committed


Note that this may affect third-party code. IMO this is not a big deal
in this particular case.


I don't think this will impact any third-party code.  Only maybe for the 
better, by being able to remove some casts.



Also by randomly checking one of the affected non-static functions I
found a bunch of calls like this:

XLogRegisterData((char *) msgs, ...)

... where the first argument is going to become (const char *). It
looks like the compilers are OK with implicitly casting (char*) to a
more restrictive (const char*) though.


Yes, that's ok.





Re: v17 vs v16 performance comparison

2024-09-02 Thread Thomas Munro
On Tue, Sep 3, 2024 at 5:00 PM Alexander Lakhin  wrote:
>  From a bird's eye view, new v17-vs-v16 comparison has only 87 "worse",
> while the previous one had 115 (it requires deeper analysis, of course, but
> still...).

Any chance you could share that whole pgdata dir with me, assuming it
compresses to a manageable size?  Perhaps we could discuss that
off-list?




Re: Expand applicability of aggregate's sortop optimization

2024-09-02 Thread Andrei Lepikhov

On 18/7/2024 14:49, Matthias van de Meent wrote:

Aside: Arguably, checking for commutator operators would not be
incorrect when looking at it from "applied operators" point of view,
but if that commutative operator isn't registered as opposite ordering
of the same btree opclass, then we'd probably break some assumptions
of some aggregate's sortop - it could be registered with another
opclass, and that could cause us to select a different btree opclass
(thus: ordering) than is indicated to be required by the aggregate;
the thing we're trying to protect against here.

Hi,
This thread stands idle. At the same time, the general idea of this 
patch and the idea of utilising prosupport functions look promising. Are 
you going to develop this feature further?


--
regards, Andrei Lepikhov





Re: define PG_REPLSLOT_DIR

2024-09-02 Thread Michael Paquier
On Tue, Sep 03, 2024 at 04:35:28AM +, Bertrand Drouvot wrote:
> I understand why pg_xlog and pg_wal both need to be listed here, but I don't
> get why the proposed changes were "wrong". Or, are you saying that if for any
> reason PG_TBLSPC_DIR needs to be changed that would not work
> anymore?

Yes.  Folks are not going to do that, but changing PG_TBLSPC_DIR would
change the decision-making when streaming from versions older than
v17 with clients tools from v18~.
--
Michael


signature.asc
Description: PGP signature


Re: DOCS - pg_replication_slot . Fix the 'inactive_since' description

2024-09-02 Thread Bertrand Drouvot
Hi,

On Tue, Sep 03, 2024 at 10:43:14AM +0530, Amit Kapila wrote:
> On Mon, Sep 2, 2024 at 9:14 AM shveta malik  wrote:
> >
> > On Mon, Sep 2, 2024 at 5:47 AM Peter Smith  wrote:
> > > 
> > >
> > > To summarize, the current description wrongly describes the field as a
> > > time duration:
> > > "The time since the slot has become inactive."
> > >
> > > I suggest replacing it with:
> > > "The slot has been inactive since this time."
> > >
> >
> > +1 for the change. If I had read the document without knowing about
> > the patch, I too would have interpreted it as a duration.
> >
> 
> The suggested change looks good to me as well. I'll wait for a day or
> two before pushing to see if anyone thinks otherwise.

I'm not 100% convinced the current wording is confusing because:

- the field type is described as a "timestamptz".
- there is no duration unit in the wording (if we were to describe a duration,
we would probably add an unit to it, like "The time (in s)...").

That said, if we want to highlight that this is not a duration, what about?

"The time (not duration) since the slot has become inactive."

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: GUC names in messages

2024-09-02 Thread Michael Paquier
On Tue, Sep 03, 2024 at 12:00:19PM +1000, Peter Smith wrote:
> Here is the rebased patch set v10*. Everything is the same as before
> except now there are only 7 patches instead of 8. The previous v9-0001
> ("bool") patch no longer exists because those changes are now already
> present in HEAD.
> 
> I hope these might be pushed soon to avoid further rebasing.

0001~0004 could just be merged, they're the same thing, for different
GUC types.  The consensus mentioned in 17974ec25946 makes that clear.

0007 is a good thing for translators, indeed..  I'll see about doing
something here, at least.
--
Michael


signature.asc
Description: PGP signature


Re: per backend I/O statistics

2024-09-02 Thread Kyotaro Horiguchi
At Mon, 2 Sep 2024 14:55:52 +, Bertrand Drouvot 
 wrote in 
> Hi hackers,
> 
> Please find attached a patch to implement $SUBJECT.
> 
> While pg_stat_io provides cluster-wide I/O statistics, this patch adds a new
> pg_my_stat_io view to display "my" backend I/O statistics and a new
> pg_stat_get_backend_io() function to retrieve the I/O statistics for a given
> backend pid.
> 
> By having the per backend level of granularity, one could for example identify
> which running backend is responsible for most of the reads, most of the 
> extends
> and so on... The pg_my_stat_io view could also be useful to check the
> impact on the I/O made by some operations, queries,... in the current session.
> 
> Some remarks:
> 
> - it is split in 2 sub patches: 0001 introducing the necessary changes to 
> provide
> the pg_my_stat_io view and 0002 to add the pg_stat_get_backend_io() function.
> - the idea of having per backend I/O statistics has already been mentioned in
> [1] by Andres.
> 
> Some implementation choices:
> 
> - The KIND_IO stats are still "fixed amount" ones as the maximum number of
> backend is fixed.
> - The statistics snapshot is made for the global stats (the aggregated ones) 
> and
> for my backend stats. The snapshot is not build for all the backend stats 
> (that
> could be memory expensive depending on the number of max connections and given
> the fact that PgStat_IO is 16KB long).
> - The above point means that pg_stat_get_backend_io() behaves as if
> stats_fetch_consistency is set to none (each execution re-fetches counters
> from shared memory).
> - The above 2 points are also the reasons why the pg_my_stat_io view has been
> added (as its results takes care of the stats_fetch_consistency setting). I 
> think
> that makes sense to rely on it in that case, while I'm not sure that would 
> make
> a lot of sense to retrieve other's backend I/O stats and taking care of
> stats_fetch_consistency.
> 
> 
> [1]: 
> https://www.postgresql.org/message-id/20230309003438.rectf7xo7pw5t5cj%40awork3.anarazel.de

I'm not sure about the usefulness of having the stats only available
from the current session. Since they are stored in shared memory,
shouldn't we make them accessible to all backends? However, this would
introduce permission considerations and could become complex.

When I first looked at this patch, my initial thought was whether we
should let these stats stay "fixed." The reason why the current
PGSTAT_KIND_IO is fixed is that there is only one global statistics
storage for the entire database. If we have stats for a flexible
number of backends, it would need to be non-fixed, perhaps with the
entry for INVALID_PROC_NUMBER storing the global I/O stats, I
suppose. However, one concern with that approach would be the impact
on performance due to the frequent creation and deletion of stats
entries caused by high turnover of backends.

Just to be clear, the above comments are not meant to oppose the
current implementation approach. They are purely for the sake of
discussing comparisons with other possible approaches.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-02 Thread Peter Smith
Hi, my previous review posts did not cover the test code.

Here are my review comments for the v44-0001 test code

==
TEST CASE #1

1.
+# Wait for the inactive replication slot to be invalidated.
+$standby1->poll_query_until(
+ 'postgres', qq[
+ SELECT COUNT(slot_name) = 1 FROM pg_replication_slots
+ WHERE slot_name = 'lsub1_sync_slot' AND
+ invalidation_reason = 'inactive_timeout';
+])
+  or die
+  "Timed out while waiting for lsub1_sync_slot invalidation to be
synced on standby";
+

Is that comment correct? IIUC the synced slot should *already* be
invalidated from the primary, so here we are not really "waiting" for
it to be invalidated; Instead, we are just "confirming" that the
synchronized slot is already invalidated with the correct reason as
expected.

~~~

2.
+# Synced slot mustn't get invalidated on the standby even after a checkpoint,
+# it must sync invalidation from the primary. So, we must not see the slot's
+# invalidation message in server log.
+$standby1->safe_psql('postgres', "CHECKPOINT");
+ok( !$standby1->log_contains(
+ "invalidating obsolete replication slot \"lsub1_sync_slot\"",
+ $standby1_logstart),
+ 'check that syned lsub1_sync_slot has not been invalidated on the standby'
+);
+

This test case seemed bogus, for a couple of reasons:

2a. IIUC this 'lsub1_sync_slot' is the same one that is already
invalid (from the primary), so nobody should be surprised that an
already invalid slot doesn't get flagged as invalid again. i.e.
Shouldn't your test scenario here be done using a valid synced slot?

2b. AFAICT it was only moments above this CHECKPOINT where you
assigned the standby inactivity timeout to 2s. So even if there was
some bug invalidating synced slots I don't think you gave it enough
time to happen -- e.g. I doubt 2s has elapsed yet.

~

3.
+# Stop standby to make the standby's replication slot on the primary inactive
+$standby1->stop;
+
+# Wait for the standby's replication slot to become inactive
+wait_for_slot_invalidation($primary, 'sb1_slot', $logstart,
+ $inactive_timeout);

This seems a bit tricky. Both these (the stop and the wait) seem to
belong together, so I think maybe a single bigger explanatory comment
covering both parts would help for understanding.

==
TEST CASE #2

4.
+# Stop subscriber to make the replication slot on publisher inactive
+$subscriber->stop;
+
+# Wait for the replication slot to become inactive and then invalidated due to
+# timeout.
+wait_for_slot_invalidation($publisher, 'lsub1_slot', $logstart,
+ $inactive_timeout);

IIUC, this is just like comment #3 above. Both these (the stop and the
wait) seem to belong together, so I think maybe a single bigger
explanatory comment covering both parts would help for understanding.

~~~

5.
+# Testcase end: Invalidate logical subscriber's slot due to
+# replication_slot_inactive_timeout.
+# =


IMO the rest of the comment after "Testcase end" isn't very useful.

==
sub wait_for_slot_invalidation

6.
+sub wait_for_slot_invalidation
+{

An explanatory header comment for this subroutine would be helpful.

~~~

7.
+ # Wait for the replication slot to become inactive
+ $node->poll_query_until(
+ 'postgres', qq[
+ SELECT COUNT(slot_name) = 1 FROM pg_replication_slots
+ WHERE slot_name = '$slot_name' AND active = 'f';
+ ])
+   or die
+   "Timed out while waiting for slot $slot_name to become inactive on
node $name";
+
+ # Wait for the replication slot info to be updated
+ $node->poll_query_until(
+ 'postgres', qq[
+ SELECT COUNT(slot_name) = 1 FROM pg_replication_slots
+ WHERE inactive_since IS NOT NULL
+ AND slot_name = '$slot_name' AND active = 'f';
+ ])
+   or die
+   "Timed out while waiting for info of slot $slot_name to be updated
on node $name";
+

Why are there are 2 separate poll_query_until's here? Can't those be
combined into just one?

~~~

8.
+ # Sleep at least $inactive_timeout duration to avoid multiple checkpoints
+ # for the slot to get invalidated.
+ sleep($inactive_timeout);
+

Maybe this special sleep to prevent too many CHECKPOINTs should be
moved to be inside the other subroutine, which is actually doing those
CHECKPOINTs.

~~~

9.
+ # Wait for the inactive replication slot to be invalidated
+ $node->poll_query_until(
+ 'postgres', qq[
+ SELECT COUNT(slot_name) = 1 FROM pg_replication_slots
+ WHERE slot_name = '$slot_name' AND
+ invalidation_reason = 'inactive_timeout';
+ ])
+   or die
+   "Timed out while waiting for inactive slot $slot_name to be
invalidated on node $name";
+

The comment seems misleading. IIUC you are not "waiting" for the
invalidation here, because it is the other subroutine doing the
waiting for the invalidation message in the logs. Instead, here I
think you are just confirming the 'invalidation_reason' got set
correctly. The comment should say what it is really doing.

==
sub check_for_slot_invalidation_in_server_log

10.
+# Check for invalidation of slot in server log
+sub che