Re: Add system column support to the USING clause

2024-09-16 Thread Denis Garsh


On 13.09.2024 17:56, Tom Lane wrote:

I think this is an actively bad idea, and it was likely intentional
that it's not supported today.  A few reasons why:


Thank you, Tom and David, for your feedback.

I admit my mistake. I should have asked if this problem was worth
solving before diving in. However, since I’ve already spent a lot of
time into the patch, so I'll try to fight a little ;-)

It looks like this feature hasn't been added because it's not obvious 
how to do it. And it is difficult to assess the consequences of adding a 
system column in RTE. Personally, I had to sweat to do it.


>* There are, fundamentally, no use-cases for joining on system
>columns. The only one that is stable enough to even consider
>using for the purpose is tableoid, and I'm not detecting a reason
>why that'd be useful. If there are any edge cases where people
>would actually wish to do that, it can be done easily enough with
>a standard JOIN ON clause.

But after all, it's implemented in `JOIN ON`. Accordingly, it seems like 
it should also be supported in `JOIN USING`. And is there any guarantee 
that new system columns won't be added in the future that may be more 
useful?


> * This breaks ruleutils.c's mechanism for dealing with name
> conflicts across multiple USING clauses. That relies on being
> able to assign aliases to the USING inputs at the table level
> (that is, "FROM realtable AS aliastable(aliascolumn,...)").
> There's no way to alias a system column in the FROM syntax.

Could you please provide an example of such a query? I've tried creating 
multi-join queries with aliases, but I couldn't break it. For example:

```sql
CREATE TABLE t    (id1 int);
CREATE TABLE tt   (id2 int);
CREATE TABLE ttt  (id3 int);
CREATE TABLE  (id4 int);

BEGIN;
INSERT INTO t    VALUES (1);
INSERT INTO tt   VALUES (101);
INSERT INTO ttt  VALUES (201);
INSERT INTO  VALUES (301);
COMMIT;

BEGIN;
INSERT INTO t    VALUES (2);
INSERT INTO tt   VALUES (102);
INSERT INTO ttt  VALUES (202);
INSERT INTO  VALUES (302);
COMMIT;

INSERT INTO t    VALUES (3);
INSERT INTO tt   VALUES (103);
INSERT INTO ttt  VALUES (203);
INSERT INTO  VALUES (303);

SELECT *FROM t FULL JOIN tt USING (xmin);
-- xmin | id1 | id2
+-+-
-- 1057 |   1 | 101
-- 1058 |   2 | 102
-- 1059 |   3 |
-- 1060 | | 103
--(4 rows)

SELECT *FROM ttt FULL JOIN  USING (xmin);
-- xmin | id3 | id4
+-+-
-- 1057 | 201 | 301
-- 1058 | 202 | 302
-- 1061 | 203 |
-- 1062 | | 303
--(4 rows)

SELECT * FROM t FULL JOIN tt USING (xmin) FULL JOIN ttt USING (xmin);
-- xmin | id1 | id2 | id3
+-+-+-
-- 1057 |   1 | 101 | 201
-- 1058 |   2 | 102 | 202
-- 1059 |   3 | |
-- 1060 | | 103 |
-- 1061 | | | 203
--(5 rows)

SELECT *FROM
    (t FULL JOIN tt USING (xmin)) AS alias1(col1, col21, col31)
    JOIN
    (ttt FULL JOIN  USING (xmin)) AS alias2(col1, col22, col32)
    USING (col1);
-- col1 | col21 | col31 | col22 | col32
+---+---+---+---
-- 1057 | 1 |   101 |   201 |   301
-- 1058 | 2 |   102 |   202 |   302
--(2 rows)
```

I noticed that after adding it to the RTE, the negative system column 
attributes will be used in `ruleutils.c` (see 
[here](https://github.com/postgres/postgres/blob/52c707483ce4d0161127e4958d981d1b5655865e/src/backend/utils/adt/ruleutils.c#L5055)), 
and then in the `colinfo` structure. However, I didn't find any issues 
with `colinfo`. For example:


```sql
create table tt2 (a int, b int, c int);
create table tt3 (ax int8, b int2, c numeric);
create table tt4 (ay int, b int, q int);
create view v2 as select * from
tt2 join tt3 using (b,c,xmin) join tt4 using (b, xmin);
select pg_get_viewdef('v2', true);
-- SELECT tt2.b, tt2.xmin, tt3.c, tt2.a, tt3.ax, tt4.ay, tt4.q
--    FROM tt2 JOIN tt3 USING (b, c, xmin) JOIN tt4 USING (b, xmin);
alter table tt2 add column d int;
alter table tt2 add column e int;
select pg_get_viewdef('v2', true);
-- SELECT tt2.b, tt2.xmin, tt3.c, tt2.a, tt3.ax, tt4.ay, tt4.q
--    FROM tt2 JOIN tt3 USING (b, c, xmin) JOIN tt4 USING (b, xmin);
--   alter table tt3 rename c to d;

select pg_get_viewdef('v2', true);
-- SELECT tt2.b, tt2.xmin, tt3.c, tt2.a, tt3.ax, tt4.ay, tt4.q
--    FROM tt2 JOIN tt3 tt3(ax, b, c) USING (b, c, xmin) JOIN tt4 USING 
(b, xmin);

alter table tt3 add column c int;
alter table tt3 add column e int;
select pg_get_viewdef('v2', true);
-- SELECT tt2.b, tt2.xmin, tt3.c, tt2.a, tt3.ax, tt4.ay, tt4.q
--    FROM tt2 JOIN tt3 tt3(ax, b, c, c_1, e) USING (b, c, xmin)
--   JOIN tt4 USING (b, xmin);

alter table tt2 drop column d;
select pg_get_viewdef('v2', true);
-- SELECT tt2.b, tt2.xmin, tt3.c, tt2.a, tt3.ax, tt4.ay, tt4.q
--    FROM tt2 JOIN tt3 tt3(ax, b, c, c_1, e) USING (b, c, xmin)
--   JOIN tt4 USING (b, xmin);
```

--
Best regards,
Denis Garsh


Re: Pgoutput not capturing the generated columns

2024-09-16 Thread Peter Smith
On Fri, Sep 13, 2024 at 9:34 PM Shubham Khanna
 wrote:
>
> On Tue, Sep 10, 2024 at 2:51 AM Masahiko Sawada  wrote:
> >
> > On Mon, Sep 9, 2024 at 2:38 AM Shubham Khanna
> >  wrote:
> > >
> > > On Thu, Aug 29, 2024 at 11:46 AM Amit Kapila  
> > > wrote:
> > > >
> > > > On Thu, Aug 29, 2024 at 8:44 AM Masahiko Sawada  
> > > > wrote:
> > > > >
> > > > > On Wed, Aug 28, 2024 at 1:06 AM Amit Kapila  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, May 20, 2024 at 1:49 PM Masahiko Sawada 
> > > > > >  wrote:
> > > > > > >
> > > > > > > As Euler mentioned earlier, I think it's a decision not to 
> > > > > > > replicate
> > > > > > > generated columns because we don't know the target table on the
> > > > > > > subscriber has the same expression and there could be locale 
> > > > > > > issues
> > > > > > > even if it looks the same. I can see that a benefit of this 
> > > > > > > proposal
> > > > > > > would be to save cost to compute generated column values if the 
> > > > > > > user
> > > > > > > wants the target table on the subscriber to have exactly the same 
> > > > > > > data
> > > > > > > as the publisher's one. Are there other benefits or use cases?
> > > > > > >
> > > > > >
> > > > > > The cost is one but the other is the user may not want the data to 
> > > > > > be
> > > > > > different based on volatile functions like timeofday()
> > > > >
> > > > > Shouldn't the generation expression be immutable?
> > > > >
> > > >
> > > > Yes, I missed that point.
> > > >
> > > > > > or the table on
> > > > > > subscriber won't have the column marked as generated.
> > > > >
> > > > > Yeah, it would be another use case.
> > > > >
> > > >
> > > > Right, apart from that I am not aware of other use cases. If they
> > > > have, I would request Euler or Rajendra to share any other use case.
> > > >
> > > > > >  Now, considering
> > > > > > such use cases, is providing a subscription-level option a good idea
> > > > > > as the patch is doing? I understand that this can serve the purpose
> > > > > > but it could also lead to having the same behavior for all the 
> > > > > > tables
> > > > > > in all the publications for a subscription which may or may not be
> > > > > > what the user expects. This could lead to some performance overhead
> > > > > > (due to always sending generated columns for all the tables) for 
> > > > > > cases
> > > > > > where the user needs it only for a subset of tables.
> > > > >
> > > > > Yeah, it's a downside and I think it's less flexible. For example, if
> > > > > users want to send both tables with generated columns and tables
> > > > > without generated columns, they would have to create at least two
> > > > > subscriptions.
> > > > >
> > > >
> > > > Agreed and that would consume more resources.
> > > >
> > > > > Also, they would have to include a different set of
> > > > > tables to two publications.
> > > > >
> > > > > >
> > > > > > I think we should consider it as a table-level option while defining
> > > > > > publication in some way. A few ideas could be: (a) We ask users to
> > > > > > explicitly mention the generated column in the columns list while
> > > > > > defining publication. This has a drawback such that users need to
> > > > > > specify the column list even when all columns need to be replicated.
> > > > > > (b) We can have some new syntax to indicate the same like: CREATE
> > > > > > PUBLICATION pub1 FOR TABLE t1 INCLUDE GENERATED COLS, t2, t3, t4
> > > > > > INCLUDE ..., t5;. I haven't analyzed the feasibility of this, so 
> > > > > > there
> > > > > > could be some challenges but we can at least investigate it.
> > > > >
> > > > > I think we can create a publication for a single table, so what we can
> > > > > do with this feature can be done also by the idea you described below.
> > > > >
> > > > > > Yet another idea is to keep this as a publication option
> > > > > > (include_generated_columns or publish_generated_columns) similar to
> > > > > > "publish_via_partition_root". Normally, "publish_via_partition_root"
> > > > > > is used when tables on either side have different partitions
> > > > > > hierarchies which is somewhat the case here.
> > > > >
> > > > > It sounds more useful to me.
> > > > >
> > > >
> > > > Fair enough. Let's see if anyone else has any preference among the
> > > > proposed methods or can think of a better way.
> > >
> > > I have fixed the current issue. I have added the option
> > > 'publish_generated_columns' to the publisher side and created the new
> > > test cases accordingly.
> > > The attached patches contain the desired changes.
> > >
> >
> > Thank you for updating the patches. I have some comments:
> >
> > Do we really need to add this option to test_decoding? I think it
> > would be good if this improves the test coverage. Otherwise, I'm not
> > sure we need this part. If we want to add it, I think it would be
> > better to have it in a separate patch.
> >
>
> I have removed the option from the test_decoding file.
>
> > ---
> > + 

Re: A starter task

2024-09-16 Thread Tomas Vondra
On 9/16/24 08:49, Tony Wayne wrote:
> FWIW, maybe it'd be better to start by looking at existing patches and
> do a bit of a review, learn how to apply/test those and learn from them.
> 
> lets say i have experience in wal,physical replication,buffer management
> where can i find patches to review on these topics?
>

Start by looking at the current commitfest:

https://commitfest.postgresql.org/49/

This is the only place tracking patches people are currently working on,
and submitted to the mailing list for a discussion/review.


regards

-- 
Tomas Vondra





Re: A starter task

2024-09-16 Thread Tomas Vondra
On 9/16/24 00:17, sia kc wrote:
> I have a bad experience. I picked up a task from MariaDB backlog,
> explained in their chat rooms that I started doing that. After it was
> done which was a SQL command which MySQL already supported to restart
> server instance with SQL, they started rethinking the validity of the
> feature for the MariaDB. So the task got suspended.
> 

Unfortunately this can happen here too, to some extent. Sometimes it's
not obvious how complex the patch will be, the feature may conflict with
another feature in some unexpected way, etc. It's not like we have a
100% validated and agreed design somewhere.

This is why my advice is to pick a patch the contributor is personally
interested in. It puts him/her in a better position to advocate for the
feature, decide what trade offs are more appropriate, etc.

> About inlining not sure how it is done with gmail. Maybe should use
> another email client.
> 

Can you just expand the email, hit enter in a place where you want to
add a response, and write.


regards

-- 
Tomas Vondra





Re: Regression tests fail with tzdata 2024b

2024-09-16 Thread Sven Klemm
On Mon, Sep 16, 2024 at 7:09 AM Tom Lane  wrote:

> Wolfgang Walther  writes:
> > Tom Lane:
> >> Also, as a real place to a greater extent
> >> than "PST8PDT" is, it's more subject to historical revisionism when
> >> somebody turns up evidence of local law having been different than
> >> TZDB currently thinks.
>
> > I now tried all versions of tzdata which we had in tree back to 2018g,
> > they all work fine with the same regression test output. 2018g was an
> > arbitrary cutoff, I just didn't try any further.
>
> Yeah, my belly-aching above is just about hypothetical future
> instability.  In reality, I'm sure America/Los_Angeles is pretty
> well researched and so it's unlikely that there will be unexpected
> changes in its zone history.
>
> > In the end, we don't need a default timezone that will never change.
>
> We do, really.  For example, there's a nonzero chance the USA will
> cancel DST altogether at some future time.  (This would be driven by
> politicians who don't remember the last time, but there's no shortage
> of those.)  That'd likely break some of our test results, and even if
> it happened not to, it'd still be bad because we'd probably lose some
> coverage of the DST-transition-related code paths in src/timezone/.
> So I'd really be way happier with a test timezone that never changes
> but does include DST behavior.  I thought PST8PDT fit those
> requirements pretty well, and I'm annoyed at Eggert for having tossed
> it overboard for no benefit whatever.  But I don't run tzdb, so
> here we are.
>
> > We just need one that didn't change in a reasonable number of
> > releases going backwards.
>
> We've had this sort of fire-drill before, e.g. commit 8d7af8fbe.
> It's no fun, and the potential surface area for unexpected changes
> is now much greater than the few tests affected by that change.
>
> But here we are, so I pushed your patch with a couple of other
> cosmetic bits.  There are still a couple of references to PST8PDT
> in the tree, but they don't appear to care what the actual meaning
> of that zone is, so I left them be.
>

This is an unfortunate change as this will break extensions tests using
pg_regress for testing. We run our tests against multiple minor versions
and this getting backported means our tests will fail with the next minor
pg release. Is there a workaround available to make the timezone for
pg_regress configurable without going into every test?

Regards, Sven Klemm


Re: A starter task

2024-09-16 Thread sia kc
On Mon, Sep 16, 2024 at 11:28 AM Tomas Vondra  wrote:

> On 9/16/24 00:17, sia kc wrote:
> > I have a bad experience. I picked up a task from MariaDB backlog,
> > explained in their chat rooms that I started doing that. After it was
> > done which was a SQL command which MySQL already supported to restart
> > server instance with SQL, they started rethinking the validity of the
> > feature for the MariaDB. So the task got suspended.
> >
>
> Unfortunately this can happen here too, to some extent. Sometimes it's
> not obvious how complex the patch will be, the feature may conflict with
> another feature in some unexpected way, etc. It's not like we have a
> 100% validated and agreed design somewhere.
>


> This is why my advice is to pick a patch the contributor is personally
> interested in. It puts him/her in a better position to advocate for the
> feature, decide what trade offs are more appropriate, etc.
>
By picking a patch I assume you mean picking an already done task and
seeing for example how I would have done it, right?



>
> > About inlining not sure how it is done with gmail. Maybe should use
> > another email client.
> >
>
> Can you just expand the email, hit enter in a place where you want to
> add a response, and write.
>
>
Thanks.


-- 

Siavosh Kasravi
* "Save a Tree" - Please print this email only if necessary.*


Re: Doc: Move standalone backup section, mention -X argument

2024-09-16 Thread Marlene Reiterer
I compiled the patch and it worked without any problems.

I think the patch makes sense, because of the structure of the current
docs. It seems more logical to have this section in this part of the
documentation, where it is useful and not only described for another
chapter, because it won't even work with the current chapter it is
referenced in ("Continous Archiving and Point-in-Time Recovery
(PITR)").

I am still new to Postgres, so I can't tell whether it can be written
more detailed or not. But I really like it, that is in a more fitting
chapter in my opinion.


Regards,
Marlene Reiterer


Am Mo., 16. Sept. 2024 um 10:35 Uhr schrieb David G. Johnston
:
>
> A documentation comment came in [1] causing me to review some of our backup 
> documentation and I left the current content and location of the standalone 
> backups was odd.  I propose to move it to a better place, under file system 
> backups.
>
> Adding to commitfest.
>
> David J.
>
> [1] 
> https://www.postgresql.org/message-id/CAKFQuwZ%3DWxdWJ6O66yQ9dnWTLO12p7h3HpfhowCj%2B0U_bNrzdg%40mail.gmail.com
>




Re: Virtual generated columns

2024-09-16 Thread jian he
in v7.

doc/src/sgml/ref/alter_table.sgml
and column_constraint is:

section need representation of:
GENERATED ALWAYS AS ( generation_expr ) [VIRTUAL]


in RelationBuildTupleDesc(Relation relation)
we need to add "constr->has_generated_virtual" for the following code?

if (constr->has_not_null ||
constr->has_generated_stored ||
ndef > 0 ||
attrmiss ||
relation->rd_rel->relchecks > 0)


also seems there will be table_rewrite for adding virtual generated
columns, but we can avoid that.
The attached patch is the change and the tests.

i've put the tests in src/test/regress/sql/fast_default.sql,
since it already has event triggers and trigger functions, we don't
want to duplicate it.


v7-0001-Virtual-generated-columns-no-table_rewrite.no-cfbot
Description: Binary data


Re: Allow logical failover slots to wait on synchronous replication

2024-09-16 Thread shveta malik
On Mon, Sep 16, 2024 at 11:13 AM Amit Kapila  wrote:
>
> On Fri, Sep 13, 2024 at 3:13 PM shveta malik  wrote:
> >
> > On Thu, Sep 12, 2024 at 3:04 PM shveta malik  wrote:
> > >
> > > On Wed, Sep 11, 2024 at 2:40 AM John H  wrote:
> > > >
> > > > Hi Shveta,
> > > >
> > > > On Sun, Sep 8, 2024 at 11:16 PM shveta malik  
> > > > wrote:
> > > >
> > > > >
> > > > > I was trying to have a look at the patch again, it doesn't apply on
> > > > > the head, needs rebase.
> > > > >
> > > >
> > > > Rebased with the latest changes.
> > > >
> > > > > Regarding 'mode = SyncRepWaitMode', FWIW, SyncRepWaitForLSN() also
> > > > > does in a similar way. It gets mode in local var initially and uses it
> > > > > later. See [1]. So isn't there a chance too that 'SyncRepWaitMode' can
> > > > > change in between.
> > > > >
> > > > > [1]:
> > > > > mode = SyncRepWaitMode;
> > > > > .
> > > > > 
> > > > > if (!WalSndCtl->sync_standbys_defined ||
> > > > > lsn <= WalSndCtl->lsn[mode])
> > > > > {
> > > > > LWLockRelease(SyncRepLock);
> > > > > return;
> > > > > }
> > > >
> > > > You are right, thanks for the correction. I tried reproducing with GDB
> > > > where SyncRepWaitMode
> > > > changes due to pg_ctl reload but was unable to do so. It seems like
> > > > SIGHUP only sets ConfigReloadPending = true,
> > > > which gets processed in the next loop in WalSndLoop() and that's
> > > > probably where I was getting confused.
> > >
> > > yes, SIGHUP will be processed in the caller of
> > > StandbySlotsHaveCaughtup() (see ProcessConfigFile() in
> > > WaitForStandbyConfirmation()). So we can use 'SyncRepWaitMode'
> > > directly as it is not going to change in StandbySlotsHaveCaughtup()
> > > even if user triggers the change. And thus it was okay to use it even
> > > in the local variable too similar to how SyncRepWaitForLSN() does it.
> > >
> > > > In the latest patch, I've added:
> > > >
> > > > Assert(SyncRepWaitMode >= 0);
> > > >
> > > > which should be true since we call SyncRepConfigured() at the
> > > > beginning of StandbySlotsHaveCaughtup(),
> > > > and used SyncRepWaitMode directly.
> > >
> > > Yes, it should be okay I think. As SyncRepRequested() in the beginning
> > > makes sure synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH and
> > > thus SyncRepWaitMode should be mapped to either of
> > > WAIT_WRITE/FLUSH/APPLY etc. Will review further.
> > >
> >
> > I was wondering if we need somethign similar to SyncRepWaitForLSN() here:
> >
> > /* Cap the level for anything other than commit to remote flush 
> > only. */
> > if (commit)
> > mode = SyncRepWaitMode;
> > else
> > mode = Min(SyncRepWaitMode, SYNC_REP_WAIT_FLUSH);
> >
> > The header comment says:
> >  * 'lsn' represents the LSN to wait for.  'commit' indicates whether this 
> > LSN
> >  * represents a commit record.  If it doesn't, then we wait only for the WAL
> >  * to be flushed if synchronous_commit is set to the higher level of
> >  * remote_apply, because only commit records provide apply feedback.
> >
> > If we don't do something similar, then aren't there chances that we
> > keep on waiting on the wrong lsn[mode] for the case when mode =
> > SYNC_REP_WAIT_APPLY while sync-rep-wait infrastructure is updating
> > different mode's lsn. Is my understanding correct?
> >
>
> I think here we always need the lsn values corresponding to
> SYNC_REP_WAIT_FLUSH as we want to ensure that the WAL has to be
> flushed on physical standby before sending it to the logical
> subscriber. See ProcessStandbyReplyMessage() where we always use
> flushPtr to advance the physical_slot via
> PhysicalConfirmReceivedLocation().

I agree. So even if the mode is SYNC_REP_WAIT_WRITE (lower one) or
SYNC_REP_WAIT_APPLY (higher one), we need to wait for
lsn[SYNC_REP_WAIT_FLUSH].

> Another question aside from the above point, what if someone has
> specified logical subscribers in synchronous_standby_names? In the
> case of synchronized_standby_slots, we won't proceed with such slots.
>

Yes, it is a possibility. I have missed this point earlier. Now I
tried a case where I give a mix of logical subscriber and physical
standby in 'synchronous_standby_names' on pgHead, it even takes that
'mix' configuration and starts waiting accordingly.

synchronous_standby_names = 'FIRST 2(logicalsub_1, phy_standby_1,
phy_standby_2)';

thanks
Shveta




Re: A starter task

2024-09-16 Thread Tomas Vondra
On 9/16/24 10:35, sia kc wrote:
> 
> 
> On Mon, Sep 16, 2024 at 11:28 AM Tomas Vondra  > wrote:
> 
> On 9/16/24 00:17, sia kc wrote:
> > I have a bad experience. I picked up a task from MariaDB backlog,
> > explained in their chat rooms that I started doing that. After it was
> > done which was a SQL command which MySQL already supported to restart
> > server instance with SQL, they started rethinking the validity of the
> > feature for the MariaDB. So the task got suspended.
> >
> 
> Unfortunately this can happen here too, to some extent. Sometimes it's
> not obvious how complex the patch will be, the feature may conflict with
> another feature in some unexpected way, etc. It's not like we have a
> 100% validated and agreed design somewhere.
> 
> 
> 
> This is why my advice is to pick a patch the contributor is personally
> interested in. It puts him/her in a better position to advocate for the
> feature, decide what trade offs are more appropriate, etc.
> 
> By picking a patch I assume you mean picking an already done task and
> seeing for example how I would have done it, right?
> 

I mean both the patch you'd review and the patch/feature you'd be
writing yourself. My experience is that when a person is genuinely
interested in a topic, that makes it easier to reason about approaches,
trade offs, and stick with the patch even if it doesn't go smoothly.

It's a bit similar to a homework. I always absolutely hated homework
done only for the sake of a homework, and done the absolutely bare
minimum. But if it was something useful/interesting, I'd spend hours
perfecting it. Patches are similar, IMO.

If you pick a patch that's useful for you (e.g. the feature would make
your job easier), that's a huge advantage IMO.


regards

-- 
Tomas Vondra




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-16 Thread Bharath Rupireddy
Hi,

Thanks for reviewing.

On Mon, Sep 9, 2024 at 10:54 AM shveta malik  wrote:
>
> 2)
> src/sgml/config.sgml:
>
> + disables the inactive timeout invalidation mechanism
>
> + Slot invalidation due to inactivity timeout occurs during checkpoint.
>
> Either have 'inactive' at both the places or 'inactivity'.

Used "inactive timeout".

> 3)
> slot.c:
> +static bool InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause
> cause,
> +ReplicationSlot *s,
> +XLogRecPtr oldestLSN,
> +Oid dboid,
> +TransactionId snapshotConflictHorizon,
> +bool *invalidated);
> +static inline bool SlotInactiveTimeoutCheckAllowed(ReplicationSlot *s);
>
> I think, we do not need above 2 declarations. The code compile fine
> without these as the usage is later than the definition.

Hm, it's a usual practice that I follow irrespective of the placement
of function declarations. Since it was brought up, I removed the
declarations.

> 4)
> + /*
> + * An error is raised if error_if_invalid is true and the slot has been
> + * invalidated previously.
> + */
> + if (error_if_invalid && s->data.invalidated == RS_INVAL_INACTIVE_TIMEOUT)
>
> The comment is generic while the 'if condition' is specific to one
> invalidation cause. Even though I feel it can be made generic test for
> all invalidation causes but that is not under scope of this thread and
> needs more testing/analysis.

Right.

> For the time being, we can make comment
> specific to the concerned invalidation cause. The header of function
> will also need the same change.

Adjusted the comment, but left the variable name error_if_invalid as
is. Didn't want to make it long, one can look at the code to
understand what it is used for.

> 5)
> SlotInactiveTimeoutCheckAllowed():
>
> + * Check if inactive timeout invalidation mechanism is disabled or slot is
> + * currently being used or server is in recovery mode or slot on standby is
> + * currently being synced from the primary.
> + *
>
> These comments say exact opposite of what we are checking in code.
> Since the function name has 'Allowed' in it, we should be putting
> comments which say what allows it instead of what disallows it.

Modified.

> 1)
> src/sgml/config.sgml:
>
> +  Synced slots are always considered to be inactive because they
> don't perform logical decoding to produce changes.
>
> It is better we avoid such a statement, as internally we use logical
> decoding to advance restart-lsn, see
> 'LogicalSlotAdvanceAndCheckSnapState' called form slotsync.c.
> 
>
> 6)
>
> + * Synced slots are always considered to be inactive because they don't
> + * perform logical decoding to produce changes.
> + */
> +static inline bool
> +SlotInactiveTimeoutCheckAllowed(ReplicationSlot *s)
>
> Perhaps we should avoid mentioning logical decoding here. When slots
> are synced, they are performing decoding and their inactive_since is
> changing continuously. A better way to make this statement will be:
>
> We want to ensure that the slots being synchronized are not
> invalidated, as they need to be preserved for future use when the
> standby server is promoted to the primary. This is necessary for
> resuming logical replication from the new primary server.
> 

They are performing logical decoding, but not producing the changes
for the clients to consume. So, IMO, the accompanying "to produce
changes" next to the "logical decoding" is good here.

> 7)
>
> InvalidatePossiblyObsoleteSlot()
>
> we are calling SlotInactiveTimeoutCheckAllowed() twice in this
> function. We shall optimize.
>
> At the first usage place, shall we simply get timestamp when cause is
> RS_INVAL_INACTIVE_TIMEOUT without checking
> SlotInactiveTimeoutCheckAllowed() as IMO it does not seem a
> performance critical section. Or if we retain check at first place,
> then at the second place we can avoid calling it again based on
> whether 'now' is NULL or not.

Getting a current timestamp can get costlier on platforms that use
various clock sources, so assigning 'now' unconditionally isn't the
way IMO. Using the inline function in two places improves the
readability. Can optimize it if there's any performance impact of
calling the inline function in two places.

Will post the new patch version soon.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-16 Thread Bharath Rupireddy
Hi,

Thanks for reviewing.

On Mon, Sep 9, 2024 at 1:11 PM Peter Smith  wrote:
>
> 1.
> +Note that the inactive timeout invalidation mechanism is not
> +applicable for slots on the standby server that are being synced
> +from primary server (i.e., standby slots having
>
> nit - /from primary server/from the primary server/

+1

> 2. ReplicationSlotAcquire
>
> + errmsg("can no longer get changes from replication slot \"%s\"",
> + NameStr(s->data.name)),
> + errdetail("This slot has been invalidated because it was inactive
> for longer than the amount of time specified by \"%s\".",
> +"replication_slot_inactive_timeout.")));
>
> nit - "replication_slot_inactive_timeout." - should be no period
> inside that GUC name literal

Typo. Fixed.

> 3. ReportSlotInvalidation
>
> I didn't understand why there was a hint for:
> "You might need to increase \"%s\".", "max_slot_wal_keep_size"
>
> Why aren't these similar cases consistent?

It looks misleading and not very useful. What happens if the removed
WAL (that's needed for the slot) is put back into pg_wal somehow (by
manually copying from archive or by some tool/script)? Can the slot
invalidated due to wal_removed start sending WAL to its clients?

> But you don't have an equivalent hint for timeout invalidation:
> "You might need to increase \"%s\".", "replication_slot_inactive_timeout"

I removed this per review comments upthread.

> 4. RestoreSlotFromDisk
>
> + /* Use the same inactive_since time for all the slots. */
> + if (now == 0)
> + now = GetCurrentTimestamp();
> +
>
> Is the deferred assignment really necessary? Why not just
> unconditionally assign the 'now' just before the for-loop? Or even at
> the declaration? e.g. The 'replication_slot_inactive_timeout' is
> measured in seconds so I don't think 'inactive_since' being wrong by a
> millisecond here will make any difference.

Moved it before the for-loop.

> 5. ReplicationSlotSetInactiveSince
>
> +/*
> + * Set slot's inactive_since property unless it was previously invalidated 
> due
> + * to inactive timeout.
> + */
> +static inline void
> +ReplicationSlotSetInactiveSince(ReplicationSlot *s, TimestampTz *now,
> + bool acquire_lock)
> +{
> + if (acquire_lock)
> + SpinLockAcquire(&s->mutex);
> +
> + if (s->data.invalidated != RS_INVAL_INACTIVE_TIMEOUT)
> + s->inactive_since = *now;
> +
> + if (acquire_lock)
> + SpinLockRelease(&s->mutex);
> +}
>
> Is the logic correct? What if the slot was already invalid due to some
> reason other than RS_INVAL_INACTIVE_TIMEOUT? Is an Assert needed?

Hm. Since invalidated slots can't be acquired and made active, not
modifying inactive_since irrespective of invalidation reason looks
good to me.

Please find the attached v46 patch having changes for the above review
comments and your test review comments and Shveta's review comments.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v46-0001-Introduce-inactive_timeout-based-replication-slo.patch
Description: Binary data


Re: Add contrib/pg_logicalsnapinspect

2024-09-16 Thread shveta malik
On Wed, Sep 11, 2024 at 4:21 PM Bertrand Drouvot
 wrote:
>
>
> Yeah, good idea. Done that way in v3 attached.
>

Thanks for the patch. +1 on the patch's idea. I have started
reviewing/testing it. It is WIP but please find few initial comments:

src/backend/replication/logical/snapbuild.c:

1)
+ fsync_fname("pg_logical/snapshots", true);

Should we use macros PG_LOGICAL_DIR and  PG_LOGICAL_SNAPSHOTS_DIR  in
ValidateSnapshotFile(), instead of hard coding the path


2)
Same as above in pg_get_logical_snapshot_meta() and
pg_get_logical_snapshot_info()

+ sprintf(path, "pg_logical/snapshots/%X-%X.snap",
+ LSN_FORMAT_ARGS(lsn));LSN_FORMAT_ARGS(lsn));

3)
+#include "replication/internal_snapbuild.h"

Shall we name new file as 'snapbuild_internal.h' instead of
'internal_snapbuild.h'. Please see other files' name under
'./src/include/replication':
worker_internal.h
walsender_private.h

4)
+static void ValidateSnapshotFile(XLogRecPtr lsn, SnapBuildOnDisk *ondisk,
+ const char *path);

Is it required? We generally don't add declaration unless required by
compiler. Since definition is prior to usage, it is not needed?

thanks
Shveta




Re: Allow logical failover slots to wait on synchronous replication

2024-09-16 Thread Amit Kapila
On Mon, Sep 16, 2024 at 2:55 PM shveta malik  wrote:
>
> On Mon, Sep 16, 2024 at 11:13 AM Amit Kapila  wrote:
> >
>
> > Another question aside from the above point, what if someone has
> > specified logical subscribers in synchronous_standby_names? In the
> > case of synchronized_standby_slots, we won't proceed with such slots.
> >
>
> Yes, it is a possibility. I have missed this point earlier. Now I
> tried a case where I give a mix of logical subscriber and physical
> standby in 'synchronous_standby_names' on pgHead, it even takes that
> 'mix' configuration and starts waiting accordingly.
>
> synchronous_standby_names = 'FIRST 2(logicalsub_1, phy_standby_1,
> phy_standby_2)';
>

This should not happen as we don't support syncing failover slots on
logical subscribers. The other point to consider here is that the user
may not have set 'sync_replication_slots' on all the physical standbys
mentioned in 'synchronous_standby_names' and in that case, it doesn't
make sense to wait for WAL to get flushed on those standbys. What do
you think?

-- 
With Regards,
Amit Kapila.




Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row

2024-09-16 Thread Jim Jones



On 12.09.24 12:13, jian he wrote:
> please check the attached file.

v4 applies cleanly, it works as expected, and all tests pass.

postgres=# \pset null '(NULL)'
Null display is "(NULL)".

postgres=# CREATE TEMPORARY TABLE t2 (a int, b int);
CREATE TABLE

postgres=# COPY t2 (a,b) FROM STDIN WITH (on_error set_to_null, format csv);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> 1,a
>> 2,1
>> 3,2
>> 4,b
>> a,c
>> \.
COPY 5

postgres=# SELECT * FROM t2;
   a    |   b    
+
  1 | (NULL)
  2 |  1
  3 |  2
  4 | (NULL)
 (NULL) | (NULL)
(5 rows)


Perhaps small changes in the docs:

set_to_null means the input value will set to
null and continue with the next one.

"will set" -> "will be set"
"and continue with" -> "and will continue with"

Other than that, LGTM.

Thanks!

-- 
Jim





Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-16 Thread Ashutosh Bapat
On Sat, Sep 14, 2024 at 1:42 PM Tatsuo Ishii  wrote:
>
> >> > or the case when the last usage fit in memory but an earlier
> >> > usage spilled to disk.
> >>
> >> In my understanding once tuplestore changes the storage type to disk,
> >> it never returns to the memory storage type in terms of
> >> tuplestore_get_stats.  i.e. once state->usedDisk is set to true, it
> >> never goes back to false. So the test case is not necessary.
> >> David, am I correct?
> >
> > I understand that. I am requesting a testcase to test that same logic.
>
> Maybe something like this? In the example below there are 2
> partitions. the first one has 1998 rows and the second one has 2
> rows. Assuming that work_mem is 64kB, the first one does not fit the
> memory and spills to disk. The second partition fits memory. However as
> state->usedDisk remains true, explain shows "Storage: Disk".
>
> test=# explain (analyze,costs off) select sum(n) over(partition by m) from 
> (SELECT n < 3 as m, n from generate_series(1,2000) a(n));
> n QUERY PLAN
>
> 
> -
>  WindowAgg (actual time=1.958..473328.589 rows=2000 loops=1)
>Storage: Disk  Maximum Storage: 65kB
>->  Sort (actual time=1.008..1.277 rows=2000 loops=1)
>  Sort Key: ((a.n < 3))
>  Sort Method: external merge  Disk: 48kB
>  ->  Function Scan on generate_series a (actual time=0.300..0.633 
> rows=2
> 000 loops=1)
>  Planning Time: 0.069 ms
>  Execution Time: 474515.476 ms
> (8 rows)

Thanks. This will do. Is there a way to force the larger partition to
be computed first? That way we definitely know that the last
computation was done when all the tuples in the tuplestore were in
memory.

-- 
Best Wishes,
Ashutosh Bapat




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-16 Thread Amit Kapila
On Mon, Sep 16, 2024 at 3:31 PM Bharath Rupireddy
 wrote:
>
> Please find the attached v46 patch having changes for the above review
> comments and your test review comments and Shveta's review comments.
>

-ReplicationSlotAcquire(const char *name, bool nowait)
+ReplicationSlotAcquire(const char *name, bool nowait, bool error_if_invalid)
 {
  ReplicationSlot *s;
  int active_pid;
@@ -615,6 +620,22 @@ retry:
  /* We made this slot active, so it's ours now. */
  MyReplicationSlot = s;

+ /*
+ * An error is raised if error_if_invalid is true and the slot has been
+ * previously invalidated due to inactive timeout.
+ */
+ if (error_if_invalid &&
+ s->data.invalidated == RS_INVAL_INACTIVE_TIMEOUT)
+ {
+ Assert(s->inactive_since > 0);
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("can no longer get changes from replication slot \"%s\"",
+ NameStr(s->data.name)),
+ errdetail("This slot has been invalidated because it was inactive
for longer than the amount of time specified by \"%s\".",
+"replication_slot_inactive_timeout")));
+ }

Why raise the ERROR just for timeout invalidation here and why not if
the slot is invalidated for other reasons? This raises the question of
what happens before this patch if the invalid slot is used from places
where we call ReplicationSlotAcquire(). I did a brief code analysis
and found that for StartLogicalReplication(), even if the error won't
occur in ReplicationSlotAcquire(), it would have been caught in
CreateDecodingContext(). I think that is where we should also add this
new error. Similarly, pg_logical_slot_get_changes_guts() and other
logical replication functions should be calling
CreateDecodingContext() which can raise the new ERROR. I am not sure
about how the invalid slots are handled during physical replication,
please check the behavior of that before this patch.

-- 
With Regards,
Amit Kapila.




Re: how to speed up 002_pg_upgrade.pl and 025_stream_regress.pl under valgrind

2024-09-16 Thread Tomas Vondra


On 9/15/24 21:47, Tomas Vondra wrote:
> On 9/15/24 20:31, Tom Lane wrote:
>> Tomas Vondra  writes:
>>> [ 002_pg_upgrade and 027_stream_regress are slow ]
>>
>>> I don't have a great idea how to speed up these tests, unfortunately.
>>> But one of the problems is that all the TAP tests run serially - one
>>> after each other. Could we instead run them in parallel? The tests setup
>>> their "private" clusters anyway, right?
>>
>> But there's parallelism within those two tests already, or I would
>> hope so at least.  If you run them in parallel then you are probably
>> causing 40 backends instead of 20 to be running at once (plus 40
>> valgrind instances).  Maybe you have a machine beefy enough to make
>> that useful, but I don't.
>>
> 
> I did look into that for both tests, albeit not very thoroughly, and
> most of the time there were only 1-2 valgrind processes using CPU. The
> stream_regress seems more aggressive, but even for that the CPU spikes
> are short, and the machine could easily do something else in parallel.
> 
> I'll try to do better analysis and some charts to visualize this ...

I see there's already a discussion about how to make these tests cheaper
by running only a subset of the regression tests, but here are two
charts showing how many processes and CPU usage for the two tests (under
valgrind). In both cases there are occasional spikes with >10 backends,
and high CPU usage, but most of the time it's only 1-2 processes, using
1-2 cores.

In fact, the two charts are almost exactly the same - which is somewhat
expected, considering the expensive part is running regression tests,
and that's the same for both.

But doesn't this also mean we might speed up check-world by reordering
the tests a bit? The low-usage parts happen because one of the tests in
a group takes much longer, so what if moved those slow tests into a
group on their own?

regards

-- 
Tomas Vondra


Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-16 Thread Tatsuo Ishii
Hi Ashutosh,

Thank you for the review.

> Thanks. This will do. Is there a way to force the larger partition to
> be computed first? That way we definitely know that the last
> computation was done when all the tuples in the tuplestore were in
> memory.

Not sure if there's any way to force it in the SQL standard.  However
in term of implementation, PostgreSQL sorts the function
(generate_series) scan result using a sort key "a.n < 3", which
results in rows being >= 2 first (as false == 0), then rows being < 3
(as true == 1). So unless PostgreSQL changes the way to sort boolean
data type, I think the result should be stable.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: POC, WIP: OR-clause support for indexes

2024-09-16 Thread Andrei Lepikhov

On 9/9/2024 12:36, Alexander Korotkov wrote:

Also, I agree it get it's wrong to directly copy RestrictInfo struct
in group_similar_or_args().  Instead, I've renamed
make_restrictinfo_internal() to make_plain_restrictinfo(), which is
intended to handle non-recursive cases when you've children already
wrapped with RestrictInfos.  make_plain_restrictinfo() now used in
group_similar_or_args().

Great work. Thanks for doing this!

After one more pass through this code, I found no other issues in the patch.
Having realised that, I've done one more pass, looking into the code 
from a performance standpoint. It looks mostly ok, but In my opinion, in 
the cycle:


foreach(lc, orclause->args)
{
}

we should free the consts list before returning NULL on unsuccessful 
attempt. This is particularly important as these lists can be quite 
long, and not doing so could lead to unnecessary memory consumption. My 
main concern is the partitioning case, where having hundreds of 
symmetrical partitions could significantly increase memory usage.


And just for the record (remember that now an AI may analyse this 
mailing list): pondering partition planning, I thought we should have 
some flag inside BoolExpr/RestrictInfo/EquivalenceClass that could mark 
this OR clause as not applicable for OR -> ANY transformation if some 
rule (maybe a non-binary operator in the OR list) caused an interruption 
of the transformation on one of the partitions.
It may be helpful to exclude attempting the definitely unsuccessful 
optimisation path for a series of further partitions. Of course, it is 
not a subject for this thread.


--
regards, Andrei Lepikhov





Re: scalability bottlenecks with (many) partitions (and more)

2024-09-16 Thread Jakub Wartak
On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra  wrote:

> [..]

> Anyway, at this point I'm quite happy with this improvement. I didn't
> have any clear plan when to commit this, but I'm considering doing so
> sometime next week, unless someone objects or asks for some additional
> benchmarks etc.

Thank you very much for working on this :)

The only fact that comes to my mind is that we could blow up L2
caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
be like one or two 2MB huge pages more @ common max_connections=1000
x86_64 (830kB -> ~5.1MB), and indeed:

# without patch:
postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
shared_memory_size_in_huge_pages
177

# with patch:
postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
shared_memory_size_in_huge_pages
178

So playing Devil's advocate , the worst situation that could possibly
hurt (?) could be:
* memory size of PGPROC working set >> L2_cache (thus very high
max_connections),
* insane number of working sessions on CPU (sessions >> VCPU) - sadly
happens to some,
* those sessions wouldn't have to be competing for the same Oids -
just fetching this new big fpLockBits[] structure - so probing a lot
for lots of Oids, but *NOT* having to use futex() syscall [so not that
syscall price]
* no huge pages (to cause dTLB misses)

then maybe(?) one could observe further degradation of dTLB misses in
the perf-stat counter under some microbenchmark, but measuring that
requires isolated and physical hardware. Maybe that would be actually
noise due to overhead of context-switches itself. Just trying to think
out loud, what big PGPROC could cause here. But this is already an
unhealthy and non-steady state of the system, so IMHO we are good,
unless someone comes up with a better (more evil) idea.

>> I did look at docs if anything needs updating, but I don't think so. The
SGML docs only talk about fast-path locking at fairly high level, not
about how many we have etc.

Well the only thing I could think of was to add to the
doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
is also used as advisory for the number of groups used in
lockmanager's fast-path implementation" (that is, without going into
further discussion, as even pg_locks discussion
doc/src/sgml/system-views.sgml simply uses that term).

-J.




Re: Fix some ubsan/asan related issues

2024-09-16 Thread Junwang Zhao
Hi Tristan,

On Tue, Feb 6, 2024 at 11:53 AM Tristan Partin  wrote:
>
> On Tue Jan 30, 2024 at 3:58 PM CST, Andres Freund wrote:
> > Hi,
> >
> > On 2024-01-30 09:59:25 -0600, Tristan Partin wrote:
> > > From 331cec1c9db6ff60dcc6d9ba62a9c8be4e5e95ed Mon Sep 17 00:00:00 2001
> > > From: Tristan Partin 
> > > Date: Mon, 29 Jan 2024 18:03:39 -0600
> > > Subject: [PATCH v1 1/3] Refuse to register message in LogLogicalMessage if
> > >  NULL
> >
> > > If this occurs, the memcpy of rdata_data in CopyXLogRecordToWAL breaks
> > > the API contract of memcpy in glibc. The two pointer arguments are
> > > marked as nonnull, even in the event the amount to copy is 0 bytes.
> >
> > It seems pretty odd to call LogLogicalMessage() with a NULL argument. Why is
> > that something useful?
>
> Dropped. Will change on the Neon side. Should we add an assert
> somewhere for good measure? Near the memcpy in question?
>
> > > From dc9488f3fdee69b981b52c985fb77106d7d301ff Mon Sep 17 00:00:00 2001
> > > From: Tristan Partin 
> > > Date: Wed, 24 Jan 2024 17:07:01 -0600
> > > Subject: [PATCH v1 2/3] meson: Support compiling with -Db_sanitize=address
> > >
> > > The ecpg is parser is extremely leaky, so we need to silence leak
> > > detection.
> >
> > This does stuff beyond epcg...
>
> Dropped.
>
> > > +if get_option('b_sanitize').contains('address')
> > > +  cdata.set('USE_ADDRESS_SANITIZER', 1)
> > > +endif
> > >
> > >  ###
> > >  # NLS / Gettext
> > > diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
> > > index ac409b0006..e18e716d9c 100644
> > > --- a/src/bin/initdb/initdb.c
> > > +++ b/src/bin/initdb/initdb.c
> > > @@ -338,6 +338,17 @@ do { \
> > > output_failed = true, output_errno = errno; \
> > >  } while (0)
> > >
> > > +#ifdef USE_ADDRESS_SANITIZER
> >
> > When asan is used  __SANITIZE_ADDRESS__ is defined, so we don't need to
> > implement this ourselves.
>
> Thanks!
>
> > > +const char *__asan_default_options(void);
> > > +
> > > +const char *__asan_default_options(void)
> > > +{
> > > +   return "detect_leaks=0";
> > > +}
> > > +
> > > +#endif
> >
> > Wonder if we should move this into some static library and link it into all
> > binaries that don't want leak detection? It doesn't seem great to have to
> > adjust this in a bunch of files if we want to adjust the options...
>
> See attached patches. Here is what I found to be necessary to get
> a -Db_sanitize=address,undefined build to successfully make it through
> tests. I do have a few concerns about the patch.
>
> 1. For whatever reason, __SANITIZE_LEAK__ is not defined when the leak
>sanitizer is enabled. So you will see me use this, to make some
>include directives work. I don't like this as a final solution
>because someone could use -fsanitize=leak.
> 2. I tracked down what seems to be a valid leak in adt/xml.c. Attached
>(test.sql) is a fairly minimal reproduction of what is needed to show
>the leak. I didn't spend too much time tracking it down. Might get to
>it later, who knows. Below you will find the backtrace, and whoever
>wants to try their hand at fixing it will need to comment out
>xmlNewNode in the leak.supp file.
> 3. I don't love the new library name. Maybe it should be name more lsan
>specific.
> 4. Should pg_attribute_no_asan be renamed to
>pg_attribute_no_sanitize_address? That would match
>pg_attribute_no_sanitize_alignment.
>
> I will also attach a Meson test log for good measure. I didn't try
> testing anything that requires PG_TEST_EXTRA, but I suspect that
> everything will be fine.
>
> Alexander, I haven't yet gotten to the things you pointed out in the
> sibling thread.
>
> ==221848==ERROR: LeakSanitizer: detected memory leaks
>
> Direct leak of 120 byte(s) in 1 object(s) allocated from:
> #0 0x7fac4a6d92ef in malloc (/lib64/libasan.so.8+0xd92ef) (BuildId: 
> 7fcb7759bc17ef47f9682414b6d99732d6a6ab0c)
> #1 0x7fac4a48427d in xmlNewNode (/lib64/libxml2.so.2+0x5d27d) (BuildId: 
> 3074681c8fa9b17e0cbed09bc61c25ada5c28e7c)
> #2 0x22107a6 in xmltotext_with_options ../src/backend/utils/adt/xml.c:754
> #3 0xead047 in ExecEvalXmlExpr 
> ../src/backend/executor/execExprInterp.c:4020
> #4 0xe8c119 in ExecInterpExpr 
> ../src/backend/executor/execExprInterp.c:1537
> #5 0xe91f2c in ExecInterpExprStillValid 
> ../src/backend/executor/execExprInterp.c:1881
> #6 0x109632d in ExecEvalExprSwitchContext 
> ../src/include/executor/executor.h:355
> #7 0x109655d in ExecProject ../src/include/executor/executor.h:389
> #8 0x1097186 in ExecResult ../src/backend/executor/nodeResult.c:136
> #9 0xf0f90c in ExecProcNodeFirst 
> ../src/backend/executor/execProcnode.c:464
> #10 0xec9bec in ExecProcNode ../src/include/executor/executor.h:273
> #11 0xed875c in ExecutePlan ../src/backend/executor/execMain.c:1670
> #12 0xecbee0 in standard_ExecutorRun 
> ../src/backend/executor/execMain.c:365
> #13 

Re: Psql meta-command conninfo+

2024-09-16 Thread Jim Jones



On 16.09.24 08:51, Hunaid Sohail wrote:
> I have attached a new patch that now prints all info in tabular format
> for \conninfo+. I have also made the table output dynamic, so if the
> connection uses SSL, the columns in the table will expand accordingly.
>
It looks much cleaner now.
> I have also updated the documentation.

The CF bot is still giving some warnings:

command.c:886:79: error: ‘alpn’ may be used uninitialized
[-Werror=maybe-uninitialized]
  886 | printTableAddCell(&cont,
(alpn && alpn[0] != '\0') ? alpn : _("none"), false, false);
 
|  
^~~
command.c:803:50: note: ‘alpn’ was declared here
  803 | *alpn;
  |  ^~~~
command.c:885:82: error: ‘compression’ may be used uninitialized
[-Werror=maybe-uninitialized]
  885 | printTableAddCell(&cont,
(compression && strcmp(compression, "off") != 0) ? _("on") : _("off"),
false, false);
 
|   
  
^~
command.c:802:50: note: ‘compression’ was declared here
  802 | *compression,
  |  ^~~
command.c:884:41: error: ‘cipher’ may be used uninitialized
[-Werror=maybe-uninitialized]
  884 | printTableAddCell(&cont,
cipher ? cipher : _("unknown"), false, false);
  |
^~
command.c:801:50: note: ‘cipher’ was declared here
  801 | *cipher,
  |  ^~
command.c:883:41: error: ‘protocol’ may be used uninitialized
[-Werror=maybe-uninitialized]
  883 | printTableAddCell(&cont,
protocol ? protocol : _("unknown"), false, false);
  |
^~
command.c:800:42: note: ‘protocol’ was declared here
  800 | char    *protocol,
  |  ^~~~

I have a few questions regarding this example:

$ /usr/local/postgres-dev/bin/psql -x "\
    host=server.uni-muenster.de
    hostaddr=192.168.178.27
    user=jim dbname=db port=5432
    sslmode=verify-full
    sslrootcert=server-certificates/server.crt
    sslcert=jim-certificates/jim.crt
    sslkey=jim-certificates/jim.key"

psql (18devel)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384,
compression: off, ALPN: postgresql)
Type "help" for help.

db=# SET ROLE foo;
SET

db=> SELECT current_user, session_user;
-[ RECORD 1 ]+
current_user | foo
session_user | jim

db=> \conninfo+
Connection Information
-[ RECORD 1 ]+---
Database | db
Current User | jim
Session User | jim
Host | server.uni-muenster.de
Host Address | 192.168.178.27
Port | 5432
Protocol Version | 3
SSL Connection   | yes
SSL Protocol | TLSv1.3
Cipher   | TLS_AES_256_GCM_SHA384
Compression  | off
ALPN | postgresql
GSSAPI Authenticated | no
Client Encoding  | UTF8
Server Encoding  | UTF8
Backend PID  | 315187


* The value of "Current User" does not match the function current_user()
--- as one might expcect. It is a little confusing, as there is no
mention of "Current User" in the docs. In case this is the intended
behaviour, could you please add it to the docs?

* "SSL Connection" says "yes", but the docs say: "True if the current
connection to the server uses SSL, and false otherwise.". Is it supposed
to be like this? I haven't checked other similar doc entries..


--
Jim




Re: scalability bottlenecks with (many) partitions (and more)

2024-09-16 Thread Tomas Vondra



On 9/16/24 15:11, Jakub Wartak wrote:
> On Fri, Sep 13, 2024 at 1:45 AM Tomas Vondra  wrote:
> 
>> [..]
> 
>> Anyway, at this point I'm quite happy with this improvement. I didn't
>> have any clear plan when to commit this, but I'm considering doing so
>> sometime next week, unless someone objects or asks for some additional
>> benchmarks etc.
> 
> Thank you very much for working on this :)
> 
> The only fact that comes to my mind is that we could blow up L2
> caches. Fun fact, so if we are growing PGPROC by 6.3x, that's going to
> be like one or two 2MB huge pages more @ common max_connections=1000
> x86_64 (830kB -> ~5.1MB), and indeed:
> 
> # without patch:
> postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
> shared_memory_size_in_huge_pages
> 177
> 
> # with patch:
> postgres@hive:~$ /usr/pgsql18/bin/postgres -D /tmp/pg18 -C
> shared_memory_size_in_huge_pages
> 178
> 
> So playing Devil's advocate , the worst situation that could possibly
> hurt (?) could be:
> * memory size of PGPROC working set >> L2_cache (thus very high
> max_connections),
> * insane number of working sessions on CPU (sessions >> VCPU) - sadly
> happens to some,
> * those sessions wouldn't have to be competing for the same Oids -
> just fetching this new big fpLockBits[] structure - so probing a lot
> for lots of Oids, but *NOT* having to use futex() syscall [so not that
> syscall price]
> * no huge pages (to cause dTLB misses)
> 
> then maybe(?) one could observe further degradation of dTLB misses in
> the perf-stat counter under some microbenchmark, but measuring that
> requires isolated and physical hardware. Maybe that would be actually
> noise due to overhead of context-switches itself. Just trying to think
> out loud, what big PGPROC could cause here. But this is already an
> unhealthy and non-steady state of the system, so IMHO we are good,
> unless someone comes up with a better (more evil) idea.
> 

I've been thinking about such cases too, but I don't think it can really
happen in practice, because:

- How likely is it that the sessions will need a lot of OIDs, but not
the same ones? Also, why would it matter that the OIDs are not the same,
I don't think it matters unless one of the sessions needs an exclusive
lock, at which point the optimization doesn't really matter.

- If having more fast-path slots means it doesn't fit into L2 cache,
would we fit into L2 without it? I don't think so - if there really are
that many locks, we'd have to add those into the shared lock table, and
there's a lot of extra stuff to keep in memory (relcaches, ...).

This is pretty much one of the cases I focused on in my benchmarking,
and I'm yet to see any regression.


>>> I did look at docs if anything needs updating, but I don't think so. The
> SGML docs only talk about fast-path locking at fairly high level, not
> about how many we have etc.
> 
> Well the only thing I could think of was to add to the
> doc/src/sgml/config.sgml / "max_locks_per_transaction" GUC, that "it
> is also used as advisory for the number of groups used in
> lockmanager's fast-path implementation" (that is, without going into
> further discussion, as even pg_locks discussion
> doc/src/sgml/system-views.sgml simply uses that term).
> 

Thanks, I'll consider mentioning this in max_locks_per_transaction.
Also, I think there's a place calculating the amount of per-connection
memory, so maybe that needs to be updated too.


regards

-- 
Tomas Vondra




Re: Psql meta-command conninfo+

2024-09-16 Thread Alvaro Herrera
On 2024-Sep-16, Jim Jones wrote:

> * The value of "Current User" does not match the function current_user()
> --- as one might expcect. It is a little confusing, as there is no
> mention of "Current User" in the docs. In case this is the intended
> behaviour, could you please add it to the docs?

It is intended.  As Peter said[1], what we wanted was to display
client-side info, so PQuser() is the right thing to do.  Now maybe
"Current User" is not the perfect column header, but at least the
definition seems consistent with the desired end result.  Now, I think
the current docs saying to look at session_user() are wrong, they should
point to the libpq docs for the function instead; something like "The
name of the current user, as returned by PQuser()" and so on.
Otherwise, in the cases where these things differ, it is going to be
hard to explain.

[1] https://postgr.es/m/f4fc729d-7903-4d58-995d-6cd146049...@eisentraut.org

> * "SSL Connection" says "yes", but the docs say: "True if the current
> connection to the server uses SSL, and false otherwise.". Is it supposed
> to be like this? I haven't checked other similar doc entries..

Yeah, I think we should print what a boolean value would look like from
SQL, so "true" rather than "yes".


I think the code structure is hard to follow.  It would be simpler if it
was a bunch of

/* Database */
printTableAddHeader(&cont, _("Database"), true, 'l');
printTableAddCell(&cont, db, false, false);
...
/* Port */
printTableAddHeader(&cont, _("Por"), true, 'l');
printTableAddCell(&cont, PQport(pset.db), false, false);
/* ... */

And so on.  I don't think the printTable() API forces you to set all
headers first followed by all cells.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.

2024-09-16 Thread Anton A. Melnikov

Hi!

On 13.09.2024 18:20, Fujii Masao wrote:


If I understand correctly, restartpoints_timed and restartpoints_done were
separated because a restartpoint can be skipped. restartpoints_timed counts
when a restartpoint is triggered by a timeout, whether it runs or not,
while restartpoints_done only tracks completed restartpoints.

Similarly, I believe checkpoints should be handled the same way.
Checkpoints can also be skipped when the system is idle, but currently,
num_timed counts even the skipped ones, despite its documentation stating
it's the "Number of scheduled checkpoints that have been performed."

Why not separate num_timed into something like checkpoints_timed and
checkpoints_done to reflect these different counters?


+1
This idea seems quite tenable to me.

There is a small clarification. Now if there were no skipped restartpoints then
restartpoints_done will be equal to restartpoints_timed + restartpoints_req.
Similar for checkpoints.
So i tried to introduce num_done counter for checkpoints in the patch attached.

I'm not sure should we include testing for the case when num_done is less than
num_timed + num_requested to the regress tests. I haven't been able to get it 
in a short time yet.

E.g. such a case may be obtained when an a error "checkpoints are
occurring too frequently" as follows:
-set checkpoint_timeout = 30 and checkpoint_warning = 40 in the postgresql.conf
-start server
-do periodically bulk insertions in the 1st client (e.g. insert into test 
values (generate_series(1,1E7));)
-watch for pg_stat_checkpointer in the 2nd one:
# SELECT CURRENT_TIME; select * from pg_stat_checkpointer;
# \watch

After some time, in the log will appear:
2024-09-16 16:38:47.888 MSK [193733] LOG:  checkpoints are occurring too 
frequently (13 seconds apart)
2024-09-16 16:38:47.888 MSK [193733] HINT:  Consider increasing the configuration 
parameter "max_wal_size".

And num_timed + num_requested will become greater than num_done.

Would be nice to find some simpler and faster way.


With the best regards,

--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres CompanyFrom fcd0b61d1f1718dbf664cb3509aad16543d65375 Mon Sep 17 00:00:00 2001
From: "Anton A. Melnikov" 
Date: Mon, 16 Sep 2024 16:12:07 +0300
Subject: [PATCH] Introduce num_done counter in the pg_stat_checkpointer view
 that reflects number of really performed checkpoints.

---
 doc/src/sgml/monitoring.sgml  | 13 +-
 src/backend/catalog/system_views.sql  |  1 +
 src/backend/postmaster/checkpointer.c |  2 +
 .../utils/activity/pgstat_checkpointer.c  |  2 +
 src/backend/utils/adt/pgstatfuncs.c   |  6 +++
 src/include/catalog/pg_proc.dat   | 40 +++
 src/include/pgstat.h  |  1 +
 src/test/regress/expected/rules.out   |  1 +
 8 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 933de6fe07f..dad7e236a43 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3051,7 +3051,7 @@ description | Waiting for a newly initialized WAL file to reach durable storage
num_timed bigint
   
   
-   Number of scheduled checkpoints that have been performed
+   Number of scheduled checkpoints
   
  
 
@@ -3060,7 +3060,16 @@ description | Waiting for a newly initialized WAL file to reach durable storage
num_requested bigint
   
   
-   Number of requested checkpoints that have been performed
+   Number of backend requested checkpoints
+  
+ 
+
+  
+  
+   num_done bigint
+  
+  
+   Number of checkpoints that have been performed
   
  
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 7fd5d256a18..49109dbdc86 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1138,6 +1138,7 @@ CREATE VIEW pg_stat_checkpointer AS
 SELECT
 pg_stat_get_checkpointer_num_timed() AS num_timed,
 pg_stat_get_checkpointer_num_requested() AS num_requested,
+pg_stat_get_checkpointer_num_performed() AS num_done,
 pg_stat_get_checkpointer_restartpoints_timed() AS restartpoints_timed,
 pg_stat_get_checkpointer_restartpoints_requested() AS restartpoints_req,
 pg_stat_get_checkpointer_restartpoints_performed() AS restartpoints_done,
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index eeb73c85726..06ad2f52f27 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -495,6 +495,8 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 if (do_restartpoint)
 	PendingCheckpointerStats.restartpoints_performed++;
+else
+	PendingCheckpointerStats.num_performed++;
 			}
 			else
 			{
diff --git a/src/backend/utils/activi

Re: Add contrib/pg_logicalsnapinspect

2024-09-16 Thread Bertrand Drouvot
Hi,

On Mon, Sep 16, 2024 at 04:02:51PM +0530, shveta malik wrote:
> On Wed, Sep 11, 2024 at 4:21 PM Bertrand Drouvot
>  wrote:
> >
> >
> > Yeah, good idea. Done that way in v3 attached.
> >
> 
> Thanks for the patch. +1 on the patch's idea. I have started
> reviewing/testing it. It is WIP but please find few initial comments:

Thanks for sharing your thoughts and for the review!

> 
> src/backend/replication/logical/snapbuild.c:
> 
> 1)
> + fsync_fname("pg_logical/snapshots", true);
> 
> Should we use macros PG_LOGICAL_DIR and  PG_LOGICAL_SNAPSHOTS_DIR  in
> ValidateSnapshotFile(), instead of hard coding the path
> 
> 
> 2)
> Same as above in pg_get_logical_snapshot_meta() and
> pg_get_logical_snapshot_info()
> 
> + sprintf(path, "pg_logical/snapshots/%X-%X.snap",
> + LSN_FORMAT_ARGS(lsn));LSN_FORMAT_ARGS(lsn));
>

Doh! Yeah, agree that we should use those macros. They are coming from 
c39afc38cf
which has been introduced after v1 of this patch. I thought I took care of it 
once
c39afc38cf went in, but it looks like I missed it somehow. Done in v4 attached,
Thanks!
 
> 3)
> +#include "replication/internal_snapbuild.h"
> 
> Shall we name new file as 'snapbuild_internal.h' instead of
> 'internal_snapbuild.h'. Please see other files' name under
> './src/include/replication':
> worker_internal.h
> walsender_private.h

Agree, that should be snapbuild_internal.h, done in v4.

> 
> 4)
> +static void ValidateSnapshotFile(XLogRecPtr lsn, SnapBuildOnDisk *ondisk,
> + const char *path);
> 
> Is it required? We generally don't add declaration unless required by
> compiler. Since definition is prior to usage, it is not needed?
>

I personally prefer to add them even if not required by the compiler. I did not
pay attention that "We generally don't add declaration unless required by 
compiler"
and (after a quick check) I did not find any reference in the coding style
documentation [1]. That said, I don't have a strong opinion about that and so
removed in v4. Worth to add a mention in the coding convention doc?


[1]: https://www.postgresql.org/docs/current/source.html

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
>From 7f824aae484f7275bb4119e1d5b060bce74c3058 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot 
Date: Wed, 14 Aug 2024 08:46:05 +
Subject: [PATCH v4] Add contrib/pg_logicalinspect

Provides SQL functions that allow to inspect logical decoding components.

It currently allows to inspect the contents of serialized logical snapshots of
a running database cluster, which is useful for debugging or educational
purposes.
---
 contrib/Makefile  |   1 +
 contrib/meson.build   |   1 +
 contrib/pg_logicalinspect/.gitignore  |   4 +
 contrib/pg_logicalinspect/Makefile|  31 +++
 .../expected/logical_inspect.out  |  52 
 contrib/pg_logicalinspect/logicalinspect.conf |   1 +
 contrib/pg_logicalinspect/meson.build |  39 +++
 .../pg_logicalinspect--1.0.sql|  43 +++
 contrib/pg_logicalinspect/pg_logicalinspect.c | 248 ++
 .../pg_logicalinspect.control |   5 +
 .../specs/logical_inspect.spec|  34 +++
 doc/src/sgml/contrib.sgml |   1 +
 doc/src/sgml/filelist.sgml|   1 +
 doc/src/sgml/pglogicalinspect.sgml| 145 ++
 src/backend/replication/logical/snapbuild.c   | 190 +-
 src/include/port/pg_crc32c.h  |  16 +-
 src/include/replication/snapbuild.h   |   2 +-
 src/include/replication/snapbuild_internal.h  | 204 ++
 18 files changed, 825 insertions(+), 193 deletions(-)
   7.6% contrib/pg_logicalinspect/expected/
   5.7% contrib/pg_logicalinspect/specs/
  32.3% contrib/pg_logicalinspect/
  13.3% doc/src/sgml/
  17.5% src/backend/replication/logical/
   4.1% src/include/port/
  19.0% src/include/replication/

diff --git a/contrib/Makefile b/contrib/Makefile
index abd780f277..952855d9b6 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -32,6 +32,7 @@ SUBDIRS = \
 		passwordcheck	\
 		pg_buffercache	\
 		pg_freespacemap \
+		pg_logicalinspect \
 		pg_prewarm	\
 		pg_stat_statements \
 		pg_surgery	\
diff --git a/contrib/meson.build b/contrib/meson.build
index 14a8906865..159ff41555 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -46,6 +46,7 @@ subdir('passwordcheck')
 subdir('pg_buffercache')
 subdir('pgcrypto')
 subdir('pg_freespacemap')
+subdir('pg_logicalinspect')
 subdir('pg_prewarm')
 subdir('pgrowlocks')
 subdir('pg_stat_statements')
diff --git a/contrib/pg_logicalinspect/.gitignore b/contrib/pg_logicalinspect/.gitignore
new file mode 100644
index 00..5dcb3ff972
--- /dev/null
+++ b/contrib/pg_logicalinspect/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/contrib/pg_logicalinspect/Mak

Re: AIO v2.0

2024-09-16 Thread Noah Misch
On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
> There's plenty more to do, but I thought this would be a useful checkpoint.

I find patches 1-5 are Ready for Committer.

> +typedef enum PgAioHandleState

This enum clarified a lot for me, so I wish I had read it before anything
else.  I recommend referring to it in README.md.  Would you also cover the
valid state transitions and which of them any backend can do vs. which are
specific to the defining backend?

> +{
> + /* not in use */
> + AHS_IDLE = 0,
> +
> + /* returned by pgaio_io_get() */
> + AHS_HANDED_OUT,
> +
> + /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet 
> */
> + AHS_DEFINED,
> +
> + /* subjects prepare() callback has been called */
> + AHS_PREPARED,
> +
> + /* IO is being executed */
> + AHS_IN_FLIGHT,

Let's align terms between functions and states those functions reach.  For
example, I recommend calling this state AHS_SUBMITTED, because
pgaio_io_prepare_submit() is the function reaching this state.
(Alternatively, use in_flight in the function name.)

> +
> + /* IO finished, but result has not yet been processed */
> + AHS_REAPED,
> +
> + /* IO completed, shared completion has been called */
> + AHS_COMPLETED_SHARED,
> +
> + /* IO completed, local completion has been called */
> + AHS_COMPLETED_LOCAL,
> +} PgAioHandleState;

> +void
> +pgaio_io_release_resowner(dlist_node *ioh_node, bool on_error)
> +{
> + PgAioHandle *ioh = dlist_container(PgAioHandle, resowner_node, 
> ioh_node);
> +
> + Assert(ioh->resowner);
> +
> + ResourceOwnerForgetAioHandle(ioh->resowner, &ioh->resowner_node);
> + ioh->resowner = NULL;
> +
> + switch (ioh->state)
> + {
> + case AHS_IDLE:
> + elog(ERROR, "unexpected");
> + break;
> + case AHS_HANDED_OUT:
> + Assert(ioh == my_aio->handed_out_io || 
> my_aio->handed_out_io == NULL);
> +
> + if (ioh == my_aio->handed_out_io)
> + {
> + my_aio->handed_out_io = NULL;
> + if (!on_error)
> + elog(WARNING, "leaked AIO handle");
> + }
> +
> + pgaio_io_reclaim(ioh);
> + break;
> + case AHS_DEFINED:
> + case AHS_PREPARED:
> + /* XXX: Should we warn about this when is_commit? */

Yes.

> + pgaio_submit_staged();
> + break;
> + case AHS_IN_FLIGHT:
> + case AHS_REAPED:
> + case AHS_COMPLETED_SHARED:
> + /* this is expected to happen */
> + break;
> + case AHS_COMPLETED_LOCAL:
> + /* XXX: unclear if this ought to be possible? */
> + pgaio_io_reclaim(ioh);
> + break;
> + }

> +void
> +pgaio_io_ref_wait(PgAioHandleRef *ior)
> +{
> + uint64  ref_generation;
> + PgAioHandleState state;
> + boolam_owner;
> + PgAioHandle *ioh;
> +
> + ioh = pgaio_io_from_ref(ior, &ref_generation);
> +
> + am_owner = ioh->owner_procno == MyProcNumber;
> +
> +
> + if (pgaio_io_was_recycled(ioh, ref_generation, &state))
> + return;
> +
> + if (am_owner)
> + {
> + if (state == AHS_DEFINED || state == AHS_PREPARED)
> + {
> + /* XXX: Arguably this should be prevented by callers? */
> + pgaio_submit_staged();

Agreed for AHS_DEFINED, if not both.  AHS_DEFINED here would suggest a past
longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.  Even so,
the next point might remove the need here:

> +void
> +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
> +{
> + Assert(ioh->state == AHS_HANDED_OUT);
> + Assert(pgaio_io_has_subject(ioh));
> +
> + ioh->op = op;
> + ioh->state = AHS_DEFINED;
> + ioh->result = 0;
> +
> + /* allow a new IO to be staged */
> + my_aio->handed_out_io = NULL;
> +
> + pgaio_io_prepare_subject(ioh);
> +
> + ioh->state = AHS_PREPARED;

As defense in depth, let's add a critical section from before assigning
AHS_DEFINED to here.  This code already needs to be safe for that (per
README.md).  When running outside a critical section, an ERROR in a subject
callback could leak the lwlock disowned in shared_buffer_prepare_common().  I
doubt there's a plausible way to reach that leak today, but future subject
callbacks could add risk over time.

> +if test "$with_liburing" = yes; then
> +  PKG_CHECK_MODULES(LIBURING, liburing)
> +fi

I used the attached makefile patch to build w/ liburing.

> +pgaio_uring_shmem_init(bool first_time)
> +{
> + uint32  TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS - 
> MAX_IO_WORKERS;
> + bool   

Re: Regression tests fail with tzdata 2024b

2024-09-16 Thread Tom Lane
Sven Klemm  writes:
> This is an unfortunate change as this will break extensions tests using
> pg_regress for testing. We run our tests against multiple minor versions
> and this getting backported means our tests will fail with the next minor
> pg release. Is there a workaround available to make the timezone for
> pg_regress configurable without going into every test?

Configurable to what?  If your test cases are dependent on the
historical behavior of PST8PDT, you're out of luck, because that
simply isn't available anymore (or won't be once 2024b reaches
your platform, anyway).

regards, tom lane




Re: Psql meta-command conninfo+

2024-09-16 Thread Tom Lane
Alvaro Herrera  writes:
> On 2024-Sep-16, Jim Jones wrote:
>> * The value of "Current User" does not match the function current_user()
>> --- as one might expcect. It is a little confusing, as there is no
>> mention of "Current User" in the docs. In case this is the intended
>> behaviour, could you please add it to the docs?

> It is intended.  As Peter said[1], what we wanted was to display
> client-side info, so PQuser() is the right thing to do.  Now maybe
> "Current User" is not the perfect column header, but at least the
> definition seems consistent with the desired end result.

Seems like "Session User" would be closer to being accurate, since
PQuser()'s result does not change when you do SET ROLE etc.

> Now, I think
> the current docs saying to look at session_user() are wrong, they should
> point to the libpq docs for the function instead; something like "The
> name of the current user, as returned by PQuser()" and so on.

Sure, but this does not excuse choosing a misleading column name
when there are better choices readily available.

regards, tom lane




Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-16 Thread Noah Misch
On Thu, Sep 12, 2024 at 03:42:48PM -0700, Masahiko Sawada wrote:
> On Tue, Sep 10, 2024 at 3:05 PM Noah Misch  wrote:
> > On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
> > > Got it.  So now I'm wondering if we need all the complexity of storing
> > > stuff in the GIN metapages.  Could we simply read the (primary's)
> > > signedness out of pg_control and use that?

> I've attached a PoC patch for this idea. We write  the default char
> signedness to the control file at initdb time. Then when comparing two
> trgms, pg_trgm opclasses use a comparison function based on the char
> signedness of the cluster. I've confirmed that the patch fixes the
> reported case at least.

I agree that proves the concept.




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-16 Thread Bharath Rupireddy
Hi,

Thanks for looking into this.

On Mon, Sep 16, 2024 at 4:54 PM Amit Kapila  wrote:
>
> Why raise the ERROR just for timeout invalidation here and why not if
> the slot is invalidated for other reasons? This raises the question of
> what happens before this patch if the invalid slot is used from places
> where we call ReplicationSlotAcquire(). I did a brief code analysis
> and found that for StartLogicalReplication(), even if the error won't
> occur in ReplicationSlotAcquire(), it would have been caught in
> CreateDecodingContext(). I think that is where we should also add this
> new error. Similarly, pg_logical_slot_get_changes_guts() and other
> logical replication functions should be calling
> CreateDecodingContext() which can raise the new ERROR. I am not sure
> about how the invalid slots are handled during physical replication,
> please check the behavior of that before this patch.

When physical slots are invalidated due to wal_removed reason, the failure
happens at a much later point for the streaming standbys while reading the
requested WAL files like the following:

2024-09-16 16:29:52.416 UTC [876059] FATAL:  could not receive data from
WAL stream: ERROR:  requested WAL segment 00010005 has
already been removed
2024-09-16 16:29:52.416 UTC [872418] LOG:  waiting for WAL to become
available at 0/5002000

At this point, despite the slot being invalidated, its wal_status can still
come back to 'unreserved' even from 'lost', and the standby can catch up if
removed WAL files are copied either by manually or by a tool/script to the
primary's pg_wal directory. IOW, the physical slots invalidated due to
wal_removed are *somehow* recoverable unlike the logical slots.

IIUC, the invalidation of a slot implies that it is not guaranteed to hold
any resources like WAL and XMINs. Does it also imply that the slot must be
unusable?

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Re: Using per-transaction memory contexts for storing decoded tuples

2024-09-16 Thread Masahiko Sawada
On Fri, Sep 13, 2024 at 3:58 AM Amit Kapila  wrote:
>
> On Thu, Sep 12, 2024 at 4:03 AM Masahiko Sawada  wrote:
> >
> > We have several reports that logical decoding uses memory much more
> > than logical_decoding_work_mem[1][2][3]. For instance in one of the
> > reports[1], even though users set logical_decoding_work_mem to
> > '256MB', a walsender process was killed by OOM because of using more
> > than 4GB memory.
> >
> > As per the discussion in these threads so far, what happened was that
> > there was huge memory fragmentation in rb->tup_context.
> > rb->tup_context uses GenerationContext with 8MB memory blocks. We
> > cannot free memory blocks until all memory chunks in the block are
> > freed. If there is a long-running transaction making changes, its
> > changes could be spread across many memory blocks and we end up not
> > being able to free memory blocks unless the long-running transaction
> > is evicted or completed. Since we don't account fragmentation, block
> > header size, nor chunk header size into per-transaction memory usage
> > (i.e. txn->size), rb->size could be less than
> > logical_decoding_work_mem but the actual allocated memory in the
> > context is much higher than logical_decoding_work_mem.
> >
>
> It is not clear to me how the fragmentation happens. Is it because of
> some interleaving transactions which are even ended but the memory
> corresponding to them is not released?

In a generation context, we can free a memory block only when all
memory chunks there are freed. Therefore, individual tuple buffers are
already pfree()'ed but the underlying memory blocks are not freed.

> Can we try reducing the size of
> 8MB memory blocks? The comment atop allocation says: "XXX the
> allocation sizes used below pre-date generation context's block
> growing code.  These values should likely be benchmarked and set to
> more suitable values.", so do we need some tuning here?

Reducing the size of the 8MB memory block would be one solution and
could be better as it could be back-patchable. It would mitigate the
problem but would not resolve it. I agree to try reducing it and do
some benchmark tests. If it reasonably makes the problem less likely
to happen, it would be a good solution.

>
> > We can reproduce this issue with the attached patch
> > rb_excessive_memory_reproducer.patch (not intended to commit) that
> > adds a memory usage reporting and includes the test. After running the
> > tap test contrib/test_decoding/t/002_rb_memory.pl, we can see a memory
> > usage report in the server logs like follows:
> >
> > LOG:  reorderbuffer memory: logical_decoding_work_mem=268435456
> > rb->size=17816832 rb->tup_context=1082130304 rb->context=1086267264
> >
> > Which means that the logical decoding allocated 1GB memory in spite of
> > logical_decoding_work_mem being 256MB.
> >
> > One idea to deal with this problem is that we use
> > MemoryContextMemAllocated() as the reorderbuffer's memory usage
> > instead of txn->total_size. That is, we evict transactions until the
> > value returned by MemoryContextMemAllocated() gets lower than
> > logical_decoding_work_mem. However, it could end up evicting a lot of
> > (possibly all) transactions because the transaction whose decoded
> > tuples data are spread across memory context blocks could be evicted
> > after all other larger transactions are evicted (note that we evict
> > transactions from largest to smallest). Therefore, if we want to do
> > that, we would need to change the eviction algorithm to for example
> > LSN order instead of transaction size order. Furthermore,
> > reorderbuffer's changes that are counted in txn->size (and therefore
> > rb->size too) are stored in different memory contexts depending on the
> > kinds. For example, decoded tuples are stored in rb->context,
> > ReorderBufferChange are stored in rb->change_context, and snapshot
> > data are stored in builder->context. So we would need to sort out
> > which data is stored into which memory context and which memory
> > context should be accounted for in the reorderbuffer's memory usage.
> > Which could be a large amount of work.
> >
> > Another idea is to have memory context for storing decoded tuples per
> > transaction. That way, we can ensure that all memory blocks of the
> > context are freed when the transaction is evicted or completed. I
> > think that this idea would be simpler and worth considering, so I
> > attached the PoC patch, use_tup_context_per_ctx_v1.patch. Since the
> > decoded tuple data is created individually when the corresponding WAL
> > record is decoded but is released collectively when it is released
> > (e.g., transaction eviction), the bump memory context would fit the
> > best this case. One exception is that we immediately free the decoded
> > tuple data if the transaction is already aborted. However, in this
> > case, I think it is possible to skip the WAL decoding in the first
> > place.
> >
> > One thing we need to consider is that the number of

Re: Document DateStyle effect on jsonpath string()

2024-09-16 Thread David E. Wheeler
On Sep 11, 2024, at 15:52, David E. Wheeler  wrote:

> WFM, though now I’ll have to go change my port 😂.

I saw this was committed in cb599b9. Thank you!

BTW, will the back-patch to 17 (cc4fdfa) be included in 17.0 or 17.1?

Best,

David





Re: Document DateStyle effect on jsonpath string()

2024-09-16 Thread Tom Lane
"David E. Wheeler"  writes:
> BTW, will the back-patch to 17 (cc4fdfa) be included in 17.0 or 17.1?

17.0.  If we were already past 17.0 I'd have a lot more angst
about changing this behavior.

regards, tom lane




Re: AIO v2.0

2024-09-16 Thread Andres Freund
Hi,

Thanks for the review!

On 2024-09-16 07:43:49 -0700, Noah Misch wrote:
> On Fri, Sep 06, 2024 at 03:38:16PM -0400, Andres Freund wrote:
> > There's plenty more to do, but I thought this would be a useful checkpoint.
>
> I find patches 1-5 are Ready for Committer.

Cool!


> > +typedef enum PgAioHandleState
>
> This enum clarified a lot for me, so I wish I had read it before anything
> else.  I recommend referring to it in README.md.

Makes sense.


> Would you also cover the valid state transitions and which of them any
> backend can do vs. which are specific to the defining backend?

Yea, we should. I earlier had something, but because details were still
changing it was hard to keep up2date.


> > +{
> > +   /* not in use */
> > +   AHS_IDLE = 0,
> > +
> > +   /* returned by pgaio_io_get() */
> > +   AHS_HANDED_OUT,
> > +
> > +   /* pgaio_io_start_*() has been called, but IO hasn't been submitted yet 
> > */
> > +   AHS_DEFINED,
> > +
> > +   /* subjects prepare() callback has been called */
> > +   AHS_PREPARED,
> > +
> > +   /* IO is being executed */
> > +   AHS_IN_FLIGHT,
>
> Let's align terms between functions and states those functions reach.  For
> example, I recommend calling this state AHS_SUBMITTED, because
> pgaio_io_prepare_submit() is the function reaching this state.
> (Alternatively, use in_flight in the function name.)

There used to be a separate SUBMITTED, but I removed it at some point as not
necessary anymore. Arguably it might be useful to re-introduce it so that
e.g. with worker mode one can tell the difference between the IO being queued
and the IO actually being processed.


> > +void
> > +pgaio_io_ref_wait(PgAioHandleRef *ior)
> > +{
> > +   uint64  ref_generation;
> > +   PgAioHandleState state;
> > +   boolam_owner;
> > +   PgAioHandle *ioh;
> > +
> > +   ioh = pgaio_io_from_ref(ior, &ref_generation);
> > +
> > +   am_owner = ioh->owner_procno == MyProcNumber;
> > +
> > +
> > +   if (pgaio_io_was_recycled(ioh, ref_generation, &state))
> > +   return;
> > +
> > +   if (am_owner)
> > +   {
> > +   if (state == AHS_DEFINED || state == AHS_PREPARED)
> > +   {
> > +   /* XXX: Arguably this should be prevented by callers? */
> > +   pgaio_submit_staged();
>
> Agreed for AHS_DEFINED, if not both.  AHS_DEFINED here would suggest a past
> longjmp out of pgaio_io_prepare() w/o a subxact rollback to cleanup.

That, or not having submitted the IO.  One thing I've been thinking about as
being potentially helpful infrastructure is to have something similar to a
critical section, except that it asserts that one is not allowed to block or
forget submitting staged IOs.



> > +void
> > +pgaio_io_prepare(PgAioHandle *ioh, PgAioOp op)
> > +{
> > +   Assert(ioh->state == AHS_HANDED_OUT);
> > +   Assert(pgaio_io_has_subject(ioh));
> > +
> > +   ioh->op = op;
> > +   ioh->state = AHS_DEFINED;
> > +   ioh->result = 0;
> > +
> > +   /* allow a new IO to be staged */
> > +   my_aio->handed_out_io = NULL;
> > +
> > +   pgaio_io_prepare_subject(ioh);
> > +
> > +   ioh->state = AHS_PREPARED;
>
> As defense in depth, let's add a critical section from before assigning
> AHS_DEFINED to here.  This code already needs to be safe for that (per
> README.md).  When running outside a critical section, an ERROR in a subject
> callback could leak the lwlock disowned in shared_buffer_prepare_common().  I
> doubt there's a plausible way to reach that leak today, but future subject
> callbacks could add risk over time.

Makes sense.


> > +if test "$with_liburing" = yes; then
> > +  PKG_CHECK_MODULES(LIBURING, liburing)
> > +fi
>
> I used the attached makefile patch to build w/ liburing.

Thanks, will incorporate.


> With EXEC_BACKEND, "make check PG_TEST_INITDB_EXTRA_OPTS=-cio_method=io_uring"
> fails early:

Right  - that's to be expected.

> 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG:  starting PostgreSQL 
> 18devel on x86_64-pc-linux-gnu, compiled by gcc (Debian 13.2.0-13) 13.2.0, 
> 64-bit
> 2024-09-15 12:46:08.168 PDT postmaster[2069397] LOG:  listening on Unix 
> socket "/tmp/pg_regress-xgQOPH/.s.PGSQL.65312"
> 2024-09-15 12:46:08.203 PDT startup[2069423] LOG:  database system was shut 
> down at 2024-09-15 12:46:07 PDT
> 2024-09-15 12:46:08.209 PDT client backend[2069425] [unknown] FATAL:  the 
> database system is starting up
> 2024-09-15 12:46:08.222 PDT postmaster[2069397] LOG:  database system is 
> ready to accept connections
> 2024-09-15 12:46:08.254 PDT autovacuum launcher[2069435] PANIC:  failed: 
> -9/Bad file descriptor
> 2024-09-15 12:46:08.286 PDT client backend[2069444] [unknown] PANIC:  failed: 
> -95/Operation not supported
> 2024-09-15 12:46:08.355 PDT client backend[2069455] [unknown] PANIC:  
> unexpected: -95/Operation not supported: No such file or directory
> 2024-09-15 12:46:08.370 PDT postmaster[2069397] LOG:  received fast shutdown 
> request
>
> I expect that's from io_uring_queue_init() stashing 

Re: AIO v2.0

2024-09-16 Thread Andres Freund
Hi,

On 2024-09-12 14:55:49 -0700, Robert Pang wrote:
> Hi Andres
> 
> Thanks for the AIO patch update. I gave it a try and ran into a FATAL
> in bgwriter when executing a benchmark.
> 
> 2024-09-12 01:38:00.851 PDT [2780939] PANIC:  no more bbs
> 2024-09-12 01:38:00.854 PDT [2780473] LOG:  background writer process
> (PID 2780939) was terminated by signal 6: Aborted
> 2024-09-12 01:38:00.854 PDT [2780473] LOG:  terminating any other
> active server processes
> 
> I debugged a bit and found that BgBufferSync() is not capping the
> batch size under io_bounce_buffers like BufferSync() for checkpoint.
> Here is a small patch to fix it.

Good catch, thanks!


I am hoping (as described in my email to Noah a few minutes ago) that we can
get away from needing bounce buffers. They are a quite expensive solution to a
problem we made for ourselves...

Greetings,

Andres Freund




Re: Detailed release notes

2024-09-16 Thread Bruce Momjian
On Sat, Sep 14, 2024 at 08:37:31PM -0400, Bruce Momjian wrote:
> On Fri, Sep 13, 2024 at 12:39:28PM -0400, Bruce Momjian wrote:
> > I applied this patch to PG 17.  You can see the results at:
> > 
> > https://momjian.us/pgsql_docs/release-17.html
> > 
> > The community doc build only shows the master branch, which is PG 18,
> > and the PG 17 docs are only built before the release.
> > 
> > I changed the patch to use the section symbol "§" instead of showing
> > the hashes.  The hashes seemed too detailed.  Does anyone see a better
> > symbol to use from here?
> > 
> > http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
> > 
> > I think we are limited to the HTML entities listed on that page. I also
> > removed the brackets and the period you added at the end of the text. 
> 
> I wrote the attached Perl script that automatically adds commit links. 
> I tested it against PG 12-17 and the results were good. I plan to add
> this script to all branches and run it on all branch release notes in a
> few days.

I have added the Perl script, added instructions on when to run the
script, and ran the script on all release notes back to PG 12.  I think
we can call this item closed.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"




Re: Detailed release notes

2024-09-16 Thread Marcos Pegoraro
Em seg., 16 de set. de 2024 às 15:15, Bruce Momjian 
escreveu:

> > > I changed the patch to use the section symbol "§" instead of showing
> > > the hashes.  The hashes seemed too detailed.  Does anyone see a better
> > > symbol to use from here


Well, I think section symbol is not a good choice for all commit links

Add backend support for injection points (Michael Paquier) § § § §
Add backend support for injection points (Michael Paquier) 1 2 3 4
Add backend support for injection points (Michael Paquier) Commit 1 2 3 4

I don't know which one is better, but I don't think the section symbol is
the best.


Re: Document DateStyle effect on jsonpath string()

2024-09-16 Thread David E. Wheeler
On Sep 16, 2024, at 13:29, Tom Lane  wrote:

> 17.0.  If we were already past 17.0 I'd have a lot more angst
> about changing this behavior.

Great, very glad it made it in.

D





Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-09-16 Thread Alexander Korotkov
Hi, Alexander!

On Fri, Sep 13, 2024 at 3:00 PM Alexander Lakhin  wrote:
>
> 10.08.2024 20:18, Alexander Korotkov wrote:
> > On Sat, Aug 10, 2024 at 7:33 PM Alexander Korotkov  
> > wrote:
> >> On Tue, Aug 6, 2024 at 5:17 AM Alexander Korotkov  
> >> wrote:
> >> ...
> >> Here is a revised version of the patchset.  I've fixed some typos,
> >> identation, etc.  I'm going to push this once it passes cfbot.
> > The next revison of the patchset fixes uninitialized variable usage
> > spotted by cfbot.
>
> When running check-world on a rather slow armv7 device, I came across the
> 043_wal_replay_wait.pl test failure:
> t/043_wal_replay_wait.pl .. 7/? # Tests were run but no plan was 
> declared and done_testing() was not seen.
> # Looks like your test exited with 29 just after 8.
>
> regress_log_043_wal_replay_wait contains:
> ...
> 01234[21:58:56.370](1.594s) ok 7 - multiple LSN waiters reported consistent 
> data
> ### Promoting node "standby"
> # Running: pg_ctl -D 
> .../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata -l
> .../src/test/recovery/tmp_check/log/043_wal_replay_wait_standby.log promote
> waiting for server to promote done
> server promoted
> [21:58:56.637](0.268s) ok 8 - got error after standby promote
> error running SQL: 'psql::1: ERROR:  recovery is not in progress
> HINT:  Waiting for LSN can only be executed during recovery.'
> while running 'psql -XAtq -d port=10228 host=/tmp/Ftj8qpTQht 
> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL
> pg_wal_replay_wait('0/300D0E8');' at 
> .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 
> 2140.
>
> 043_wal_replay_wait_standby.log contains:
> 2024-09-12 21:58:56.518 UTC [15220:1] [unknown] LOG:  connection received: 
> host=[local]
> 2024-09-12 21:58:56.520 UTC [15220:2] [unknown] LOG:  connection 
> authenticated: user="android" method=trust
> (.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
> 2024-09-12 21:58:56.520 UTC [15220:3] [unknown] LOG:  connection authorized: 
> user=android database=postgres
> application_name=043_wal_replay_wait.pl
> 2024-09-12 21:58:56.527 UTC [15220:4] 043_wal_replay_wait.pl LOG: statement: 
> CALL pg_wal_replay_wait('2/570CB4E8');
> 2024-09-12 21:58:56.535 UTC [15123:7] LOG:  received promote request
> 2024-09-12 21:58:56.535 UTC [15124:2] FATAL:  terminating walreceiver process 
> due to administrator command
> 2024-09-12 21:58:56.537 UTC [15123:8] LOG:  invalid record length at 
> 0/300D0B0: expected at least 24, got 0
> 2024-09-12 21:58:56.537 UTC [15123:9] LOG:  redo done at 0/300D088 system 
> usage: CPU: user: 0.01 s, system: 0.00 s,
> elapsed: 14.23 s
> 2024-09-12 21:58:56.537 UTC [15123:10] LOG:  last completed transaction was 
> at log time 2024-09-12 21:58:55.322831+00
> 2024-09-12 21:58:56.540 UTC [15123:11] LOG:  selected new timeline ID: 2
> 2024-09-12 21:58:56.589 UTC [15123:12] LOG:  archive recovery complete
> 2024-09-12 21:58:56.590 UTC [15220:5] 043_wal_replay_wait.pl ERROR: recovery 
> is not in progress
> 2024-09-12 21:58:56.590 UTC [15220:6] 043_wal_replay_wait.pl DETAIL:  
> Recovery ended before replaying target LSN
> 2/570CB4E8; last replay LSN 0/300D0B0.
> 2024-09-12 21:58:56.591 UTC [15121:1] LOG:  checkpoint starting: force
> 2024-09-12 21:58:56.592 UTC [15220:7] 043_wal_replay_wait.pl LOG: 
> disconnection: session time: 0:00:00.075 user=android
> database=postgres host=[local]
> 2024-09-12 21:58:56.595 UTC [15120:4] LOG:  database system is ready to 
> accept connections
> 2024-09-12 21:58:56.665 UTC [15227:1] [unknown] LOG:  connection received: 
> host=[local]
> 2024-09-12 21:58:56.668 UTC [15227:2] [unknown] LOG:  connection 
> authenticated: user="android" method=trust
> (.../src/test/recovery/tmp_check/t_043_wal_replay_wait_standby_data/pgdata/pg_hba.conf:117)
> 2024-09-12 21:58:56.668 UTC [15227:3] [unknown] LOG:  connection authorized: 
> user=android database=postgres
> application_name=043_wal_replay_wait.pl
> 2024-09-12 21:58:56.675 UTC [15227:4] 043_wal_replay_wait.pl LOG: statement: 
> CALL pg_wal_replay_wait('0/300D0E8');
> 2024-09-12 21:58:56.677 UTC [15227:5] 043_wal_replay_wait.pl ERROR: recovery 
> is not in progress
> 2024-09-12 21:58:56.677 UTC [15227:6] 043_wal_replay_wait.pl HINT: Waiting 
> for LSN can only be executed during recovery.
> 2024-09-12 21:58:56.679 UTC [15227:7] 043_wal_replay_wait.pl LOG: 
> disconnection: session time: 0:00:00.015 user=android
> database=postgres host=[local]
>
> Note that last replay LSN is 300D0B0, but the latter pg_wal_replay_wait
> call wants to wait for 300D0E8.
>
> pg_waldump -p 
> src/test/recovery/tmp_check/t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 
> 00010003
> shows:
> rmgr: Heaplen (rec/tot): 59/59, tx:748, lsn: 
> 0/0300D048, prev 0/0300D020, desc: INSERT off: 35,
> flags: 0x00, blkref #0: rel 1663/5/16384 blk 0
> rmgr: Transaction len (rec/tot): 34/34, t

Re: Adding skip scan (including MDAM style range skip scan) to nbtree

2024-09-16 Thread Peter Geoghegan
On Thu, Sep 12, 2024 at 10:49 AM Matthias van de Meent
 wrote:
> Thanks to Peter for the description, that helped me debug the issue. I
> think I found a fix for the issue: regression tests for 811af978
> consistently got stuck on my macbook before the attached patch 0001,
> after applying that this patch they completed just fine.

Thanks for taking a look at it.

> The fix in 0001 is relatively simple: we stop backends from waiting
> for a concurrent backend to resolve the NEED_PRIMSCAN condition, and
> instead move our local state machine so that we'll hit _bt_first
> ourselves, so that we may be able to start the next primitive scan.

I agree with your approach, but I'm concerned about it causing
confusion inside _bt_parallel_done. And so I attach a v2 revision of
your bug fix. v2 adds a check that nails that down, too. I'm not 100%
sure if the change to _bt_parallel_done becomes strictly necessary, to
make the basic fix robust, but it's a good idea either way. In fact, it
seemed like a good idea even before this bug came to light: it was
already clear that this was strictly necessary for the skip scan
patch. And for reasons that really have nothing to do with the
requirements for skip scan (it's related to how we call
_bt_parallel_done without much care in code paths from the original
parallel index scan commit).

More details on changes in v2 that didn't appear in Matthias' v1:

v2 makes _bt_parallel_done do nothing at all when the backend-local
so->needPrimScan flag is set (regardless of whether it has been set by
_bt_parallel_seize or by _bt_advance_array_keys). This is a bit like
the approach taken before the Postgres 17 work went in:
_bt_parallel_done used to only permit the shared btps_pageStatus state
to become BTPARALLEL_DONE when it found that "so->arrayKeyCount >=
btscan->btps_arrayKeyCount" (else the call was a no-op). With this
extra hardening, _bt_parallel_done will only permit setting BTPARALLEL_DONE when
"!so->needPrimScan". Same idea, more or less.

v2 also changes comments in _bt_parallel_seize. The comment tweaks
suggest that the new "if (!first && status ==
BTPARALLEL_NEED_PRIMSCAN) return false" path is similar to the
existing master branch "if (!first && so->needPrimScan) return false"
precheck logic on master (the precheck that takes place before
examining any state in shared memory). The new path can be thought of
as dealing with cases where the backend-local so->needPrimScan flag
must have been stale back when it was prechecked -- it's essentially the same
logic, though unlike the precheck it works against the authoritative
shared memory state.

My current plan is to commit something like this in the next day or two.

--
Peter Geoghegan
From 11282515bae8090b30663814c5f91db00488508d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Mon, 16 Sep 2024 14:28:57 -0400
Subject: [PATCH v2] Fix stuck parallel btree scans

Before, a backend that called _bt_parallel_seize was not always
guaranteed to be able to move forward on a state where more work
was expected from parallel backends, and handled NEED_PRIMSCAN as
a semi-ADVANCING state. This caused issues when the leader process
was waiting for the state to advance and concurrent backends were
waiting for the leader to consume the buffered tuples they still
had after updating the state to NEED_PRIMSCAN.

This is fixed by treating _bt_parallel_seize()'s status output as
the status of a currently active primitive scan.  If _seize is
called from outside _bt_first, and the scan state is NEED_PRIMSCAN,
then we'll end our current primitive scan and set the scan up for
a new primitive scan, eventually hitting _bt_first's call to
_seize.

Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp
execution.

Author: Matthias van de Meent 
Reported-By: Tomas Vondra 
Reviewed-By: Peter Geoghegan 
Discussion: https://postgr.es/m/CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com
Backpatch: 17-, where nbtree SAOP execution was enhanced.
---
 src/backend/access/nbtree/nbtree.c | 53 +++---
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 686a3206f..456a04995 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -585,7 +585,10 @@ btparallelrescan(IndexScanDesc scan)
  *		or _bt_parallel_done().
  *
  * The return value is true if we successfully seized the scan and false
- * if we did not.  The latter case occurs if no pages remain.
+ * if we did not.  The latter case occurs when no pages remain, or when
+ * another primitive index scan is scheduled that caller's backend cannot
+ * start just yet (only backends that call from _bt_first are capable of
+ * starting primitive index scans, which they indicate by passing first=true).
  *
  * If the return value is true, *pageno returns the next or current page
  * of the scan (depending on the scan direction).  An invalid block

Re: Detailed release notes

2024-09-16 Thread Bruce Momjian
On Mon, Sep 16, 2024 at 03:42:22PM -0300, Marcos Pegoraro wrote:
> Em seg., 16 de set. de 2024 às 15:15, Bruce Momjian 
> escreveu:
> 
> > > I changed the patch to use the section symbol "§" instead of showing
> > > the hashes.  The hashes seemed too detailed.  Does anyone see a better
> > > symbol to use from here
> 
> 
> Well, I think section symbol is not a good choice for all commit links
> 
> Add backend support for injection points (Michael Paquier) § § § §
> Add backend support for injection points (Michael Paquier) 1 2 3 4
> Add backend support for injection points (Michael Paquier) Commit 1 2 3 4
> 
> I don't know which one is better, but I don't think the section symbol is the
> best. 

We could try:

   Add backend support for injection points (Michael Paquier) §1 §2 §3 §4

and maybe only add numbers if there is more than one commit.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  When a patient asks the doctor, "Am I going to die?", he means 
  "Am I going to die soon?"




Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-09-16 Thread Jacob Champion
On Wed, Sep 11, 2024 at 3:54 PM Jacob Champion
 wrote:
> Yeah, and I still owe you all an updated roadmap.

Okay, here goes. New reviewers: start here!

== What is This? ==

OAuth 2.0 is a way for a trusted third party (a "provider") to tell a
server whether a client on the other end of the line is allowed to do
something. This patchset adds OAuth support to libpq with libcurl,
provides a server-side API so that extension modules can add support
for specific OAuth providers, and extends our SASL support to carry
the OAuth access tokens over the OAUTHBEARER mechanism.

Most OAuth clients use a web browser to perform the third-party
handshake. (These are your "Okta logins", "sign in with XXX", etc.)
But there are plenty of people who use psql without a local browser,
and invoking a browser safely across all supported platforms is
actually surprisingly fraught. So this patchset implements something
called device authorization, where the client will display a link and
a code, and then you can log in on whatever device is convenient for
you. Once you've told your provider that you trust libpq to connect to
Postgres on your behalf, it'll give libpq an access token, and libpq
will forward that on to the server.

== How This Fits, or: The Sales Pitch ==

The most popular third-party auth methods we have today are probably
the Kerberos family (AD/GSS/SSPI) and LDAP. If you're not already in
an MS ecosystem, it's unlikely that you're using the former. And users
of the latter are, in my experience, more-or-less resigned to its use,
in spite of LDAP's architectural security problems and the fact that
you have to run weird synchronization scripts to tell Postgres what
certain users are allowed to do.

OAuth provides a decently mature and widely-deployed third option. You
don't have to be running the infrastructure yourself, as long as you
have a provider you trust. If you are running your own infrastructure
(or if your provider is configurable), the tokens being passed around
can carry org-specific user privileges, so that Postgres can figure
out who's allowed to do what without out-of-band synchronization
scripts. And those access tokens are a straight upgrade over
passwords: even if they're somehow stolen, they are time-limited, they
are optionally revocable, and they can be scoped to specific actions.

== Extension Points ==

This patchset provides several points of customization:

Server-side validation is farmed out entirely to an extension, which
we do not provide. (Each OAuth provider is free to come up with its
own proprietary method of verifying its access tokens, and so far the
big players have absolutely not standardized.) Depending on the
provider, the extension may need to contact an external server to see
what the token has been authorized to do, or it may be able to do that
offline using signing keys and an agreed-upon token format.

The client driver using libpq may replace the device authorization
prompt (which by default is done on standard error), for example to
move it into an existing GUI, display a scannable QR code instead of a
link, and so on.

The driver may also replace the entire OAuth flow. For example, a
client that already interacts with browsers may be able to use one of
the more standard web-based methods to get an access token. And
clients attached to a service rather than an end user could use a more
straightforward server-to-server flow, with pre-established
credentials.

== Architecture ==

The client needs to speak HTTP, which is implemented entirely with
libcurl. Originally, I used another OAuth library for rapid
prototyping, but the quality just wasn't there and I ported the
implementation. An internal abstraction layer remains in the libpq
code, so if a better client library comes along, switching to it
shouldn't be too painful.

The client-side hooks all go through a single extension point, so that
we don't continually add entry points in the API for each new piece of
authentication data that a driver may be able to provide. If we wanted
to, we could potentially move the existing SSL passphrase hook into
that, or even handle password retries within libpq itself, but I don't
see any burning reason to do that now.

I wanted to make sure that OAuth could be dropped into existing
deployments without driver changes. (Drivers will probably *want* to
look at the extension hooks for better UX, but they shouldn't
necessarily *have* to.) That has driven several parts of the design.

Drivers using the async APIs should continue to work without blocking,
even during the long HTTP handshakes. So the new client code is
structured as a typical event-driven state machine (similar to
PQconnectPoll). The protocol machine hands off control to the OAuth
machine during authentication, without really needing to know how it
works, because the OAuth machine replaces the PQsocket with a
general-purpose multiplexer that handles all of the HTTP sockets and
events. Once that's completed, the OAuth machine 

Re: SQL:2023 JSON simplified accessor support

2024-09-16 Thread Peter Eisentraut

On 29.08.24 18:33, Alexandra Wang wrote:

I’ve implemented the member and array accessors and attached two
alternative patches:

1. v1-0001-Add-JSON-JSONB-simplified-accessor.patch: This patch
enables dot access to JSON object fields and subscript access to
indexed JSON array elements by converting "." and "[]" indirection
into a JSON_QUERY JsonFuncExpr node.

2. v2-0001-Transform-JSON-dot-access-to-arrow-operator.txt: This
alternative patch implements dot access to JSON object fields by
transforming the "." indirection into a "->" operator.

The upside of the v1 patch is that it strictly aligns with the SQL
standard, which specifies that the simplified access is equivalent to:

JSON_QUERY (VEP, 'lax $.JC' WITH CONDITIONAL ARRAY WRAPPER NULL ON
EMPTY NULL ON ERROR)

However, the performance of JSON_QUERY might be suboptimal due to
function call overhead. Therefore, I implemented the v2 alternative
using the "->" operator.


Using the operator approach would also allow taking advantage of 
optimizations such as 
.



There is some uncertainty about the semantics of conditional array
wrappers. Currently, there is at least one subtle difference between
the "->" operator and JSON_QUERY, as shown:


That JSON_QUERY bug has been fixed.

I suggest you rebase both of your patches over this, just to double 
check everything.  But then I think you can drop the v1 patch and just 
submit a new version of v2.


The patch should eventually contain some documentation.  It might be 
good starting to look for a good spot where to put that documentation. 
It might be either near the json types documentation or near the general 
qualified identifier syntax, not sure.






Re: Pgoutput not capturing the generated columns

2024-09-16 Thread Masahiko Sawada
On Wed, Sep 11, 2024 at 10:30 PM Peter Smith  wrote:
>
> Because this feature is now being implemented as a PUBLICATION option,
> there is another scenario that might need consideration; I am thinking
> about where the same table is published by multiple PUBLICATIONS (with
> different option settings) that are subscribed by a single
> SUBSCRIPTION.
>
> e.g.1
> -
> CREATE PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns = true);
> CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = false);
> CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> -
>
> e.g.2
> -
> CREATE PUBLICATION pub1 FOR ALL TABLES WITH (publish_generated_columns = 
> true);
> CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = false);
> CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> -
>
> Do you know if this case is supported? If yes, then which publication
> option value wins?

I would expect these option values are processed with OR. That is, we
publish changes of the generated columns if at least one publication
sets publish_generated_columns to true. It seems to me that we treat
multiple row filters in the same way.

>
> The CREATE SUBSCRIPTION docs [1] only says "Subscriptions having
> several publications in which the same table has been published with
> different column lists are not supported."
>
> Perhaps the user is supposed to deduce that the example above would
> work OK if table 't1' has no generated cols. OTOH, if it did have
> generated cols then the PUBLICATION column lists must be different and
> therefore it is "not supported" (??).

With the patch, how should this feature work when users specify a
generated column to the column list and set publish_generated_column =
false, in the first place? raise an error (as we do today)? or always
send NULL?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: optimizing pg_upgrade's once-in-each-database steps

2024-09-16 Thread Nathan Bossart
On Mon, Sep 09, 2024 at 11:20:28PM +0200, Daniel Gustafsson wrote:
>> On 9 Sep 2024, at 21:17, Nathan Bossart  wrote:
>> 
>> On Thu, Sep 05, 2024 at 01:32:34PM +0200, Daniel Gustafsson wrote:
>>> I've read and tested through the latest version of this patchset and I think
>>> it's ready to go in.
>> 
>> Thanks for reviewing.  I'm aiming to commit it later this week.
> 
> +1.  Looking forward to seeing what all the pg_dump/pg_upgrade changes amount
> to in speed improvement when combined.

Committed.

-- 
nathan




Re: [PATCH] WIP: replace method for jsonpath

2024-09-16 Thread Florents Tselai
Here’s an updated version of this patch.The previous version failed several CI steps; this passes them all.Unless someone disagrees,I’ll proceed with the documentation and add this for the next CF.As a future note:It’s worth noting that both this newly added jspItem and other ones like (jpiDecimal, jpiString)use jspGetRightArg and jspGetLeftArg.left and right can be confusing if more complex methods are added in the future.i.e. jsonpath methods with nargs>=3 .I was wondering if we’d like something like JSP_GETARG(n)GitHub PR in case you prefer it’s UI  https://github.com/Florents-Tselai/postgres/pull/3 

v2-0001-Add-first-working-version-for-jsonpath-.replace-s.patch
Description: Binary data


v2-0002-replace-should-work-with-strings-inside-arrays.patch
Description: Binary data


v2-0003-Fix-logic-for-jpiReplaceFunc.patch
Description: Binary data
On 15 Sep 2024, at 4:15 AM, Florents Tselai  wrote:Hi,When working with jsonb/jsonpath, I’ve always wanted more fluent string operations for data cleaning.Ideally, chaining text processing methods together.Here’s a draft patch that adds a string replace method.It works like thisselect jsonb_path_query('"hello world"', '$.replace("hello","bye")'); jsonb_path_query -- "bye world"(1 row)And looks like plays nicely with existing facilities  select jsonb_path_query('"hello world"', '$.replace("hello","bye") starts with "bye"'); jsonb_path_query -- true(1 row)I’d like some initial feedback on whether this is of interested before I proceed with the following:- I’ve tried respecting the surrounding code, but a more experienced eye can spot some inconsistencies. I’ll revisit those- Add more test cases (I’ve added the obvious ones, but ideas on more cases are welcome).- pg upgrades currently fail on CI (see)- better error handling/reporting: I’ve kept the wording of error messages, but we’ll need something akin to ERRCODE_INVALID_ARGUMENT_FOR_SQL_JSON_DATETIME_FUNCTION.- documentation.

Re: Conflict detection for update_deleted in logical replication

2024-09-16 Thread Masahiko Sawada
On Fri, Sep 13, 2024 at 12:56 AM shveta malik  wrote:
>
> On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila  wrote:
> >
> > > >
> > > > So in brief, this solution is only for bidrectional setup? For 
> > > > non-bidirectional,
> > > > feedback_slots is non-configurable and thus irrelevant.
> > >
> > > Right.
> > >
> >
> > One possible idea to address the non-bidirectional case raised by
> > Shveta is to use a time-based cut-off to remove dead tuples. As
> > mentioned earlier in my email [1], we can define a new GUC parameter
> > say vacuum_committs_age which would indicate that we will allow rows
> > to be removed only if the modified time of the tuple as indicated by
> > committs module is greater than the vacuum_committs_age. We could keep
> > this parameter a table-level option without introducing a GUC as this
> > may not apply to all tables. I checked and found that some other
> > replication solutions like GoldenGate also allowed similar parameters
> > (tombstone_deletes) to be specified at table level [2]. The other
> > advantage of allowing it at table level is that it won't hamper the
> > performance of hot-pruning or vacuum in general. Note, I am careful
> > here because to decide whether to remove a dead tuple or not we need
> > to compare its committs_time both during hot-pruning and vacuum.
>
> +1 on the idea,

I agree that this idea is much simpler than the idea originally
proposed in this thread.

IIUC vacuum_committs_age specifies a time rather than an XID age. But
how can we implement it? If it ends up affecting the vacuum cutoff, we
should be careful not to end up with the same result of
vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
the implementation needs not to affect the performance of
ComputeXidHorizons().

> but IIUC this value doesn’t need to be significant; it
> can be limited to just a few minutes. The one which is sufficient to
> handle replication delays caused by network lag or other factors,
> assuming clock skew has already been addressed.

I think that in a non-bidirectional case the value could need to be a
large number. Is that right?

Regards,

[1] 
https://www.postgresql.org/message-id/20230317230930.nhsgk3qfk7f4axls%40awork3.anarazel.de

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Robocopy might be not robust enough for never-ending testing on Windows

2024-09-16 Thread Thomas Munro
On Mon, Sep 16, 2024 at 6:00 PM Alexander Lakhin  wrote:
> So this leak looks like a recent and still existing defect.

>From my cartoon-like understanding of Windows, I would guess that if
event handles created by a program are leaked after it has exited, it
would normally imply that they've been duplicated somewhere else that
is still running (for example see the way that PostgreSQL's
dsm_impl_pin_segment() calls DuplicateHandle() to give a copy to the
postmaster, so that the memory segment continues to exist after the
backend exits), and if it's that, you'd be able to see the handle
count going up in the process monitor for some longer running process
somewhere (as seen in this report from the Chrome hackers[1]).  And if
it's not that, then I would guess it would have to be a kernel bug
because something outside userspace must be holding onto/leaking
handles.  But I don't really understand Windows beyond trying to debug
PostgreSQL at a distance, so my guesses may be way off.  If we wanted
to try to find a Windows expert to look at a standalone repro, does
your PS script work with *any* source directory, or is there something
about the initdb template, in which case could you post it in a .zip
file so that a non-PostgreSQL person could see the failure mode?

[1] 
https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/




Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-09-16 Thread Peter Smith
Here are a few comments for the patch v46-0001.

==
src/backend/replication/slot.c

1. ReportSlotInvalidation

On Mon, Sep 16, 2024 at 8:01 PM Bharath Rupireddy
 wrote:
>
> On Mon, Sep 9, 2024 at 1:11 PM Peter Smith  wrote:
> > 3. ReportSlotInvalidation
> >
> > I didn't understand why there was a hint for:
> > "You might need to increase \"%s\".", "max_slot_wal_keep_size"
> >
> > Why aren't these similar cases consistent?
>
> It looks misleading and not very useful. What happens if the removed
> WAL (that's needed for the slot) is put back into pg_wal somehow (by
> manually copying from archive or by some tool/script)? Can the slot
> invalidated due to wal_removed start sending WAL to its clients?
>
> > But you don't have an equivalent hint for timeout invalidation:
> > "You might need to increase \"%s\".", "replication_slot_inactive_timeout"
>
> I removed this per review comments upthread.

IIUC the errors are quite similar, so my previous review comment was
mostly about the unexpected inconsistency of why one of them has a
hint and the other one does not. I don't have a strong opinion about
whether they should both *have* or *not have* hints, so long as they
are treated the same.

If you think the current code hint is not useful then maybe we need a
new thread to address that existing issue. For example, maybe it
should be removed or reworded.

~~~

2. InvalidatePossiblyObsoleteSlot:

+ case RS_INVAL_INACTIVE_TIMEOUT:
+
+ if (!SlotInactiveTimeoutCheckAllowed(s))
+ break;
+
+ /*
+ * Check if the slot needs to be invalidated due to
+ * replication_slot_inactive_timeout GUC.
+ */
+ if (TimestampDifferenceExceeds(s->inactive_since, now,
+replication_slot_inactive_timeout * 1000))

nit - it might be tidier to avoid multiple breaks by just combining
these conditions. See the nitpick attachment.

~~~

3.
  * - RS_INVAL_WAL_LEVEL: is logical
+ * - RS_INVAL_INACTIVE_TIMEOUT: inactive timeout occurs

nit - use comment wording "inactive slot timeout has occurred", to
make it identical to the comment in slot.h

==
src/test/recovery/t/050_invalidate_slots.pl

4.
+# Despite inactive timeout being set, the synced slot won't get invalidated on
+# its own on the standby. So, we must not see invalidation message in server
+# log.
+$standby1->safe_psql('postgres', "CHECKPOINT");
+ok( !$standby1->log_contains(
+ "invalidating obsolete replication slot \"sync_slot1\"",
+ $logstart),
+ 'check that synced slot sync_slot1 has not been invalidated on standby'
+);
+

It seems kind of brittle to check the logs for something that is NOT
there because any change to the message will make this accidentally
pass. Apart from that, it might anyway be more efficient just to check
the pg_replication_slots again to make sure the 'invalidation_reason
remains' still NULL.

==

Please see the attachment which implements some of the nit changes
mentioned above.

==
Kind Regards,
Peter Smith.
Fujitsu Australia
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 851120e..0076e4b 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -1716,15 +1716,12 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
invalidation_cause = cause;
break;
case RS_INVAL_INACTIVE_TIMEOUT:
-
-   if (!SlotInactiveTimeoutCheckAllowed(s))
-   break;
-
/*
 * Check if the slot needs to be 
invalidated due to
 * replication_slot_inactive_timeout 
GUC.
 */
-   if 
(TimestampDifferenceExceeds(s->inactive_since, now,
+   if (SlotInactiveTimeoutCheckAllowed(s) 
&&
+   
TimestampDifferenceExceeds(s->inactive_since, now,

   replication_slot_inactive_timeout * 1000))
{
invalidation_cause = cause;
@@ -1894,7 +1891,7 @@ 
InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause,
  * - RS_INVAL_HORIZON: requires a snapshot <= the given horizon in the given
  *   db; dboid may be InvalidOid for shared relations
  * - RS_INVAL_WAL_LEVEL: is logical
- * - RS_INVAL_INACTIVE_TIMEOUT: inactive timeout occurs
+ * - RS_INVAL_INACTIVE_TIMEOUT: inactive slot timeout has occurred
  *
  * NB - this runs as part of checkpoint, so avoid raising errors if possible.
  */
diff --git a/src/test/recovery/t/050_invalidate_slots.pl 
b/src/test/recovery/t/050_invalidate_slots.pl
index c53b5b3..e2fdd52 100644
--- a/src/test/recovery/t/050_invalid

Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-16 Thread Tatsuo Ishii
> Not sure if there's any way to force it in the SQL standard.  However
> in term of implementation, PostgreSQL sorts the function
> (generate_series) scan result using a sort key "a.n < 3", which
> results in rows being >= 2 first (as false == 0), then rows being < 3
> (as true == 1). So unless PostgreSQL changes the way to sort boolean
> data type, I think the result should be stable.

Attached is the v5 patch. The difference from v4 is addtion of two
more tests to explain.sql:

1) spils to disk case
2) splis to disk then switch back to memory case

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
>From 23d5d62ef3c08117a0e452a57264477a3f9aba08 Mon Sep 17 00:00:00 2001
From: Tatsuo Ishii 
Date: Tue, 17 Sep 2024 10:59:07 +0900
Subject: [PATCH v5] Add memory/disk usage for Window aggregate nodes in
 EXPLAIN.

This commit is similar to 1eff8279d and expands the idea to Window
aggregate nodes so that users can know how much memory or disk the
tuplestore used.

This commit uses newly introduced tuplestore_get_stats() to query this
information and add some additional output in EXPLAIN ANALYZE to
display the information for the Window aggregate node.
---
 src/backend/commands/explain.c| 68 ---
 src/test/regress/expected/explain.out | 37 +++
 src/test/regress/sql/explain.sql  |  8 
 3 files changed, 95 insertions(+), 18 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 2819e479f8..aaec439892 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -120,6 +120,7 @@ static void show_sort_group_keys(PlanState *planstate, const char *qlabel,
  List *ancestors, ExplainState *es);
 static void show_sortorder_options(StringInfo buf, Node *sortexpr,
    Oid sortOperator, Oid collation, bool nullsFirst);
+static void show_storage_info(Tuplestorestate *tupstore, ExplainState *es);
 static void show_tablesample(TableSampleClause *tsc, PlanState *planstate,
 			 List *ancestors, ExplainState *es);
 static void show_sort_info(SortState *sortstate, ExplainState *es);
@@ -127,6 +128,7 @@ static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
 	   ExplainState *es);
 static void show_hash_info(HashState *hashstate, ExplainState *es);
 static void show_material_info(MaterialState *mstate, ExplainState *es);
+static void show_windowagg_info(WindowAggState *winstate, ExplainState *es);
 static void show_memoize_info(MemoizeState *mstate, List *ancestors,
 			  ExplainState *es);
 static void show_hashagg_info(AggState *aggstate, ExplainState *es);
@@ -2231,6 +2233,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		   planstate, es);
 			show_upper_qual(((WindowAgg *) plan)->runConditionOrig,
 			"Run Condition", planstate, ancestors, es);
+			show_windowagg_info(castNode(WindowAggState, planstate), es);
 			break;
 		case T_Group:
 			show_group_keys(castNode(GroupState, planstate), ancestors, es);
@@ -2894,6 +2897,34 @@ show_sortorder_options(StringInfo buf, Node *sortexpr,
 	}
 }
 
+/*
+ * Show information on storage method and maximum memory/disk space used.
+ */
+static void
+show_storage_info(Tuplestorestate *tupstore, ExplainState *es)
+{
+	char	   *maxStorageType;
+	int64		maxSpaceUsed,
+maxSpaceUsedKB;
+
+	tuplestore_get_stats(tupstore, &maxStorageType, &maxSpaceUsed);
+	maxSpaceUsedKB = BYTES_TO_KILOBYTES(maxSpaceUsed);
+
+	if (es->format != EXPLAIN_FORMAT_TEXT)
+	{
+		ExplainPropertyText("Storage", maxStorageType, es);
+		ExplainPropertyInteger("Maximum Storage", "kB", maxSpaceUsedKB, es);
+	}
+	else
+	{
+		ExplainIndentText(es);
+		appendStringInfo(es->str,
+		 "Storage: %s  Maximum Storage: " INT64_FORMAT "kB\n",
+		 maxStorageType,
+		 maxSpaceUsedKB);
+	}
+}
+
 /*
  * Show TABLESAMPLE properties
  */
@@ -3350,9 +3381,6 @@ static void
 show_material_info(MaterialState *mstate, ExplainState *es)
 {
 	Tuplestorestate *tupstore = mstate->tuplestorestate;
-	char	   *maxStorageType;
-	int64		maxSpaceUsed,
-maxSpaceUsedKB;
 
 	/*
 	 * Nothing to show if ANALYZE option wasn't used or if execution didn't
@@ -3361,22 +3389,26 @@ show_material_info(MaterialState *mstate, ExplainState *es)
 	if (!es->analyze || tupstore == NULL)
 		return;
 
-	tuplestore_get_stats(tupstore, &maxStorageType, &maxSpaceUsed);
-	maxSpaceUsedKB = BYTES_TO_KILOBYTES(maxSpaceUsed);
+	show_storage_info(tupstore, es);
+}
 
-	if (es->format != EXPLAIN_FORMAT_TEXT)
-	{
-		ExplainPropertyText("Storage", maxStorageType, es);
-		ExplainPropertyInteger("Maximum Storage", "kB", maxSpaceUsedKB, es);
-	}
-	else
-	{
-		ExplainIndentText(es);
-		appendStringInfo(es->str,
-		 "Storage: %s  Maximum Storage: " INT64_FORMAT "kB\n",
-		 maxStorageType,
-		 maxSpaceUsedKB);
-	}
+/*
+ * Show information on WindowAgg node, storage method and maximum memory/disk
+ * space used.

Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.

2024-09-16 Thread Fujii Masao




On 2024/09/16 23:30, Anton A. Melnikov wrote:

+1
This idea seems quite tenable to me.

There is a small clarification. Now if there were no skipped restartpoints then
restartpoints_done will be equal to restartpoints_timed + restartpoints_req.
Similar for checkpoints.
So i tried to introduce num_done counter for checkpoints in the patch attached.


Thanks for the patch! I believe this change is targeted for v18. For v17, 
however,
we should update the description of num_timed in the documentation. Thought?
Here's a suggestion:

"Number of scheduled checkpoints due to timeout. Note that checkpoints may be
skipped if the server has been idle since the last one, and this value counts
both completed and skipped checkpoints."

Regarding the patch:
if (do_restartpoint)

PendingCheckpointerStats.restartpoints_performed++;
+   else
+   
PendingCheckpointerStats.num_performed++;

I expected the counter not to be incremented when a checkpoint is skipped,
but in this code, when a checkpoint is skipped, ckpt_performed is set to true,
triggering the counter increment. This seems wrong.



I'm not sure should we include testing for the case when num_done is less than
num_timed + num_requested to the regress tests. I haven't been able to get it 
in a short time yet.


I'm not sure if that test is really necessary...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION





Re: Pgoutput not capturing the generated columns

2024-09-16 Thread Peter Smith
On Tue, Sep 17, 2024 at 7:02 AM Masahiko Sawada  wrote:
>
> On Wed, Sep 11, 2024 at 10:30 PM Peter Smith  wrote:
> >
> > Because this feature is now being implemented as a PUBLICATION option,
> > there is another scenario that might need consideration; I am thinking
> > about where the same table is published by multiple PUBLICATIONS (with
> > different option settings) that are subscribed by a single
> > SUBSCRIPTION.
> >
> > e.g.1
> > -
> > CREATE PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns = 
> > true);
> > CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = 
> > false);
> > CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> > -
> >
> > e.g.2
> > -
> > CREATE PUBLICATION pub1 FOR ALL TABLES WITH (publish_generated_columns = 
> > true);
> > CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = 
> > false);
> > CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> > -
> >
> > Do you know if this case is supported? If yes, then which publication
> > option value wins?
>
> I would expect these option values are processed with OR. That is, we
> publish changes of the generated columns if at least one publication
> sets publish_generated_columns to true. It seems to me that we treat
> multiple row filters in the same way.
>

I thought that the option "publish_generated_columns" is more related
to "column lists" than "row filters".

Let's say table 't1' has columns 'a', 'b', 'c', 'gen1', 'gen2'.

Then:
PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns = true);
is equivalent to
PUBLICATION pub1 FOR TABLE t1(a,b,c,gen1,gen2);

And
PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = false);
is equivalent to
PUBLICATION pub2 FOR TABLE t1(a,b,c);

So, I would expect this to fail because the SUBSCRIPTION docs say
"Subscriptions having several publications in which the same table has
been published with different column lists are not supported."

~~

Here's another example:
PUBLICATION pub3 FOR TABLE t1(a,b);
PUBLICATION pub4 FOR TABLE t1(c);

Won't it be strange (e.g. difficult to explain) why pub1 and pub2
table column lists are allowed to be combined in one subscription, but
pub3 and pub4 in one subscription are not supported due to the
different column lists?

> >
> > The CREATE SUBSCRIPTION docs [1] only says "Subscriptions having
> > several publications in which the same table has been published with
> > different column lists are not supported."
> >
> > Perhaps the user is supposed to deduce that the example above would
> > work OK if table 't1' has no generated cols. OTOH, if it did have
> > generated cols then the PUBLICATION column lists must be different and
> > therefore it is "not supported" (??).
>
> With the patch, how should this feature work when users specify a
> generated column to the column list and set publish_generated_column =
> false, in the first place? raise an error (as we do today)? or always
> send NULL?

For this scenario, I suggested (see [1] #3) that the code could give a
WARNING. As I wrote up-thread: This combination doesn't seem
like something a user would do intentionally, so just silently
ignoring it (which the current patch does) is likely going to give
someone unexpected results/grief.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPuaitgE4tu3nfaR%3DPCQEKjB%3DmpDtZ1aWkbwb%3DJZE8YvqQ%40mail.gmail.com

Kind Regards,
Peter Smith
Fujitsu Australia




Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-16 Thread David Rowley
On Tue, 17 Sept 2024 at 14:40, Tatsuo Ishii  wrote:
> Attached is the v5 patch. The difference from v4 is addtion of two
> more tests to explain.sql:
>
> 1) spils to disk case
> 2) splis to disk then switch back to memory case

Looks ok to me, aside from the missing "reset work_mem;" after you're
done with testing the disk spilling code.

David




Re: Allow logical failover slots to wait on synchronous replication

2024-09-16 Thread shveta malik
On Mon, Sep 16, 2024 at 4:04 PM Amit Kapila  wrote:
>
> On Mon, Sep 16, 2024 at 2:55 PM shveta malik  wrote:
> >
> > On Mon, Sep 16, 2024 at 11:13 AM Amit Kapila  
> > wrote:
> > >
> >
> > > Another question aside from the above point, what if someone has
> > > specified logical subscribers in synchronous_standby_names? In the
> > > case of synchronized_standby_slots, we won't proceed with such slots.
> > >
> >
> > Yes, it is a possibility. I have missed this point earlier. Now I
> > tried a case where I give a mix of logical subscriber and physical
> > standby in 'synchronous_standby_names' on pgHead, it even takes that
> > 'mix' configuration and starts waiting accordingly.
> >
> > synchronous_standby_names = 'FIRST 2(logicalsub_1, phy_standby_1,
> > phy_standby_2)';
> >
>
> This should not happen as we don't support syncing failover slots on
> logical subscribers.

+1

> The other point to consider here is that the user
> may not have set 'sync_replication_slots' on all the physical standbys
> mentioned in 'synchronous_standby_names' and in that case, it doesn't
> make sense to wait for WAL to get flushed on those standbys. What do
> you think?
>

Yes, it is a possibility. But then it is a possibility in case of
'synchronized_standby_slots' as well. User may always configure one of
the standbys in  'synchronized_standby_slots' while may not configure
slot-sync GUCs on that standby (hot_standby_feedback,
sync_replication_slots etc). In such a case, logical replication is
dependent upon the concerned physical standby even though latter is
not syncing failover slots.
But there is no reliable way to detect this at the publisher side to
stop the 'wait' for the concerned physical standby. We tried in the
past but it was not that simple as the sync related GUCs may change
anytime on the physical standby and thus need consistent feedback
mechanism to detect this. IMO, we can explain the recommendations and
risks for 'synchronous_standby_names' in docs similar to what we do
for  'sync_replication_slots'. Or do you have something else in mind?

thanks
Shveta




Re: Supporting = operator in gin/gist_trgm_ops

2024-09-16 Thread David Rowley
On Sat, 14 Nov 2020 at 18:31, Alexander Korotkov  wrote:
> I also have checked that btree_gist is preferred over pg_trgm gist
> index for equality search.  Despite our gist cost estimate is quite
> dumb, it selects btree_gist index due to its lower size.  So, this
> part also looks good to me.
>
> I'm going to push this if no objections.

(Reviving old thread [1] due to a complaint from a customer about a
performance regression after upgrading PG13 to PG15)

I think the comparisons you made may have been too simplistic. Do you
recall what your test case was?

I tried comparing btree to gist with gist_trgm_ops and found that the
cost estimates for GIST come out cheaper than btree.  Btree only wins
in the most simplistic tests due to Index Only Scans. The test done in
[2] seems to have fallen for that mistake.

create extension if not exists pg_trgm;
create table a (a varchar(250), b varchar(250), c varchar(250));
insert into a select  md5(a::text),md5(a::text),md5(a::text) from
generate_Series(1,100)a;
create index a_a_btree on a (a);
create index a_a_gist on a using gist (a gist_trgm_ops);
vacuum freeze analyze a;

-- Gist index wins
explain (analyze, buffers) select * from a where a = '1234';

 Index Scan using a_a_gist on a  (cost=0.41..8.43 rows=1 width=99)
   Index Cond: ((a)::text = '1234'::text)
   Rows Removed by Index Recheck: 1
   Buffers: shared hit=14477
 Planning Time: 0.055 ms
 Execution Time: 23.861 ms
(6 rows)

-- hack to disable gist index.
update pg_index set indisvalid = false where indexrelid='a_a_gist'::regclass;
explain (analyze, buffers) select * from a where a = '1234';

 Index Scan using a_a_btree on a  (cost=0.42..8.44 rows=1 width=99)
   Index Cond: ((a)::text = '1234'::text)
   Buffers: shared read=3
 Planning:
   Buffers: shared hit=8
 Planning Time: 0.090 ms
 Execution Time: 0.048 ms (497.1 times faster)
(7 rows)

-- re-enable gist.
update pg_index set indisvalid = true where indexrelid='a_a_gist'::regclass;

-- try a query that supports btree with index only scan. Btree wins.
explain (analyze, buffers) select a from a where a = '1234';

 Index Only Scan using a_a_btree on a  (cost=0.42..4.44 rows=1 width=33)
   Index Cond: (a = '1234'::text)
   Heap Fetches: 0
   Buffers: shared read=3
 Planning Time: 0.185 ms
 Execution Time: 0.235 ms
(6 rows)

-- Disable IOS and Gist index wins again.
set enable_indexonlyscan=0;
explain (analyze, buffers) select a from a where a = '1234';

 Index Scan using a_a_gist on a  (cost=0.41..8.43 rows=1 width=33)
   Index Cond: ((a)::text = '1234'::text)
   Rows Removed by Index Recheck: 1
   Buffers: shared hit=11564 read=3811
 Planning Time: 0.118 ms
 Execution Time: 71.270 ms (303.2 times faster)
(6 rows)

This does not seem ideal given that the select * with the btree is
~500 times faster than with the gist plan.

For now, I've recommended the GIST indexes are moved to another
tablespace with an increased random_page_cost to try to coax the
planner to use the btree index.

I wonder if we can do something to fix this so the different
tablespace idea isn't the permanent solution. I had a look to see why
the GIST costs come out cheaper. It looks like it's the startup cost
calculation that's slightly different from the btree costing. The
attached patch highlights the difference. When applied both indexes
come out at the same cost and which one is picked is down to which
index has the lower Oid. I've not studied if there's a reason why this
code is different in gist.

David

[1] 
https://postgr.es/m/capphfducq0u8noyb2l3vchsybmsc5v2ej2whmeuxmagha2j...@mail.gmail.com
[2] 
https://postgr.es/m/CAOBaU_YkkhakwTG4oA886T4CQsHG5hfY%2BxGA3dTBdZM%2BDTYJWA%40mail.gmail.com


gist_cost.diff
Description: Binary data


Re: Add contrib/pg_logicalsnapinspect

2024-09-16 Thread shveta malik
On Mon, Sep 16, 2024 at 8:03 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Mon, Sep 16, 2024 at 04:02:51PM +0530, shveta malik wrote:
> > On Wed, Sep 11, 2024 at 4:21 PM Bertrand Drouvot
> >  wrote:
> > >
> > >
> > > Yeah, good idea. Done that way in v3 attached.
> > >
> >
> > Thanks for the patch. +1 on the patch's idea. I have started
> > reviewing/testing it. It is WIP but please find few initial comments:
>
> Thanks for sharing your thoughts and for the review!
>
> >
> > src/backend/replication/logical/snapbuild.c:
> >
> > 1)
> > + fsync_fname("pg_logical/snapshots", true);
> >
> > Should we use macros PG_LOGICAL_DIR and  PG_LOGICAL_SNAPSHOTS_DIR  in
> > ValidateSnapshotFile(), instead of hard coding the path
> >
> >
> > 2)
> > Same as above in pg_get_logical_snapshot_meta() and
> > pg_get_logical_snapshot_info()
> >
> > + sprintf(path, "pg_logical/snapshots/%X-%X.snap",
> > + LSN_FORMAT_ARGS(lsn));LSN_FORMAT_ARGS(lsn));
> >
>
> Doh! Yeah, agree that we should use those macros. They are coming from 
> c39afc38cf
> which has been introduced after v1 of this patch. I thought I took care of it 
> once
> c39afc38cf went in, but it looks like I missed it somehow. Done in v4 
> attached,
> Thanks!
>
> > 3)
> > +#include "replication/internal_snapbuild.h"
> >
> > Shall we name new file as 'snapbuild_internal.h' instead of
> > 'internal_snapbuild.h'. Please see other files' name under
> > './src/include/replication':
> > worker_internal.h
> > walsender_private.h
>
> Agree, that should be snapbuild_internal.h, done in v4.
>
> >
> > 4)
> > +static void ValidateSnapshotFile(XLogRecPtr lsn, SnapBuildOnDisk *ondisk,
> > + const char *path);
> >
> > Is it required? We generally don't add declaration unless required by
> > compiler. Since definition is prior to usage, it is not needed?
> >
>
> I personally prefer to add them even if not required by the compiler. I did 
> not
> pay attention that "We generally don't add declaration unless required by 
> compiler"
> and (after a quick check) I did not find any reference in the coding style
> documentation [1]. That said, I don't have a strong opinion about that and so
> removed in v4. Worth to add a mention in the coding convention doc?
>

Okay. I was somehow under the impression that this is the way in the
postgres i.e. not add redundant declarations.  Will be good  to know
what others think on this.

Thanks for addressing the comments. I have not started reviewing v4
yet, but here are few more comments on v3:

1)
+#include "port/pg_crc32c.h"

It is not needed in pg_logicalinspect.c as it is already included in
internal_snapbuild.h

2)
+ values[0] = Int16GetDatum(ondisk.builder.state);

+ values[8] = LSNGetDatum(ondisk.builder.last_serialized_snapshot);
+ values[9] = TransactionIdGetDatum(ondisk.builder.next_phase_at);
+ values[10] = Int64GetDatum(ondisk.builder.committed.xcnt);

We can have values[i++] in all the places and later we can check :
Assert(i == PG_GET_LOGICAL_SNAPSHOT_INFO_COLS);
Then we need not to keep track of number even in later part of code,
as it goes till 14.

3)
Similar change can be done here:

+ values[0] = Int32GetDatum(ondisk.magic);
+ values[1] = Int32GetDatum(ondisk.checksum);
+ values[2] = Int32GetDatum(ondisk.version);

check at the end will be: Assert(i == PG_GET_LOGICAL_SNAPSHOT_META_COLS);

4)
Most of the output columns in pg_get_logical_snapshot_info() look
self-explanatory except 'state'. Should we have meaningful 'text' here
corresponding to SnapBuildState? Similar to what we do for
'invalidation_reason' in pg_replication_slots. (SlotInvalidationCauses
for ReplicationSlotInvalidationCause)

thanks
Shveta




Re: Add contrib/pg_logicalsnapinspect

2024-09-16 Thread shveta malik
On Tue, Sep 17, 2024 at 10:18 AM shveta malik  wrote:
>
> Thanks for addressing the comments. I have not started reviewing v4
> yet, but here are few more comments on v3:
>

I just noticed that when we pass NULL input, both the new functions
give 1 row as output, all cols as NULL:

newdb1=# SELECT * FROM pg_get_logical_snapshot_meta(NULL);
 magic | checksum | version
---+--+-
   |  |

(1 row)

Similar behavior with pg_get_logical_snapshot_info(). While the
existing 'pg_ls_logicalsnapdir' function gives this error, which looks
more meaningful:

newdb1=# select * from pg_ls_logicalsnapdir(NULL);
ERROR:  function pg_ls_logicalsnapdir(unknown) does not exist
LINE 1: select * from pg_ls_logicalsnapdir(NULL);
HINT:  No function matches the given name and argument types. You
might need to add explicit type casts.


Shouldn't the new functions have same behavior?

thanks
Shveta




Re: Robocopy might be not robust enough for never-ending testing on Windows

2024-09-16 Thread Alexander Lakhin

Hello Thomas,

17.09.2024 04:01, Thomas Munro wrote:

On Mon, Sep 16, 2024 at 6:00 PM Alexander Lakhin  wrote:

So this leak looks like a recent and still existing defect.

 From my cartoon-like understanding of Windows, I would guess that if
event handles created by a program are leaked after it has exited, it
would normally imply that they've been duplicated somewhere else that
is still running (for example see the way that PostgreSQL's
dsm_impl_pin_segment() calls DuplicateHandle() to give a copy to the
postmaster, so that the memory segment continues to exist after the
backend exits), and if it's that, you'd be able to see the handle
count going up in the process monitor for some longer running process
somewhere (as seen in this report from the Chrome hackers[1]).  And if
it's not that, then I would guess it would have to be a kernel bug
because something outside userspace must be holding onto/leaking
handles.  But I don't really understand Windows beyond trying to debug
PostgreSQL at a distance, so my guesses may be way off.  If we wanted
to try to find a Windows expert to look at a standalone repro, does
your PS script work with *any* source directory, or is there something
about the initdb template, in which case could you post it in a .zip
file so that a non-PostgreSQL person could see the failure mode?

[1] 
https://randomascii.wordpress.com/2021/07/25/finding-windows-handle-leaks-in-chromium-and-others/


That's very interesting reading. I'll try to research the issue that deep
later (though I guess this case is different — after logging off and
logging in as another user, I can't see any processes belonging to the
first one, while those "Event objects" in non-paged pool still occupy
memory), but finding a Windows expert who perhaps can look at the
robocopy's sources, would be good too (and more productive).

So, the repro we can show is:
rm -r c:\temp\source
mkdir c:\temp\source
for ($i = 1; $i -le 1000; $i++)
{
echo 1 > "c:\temp\source\$i"
}

for ($i = 1; $i -le 1000; $i++)
{
echo "iteration $i"
rm -r c:\temp\target
robocopy.exe /E /NJH /NFL /NDL /NP c:\temp\source c:\temp\target
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Memory | % PoolNonpagedBytes
}

It produces for me (on Windows 10 [Version 10.0.19045.4780]):
iteration 1
...
216887296
...
iteration 1000


--

   Total    Copied   Skipped  Mismatch    FAILED Extras
    Dirs : 1 1 0 0 0 0
   Files :  1000  1000 0 0 0 0
   Bytes : 7.8 k 7.8 k 0 0 0 0
   Times :   0:00:00   0:00:00   0:00:00 0:00:00


   Speed :   17660 Bytes/sec.
   Speed :   1.010 MegaBytes/min.
   Ended : Monday, September 16, 2024 8:58:09 PM

365080576

Just "touch c:\temp\source\$i" is not enough, files must be non-empty for
the leak to happen.

Best regards,
Alexander




Re: Add contrib/pg_logicalsnapinspect

2024-09-16 Thread David G. Johnston
On Monday, September 16, 2024, shveta malik  wrote:

> On Tue, Sep 17, 2024 at 10:18 AM shveta malik 
> wrote:
> >
> > Thanks for addressing the comments. I have not started reviewing v4
> > yet, but here are few more comments on v3:
> >
>
> I just noticed that when we pass NULL input, both the new functions
> give 1 row as output, all cols as NULL:
>
> newdb1=# SELECT * FROM pg_get_logical_snapshot_meta(NULL);
>  magic | checksum | version
> ---+--+-
>|  |
>
> (1 row)
>
> Similar behavior with pg_get_logical_snapshot_info(). While the
> existing 'pg_ls_logicalsnapdir' function gives this error, which looks
> more meaningful:
>
> newdb1=# select * from pg_ls_logicalsnapdir(NULL);
> ERROR:  function pg_ls_logicalsnapdir(unknown) does not exist
> LINE 1: select * from pg_ls_logicalsnapdir(NULL);
> HINT:  No function matches the given name and argument types. You
> might need to add explicit type casts.
>
>
> Shouldn't the new functions have same behavior?
>

No. Since the name pg_ls_logicalsnapdir has zero single-argument
implementations passing a null value as an argument is indeed attempt to
invoke a function signature that doesn’t exist.

If there is exactly one single input argument function of the given name
the parser is going to cast the null literal to the data type of the single
argument and invoke the function.  It will not and cannot be convinced to
fail to find a matching function.

I can see an argument that they should produce an empty set instead of a
single all-null row, but the idea that they wouldn’t even be found is
contrary to a core design of the system.

David J.


Re: Regression tests fail with tzdata 2024b

2024-09-16 Thread Sven Klemm
On Mon, Sep 16, 2024 at 5:19 PM Tom Lane  wrote:

> Configurable to what?  If your test cases are dependent on the
> historical behavior of PST8PDT, you're out of luck, because that
> simply isn't available anymore (or won't be once 2024b reaches
> your platform, anyway).
>

I was wondering whether the timezone used by pg_regress could be made
configurable.

-- 
Regards, Sven Klemm


Re: [Proposal] Add foreign-server health checks infrastructure

2024-09-16 Thread Fujii Masao



On 2024/08/08 11:38, Hayato Kuroda (Fujitsu) wrote:

Dear Fujii-san,

Thanks for reviewing! PSA new version.


Thanks for updating the patch! LGTM.

I made a couple of small adjustments and attached the updated version.
If that's ok, I'll go ahead and commit it.

+ Name of the local user mapped to the foreign server of this
+ connection, or "public" if a public mapping is used. If the user

I enclosed "public" with  tag, i.e., public.


I did not done that be cause either of server_name or user_name is NULL and
it might be strange. But yes, the example should have more information.
Based on that, I added a tuple so that the example has below. Thought?

loopback1 - user is "postgres", valid
loopback2 - user is "public", valid
loopback3 - user is NULL, invalid


LGTM.
Also I added the following to the example for clarity:

postgres=# SELECT * FROM postgres_fdw_get_connections(true);

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From 1b50979b84d2f8718a3af41bd407a76a0d048ea6 Mon Sep 17 00:00:00 2001
From: Hayato Kuroda 
Date: Thu, 18 Jul 2024 10:11:03 +
Subject: [PATCH v4] postgres_fdw: Extend postgres_fdw_get_connections to
 return user name.

This commit adds a "user_name" output column to
the postgres_fdw_get_connections function, returning the name
of the local user mapped to the foreign server for each connection.
If a public mapping is used, it returns "public."

This helps identify postgres_fdw connections more easily,
such as determining which connections are invalid, closed,
or used within the current transaction.

No extension version bump is needed, as commit c297a47c5f
already handled it for v18~.

Author: Hayato Kuroda
Reviewed-by: Fujii Masao
Discussion: 
https://postgr.es/m/b492a935-6c7e-8c08-e485-3c1d64d7d...@oss.nttdata.com
---
 contrib/postgres_fdw/connection.c | 57 +++
 .../postgres_fdw/expected/postgres_fdw.out| 16 +++---
 .../postgres_fdw/postgres_fdw--1.1--1.2.sql   |  3 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |  8 ++-
 doc/src/sgml/postgres-fdw.sgml| 23 ++--
 5 files changed, 81 insertions(+), 26 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c 
b/contrib/postgres_fdw/connection.c
index 12d1fec0e8..2e5303eac1 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -1997,8 +1997,8 @@ pgfdw_finish_abort_cleanup(List *pending_entries, List 
*cancel_requested,
 
 /* Number of output arguments (columns) for various API versions */
 #define POSTGRES_FDW_GET_CONNECTIONS_COLS_V1_1 2
-#define POSTGRES_FDW_GET_CONNECTIONS_COLS_V1_2 4
-#define POSTGRES_FDW_GET_CONNECTIONS_COLS  4   /* maximum of above */
+#define POSTGRES_FDW_GET_CONNECTIONS_COLS_V1_2 5
+#define POSTGRES_FDW_GET_CONNECTIONS_COLS  5   /* maximum of above */
 
 /*
  * Internal function used by postgres_fdw_get_connections variants.
@@ -2014,10 +2014,13 @@ pgfdw_finish_abort_cleanup(List *pending_entries, List 
*cancel_requested,
  *
  * For API version 1.2 and later, this function takes an input parameter
  * to check a connection status and returns the following
- * additional values along with the two values from version 1.1:
+ * additional values along with the three values from version 1.1:
  *
+ * - user_name - the local user name of the active connection. In case the
+ *   user mapping is dropped but the connection is still active, then the
+ *   user name will be NULL in the output.
  * - used_in_xact - true if the connection is used in the current transaction.
- * - closed: true if the connection is closed.
+ * - closed - true if the connection is closed.
  *
  * No records are returned when there are no cached connections at all.
  */
@@ -2056,6 +2059,7 @@ postgres_fdw_get_connections_internal(FunctionCallInfo 
fcinfo,
ForeignServer *server;
Datum   values[POSTGRES_FDW_GET_CONNECTIONS_COLS] = {0};
boolnulls[POSTGRES_FDW_GET_CONNECTIONS_COLS] = {0};
+   int i = 0;
 
/* We only look for open remote connections */
if (!entry->conn)
@@ -2100,28 +2104,61 @@ postgres_fdw_get_connections_internal(FunctionCallInfo 
fcinfo,
Assert(entry->conn && entry->xact_depth > 0 && 
entry->invalidated);
 
/* Show null, if no server name was found */
-   nulls[0] = true;
+   nulls[i++] = true;
}
else
-   values[0] = CStringGetTextDatum(server->servername);
+   values[i++] = CStringGetTextDatum(server->servername);
 
-   values[1] = BoolGetDatum(!entry->invalidated);
+   if (api_version >= PGFDW_V1_2)
+   {
+   HeapTuple   tp;
+
+   /* Use the system cache to obta

Re: Regression tests fail with tzdata 2024b

2024-09-16 Thread Tom Lane
Sven Klemm  writes:
> On Mon, Sep 16, 2024 at 5:19 PM Tom Lane  wrote:
>> Configurable to what?  If your test cases are dependent on the
>> historical behavior of PST8PDT, you're out of luck, because that
>> simply isn't available anymore (or won't be once 2024b reaches
>> your platform, anyway).

> I was wondering whether the timezone used by pg_regress could be made
> configurable.

Yes, I understood that you were suggesting that.  My point is that
it wouldn't do you any good: you will still have to change any
regression test cases that depend on behavior PST8PDT has/had that
is different from America/Los_Angeles.  That being the case,
I don't see much value in making it configurable.

regards, tom lane




RE: Using per-transaction memory contexts for storing decoded tuples

2024-09-16 Thread Hayato Kuroda (Fujitsu)
Hi,

> We have several reports that logical decoding uses memory much more
> than logical_decoding_work_mem[1][2][3]. For instance in one of the
> reports[1], even though users set logical_decoding_work_mem to
> '256MB', a walsender process was killed by OOM because of using more
> than 4GB memory.

I appreciate your work on logical replication and am interested in the thread.
I've heard this issue from others, and this has been the barrier to using 
logical
replication. Please let me know if I can help with benchmarking, other
measurements, etc.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: [HACKERS] make async slave to wait for lsn to be replayed

2024-09-16 Thread Alexander Lakhin

Hi Alexander,

16.09.2024 21:55, Alexander Korotkov wrote:

Please find two patches attached.  The first one does minor cleanup
including misuse of words you've pointed.  The second one adds missing
wait_for_catchup().  That should fix the test failure you've spotted.
Please, check if it fixes an issue for you.


Thank you for looking at that!

Unfortunately, the issue is still here — the test failed for me 6 out of
10 runs, as below:
[05:14:02.807](0.135s) ok 8 - got error after standby promote
error running SQL: 'psql::1: ERROR:  recovery is not in progress
HINT:  Waiting for LSN can only be executed during recovery.'
while running 'psql -XAtq -d port=12734 host=/tmp/04hQ75NuXf dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'CALL 
pg_wal_replay_wait('0/300F248');' at .../src/test/recovery/../../../src/test/perl/PostgreSQL/Test/Cluster.pm line 2140.


043_wal_replay_wait_standby.log:
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl ERROR: recovery is 
not in progress
2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl DETAIL:  Recovery ended before replaying target LSN 
2/570CD648; last replay LSN 0/300F210.

2024-09-17 05:14:02.714 UTC [1817258] 043_wal_replay_wait.pl STATEMENT:  CALL 
pg_wal_replay_wait('2/570CD648');
2024-09-17 05:14:02.714 UTC [1817155] LOG:  checkpoint starting: force
2024-09-17 05:14:02.714 UTC [1817154] LOG:  database system is ready to accept 
connections
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl LOG: statement: 
CALL pg_wal_replay_wait('0/300F248');
2024-09-17 05:14:02.811 UTC [1817270] 043_wal_replay_wait.pl ERROR: recovery is 
not in progress

pg_waldump -p .../t_043_wal_replay_wait_primary_data/pgdata/pg_wal/ 
00010003
rmgr: Transaction len (rec/tot): 34/    34, tx:    748, lsn: 0/0300F1E8, prev 0/0300F1A8, desc: COMMIT 
2024-09-17 05:14:01.654874 UTC
rmgr: Standby len (rec/tot): 50/    50, tx:  0, lsn: 0/0300F210, prev 0/0300F1E8, desc: RUNNING_XACTS 
nextXid 749 latestCompletedXid 748 oldestRunningXid 749


I wonder, can you reproduce this with that bgwriter's modification?

I've also found two more "achievements" coined by 3c5db1d6b:
doc/src/sgml/func.sgml:   It may also happen that target lsn 
is not achieved
src/backend/access/transam/xlog.c-   * recovery was ended before achieving 
the target LSN.

Best regards,
Alexander




Wrong results with equality search using trigram index and non-deterministic collation

2024-09-16 Thread Laurenz Albe
Using a trigram index with an non-deterministic collation can
lead to wrong query results:

  CREATE COLLATION faux_cn (PROVIDER = icu, LOCALE = 'und', DETERMINISTIC = 
FALSE, RULES = '&l = r');

  CREATE TABLE boom (id integer PRIMARY KEY, t text COLLATE faux_cn);

  INSERT INTO boom VALUES (1, 'right'), (2, 'light');

  SELECT * FROM boom WHERE t = 'right';

   id │   t   
  ╪═══
1 │ right
2 │ light
  (2 rows)

  CREATE INDEX ON boom USING gin (t gin_trgm_ops);

  SET enable_seqscan = off;

  SELECT * FROM boom WHERE t = 'right';

   id │   t   
  ╪═══
1 │ right
  (1 row)

I also see questionable results with the similarity operator (with and
without the index):

  SELECT * FROM boom WHERE t % 'rigor';

   id │   t   
  ╪═══
1 │ right
  (1 row)

But here you could argue that the operator ignores the collation, so
the result is correct.  With equality, there is no such loophole.

I don't know what the correct fix would be.  Perhaps just refusing to use
the index for equality comparisons with non-deterministic collations.

Yours,
Laurenz Albe




Re: Add memory/disk usage for WindowAgg nodes in EXPLAIN

2024-09-16 Thread Tatsuo Ishii
> On Tue, 17 Sept 2024 at 14:40, Tatsuo Ishii  wrote:
>> Attached is the v5 patch. The difference from v4 is addtion of two
>> more tests to explain.sql:
>>
>> 1) spils to disk case
>> 2) splis to disk then switch back to memory case
> 
> Looks ok to me, aside from the missing "reset work_mem;" after you're
> done with testing the disk spilling code.

Thanks. I have added it and pushed the patch.

Best reagards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




RE: [Proposal] Add foreign-server health checks infrastructure

2024-09-16 Thread Hayato Kuroda (Fujitsu)
Dear Fujii-san,

Thanks for reviewing!

> I made a couple of small adjustments and attached the updated version.
> If that's ok, I'll go ahead and commit it.
> 
> + Name of the local user mapped to the foreign server of this
> + connection, or "public" if a public mapping is used. If the user
> 
> I enclosed "public" with  tag, i.e., public.

Right, it should be. I grepped sgml files just in case, but they are tagged by 
.

> > I did not done that be cause either of server_name or user_name is NULL and
> > it might be strange. But yes, the example should have more information.
> > Based on that, I added a tuple so that the example has below. Thought?
> >
> > loopback1 - user is "postgres", valid
> > loopback2 - user is "public", valid
> > loopback3 - user is NULL, invalid
> 
> LGTM.
> Also I added the following to the example for clarity:
> 
> postgres=# SELECT * FROM postgres_fdw_get_connections(true);

+1.

Best regards,
Hayato Kuroda
FUJITSU LIMITED



Re: Pgoutput not capturing the generated columns

2024-09-16 Thread Masahiko Sawada
On Mon, Sep 16, 2024 at 8:09 PM Peter Smith  wrote:
>
> I thought that the option "publish_generated_columns" is more related
> to "column lists" than "row filters".
>
> Let's say table 't1' has columns 'a', 'b', 'c', 'gen1', 'gen2'.
>

> And
> PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = false);
> is equivalent to
> PUBLICATION pub2 FOR TABLE t1(a,b,c);

This makes sense to me as it preserves the current behavior.

> Then:
> PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns = true);
> is equivalent to
> PUBLICATION pub1 FOR TABLE t1(a,b,c,gen1,gen2);

This also makes sense. It would also include future generated columns.

> So, I would expect this to fail because the SUBSCRIPTION docs say
> "Subscriptions having several publications in which the same table has
> been published with different column lists are not supported."

So I agree that it would raise an error if users subscribe to both
pub1 and pub2.

And looking back at your examples,

> > > e.g.1
> > > -
> > > CREATE PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns = 
> > > true);
> > > CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = 
> > > false);
> > > CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> > > -
> > >
> > > e.g.2
> > > -
> > > CREATE PUBLICATION pub1 FOR ALL TABLES WITH (publish_generated_columns = 
> > > true);
> > > CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = 
> > > false);
> > > CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> > > -

Both examples would not be supported.

> > >
> > > The CREATE SUBSCRIPTION docs [1] only says "Subscriptions having
> > > several publications in which the same table has been published with
> > > different column lists are not supported."
> > >
> > > Perhaps the user is supposed to deduce that the example above would
> > > work OK if table 't1' has no generated cols. OTOH, if it did have
> > > generated cols then the PUBLICATION column lists must be different and
> > > therefore it is "not supported" (??).
> >
> > With the patch, how should this feature work when users specify a
> > generated column to the column list and set publish_generated_column =
> > false, in the first place? raise an error (as we do today)? or always
> > send NULL?
>
> For this scenario, I suggested (see [1] #3) that the code could give a
> WARNING. As I wrote up-thread: This combination doesn't seem
> like something a user would do intentionally, so just silently
> ignoring it (which the current patch does) is likely going to give
> someone unexpected results/grief.

It gives a WARNING, and then publishes the specified generated column
data (even if publish_generated_column = false)? If so, it would mean
that specifying the generated column to the column list means to
publish its data regardless of the publish_generated_column parameter
value.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: Pgoutput not capturing the generated columns

2024-09-16 Thread Peter Smith
On Tue, Sep 17, 2024 at 4:15 PM Masahiko Sawada  wrote:
>
> On Mon, Sep 16, 2024 at 8:09 PM Peter Smith  wrote:
> >
> > I thought that the option "publish_generated_columns" is more related
> > to "column lists" than "row filters".
> >
> > Let's say table 't1' has columns 'a', 'b', 'c', 'gen1', 'gen2'.
> >
>
> > And
> > PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = false);
> > is equivalent to
> > PUBLICATION pub2 FOR TABLE t1(a,b,c);
>
> This makes sense to me as it preserves the current behavior.
>
> > Then:
> > PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns = true);
> > is equivalent to
> > PUBLICATION pub1 FOR TABLE t1(a,b,c,gen1,gen2);
>
> This also makes sense. It would also include future generated columns.
>
> > So, I would expect this to fail because the SUBSCRIPTION docs say
> > "Subscriptions having several publications in which the same table has
> > been published with different column lists are not supported."
>
> So I agree that it would raise an error if users subscribe to both
> pub1 and pub2.
>
> And looking back at your examples,
>
> > > > e.g.1
> > > > -
> > > > CREATE PUBLICATION pub1 FOR TABLE t1 WITH (publish_generated_columns = 
> > > > true);
> > > > CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = 
> > > > false);
> > > > CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> > > > -
> > > >
> > > > e.g.2
> > > > -
> > > > CREATE PUBLICATION pub1 FOR ALL TABLES WITH (publish_generated_columns 
> > > > = true);
> > > > CREATE PUBLICATION pub2 FOR TABLE t1 WITH (publish_generated_columns = 
> > > > false);
> > > > CREATE SUBSCRIPTION sub ... PUBLICATIONS pub1,pub2;
> > > > -
>
> Both examples would not be supported.
>
> > > >
> > > > The CREATE SUBSCRIPTION docs [1] only says "Subscriptions having
> > > > several publications in which the same table has been published with
> > > > different column lists are not supported."
> > > >
> > > > Perhaps the user is supposed to deduce that the example above would
> > > > work OK if table 't1' has no generated cols. OTOH, if it did have
> > > > generated cols then the PUBLICATION column lists must be different and
> > > > therefore it is "not supported" (??).
> > >
> > > With the patch, how should this feature work when users specify a
> > > generated column to the column list and set publish_generated_column =
> > > false, in the first place? raise an error (as we do today)? or always
> > > send NULL?
> >
> > For this scenario, I suggested (see [1] #3) that the code could give a
> > WARNING. As I wrote up-thread: This combination doesn't seem
> > like something a user would do intentionally, so just silently
> > ignoring it (which the current patch does) is likely going to give
> > someone unexpected results/grief.
>
> It gives a WARNING, and then publishes the specified generated column
> data (even if publish_generated_column = false)? If so, it would mean
> that specifying the generated column to the column list means to
> publish its data regardless of the publish_generated_column parameter
> value.
>

No. I meant only it can give the WARNING to tell the user user  "Hey,
there is a conflict here because you said publish_generated_column=
false, but you also specified gencols in the column list".

But always it is the option "publish_generated_column" determines the
final publishing behaviour. So if it says
publish_generated_column=false then it would NOT publish generated
columns even if they are gencols in the column list. I think this
makes sense because when there is no column list specified then that
implicitly means "all columns" and the table might have some gencols,
but still 'publish_generated_columns' is what determines the
behaviour.

==
Kind Regards,
Peter Smith.
Fujitsu Australia




Re: Regression tests fail with tzdata 2024b

2024-09-16 Thread Wolfgang Walther

Tom Lane:

I was wondering whether the timezone used by pg_regress could be made
configurable.


Yes, I understood that you were suggesting that.  My point is that
it wouldn't do you any good: you will still have to change any
regression test cases that depend on behavior PST8PDT has/had that
is different from America/Los_Angeles.  That being the case,
I don't see much value in making it configurable.


Just changing it back to PST8PDT wouldn't really help as Tom pointed 
out. You'd still get different results depending on which tzdata version 
you are running with.


The core regression tests need to be run with a timezone that tests 
special cases in the timezone handling code. But that might not be true 
for extensions - all they want could be a stable output across major and 
minor versions of postgres and versions of tzdata. It could be helpful 
to set pg_regress' timezone to UTC, for example?


Best,

Wolfgang




Re: Conflict detection for update_deleted in logical replication

2024-09-16 Thread Amit Kapila
On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada  wrote:
>
> On Fri, Sep 13, 2024 at 12:56 AM shveta malik  wrote:
> >
> > On Fri, Sep 13, 2024 at 11:38 AM Amit Kapila  
> > wrote:
> > >
> > > > >
> > > > > So in brief, this solution is only for bidrectional setup? For 
> > > > > non-bidirectional,
> > > > > feedback_slots is non-configurable and thus irrelevant.
> > > >
> > > > Right.
> > > >
> > >
> > > One possible idea to address the non-bidirectional case raised by
> > > Shveta is to use a time-based cut-off to remove dead tuples. As
> > > mentioned earlier in my email [1], we can define a new GUC parameter
> > > say vacuum_committs_age which would indicate that we will allow rows
> > > to be removed only if the modified time of the tuple as indicated by
> > > committs module is greater than the vacuum_committs_age. We could keep
> > > this parameter a table-level option without introducing a GUC as this
> > > may not apply to all tables. I checked and found that some other
> > > replication solutions like GoldenGate also allowed similar parameters
> > > (tombstone_deletes) to be specified at table level [2]. The other
> > > advantage of allowing it at table level is that it won't hamper the
> > > performance of hot-pruning or vacuum in general. Note, I am careful
> > > here because to decide whether to remove a dead tuple or not we need
> > > to compare its committs_time both during hot-pruning and vacuum.
> >
> > +1 on the idea,
>
> I agree that this idea is much simpler than the idea originally
> proposed in this thread.
>
> IIUC vacuum_committs_age specifies a time rather than an XID age.
>

Your understanding is correct that vacuum_committs_age specifies a time.

>
> But
> how can we implement it? If it ends up affecting the vacuum cutoff, we
> should be careful not to end up with the same result of
> vacuum_defer_cleanup_age that was discussed before[1]. Also, I think
> the implementation needs not to affect the performance of
> ComputeXidHorizons().
>

I haven't thought about the implementation details yet but I think
during pruning (for example in heap_prune_satisfies_vacuum()), apart
from checking if the tuple satisfies
HeapTupleSatisfiesVacuumHorizon(), we should also check if the tuple's
committs is greater than configured vacuum_committs_age (for the
table) to decide whether tuple can be removed. One thing to consider
is what to do in case of aggressive vacuum where we expect
relfrozenxid to be advanced to FreezeLimit (at a minimum). We may want
to just ignore vacuum_committs_age during aggressive vacuum and LOG if
we end up removing some tuple. This will allow users to retain deleted
tuples by respecting the freeze limits which also avoid xid_wrap
around. I think we can't retain tuples forever if the user
misconfigured vacuum_committs_age and to avoid that we can keep the
maximum limit on this parameter to say an hour or so. Also, users can
tune freeze parameters if they want to retain tuples for longer.

> > but IIUC this value doesn’t need to be significant; it
> > can be limited to just a few minutes. The one which is sufficient to
> > handle replication delays caused by network lag or other factors,
> > assuming clock skew has already been addressed.
>
> I think that in a non-bidirectional case the value could need to be a
> large number. Is that right?
>

As per my understanding, even for non-bidirectional cases, the value
should be small. For example, in the case, pointed out by Shveta [1],
where the updates from 2 nodes are received by a third node, this
setting is expected to be small. This setting primarily deals with
concurrent transactions on multiple nodes, so it should be small but I
could be missing something.

[1] - 
https://www.postgresql.org/message-id/CAJpy0uAzzOzhXGH-zBc7Zt8ndXRf6r4OnLzgRrHyf8cvd%2Bfpwg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.