date:20240910

Re: First draft of PG 17 release notes

2024-09-10 Thread Michael Banck

On Thu, Sep 05, 2024 at 09:51:25PM -0400, Bruce Momjian wrote:
> On Tue, Sep  3, 2024 at 10:44:01AM -0500, Nathan Bossart wrote:
> > While freely acknowledging that I am biased because I wrote it, I am a bit
> > surprised to see the DSM registry left out of the release notes (commit
> > 8b2bcf3, docs are here [0]).  This feature is intended to allow modules to
> > allocate shared memory after startup, i.e., without requiring the module to
> > be loaded via shared_preload_libraries.  IMHO that is worth mentioning.
> > 
> > [0] 
> > https://www.postgresql.org/docs/devel/xfunc-c.html#XFUNC-SHARED-ADDIN-AFTER-STARTUP
> 
> That seems more infrastructure/extension author stuff which isn't
> normally mentioned in the release notes.

If I understand the feature correctly, it allows extensions to be just
CREATEd without having them to be added to shared_preload_libraries,
i.e. saving the organization an instance restart/downtime.

That seems important enough for end-users to know, even if they will
need to wait for extension authors to catch up to this (but I guess a
lot will).

Michael

Re: GUC names in messages

2024-09-10 Thread Peter Smith

On Wed, Sep 4, 2024 at 3:54 PM Michael Paquier  wrote:
>
...
> 0001 and 0004 have been applied with these tweaks.  I am still not
> sure about the changes for DateStyle and IntervalStyle in 0002 and
> 0003.  Perhaps others have an opinion that could drive to a consensus.
>

Thanks for pushing the patches 0001 and 0004.

I have rebased the two remaining patches. See v12 attached.

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v12-0002-GUC-names-fix-case-datestyle.patch
Description: Binary data


v12-0001-GUC-names-fix-case-intervalstyle.patch
Description: Binary data

Re: json_query conditional wrapper bug

2024-09-10 Thread Amit Langote

Sorry for missing this report and thanks Andrew for the offlist heads up.

On Wed, Sep 4, 2024 at 7:16 PM Peter Eisentraut  wrote:
> On 28.08.24 11:21, Peter Eisentraut wrote:
> > These are ok:
> >
> > select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' without wrapper);
> >   json_query
> > 
> >   42
> >
> > select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' with
> > unconditional wrapper);
> >   json_query
> > 
> >   [42]
> >
> > But this appears to be wrong:
> >
> > select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' with conditional
> > wrapper);
> >   json_query
> > 
> >   [42]
> >
> > This should return an unwrapped 42.
>
> If I make the code change illustrated in the attached patch, then I get
> the correct result here.  And various regression test results change,
> which, to me, all look more correct after this patch.  I don't know what
> the code I removed was supposed to accomplish, but it seems to be wrong
> somehow.  In the current implementation, the WITH CONDITIONAL WRAPPER
> clause doesn't appear to work correctly in any case I could identify.

Agreed that this looks wrong.

I've wondered why the condition was like that but left it as-is,
because I thought at one point that that's needed to ensure that the
returned single scalar SQL/JSON item is valid jsonb.

I've updated your patch to include updated test outputs and a nearby
code comment expanded.  Do you intend to commit it or do you prefer
that I do?

-- 
Thanks, Amit Langote


v2-0001-WIP-Fix-JSON_QUERY-WITH-CONDITIONAL-WRAPPER.patch
Description: Binary data

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-09-10 Thread Peter Eisentraut


On 02.09.24 14:26, Daniel Gustafsson wrote:

On 2 Sep 2024, at 10:03, Daniel Gustafsson  wrote:


On 23 Aug 2024, at 01:56, Michael Paquier  wrote:

On Thu, Aug 22, 2024 at 11:13:15PM +0200, Daniel Gustafsson wrote:

On 22 Aug 2024, at 02:31, Michael Paquier  wrote:

Just do it :)


That's my plan, I wanted to wait a bit to see if anyone else chimed in with
concerns.


Cool, thanks!


Attached is a rebased v15 (only changes are commit-message changes noted by
Peter upthread) for the sake of archives, and for a green-check run in the
CFBot.  Assuming this builds green I intend to push this.


And pushed.  All BF owners with animals using 1.0.2 have been notified but not
all have been updated (or modified to skip SSL) so there will be some failing.


A small follow-up for this:  With the current minimum OpenSSL version 
being 1.1.0, we can remove an unconstify() call; see attached patch.


See this OpenSSL commit: 
.  The analogous 
LibreSSL change is here: 
. 
 I don't know if we have a concrete minimum LibreSSL version, but the 
change is about as old as the OpenSSL change.


From 91550eaac4883113b9e85361c5c049a6555cd2f3 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 10 Sep 2024 09:53:32 +0200
Subject: [PATCH] Remove obsolete unconstify()

This is no longer needed as of OpenSSL 1.1.0 (the current minimum
version).  LibreSSL made the same change around the same time as well.
---
 src/backend/libpq/be-secure-openssl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/libpq/be-secure-openssl.c 
b/src/backend/libpq/be-secure-openssl.c
index 1ebd3f2e6d3..8ec78c83304 100644
--- a/src/backend/libpq/be-secure-openssl.c
+++ b/src/backend/libpq/be-secure-openssl.c
@@ -1075,7 +1075,7 @@ load_dh_buffer(const char *buffer, size_t len)
BIO*bio;
DH *dh = NULL;
 
-   bio = BIO_new_mem_buf(unconstify(char *, buffer), len);
+   bio = BIO_new_mem_buf(buffer, len);
if (bio == NULL)
return NULL;
dh = PEM_read_bio_DHparams(bio, NULL, NULL, NULL);
-- 
2.46.0

Re: Index AM API cleanup

2024-09-10 Thread Peter Eisentraut


On 03.09.24 19:26, Mark Dilger wrote:

On Sep 3, 2024, at 9:52 AM, Peter Eisentraut  wrote:

Here is a cleaned-up version of the v17-0004 patch.  I have applied the 
renaming discussed above.  I have also made a wrapper function 
btgettreeheight() that calls _bt_getrootheight().  That way, if we ever want to 
change the API, we don't have to change _bt_getrootheight(), or disentangle it 
then.  I've also added documentation and put in a source code comment that the 
API could be expanded for additional uses.


Ok.


  Also, I have removed the addition to the IndexOptInfo struct; that didn't 
seem necessary.


Good catch.  I agree with the change.


I have committed this patch.  (It needed another small change to silence 
a compiler warning about an uninitialized variable.)

RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)

On Tuesday, September 10, 2024 2:45 PM shveta malik  
wrote:
> > ---
> > THE DESIGN
> > ---
> >
> > To achieve the above, we plan to allow the logical walsender to
> > maintain and advance the slot.xmin to protect the data in the user
> > table and introduce a new logical standby feedback message. This
> > message reports the WAL position that has been replayed on the logical
> > standby *AND* the changes occurring on the logical standby before the
> > WAL position are also replayed to the walsender's node (where the
> > walsender is running). After receiving the new feedback message, the
> > walsender will advance the slot.xmin based on the flush info, similar
> > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > of logical slot are unused during logical replication, so I think it's safe 
> > and
> won't cause side-effect to reuse the xmin for this feature.
> >
> > We have introduced a new subscription option
> > (feedback_slots='slot1,...'), where these slots will be used to check
> > condition (b): the transactions on logical standbys occurring before
> > the replay of Node A's DELETE are replayed on Node A as well.
> > Therefore, on Node B, users should specify the slots corresponding to
> > Node A in this option. The apply worker will get the oldest confirmed
> > flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> > let apply worker select the slots that acquired by the walsenders
> > which connect to the same remote server(e.g. if apply worker's
> > connection info or some other flags is same as the walsender's
> > connection info). But it seems tricky because if some slots are
> > inactive which means the walsenders are not there, the apply worker
> > could not find the correct slots to check unless we save the host along with
> the slot's persistence data.
> >
> > The new feedback message is sent only if feedback_slots is not NULL.
> > If the slots in feedback_slots are removed, a final message containing
> > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > the slot.xmin.
> >
> > To detect update_deleted conflicts during update operations, if the
> > target row cannot be found, we perform an additional scan of the table using
> snapshotAny.
> > This scan aims to locate the most recently deleted row that matches
> > the old column values from the remote update operation and has not yet
> > been removed by VACUUM. If any such tuples are found, we report the
> > update_deleted conflict along with the origin and transaction information
> that deleted the tuple.
> >
> > Please refer to the attached POC patch set which implements above
> > design. The patch set is split into some parts to make it easier for the 
> > initial
> review.
> > Please note that each patch is interdependent and cannot work
> independently.
> >
> > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> >
> > Suggestions and comments are highly appreciated !
> >
> 
> Thank You Hou-San for explaining the design. But to make it easier to
> understand, would you be able to explain the sequence/timeline of the
> *new* actions performed by the walsender and the apply processes for the
> given example along with new feedback_slot config needed
> 
> Node A: (Procs: walsenderA, applyA)
>   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
>   T2: DELETE FROM t WHERE id = 1;   ts=10.02 AM
> 
> Node B: (Procs: walsenderB, applyB)
>   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM

Thanks for reviewing! Let me elaborate further on the example:

On node A, feedback_slots should include the logical slot that used to 
replicate changes
from Node A to Node B. On node B, feedback_slots should include the logical
slot that replicate changes from Node B to Node A.

Assume the slot.xmin on Node A has been initialized to a valid number(740) 
before the
following flow:

Node A executed T1  
- 10.00 AM
T1 replicated and applied on Node B 
- 10.0001 AM
Node B executed T3  
- 10.01 AM
Node A executed T2 (741)
- 10.02 AM
T2 replicated and applied on Node B (delete_missing)
- 10.03 AM
T3 replicated and applied on Node A (new action, detect update_deleted) 
- 10.04 AM

(new action) Apply worker on Node B has confirmed that T2 has been applied
locally and the transactions before T2 (e.g., T3) has been replicated and
applied to Node A (e.g. feedback_slot.confirmed_flush_lsn >= lsn of the local
replayed T2), thus send the new feedback message to Node A. 
- 10.05 AM

Re: Cutting support for OpenSSL 1.0.1 and 1.0.2 in 17~?

2024-09-10 Thread Daniel Gustafsson

> On 10 Sep 2024, at 10:01, Peter Eisentraut  wrote:

>> And pushed.  All BF owners with animals using 1.0.2 have been notified but 
>> not
>> all have been updated (or modified to skip SSL) so there will be some 
>> failing.
> 
> A small follow-up for this:  With the current minimum OpenSSL version being 
> 1.1.0, we can remove an unconstify() call; see attached patch.

Nice catch.

> See this OpenSSL commit: 
> .  The analogous 
> LibreSSL change is here: 
> .
>   

> I don't know if we have a concrete minimum LibreSSL version, but the change 
> is about as old as the OpenSSL change.

We've never documented the minimum LibreSSL version we support, but given that
we regularly test LibreSSL and fix breakage in our support I think we should.

--
Daniel Gustafsson

Re: Pgoutput not capturing the generated columns

2024-09-10 Thread Peter Smith

IIUC, previously there was a subscriber side option
'include_generated_columns', but now since v30* there is a publisher
side option 'publish_generated_columns'.

Fair enough, but in the v30* patches I can still see remnants of the
old name 'include_generated_columns' all over the place:
- in the commit message
- in the code (struct field names, param names etc)
- in the comments
- in the docs

If the decision is to call the new PUBLICATION option
'publish_generated_columns', then can't we please use that one name
*everywhere* -- e.g. replace all cases where any old name is still
lurking?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Retire support for OpenSSL 1.1.1 due to raised API requirements

2024-09-10 Thread Daniel Gustafsson

> On 10 Sep 2024, at 00:53, Michael Paquier  wrote:
> 
> On Mon, Sep 09, 2024 at 11:29:09PM +0200, Daniel Gustafsson wrote:
>> Agreed.  OpenSSL 1.1.1 is very different story and I suspect we'll be stuck 
>> on
>> that level for some time, but 1.1.0 is gone from production use.
> 
> The cleanup induced by the removal of 1.1.0 is minimal.  I'm on board
> about your argument with SSL_CTX_set_ciphersuites() to drop 1.1.0 and
> simplify the other feature.

Yeah, the change to existing code is trivial but avoiding adding a kluge to
handle versions without the relevant API will save complexity.  Thanks for
review.

This change will be committed together with the TLSv1.3 cipher suite pathcset,
just wanted to bring it up here and not hide it in another thread.

--
Daniel Gustafsson

Re: Add has_large_object_privilege function

2024-09-10 Thread Yugo NAGATA

On Mon, 9 Sep 2024 22:59:34 +0900
Fujii Masao  wrote:

> 
> 
> On 2024/07/02 16:34, Yugo NAGATA wrote:
> > So, I would like to propose to add
> > has_large_object_function for checking if a user has the privilege on a 
> > large
> > object.
> 
> +1

Thank you for your looking into this!
I've attached a updated patch.

> 
> BTW since we already have pg_largeobject, using has_largeobject_privilege 
> might
> offer better consistency. However, I'm okay with the current name for now.
> Even after committing the patch, we can rename it if others prefer 
> has_largeobject_privilege.

I was also wandering which is better, so I renamed it because it seems at least 
one person,
you, have an idea that has_largeobject_privilege might be better. If it is 
found that majority
prefer the previous name, I'll get it back.

> 
> As for 0001.patch, should we also remove the inclusion of "access/genam.h" and
> "access/htup_details.h" since they're no longer needed?

Removed.
 
> 
> > 0002 adds has_large_object_privilege function.There are three variations 
> > whose
> > arguments are combinations of large object OID with user name, user OID, or
> > implicit user (current_user).
> 
> As for 0002.patch, as the code in these three functions is mostly the same,
> it might be more efficient to create a common internal function and have
> the three functions call it for simplicity.

I made a new internal function "has_lo_priv_byid" that is called from these
functions.
 
> Here are other review comments for 0002.patch.
> 
> +  
> +   
> +
> + has_large_object_privilege
> 
> In the documentation, access privilege inquiry functions are listed
> alphabetically. So, this function's description should come right
> after has_language_privilege.

Fixed.

> 
> + * has_large_objec_privilege variants
> 
> Typo: s/objec/object

Fixed.

> 
> + *   The result is a boolean value: true if user has been granted
> + *   the indicated privilege or false if not.
> 
> The comment should clarify that NULL is returned if the specified
> large object doesn’t exist. For example,
> --
> The result is a boolean value: true if user has the indicated
> privilege, false if not, or NULL if object doesn't exist.
> --

Fixed.

> 
> +convert_large_object_priv_string(text *priv_text)
> 
> It would be better to use "priv_type_text" instead of "priv_text"
> for consistency with similar functions.
> 
> 
> + static const priv_map parameter_priv_map[] = {
> + {"SELECT", ACL_SELECT},
> + {"UPDATE", ACL_UPDATE},
> 
> parameter_priv_map should be large_object_priv_map.

Fixed.

> Additionally, the entries for "WITH GRANT OPTION" should be included here.

Fixed.

> 
> +-- not-existing user
> +SELECT has_large_object_privilege(-9, 1001, 'SELECT');   -- false
> + has_large_object_privilege
> +
> + t
> +(1 row)
> 
> 
> The comment states the result should be false, but the actual result is true.
> One of them seems incorrect.

I misunderstood that has_table_privilege always returns false for not-existing 
user, 
but it was not correct.  Actually, it returns true if the privilege is granted 
to public. 
I removed this check from the test at last because I don't think it is 
important.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
>From 98c2ce7ef1f03aac7a3bbfdc8c1599ea92d253f8 Mon Sep 17 00:00:00 2001
From: Yugo Nagata 
Date: Tue, 2 Jul 2024 15:12:17 +0900
Subject: [PATCH v2 2/2] Add has_largeobject_privilege function

This function is for checking whether a user has the privilege on a
large object. There are three variations whose arguments are combinations
of large object OID with user name, user OID, or implicit user (current_user).
It returns NULL if not-existing large object id is specified, and raises an
error if non-existing user name is specified. These behavior is similar
with has_table_privilege.
---
 doc/src/sgml/func.sgml   |  18 +++
 src/backend/utils/adt/acl.c  | 140 
 src/include/catalog/pg_proc.dat  |  13 ++
 src/test/regress/expected/privileges.out | 162 +++
 src/test/regress/sql/privileges.sql  |  42 ++
 5 files changed, 375 insertions(+)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 461fc3f437..57f2e1f29b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25113,6 +25113,24 @@ SELECT has_function_privilege('joeuser', 'myfunc(int, text)', 'execute');

   
 
+  
+   
+
+ has_largeobject_privilege
+
+has_largeobject_privilege (
+   user name or oid, 
+  largeobject oid,
+  privilege text )
+boolean
+   
+   
+Does user have privilege for large object?
+Allowable privilege types are
+SELECT and UPDATE.
+   
+  
+
   

 
diff --git a/src/backend/utils/adt/acl.c b/src/backend/utils/adt/ac

Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-10 Thread Amit Kapila

On Tue, Sep 10, 2024 at 11:36 AM vignesh C  wrote:
>
> On Mon, 9 Sept 2024 at 13:12, Amit Kapila  wrote:
> >
> > The second part of the assertion is incomplete. The
> > IsIndexUsableForReplicaIdentityFull() should be used only when the
> > remote relation has REPLICA_IDENTITY_FULL set. I haven't tested all
> > possible cases yet but I think the attached should be a better way to
> > write this assertion.
>
> The changes look good to me.
>

Thanks, I'll push this tomorrow unless Dilip or anyone else has any
comments. BTW, as the current code doesn't lead to any bug or
assertion failure, I am planning to push this change to HEAD only, let
me know if you think otherwise.

With Regards,
Amit Kapila.

Re: long-standing data loss bug in initial sync of logical replication

2024-09-10 Thread Nitin Motiani

On Thu, Sep 5, 2024 at 4:04 PM Amit Kapila  wrote:
>
> On Mon, Sep 2, 2024 at 9:19 PM Nitin Motiani  wrote:
> >
> > I think that the partial data replication for one table is a bigger
> > issue than the case of data being sent for a subset of the tables in
> > the transaction. This can lead to inconsistent data if the same row is
> > updated multiple times or deleted in the same transaction. In such a
> > case if only the partial updates from the transaction are sent to the
> > subscriber, it might end up with the data which was never visible on
> > the publisher side.
> >
> > Here is an example I tried with the patch v8-001 :
> >
> > I created following 2 tables on the publisher and the subscriber :
> >
> > CREATE TABLE delete_test(id int primary key, name varchar(100));
> > CREATE TABLE update_test(id int primary key, name varchar(100));
> >
> > I added both the tables to the publication p on the publisher and
> > created a subscription s on the subscriber.
> >
> > I run 2 sessions on the publisher and do the following :
> >
> > Session 1 :
> > BEGIN;
> > INSERT INTO delete_test VALUES(0, 'Nitin');
> >
> > Session 2 :
> > ALTER PUBLICATION p DROP TABLE delete_test;
> >
> > Session 1 :
> > DELETE FROM delete_test WHERE id=0;
> > COMMIT;
> >
> > After the commit there should be no new row created on the publisher.
> > But because the partial data was replicated, this is what the select
> > on the subscriber shows :
> >
> > SELECT * FROM delete_test;
> >  id |   name
> > +---
> >   0 | Nitin
> > (1 row)
> >
> > I don't think the above is a common use case. But this is still an
> > issue because the subscriber has the data which never existed on the
> > publisher.
> >
>
> I don't think that is the correct conclusion because the user has
> intentionally avoided sending part of the transaction changes. This
> can happen in various ways without the patch as well. For example, if
> the user has performed the ALTER in the same transaction.
>
> Publisher:
> =
> BEGIN
> postgres=*# Insert into delete_test values(0, 'Nitin');
> INSERT 0 1
> postgres=*# Alter Publication pub1 drop table delete_test;
> ALTER PUBLICATION
> postgres=*# Delete from delete_test where id=0;
> DELETE 1
> postgres=*# commit;
> COMMIT
> postgres=# select * from delete_test;
>  id | name
> +--
> (0 rows)
>
> Subscriber:
> =
> postgres=# select * from delete_test;
>  id | name
> +---
>   0 | Nitin
> (1 row)
>
> This can also happen when the user has published only 'inserts' but
> not 'updates' or 'deletes'.
>

Thanks for the clarification. I didn't think of this case. The change
seems fine if this can already happen.

Thanks & Regards
Nitin Motiani
Google

Re: First draft of PG 17 release notes

2024-09-10 Thread Alvaro Herrera

On 2024-Sep-10, Jelte Fennema-Nio wrote:

> I think as an extension author there are usually three types of
> changes that are relevant:
>
> 1. New APIs/hooks that are meant for extension authors

> For 1, I think adding them to the release notes makes total sense,
> especially if the new APIs are documented not only in source code, but
> also on the website. Nathan his change is of this type, so I agree
> with him it should be in the release notes.

I agree.  The volume of such items should be pretty small.

> 3. Stuff that changes behaviour of existing APIs code in a
> incompatible but silent way

> For 3, it would be very useful if it would be in the release notes,
> but I think in many cases it's hard to know what commits do this. So
> unless it's obviously going to break a bunch of extensions silently, I
> think we don't have to add such changes to the release notes.

While we cannot be 100% vigilant (and it doesn't seem likely for
automated tools to detect this), we try to avoid API changes that would
still compile but behave incompatibly.  In many review discussions you
can see suggestions to change some function signature so that
third-party authors would be aware that they need to adapt their code to
new behavior, turning cases of (3) into (2).  I agree that these don't
need release notes items.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"XML!" Exclaimed C++.  "What are you doing here? You're not a programming
language."
"Tell that to the people who use me," said XML.
https://burningbird.net/the-parable-of-the-languages/

Re: Special-case executor expression steps for common combinations

2024-09-10 Thread Daniel Gustafsson

> On 22 Jul 2024, at 23:25, Andreas Karlsson  wrote:
> 
> I have bench marked the two patches now and failed to measure any speedup or 
> slowdown from the first patch (removing return) but I think it is a good idea 
> anyway.
> 
> For the second patch (optimize strict) I managed to measure a ~1% speed up 
> for the following query "SELECT sum(x + y + 1) FROM t;" over one million rows.

That's expected, this is mostly about refactoring the code to simplifying the
JITed code (and making tiny strides towards JIT expression caching).

> I would say both patches are ready for committer modulo my proposed style 
> fixes.

I am a bit wary about removing the out_error label and goto since it may open
up for reports from static analyzers about control reaching the end of a
non-void function without a return. The other change has been incorporated.

The attached v3 is a rebase to handle executor changes done since v2, with the
above mentioned fix as well.  If there are no objections I think we should
apply this version.

--
Daniel Gustafsson

v3-0001-Replace-EEOP_DONE-with-special-steps-for-return-n.patch
Description: Binary data

v3-0002-Add-special-case-fast-paths-for-strict-functions.patch
Description: Binary data

Re: Conflict detection for update_deleted in logical replication

2024-09-10 Thread shveta malik

On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, September 10, 2024 2:45 PM shveta malik  
> wrote:
> > > ---
> > > THE DESIGN
> > > ---
> > >
> > > To achieve the above, we plan to allow the logical walsender to
> > > maintain and advance the slot.xmin to protect the data in the user
> > > table and introduce a new logical standby feedback message. This
> > > message reports the WAL position that has been replayed on the logical
> > > standby *AND* the changes occurring on the logical standby before the
> > > WAL position are also replayed to the walsender's node (where the
> > > walsender is running). After receiving the new feedback message, the
> > > walsender will advance the slot.xmin based on the flush info, similar
> > > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > > of logical slot are unused during logical replication, so I think it's 
> > > safe and
> > won't cause side-effect to reuse the xmin for this feature.
> > >
> > > We have introduced a new subscription option
> > > (feedback_slots='slot1,...'), where these slots will be used to check
> > > condition (b): the transactions on logical standbys occurring before
> > > the replay of Node A's DELETE are replayed on Node A as well.
> > > Therefore, on Node B, users should specify the slots corresponding to
> > > Node A in this option. The apply worker will get the oldest confirmed
> > > flush LSN among the specified slots and send the LSN as a feedback
> > message to the walsender. -- I also thought of making it an automaic way, 
> > e.g.
> > > let apply worker select the slots that acquired by the walsenders
> > > which connect to the same remote server(e.g. if apply worker's
> > > connection info or some other flags is same as the walsender's
> > > connection info). But it seems tricky because if some slots are
> > > inactive which means the walsenders are not there, the apply worker
> > > could not find the correct slots to check unless we save the host along 
> > > with
> > the slot's persistence data.
> > >
> > > The new feedback message is sent only if feedback_slots is not NULL.
> > > If the slots in feedback_slots are removed, a final message containing
> > > InvalidXLogRecPtr will be sent to inform the walsender to forget about
> > > the slot.xmin.
> > >
> > > To detect update_deleted conflicts during update operations, if the
> > > target row cannot be found, we perform an additional scan of the table 
> > > using
> > snapshotAny.
> > > This scan aims to locate the most recently deleted row that matches
> > > the old column values from the remote update operation and has not yet
> > > been removed by VACUUM. If any such tuples are found, we report the
> > > update_deleted conflict along with the origin and transaction information
> > that deleted the tuple.
> > >
> > > Please refer to the attached POC patch set which implements above
> > > design. The patch set is split into some parts to make it easier for the 
> > > initial
> > review.
> > > Please note that each patch is interdependent and cannot work
> > independently.
> > >
> > > Thanks a lot to Kuroda-San and Amit for the off-list discussion.
> > >
> > > Suggestions and comments are highly appreciated !
> > >
> >
> > Thank You Hou-San for explaining the design. But to make it easier to
> > understand, would you be able to explain the sequence/timeline of the
> > *new* actions performed by the walsender and the apply processes for the
> > given example along with new feedback_slot config needed
> >
> > Node A: (Procs: walsenderA, applyA)
> >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> >   T2: DELETE FROM t WHERE id = 1;   ts=10.02 AM
> >
> > Node B: (Procs: walsenderB, applyB)
> >   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM
>
> Thanks for reviewing! Let me elaborate further on the example:
>
> On node A, feedback_slots should include the logical slot that used to 
> replicate changes
> from Node A to Node B. On node B, feedback_slots should include the logical
> slot that replicate changes from Node B to Node A.
>
> Assume the slot.xmin on Node A has been initialized to a valid number(740) 
> before the
> following flow:
>
> Node A executed T1
>   - 10.00 AM
> T1 replicated and applied on Node B   
>   - 10.0001 AM
> Node B executed T3
>   - 10.01 AM
> Node A executed T2 (741)  
>   - 10.02 AM
> T2 replicated and applied on Node B (delete_missing)  
>   - 10.03 AM

Not related to this feature, but do you mean delete_origin_differ here?

> T3 replicated and applied on Node A (new action, detect update_deleted)   
>   - 10.04 AM
>
> (new action) Apply worker on Node B has confirmed that T2 has been

Re: Speeding up ruleutils' name de-duplication code, redux

2024-09-10 Thread David Rowley

On Tue, 30 Jul 2024 at 10:14, Tom Lane  wrote:
> -
> for (my $i = 0; $i < 100; $i++)
> {
> print "CREATE TABLE test_inh_check$i (\n";
> for (my $j = 0; $j < 1000; $j++)
> {
> print "a$j float check (a$j > 10.2),\n";
> }
> print "b float);\n";
> print "CREATE TABLE test_inh_check_child$i() 
> INHERITS(test_inh_check$i);\n";
> }
> -
>
> On my development machine, it takes over 14 minutes to pg_upgrade
> this, and it turns out that that time is largely spent in column
> name de-duplication while deparsing the CHECK constraints.  The
> attached patch reduces that to about 3m45s.

I think this is worth doing. Reducing the --schema-only time in
pg_dump is a worthy goal to reduce downtime during upgrades.

I looked at the patch and tried it out. I wondered about the choice of
32 as the cut-off point so decided to benchmark using the attached
script. Here's an extract from the attached results:

Patched with 10 child tables
pg_dump for 16 columns real 0m0.068s
pg_dump for 31 columns real 0m0.080s
pg_dump for 32 columns real 0m0.083s

This gives me what I'd expect to see. I wanted to ensure the point
where you're switching to the hashing method was about the right
place. It seems to be, at least for my test.

The performance looks good too:

10 tables:
master: pg_dump for 1024 columns real   0m23.053s
patched: pg_dump for 1024 columns real   0m1.573s

100 tables:
master: pg_dump for 1024 columns real   3m29.857s
patched: pg_dump for 1024 columns real   0m23.053s

Perhaps you don't think it's worth the additional complexity, but I
see that in both locations you're calling build_colinfo_names_hash(),
it's done just after a call to expand_colnames_array_to(). I wondered
if it was worthwhile unifying both of those functions maybe with a new
name so that you don't need to loop over the always NULL element of
the colnames[] array when building the hash table. This is likely
quite a small overhead compared to the quadratic search you've
removed, so it might not move the needle any. I just wanted to point
it out as I've little else I can find to comment on.

David
tables=100
dbname=postgres


for cols in 2 4 8 16 31 32 64 128 256 512 1024 
do
psql -c "drop table parent cascade;" $dbname &> /dev/null
sql="create table parent ("
for i in $(seq 1 $cols)
do
sql="${sql}col$i int check (col$i > 0)"
if [ $i -lt $cols ]; then
sql="${sql},"
fi
done
sql="${sql});"

psql -c "$sql" $dbname &> /dev/null

for i in $(seq 1 $tables)
do
psql -c "create table child$i () inherits (parent);" $dbname > 
/dev/null
done

echo -n "pg_dump for $cols columns "
{ time pg_dump $dbname > /dev/null; } |& grep real

done

## AMD3990x  With tables=10

master
drowley@amd3990x:~$ ./pgdump_bench.sh
pg_dump for 2 columns real  0m0.106s
pg_dump for 4 columns real  0m0.075s
pg_dump for 8 columns real  0m0.077s
pg_dump for 16 columns real 0m0.083s
pg_dump for 31 columns real 0m0.100s
pg_dump for 32 columns real 0m0.099s
pg_dump for 64 columns real 0m0.130s
pg_dump for 128 columns real0m0.208s
pg_dump for 256 columns real0m0.587s
pg_dump for 512 columns real0m3.227s
pg_dump for 1024 columns real   0m23.053s

patched
drowley@amd3990x:~$ ./pgdump_bench.sh
pg_dump for 2 columns real  0m0.109s
pg_dump for 4 columns real  0m0.054s
pg_dump for 8 columns real  0m0.057s
pg_dump for 16 columns real 0m0.068s
pg_dump for 31 columns real 0m0.080s
pg_dump for 32 columns real 0m0.083s
pg_dump for 64 columns real 0m0.105s
pg_dump for 128 columns real0m0.153s
pg_dump for 256 columns real0m0.272s
pg_dump for 512 columns real0m0.558s
pg_dump for 1024 columns real   0m1.573s


## AMD3990x With tables=100

master
drowley@amd3990x:~$ ./pgdump_bench.sh
pg_dump for 2 columns real  0m0.150s
pg_dump for 4 columns real  0m0.079s
pg_dump for 8 columns real  0m0.095s
pg_dump for 16 columns real 0m0.112s
pg_dump for 31 columns real 0m0.134s
pg_dump for 32 columns real 0m0.136s
pg_dump for 64 columns real 0m0.278s
pg_dump for 128 columns real0m0.817s
pg_dump for 256 columns real0m3.998s
pg_dump for 512 columns real0m27.443s
pg_dump for 1024 columns real   3m29.857s

patched
drowley@amd3990x:~$ ./pgdump_bench.sh
pg_dump for 2 columns real  0m0.106s
pg_dump for 4 columns real  0m0.075s
pg_dump for 8 columns real  0m0.077s
pg_dump for 16 columns real 0m0.083s
pg_dump for 31 columns real 0m0.100s
pg_dump for 32 columns real 0m0.099s
pg_dump for 64 columns real 0m0.130s
pg_dump for 128 columns real0m0.208s
pg_dump for 256 columns real0m0.587s
pg_dump for 512 columns real0m3.227s
pg_dump for 1024 columns real   0m23.053s

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-10 Thread wenhui qiu

Hi Shayon
   Thank you for  your work on this , I think it's great to have this
feature implemented ,I checked the doucment  on other databases,It seems
both MySQL 8.0  and oracle supports it, sql server need to rebuild indexes
after  disabled,It seems  disable the index, it's equivalent to deleting
the index, except that the index's metadata is still retained:
https://docs.oracle.com/cd/E17952_01/mysql-8.0-en/invisible-indexes.html
https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-index-transact-sql?view=sql-server-ver16
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/ALTER-INDEX.html
->A disabled index would still receive updates and enforce constraints as
usual but would not be used for queries. This allows users to assess ->
->whether an index impacts query performance before deciding to drop it
entirely.
MySQL 8.0 and oracle settings are not visible, index information is always
updated, I would then suggest that the statement be changed to set the
index invisible and visible.



Thanks

David Rowley  于2024年9月10日周二 06:17写道：

> On Tue, 10 Sept 2024 at 09:39, Shayon Mukherjee  wrote:
> > Adding and removing indexes is a common operation in PostgreSQL. On
> larger databases, however, these operations can be resource-intensive. When
> evaluating the performance impact of one or more indexes, dropping them
> might not be ideal since as a user you may want a quicker way to test their
> effects without fully committing to removing & adding them back again.
> Which can be a time taking operation on larger tables.
> >
> > Proposal:
> > I propose adding an ALTER INDEX command that allows for enabling or
> disabling an index globally. This could look something like:
> >
> > ALTER INDEX index_name ENABLE;
> > ALTER INDEX index_name DISABLE;
> >
> > A disabled index would still receive updates and enforce constraints as
> usual but would not be used for queries. This allows users to assess
> whether an index impacts query performance before deciding to drop it
> entirely.
>
> I personally think having some way to alter an index to stop it from
> being used in query plans would be very useful for the reasons you
> mentioned.  I don't have any arguments against the syntax you've
> proposed.  We'd certainly have to clearly document that constraints
> are still enforced. Perhaps there is some other syntax which would
> self-document slightly better. I just can't think of it right now.
>
> > Implementation:
> > To keep this simple, I suggest toggling the indisvalid flag in pg_index
> during the enable/disable operation.
>
> That's not a good idea as it would allow ALTER INDEX ... ENABLE; to be
> used to make valid a failed concurrently created index.  I think this
> would need a new flag and everywhere in the planner would need to be
> adjusted to ignore indexes when that flag is false.
>
> > Additional Considerations:
> > - Keeping the index up-to-date while it’s disabled seems preferable, as
> it avoids the need to rebuild the index if it’s re-enabled later. The
> alternative would be dropping and rebuilding the index upon re-enabling,
> which I believe would introduce additional overhead in terms of application
> logic & complexity.
>
> I think the primary use case here is to assist in dropping useless
> indexes in a way that can very quickly be undone if the index is more
> useful than thought. If you didn't keep the index up-to-date then that
> would make the feature useless for that purpose.
>
> If we get the skip scan feature for PG18, then there's likely going to
> be lots of people with indexes that they might want to consider
> removing after upgrading. Maybe this is a good time to consider this
> feature as it possibly won't ever be more useful than it will be after
> we get skip scans.
>
> David
>
>
>

Re: Jargon and acronyms on this mailing list

2024-09-10 Thread Magnus Hagander

On Mon, Sep 9, 2024 at 7:20 PM Robert Haas  wrote:

> On Mon, Sep 9, 2024 at 1:03 PM Andrew Dunstan  wrote:
> > I guess I could try to write code to migrate everything, but it would be
> > somewhat fragile. And what would we do if we ever decided to migrate
> > "master" to another name like "main"? I do at least have code ready for
> > that eventuality, but it would (currently) still keep the visible name
> > of HEAD.
>
> Personally, I think using HEAD to mean master is really confusing. In
> git, master is a branch name, and HEAD is the tip of some branch, or
> the random commit you've checked out that isn't even a branch. I know
> that's not how it worked in CVS, but CVS was a very long time ago.
>

Yeah, and it gets extra confusing when some of the error messages in git
explicitly talk about HEAD and that HEAD is something completely different
from our terminology.

If we rename master to main or devel or something, we'll have to
> adjust the way we speak again, but that's not a reason to keep using
> the wrong terminology for the way things are now.
>

Agreed in general. But also if we are going to end up making technical
changes to handle it, then if we're ever going to make the change master ->
main (or whatever), it would save work and pain to do the two at the same
time.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-10 Thread Michael Banck

Hi,

On Tue, Sep 10, 2024 at 10:16:34AM +1200, David Rowley wrote:
> On Tue, 10 Sept 2024 at 09:39, Shayon Mukherjee  wrote:
> > Adding and removing indexes is a common operation in PostgreSQL. On
> > larger databases, however, these operations can be
> > resource-intensive. When evaluating the performance impact of one or
> > more indexes, dropping them might not be ideal since as a user you
> > may want a quicker way to test their effects without fully
> > committing to removing & adding them back again. Which can be a time
> > taking operation on larger tables.
> >
> > Proposal:
> > I propose adding an ALTER INDEX command that allows for enabling or 
> > disabling an index globally. This could look something like:
> >
> > ALTER INDEX index_name ENABLE;
> > ALTER INDEX index_name DISABLE;
> >
> > A disabled index would still receive updates and enforce constraints
> > as usual but would not be used for queries. This allows users to
> > assess whether an index impacts query performance before deciding to
> > drop it entirely.
> 
> I personally think having some way to alter an index to stop it from
> being used in query plans would be very useful for the reasons you
> mentioned.  I don't have any arguments against the syntax you've
> proposed.  We'd certainly have to clearly document that constraints
> are still enforced. Perhaps there is some other syntax which would
> self-document slightly better. I just can't think of it right now.
> 
> > Implementation:
> > To keep this simple, I suggest toggling the indisvalid flag in
> > pg_index during the enable/disable operation.
> 
> That's not a good idea as it would allow ALTER INDEX ... ENABLE; to be
> used to make valid a failed concurrently created index.  I think this
> would need a new flag and everywhere in the planner would need to be
> adjusted to ignore indexes when that flag is false.

How about the indislive flag instead? I haven't looked at the code, but
from the documentation ("If false, the index is in process of being
dropped, and
should be ignored for all purposes") it sounds like we made be able to
piggy-back on that instead?


Michael

RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)

On Tuesday, September 10, 2024 5:56 PM shveta malik  
wrote:
> 
> On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Tuesday, September 10, 2024 2:45 PM shveta malik
>  wrote:
> > >
> > > Thank You Hou-San for explaining the design. But to make it easier
> > > to understand, would you be able to explain the sequence/timeline of
> > > the
> > > *new* actions performed by the walsender and the apply processes for
> > > the given example along with new feedback_slot config needed
> > >
> > > Node A: (Procs: walsenderA, applyA)
> > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > >   T2: DELETE FROM t WHERE id = 1;   ts=10.02 AM
> > >
> > > Node B: (Procs: walsenderB, applyB)
> > >   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM
> >
> > Thanks for reviewing! Let me elaborate further on the example:
> >
> > On node A, feedback_slots should include the logical slot that used to
> > replicate changes from Node A to Node B. On node B, feedback_slots
> > should include the logical slot that replicate changes from Node B to Node 
> > A.
> >
> > Assume the slot.xmin on Node A has been initialized to a valid
> > number(740) before the following flow:
> >
> > Node A executed T1  
> > - 10.00 AM
> > T1 replicated and applied on Node B 
> > - 10.0001 AM
> > Node B executed T3  
> > - 10.01 AM
> > Node A executed T2 (741)
> > - 10.02 AM
> > T2 replicated and applied on Node B (delete_missing)
> > - 10.03 AM
> 
> Not related to this feature, but do you mean delete_origin_differ here?

Oh sorry, It's a miss. I meant delete_origin_differ.

> 
> > T3 replicated and applied on Node A (new action, detect
> update_deleted) - 10.04 AM
> >
> > (new action) Apply worker on Node B has confirmed that T2 has been
> > applied locally and the transactions before T2 (e.g., T3) has been
> > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> >= lsn of the local
> > replayed T2), thus send the new feedback message to Node A.
> - 10.05 AM
> >
> > (new action) Walsender on Node A received the message and would
> > advance the slot.xmin.- 10.06 AM
> >
> > Then, after the slot.xmin is advanced to a number greater than 741,
> > the VACUUM would be able to remove the dead tuple on Node A.
> >
> 
> Thanks for the example. Can you please review below and let me know if my
> understanding is correct.
> 
> 1)
> In a bidirectional replication setup, the user has to create slots in a way 
> that
> NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> A's feedback slot. And then only this feature will work well, is it correct 
> to say?

Yes, your understanding is correct.

> 
> 2)
> Now coming back to multiple feedback_slots in a subscription, is the below
> correct:
> 
> Say Node A has publications and subscriptions as follow:
> --
> A_pub1
> 
> A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> 
> 
> Say Node B has publications and subscriptions as follow:
> --
> B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> 
> B_pub1
> B_pub2
> B_pub3
> 
> Then what will be the feedback_slot configuration for all subscriptions of A 
> and
> B? Is below correct:
> --
> A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3

Right. The above configurations are correct.

> 
> 3)
> If the above is true, then do we have a way to make sure that the user  has
> given this configuration exactly the above way? If users end up giving
> feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> validate that? (I have not looked at code yet, just trying to understand 
> design
> first).

The patch doesn't validate if the feedback slots belong to the correct
subscriptions on remote server. It only validates if the slot is an existing,
valid, logical slot. I think there are few challenges to validate it further.
E.g. We need a way to identify the which server the slot is replicating
changes to, which could be tricky as the slot currently doesn't have any info
to identify the remote server. Besides, the slot could be inactive temporarily
due to some subscriber side error, in which case we cannot verify the
subscription that used it.

> 
> 4)
> Now coming to this:
> 
> > The apply worker will get the oldest
> > confirmed flush LSN among the specified slots and send the LSN as a
> > feedback message to the walsender.
> 
>  There will be one apply worker on B which will be due to B_sub1, so wil

Re: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Amit Kapila

On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu)
 wrote:
>
> ---
> ISSUES and SOLUTION
> ---
>
> To detect update_deleted conflicts, we need to search for dead tuples in the
> table. However, dead tuples can be removed by VACUUM at any time. Therefore, 
> to
> ensure consistent and accurate conflict detection, tuples deleted by other
> origins must not be removed by VACUUM before the conflict detection process. 
> If
> the tuples are removed prematurely, it might lead to incorrect conflict
> identification and resolution, causing data divergence between nodes.
>
> Here is an example of how VACUUM could affect conflict detection and how to
> prevent this issue. Assume we have a bidirectional cluster with two nodes, A
> and B.
>
> Node A:
>   T1: INSERT INTO t (id, value) VALUES (1,1);
>   T2: DELETE FROM t WHERE id = 1;
>
> Node B:
>   T3: UPDATE t SET value = 2 WHERE id = 1;
>
> To retain the deleted tuples, the initial idea was that once transaction T2 
> had
> been applied to both nodes, there was no longer a need to preserve the dead
> tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
> concurrently, with T3 committing slightly earlier than T2. In this case, if
> Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
> Node A applies T3 after the VACUUM operation, it can only result in an
> update_missing conflict. Given that the default resolution for update_missing
> conflicts is apply_or_skip (e.g. convert update to insert if possible and 
> apply
> the insert), Node A will eventually hold a row (1,2) while Node B becomes
> empty, causing data inconsistency.
>
> Therefore, the strategy needs to be expanded as follows: Node A cannot remove
> the dead tuple until:
> (a) The DELETE operation is replayed on all remote nodes, *AND*
> (b) The transactions on logical standbys occurring before the replay of Node
> A's DELETE are replayed on Node A as well.
>
> ---
> THE DESIGN
> ---
>
> To achieve the above, we plan to allow the logical walsender to maintain and
> advance the slot.xmin to protect the data in the user table and introduce a 
> new
> logical standby feedback message. This message reports the WAL position that
> has been replayed on the logical standby *AND* the changes occurring on the
> logical standby before the WAL position are also replayed to the walsender's
> node (where the walsender is running). After receiving the new feedback
> message, the walsender will advance the slot.xmin based on the flush info,
> similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> of logical slot are unused during logical replication, so I think it's safe 
> and
> won't cause side-effect to reuse the xmin for this feature.
>
> We have introduced a new subscription option (feedback_slots='slot1,...'),
> where these slots will be used to check condition (b): the transactions on
> logical standbys occurring before the replay of Node A's DELETE are replayed 
> on
> Node A as well. Therefore, on Node B, users should specify the slots
> corresponding to Node A in this option. The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> let apply worker select the slots that acquired by the walsenders which 
> connect
> to the same remote server(e.g. if apply worker's connection info or some other
> flags is same as the walsender's connection info). But it seems tricky because
> if some slots are inactive which means the walsenders are not there, the apply
> worker could not find the correct slots to check unless we save the host along
> with the slot's persistence data.
>
> The new feedback message is sent only if feedback_slots is not NULL.
>

Don't you need to deal with versioning stuff for sending this new
message? I mean what if the receiver side of this message is old and
doesn't support this new message.

One minor comment on 0003
===
1.
get_slot_confirmed_flush()
{
...
+ /*
+ * To prevent concurrent slot dropping and creation while filtering the
+ * slots, take the ReplicationSlotControlLock outside of the loop.
+ */
+ LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+ foreach_ptr(String, name, MySubscription->feedback_slots)
+ {
+ XLogRecPtr confirmed_flush;
+ ReplicationSlot *slot;
+
+ slot = ValidateAndGetFeedbackSlot(strVal(name));

Why do we need to validate slots each time here? Isn't it better to do it once?

-- 
With Regards,
Amit Kapila.

Re: Use read streams in pg_visibility

2024-09-10 Thread Nazir Bilal Yavuz

Hi,

On Tue, 10 Sept 2024 at 00:32, Noah Misch  wrote:
>
> Copying the vm file is a good technique.  In my test runs, this does catch the
> "never detect" bug, but it doesn't catch the blkno bug.  Can you make it catch
> both?  It's possible my hand-patching to recreate the blkno bug is what went
> wrong, so I'm attaching what I used.  It consists of
> v1-0002-Use-read-stream-in-pg_visibility-in-collect_corru.patch plus these
> fixes for the "never detect" bug from your v6-0002:

Your patch is correct. I wrongly assumed it would catch blockno bug,
the attached version catches it. I made blockno = 0 invisible and not
frozen before copying the vm file. So, in the blockno buggy version;
callback will skip that block but the main loop in the
collect_corrupt_items() will not skip it. I tested it with your patch
and there is exactly 1 blockno difference between expected and result
output.

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 342bd8ede69942424b6d0568574db52cb574c479 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz 
Date: Tue, 10 Sep 2024 13:15:11 +0300
Subject: [PATCH v7 1/2] Add tests that pg_check_[visible|frozen] report
 corrupted tuples

Currently, there are no tests that pg_check_[visible|frozen] need to
report corrupted tuples. Add that kind of tests.

To add these tests, visibility map needs to be corrupted. It is done by
overwriting visibility map by its older state.
---
 contrib/pg_visibility/meson.build |   1 +
 contrib/pg_visibility/t/002_corrupt_vm.pl | 107 ++
 2 files changed, 108 insertions(+)
 create mode 100644 contrib/pg_visibility/t/002_corrupt_vm.pl

diff --git a/contrib/pg_visibility/meson.build b/contrib/pg_visibility/meson.build
index f3c1263313a..609fc5f9f3e 100644
--- a/contrib/pg_visibility/meson.build
+++ b/contrib/pg_visibility/meson.build
@@ -36,6 +36,7 @@ tests += {
   'tap': {
 'tests': [
   't/001_concurrent_transaction.pl',
+  't/002_corrupt_vm.pl',
 ],
   },
 }
diff --git a/contrib/pg_visibility/t/002_corrupt_vm.pl b/contrib/pg_visibility/t/002_corrupt_vm.pl
new file mode 100644
index 000..e57f365576f
--- /dev/null
+++ b/contrib/pg_visibility/t/002_corrupt_vm.pl
@@ -0,0 +1,107 @@
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Check that pg_check_visible() and pg_check_frozen() reports
+# correct TIDs when there is a corruption.
+use strict;
+use warnings FATAL => 'all';
+use File::Copy;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+my $blck_size = $node->safe_psql("postgres", "SHOW block_size;");
+
+# Create a sample table with at least 10 pages and then run VACUUM. 10 is
+# selected manually as it is big enough to select 5 random tuples from the
+# relation.
+$node->safe_psql(
+	'postgres', qq(
+		CREATE EXTENSION pg_visibility;
+	CREATE TABLE corruption_test
+WITH (autovacuum_enabled = false) AS
+   SELECT
+i,
+repeat('a', 10) AS data
+			FROM
+			generate_series(1, $blck_size) i;
+	VACUUM (FREEZE, DISABLE_PAGE_SKIPPING) corruption_test;
+)
+);
+
+# VACUUM is run, it is safe to get the number of pages from the relpages.
+my $npages = $node->safe_psql(
+	"postgres",
+	"SELECT relpages FROM pg_class
+		WHERE  relname = 'corruption_test';"
+);
+ok($npages >= 10, "There should be at least 10 pages in the table");
+
+my $file = $node->safe_psql("postgres",
+	"SELECT pg_relation_filepath('corruption_test');");
+
+# Delete the first block to make sure that it will be skipped as it is
+# not visible nor frozen.
+$node->safe_psql(
+	"postgres",
+	"DELETE FROM corruption_test
+		WHERE (ctid::text::point)[0] = 0;"
+);
+
+# Copy visibility map.
+$node->stop;
+my $vm_file = $node->data_dir . '/' . $file . '_vm';
+copy("$vm_file", "${vm_file}_temp");
+$node->start;
+
+# Select 5 random tuples that are starting from the second block of the
+# relation. The first block is skipped because it is deleted above.
+my $tuples = $node->safe_psql(
+	"postgres",
+	"SELECT ctid FROM (
+SELECT ctid FROM corruption_test
+			WHERE (ctid::text::point)[0] != 0
+ORDER BY random() LIMIT 5)
+ORDER BY ctid ASC;"
+);
+
+# Do the changes below to use tuples in the query.
+# "\n" -> ","
+# "(" -> "'("
+# ")" -> ")'"
+(my $tuples_query = $tuples) =~ s/\n/,/g;
+$tuples_query =~ s/\(/\'\(/g;
+$tuples_query =~ s/\)/\)\'/g;
+
+$node->safe_psql(
+	"postgres",
+	"DELETE FROM corruption_test
+		WHERE ctid in ($tuples_query);"
+);
+
+# Overwrite visibility map with the old one.
+$node->stop;
+move("${vm_file}_temp", "$vm_file");
+$node->start;
+
+my $result = $node->safe_psql(
+	"postgres",
+	"SELECT DISTINCT t_ctid
+		FROM pg_check_visible('corruption_test')
+		ORDER BY t_ctid ASC;"
+);
+is($result, $tuples, 'pg_check_visible must report tuples as corrupted');
+
+$result = $node->safe_psql(
+	"postgres",
+	"SELECT DISTINCT t_ctid
+		FROM pg_check_frozen('

Re: [PATCH] Fix small overread during SASLprep

2024-09-10 Thread Daniel Gustafsson

> On 9 Sep 2024, at 23:21, Daniel Gustafsson  wrote:
> 
>> On 9 Sep 2024, at 20:41, Jacob Champion  
>> wrote:
>> 
>> On Mon, Sep 9, 2024 at 11:30 AM Daniel Gustafsson  wrote:
>>> Just to make sure I understand, this is for guarding against overreads in
>>> validation of strings containing torn MB characters?
>> 
>> Right. Our SASLprep code doesn't require/enforce UTF8-encoded inputs.
> 
> Thanks for confirming, I'll have another look in the morning and will apply
> then unless there are objections.

Pushed, thanks!

--
Daniel Gustafsson

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-10 Thread David Rowley

On Tue, 10 Sept 2024 at 22:46, Michael Banck  wrote:
> How about the indislive flag instead? I haven't looked at the code, but
> from the documentation ("If false, the index is in process of being
> dropped, and
> should be ignored for all purposes") it sounds like we made be able to
> piggy-back on that instead?

Doing that could cause an UPDATE which would ordinarily not be
eligible for a HOT-update to become a HOT-update. That would cause
issues if the index is enabled again as the index wouldn't have been
updated during the UPDATE.

I don't see the big deal with adding a new flag. There's even a free
padding byte to put this flag in after indisreplident, so we don't
have to worry about using more memory.

David

Re: ANALYZE ONLY

2024-09-10 Thread torikoshia


On 2024-09-09 16:56, Michael Harris wrote:

Thanks for updating the patch.
Here is a minor comment.

@@ -944,20 +948,32 @@ expand_vacuum_rel(VacuumRelation *vrel, 
MemoryContext vac_context,

   MemoryContextSwitchTo(oldcontext);
   }

..
+* Unless the user has specified ONLY, make relation list 
entries for

+* its partitions and/or descendant tables.


Regarding the addition of partition descendant tables, should we also 
update the below comment on expand_vacuum_rel? Currently it refers only 
partitions:


|  * Given a VacuumRelation, fill in the table OID if it wasn't 
specified,

|  * and optionally add VacuumRelations for partitions of the table.

Other than this and the following, it looks good to me.

On Mon, Sep 9, 2024 at 10:27 AM David Rowley  
wrote:

Aside from those, that just leaves me with the behavioural change. I
noted Tom was ok with the change in behaviour for ANALYZE (mentioned
in [1]). Tom, wondering if you feel the same for VACUUM too?  If we're
doing this, I think we'd need to be quite clear about it on the
release notes.



--
Regards,

--
Atsushi Torikoshi
NTT DATA Group Corporation

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-10 Thread Shayon Mukherjee

+1 for the new flag as well, since it'd be nice to be able to
enable/disable indexes without having to worry about the missed updates /
having to rebuild it.
Shayon

On Tue, Sep 10, 2024 at 8:02 AM David Rowley  wrote:

> On Tue, 10 Sept 2024 at 22:46, Michael Banck  wrote:
> > How about the indislive flag instead? I haven't looked at the code, but
> > from the documentation ("If false, the index is in process of being
> > dropped, and
> > should be ignored for all purposes") it sounds like we made be able to
> > piggy-back on that instead?
>
> Doing that could cause an UPDATE which would ordinarily not be
> eligible for a HOT-update to become a HOT-update. That would cause
> issues if the index is enabled again as the index wouldn't have been
> updated during the UPDATE.
>
> I don't see the big deal with adding a new flag. There's even a free
> padding byte to put this flag in after indisreplident, so we don't
> have to worry about using more memory.
>
> David
>


-- 
Kind Regards,
Shayon Mukherjee

Re: Retiring is_pushed_down

2024-09-10 Thread Rafia Sabih

I see following issue with the latest patch,

relnode.c:2122:32: error: use of undeclared identifier 'ojrelids'
RINFO_IS_PUSHED_DOWN(rinfo, ojrelids,
joinrel->relids))

On Thu, 18 Apr 2024 at 09:34, Richard Guo  wrote:

> Here is another rebase over 3af7040985.  Nothing else has changed.
>
> Thanks
> Richard
>


-- 
Regards,
Rafia Sabih

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-10 Thread Shayon Mukherjee

Hello,

Thank you for the detailed information and feedback David. Comments inline.

P.S Re-sending it to the mailing list, because I accidentally didn't select
reply-all on the last reply.

On Mon, Sep 9, 2024 at 6:16 PM David Rowley  wrote:

> On Tue, 10 Sept 2024 at 09:39, Shayon Mukherjee  wrote:
> > Adding and removing indexes is a common operation in PostgreSQL. On
> larger databases, however, these operations can be resource-intensive. When
> evaluating the performance impact of one or more indexes, dropping them
> might not be ideal since as a user you may want a quicker way to test their
> effects without fully committing to removing & adding them back again.
> Which can be a time taking operation on larger tables.
> >
> > Proposal:
> > I propose adding an ALTER INDEX command that allows for enabling or
> disabling an index globally. This could look something like:
> >
> > ALTER INDEX index_name ENABLE;
> > ALTER INDEX index_name DISABLE;
> >
> > A disabled index would still receive updates and enforce constraints as
> usual but would not be used for queries. This allows users to assess
> whether an index impacts query performance before deciding to drop it
> entirely.
>
> I personally think having some way to alter an index to stop it from
> being used in query plans would be very useful for the reasons you
> mentioned.  I don't have any arguments against the syntax you've
> proposed.  We'd certainly have to clearly document that constraints
> are still enforced. Perhaps there is some other syntax which would
> self-document slightly better. I just can't think of it right now.
>

Thank you and likewise. I was thinking of piggy backing off of VALID / NOT
VALID, but that might have similar issues (if not more confusion) to the
current proposed syntax. Will be sure to update the documentation.



>
> > Implementation:
> > To keep this simple, I suggest toggling the indisvalid flag in pg_index
> during the enable/disable operation.
>
> That's not a good idea as it would allow ALTER INDEX ... ENABLE; to be
> used to make valid a failed concurrently created index.  I think this
> would need a new flag and everywhere in the planner would need to be
> adjusted to ignore indexes when that flag is false.
>

That is a great call and I wasn't thinking of the semantics with the
existing usage of concurrently created indexes.


>
> > Additional Considerations:
> > - Keeping the index up-to-date while it’s disabled seems preferable, as
> it avoids the need to rebuild the index if it’s re-enabled later. The
> alternative would be dropping and rebuilding the index upon re-enabling,
> which I believe would introduce additional overhead in terms of application
> logic & complexity.
>
> I think the primary use case here is to assist in dropping useless
> indexes in a way that can very quickly be undone if the index is more
> useful than thought. If you didn't keep the index up-to-date then that
> would make the feature useless for that purpose.
>

+1


>
> If we get the skip scan feature for PG18, then there's likely going to
> be lots of people with indexes that they might want to consider
> removing after upgrading. Maybe this is a good time to consider this
> feature as it possibly won't ever be more useful than it will be after
> we get skip scans.
>
> David
>

Thank you for the feedback again, I will look into the changes required and
accordingly propose a PATCH.

-- 
Kind Regards,
Shayon Mukherjee

Re: not null constraints, again

2024-09-10 Thread Alvaro Herrera

On 2024-Sep-09, jian he wrote:

> bold idea. print out the constraint name: violates not-null constraint \"%s\"
> for the following code:
> ereport(ERROR,
> (errcode(ERRCODE_NOT_NULL_VIOLATION),
>  errmsg("null value in column \"%s\" of
> relation \"%s\" violates not-null constraint",
> NameStr(att->attname),
> RelationGetRelationName(orig_rel)),
>  val_desc ? errdetail("Failing row contains
> %s.", val_desc) : 0,
>  errtablecol(orig_rel, attrChk)));

I gave this a quick run, but I'm not sure it actually improves things
much.  Here's one change from the regression tests.  What do you think?

 INSERT INTO reloptions_test VALUES (1, NULL), (NULL, NULL);
 -ERROR:  null value in column "i" of relation "reloptions_test" violates 
not-null constraint
 +ERROR:  null value in column "i" of relation "reloptions_test" violates 
not-null constraint "reloptions_test_i_not_null"

What do I get from having the constraint name?  It's not like I'm going
to drop the constraint and retry the insert.

Here's the POC-quality patch for this.  I changes a lot of regression
tests, which I don't patch here.  (But that's not the reason for me
thinking that this isn't worth it.)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 29e186fa73d..d84137f4f43 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1907,6 +1907,7 @@ ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
  * have been converted from the original input tuple after tuple routing.
  * 'resultRelInfo' is the final result relation, after tuple routing.
  */
+#include "catalog/pg_constraint.h"
 void
 ExecConstraints(ResultRelInfo *resultRelInfo,
TupleTableSlot *slot, EState *estate)
@@ -1932,6 +1933,7 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
char   *val_desc;
Relationorig_rel = rel;
TupleDesc   orig_tupdesc = 
RelationGetDescr(rel);
+   char   *conname;
 
/*
 * If the tuple has been routed, it's been 
converted to the
@@ -1970,14 +1972,24 @@ ExecConstraints(ResultRelInfo *resultRelInfo,

 tupdesc,

 modifiedCols,

 64);
+   {
+   HeapTuple   tuple;
+   Form_pg_constraint conForm;
+
+   tuple = 
findNotNullConstraintAttnum(RelationGetRelid(orig_rel),
+   
att->attnum);
+   conForm = (Form_pg_constraint) 
GETSTRUCT(tuple);
+   conname = 
pstrdup(NameStr(conForm->conname));
+   }
 
ereport(ERROR,
-   
(errcode(ERRCODE_NOT_NULL_VIOLATION),
-errmsg("null value in column 
\"%s\" of relation \"%s\" violates not-null constraint",
-   
NameStr(att->attname),
-   
RelationGetRelationName(orig_rel)),
-val_desc ? errdetail("Failing 
row contains %s.", val_desc) : 0,
-errtablecol(orig_rel, 
attrChk)));
+   
errcode(ERRCODE_NOT_NULL_VIOLATION),
+   errmsg("null value in column 
\"%s\" of relation \"%s\" violates not-null constraint \"%s\"",
+  
NameStr(att->attname),
+  
RelationGetRelationName(orig_rel),
+  conname),
+   val_desc ? errdetail("Failing 
row contains %s.", val_desc) : 0,
+   errtablecol(orig_rel, attrChk));
}
}
}
-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: not null constraints, again

2024-09-10 Thread Alvaro Herrera

On 2024-Sep-02, Tender Wang wrote:

> The attached patch adds  List *nnconstraints, which store the not-null
> definition, in struct CreateStmt.  This makes me a little confused
> about List *constraints in struct CreateStmt. Actually, the List
> constraints store ckeck constraint, and it will be better if the
> comments can reflect that. Re-naming it to List *ckconstraints seems
> more reasonable. But a lot of codes that use stmt->constraints will be
> changed.

Well, if you look at the comment about CreateStmt, there's this:

/* --
 *  Create Table Statement
 *
 * NOTE: in the raw gram.y output, ColumnDef and Constraint nodes are
 * intermixed in tableElts, and constraints and nnconstraints are NIL.  After
 * parse analysis, tableElts contains just ColumnDefs, nnconstraints contains
 * Constraint nodes of CONSTR_NOTNULL type from various sources, and
 * constraints contains just CONSTR_CHECK Constraint nodes.
 * --
 */

So if we were to rename 'constraints' to 'ckconstraints', it would no
longer reflect the fact that not-null ones can be in the former list
initially.

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"No hay ausente sin culpa ni presente sin disculpa" (Prov. francés)

Re: pg_combinebackup --clone doesn't work

2024-09-10 Thread Tomas Vondra

On 8/23/24 14:50, Tomas Vondra wrote:
> 
> On 8/23/24 14:00, Peter Eisentraut wrote:
>> On 30.06.24 20:58, Tomas Vondra wrote:
>>> I've pushed the first three patches, fixing the headers, adding the
>>> --copy option and PG_TEST_PG_COMBINEBACKUP_MODE variable.
>>>
>>> I haven't pushed the CI changes, I'm not sure if there's a clear
>>> consensus on which combination to test. It's something we can tweak
>>> later, I think.
>>>
>>> FWIW I've added the patch to the 2024-07 commitfest, but mostly to get
>>> some CI runs (runs on private fork fail with some macos issues unrelated
>>> to the patch).
>>
>> This last patch is still pending in the commitfest.  Personally, I think
>> it's good to commit as is.
>>
> 
> OK, thanks for reminding me. I'll take care of it after thinking about
> it a bit more.
> 

Took me longer than I expected, but pushed (into master only).


regards

-- 
Tomas Vondra

Re: Jargon and acronyms on this mailing list

2024-09-10 Thread Greg Sabino Mullane

On Tue, Sep 3, 2024 at 11:50 AM Nathan Bossart 
wrote:

> Do you think these acronyms make it difficult for some to contribute to
> Postgres?  I've always felt that they were pretty easy to figure out and a
> nice way to save some typing for common phrases, but I'm not sure it's ever
> really been discussed
>

I do think it raises the bar a bit, especially for non-native-English
speakers. Granted, the learning curve is not super high, and context plus
web searching can usually help people out, but the lists are dense enough
already, so I wanted to help people out. Also, mailing lists in general are
a pretty foreign concept to young developers, and as AFAICT [1], not all
the acronyms have crossed to the texting world.

Cheers,
Greg

[1] See what I did there? [2]

[2] Do people still learn about and use footnotes?

Re: Speeding up ruleutils' name de-duplication code, redux

2024-09-10 Thread Tom Lane

David Rowley  writes:
> On Tue, 30 Jul 2024 at 10:14, Tom Lane  wrote:
>> On my development machine, it takes over 14 minutes to pg_upgrade
>> this, and it turns out that that time is largely spent in column
>> name de-duplication while deparsing the CHECK constraints.  The
>> attached patch reduces that to about 3m45s.

> I looked at the patch and tried it out.

Thanks for looking!

> This gives me what I'd expect to see. I wanted to ensure the point
> where you're switching to the hashing method was about the right
> place. It seems to be, at least for my test.

Yeah, I was just going by gut feel there.  It's good to have some
numbers showing it's not a totally silly choice.

> Perhaps you don't think it's worth the additional complexity, but I
> see that in both locations you're calling build_colinfo_names_hash(),
> it's done just after a call to expand_colnames_array_to(). I wondered
> if it was worthwhile unifying both of those functions maybe with a new
> name so that you don't need to loop over the always NULL element of
> the colnames[] array when building the hash table. This is likely
> quite a small overhead compared to the quadratic search you've
> removed, so it might not move the needle any. I just wanted to point
> it out as I've little else I can find to comment on.

Hmm, but there are quite a few expand_colnames_array_to calls that
are not associated with build_colinfo_names_hash.  On the whole it
feels like those are separate concerns that are better kept separate.

We could accomplish what you suggest by re-ordering the calls so that
we build the hash table before enlarging the array.  0001 attached
is the same as before (modulo line number changes from being rebased
up to HEAD) and then 0002 implements this idea on top.  On the whole
though I find 0002 fairly ugly and would prefer to stick to 0001.
I really doubt that scanning any newly-created column positions is
going to take long enough to justify intertwining things like this.

regards, tom lane

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ee1b7f3dc9..badbf111ee 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -224,6 +224,10 @@ typedef struct
  * of aliases to columns of the right input.  Thus, positions in the printable
  * column alias list are not necessarily one-for-one with varattnos of the
  * JOIN, so we need a separate new_colnames[] array for printing purposes.
+ *
+ * Finally, when dealing with wide tables we risk O(N^2) costs in assigning
+ * non-duplicate column names.  We ameliorate that by using a hash table that
+ * holds all the strings appearing in colnames, new_colnames, and parentUsing.
  */
 typedef struct
 {
@@ -291,6 +295,15 @@ typedef struct
 	int		   *leftattnos;		/* left-child varattnos of join cols, or 0 */
 	int		   *rightattnos;	/* right-child varattnos of join cols, or 0 */
 	List	   *usingNames;		/* names assigned to merged columns */
+
+	/*
+	 * Hash table holding copies of all the strings appearing in this struct's
+	 * colnames, new_colnames, and parentUsing.  We use a hash table only for
+	 * sufficiently wide relations, and only during the colname-assignment
+	 * functions set_relation_column_names and set_join_column_names;
+	 * otherwise, names_hash is NULL.
+	 */
+	HTAB	   *names_hash;		/* entries are just strings */
 } deparse_columns;
 
 /* This macro is analogous to rt_fetch(), but for deparse_columns structs */
@@ -376,6 +389,9 @@ static bool colname_is_unique(const char *colname, deparse_namespace *dpns,
 static char *make_colname_unique(char *colname, deparse_namespace *dpns,
  deparse_columns *colinfo);
 static void expand_colnames_array_to(deparse_columns *colinfo, int n);
+static void build_colinfo_names_hash(deparse_columns *colinfo);
+static void add_to_names_hash(deparse_columns *colinfo, const char *name);
+static void destroy_colinfo_names_hash(deparse_columns *colinfo);
 static void identify_join_columns(JoinExpr *j, RangeTblEntry *jrte,
   deparse_columns *colinfo);
 static char *get_rtable_name(int rtindex, deparse_context *context);
@@ -4133,6 +4149,10 @@ has_dangerous_join_using(deparse_namespace *dpns, Node *jtnode)
  *
  * parentUsing is a list of all USING aliases assigned in parent joins of
  * the current jointree node.  (The passed-in list must not be modified.)
+ *
+ * Note that we do not use per-deparse_columns hash tables in this function.
+ * The number of names that need to be assigned should be small enough that
+ * we don't need to trouble with that.
  */
 static void
 set_using_names(deparse_namespace *dpns, Node *jtnode, List *parentUsing)
@@ -4408,6 +4428,9 @@ set_relation_column_names(deparse_namespace *dpns, RangeTblEntry *rte,
 	colinfo->new_colnames = (char **) palloc(ncolumns * sizeof(char *));
 	colinfo->is_new_col = (bool *) palloc(ncolumns * sizeof(bool));
 
+	/* If the RTE is wide enough, use a hash table to avoid O(N^2) costs *

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-10 Thread Nathan Bossart

On Tue, Sep 10, 2024 at 10:16:34AM +1200, David Rowley wrote:
> I think the primary use case here is to assist in dropping useless
> indexes in a way that can very quickly be undone if the index is more
> useful than thought. If you didn't keep the index up-to-date then that
> would make the feature useless for that purpose.
> 
> If we get the skip scan feature for PG18, then there's likely going to
> be lots of people with indexes that they might want to consider
> removing after upgrading. Maybe this is a good time to consider this
> feature as it possibly won't ever be more useful than it will be after
> we get skip scans.

+1, this is something I've wanted for some time.  There was some past
discussion, too [0].

[0] 
https://postgr.es/m/flat/ed8c9ed7-bb5d-aaec-065b-ad4893645deb%402ndQuadrant.com

-- 
nathan

Re: [PATCH] Fix small overread during SASLprep

2024-09-10 Thread Jacob Champion

On Tue, Sep 10, 2024 at 4:39 AM Daniel Gustafsson  wrote:
> Pushed, thanks!

Thank you!

--Jacob

Re: Add contrib/pg_logicalsnapinspect

2024-09-10 Thread Bertrand Drouvot

Hi,

On Mon, Sep 09, 2024 at 04:24:09PM +0530, Amit Kapila wrote:
> On Fri, Aug 30, 2024 at 5:18 PM Bertrand Drouvot
>  wrote:
> > as we decided not to expose the SnapBuildOnDisk and SnapBuild structs to 
> > public
> > and to create/expose 2 new functions in snapbuild.c then the functions in 
> > the
> > module would do nothing but expose the data coming from the snapbuild.c's
> > functions (get the tuple and send it to the client). That sounds weird to 
> > me to
> > create a module that would "only" do so, that's why I thought that in core
> > functions taking care of everything make more sense.
> >
> 
> I see your point. Does anyone else have an opinion on the need for
> these functions and whether to expose them from a contrib module or
> have them as core functions?

I looked at when the SNAPBUILD_VERSION has been changed:

ec5896aed3 (2014)
a975ff4980 (2021)
8bdb1332eb (2021)
7f13ac8123 (2022)
bb19b70081 (2024)

So it's not like we are changing the SnapBuildOnDisk or SnapBuild structs that
frequently. Furthermore, those structs are serialized and so we have to preserve
their on-disk compatibility (means we can change them only in a major release
if we need to).

So, I think it would not be that much of an issue to expose those structs and
create a new contrib module (as v1 did propose) instead of in core new 
functions.

If we want to insist that external modules "should" not rely on those structs 
then
we could put them into a new internal_snapbuild.h file (instead of snapbuild.h
as proposed in v1).

At the end, I think that creating a contrib module and exposing those structs in
internal_snapbuild.h make more sense (as compared to in core functions).

Thoughts?

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Converting README documentation to Markdown

2024-09-10 Thread Tom Lane

Daniel Gustafsson  writes:
> Since there doesn't seem to be much interest in going all the way to Markdown,
> the attached 0001 is just the formatting changes for achieving (to some 
> degree)
> consistency among the README's.  This mostly boils down to using a consistent
> amount of whitespace around code, using the same indentation on bullet lists
> and starting sections the same way.  Inspecting the patch with git diff -w
> reveals that it's not much left once whitespace is ignored.  There might be a
> few markdown hunks left which I'll hunt down in case anyone is interested in
> this.

> As an added bonus this still makes most READMEs render nicely as Markdown, 
> just
> not automatically on Github as it doesn't know the filetype.

I did not inspect the patch in detail, but this approach seems
like a reasonable compromise.  However, if we're not officially
going to Markdown, how likely is it that these files will
stay valid in future edits?  I suspect most of us don't have
those syntax rules wired into our fingers (I sure don't).

regards, tom lane

Re: Psql meta-command conninfo+

2024-09-10 Thread Jim Jones

On 10.09.24 06:32, Hunaid Sohail wrote:
>
> I have attached a rebased patch.

Thanks.

Is \conninfo+ no longer supposed to return the results in tabular form?
At least it wasn't the case till v28.

$ /usr/local/postgres-dev/bin/psql -d postgres -h 0 -c "\conninfo+"
You are connected to database "postgres" as user "jim" on host "0"
(address "0.0.0.0") at port "5432".
Protocol Version: 3
SSL Connection: no
GSSAPI Authenticated: no
Client Encoding: UTF8
Server Encoding: UTF8
Session User: jim
Backend PID: 579041

$ /usr/local/postgres-dev/bin/psql -d postgres -h 127.0.0.1 -c "\conninfo+"
You are connected to database "postgres" as user "jim" on host
"127.0.0.1" at port "5432".
Protocol Version: 3
SSL Connection: no
GSSAPI Authenticated: no
Client Encoding: UTF8
Server Encoding: UTF8
Session User: jim
Backend PID: 579087

Sorry if I missed that in the thread.

v31 has a couple of small indentation problems:

/home/jim/patches/conninfo/v31-0001-Add-psql-meta-command-conninfo-plus.patch:87:
indent with spaces.
    show_verbose = strchr(cmd, '+') ? true : false;
/home/jim/patches/conninfo/v31-0001-Add-psql-meta-command-conninfo-plus.patch:106:
trailing whitespace.

Checking patch doc/src/sgml/ref/psql-ref.sgml...
Checking patch src/bin/psql/command.c...
Checking patch src/bin/psql/help.c...
Applied patch doc/src/sgml/ref/psql-ref.sgml cleanly.
Applied patch src/bin/psql/command.c cleanly.
Applied patch src/bin/psql/help.c cleanly.
warning: 2 lines add whitespace errors.

-- 
Jim

Re: Psql meta-command conninfo+

2024-09-10 Thread Alvaro Herrera

On 2024-Sep-10, Jim Jones wrote:

> Is \conninfo+ no longer supposed to return the results in tabular form?
> At least it wasn't the case till v28.

I suspect the reason it's no longer a table is that it was previously a
query (which is easily printed as a table by calling printQuery) and now
it's just a client-side thing, and Hunaid didn't know how to handle that
as a table.  The good news is, it should be really easy to do
printTableInit(), then a bunch of printTableAddHeader() and
printTableAddCell(), end with printTable().  I think the tabular format
is better for sure.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La primera ley de las demostraciones en vivo es: no trate de usar el sistema.
Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen)

Re: Latches vs lwlock contention

2024-09-10 Thread Maxim Orlov

I looked at the patch set and found it quite useful.

The first 7 patches are just refactoring and may be committed separately if
needed.
There were minor problems: patch #5 don't want to apply clearly and the #8
is complained
about partitionLock is unused if we build without asserts. So, I add a
PG_USED_FOR_ASSERTS_ONLY
to solve the last issue.

Again, overall patch looks good and seems useful to me. Here is the rebased
v5 version based on Heikki's patch set above.

-- 
Best regards,
Maxim Orlov.


v5-0004-Move-TRACE-calls-into-WaitOnLock.patch
Description: Binary data


v5-0001-Remove-LOCK_PRINT-call-that-could-point-to-garbag.patch
Description: Binary data


v5-0003-Set-MyProc-heldLocks-in-ProcSleep.patch
Description: Binary data


v5-0005-Remove-redundant-lockAwaited-global-variable.patch
Description: Binary data


v5-0002-Fix-comment-in-LockReleaseAll-on-when-locallock-n.patch
Description: Binary data


v5-0007-Release-partition-lock-a-little-earlier.patch
Description: Binary data


v5-0006-Update-local-lock-table-in-ProcSleep-s-caller.patch
Description: Binary data


v5-0008-Split-ProcSleep-function-into-JoinWaitQueue-and-P.patch
Description: Binary data

Re: First draft of PG 17 release notes

2024-09-10 Thread Masahiko Sawada

On Mon, Sep 9, 2024 at 11:29 PM Jelte Fennema-Nio  wrote:
>
> On Tue, 10 Sept 2024 at 04:47, Bruce Momjian  wrote:
> > Yes.  There are so many changes at the source code level it is unwise to
> > try and get them into the main release notes.  If someone wants to
> > create an addendum, like was suggested for pure performance
> > improvements, that would make sense.
>
> I agree that the release notes cannot fit every change. But I also
> don't think any extension author reads the complete git commit log
> every release, so taking the stance that they should be seems
> unhelpful. And the "Source Code" section does exist so at some level
> you seem to disagree with that too. So what is the way to decide that
> something makes the cut for the "Source Code" section?
>
> I think as an extension author there are usually three types of
> changes that are relevant:
> 1. New APIs/hooks that are meant for extension authors
> 2. Stuff that causes my existing code to not compile anymore
> 3. Stuff that changes behaviour of existing APIs code in a
> incompatible but silent way
>
> For 1, I think adding them to the release notes makes total sense,
> especially if the new APIs are documented not only in source code, but
> also on the website. Nathan his change is of this type, so I agree
> with him it should be in the release notes.

+1. I think that the increment JSON parser that is already mentioned
in the release note would fall in this type too; it's not a feature
aimed just for extension authors, but it's kind of source and internal
changes IMO. Since the DSM registry feature is described in the doc, I
think it would make sense to have it in the release notes and probably
has a link to the "Requesting Shared Memory After Startup" section.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: Refactoring postmaster's code to cleanup after child exit

2024-09-10 Thread Andres Freund

Hi,

On 2024-09-06 16:13:43 +0300, Heikki Linnakangas wrote:
> On 04/09/2024 17:35, Andres Freund wrote:
> > On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
> > >  From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
> > > From: Heikki Linnakangas 
> > > Date: Mon, 12 Aug 2024 10:59:04 +0300
> > > Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end 
> > > children
> > >
> > > And replace postmaster.c's own "backend type" codes with BackendType
> >
> > Hm - it seems a bit odd to open-code this when we actually have a "table
> > driven configuration" available?  Why isn't the type a field in
> > child_process_kind?
>
> Sorry, I didn't understand this. What exactly would you add to
> child_process_kind? Where would you use it?

I'm not entirely sure what I was thinking of. It might be partially triggering
a prior complaint I had about manually assigning things to MyBackendType,
despite actually having all the information already.

One thing that I just noticed is that this patch orphans comment references to
BACKEND_TYPE_AUTOVAC and BACKEND_TYPE_BGWORKER.

Seems a tad odd to have BACKEND_TYPE_ALL after removing everything else from
the BACKEND_TYPE_* "namespace".


To deal with the issue around bitmasks you had mentioned, I think we should at
least have a static inline function to convert B_* values to the bitmask
index.


> > > +/*
> > > + * MaxLivePostmasterChildren
> > > + *
> > > + * This reports the number postmaster child processes that can be 
> > > active.  It
> > > + * includes all children except for dead_end children.  This allows the 
> > > array
> > > + * in shared memory (PMChildFlags) to have a fixed maximum size.
> > > + */
> > > +int
> > > +MaxLivePostmasterChildren(void)
> > > +{
> > > + int n = 0;
> > > +
> > > + /* We know exactly how mamy worker and aux processes can be active */
> > > + n += autovacuum_max_workers;
> > > + n += max_worker_processes;
> > > + n += NUM_AUXILIARY_PROCS;
> > > +
> > > + /*
> > > +  * We allow more connections here than we can have backends because some
> > > +  * might still be authenticating; they might fail auth, or some existing
> > > +  * backend might exit before the auth cycle is completed.  The exact
> > > +  * MaxBackends limit is enforced when a new backend tries to join the
> > > +  * shared-inval backend array.
> > > +  */
> > > + n += 2 * (MaxConnections + max_wal_senders);
> > > +
> > > + return n;
> > > +}
> >
> > I wonder if we could instead maintain at least some of this in
> > child_process_kinds? Manually listing different types of processes in
> > different places doesn't seem particularly sustainable.
>
> Hmm, you mean adding "max this kind of children" field to
> child_process_kinds? Perhaps.

Yep, that's what I meant.



> > > +/* Return the appropriate free-list for the given backend type */
> > > +static dlist_head *
> > > +GetFreeList(BackendType btype)
> > > +{
> > > + switch (btype)
> > > + {
> > > + case B_BACKEND:
> > > + case B_BG_WORKER:
> > > + case B_WAL_SENDER:
> > > + case B_SLOTSYNC_WORKER:
> > > + return &freeBackendList;
> >
> > Maybe a daft question - but why are all of these in the same list? Sure,
> > they're essentially all full backends, but they're covered by different 
> > GUCs?
>
> No reason. No particular reason they should *not* share the same list either
> though.

Aren't they controlled by distinct connection limits? Isn't there a danger
that we could use up entries and fail connections due to that, despite not
actually being above the limit?


> > Tangential: Why do we need a freelist for these and why do we choose a 
> > random
> > pgproc for these instead of assigning one statically?
> >
> > Background: I'd like to not provide AIO workers with "bounce buffers" (for 
> > IO
> > of buffers that can't be done in-place, like writes when checksums are
> > enabled). The varying proc numbers make that harder than it'd have to be...
>
> Yeah, we can make these fixed.

Cool.


> Currently, the # of slots reserved for aux processes is sized by
> NUM_AUXILIARY_PROCS, which is one smaller than the number of different aux
> proces kinds:

> > /*
> >  * We set aside some extra PGPROC structures for auxiliary processes,
> >  * ie things that aren't full-fledged backends but need shmem access.
> >  *
> >  * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
> >  * run during normal operation.  Startup process and WAL receiver also 
> > consume
> >  * 2 slots, but WAL writer is launched only after startup has exited, so we
> >  * only need 6 slots.
> >  */
> > #define NUM_AUXILIARY_PROCS 6
>
> For PMChildSlot numbers, we could certainly just allocate one more slot.
>
> It would probably make sense for PGPROCs too, even though PGPROC is a much
> larger struct.

I don't think it's worth worrying about that much. PGPROC is large, but not
*that* large. And the robustness win of properly

Re: Use streaming read API in ANALYZE

2024-09-10 Thread Mats Kindahl

On Thu, Sep 5, 2024 at 11:12 AM Thomas Munro  wrote:

> On Thu, Sep 5, 2024 at 6:45 PM Mats Kindahl  wrote:
> > Forgive me for asking, but I am not entirely sure why the ReadStream
> struct is opaque. The usual reasons are:
> >
> > You want to provide an ABI to allow extensions to work with new major
> versions without re-compiling. Right now it is necessary to recompile
> extensions anyway, this does not seem to apply. (Because there are a lot of
> other changes that you need when switching versions because of the lack of
> a stable ABI for other parts of the code. However, it might be that the
> goal is to support it eventually, and then it would make sense to start
> making structs opaque.)
> > You want to ensure that you can make modifications inside a major
> version without breaking ABIs and requiring a re-compile. In this case, you
> could still follow safe practice of adding new fields last, not relying on
> the size of the struct for anything (e.g., no arrays of these structures,
> just pointers to them), etc. However, if you want to be very safe and
> support very drastic changes inside a major version, it needs to be opaque,
> so this could be the reason.
> >
> > Is it either of these reasons, or is there another reason?
> >
> > Making the ReadStream API non-opaque (that is, moving the definition to
> the header file) would at least solve our problem (unless I am mistaken).
> However, I am ignorant about long-term plans which might affect this, so
> there might be a good reason to revert it for reasons I am not aware of.
>
> The second thing.  Also there are very active plans[1] to change the
> internal design of ReadStream in 18, since the goal is to drive true
> asynchronous I/O, and the idea of ReadStream was to create a simple
> API to let many consumers start using it, so that we can drive
> efficient modern system interfaces below that API, so having people
> depending on how it works would not be great.
>

That is understandable, since you usually do not want to have to re-compile
the extension for different minor versions. However, it would be a rare
case with extensions that are meddling with this, so might not turn out to
be a big problem in reality, as long as it is very clear to all involved
that this might change and that you make an effort to avoid binary
incompatibility by removing or changing types for fields.


> But let's talk about how that would actually look, for example if we
> exposed the struct or you took a photocopy of it...  I think your idea
> must be something like: if you could access struct ReadStream's
> internals, you could replace stream->callback with an interceptor
> callback, and if the BlockSampler had been given the fake N + M
> relation size, the interceptor could overwrite
> stream->ios[next_io_index].op.smgr and return x - N if the intercepted
> callback returned x >= N.  (Small detail: need to check
> stream->fast_path and use 0 instead or something like that, but maybe
> we could change that.)


Yes, this is what I had in mind, but I did not dig too deeply into the code.


> One minor problem that jumps out is that
> read_stream.c could inappropriately merge blocks from the two
> relations into one I/O.  Hmm, I guess you'd have to teach the
> interceptor not to allow that: if switching between the two relation,
> and if the block number would coincide with
> stream->pending_read_blocknum + stream->pending_read_nblocks, it would
> need to pick a new block instead (interfering with the block sampling
> algorithm, but only very rarely).  Is this what you had in mind, or
> something else?
>

Hmmm... I didn't look too closely at this. Since the block number comes
from the callback, I guess we could make sure to have a "padding" block
between the regions so that we "break" any suite of blocks, which I think
is what you mean with "teach the interceptor not to allow that", but I
would have to write a patch to make sure.


>
> (BTW I have a patch to teach read_stream.c about multi-smgr-relation
> streams, by adding a different constructor with a different callback
> that returns smgr, fork, block instead of just the block, but it
> didn't make it into 17.)
>

Without having looked at the patch, this sounds like the correct way to do
it.


>
> [1]
> https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah@brqs62irg4dt
>


-- 
Best wishes,
Mats Kindahl, Timescale

Re: pg_verifybackup: TAR format backup verification

2024-09-10 Thread Robert Haas

+   pg_log_error("pg_waldump cannot read from a tar");

"tar" isn't normally used as a noun as you do here, so I think this
should say "pg_waldump cannot read tar files".

Technically, the position of this check could lead to an unnecessary
failure, if -n wasn't specified but pg_wal.tar(.whatever) also doesn't
exist on disk. But I think it's OK to ignore that case.

However, I also notice this change to the TAP tests in a few places:

-   [ 'pg_verifybackup', $tempdir ],
+   [ 'pg_verifybackup', '-Fp', $tempdir ],

It's not the end of the world to have to make a change like this, but
it seems easy to do better. Just postpone the call to
find_backup_format() until right after we call parse_manifest_file().
That also means postponing the check mentioned above until right after
that, but that's fine: after parse_manifest_file() and then
find_backup_format(), you can do if (!no_parse_wal && context.format
== 't') { bail out }.

+   if (stat(path, &sb) == 0)
+   result = 'p';
+   else
+   {
+   if (errno != ENOENT)
+   {
+   pg_log_error("cannot determine backup format : %m");
+   pg_log_error_hint("Try \"%s --help\" for more
information.", progname);
+   exit(1);
+   }
+
+   /* Otherwise, it is assumed to be a tar format backup. */
+   result = 't';
+   }

This doesn't look good, for a few reasons:

1. It would be clearer to structure this as if (stat(...) == 0) result
= 'p'; else if (errno == ENOENT) result = 't'; else { report an error;
} instead of the way you have it.

2. "cannot determine backup format" is not an appropriate way to
report the failure of stat(). The appropriate message is "could not
stat file \"%s\"".

3. It is almost never correct to put a space before a colon in an error message.

4. The hint doesn't look helpful, or necessary. I think you can just
delete that.

Regarding both point #2 and point #4, I think we should ask ourselves
how stat() could realistically fail here. On my system (macOS), the
document failed modes for stat() are EACCES (i.e. permission denied),
EFAULT (i.e. we have a bug in pg_verifybackup), EIO (I/O Error), ELOOP
(symlink loop), ENAMETOOLONG, ENOENT, ENOTDIR, and EOVERFLOW. In none
of those cases does it seem likely that specifying the format manually
will help anything. Thus, suggesting that the user look at the help,
presumably to find --format, is unlikely to solve anything, and
telling them that the error happened while trying to determine the
backup format isn't really helping anything, either. What the user
needs to know is that it was stat() that failed, and the pathname for
which it failed. Then they need to sort out whatever problem is
causing them to get one of those really weird errors.

Aside from the above, 0010 looks pretty reasonable, although I'll
probably want to do some wordsmithing on some of the comments at some
point.

-   of the backup.  The backup must be stored in the "plain"
-   format; a "tar" format backup can be checked after extracting it.
+   of the backup.  The backup must be stored in the "plain" or "tar"
+   format.  Verification is supported for gzip,
+   lz4, and  zstd compressed tar backup;
+   any other compressed format backups can be checked after decompressing them.

I don't think that we need to say that the backup must be stored in
the plain or tar format, because those are the only backup formats
pg_basebackup knows about. Similarly, it doesn't seem help to me to
enumerate all the compression algorithms that pg_basebackup supports
and say we only support those; what else would a user expect?

What I would do is replace the original sentence ("The backup must be
stored...") with: The backup may be stored either in the "plain" or
the "tar" format; this includes "tar" backups compressed with any
algorithm supported by pg_basebackup. However, at present, WAL
verification is supported only for plain-format backups. Therefore, if
the backup is stored in "tar" format, the -n,
--no-parse-wal option should be used.

+   # Add tar backup format option
+   push @backup, ('-F', 't');
+   # Add switch to skip WAL verification.
+   push @verify, ('-n');

Say why, not what. The second comment should say something like "WAL
verification not yet supported for tar-format backups".

+   "$format backup fails with algorithm \"$algorithm\"");
+   $primary->command_ok(\@backup, "$format backup ok with
algorithm \"$algorithm\"");
+   ok(-f "$backup_path/backup_manifest", "$format backup
manifest exists");
+   "verify $format backup with algorithm \"$algorithm\"");

Personally I would change "$format" to "$format format" in all of
these places, so that we talk about a "tar format backup" or a "plain
format backup" instead of a "tar backup" or a "plain backup".

+   's

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-10 Thread Noah Misch

On Tue, Sep 10, 2024 at 02:29:57PM +0900, Michael Paquier wrote:
> You are adding twelve event points with only 5
> new wait names.  Couldn't it be better to have a one-one mapping
> instead, adding twelve entries in wait_event_names.txt?

No, I think the patch's level of detail is better.  One shouldn't expect the
two ldap_simple_bind_s() calls to have different-enough performance
characteristics to justify exposing that level of detail to the DBA.
ldap_search_s() and InitializeLDAPConnection() differ more, but the DBA mostly
just needs to know the scale of their LDAP responsiveness problem.

(Someday, it might be good to expose the file:line and/or backtrace associated
with a wait, like we do for ereport().  As a way to satisfy rare needs for
more detail, I'd prefer that over giving every pgstat_report_wait_start() a
different name.)

Re: Refactoring postmaster's code to cleanup after child exit

2024-09-10 Thread Robert Haas

On Tue, Sep 10, 2024 at 12:59 PM Andres Freund  wrote:
> I still think that we'd be better off to just return an error to the client in
> postmaster, rather than deal with this dead-end children mess. That was
> perhaps justified at some point, but now it seems to add way more complexity
> than it's worth. And it's absurdly expensive to fork to return an error. Way
> more expensive than just having postmaster send an error and close the socket.

The tricky case is the one where the client write() -- or SSL_write() -- blocks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Add CANONICAL option to xmlserialize

2024-09-10 Thread Tom Lane

Jim Jones  writes:
> This patch introduces the CANONICAL option to xmlserialize, which
> serializes xml documents in their canonical form - as described in
> the W3C Canonical XML Version 1.1 specification. This option can
> be used with the additional parameter WITH [NO] COMMENTS to keep
> or remove xml comments from the canonical xml output.

While I don't object to providing this functionality in some form,
I think that doing it with this specific syntax is a seriously
bad idea.  I think there's significant risk that at some point
the SQL committee will either standardize this syntax with a
somewhat different meaning or standardize some other syntax for
the same functionality.

How about instead introducing a plain function along the lines of
"xml_canonicalize(xml, bool keep_comments) returns text" ?  The SQL
committee will certainly never do that, but we won't regret having
created a plain function whenever they get around to doing something
in the same space.

regards, tom lane

Re: Refactoring postmaster's code to cleanup after child exit

2024-09-10 Thread Andres Freund

Hi,

On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
> @@ -2864,6 +2777,8 @@ PostmasterStateMachine(void)
>*/
>   if (pmState == PM_STOP_BACKENDS)
>   {
> + uint32  targetMask;
> +
>   /*
>* Forget any pending requests for background workers, since 
> we're no
>* longer willing to launch any new workers.  (If additional 
> requests
> @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
>*/
>   ForgetUnstartedBackgroundWorkers();
>  
> - /* Signal all backend children except walsenders and dead-end 
> backends */
> - SignalSomeChildren(SIGTERM,
> -BACKEND_TYPE_ALL & ~(1 << 
> B_WAL_SENDER | 1 << B_DEAD_END_BACKEND));
> + /* Signal all backend children except walsenders */
> + /* dead-end children are not signalled yet */
> + targetMask = (1 << B_BACKEND);
> + targetMask |= (1 << B_BG_WORKER);
> +
>   /* and the autovac launcher too */
> - if (AutoVacPID != 0)
> - signal_child(AutoVacPID, SIGTERM);
> + targetMask |= (1 << B_AUTOVAC_LAUNCHER);
>   /* and the bgwriter too */
> - if (BgWriterPID != 0)
> - signal_child(BgWriterPID, SIGTERM);
> + targetMask |= (1 << B_BG_WRITER);
>   /* and the walwriter too */
> - if (WalWriterPID != 0)
> - signal_child(WalWriterPID, SIGTERM);
> + targetMask |= (1 << B_WAL_WRITER);
>   /* If we're in recovery, also stop startup and walreceiver 
> procs */
> - if (StartupPID != 0)
> - signal_child(StartupPID, SIGTERM);
> - if (WalReceiverPID != 0)
> - signal_child(WalReceiverPID, SIGTERM);
> - if (WalSummarizerPID != 0)
> - signal_child(WalSummarizerPID, SIGTERM);
> - if (SlotSyncWorkerPID != 0)
> - signal_child(SlotSyncWorkerPID, SIGTERM);
> + targetMask |= (1 << B_STARTUP);
> + targetMask |= (1 << B_WAL_RECEIVER);
> +
> + targetMask |= (1 << B_WAL_SUMMARIZER);
> + targetMask |= (1 << B_SLOTSYNC_WORKER);
>   /* checkpointer, archiver, stats, and syslogger may continue 
> for now */
>  
> + SignalSomeChildren(SIGTERM, targetMask);
> +
>   /* Now transition to PM_WAIT_BACKENDS state to wait for them to 
> die */
>   pmState = PM_WAIT_BACKENDS;
>   }

I think this might now omit shutting down at least autovac workers, which
afaict previously were included.

Greetings,

Andres Freund

Re: Disallow altering invalidated replication slots

2024-09-10 Thread Bharath Rupireddy

Hi,

Thanks for reviewing.

On Tue, Sep 10, 2024 at 8:40 AM Peter Smith  wrote:
>
> Commit message
>
> 1.
> ALTER_REPLICATION_SLOT on invalidated replication slots is unnecessary
> as there is no way...
>
> suggestion:
> ALTER_REPLICATION_SLOT for invalid replication slots should not be
> allowed because there is no way...

Modified.

> ==
> 2. Missing docs update
>
> Should this docs page [1] be updated to say ALTER_REPLICATION_SLOT is
> not allowed for invalid slots?

Haven't noticed for other ERROR cases in the docs, e.g. slots being
synced, temporary slots. Not sure if it's worth adding every ERROR
case to the docs.

> ==
> src/backend/replication/slot.c
>
> 3.
> + if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
> + ereport(ERROR,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot alter replication slot \"%s\"", name),
> + errdetail("This replication slot was invalidated due to \"%s\".",
> +   SlotInvalidationCauses[MyReplicationSlot->data.invalidated]));
> +
>
> I thought including the reason "invalid" (e.g. "cannot alter invalid
> replication slot \"%s\"") in the message might be better, but OTOH I
> see the patch message is the same as an existing one. Maybe see what
> others think.

Changed.

> ==
> src/test/recovery/t/035_standby_logical_decoding.pl
>
> 3.
> There is already a comment about this test:
> ##
> # Recovery conflict: Invalidate conflicting slots, including in-use slots
> # Scenario 1: hot_standby_feedback off and vacuum FULL
> #
> # In passing, ensure that replication slot stats are not removed when the
> # active slot is invalidated.
> ##
>
> IMO we should update that "In passing..." sentence to something like:
>
> In passing, ensure that replication slot stats are not removed when
> the active slot is invalidated, and check that an error occurs when
> attempting to alter the invalid slot.

Added. But, keeping it closer to the test case doesn't hurt.

Please find the attached v2 patch also having Shveta's review comments
addressed.

-- 
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


v2-0001-Disallow-altering-invalidated-replication-slots.patch
Description: Binary data

Re: Converting README documentation to Markdown

2024-09-10 Thread Daniel Gustafsson

> On 10 Sep 2024, at 17:37, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> Since there doesn't seem to be much interest in going all the way to 
>> Markdown,
>> the attached 0001 is just the formatting changes for achieving (to some 
>> degree)
>> consistency among the README's.  This mostly boils down to using a consistent
>> amount of whitespace around code, using the same indentation on bullet lists
>> and starting sections the same way.  Inspecting the patch with git diff -w
>> reveals that it's not much left once whitespace is ignored.  There might be a
>> few markdown hunks left which I'll hunt down in case anyone is interested in
>> this.
> 
>> As an added bonus this still makes most READMEs render nicely as Markdown, 
>> just
>> not automatically on Github as it doesn't know the filetype.
> 
> I did not inspect the patch in detail, but this approach seems
> like a reasonable compromise.  However, if we're not officially
> going to Markdown, how likely is it that these files will
> stay valid in future edits?  I suspect most of us don't have
> those syntax rules wired into our fingers (I sure don't).

I'm not too worried, especially since we're not making any guarantees about
conforming to a set syntax.  We had written more or less correct Markdown
already, if we continue to create new content in the style of the surrounding
existing content then I'm confident they'll stay very close to markdown.

--
Daniel Gustafsson

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-10 Thread Masahiko Sawada

On Mon, Sep 9, 2024 at 11:25 PM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > On Mon, Sep 9, 2024 at 4:42 PM Tom Lane  wrote:
> >> Do you have an idea for how we'd get
> >> this to happen during pg_upgrade, exactly?
>
> > What I was thinking is that we have "pg_dump --binary-upgrade" emit a
> > function call, say "SELECT binary_upgrade_update_gin_meta_page()" for
> > each GIN index, and the meta pages are updated when restoring the
> > schemas.
>
> Hmm, but ...
>
> 1. IIRC we don't move the relation files into the new cluster until
> after we've run the schema dump/restore step.  I think this'd have to
> be driven in some other way than from the pg_dump output.  I guess we
> could have pg_upgrade start up the new postmaster and call a function
> in each DB, which would have to scan for GIN indexes by itself.

You're right.

>
> 2. How will this interact with --link mode?  I don't see how it
> doesn't involve scribbling on files that are shared with the old
> cluster, leading to possible problems if the pg_upgrade fails later
> and the user tries to go back to using the old cluster.  It's not so
> much the metapage update that is worrisome --- we're assuming that
> that will modify storage that's unused in old versions.  But the
> change would be unrecorded in the old cluster's WAL, which sounds
> risky.
>
> Maybe we could get away with forcing --copy mode for affected
> indexes, but that feels a bit messy.  We'd not want to do it
> for unaffected indexes, so the copying code would need to know
> a great deal about this problem.

Good point. I agree that it would not be a very good idea to force --copy mode.

An alternative way would be that we store the char signedness in the
control file, and gin_trgm_ops opclass reads it if the bytes in the
meta page shows 'unset'. The char signedness in the control file
doesn't mean to be used for the compatibility check for physical
replication but used as a hint. But it also could be a bit messy,
though.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

2024-09-10 Thread Jacob Champion

On Sun, Sep 8, 2024 at 1:08 PM Lars Kanis  wrote:
> I'm the maintainer of ruby-pg the ruby interface to the PostgreSQL
> database. This binding uses the asynchronous API of libpq by default to
> facilitate the ruby IO wait and scheduling mechanisms.
>
> This works well with the vanilla postgresql server, but it leads to
> starvation with other types of servers using the postgresql wire
> protocol 3. This is because the current functioning of the libpq async
> interface depends on a maximum size of SSL records of 8kB.

Thanks for the report! I wanted evidence that this wasn't a
ruby-pg-specific problem, so I set up a test case with
Python/psycopg2.

I was able to reproduce a hang when all of the following were true:
- psycopg2's async mode was enabled
- the client performs a PQconsumeInput/PQisBusy loop, waiting on
socket read events when the connection is busy (I used
psycopg2.extras.wait_select() for this)
- the server splits a large message over many large TLS records
- the server packs the final ReadyForQuery message into the same
record as the split message's final fragment

Gory details of the packet sizes, if it's helpful:
- max TLS record size is 12k, because it made the math easier
- server sends DataRow of 32006 bytes, followed by DataRow of 806
bytes, followed by CommandComplete/ReadyForQuery
- so there are three TLS records on the wire containing
  1) DataRow 1 fragment 1 (12k bytes)
  2) DataRow 1 fragment 2 (12k bytes)
  3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes)
   + CommandComplete + ReadyForQuery

> To fix this issue the attached patch calls pqReadData() repeatedly in
> PQconsumeInput() until there is no buffered SSL data left to be read.
> Another solution could be to process buffered SSL read bytes in
> PQisBusy() instead of PQconsumeInput() .

I agree that PQconsumeInput() needs to ensure that the transport
buffers are all drained. But I'm not sure this is a complete solution;
doesn't GSS have the same problem? And are there any other sites that
need to make the same guarantee before returning?

I need to switch away from this for a bit. Would you mind adding this
to the next Commitfest as a placeholder?

https://commitfest.postgresql.org/50/

Thanks,
--Jacob

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-10 Thread Robert Haas

On Tue, Sep 10, 2024 at 1:27 PM Noah Misch  wrote:
> On Tue, Sep 10, 2024 at 02:29:57PM +0900, Michael Paquier wrote:
> > You are adding twelve event points with only 5
> > new wait names.  Couldn't it be better to have a one-one mapping
> > instead, adding twelve entries in wait_event_names.txt?
>
> No, I think the patch's level of detail is better.  One shouldn't expect the
> two ldap_simple_bind_s() calls to have different-enough performance
> characteristics to justify exposing that level of detail to the DBA.
> ldap_search_s() and InitializeLDAPConnection() differ more, but the DBA mostly
> just needs to know the scale of their LDAP responsiveness problem.
>
> (Someday, it might be good to expose the file:line and/or backtrace associated
> with a wait, like we do for ereport().  As a way to satisfy rare needs for
> more detail, I'd prefer that over giving every pgstat_report_wait_start() a
> different name.)

I think unique names are a good idea. If a user doesn't care about the
difference between sdgjsA and sdjgsB, they can easily ignore the
trailing suffix, and IME, people typically do that without really
stopping to think about it. If on the other hand the two are lumped
together as sdjgs and a user needs to distinguish them, they can't. So
I see unique names as having much more upside than downside.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Document DateStyle effect on jsonpath string()

2024-09-10 Thread Tom Lane

"David E. Wheeler"  writes:
> Rebase on 47c9803. I also changed the commitfest item[1] to “ready for 
> committer”, since jian reviewed it, though I couldn’t see a way to add jian 
> as a reviewer in the app. Hope that makes sense.

Pushed with a little additional polishing.

I thought the best way to address jian's complaint about DateStyle not
being clearly locked down was to change horology.sql to verify the
prevailing setting, as it has long done for TimeZone.  That's the
lead test script for related stuff, so it makes the most sense to
do it there.  Having done that, I don't feel a need to duplicate
that elsewhere.

regards, tom lane

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-10 Thread Tom Lane

Masahiko Sawada  writes:
> An alternative way would be that we store the char signedness in the
> control file, and gin_trgm_ops opclass reads it if the bytes in the
> meta page shows 'unset'. The char signedness in the control file
> doesn't mean to be used for the compatibility check for physical
> replication but used as a hint. But it also could be a bit messy,
> though.

Yeah, that seems like it could work.  But are we sure that replicas
get a copy of the primary's control file rather than creating their
own?

regards, tom lane

Re: Converting README documentation to Markdown

2024-09-10 Thread Robert Haas

On Tue, Sep 10, 2024 at 8:51 AM Daniel Gustafsson  wrote:
> Since there doesn't seem to be much interest in going all the way to Markdown,

Just for the record, I suspect going to Markdown is actually the right
thing to do. I am personally unenthusiastic about it because I need
one more thing to worry about when committing like I need a hole in my
head, but a chronic complaint about the PostgreSQL project is that we
insist on doing everything our own way instead of admitting that there
is significant value in conforming to, or at least being compatible
with, widely-adopted development practices, and using Markdown files
to document stuff in git repos seems to be one of those. No single
change that we make is going to make the difference between us
attracting the next generation of developers and not, but if we always
prioritize what feels good to people who learned to code in the 1970s
or 1980s (like me!) over what feels good to people who learned to code
in the 2010s or 2020s, we will definitely run out of developers at
some point.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Document DateStyle effect on jsonpath string()

2024-09-10 Thread David E. Wheeler

On Sep 10, 2024, at 14:51, Tom Lane  wrote:

> Pushed with a little additional polishing.

Thank you! Do you think it’d be worthwhile to back port to 17?

> I thought the best way to address jian's complaint about DateStyle not
> being clearly locked down was to change horology.sql to verify the
> prevailing setting, as it has long done for TimeZone.  That's the
> lead test script for related stuff, so it makes the most sense to
> do it there.  Having done that, I don't feel a need to duplicate
> that elsewhere.

Yeah, that will help, but I still bet next time I go to figure out what it is 
I’ll stick that line in some test to make it fail with clear output for what 
it’s set to 😂.

D

Re: BUG #18598: AddressSanitizer detects use after free inside json_unique_hash_match()

2024-09-10 Thread Tomas Vondra

On 9/5/24 06:06, Junwang Zhao wrote:
> 
> ...
> 
> I found two other places called json_unique_check_key.
> 
> One is *json_build_object_worker*, and the usage is the same as
> *json_object_agg_transfn_worker*, I fix that the same way, PSA
> 
> The following sql should trigger the problem, I haven't tried asan
> but traced the *repalloc* in gdb, I will try this later when I set up my
> asan building.
> 
> SELECT JSON_OBJECT(1: 1, '2': NULL, '3': 1, repeat('x', 1000): 1, 2:
> repeat('a', 100) WITH UNIQUE);
> 
> The other place is json_unique_object_field_start, it is set
> as a callback of JsonSemAction, and in the parse state machine,
> the fname used by the callback has been correctly handled.
> see [1] & [2].
> 
> [1]: 
> https://github.com/postgres/postgres/blob/master/src/common/jsonapi.c#L1069-L1087
> [2]: 
> https://github.com/postgres/postgres/blob/master/src/common/jsonapi.c#L785-L823
> 
> 

Thanks for the fix and the idea for a function triggering issue in the
other place. I verified that it indeed triggers a valgrind report, and
the fix makes it go away.

Attached is a patch with proper commit message, explaining the issue,
and various other details. Can you please proofread it and check that I
got all the details right?

The patch also adds two regression tests, triggering the issue (without
the rest of the patch applied). It took me a while to realize the
existing tests pass simply because the objects are tiny and don't
require enlarging the buffer, and thus the repalloc.

The only question that bothers me a little bit is the possibility of a
memory leak - could it happen that we keep the copied key much longer
than needed? Or does aggcontext have with the right life span? AFAICS
that's where we allocate the aggregate state, so it seems fine.

Also, how far back do we need to backpatch this? ITSM PG15 does not have
this issue, and it was introduced with the SQL/JSON stuff in PG16. Is
that correct?

regards

-- 
Tomas VondraFrom 45837d645a276912b400b5d99d80ac290e0ba8ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Tue, 10 Sep 2024 17:15:10 +0200
Subject: [PATCH] Fix unique key checks in JSON object constructors

When building a JSON object, the code builds a hash table of keys, to
allow checking if the keys are unique. The uniqueness check and adding
the new key happens in json_unique_check_key(), but this assumes the
pointer to the key remains valid.

Unfortunately, two places passed pointers to keys in a StringInfo
buffer, while also appending additional data (more key/value pairs) to
the buffer. With enough data the buffer would be resized using enlarge
enlargeStringInfo(), which does repalloc, invalidating the earlier key
pointers.

Due to this the uniqueness check may fail with both false negatives and
false positives, producing JSON objects with duplicate keys or failing
to produce a perfectly valid JSON object.

This affects four functions, introduced in PG16 with the new SQL/JSON:

- json_object_agg_unique / jsonb_object_agg_unique
- json_object / jsonb_objectagg

The existing regression tests did not detect this issue, simply because
the initial buffer size is 1024 and the objects were small enough not to
require the repalloc.

Reported by Alexander Lakhin, based on AddressSanitizer failure.
Valgrind notices this too, but both require the JSON object large enough
to require the repalloc.

Fixed by properly copying the key into the aggregate memory context, and
adding regression tests with enough data to repalloc the buffer.

Investigation and initial fix by Junwang Zhao, with various improvements
and tests by me.

Backpatch to 16, where these functions were introduced.

Reported-by: Alexander Lakhin
Author: Junwang Zhao, Tomas Vondra
Discussion: https://postgr.es/m/18598-3279ed972a234...@postgresql.org
Discussion: https://postgr.es/m/caeg8a3jjh0rejf2_o7-8luebo69bxphyexs95_x7+h9amwf...@mail.gmail.com
---
 src/backend/utils/adt/json.c  | 6 --
 src/test/regress/expected/json.out| 3 +++
 src/test/regress/expected/sqljson.out | 5 +
 src/test/regress/sql/json.sql | 2 ++
 src/test/regress/sql/sqljson.sql  | 5 +
 5 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 4af0a60..3819afdc433 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -,7 +,8 @@ json_object_agg_transfn_worker(FunctionCallInfo fcinfo,

 	if (unique_keys)
 	{
-		const char *key = &out->data[key_offset];
+		const char *key = MemoryContextStrdup(aggcontext,
+			  &out->data[key_offset]);

 		if (!json_unique_check_key(&state->unique_check.check, key, 0))
 			ereport(ERROR,
@@ -1275,7 +1276,8 @@ json_build_object_worker(int nargs, const Datum *args, const bool *nulls, const
 		if (unique_keys)
 		{
 			/* check key uniqueness after key appending */
-			const char *key = &out->data[key_offset];
+			const char *key = MemoryContextStrdup(unique_check.mcxt,
+

Re: Document DateStyle effect on jsonpath string()

2024-09-10 Thread Peter Eisentraut

Isn't this behavior actually a bug that should be fixed rather than 
documented?


These JSON path functions are specified by the SQL standard, so they 
shouldn't depend on PostgreSQL-specific settings.  At least in new 
functionality we should avoid that, no?



On 10.09.24 21:43, David E. Wheeler wrote:

On Sep 10, 2024, at 14:51, Tom Lane  wrote:


Pushed with a little additional polishing.


Thank you! Do you think it’d be worthwhile to back port to 17?


I thought the best way to address jian's complaint about DateStyle not
being clearly locked down was to change horology.sql to verify the
prevailing setting, as it has long done for TimeZone.  That's the
lead test script for related stuff, so it makes the most sense to
do it there.  Having done that, I don't feel a need to duplicate
that elsewhere.


Yeah, that will help, but I still bet next time I go to figure out what it is 
I’ll stick that line in some test to make it fail with clear output for what 
it’s set to 😂.

Re: json_query conditional wrapper bug

2024-09-10 Thread Peter Eisentraut


On 10.09.24 10:00, Amit Langote wrote:

Sorry for missing this report and thanks Andrew for the offlist heads up.

On Wed, Sep 4, 2024 at 7:16 PM Peter Eisentraut  wrote:

On 28.08.24 11:21, Peter Eisentraut wrote:

These are ok:

select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' without wrapper);
   json_query

   42

select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' with
unconditional wrapper);
   json_query

   [42]

But this appears to be wrong:

select json_query('{"a": 1, "b": 42}'::jsonb, 'lax $.b' with conditional
wrapper);
   json_query

   [42]

This should return an unwrapped 42.


If I make the code change illustrated in the attached patch, then I get
the correct result here.  And various regression test results change,
which, to me, all look more correct after this patch.  I don't know what
the code I removed was supposed to accomplish, but it seems to be wrong
somehow.  In the current implementation, the WITH CONDITIONAL WRAPPER
clause doesn't appear to work correctly in any case I could identify.


Agreed that this looks wrong.

I've wondered why the condition was like that but left it as-is,
because I thought at one point that that's needed to ensure that the
returned single scalar SQL/JSON item is valid jsonb.

I've updated your patch to include updated test outputs and a nearby
code comment expanded.  Do you intend to commit it or do you prefer
that I do?


This change looks unrelated:

-ERROR:  new row for relation "test_jsonb_constraints" violates check 
constraint "test_jsonb_constraint4"
+ERROR:  new row for relation "test_jsonb_constraints" violates check 
constraint "test_jsonb_constraint5"


Is this some randomness in the way these constraints are evaluated?

Re: Document DateStyle effect on jsonpath string()

2024-09-10 Thread Tom Lane

Peter Eisentraut  writes:
> These JSON path functions are specified by the SQL standard, so they 
> shouldn't depend on PostgreSQL-specific settings.  At least in new 
> functionality we should avoid that, no?

Hmm ... but does the standard precisely define the output format?

Since these conversions are built on our own timestamp I/O code,
I rather imagine there is quite a lot of behavior there that's
not to be found in the standard.  That doesn't really trouble
me as long as the spec's behavior is a subset of it (i.e.,
reachable as long as you've got the right parameter settings).

regards, tom lane

Re: Document DateStyle effect on jsonpath string()

2024-09-10 Thread Tom Lane

"David E. Wheeler"  writes:
> On Sep 10, 2024, at 14:51, Tom Lane  wrote:
>> Pushed with a little additional polishing.

> Thank you! Do you think it’d be worthwhile to back port to 17?

Not as things stand.  If we adopt Peter's nearby position that
the current behavior is actually buggy, then probably back-patching
a corrected version would be worthwhile as a part of fixing it.

regards, tom lane

Re: [BUG?] XMLSERIALIZE( ... INDENT) won't work with blank nodes

2024-09-10 Thread Tom Lane

Jim Jones  writes:
> [ xmlserialize patches ]

Pushed with minor editorialization.  Notably, I got rid of scribbling
on xmlBufferContent's buffer --- I don't know how likely that is to
upset libxml2, but it seems like a fairly bad idea given that they
declare the result as "const xmlChar*".  Casting away the const is
poor form in any case.

regards, tom lane

Re: Speeding up ruleutils' name de-duplication code, redux

2024-09-10 Thread David Rowley

On Wed, 11 Sept 2024 at 03:06, Tom Lane  wrote:
> We could accomplish what you suggest by re-ordering the calls so that
> we build the hash table before enlarging the array.  0001 attached
> is the same as before (modulo line number changes from being rebased
> up to HEAD) and then 0002 implements this idea on top.  On the whole
> though I find 0002 fairly ugly and would prefer to stick to 0001.
> I really doubt that scanning any newly-created column positions is
> going to take long enough to justify intertwining things like this.

I'm fine with that.  I did test the performance with and without
v2-0002 and the performance is just a little too noisy to tell. Both
runs I did with v2-0002, it was slower, so I agree it's not worth
making the code uglier for.

I've no more comments. Looks good.

David

Re: Speeding up ruleutils' name de-duplication code, redux

2024-09-10 Thread Tom Lane

David Rowley  writes:
> On Wed, 11 Sept 2024 at 03:06, Tom Lane  wrote:
>> We could accomplish what you suggest by re-ordering the calls so that
>> we build the hash table before enlarging the array.  0001 attached
>> is the same as before (modulo line number changes from being rebased
>> up to HEAD) and then 0002 implements this idea on top.  On the whole
>> though I find 0002 fairly ugly and would prefer to stick to 0001.
>> I really doubt that scanning any newly-created column positions is
>> going to take long enough to justify intertwining things like this.

> I'm fine with that.  I did test the performance with and without
> v2-0002 and the performance is just a little too noisy to tell. Both
> runs I did with v2-0002, it was slower, so I agree it's not worth
> making the code uglier for.
> I've no more comments. Looks good.

Thanks for the review!  I'll go push just 0001.

regards, tom lane

Re: Document DateStyle effect on jsonpath string()

2024-09-10 Thread David E. Wheeler

On Sep 10, 2024, at 16:10, Peter Eisentraut  wrote:

> These JSON path functions are specified by the SQL standard, so they 
> shouldn't depend on PostgreSQL-specific settings.  At least in new 
> functionality we should avoid that, no?

Does that also apply to `datetime(template)`, where it uses the 
`to_timestamp()` templates? From the docs[1]:

> The datetime() and datetime(template) methods use the same parsing rules as 
> the to_timestamp SQL function does (see Section 9.8[2]), with three 
> exceptions. First, these methods don't allow unmatched template patterns. 
> Second, only the following separators are allowed in the template string: 
> minus sign, period, solidus (slash), comma, apostrophe, semicolon, colon and 
> space. Third, separators in the template string must exactly match the input 
> string.

Does the standard specify a formatting language?

Best,

David

[1]: https://www.postgresql.org/docs/devel/functions-json.html
[2]: https://www.postgresql.org/docs/devel/functions-formatting.html

Re: Document DateStyle effect on jsonpath string()

2024-09-10 Thread David E. Wheeler

On Sep 10, 2024, at 16:17, Tom Lane  wrote:

> Not as things stand.  If we adopt Peter's nearby position that
> the current behavior is actually buggy, then probably back-patching
> a corrected version would be worthwhile as a part of fixing it.

Oh, I see now that my reply to him points out the same issue as yours.

So annoying that the standard is not publicly available for any one of us to go 
look.

D

Re: Jargon and acronyms on this mailing list

2024-09-10 Thread Andrew Dunstan




On 2024-09-09 Mo 3:54 PM, Andrew Dunstan wrote:


On 2024-09-09 Mo 1:19 PM, Robert Haas wrote:
On Mon, Sep 9, 2024 at 1:03 PM Andrew Dunstan  
wrote:
I guess I could try to write code to migrate everything, but it 
would be

somewhat fragile. And what would we do if we ever decided to migrate
"master" to another name like "main"? I do at least have code ready for
that eventuality, but it would (currently) still keep the visible name
of HEAD.

Personally, I think using HEAD to mean master is really confusing. In
git, master is a branch name, and HEAD is the tip of some branch, or
the random commit you've checked out that isn't even a branch. I know
that's not how it worked in CVS, but CVS was a very long time ago.

If we rename master to main or devel or something, we'll have to
adjust the way we speak again, but that's not a reason to keep using
the wrong terminology for the way things are now.



There are some serious obstacles to changing it all over, though. I 
don't want to rewrite all the history, for example.


What we could do relatively simply is change what is seen publicly. 
e.g. we could rewrite the status page to read "Branch: master". We 
could also change URLs we generate to use master instead of HEAD (and 
change it back when processing the URLs. And so on.



I've done this. Nothing in the client or the database has changed, but 
the fact that we refer to "master" as "HEAD" is pretty much hidden now 
from the web app and the emails it sends out. That should help lessen 
any confusion in casual viewers.


Comments welcome. I don't think I have missed anything but it's always 
possible.



cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-10 Thread Noah Misch

On Tue, Sep 10, 2024 at 02:51:23PM -0400, Robert Haas wrote:
> On Tue, Sep 10, 2024 at 1:27 PM Noah Misch  wrote:
> > On Tue, Sep 10, 2024 at 02:29:57PM +0900, Michael Paquier wrote:
> > > You are adding twelve event points with only 5
> > > new wait names.  Couldn't it be better to have a one-one mapping
> > > instead, adding twelve entries in wait_event_names.txt?
> >
> > No, I think the patch's level of detail is better.  One shouldn't expect the
> > two ldap_simple_bind_s() calls to have different-enough performance
> > characteristics to justify exposing that level of detail to the DBA.
> > ldap_search_s() and InitializeLDAPConnection() differ more, but the DBA 
> > mostly
> > just needs to know the scale of their LDAP responsiveness problem.
> >
> > (Someday, it might be good to expose the file:line and/or backtrace 
> > associated
> > with a wait, like we do for ereport().  As a way to satisfy rare needs for
> > more detail, I'd prefer that over giving every pgstat_report_wait_start() a
> > different name.)
> 
> I think unique names are a good idea. If a user doesn't care about the
> difference between sdgjsA and sdjgsB, they can easily ignore the
> trailing suffix, and IME, people typically do that without really
> stopping to think about it. If on the other hand the two are lumped
> together as sdjgs and a user needs to distinguish them, they can't. So
> I see unique names as having much more upside than downside.

I agree a person can ignore the distinction, but that requires the person to
be consuming the raw event list.  It's reasonable to tell your monitoring tool
to give you the top N wait events.  Individual AuthnLdap* events may all miss
the cut even though their aggregate would have made the cut.  Before you know
to teach that monitoring tool to group AuthnLdap* together, it won't show you
any of those names.

I felt commit c789f0f also chose sub-optimally in this respect, particularly
with the DblinkGetConnect/DblinkConnect pair.  I didn't feel strongly enough
to complain at the time, but a rule of "each wait event appears in one
pgstat_report_wait_start()" would be a rule I don't want.  One needs
familiarity with the dblink implementation internals to grasp the
DblinkGetConnect/DblinkConnect distinction, and a plausible refactor of dblink
would make those names cease to fit.  I see this level of fine-grained naming
as making the event name a sort of stable proxy for FILE:LINE.  I'd value
exposing such a proxy, all else being equal, but I don't think wait event
names like AuthLdapBindLdapbinddn/AuthLdapBindUser are the right way.  Wait
event names should be more independent of today's code-level details.

Re: Allow logical failover slots to wait on synchronous replication

2024-09-10 Thread John H

Hi Shveta,

On Sun, Sep 8, 2024 at 11:16 PM shveta malik  wrote:

>
> I was trying to have a look at the patch again, it doesn't apply on
> the head, needs rebase.
>

Rebased with the latest changes.

> Regarding 'mode = SyncRepWaitMode', FWIW, SyncRepWaitForLSN() also
> does in a similar way. It gets mode in local var initially and uses it
> later. See [1]. So isn't there a chance too that 'SyncRepWaitMode' can
> change in between.
>
> [1]:
> mode = SyncRepWaitMode;
> .
> 
> if (!WalSndCtl->sync_standbys_defined ||
> lsn <= WalSndCtl->lsn[mode])
> {
> LWLockRelease(SyncRepLock);
> return;
> }

You are right, thanks for the correction. I tried reproducing with GDB
where SyncRepWaitMode
changes due to pg_ctl reload but was unable to do so. It seems like
SIGHUP only sets ConfigReloadPending = true,
which gets processed in the next loop in WalSndLoop() and that's
probably where I was getting confused.

In the latest patch, I've added:

Assert(SyncRepWaitMode >= 0);

which should be true since we call SyncRepConfigured() at the
beginning of StandbySlotsHaveCaughtup(),
and used SyncRepWaitMode directly.

Thank you

-- 
John Hsu - Amazon Web Services

0005-Wait-on-synchronous-replication-by-default-for-logic.patch
Description: Binary data

Re: Proposal to Enable/Disable Index using ALTER INDEX

2024-09-10 Thread David Rowley

On Wed, 11 Sept 2024 at 03:12, Nathan Bossart  wrote:
>
> On Tue, Sep 10, 2024 at 10:16:34AM +1200, David Rowley wrote:
> > If we get the skip scan feature for PG18, then there's likely going to
> > be lots of people with indexes that they might want to consider
> > removing after upgrading. Maybe this is a good time to consider this
> > feature as it possibly won't ever be more useful than it will be after
> > we get skip scans.
>
> +1, this is something I've wanted for some time.  There was some past
> discussion, too [0].
>
> [0] 
> https://postgr.es/m/flat/ed8c9ed7-bb5d-aaec-065b-ad4893645deb%402ndQuadrant.com

Thanks for digging that up. I'd forgotten about that.  I see there was
pushback from having this last time, which is now over 6 years ago.
In the meantime, we still have nothing to make this easy for people.

I think the most important point I read in that thread is [1]. Maybe
what I mentioned in [2] is a good workaround.

Additionally, I think there will need to be syntax in CREATE INDEX for
this. Without that pg_get_indexdef() might return SQL that does not
reflect the current state of the index. MySQL seems to use "CREATE
INDEX name ON table (col) [VISIBLE|INVISIBLE]".

David

[1] 
https://www.postgresql.org/message-id/20180618215635.m5vrnxdxhxytvmcm%40alap3.anarazel.de
[2] 
https://www.postgresql.org/message-id/CAKJS1f_L7y_BTGESp5Qd6BSRHXP0mj3x9O9C_U27GU478UwpBw%40mail.gmail.com

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-10 Thread Masahiko Sawada

On Tue, Sep 10, 2024 at 11:57 AM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > An alternative way would be that we store the char signedness in the
> > control file, and gin_trgm_ops opclass reads it if the bytes in the
> > meta page shows 'unset'. The char signedness in the control file
> > doesn't mean to be used for the compatibility check for physical
> > replication but used as a hint. But it also could be a bit messy,
> > though.
>
> Yeah, that seems like it could work.  But are we sure that replicas
> get a copy of the primary's control file rather than creating their
> own?

Yes, I think so. Since at least the system identifiers of primary and
replicas must be identical for physical replication, if replicas use
their own control files then they cannot start the replication.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-10 Thread Tom Lane

Masahiko Sawada  writes:
> On Tue, Sep 10, 2024 at 11:57 AM Tom Lane  wrote:
>> Yeah, that seems like it could work.  But are we sure that replicas
>> get a copy of the primary's control file rather than creating their
>> own?

> Yes, I think so. Since at least the system identifiers of primary and
> replicas must be identical for physical replication, if replicas use
> their own control files then they cannot start the replication.

Got it.  So now I'm wondering if we need all the complexity of storing
stuff in the GIN metapages.  Could we simply read the (primary's)
signedness out of pg_control and use that?  We might need some caching
mechanism to make that cheap enough, but accessing the current index's
metapage is far from free either.

regards, tom lane

Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation

2024-09-10 Thread Noah Misch

On Tue, Sep 10, 2024 at 05:56:47PM -0400, Tom Lane wrote:
> Masahiko Sawada  writes:
> > On Tue, Sep 10, 2024 at 11:57 AM Tom Lane  wrote:
> >> Yeah, that seems like it could work.  But are we sure that replicas
> >> get a copy of the primary's control file rather than creating their
> >> own?
> 
> > Yes, I think so. Since at least the system identifiers of primary and
> > replicas must be identical for physical replication, if replicas use
> > their own control files then they cannot start the replication.
> 
> Got it.  So now I'm wondering if we need all the complexity of storing
> stuff in the GIN metapages.  Could we simply read the (primary's)
> signedness out of pg_control and use that?

Yes.

Re: Retire support for OpenSSL 1.1.1 due to raised API requirements

2024-09-10 Thread Michael Paquier

On Tue, Sep 10, 2024 at 10:44:42AM +0200, Daniel Gustafsson wrote:
> This change will be committed together with the TLSv1.3 cipher suite pathcset,
> just wanted to bring it up here and not hide it in another thread.

As you wish ;)
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-10 Thread Michael Paquier

On Tue, Sep 10, 2024 at 01:58:50PM -0700, Noah Misch wrote:
> On Tue, Sep 10, 2024 at 02:51:23PM -0400, Robert Haas wrote:
>> I think unique names are a good idea. If a user doesn't care about the
>> difference between sdgjsA and sdjgsB, they can easily ignore the
>> trailing suffix, and IME, people typically do that without really
>> stopping to think about it. If on the other hand the two are lumped
>> together as sdjgs and a user needs to distinguish them, they can't. So
>> I see unique names as having much more upside than downside.
> 
> I agree a person can ignore the distinction, but that requires the person to
> be consuming the raw event list.  It's reasonable to tell your monitoring tool
> to give you the top N wait events.  Individual AuthnLdap* events may all miss
> the cut even though their aggregate would have made the cut.  Before you know
> to teach that monitoring tool to group AuthnLdap* together, it won't show you
> any of those names.

That's a fair point.  I use a bunch of aggregates with group bys for
any monitoring queries looking for event point patterns.  In my
experience, when dealing with enough connections, patterns show up
anyway even if there is noise because some of the events that I was
looking for are rather short-term, like a sync events interleaving
with locks  storing an average of the events into a secondary table
with some INSERT SELECT.

> I felt commit c789f0f also chose sub-optimally in this respect, particularly
> with the DblinkGetConnect/DblinkConnect pair.  I didn't feel strongly enough
> to complain at the time, but a rule of "each wait event appears in one
> pgstat_report_wait_start()" would be a rule I don't want.  One needs
> familiarity with the dblink implementation internals to grasp the
> DblinkGetConnect/DblinkConnect distinction, and a plausible refactor of dblink
> would make those names cease to fit.  I see this level of fine-grained naming
> as making the event name a sort of stable proxy for FILE:LINE.  I'd value
> exposing such a proxy, all else being equal, but I don't think wait event
> names like AuthLdapBindLdapbinddn/AuthLdapBindUser are the right way.  Wait
> event names should be more independent of today's code-level details.

Depends.  I'd rather choose more granularity to know exactly which
part of the code I am dealing with, especially in the case of this
thread where these are embedded around external function calls.  If,
for example, one notices that a stack of pg_stat_activity scans are
complaining about a specific step in the authentication process, it is
going to offer a much better hint than having to guess which part of
the authentication step is slow, like in LDAP.

Wait event additions are also kind of cheap in terms of maintenance in
core, creating a new translation cost.  So I also think there are more
upsides to be wilder here with more points and more granularity.
--
Michael


signature.asc
Description: PGP signature

Re: Refactoring postmaster's code to cleanup after child exit

2024-09-10 Thread Andres Freund

Hi,

On 2024-09-10 13:33:36 -0400, Robert Haas wrote:
> On Tue, Sep 10, 2024 at 12:59 PM Andres Freund  wrote:
> > I still think that we'd be better off to just return an error to the client 
> > in
> > postmaster, rather than deal with this dead-end children mess. That was
> > perhaps justified at some point, but now it seems to add way more complexity
> > than it's worth. And it's absurdly expensive to fork to return an error. Way
> > more expensive than just having postmaster send an error and close the 
> > socket.
> 
> The tricky case is the one where the client write() -- or SSL_write() -- 
> blocks.

Yea, SSL definitely does make it harder. But it's not exactly rocket science
to do non-blocking SSL connection establishment. After all, we do manage to
do so in libpq...

Greetings,

Andres Freund

Re: Use read streams in pg_visibility

2024-09-10 Thread Noah Misch

On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:
> Your patch is correct. I wrongly assumed it would catch blockno bug,
> the attached version catches it. I made blockno = 0 invisible and not
> frozen before copying the vm file. So, in the blockno buggy version;
> callback will skip that block but the main loop in the
> collect_corrupt_items() will not skip it. I tested it with your patch
> and there is exactly 1 blockno difference between expected and result
> output.

Pushed.  I added autovacuum=off so auto-analyze of a system catalog can't take
a snapshot that blocks VACUUM updating the vismap.  I doubt that could happen
under default settings, but this lets us disregard the possibility entirely.

I also fixed the mix of tabs and spaces inside test file string literals.

Re: Windows socket problems, interesting connection to AIO

2024-09-10 Thread Noah Misch

On Mon, Sep 02, 2024 at 09:20:21PM +1200, Thomas Munro wrote:
> 2.  If a Windows client tries to send() and gets an ECONNRESET/EPIPE
> error, then the network stack seems to drop already received data, so
> a following recv() will never see it.  In other words, it depends on
> whether the application-level protocol is strictly request/response
> based, or has sequence points at which both ends might send().  AFAIK
> the main consequence for real users is that FATAL recovery conflict,
> idle termination, etc messages are not delivered to clients, leaving
> just "server closed the connection unexpectedly".

> The new thought I had about the second category of problem is: if you
> use asynchronous networking APIs, then the kernel *can't* throw your
> data out, because it doesn't even have it.  If the server's FATAL
> message arrives before the client calls send(), then the data is
> already written to user space memory and the I/O is marked as
> complete.

Good point.

> just wanted to share this observation.

Thanks for sharing that and the test program.

Re: Make printtup a bit faster

2024-09-10 Thread Andy Fan



Hello David & Andreas, 

> On 8/29/24 1:51 PM, David Rowley wrote:
>> I had planned to work on this for PG18, but I'd be happy for some
>> assistance if you're willing.
>
> I am interested in working on this, unless Andy Fan wants to do this
> work. :) I believe that optimizing the out, in and send functions would
> be worth the pain. I get Tom's objections but I do not think adding a
> small check would add much overhead compared to the gains we can get.

Just to be clearer, I'd like work on the out function only due to my
internal assignment. (Since David planned it for PG18, so it is better
say things clearer eariler). I'd put parts of out(print) function
refactor in the next 2 days. I think it deserves a double check before
working on *all* the out function.

select count(*), count(distinct typoutput) from pg_type;
 count | count 
---+---
   621 |97
(1 row)

select typoutput, count(*) from pg_type group by typoutput having
count(*) > 1 order by 2 desc;

typoutput| count 
-+---
 array_out   |   296
 record_out  |   214
 multirange_out  | 6
 range_out   | 6
 varcharout  | 3
 int4out | 2
 timestamptz_out | 2
 nameout | 2
 textout | 2
(9 rows)

-- 
Best Regards
Andy Fan

Re: not null constraints, again

2024-09-10 Thread jian he

On Wed, Sep 11, 2024 at 2:18 AM Alvaro Herrera  wrote:
>
> Hello, here's a v2 of this patch.  I have fixed --I think-- all the
> issues you and Tender Wang reported (unless I declined a fix in some
> previous email).
>

+ /*
+ * The constraint must appear as inherited in children, so create a
+ * modified constraint object to use.
+ */
+ constr = copyObject(constr);
+ constr->inhcount = 1;

in ATAddCheckNNConstraint, we don't need the above copyObject call.
because at the beginning of ATAddCheckNNConstraint, we do
newcons = AddRelationNewConstraints(rel, NIL,
list_make1(copyObject(constr)),
recursing || is_readd,/*
allow_merge */
!recursing, /* is_local */
is_readd,/* is_internal */
NULL);/* queryString not available
 * here */

pg_constraint manual <<constr->has_not_null)
{
lst = RelationGetNotNullConstraints(RelationGetRelid(relation), false);
cxt->nnconstraints = list_concat(cxt->nnconstraints, lst);
}

we can do:
create table parent (a text, b int);
create table child () inherits (parent);
alter table child no inherit parent;

so comments in AdjustNotNullInheritance
 * AdjustNotNullInheritance
 *Adjust not-null constraints' inhcount/islocal for
 *ALTER TABLE [NO] INHERITS

"ALTER TABLE [NO] INHERITS"
should be
"ALTER TABLE ALTER COLUMN [NO] INHERITS"
?

Also, seems AdjustNotNullInheritance never being called/used?

Re: Make printtup a bit faster

2024-09-10 Thread Tom Lane

Andy Fan  writes:
> Just to be clearer, I'd like work on the out function only due to my
> internal assignment. (Since David planned it for PG18, so it is better
> say things clearer eariler). I'd put parts of out(print) function
> refactor in the next 2 days. I think it deserves a double check before
> working on *all* the out function.

Well, sure.  You *cannot* write a patch that breaks existing output
functions.  Not at the start, and not at the end either.  You
should focus on writing the infrastructure and, for starters,
converting just a few output functions as a demonstration.  If
that gets accepted then you can work on converting other output
functions a few at a time.  But they'll never all be done, because
we can't realistically force extensions to convert.

There are lots of examples of similar incremental conversions in our
project's history.  I think the most recent example is the "soft error
handling" work (d9f7f5d32, ccff2d20e, and many follow-on patches).

regards, tom lane

Re: Disallow altering invalidated replication slots

2024-09-10 Thread Peter Smith

On Wed, Sep 11, 2024 at 3:54 AM Bharath Rupireddy
 wrote:
>
> Hi,
>
> Thanks for reviewing.
>
> On Tue, Sep 10, 2024 at 8:40 AM Peter Smith  wrote:
> >
> > Commit message
> >
> > 1.
> > ALTER_REPLICATION_SLOT on invalidated replication slots is unnecessary
> > as there is no way...
> >
> > suggestion:
> > ALTER_REPLICATION_SLOT for invalid replication slots should not be
> > allowed because there is no way...
>
> Modified.
>
> > ==
> > 2. Missing docs update
> >
> > Should this docs page [1] be updated to say ALTER_REPLICATION_SLOT is
> > not allowed for invalid slots?
>
> Haven't noticed for other ERROR cases in the docs, e.g. slots being
> synced, temporary slots. Not sure if it's worth adding every ERROR
> case to the docs.
>

OK.

> > ==
> > src/backend/replication/slot.c
> >
> > 3.
> > + if (MyReplicationSlot->data.invalidated != RS_INVAL_NONE)
> > + ereport(ERROR,
> > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > + errmsg("cannot alter replication slot \"%s\"", name),
> > + errdetail("This replication slot was invalidated due to \"%s\".",
> > +   SlotInvalidationCauses[MyReplicationSlot->data.invalidated]));
> > +
> >
> > I thought including the reason "invalid" (e.g. "cannot alter invalid
> > replication slot \"%s\"") in the message might be better, but OTOH I
> > see the patch message is the same as an existing one. Maybe see what
> > others think.
>
> Changed.
>
> > ==
> > src/test/recovery/t/035_standby_logical_decoding.pl
> >
> > 3.
> > There is already a comment about this test:
> > ##
> > # Recovery conflict: Invalidate conflicting slots, including in-use slots
> > # Scenario 1: hot_standby_feedback off and vacuum FULL
> > #
> > # In passing, ensure that replication slot stats are not removed when the
> > # active slot is invalidated.
> > ##
> >
> > IMO we should update that "In passing..." sentence to something like:
> >
> > In passing, ensure that replication slot stats are not removed when
> > the active slot is invalidated, and check that an error occurs when
> > attempting to alter the invalid slot.
>
> Added. But, keeping it closer to the test case doesn't hurt.
>
> Please find the attached v2 patch also having Shveta's review comments
> addressed.
>

The v2 patch looks OK to me.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Eager aggregation, take 3

2024-09-10 Thread Tender Wang

Richard Guo  于2024年8月21日周三 15:11写道：

> On Fri, Aug 16, 2024 at 4:14 PM Richard Guo 
> wrote:
> > I had a self-review of this patchset and made some refactoring,
> > especially to the function that creates the RelAggInfo structure for a
> > given relation.  While there were no major changes, the code should
> > now be simpler.
>
> I found a bug in v10 patchset: when we generate the GROUP BY clauses
> for the partial aggregation that is pushed down to a non-aggregated
> relation, we may produce a clause with a tleSortGroupRef that
> duplicates one already present in the query's groupClause, which would
> cause problems.
>
> Attached is the updated version of the patchset that fixes this bug
> and includes further code refactoring.
>
>
I continue to review the v11 version patches. Here are some my thoughts.

1. In make_one_rel(), we have the below codes:
/*
* Build grouped base relations for each base rel if possible.
*/
setup_base_grouped_rels(root);

As far as I know, each base rel only has one grouped base relation, if
possible.
The comments may be changed to "Build a grouped base relation for each base
rel if possible."

2.  According to the comments of generate_grouped_paths(), we may generate
paths for a grouped
relation on top of paths of join relation. So the ”rel_plain" argument in
generate_grouped_paths() may be
confused. "plain" usually means "base rel" . How about Re-naming rel_plain
to input_rel?

3. In create_partial_grouping_paths(), The partially_grouped_rel could have
been already created due to eager
aggregation. If partially_grouped_rel exists,  its reltarget has been
created. So do we need below logic?

/*
* Build target list for partial aggregate paths.  These paths cannot just
* emit the same tlist as regular aggregate paths, because (1) we must
* include Vars and Aggrefs needed in HAVING, which might not appear in
* the result tlist, and (2) the Aggrefs must be set in partial mode.
*/
partially_grouped_rel->reltarget =
   make_partial_grouping_target(root, grouped_rel->reltarget,
extra->havingQual);

--
Thanks,
Tender Wang

RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)

On Tuesday, September 10, 2024 7:25 PM Amit Kapila  
wrote:
> 
> On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > ---
> > ISSUES and SOLUTION
> > ---
> >
> > To detect update_deleted conflicts, we need to search for dead tuples
> > in the table. However, dead tuples can be removed by VACUUM at any
> > time. Therefore, to ensure consistent and accurate conflict detection,
> > tuples deleted by other origins must not be removed by VACUUM before
> > the conflict detection process. If the tuples are removed prematurely,
> > it might lead to incorrect conflict identification and resolution, causing 
> > data
> divergence between nodes.
> >
> > Here is an example of how VACUUM could affect conflict detection and
> > how to prevent this issue. Assume we have a bidirectional cluster with
> > two nodes, A and B.
> >
> > Node A:
> >   T1: INSERT INTO t (id, value) VALUES (1,1);
> >   T2: DELETE FROM t WHERE id = 1;
> >
> > Node B:
> >   T3: UPDATE t SET value = 2 WHERE id = 1;
> >
> > To retain the deleted tuples, the initial idea was that once
> > transaction T2 had been applied to both nodes, there was no longer a
> > need to preserve the dead tuple on Node A. However, a scenario arises
> > where transactions T3 and T2 occur concurrently, with T3 committing
> > slightly earlier than T2. In this case, if Node B applies T2 and Node
> > A removes the dead tuple (1,1) via VACUUM, and then Node A applies T3
> > after the VACUUM operation, it can only result in an update_missing
> > conflict. Given that the default resolution for update_missing
> > conflicts is apply_or_skip (e.g. convert update to insert if possible
> > and apply the insert), Node A will eventually hold a row (1,2) while Node B
> becomes empty, causing data inconsistency.
> >
> > Therefore, the strategy needs to be expanded as follows: Node A cannot
> > remove the dead tuple until:
> > (a) The DELETE operation is replayed on all remote nodes, *AND*
> > (b) The transactions on logical standbys occurring before the replay
> > of Node A's DELETE are replayed on Node A as well.
> >
> > ---
> > THE DESIGN
> > ---
> >
> > To achieve the above, we plan to allow the logical walsender to
> > maintain and advance the slot.xmin to protect the data in the user
> > table and introduce a new logical standby feedback message. This
> > message reports the WAL position that has been replayed on the logical
> > standby *AND* the changes occurring on the logical standby before the
> > WAL position are also replayed to the walsender's node (where the
> > walsender is running). After receiving the new feedback message, the
> > walsender will advance the slot.xmin based on the flush info, similar
> > to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> > of logical slot are unused during logical replication, so I think it's safe 
> > and
> won't cause side-effect to reuse the xmin for this feature.
> >
> > We have introduced a new subscription option
> > (feedback_slots='slot1,...'), where these slots will be used to check
> > condition (b): the transactions on logical standbys occurring before
> > the replay of Node A's DELETE are replayed on Node A as well.
> > Therefore, on Node B, users should specify the slots corresponding to
> > Node A in this option. The apply worker will get the oldest confirmed
> > flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> > let apply worker select the slots that acquired by the walsenders
> > which connect to the same remote server(e.g. if apply worker's
> > connection info or some other flags is same as the walsender's
> > connection info). But it seems tricky because if some slots are
> > inactive which means the walsenders are not there, the apply worker
> > could not find the correct slots to check unless we save the host along with
> the slot's persistence data.
> >
> > The new feedback message is sent only if feedback_slots is not NULL.
> >
> 
> Don't you need to deal with versioning stuff for sending this new message? I
> mean what if the receiver side of this message is old and doesn't support this
> new message.

Yes, I think we can avoid sending the new message if the remote server version
doesn't support handling this message (e.g. server_version < 18). Will address
this in next version.

> 
> One minor comment on 0003
> ===
> 1.
> get_slot_confirmed_flush()
> {
> ...
> + /*
> + * To prevent concurrent slot dropping and creation while filtering the
> + * slots, take the ReplicationSlotControlLock outside of the loop.
> + */
> + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +
> + foreach_ptr(String, name, MySubscription->feedback_slots) { XLogRecPtr
> + confirmed_flush; ReplicationSlot *slot;
> +
> + slot = ValidateAndGetFeedbackSlot(strVal(name));
> 
> Why do we need to validate slots each time here? Isn't it better to do it 
> once?

I think it's possible tha

Re: query_id, pg_stat_activity, extended query protocol

2024-09-10 Thread Michael Paquier

On Mon, Sep 09, 2024 at 06:20:01PM -0500, Sami Imseih wrote:
> On 14/8/2024 23:05, Imseih (AWS), Sami wrote:
>> Also, while writing the test, I found out that now, JumbleQuery takes 
>> into account constants of the A_Const node, and calls of the same 
>> prepared statement with different parameters generate different 
>> query_id. Is it a reason to introduce JumbleQuery options and allow 
>> different logic of queryid generation?
> 
> Can you start a new thread for this prepared statement scenario?

Yes, please, this makes the thread rather confusing by adding
different problems into the mix that require different analysis and
actions.  Let's only focus on the issue that the query ID reporting
in pg_stat_activity is missing for the extended query protocol here.
--
Michael


signature.asc
Description: PGP signature

Re: Disallow altering invalidated replication slots

2024-09-10 Thread shveta malik

On Tue, Sep 10, 2024 at 11:24 PM Bharath Rupireddy
 wrote:
>
>
> Please find the attached v2 patch also having Shveta's review comments
> addressed.

The v2 patch looks good to me.

thanks
Shveta

Re: Pgoutput not capturing the generated columns

2024-09-10 Thread Peter Smith

Here are a some more review comments for patch v30-0001.

==
src/sgml/ref/create_publication.sgml

1.
+ 
+  If the publisher-side column is also a generated column
then this option
+  has no effect; the publisher column will be filled as normal with the
+  publisher-side computed or default data.
+ 

It should say "subscriber-side"; not "publisher-side". The same was
already reported by Sawada-San [1].

~~~

2.
+ 
+ This parameter can only be set true if
copy_data is
+ set to false.
+ 

IMO this limitation should be addressed by patch 0001 like it was
already done in the previous patches (e.g. v22-0002). I think
Sawada-san suggested the same [1].

Anyway, 'copy_data' is not a PUBLICATION option, so the fact it is
mentioned like this without any reference to the SUBSCRIPTION seems
like a cut/paste error from the previous implementation.

==
src/backend/catalog/pg_publication.c

3. pub_collist_validate
- if (TupleDescAttr(tupdesc, attnum - 1)->attgenerated)
- ereport(ERROR,
- errcode(ERRCODE_INVALID_COLUMN_REFERENCE),
- errmsg("cannot use generated column \"%s\" in publication column list",
-colname));
-

Instead of just removing this ERROR entirely here, I thought it would
be more user-friendly to give a WARNING if the PUBLICATION's explicit
column list includes generated cols when the option
"publish_generated_columns" is false. This combination doesn't seem
like something a user would do intentionally, so just silently
ignoring it (like the current patch does) is likely going to give
someone unexpected results/grief.

==
src/backend/replication/logical/proto.c

4. logicalrep_write_tuple, and logicalrep_write_attrs:

- if (att->attisdropped || att->attgenerated)
+ if (att->attisdropped)
  continue;

Why aren't you also checking the new PUBLICATION option here and
skipping all gencols if the "publish_generated_columns" option is
false? Or is the BMS of pgoutput_column_list_init handling this case?
Maybe there should be an Assert for this?

==
src/backend/replication/pgoutput/pgoutput.c

5. send_relation_and_attrs

- if (att->attisdropped || att->attgenerated)
+ if (att->attisdropped)
  continue;

Same question as #4.

~~~

6. prepare_all_columns_bms and pgoutput_column_list_init

+ if (att->attgenerated && !pub->pubgencolumns)
+ cols = bms_del_member(cols, i + 1);

IIUC, the algorithm seems overly tricky filling the BMS with all
columns, before straight away conditionally removing the generated
columns. Can't it be refactored to assign all the correct columns
up-front, to avoid calling bms_del_member()?

==
src/bin/pg_dump/pg_dump.c

7. getPublications

IIUC, there is lots of missing SQL code here (for all older versions)
that should be saying "false AS pubgencolumns".
e.g. compare the SQL with how "false AS pubviaroot" is used.

==
src/bin/pg_dump/t/002_pg_dump.pl

8. Missing tests?

I expected to see a pg_dump test for this new PUBLICATION option.

==
src/test/regress/sql/publication.sql

9. Missing tests?

How about adding another test case that checks this new option must be
"Boolean"?

~~~

10. Missing tests?

--- error: generated column "d" can't be in list
+-- ok: generated columns can be in the list too
 ALTER PUBLICATION testpub_fortable ADD TABLE testpub_tbl5 (a, d);
+ALTER PUBLICATION testpub_fortable DROP TABLE testpub_tbl5;

(see my earlier comment #3)

IMO there should be another test case for a WARNING here if the user
attempts to include generated column 'd' in an explicit PUBLICATION
column list while the "publish_generated-columns" is false.

==
[1]  
https://www.postgresql.org/message-id/CAD21AoA-tdTz0G-vri8KM2TXeFU8RCDsOpBXUBCgwkfokF7%3DjA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Allow pushdown of HAVING clauses with grouping sets

2024-09-10 Thread Richard Guo

In some cases, we may want to transfer a HAVING clause into WHERE in
hopes of eliminating tuples before aggregation instead of after.

Previously, we couldn't do this if there were any nonempty grouping
sets, because we didn't have a way to tell if the HAVING clause
referenced any columns that were nullable by the grouping sets, and
moving such a clause into WHERE could potentially change the results.

Now, with expressions marked nullable by grouping sets with the RT
index of the RTE_GROUP RTE, it is much easier to identify those
clauses that reference any nullable-by-grouping-sets columns: we just
need to check if the RT index of the RTE_GROUP RTE is present in the
clause.  For other HAVING clauses, they can be safely pushed down.

I'm not sure how common it is in practice to have a HAVING clause
where all referenced columns are present in all the grouping sets.
But it seems to me that this optimization doesn't cost too much.  Not
implementing it seems like leaving money on the table.

Any thoughts?

Thanks
Richard


v1-0001-Allow-pushdown-of-HAVING-clauses-with-grouping-sets.patch
Description: Binary data

Re: Conflict detection for update_deleted in logical replication

2024-09-10 Thread shveta malik

On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, September 10, 2024 5:56 PM shveta malik  
> wrote:
> >
> > On Tue, Sep 10, 2024 at 1:40 PM Zhijie Hou (Fujitsu) 
> > 
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 2:45 PM shveta malik
> >  wrote:
> > > >
> > > > Thank You Hou-San for explaining the design. But to make it easier
> > > > to understand, would you be able to explain the sequence/timeline of
> > > > the
> > > > *new* actions performed by the walsender and the apply processes for
> > > > the given example along with new feedback_slot config needed
> > > >
> > > > Node A: (Procs: walsenderA, applyA)
> > > >   T1: INSERT INTO t (id, value) VALUES (1,1);  ts=10.00 AM
> > > >   T2: DELETE FROM t WHERE id = 1;   ts=10.02 AM
> > > >
> > > > Node B: (Procs: walsenderB, applyB)
> > > >   T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM
> > >
> > > Thanks for reviewing! Let me elaborate further on the example:
> > >
> > > On node A, feedback_slots should include the logical slot that used to
> > > replicate changes from Node A to Node B. On node B, feedback_slots
> > > should include the logical slot that replicate changes from Node B to 
> > > Node A.
> > >
> > > Assume the slot.xmin on Node A has been initialized to a valid
> > > number(740) before the following flow:
> > >
> > > Node A executed T1
> > >   - 10.00 AM
> > > T1 replicated and applied on Node B   
> > >   - 10.0001 AM
> > > Node B executed T3
> > >   - 10.01 AM
> > > Node A executed T2 (741)  
> > >   - 10.02 AM
> > > T2 replicated and applied on Node B (delete_missing)  
> > >   - 10.03 AM
> >
> > Not related to this feature, but do you mean delete_origin_differ here?
>
> Oh sorry, It's a miss. I meant delete_origin_differ.
>
> >
> > > T3 replicated and applied on Node A (new action, detect
> > update_deleted) - 10.04 AM
> > >
> > > (new action) Apply worker on Node B has confirmed that T2 has been
> > > applied locally and the transactions before T2 (e.g., T3) has been
> > > replicated and applied to Node A (e.g. feedback_slot.confirmed_flush_lsn
> > >= lsn of the local
> > > replayed T2), thus send the new feedback message to Node A.
> > - 10.05 AM
> > >
> > > (new action) Walsender on Node A received the message and would
> > > advance the slot.xmin.- 10.06 AM
> > >
> > > Then, after the slot.xmin is advanced to a number greater than 741,
> > > the VACUUM would be able to remove the dead tuple on Node A.
> > >
> >
> > Thanks for the example. Can you please review below and let me know if my
> > understanding is correct.
> >
> > 1)
> > In a bidirectional replication setup, the user has to create slots in a way 
> > that
> > NodeA's sub's slot is Node B's feedback_slot and Node B's sub's slot is Node
> > A's feedback slot. And then only this feature will work well, is it correct 
> > to say?
>
> Yes, your understanding is correct.
>
> >
> > 2)
> > Now coming back to multiple feedback_slots in a subscription, is the below
> > correct:
> >
> > Say Node A has publications and subscriptions as follow:
> > --
> > A_pub1
> >
> > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> >
> >
> > Say Node B has publications and subscriptions as follow:
> > --
> > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> >
> > B_pub1
> > B_pub2
> > B_pub3
> >
> > Then what will be the feedback_slot configuration for all subscriptions of 
> > A and
> > B? Is below correct:
> > --
> > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
>
> Right. The above configurations are correct.

Okay. It seems difficult to understand configuration from user's perspective.

> >
> > 3)
> > If the above is true, then do we have a way to make sure that the user  has
> > given this configuration exactly the above way? If users end up giving
> > feedback_slots as some random slot  (say A_slot4 or incomplete list), do we
> > validate that? (I have not looked at code yet, just trying to understand 
> > design
> > first).
>
> The patch doesn't validate if the feedback slots belong to the correct
> subscriptions on remote server. It only validates if the slot is an existing,
> valid, logical slot. I think there are few challenges to validate it further.
> E.g. We need a way to identify the which server the slot is replicating
> changes to, which could be tricky as the slot currently doesn't have any info
> to identify the remote server. Besides, the slot could be inactive

Re: Invalid Assert while validating REPLICA IDENTITY?

2024-09-10 Thread Amit Kapila

On Tue, Sep 10, 2024 at 2:16 PM Amit Kapila  wrote:
>
> On Tue, Sep 10, 2024 at 11:36 AM vignesh C  wrote:
> >
> > On Mon, 9 Sept 2024 at 13:12, Amit Kapila  wrote:
> > >
> > > The second part of the assertion is incomplete. The
> > > IsIndexUsableForReplicaIdentityFull() should be used only when the
> > > remote relation has REPLICA_IDENTITY_FULL set. I haven't tested all
> > > possible cases yet but I think the attached should be a better way to
> > > write this assertion.
> >
> > The changes look good to me.
> >
>
> Thanks, I'll push this tomorrow unless Dilip or anyone else has any
> comments. BTW, as the current code doesn't lead to any bug or
> assertion failure, I am planning to push this change to HEAD only, let
> me know if you think otherwise.
>

Pushed.

-- 
With Regards,
Amit Kapila.

RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)

On Wednesday, September 11, 2024 12:18 PM shveta malik  
wrote:
> 
> On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) 
> wrote:
> >
> > On Tuesday, September 10, 2024 5:56 PM shveta malik
>  wrote:
> > >
> > > Thanks for the example. Can you please review below and let me know
> > > if my understanding is correct.
> > >
> > > 1)
> > > In a bidirectional replication setup, the user has to create slots
> > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > B's sub's slot is Node A's feedback slot. And then only this feature will
> work well, is it correct to say?
> >
> > Yes, your understanding is correct.
> >
> > >
> > > 2)
> > > Now coming back to multiple feedback_slots in a subscription, is the
> > > below
> > > correct:
> > >
> > > Say Node A has publications and subscriptions as follow:
> > > --
> > > A_pub1
> > >
> > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > >
> > >
> > > Say Node B has publications and subscriptions as follow:
> > > --
> > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > >
> > > B_pub1
> > > B_pub2
> > > B_pub3
> > >
> > > Then what will be the feedback_slot configuration for all
> > > subscriptions of A and B? Is below correct:
> > > --
> > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> >
> > Right. The above configurations are correct.
> 
> Okay. It seems difficult to understand configuration from user's perspective.

Right. I think we could give an example in the document to make it clear.

> 
> > >
> > > 3)
> > > If the above is true, then do we have a way to make sure that the
> > > user  has given this configuration exactly the above way? If users
> > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > incomplete list), do we validate that? (I have not looked at code
> > > yet, just trying to understand design first).
> >
> > The patch doesn't validate if the feedback slots belong to the correct
> > subscriptions on remote server. It only validates if the slot is an
> > existing, valid, logical slot. I think there are few challenges to validate 
> > it
> further.
> > E.g. We need a way to identify the which server the slot is
> > replicating changes to, which could be tricky as the slot currently
> > doesn't have any info to identify the remote server. Besides, the slot
> > could be inactive temporarily due to some subscriber side error, in
> > which case we cannot verify the subscription that used it.
> 
> Okay, I understand the challenges here.
> 
> > >
> > > 4)
> > > Now coming to this:
> > >
> > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > specified slots and send the LSN as a feedback message to the
> > > > walsender.
> > >
> > >  There will be one apply worker on B which will be due to B_sub1, so
> > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > alone which has subscribed to table 't' on which delete has been
> > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> different tables?
> >
> > I think it's theoretically correct to only check the A_sub1. We could
> > document that user can do this by identifying the tables that each
> > subscription replicates, but it may not be user friendly.
> >
> 
> Sorry, I fail to understand how user can identify the tables and give
> feedback_slots accordingly? I thought feedback_slots is a one time
> configuration when replication is setup (or say setup changes in future); it 
> can
> not keep on changing with each query. Or am I missing something?

I meant that user have all the publication information(including the tables
added in a publication) that the subscription subscribes to, and could also
have the slot_name, so I think it's possible to identify the tables that each
subscription includes and add the feedback_slots correspondingly before
starting the replication. It would be pretty complicate although possible, so I
prefer to not mention it in the first place if it could not bring much
benefits.

> 
> IMO, it is something which should be identified internally. Since the query 
> is on
> table 't1', feedback-slot which is for 't1' shall be used to check lsn. But on
> rethinking,this optimization may not be worth the effort, the identification 
> part
> could be tricky, so it might be okay to check all the slots.

I agree that identifying these internally would add complexity.

> 
> ~~
> 
> Another query is about 3 node setup. I couldn't figure out what would be
> feedback_slots setting when it is not bidirectional, as in consider the case
> where there are three nodes A,B,C. Node C is subscribing to both Node A and
> Node B. Nod

Re: Add contrib/pg_logicalsnapinspect

2024-09-10 Thread Amit Kapila

On Tue, Sep 10, 2024 at 8:56 PM Bertrand Drouvot
 wrote:
>
> On Mon, Sep 09, 2024 at 04:24:09PM +0530, Amit Kapila wrote:
> > On Fri, Aug 30, 2024 at 5:18 PM Bertrand Drouvot
> >  wrote:
> > > as we decided not to expose the SnapBuildOnDisk and SnapBuild structs to 
> > > public
> > > and to create/expose 2 new functions in snapbuild.c then the functions in 
> > > the
> > > module would do nothing but expose the data coming from the snapbuild.c's
> > > functions (get the tuple and send it to the client). That sounds weird to 
> > > me to
> > > create a module that would "only" do so, that's why I thought that in core
> > > functions taking care of everything make more sense.
> > >
> >
> > I see your point. Does anyone else have an opinion on the need for
> > these functions and whether to expose them from a contrib module or
> > have them as core functions?
>
> I looked at when the SNAPBUILD_VERSION has been changed:
>
> ec5896aed3 (2014)
> a975ff4980 (2021)
> 8bdb1332eb (2021)
> 7f13ac8123 (2022)
> bb19b70081 (2024)
>
> So it's not like we are changing the SnapBuildOnDisk or SnapBuild structs that
> frequently. Furthermore, those structs are serialized and so we have to 
> preserve
> their on-disk compatibility (means we can change them only in a major release
> if we need to).
>
> So, I think it would not be that much of an issue to expose those structs and
> create a new contrib module (as v1 did propose) instead of in core new 
> functions.
>
> If we want to insist that external modules "should" not rely on those structs 
> then
> we could put them into a new internal_snapbuild.h file (instead of snapbuild.h
> as proposed in v1).
>

Adding snapbuild_internal.h sounds like a good idea.

> At the end, I think that creating a contrib module and exposing those structs 
> in
> internal_snapbuild.h make more sense (as compared to in core functions).
>

Fail enough. We can keep the module name as logicalinspect so that we
can extend it for other logical decoding/replication-related files in
the future.


-- 
With Regards,
Amit Kapila.

Re: Conflict detection for update_deleted in logical replication

2024-09-10 Thread shveta malik

On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, September 11, 2024 12:18 PM shveta malik 
>  wrote:
> >
> > On Tue, Sep 10, 2024 at 4:30 PM Zhijie Hou (Fujitsu) 
> > 
> > wrote:
> > >
> > > On Tuesday, September 10, 2024 5:56 PM shveta malik
> >  wrote:
> > > >
> > > > Thanks for the example. Can you please review below and let me know
> > > > if my understanding is correct.
> > > >
> > > > 1)
> > > > In a bidirectional replication setup, the user has to create slots
> > > > in a way that NodeA's sub's slot is Node B's feedback_slot and Node
> > > > B's sub's slot is Node A's feedback slot. And then only this feature 
> > > > will
> > work well, is it correct to say?
> > >
> > > Yes, your understanding is correct.
> > >
> > > >
> > > > 2)
> > > > Now coming back to multiple feedback_slots in a subscription, is the
> > > > below
> > > > correct:
> > > >
> > > > Say Node A has publications and subscriptions as follow:
> > > > --
> > > > A_pub1
> > > >
> > > > A_sub1 (subscribing to B_pub1 with the default slot_name of A_sub1)
> > > > A_sub2 (subscribing to B_pub2 with the default slot_name of A_sub2)
> > > > A_sub3 (subscribing to B_pub3 with the default slot_name of A_sub3)
> > > >
> > > >
> > > > Say Node B has publications and subscriptions as follow:
> > > > --
> > > > B_sub1 (subscribing to A_pub1 with the default slot_name of B_sub1)
> > > >
> > > > B_pub1
> > > > B_pub2
> > > > B_pub3
> > > >
> > > > Then what will be the feedback_slot configuration for all
> > > > subscriptions of A and B? Is below correct:
> > > > --
> > > > A_sub1, A_sub2, A_sub3: feedback_slots=B_sub1
> > > > B_sub1: feedback_slots=A_sub1,A_sub2, A_sub3
> > >
> > > Right. The above configurations are correct.
> >
> > Okay. It seems difficult to understand configuration from user's 
> > perspective.
>
> Right. I think we could give an example in the document to make it clear.
>
> >
> > > >
> > > > 3)
> > > > If the above is true, then do we have a way to make sure that the
> > > > user  has given this configuration exactly the above way? If users
> > > > end up giving feedback_slots as some random slot  (say A_slot4 or
> > > > incomplete list), do we validate that? (I have not looked at code
> > > > yet, just trying to understand design first).
> > >
> > > The patch doesn't validate if the feedback slots belong to the correct
> > > subscriptions on remote server. It only validates if the slot is an
> > > existing, valid, logical slot. I think there are few challenges to 
> > > validate it
> > further.
> > > E.g. We need a way to identify the which server the slot is
> > > replicating changes to, which could be tricky as the slot currently
> > > doesn't have any info to identify the remote server. Besides, the slot
> > > could be inactive temporarily due to some subscriber side error, in
> > > which case we cannot verify the subscription that used it.
> >
> > Okay, I understand the challenges here.
> >
> > > >
> > > > 4)
> > > > Now coming to this:
> > > >
> > > > > The apply worker will get the oldest confirmed flush LSN among the
> > > > > specified slots and send the LSN as a feedback message to the
> > > > > walsender.
> > > >
> > > >  There will be one apply worker on B which will be due to B_sub1, so
> > > > will it check confirmed_lsn of all slots A_sub1,A_sub2, A_sub3?
> > > > Won't it be sufficient to check confimed_lsn of say slot A_sub1
> > > > alone which has subscribed to table 't' on which delete has been
> > > > performed? Rest of the  lots (A_sub2, A_sub3) might have subscribed to
> > different tables?
> > >
> > > I think it's theoretically correct to only check the A_sub1. We could
> > > document that user can do this by identifying the tables that each
> > > subscription replicates, but it may not be user friendly.
> > >
> >
> > Sorry, I fail to understand how user can identify the tables and give
> > feedback_slots accordingly? I thought feedback_slots is a one time
> > configuration when replication is setup (or say setup changes in future); 
> > it can
> > not keep on changing with each query. Or am I missing something?
>
> I meant that user have all the publication information(including the tables
> added in a publication) that the subscription subscribes to, and could also
> have the slot_name, so I think it's possible to identify the tables that each
> subscription includes and add the feedback_slots correspondingly before
> starting the replication. It would be pretty complicate although possible, so 
> I
> prefer to not mention it in the first place if it could not bring much
> benefits.
>
> >
> > IMO, it is something which should be identified internally. Since the query 
> > is on
> > table 't1', feedback-slot which is for 't1' shall be used to check lsn. But 
> > on
> > rethinking,this optimization may not be worth the effort, the 
> > identification part
> > could be tricky, so it might be okay to check all the slots.
>
> I

RE: Conflict detection for update_deleted in logical replication

2024-09-10 Thread Zhijie Hou (Fujitsu)

On Wednesday, September 11, 2024 1:03 PM shveta malik  
wrote:
> 
> On Wed, Sep 11, 2024 at 10:15 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > On Wednesday, September 11, 2024 12:18 PM shveta malik
>  wrote:
> > >
> > > ~~
> > >
> > > Another query is about 3 node setup. I couldn't figure out what
> > > would be feedback_slots setting when it is not bidirectional, as in
> > > consider the case where there are three nodes A,B,C. Node C is
> > > subscribing to both Node A and Node B. Node A and Node B are the
> > > ones doing concurrent "update" and "delete" which will both be
> > > replicated to Node C. In this case what will be the feedback_slots
> > > setting on Node C? We don't have any slots here which will be
> > > replicating changes from Node C to Node A and Node C to Node B. This
> > > is given in [3] in your first email ([1])
> >
> > Thanks for pointing this, the link was a bit misleading. I think the
> > solution proposed in this thread is only used to allow detecting
> > update_deleted reliably in a bidirectional cluster.  For non-
> > bidirectional cases, it would be more tricky to predict the timing till when
> should we retain the dead tuples.
> >
> 
> So in brief, this solution is only for bidrectional setup? For 
> non-bidirectional,
> feedback_slots is non-configurable and thus irrelevant.

Right.

> 
> Irrespective of above, if user ends up setting feedback_slot to some random 
> but
> existing slot which is not at all consuming changes, then it may so happen 
> that
> the node will never send feedback msg to another node resulting in
> accumulation of dead tuples on another node. Is that a possibility?

Yes, It's possible. I think this is a common situation for this kind of user
specified options. Like the user DML will be blocked, if any inactive standby
names are added synchronous_standby_names.

Best Regards,
Hou zj

Re: query_id, pg_stat_activity, extended query protocol

2024-09-10 Thread Michael Paquier

On Mon, Sep 02, 2024 at 10:11:43AM +0900, Michael Paquier wrote:
> I need to spend a bit more time with my head down for this thread, but
> couldn't we use these commands with various query patterns in
> pg_stat_statements and look at the shmem counters reported through its
> view?

My apologies for the time it took, but here you go with a patch set.

I have looked at this thread overall, and there are two problems at
hand regarding the lack of reporting of the query ID in backend
entries for the extended query protocol:
1) ExecutorRun() misses the reports, which happens when a query
does an ExecutorStart(), then a series of ExecutorRun() through a
portal with bind messages.  Robert has mentioned that separately a few
days ago at [1].  But that's not everything.
2) A query executed through a portal with tuples to return in a
tuplestore also miss the query ID report.  For example, a DML
RETURNING with the extended protocol would use an execute (with
ExecutorStart and ExecutorRun) followed by a series of execute fetch.
pg_stat_activity would report the query ID for the execute, not for
the fetches, while pg_stat_activity has the query string.  That's
confusing.

The patch series attached address these two in 0001 and 0003.  0001
should be backpatched (still need to wordsmith the comments), where
I've come down to the approach of using a report in ExecutorRun()
because it is simpler and it does the job.  Perhaps also 0003, but
nobody has complained about that, either.

I have also looked at the tests proposed (isolation, TAP, custom
module); all of them are a bit disappointing because they duplicate
some patterns that are already tested in pg_stat_statements, while
willing to check the contents of pg_stat_statements.  I am afraid that
it is not going to age well because we'd need to have the same query
patterns in more than one place.  We should have tests, definitely,
but we can do an equivalent of pg_stat_activity lookups by calling
pgstat_get_my_query_id() in strategic places, making sure that all
dedicated paths always have the query ID reported:
- Check pgstat_get_my_query_id() in the run, finish and end executor
hooks.
- In the parse-analyze hook, before the query ID is reported (except
for a PREPARE), check that the ID in a Query is set.

The test proposed by Robert on the other thread was fancy enough that
I've added it.  All that is in 0002, and that's enough to cause 0001
to fail, planning only these on HEAD.  Tests in 0003 require fetch
messages, and I don't have a trick in my sleeves except if we invent a
new meta-command in psql.

There are other problems mentioned on this thread, with plan caching
for example.  Let's deal with that separately, in separate threads.

[1]: 
https://www.postgresql.org/message-id/CA+TgmoZxtnf_jZ=vqbsyau8hfukkwojcj6ufy4lgpxaunkr...@mail.gmail.com
--
Michael
From d4d469da18d6442fed5a6fd5824f827e28cf5bb3 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 11 Sep 2024 14:12:24 +0900
Subject: [PATCH 1/3] Fix reporting of query ID with extended query protocol

A query run through the extended query protocol would miss reporting its
query ID to pg_stat_statements, as the value is reset when beginning the
processing of a new execute message.

ExecutorStart() was calling pgstat_report_query_id(), but it missed the
fact that multiple ExecutorRun() calls could be issued for a single
query with an initial ExecutorStart() across multiple execute messages,
hence the query ID would be missing in the second execute.

Backpatch-through: 14
---
 src/backend/executor/execMain.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 29e186fa73..94e18f2e36 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -119,10 +119,11 @@ void
 ExecutorStart(QueryDesc *queryDesc, int eflags)
 {
 	/*
-	 * In some cases (e.g. an EXECUTE statement) a query execution will skip
-	 * parse analysis, which means that the query_id won't be reported.  Note
-	 * that it's harmless to report the query_id multiple times, as the call
-	 * will be ignored if the top level query_id has already been reported.
+	 * In some cases (e.g. an EXECUTE statement or an execute message with
+	 * the extended query protocol) the query_id won't be reported, so do it
+	 * now.  Note that it's harmless to report the query_id multiple times,
+	 * as the call will be ignored if the top level query_id has already been
+	 * reported.
 	 */
 	pgstat_report_query_id(queryDesc->plannedstmt->queryId, false);

@@ -293,6 +294,17 @@ ExecutorRun(QueryDesc *queryDesc,
 			ScanDirection direction, uint64 count,
 			bool execute_once)
 {
+	/*
+	 * When executing a query with the extended query protocol,
+	 * ExecutorStart() may not be called, causing the query ID to not be
+	 * reported.  Hence, do it again here in case it was missed.
+	 *
+	 * Reporting multiple times the query ID is harmless.
+

Re: Use read streams in pg_visibility

2024-09-10 Thread Nazir Bilal Yavuz

Hi,

On Wed, 11 Sept 2024 at 01:38, Noah Misch  wrote:
>
> On Tue, Sep 10, 2024 at 02:35:46PM +0300, Nazir Bilal Yavuz wrote:
> > Your patch is correct. I wrongly assumed it would catch blockno bug,
> > the attached version catches it. I made blockno = 0 invisible and not
> > frozen before copying the vm file. So, in the blockno buggy version;
> > callback will skip that block but the main loop in the
> > collect_corrupt_items() will not skip it. I tested it with your patch
> > and there is exactly 1 blockno difference between expected and result
> > output.
>
> Pushed.  I added autovacuum=off so auto-analyze of a system catalog can't take
> a snapshot that blocks VACUUM updating the vismap.  I doubt that could happen
> under default settings, but this lets us disregard the possibility entirely.

Thanks!

> I also fixed the mix of tabs and spaces inside test file string literals.

I ran both pgindent and pgperltidy but they didn't catch them. Is
there an automated way to catch them?

-- 
Regards,
Nazir Bilal Yavuz
Microsoft

Inconsistency in lock structure after reporting "lock already held" error

2024-09-10 Thread 高增琦

At the end of SetupLockInTable(), there is a check for the "lock already
held" error.
Because the nRequested and requested[lockmode] value of a lock is bumped
before "lock already held" error, and there is no way to reduce them later
for
this situation, then it will keep the inconsistency in lock structure until
cluster restart or reset.

The inconsistency is:
* nRequested will never reduce to zero, the lock will never be
garbage-collected
* if there is a waitMask for this lock, the waitMast will never be removed,
then
  new proc will be blocked to wait for a lock with zero holder
  (looks weird in the pg_locks table)

I think moving the "lock already held" error before the bump operation of
nRequested
and requested[lockmode] value in SetupLockInTable() will fix it.
(maybe also fix the lock_twophase_recover() function)

To recreate the inconsistency:
1. create a backend 1 to lock table a, keep it idle in transaction
2. terminate backend 1 and hack it to skip the LockReleaseAll() function
3. create another backend 2 to lock table a, it will wait for the lock to
release
4. reuse the backend 1 (reuse the same proc) to lock table a again,
   it will trigger the "lock already held" error
5. quit both backend 1 and 2
6. create backend 3 to lock table a, it will wait for the lock's waitMask
7. check the pg_locks table

-- 
GaoZengqi
pgf...@gmail.com
zengqi...@gmail.com

98 matches

Mail list logo