Re: Synchronizing slots from primary to standby

2023-11-21 Thread shveta malik
On Tue, Nov 21, 2023 at 10:02 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, November 17, 2023 7:39 PM Amit Kapila  
> wrote:
> >
> > On Thu, Nov 16, 2023 at 5:34 PM shveta malik 
> > wrote:
> > >
> > > PFA v35.
> > >
> >
> > Review v35-0002*
> > ==
>
> Thanks for the comments.
>
> > 1.
> > As quoted in the commit message,
> > >
> > If a logical slot is invalidated on the primary, slot on the standby is also
> > invalidated. If a logical slot on the primary is valid but is invalidated 
> > on the
> > standby due to conflict (say required rows removed on the primary), then 
> > that
> > slot is dropped and recreated on the standby in next sync-cycle.
> > It is okay to recreate such slots as long as these are not consumable on the
> > standby (which is the case currently).
> > >
> >
> > I think this won't happen normally because of the physical slot and
> > hot_standby_feedback but probably can occur in cases like if the user
> > temporarily switches hot_standby_feedback from on to off. Are there any 
> > other
> > reasons? I think we can mention the cases along with it as well at least 
> > for now.
> > Additionally, I think this should be covered in code comments as well.
>
> I will collect all these cases and update in next version.
>
> >
> > 2.
> >  #include "postgres.h"
> > -
> > +#include "access/genam.h"
> >
> > Spurious line removal.
>
> Removed.
>
> >
> > 3.
> >A password needs to be provided too, if the sender demands
> > password
> >authentication.  It can be provided in the
> >primary_conninfo string, or in a separate
> > -  ~/.pgpass file on the standby server (use
> > -  replication as the database name).
> > -  Do not specify a database name in the
> > -  primary_conninfo string.
> > +  ~/.pgpass file on the standby server.
> > + 
> > + 
> > +  Specify dbname in
> > +  primary_conninfo string to allow
> > synchronization
> > +  of slots from the primary server to the standby server.
> > +  This will only be used for slot synchronization. It is ignored
> > +  for streaming.
> >
> > Is there a reason to remove part of the earlier sentence "use
> > replication as the database name"?
>
> Added it back.
>
> >
> > 4.
> > +   enable_syncslot configuration
> > parameter
> > +  
> > +  
> > +  
> > +   
> > +It enables a physical standby to synchronize logical failover slots
> > +from the primary server so that logical subscribers are not blocked
> > +after failover.
> > +   
> > +   
> > +It is enabled by default. This parameter can only be set in the
> > +postgresql.conf file or on the server
> > command line.
> > +   
> >
> > I think you forgot to update the documentation for the default value of this
> > variable.
>
> Updated.
>
> >
> > 5.
> > + *   a) start the logical replication workers for every enabled 
> > subscription
> > + *  when not in standby_mode
> > + *   b) start the slot-sync worker for logical failover slots 
> > synchronization
> > + *  from the primary server when in standby_mode.
> >
> > Either use a full stop after both lines or none of these.
>
> Added a full stop.
>
> >
> > 6.
> > +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker);
> >
> > There shouldn't be space between * and the worker.
>
> Removed, and added the type to typedefs.list.
>
> >
> > 7.
> > + if (!SlotSyncWorker->hdr.in_use)
> > + {
> > + LWLockRelease(SlotSyncWorkerLock);
> > + ereport(ERROR,
> > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > + errmsg("replication slot-sync worker not initialized, "
> > + "cannot attach")));
> > + }
> > +
> > + if (SlotSyncWorker->hdr.proc)
> > + {
> > + LWLockRelease(SlotSyncWorkerLock);
> > + ereport(ERROR,
> > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > + errmsg("replication slot-sync worker is "
> > + "already running, cannot attach")));
> > + }
> >
> > Using slot-sync in the error messages looks a bit odd to me. Can we use
> > "replication slot sync worker ..." in both these and other similar 
> > messages? I
> > think it would be better if we don't split the messages into multiple lines 
> > in
> > these cases as messages don't appear too long to me.
>
> Changed as suggested.
>
> >
> > 8.
> > +/*
> > + * Detach the worker from DSM and update 'proc' and 'in_use'.
> > + * Logical replication launcher will come to know using these
> > + * that the worker has shutdown.
> > + */
> > +void
> > +slotsync_worker_detach(int code, Datum arg) {
> >
> > I think the reference to DSM is leftover from the previous version of the 
> > patch.
> > Can we change the above comments as per the new code?
>
> Changed.
>
> >
> > 9.
> > +static bool
> > +slotsync_worker_launch()
> > {
> > ...
> > + /* TODO: do we really need 'generation', analyse more here */
> > + worker->hdr.generation++;
> >
> > We should do something about this TODO.

Re: Synchronizing slots from primary to standby

2023-11-21 Thread shveta malik
On Tue, Nov 21, 2023 at 1:13 PM Drouvot, Bertrand
 wrote:
>
> Hi,
>
> On 11/21/23 6:16 AM, Amit Kapila wrote:
> > On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand
> >  wrote:
> >> As far the 'i' state here, from what I see, it is currently useful for:
> >>
> >> 1. Cascading standby to not sync slots with state = 'i' from
> >> the first standby.
> >> 2. Easily report Slots that did not catch up on the primary yet.
> >> 3. Avoid inactive slots to block "active" ones creation.
> >>
> >> So not creating those slots should not be an issue for 1. (sync are
> >> not needed on cascading standby as not created on the first standby yet)
> >> but is an issue for 2. (unless we provide another way to keep track and 
> >> report
> >> such slots) and 3. (as I think we should still need to reserve WAL).
> >>
> >> I've a question: we'd still need to reserve WAL for those slots, no?
> >>
> >> If that's the case and if we don't call ReplicationSlotCreate() then 
> >> ReplicationSlotReserveWal()
> >> would not work as  MyReplicationSlot would be NULL.
> >>
> >
> > Yes, we need to reserve WAL to see if we can sync the slot. We are
> > currently creating an RS_EPHEMERAL slot and if we don't explicitly
> > persist it when we can't sync, then it will be dropped when we do
> > ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the
> > loss is probably, the next time we again try to sync the slot, we need
> > to again create it and may need to wait for newer restart_lsn on
> > standby
>
> Yeah, and doing so we'd reduce the time window to give the slot a chance
> to catch up (as opposed to create it a single time and maintain an 'i' state).
>
> > which could be avoided if we have the slot in 'i' state from
> > the previous run.
>
> Right.
>
> > I don't deny the importance of having 'i'
> > (initialized) state but was just trying to say that it has additional
> > code complexity.
>
> Right, and I think it's worth it.
>
> > OTOH, having it may give better visibility to even
> > users about slots that are not active (say manually created slots on
> > the primary).
>
> Agree.
>
> All that being said, on my side I'm +1 on keeping the 'i' state behavior
> as it is implemented currently (would be happy to hear others' opinions too).
>

+1 for 'i' state. I feel it gives a better slot-sync functionality
(optimizing redo-effort for inactive slots, inactive not blocking
active ones) along with its usage for monitoring purposes.




Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-21 Thread Laurenz Albe
On Mon, 2023-11-20 at 21:13 -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote:
> > > Bruce Momjian  writes:
> > > > An alternate approach would
> > > > be to remove pg_attribute.attndims so we don't even try to preserve 
> > > > dimensionality.
> 
> > > I could get behind that, perhaps.  It looks like we're not using the
> > > field in any meaningful way, and we could simplify TupleDescInitEntry
> > > and perhaps some other APIs.
> 
> > So should I work on that patch or do you want to try?  I think we should
> > do something.
> 
> Let's wait for some other opinions, first ...

Looking at the code, I get the impression that we wouldn't lose anything
without "pg_attribute.attndims", so +1 for removing it.

This would call for some documentation.  We should remove most of the
documentation about the non-existing difference between declaring a column
"integer[]", "integer[][]" or "integer[3][3]" and just describe the first
variant in detail, perhaps mentioning that the other notations are
accepted for backward compatibility.

I also think that it would be helpful to emphasize that while dimensionality
does not matter to a column definition, it matters for individual array values.
Perhaps it would make sense to recommend a check constraint if one wants
to make sure that an array column should contain only a certain kind of array.

Yours,
Laurenz Albe




Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)

2023-11-21 Thread Amit Kapila
On Sat, Nov 18, 2023 at 4:54 PM Amit Kapila  wrote:
>
> On Thu, Nov 16, 2023 at 6:09 PM Bharath Rupireddy
>  wrote:
> > PSA v18 patch.
> >
>
> LGTM. I'll push this early next week unless there are further
> suggestions or comments.
>

Pushed.

-- 
With Regards,
Amit Kapila.




Re: How to accurately determine when a relation should use local buffers?

2023-11-21 Thread Aleksander Alekseev
Hi,

> I would like to clarify, what the correct way is to determine that a given 
> relation is using local buffers. Local buffers, as far as I know, are used 
> for temporary tables in backends. There are two functions/macros (bufmgr.c): 
> SmgrIsTemp, RelationUsesLocalBuffers. The first function verifies that the 
> current process is a regular session backend, while the other macro verifies 
> the relation persistence characteristic. It seems, the use of each function 
> independently is not correct. I think, these functions should be applied in 
> pair to check for local buffers use, but, it seems, these functions are used 
> independently. It works until temporary tables are allowed only in session 
> backends.

Could you please provide a specific example when the current code will
do something wrong/unintended?

> I'm concerned, how to determine the use of local buffers in some other 
> theoretical cases? For example, if we decide to replicate temporary tables? 
> Are there the other cases, when local buffers can be used with relations in 
> the Vanilla? Do we allow the use of relations with RELPERSISTENCE_TEMP not 
> only in session backends?

Temporary tables, by definition, are visible only within one session.
I can't imagine how and why they would be replicated.

-- 
Best regards,
Aleksander Alekseev




Re: [PATCH] fix race condition in libpq (related to ssl connections)

2023-11-21 Thread Aleksander Alekseev
Hi,

> I've found a race condition in libpq. It is about the initialization of
> the my_bio_methods static variable in fe-secure-openssl.c, which is not
> protected by any lock. The race condition may make the initialization of
> the connection fail, and as an additional weird consequence, it might
> cause openssl call close(0), so stdin of the client application gets
> closed.

Thanks for the patch. Interestingly enough we have PQisthreadsafe()
[1], but it's current implementation is:

```
/* libpq is thread-safe? */
int
PQisthreadsafe(void)
{
return true;
}
```

I wonder if we should just document that libpq is thread safe as of PG
v??? and deprecate PQisthreadsafe() at some point. Currently the
documentation gives an impression that the library may or may not be
thread safe depending on the circumstances.

> I've prepared a patch to protect the initialization of my_bio_methods
> from the race. This is my first patch submission to the postgresql
> project, so I hope I didn't miss anything. Any comments and suggestions
> are of course very welcome.
>
> I also prepared a testcase. In the testcase tarball, there is a patch
> that adds sleeps at the right positions to make the close(0) problem
> occur practically always. It also includes comments to explain how the
> race can end up calling close(0).
>
> Concerning the patch, it is only tested on Linux. I'm unsure about
> whether the simple initialization of the mutex would work nowadays also
> on Windows or whether the more complicated initialization also to be
> found for the ssl_config_mutex in the same source file needs to be used.
> Let me know whether I should adapt that.

Please add the patch to the nearest open commit fest [2]. The patch
will be automatically picked up by cfbot [3] and tested on different
platforms. Also this way it will not be lost among other patches.

The code looks OK but I would appreciate a second opinion from cfbot.
Also maybe a few comments before my_BIO_methods_init_mutex and/or
pthread_mutex_lock would be appropriate. Personally I am inclined to
think that the automatic test in this particular case is redundant.

[1]: https://www.postgresql.org/docs/current/libpq-threading.html
[2]: https://commitfest.postgresql.org/
[3]: http://cfbot.cputube.org/

-- 
Best regards,
Aleksander Alekseev




Re: Change GUC hashtable to use simplehash?

2023-11-21 Thread John Naylor
On Mon, Nov 20, 2023 at 5:54 AM Jeff Davis  wrote:
>
> Attached are a bunch of tiny patches and some perf numbers based on
> simple test described here:
>
> https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com

I tried taking I/O out, like this, thinking the times would be less variable:

cat bench.sql
select 1 from generate_series(1,50) x(x), lateral (SELECT
inc_ab(x)) a offset 1000;

(with turbo off)
pgbench -n -T 30 -f bench.sql -M prepared

master:
latency average = 643.625 ms
0001-0005:
latency average = 607.354 ms

...about 5.5% less time, similar to what Jeff found.

I get a noticeable regression in 0002, though, and I think I see why:

 guc_name_hash(const char *name)
 {
- uint32 result = 0;
+ const unsigned char *bytes = (const unsigned char *)name;
+ int  blen  = strlen(name);

The strlen call required for hashbytes() is not free. The lack of
mixing in the (probably inlined after 0001) previous hash function can
remedied directly, as in the attached:

0001-0002 only:
latency average = 670.059 ms

0001-0002, plus revert hashbytes, add finalizer:
latency average = 656.810 ms

-#define SH_EQUAL(tb, a, b) (guc_name_compare(a, b) == 0)
+#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0)

Likewise, I suspect calling out to the C library is going to throw
away some of the gains that were won by not needing to downcase all
the time, but I haven't dug deeper.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b51d10dbc0..124b8fbe85 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1363,10 +1363,17 @@ guc_name_compare(const char *namea, const char *nameb)
 static uint32
 guc_name_hash(const char *name)
 {
-   const unsigned char *bytes = (const unsigned char *)name;
-   int  blen  = strlen(name);
+   uint32  result = 0;
 
-   return hash_bytes(bytes, blen);
+   while (*name)
+   {
+   charch = *name++;
+
+   /* Merge into hash ... not very bright, but it needn't be */
+   result = pg_rotate_left32(result, 5);
+   result ^= (uint32) ch;
+   }
+   return murmurhash32(result);
 }
 
 /*


Re: Typo with amtype = 's' in opr_sanity.sql

2023-11-21 Thread Aleksander Alekseev
Hi,

> While rebasing a patch from 2016 related to sequence AMs (more about
> that later), I've bumped on a mistake from 8586bf7ed888 in
> opr_sanity.sql, as of:
> +SELECT p1.oid, p1.amname, p2.oid, p2.proname
> +FROM pg_am AS p1, pg_proc AS p2
> +WHERE p2.oid = p1.amhandler AND p1.amtype = 's' AND

Good catch.

> It seems to me that this has been copy-pasted on HEAD from the
> sequence AM patch, but forgot to update amtype to 't'.  While that's
> maybe cosmetic, I think that this could lead to unexpected results, so
> perhaps there is a point in doing a backpatch?

I disagree that it's cosmetic. The test doesn't check what it's supposed to.

-- 
Best regards,
Aleksander Alekseev




Re: How to accurately determine when a relation should use local buffers?

2023-11-21 Thread Vitaly Davydov
Hi Aleksander,

Thank you for the reply.

> Could you please provide a specific example when the current code willdo
> something wrong/unintended?

I can't say that something is wrong in vanilla. But if you decide to
replicate DDL in some solutions like multimaster, you might want to
replicate CREATE TEMPORARY TABLE. Furthermore, there is some possible
inconsistency in the code show below (REL_16_STABLE) in bufmgr.c file:

   - FlushRelationBuffers, PrefetchBuffer uses
   RelationUsesLocalBuffers(rel).
   - ExtendBufferedRel_common finally use
   BufferManagerRelation.relpersistence which is actually
   rd_rel->relpersistence, works like RelationUsesLocalBuffers.
   - ReadBuffer_common uses isLocalBuf = SmgrIsTemp(smgr), that checks
   rlocator.backend for InvalidBackendId.

I would like to clarify, do we completely refuse the use of temporary
tables in other contexts than in backends or there is some work-in-progress
to allow some other usage contexts? If so, the check of
rd_rel->relpersistence is enough. Not sure why we use SmgrIsTemp instead of
RelationUsesLocalBuffers in ReadBuffer_common.


With best regards,

Vitaly Davydov

вт, 21 нояб. 2023 г. в 11:52, Aleksander Alekseev :

> Hi,
>
> > I would like to clarify, what the correct way is to determine that a
> given relation is using local buffers. Local buffers, as far as I know, are
> used for temporary tables in backends. There are two functions/macros
> (bufmgr.c): SmgrIsTemp, RelationUsesLocalBuffers. The first function
> verifies that the current process is a regular session backend, while the
> other macro verifies the relation persistence characteristic. It seems, the
> use of each function independently is not correct. I think, these functions
> should be applied in pair to check for local buffers use, but, it seems,
> these functions are used independently. It works until temporary tables are
> allowed only in session backends.
>
> Could you please provide a specific example when the current code will
> do something wrong/unintended?
>
> > I'm concerned, how to determine the use of local buffers in some other
> theoretical cases? For example, if we decide to replicate temporary tables?
> Are there the other cases, when local buffers can be used with relations in
> the Vanilla? Do we allow the use of relations with RELPERSISTENCE_TEMP not
> only in session backends?
>
> Temporary tables, by definition, are visible only within one session.
> I can't imagine how and why they would be replicated.
>
> --
> Best regards,
> Aleksander Alekseev
>


-- 
С уважением,
Давыдов Виталий
http://www.vdavydov.ru


RE: Random pg_upgrade test failure on drongo

2023-11-21 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

This email tells an update. The machine drongo failed the test a week ago [1]
and finally got logfiles. PSA files.

## Observed failure

pg_upgrade_server.log is a server log during the pg_upgrade command. According 
to
it, the TRUNCATE command seemed to be failed due to a "File exists" error.

```
2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR:  could not 
create file "base/1/2683": File exists
2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT:  
-- For binary upgrade, preserve pg_largeobject and index relfilenodes
SELECT 
pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid);
SELECT 
pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid);
TRUNCATE pg_catalog.pg_largeobject;
...
```

## Analysis

I think it caused due to the STATUS_DELETE_PENDING failure, not related with 
recent
updates for pg_upgrade.

The file "base/1/2683" is an index file for pg_largeobject_loid_pn_index, and 
the
output meant that file creation was failed. Below is a backtrace.

```
pgwin32_open() // <-- this returns -1
open()
BasicOpenFilePerm()
PathNameOpenFilePerm()
PathNameOpenFile()
mdcreate()
smgrcreate()
RelationCreateStorage()
RelationSetNewRelfilenumber()
ExecuteTruncateGuts()
ExecuteTruncate()
```

But this is strange. Before calling mdcreate(), we surely unlink the file which
have the same name. Below is a trace until unlink.

```
pgunlink()
unlink()
mdunlinkfork()
mdunlink()
smgrdounlinkall()
RelationSetNewRelfilenumber() // common path with above
ExecuteTruncateGuts()
ExecuteTruncate()
```

I found Thomas said that [2] pgunlink sometimes could not remove file even if
it returns OK, at that time NTSTATUS is STATUS_DELETE_PENDING. Also, a comment
in pgwin32_open_handle() mentions the same thing:

```
/*
 * ERROR_ACCESS_DENIED is returned if the file is deleted but 
not yet
 * gone (Windows NT status code is STATUS_DELETE_PENDING).  In 
that
 * case, we'd better ask for the NT status too so we can 
translate it
 * to a more Unix-like error.  We hope that nothing clobbers 
the NT
 * status in between the internal NtCreateFile() call and 
CreateFile()
 * returning.
 *
```

The definition of STATUS_DELETE_PENDING can be seen in [3]. Based on that, 
indeed,
open() would be able to fail with STATUS_DELETE_PENDING if the deletion is 
pending
but it is tried to open.

Another thread [4] also tries the issue while doing rmtree->unlink, and it 
reties
to remove if it fails with STATUS_DELETE_PENDING. So, should we retry to open 
when
it fails as well? Anyway, this fix seems out-of-scope for pg_upgrade.

How do you think? Do you have any thoughts about it?

## Acknowledgement

I want to say thanks to Sholk, Vingesh, for helping the analysis.

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-11-15%2006%3A16%3A15
[2]: 
https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com
[3]: 
https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55
[4]: 
https://www.postgresql.org/message-id/flat/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de#6ae5e2ba3dd6e1fd680dcc34eab710d5

Best Regards,
Hayato Kuroda
FUJITSU LIMITED





Re: Why is src/test/modules/committs/t/002_standby.pl flaky?

2023-11-21 Thread Alexander Lakhin

Hello Thomas,

10.11.2023 06:31, Thomas Munro wrote:

Here is a new attempt to fix this mess.  Disclaimer: this based
entirely on reading the manual and vicariously hacking a computer I
don't have via CI.


As it also might (and I would like it to) be the final attempt, I decided
to gather information and all the cases that we had on this topic.
At least, for the last 5 years we've seen:

[1] 2019-01-18: Re: BUG #15598: PostgreSQL Error Code is not reported when connection terminated due to 
idle-in-transaction timeout

    test 099_case_15598 made (in attachment)
no commit

[2] 2019-01-22: Rare SSL failures on eelpout
    references [1]
    test 099_rare_ssl_failures made (in attachment)
commit 2019-03-19 1f39a1c06: Restructure libpq's hqandling of send failures.

[3] 2019-11-28: pgsql: Add tests for '-f' option in dropdb utility.
    test 051_dropdb_force was proposed (included in the attachment)
commit 2019-11-30 98a9b37ba: Revert commits 290acac92b and 8a7e9e9dad.

[4] 2019-12-06: closesocket behavior in different platforms
    references [1], [2], [3]; a documentation change proposed
no commit

[5] 2020-06-03: libpq copy error handling busted
    test 099_pgbench_with_server_off made (in attachment)
commit 2020-06-07 7247e243a: Try to read data from the socket in pqSendSome's 
write_failed paths. (a fix for 1f39a1c06)

[6] 2020-10-19: BUG #16678: The ecpg connect/test5 test sometimes fails on 
Windows
no commit

[7] 2021-11-17: Windows: Wrong error message at connection termination
   references [6]
commit: 2021-12-02 6051857fc: On Windows, close the client socket explicitly 
during backend shutdown.

[8] 2021-12-05 17:03:18: MSVC SSL test failure
commit: 2021-12-07 ed52c3707: On Windows, also call shutdown() while closing 
the client socket.

[9] 2021-12-30: Why is src/test/modules/committs/t/002_standby.pl flaky?
   additional test 099_postgres_fdw_disconnect made (in attachment)
commit 2022-01-26 75674c7ec: Revert "graceful shutdown" changes for Windows, in 
back branches only. (REL_14_STABLE)
commit 2022-03-22 29992a6a5: Revert "graceful shutdown" changes for Windows. 
(master)

[10] 2022-02-02 19:19:22: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on 
OpenBSD 7.0

commit 2022-02-12 335fa5a26: Fix thinko in PQisBusy(). (a fix for 1f39a1c06)
commit 2022-02-12 faa189c93: Move libpq's write_failed mechanism down to 
pqsecure_raw_write(). (a fix for 1f39a1c06)

As it becomes difficult to test/check all those cases scattered around, I
decided to retell the whole story by means of tests. Please look at the
script win-sock-tests-01.cmd attached, which can be executed both on
Windows (as regular cmd) and on Linux (bash win-sock-tests-01.cmd).

At the end of the script we can see several things.
First, the last patchset you posted, applied to b282fa88d~1, fixes the
issue discussed in this thread (it eliminates failures of
commit_ts_002_standby (and also 099_postgres_fdw_disconnect)).

Second, with the sleep added (see [6]), I had the same results of
`meson test` on Windows and on Linux.
Namely, there are some tests failing (see win-sock-tests-01.cmd) due to
walsender preventing server stop.
I describe this issue separately (see details in walsender-cannot-exit.txt;
maybe it's worth to discuss it in a separate thread) as it's kind of
off-topic. With the supplementary sleep() added to WalSndLoop(), the
complete `meson test` passes successfully both on Windows and on Linux.

Third, cases [1] and [3] are still broken, due to a Windows peculiarity.
Please see server.c and client.c attached, which demonstrate:
Case "no shutdown/closesocket" on Windows:
C:\src>server
Listening for incoming connections...
                        C:\src>client
Client connected: 127.0.0.1:64395
                        Connection to server established. Enter message: msg
Client message: msg
Sending message...
                        Sleeping...
Exiting...
C:\src>
                        Calling recv()...
                        recv() failed

Case "no shutdown/closesocket" on Linux:
$ server
Listening for incoming connections...
                        $ client
Client connected: 127.0.0.1:33044
                        Connection to server established. Enter message: msg
Client message: msg
Sending message...
                        Sleeping...
Exiting...
$
                        Calling recv()...
                        Server message: MESSAGE

Case "shutdown/closesocket" on Windows:
C:\src>server shutdown closesocket
Listening for incoming connections...
                        C:\src>client
Client connected: 127.0.0.1:64396
                        Connection to server established. Enter message: msg
Client message: msg
Sending message...
                        Sleeping...
Calling shutdown()...
Calling closesocket()...
Exiting...
C:\src>
                        Calling recv()...
                        Server message: MESSAGE

That's okay so far, but what makes cases [1]/[3] different

Re: Synchronizing slots from primary to standby

2023-11-21 Thread Drouvot, Bertrand

Hi,

On 11/21/23 10:32 AM, shveta malik wrote:

On Tue, Nov 21, 2023 at 2:02 PM shveta malik  wrote:





v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd,
rebased the patches.  PFA v37_2 patches.


Thanks!

Regarding the promotion flow: If the primary is available and reachable I don't
think we currently try to ensure that slots are in sync. I think we'd miss the
activity since the last sync and the promotion request or am I missing 
something?

If the primary is available and reachable shouldn't we launch a last round of
synchronization (skipping all the slots that are not in 'r' state)?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Use of backup_label not noted in log

2023-11-21 Thread David Steele

On 11/20/23 23:54, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 03:31:20PM -0400, David Steele wrote:


I still wonder if we need "base backup" in the messages? That sort of
implies (at least to me) you used pg_basebackup but that may not be the
case.


Or just s/base backup/backup/?


That's what I meant but did not explain very well.

Regards,
-David




Re: POC, WIP: OR-clause support for indexes

2023-11-21 Thread Alena Rybakina

On 21.11.2023 03:50, Alena Rybakina wrote:

On 20.11.2023 11:52, Andrei Lepikhov wrote:
Looking into the patch, I found some trivial improvements (see 
attachment).
Also, it is not obvious that using a string representation of the 
clause as a hash table key is needed here. Also, by making a copy of 
the node in the get_key_nconst_node(), you replace the location 
field, but in the case of complex expression, you don't do the same 
with other nodes.
I propose to generate expression hash instead + prove the equality of 
two expressions by calling equal().


I was thinking about your last email and a possible error where the 
location field may not be cleared in complex expressions. 
Unfortunately, I didn't come up with any examples either, but I think 
I was able to handle this with a special function that removes 
location-related patterns. The alternative to this is to bypass this 
expression, but I think it will be more invasive. In addition, I have 
added changes related to the hash table: now the key is of type int.


All changes are displayed in the attached 
v9-0001-Replace-OR-clause-to_ANY.diff.txt file.


I haven't measured it yet. But what do you think about these changes?


Sorry, I lost your changes  during the revision process. I returned 
them. I raised the patch version just in case to run ci successfully.


--
Regards,
Alena Rybakina
Postgres Professional
From 66d1c479347d80ea45fc7aebaed873d45f9c4351 Mon Sep 17 00:00:00 2001
From: Alena Rybakina 
Date: Tue, 21 Nov 2023 14:20:56 +0300
Subject: [PATCH] [PATCH] Replace OR clause to ANY expressions. Replace (X=N1)
 OR (X=N2) ... with X = ANY(N1, N2) on the stage of the optimiser when we are
 still working with a tree expression. Firstly, we do not try to make a
 transformation for "non-or" expressions or inequalities and the creation of a
 relation with "or" expressions occurs according to the same scenario.
 Secondly, we do not make transformations if there are less than set
 or_transform_limit. Thirdly, it is worth considering that we consider "or"
 expressions only at the current level.

---
 src/backend/parser/parse_expr.c   | 280 +-
 src/backend/utils/misc/guc_tables.c   |  10 +
 src/include/parser/parse_expr.h   |   1 +
 src/test/regress/expected/create_index.out| 115 +++
 src/test/regress/expected/guc.out |   3 +-
 src/test/regress/expected/join.out|  50 
 src/test/regress/expected/partition_prune.out | 156 ++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/expected/tidscan.out |  17 ++
 src/test/regress/sql/create_index.sql |  32 ++
 src/test/regress/sql/join.sql |  10 +
 src/test/regress/sql/partition_prune.sql  |  22 ++
 src/test/regress/sql/tidscan.sql  |   6 +
 src/tools/pgindent/typedefs.list  |   1 +
 14 files changed, 703 insertions(+), 3 deletions(-)

diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c
index 64c582c344c..7b76c9f29a1 100644
--- a/src/backend/parser/parse_expr.c
+++ b/src/backend/parser/parse_expr.c
@@ -18,6 +18,7 @@
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_proc.h"
 #include "catalog/pg_type.h"
+#include "common/hashfn.h"
 #include "commands/dbcommands.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
@@ -43,6 +44,7 @@
 
 /* GUC parameters */
 bool		Transform_null_equals = false;
+bool		enable_or_transformation = false;
 
 
 static Node *transformExprRecurse(ParseState *pstate, Node *expr);
@@ -98,7 +100,283 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname,
 			  Node *ltree, Node *rtree, int location);
 static Node *make_nulltest_from_distinct(ParseState *pstate,
 		 A_Expr *distincta, Node *arg);
+typedef struct OrClauseGroupEntry
+{
+	int32			hash_leftvar_key;
+
+	Node		   *node;
+	List		   *consts;
+	Oidscalar_type;
+	Oidopno;
+	Node 		   *expr;
+} OrClauseGroupEntry;
+
+/*
+ * TODO: consider different algorithm to manage complexity
+ * in the case of many different clauses,
+ * like Rabin-Karp or Boyer–Moore algorithms.
+ */
+static char *
+clear_patterns(const char *str, const char *start_pattern)
+{
+	int			i = 0;
+	int			j = 0;
+	int			start_pattern_len = strlen(start_pattern);
+	char	   *res = palloc0(sizeof(*res) * (strlen(str) + 1));
+
+	for (i = 0; str[i];)
+	{
+		if (i >= start_pattern_len && strncmp(&str[i - start_pattern_len],
+			  start_pattern,
+			  start_pattern_len) == 0)
+		{
+			while (str[i] && !(str[i] == '{' || str[i] == '}'))
+i++;
+		}
+		if (str[i])
+			res[j++] = str[i++];
+	}
+
+	return res;
+}
+
+static int32
+get_key_nconst_node(Node *nconst_node)
+{
+	char *str = nodeToString(nconst_node);
+
+	str = clear_patterns(str, " :location");
+
+	if (str == NULL)
+		return 0;
+
+	return DatumGetInt32(hash_any((unsigned char *)str, strlen(str)));
+}
+
+static Node *
+transformBoolExprOr(ParseState *pstate, Bool

Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/20/23 19:58, Andres Freund wrote:


On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup


That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?


I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.


I'm not sure why you think the patch under discussion doesn't do 
anything for external backups. It provides the same benefits to both 
pg_basebackup and external backups, i.e. they both receive the updated 
version of pg_control.


I really dislike the idea of pg_basebackup having a special mechanism 
for making recovery safer that is not generally available to external 
backup software. It might be easy enough for some (e.g. pgBackRest) to 
manipulate pg_control but would be out of reach for most.


Regards,
-David





undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-11-21 Thread Tomas Vondra
Hi,

I decided to do some stress-testing of the built-in logical replication,
as part of the sequence decoding work. And I soon ran into an undetected
deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-(

The attached bash scripts triggers that in a couple seconds for me. The
script looks complicated, but most of the code is waiting for sync to
complete, catchup, and that sort of thing.

What the script does is pretty simple:

1) initialize two clusters, set them as publisher/subscriber pair

2) create some number of tables, add them to publication and wait for
   the sync to complete

3) start two pgbench runs in the background, modifying the publication
   (one removes+adds all tables in a single transaction, one does that
with transaction per table)

4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION
   in a loop (now that I think about it, could be another pgbench
   script, but well ...)

5) some consistency checks, but the lockup happens earlier so this does
   not really matter

After a small number of refresh cycles (for me it's usually a couple
dozen), we end up with a couple stuck locks (I shortened the backend
type string a bit, for formatting reasons):

  test=# select a.pid, classid, objid, backend_type, query
   from pg_locks l join pg_stat_activity a on (a.pid = l.pid)
  where not granted;

 pid   | classid | objid | backend_type | query
  -+-+---+--+--
   2691941 |6100 | 16785 | client backend   | ALTER SUBSCRIPTION s
  REFRESH PUBLICATION
   2691837 |6100 | 16785 | tablesync worker |
   2691936 |6100 | 16785 | tablesync worker |
  (3 rows)

All these backends wait for 6100/16785, which is the subscription row in
pg_subscription. The tablesync workers are requesting AccessShareLock,
the client backend however asks for AccessExclusiveLock.

The entry is currently locked by:

  test=# select a.pid, mode, backend_type from pg_locks l
   join pg_stat_activity a on (a.pid = l.pid)
  where classid=6100 and objid=16785 and granted;

 pid   |  mode   |   backend_type
  -+-+--
   2690477 | AccessShareLock | logical replication apply worker
  (1 row)

But the apply worker is not waiting for any locks, so what's going on?

Well, the problem is the apply worker is waiting for notification from
the tablesync workers the relation is synced, which happens through
updating the pg_subscription_rel row. And that wait happens in
wait_for_relation_state_change, which simply checks the row in a loop,
with a sleep by WaitLatch().

Unfortunately, the tablesync workers can't update the row because the
client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
sneaked in, and waits for an AccessExclusiveLock. So the tablesync
workers are stuck in the queue and can't proceed.

The client backend can't proceed, because it's waiting for a lock held
by the apply worker.

The tablesync workers can't proceed because their lock request is stuck
behind the AccessExclusiveLock request.

And the apply worker can't proceed, because it's waiting for status
update from the tablesync workers.

And the deadlock is undetected, because the apply worker is not waiting
on a lock, but sleeping on a latch :-(


I don't know what's the right solution here. I wonder if the apply
worker might release the lock before waiting for the update, that'd
solve this whole issue.

Alternatively, ALTER PUBLICATION might wait for the lock only for a
limited amount of time, and try again (but then it'd be susceptible to
starving, of course).

Or maybe there's a way to make this work in a way that would be visible
to the deadlock detector? That'd mean we occasionally get processes
killed to resolve a deadlock, but that's still better than processes
stuck indefinitely ...


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

lockup-test.sh
Description: application/shellscript


refresh.sh
Description: application/shellscript


Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-11-21 Thread Amit Kapila
On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
 wrote:
>
> I decided to do some stress-testing of the built-in logical replication,
> as part of the sequence decoding work. And I soon ran into an undetected
> deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-(
>
> The attached bash scripts triggers that in a couple seconds for me. The
> script looks complicated, but most of the code is waiting for sync to
> complete, catchup, and that sort of thing.
>
> What the script does is pretty simple:
>
> 1) initialize two clusters, set them as publisher/subscriber pair
>
> 2) create some number of tables, add them to publication and wait for
>the sync to complete
>
> 3) start two pgbench runs in the background, modifying the publication
>(one removes+adds all tables in a single transaction, one does that
> with transaction per table)
>
> 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION
>in a loop (now that I think about it, could be another pgbench
>script, but well ...)
>
> 5) some consistency checks, but the lockup happens earlier so this does
>not really matter
>
> After a small number of refresh cycles (for me it's usually a couple
> dozen), we end up with a couple stuck locks (I shortened the backend
> type string a bit, for formatting reasons):
>
>   test=# select a.pid, classid, objid, backend_type, query
>from pg_locks l join pg_stat_activity a on (a.pid = l.pid)
>   where not granted;
>
>  pid   | classid | objid | backend_type | query
>   -+-+---+--+--
>2691941 |6100 | 16785 | client backend   | ALTER SUBSCRIPTION s
>   REFRESH PUBLICATION
>2691837 |6100 | 16785 | tablesync worker |
>2691936 |6100 | 16785 | tablesync worker |
>   (3 rows)
>
> All these backends wait for 6100/16785, which is the subscription row in
> pg_subscription. The tablesync workers are requesting AccessShareLock,
> the client backend however asks for AccessExclusiveLock.
>
> The entry is currently locked by:
>
>   test=# select a.pid, mode, backend_type from pg_locks l
>join pg_stat_activity a on (a.pid = l.pid)
>   where classid=6100 and objid=16785 and granted;
>
>  pid   |  mode   |   backend_type
>   -+-+--
>2690477 | AccessShareLock | logical replication apply worker
>   (1 row)
>
> But the apply worker is not waiting for any locks, so what's going on?
>
> Well, the problem is the apply worker is waiting for notification from
> the tablesync workers the relation is synced, which happens through
> updating the pg_subscription_rel row. And that wait happens in
> wait_for_relation_state_change, which simply checks the row in a loop,
> with a sleep by WaitLatch().
>
> Unfortunately, the tablesync workers can't update the row because the
> client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
> sneaked in, and waits for an AccessExclusiveLock. So the tablesync
> workers are stuck in the queue and can't proceed.
>
> The client backend can't proceed, because it's waiting for a lock held
> by the apply worker.
>

It seems there is some inconsistency in what you have written for
client backends/tablesync worker vs. apply worker. The above text
seems to be saying that the client backend and table sync worker are
waiting on a "subscription row in pg_subscription" and the apply
worker is operating on "pg_subscription_rel". So, if that is true then
they shouldn't get stuck.

I think here client backend and tablesync worker seems to be blocked
for a lock on pg_subscription_rel.

> The tablesync workers can't proceed because their lock request is stuck
> behind the AccessExclusiveLock request.
>
> And the apply worker can't proceed, because it's waiting for status
> update from the tablesync workers.
>

This part is not clear to me because
wait_for_relation_state_change()->GetSubscriptionRelState() seems to
be releasing the lock while closing the relation. Am, I missing
something?

-- 
With Regards,
Amit Kapila.




Re: meson documentation build open issues

2023-11-21 Thread Peter Eisentraut

On 21.11.23 02:56, Andres Freund wrote:

One remaining question is whether we should adjust install-doc-{html,man} to
be install-{html,man}, to match the docs targets.


Ah didn't notice that one; yes please.




Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-11-21 Thread Tomas Vondra



On 11/21/23 14:16, Amit Kapila wrote:
> On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
>  wrote:
>>
>> I decided to do some stress-testing of the built-in logical replication,
>> as part of the sequence decoding work. And I soon ran into an undetected
>> deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-(
>>
>> The attached bash scripts triggers that in a couple seconds for me. The
>> script looks complicated, but most of the code is waiting for sync to
>> complete, catchup, and that sort of thing.
>>
>> What the script does is pretty simple:
>>
>> 1) initialize two clusters, set them as publisher/subscriber pair
>>
>> 2) create some number of tables, add them to publication and wait for
>>the sync to complete
>>
>> 3) start two pgbench runs in the background, modifying the publication
>>(one removes+adds all tables in a single transaction, one does that
>> with transaction per table)
>>
>> 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION
>>in a loop (now that I think about it, could be another pgbench
>>script, but well ...)
>>
>> 5) some consistency checks, but the lockup happens earlier so this does
>>not really matter
>>
>> After a small number of refresh cycles (for me it's usually a couple
>> dozen), we end up with a couple stuck locks (I shortened the backend
>> type string a bit, for formatting reasons):
>>
>>   test=# select a.pid, classid, objid, backend_type, query
>>from pg_locks l join pg_stat_activity a on (a.pid = l.pid)
>>   where not granted;
>>
>>  pid   | classid | objid | backend_type | query
>>   -+-+---+--+--
>>2691941 |6100 | 16785 | client backend   | ALTER SUBSCRIPTION s
>>   REFRESH PUBLICATION
>>2691837 |6100 | 16785 | tablesync worker |
>>2691936 |6100 | 16785 | tablesync worker |
>>   (3 rows)
>>
>> All these backends wait for 6100/16785, which is the subscription row in
>> pg_subscription. The tablesync workers are requesting AccessShareLock,
>> the client backend however asks for AccessExclusiveLock.
>>
>> The entry is currently locked by:
>>
>>   test=# select a.pid, mode, backend_type from pg_locks l
>>join pg_stat_activity a on (a.pid = l.pid)
>>   where classid=6100 and objid=16785 and granted;
>>
>>  pid   |  mode   |   backend_type
>>   -+-+--
>>2690477 | AccessShareLock | logical replication apply worker
>>   (1 row)
>>
>> But the apply worker is not waiting for any locks, so what's going on?
>>
>> Well, the problem is the apply worker is waiting for notification from
>> the tablesync workers the relation is synced, which happens through
>> updating the pg_subscription_rel row. And that wait happens in
>> wait_for_relation_state_change, which simply checks the row in a loop,
>> with a sleep by WaitLatch().
>>
>> Unfortunately, the tablesync workers can't update the row because the
>> client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
>> sneaked in, and waits for an AccessExclusiveLock. So the tablesync
>> workers are stuck in the queue and can't proceed.
>>
>> The client backend can't proceed, because it's waiting for a lock held
>> by the apply worker.
>>
> 
> It seems there is some inconsistency in what you have written for
> client backends/tablesync worker vs. apply worker. The above text
> seems to be saying that the client backend and table sync worker are
> waiting on a "subscription row in pg_subscription" and the apply
> worker is operating on "pg_subscription_rel". So, if that is true then
> they shouldn't get stuck.
> 
> I think here client backend and tablesync worker seems to be blocked
> for a lock on pg_subscription_rel.
> 

Not really, they are all locking the subscription. All the locks are on
classid=6100, which is pg_subscription:

  test=# select 6100::regclass;
  regclass
  -
   pg_subscription
  (1 row)

The thing is, the tablesync workers call UpdateSubscriptionRelState,
which locks the pg_subscription catalog at the very beginning:

   LockSharedObject(SubscriptionRelationId, ...);

So that's the issue. I haven't explored why it's done this way, and
there's no comment explaining locking the subscriptions is needed ...

>> The tablesync workers can't proceed because their lock request is stuck
>> behind the AccessExclusiveLock request.
>>
>> And the apply worker can't proceed, because it's waiting for status
>> update from the tablesync workers.
>>
> 
> This part is not clear to me because
> wait_for_relation_state_change()->GetSubscriptionRelState() seems to
> be releasing the lock while closing the relation. Am, I missing
> something?
> 

I think you're missing the fact that GetSubscriptionRelState() acquires
and releases the lock on pg_subscription_rel, but that's not the lock
causing th

Re: Annoying build warnings from latest Apple toolchain

2023-11-21 Thread Robert Haas
On Mon, Nov 20, 2023 at 11:37 PM Andres Freund  wrote:
> WRT Robert seeing those warnings and Tom not: There's something odd going
> on. I couldn't immediately reproduce it. Then I realized it reproduces against
> a homebrew install but not a macports one.
>
> Robert, which are you using?

macports

> Meson actually *tries* to avoid this warning without resulting in incorrect
> results due to ordering. But homebrew does something odd, with libxml-2.0,
> zlib and a few others: Unless you do something to change that, brew has
> /opt/homebrew/Library/Homebrew/os/mac/pkgconfig/14/ in its search path, but
> those libraries aren't from homebrew, they're referencing macos
> frameworks. Apparently meson isn't able to understand which files those .pc
> files link to and gives up on the deduplicating.
>
> If I add to the pkg-config search path, e.g. via
> meson configure 
> -Dpkg_config_path=$OTHER_STUFF:/opt/homebrew/opt/zlib/lib/pkgconfig/:/opt/homebrew/opt/libxml2/lib/pkgconfig/
>
> the warnings about duplicate libraries vanish.

Hmm, I'm happy to try something here, but I'm not sure exactly what.
I'm not setting pkg_config_path at all right now. I'm using this:

meson setup $HOME/pgsql $HOME/pgsql-meson -Dcassert=true -Ddebug=true
-Dextra_include_dirs=/opt/local/include
-Dextra_lib_dirs=/opt/local/lib -Dprefix=$HOME/install/dev

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: trying again to get incremental backup

2023-11-21 Thread Jakub Wartak
On Mon, Nov 20, 2023 at 4:43 PM Robert Haas  wrote:
>
> On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera  
> wrote:
> > I made a pass over pg_combinebackup for NLS.  I propose the attached
> > patch.
>
> This doesn't quite compile for me so I changed a few things and
> incorporated it. Hopefully I didn't mess anything up.
>
> Here's v11.
[..]

> I wish I had better ideas about how to thoroughly test this. [..]

Hopefully the below add some confidence, I've done some further
quick(?) checks today and results are good:

make check-world #GOOD
test_full_pri__incr_stby__restore_on_pri.sh #GOOD
test_full_pri__incr_stby__restore_on_stby.sh #GOOD*
test_full_stby__incr_stby__restore_on_pri.sh #GOOD
test_full_stby__incr_stby__restore_on_stby.sh #GOOD*
test_many_incrementals_dbcreate.sh #GOOD
test_many_incrementals.sh #GOOD
test_multixact.sh #GOOD
test_pending_2pc.sh #GOOD
test_reindex_and_vacuum_full.sh #GOOD
test_truncaterollback.sh #GOOD
test_unlogged_table.sh #GOOD
test_across_wallevelminimal.sh # GOOD(expected failure, that
walsummaries are off during walminimal and incremental cannot be
taken--> full needs to be taken after wal_level=minimal)

CFbot failed on two hosts this time, I haven't looked at the details
yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? ->
LOG:  WAL summarizer process (PID 71511) was terminated by signal 6:
Aborted?)

The remaining test idea is to have a longer running DB under stress
test in more real-world conditions and try to recover using chained
incremental backups (one such test was carried out on patchset v6 and
the result was good back then).

-J.




Re: About #13489, array dimensions and CREATE TABLE ... LIKE

2023-11-21 Thread Bruce Momjian
On Tue, Nov 21, 2023 at 09:33:18AM +0100, Laurenz Albe wrote:
> On Mon, 2023-11-20 at 21:13 -0500, Tom Lane wrote:
> > Bruce Momjian  writes:
> > > On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote:
> > > > Bruce Momjian  writes:
> > > > > An alternate approach would
> > > > > be to remove pg_attribute.attndims so we don't even try to preserve 
> > > > > dimensionality.
> > 
> > > > I could get behind that, perhaps.  It looks like we're not using the
> > > > field in any meaningful way, and we could simplify TupleDescInitEntry
> > > > and perhaps some other APIs.
> > 
> > > So should I work on that patch or do you want to try?  I think we should
> > > do something.
> > 
> > Let's wait for some other opinions, first ...
> 
> Looking at the code, I get the impression that we wouldn't lose anything
> without "pg_attribute.attndims", so +1 for removing it.
> 
> This would call for some documentation.  We should remove most of the
> documentation about the non-existing difference between declaring a column
> "integer[]", "integer[][]" or "integer[3][3]" and just describe the first
> variant in detail, perhaps mentioning that the other notations are
> accepted for backward compatibility.

Agreed, I see:

https://www.postgresql.org/docs/current/arrays.html

However, the current implementation ignores any supplied array
size limits, i.e., the behavior is the same as for arrays of
unspecified length.

The current implementation does not enforce the declared number
of dimensions either.

So both size limits and dimensions would be ignored.

> I also think that it would be helpful to emphasize that while dimensionality
> does not matter to a column definition, it matters for individual array 
> values.
> Perhaps it would make sense to recommend a check constraint if one wants
> to make sure that an array column should contain only a certain kind of array.

The CHECK constraint idea is very good.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }

2023-11-21 Thread Robert Haas
On Mon, Nov 20, 2023 at 5:27 PM Jeff Davis  wrote:
> Of course I welcome others to profile and see what they can do. There's
> a setjmp() call, and a couple allocations, and maybe some other stuff
> to look at. There are also higher-level ideas, like avoiding calling
> into guc.c in some cases, but that starts to get tricky as you pointed
> out:
>
> https://www.postgresql.org/message-id/CA%2BTgmoa8uKQgak5wH0%3D7sL-ukqbwnCPMXA2ZW7Ccdt7tdNGkzg%40mail.gmail.com
>
> It seems others are also interested in this problem, so I can put some
> more effort in after this round of patches goes in. I don't have a
> specific target other than "low enough overhead that we can reasonably
> recommend it as a best practice in multi-user environments".

The two things that jump out at me are the setjmp() and the
hash_search() call inside find_option(). As to the first, could we
remove the setjmp() and instead have abort-time processing know
something about this? For example, imagine we just push something onto
a stack like we do for ErrorContextCallback, do whatever, and then pop
it off. But if an error is thrown then the abort path knows to look at
that variable and do whatever. As to the second, could we somehow come
up with an API for guc.c where you can ask for an opaque handle that
will later allow you to do a fast-SET of a GUC? The opaque handle
would basically be the hashtable entry, perhaps with some kind of
wrapping or decoration. Then fmgr_security_definer() could obtain the
opaque handles and cache them in fn_extra.

(I'm just spitballing here.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: PSQL error: total cell count of XXX exceeded

2023-11-21 Thread Alvaro Herrera
On 2023-Sep-13, Hongxu Ma wrote:

> After double check, looks `int64` of src/include/c.h is the proper type for 
> it.
> Uploaded the v4 patch to fix it.

Right.  I made a few more adjustments, including the additional overflow
check in printTableInit that Tom Lane suggested, and pushed this.

It's a bit annoying that the error recovery decision of this code is to
exit the process with an error.  If somebody were to be interested in a
fun improvement exercise, it may be worth redoing the print.c API so
that it returns errors that psql can report and recover from, instead of
just closing the process.

TBH though, I've never hit that code in real usage.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




pg_class.relpages not updated for toast index

2023-11-21 Thread Fabrízio de Royes Mello
Hi all,

Was doing a relation size estimation based on pg_class.relpages of the
relation and the related objects (index, toast) and noticed that it is not
updated for the toast index, for example:

fabrizio=# CREATE TABLE t(c TEXT);
INSERT INTO t VALUES (repeat('x', (8192^2)::int));

VACUUM (ANALYZE) t;
CREATE TABLE
INSERT 0 1
VACUUM
fabrizio=# \x on
Expanded display is on.
fabrizio=# SELECT
  c.oid,
  c.relname,
  c.relpages,
  t.relname,
  t.relpages AS toast_pages,
  ci.relname,
  ci.relpages AS toast_index_pages,
  (pg_stat_file(pg_relation_filepath(ci.oid))).size AS toast_index_size
FROM
  pg_class c
  JOIN pg_class t ON t.oid = c.reltoastrelid
  JOIN pg_index i ON i.indrelid = t.oid
  JOIN pg_class ci ON ci.oid = i.indexrelid
WHERE
  c.oid = 't'::regclass;
-[ RECORD 1 ]-+-
oid   | 17787
relname   | t
relpages  | 1
relname   | pg_toast_17787
toast_pages   | 97
relname   | pg_toast_17787_index
toast_index_pages | 1
toast_index_size  | 16384

Are there any reasons for toast index relpages not to be updated? Or is it
a bug?

Regards,

-- 
Fabrízio de Royes Mello


Re: PSQL error: total cell count of XXX exceeded

2023-11-21 Thread Tom Lane
Alvaro Herrera  writes:
> Right.  I made a few more adjustments, including the additional overflow
> check in printTableInit that Tom Lane suggested, and pushed this.

Committed patch LGTM.

> It's a bit annoying that the error recovery decision of this code is to
> exit the process with an error.  If somebody were to be interested in a
> fun improvement exercise, it may be worth redoing the print.c API so
> that it returns errors that psql can report and recover from, instead of
> just closing the process.
> TBH though, I've never hit that code in real usage.

Yeah, I think the reason it's stayed like that for 25 years is that
nobody's hit the case in practice.  Changing the API would be a bit
troublesome, too, because we don't know if anybody besides psql
uses it.

regards, tom lane




Re: meson documentation build open issues

2023-11-21 Thread Peter Eisentraut

On 21.11.23 14:23, Peter Eisentraut wrote:

On 21.11.23 02:56, Andres Freund wrote:
One remaining question is whether we should adjust 
install-doc-{html,man} to

be install-{html,man}, to match the docs targets.


Ah didn't notice that one; yes please.


I think this was done?




Re: Annoying build warnings from latest Apple toolchain

2023-11-21 Thread Peter Eisentraut

On 21.11.23 14:35, Robert Haas wrote:

On Mon, Nov 20, 2023 at 11:37 PM Andres Freund  wrote:

WRT Robert seeing those warnings and Tom not: There's something odd going
on. I couldn't immediately reproduce it. Then I realized it reproduces against
a homebrew install but not a macports one.

Robert, which are you using?


macports


Btw., I'm also seeing warnings like this.  I'm using homebrew.  Here is 
a sample:


[145/147] Linking target src/test/modules/test_shm_mq/test_shm_mq.dylib
-macosx_version_min has been renamed to -macos_version_min
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lgcc'
[146/147] Linking target src/test/modules/test_slru/test_slru.dylib





Re: How to accurately determine when a relation should use local buffers?

2023-11-21 Thread Aleksander Alekseev
Hi,

> Furthermore, there is some possible inconsistency in the code show below 
> (REL_16_STABLE) in bufmgr.c file:
>
> FlushRelationBuffers, PrefetchBuffer uses RelationUsesLocalBuffers(rel).
> ExtendBufferedRel_common finally use BufferManagerRelation.relpersistence 
> which is actually rd_rel->relpersistence, works like RelationUsesLocalBuffers.
> ReadBuffer_common uses isLocalBuf = SmgrIsTemp(smgr), that checks 
> rlocator.backend for InvalidBackendId.

I didn't do a deep investigation of the code in this particular aspect
but that could be a fair point. Would you like to propose a
refactoring that unifies the way we check if the relation is
temporary?

> I would like to clarify, do we completely refuse the use of temporary tables 
> in other contexts than in backends or there is some work-in-progress to allow 
> some other usage contexts? If so, the check of rd_rel->relpersistence is 
> enough. Not sure why we use SmgrIsTemp instead of RelationUsesLocalBuffers in 
> ReadBuffer_common.

According to the comments in relfilelocator.h:

```
/*
 * Augmenting a relfilelocator with the backend ID provides all the information
 * we need to locate the physical storage.  The backend ID is InvalidBackendId
 * for regular relations (those accessible to more than one backend), or the
 * owning backend's ID for backend-local relations.  Backend-local relations
 * are always transient and removed in case of a database crash; they are
 * never WAL-logged or fsync'd.
 */
typedef struct RelFileLocatorBackend
{
RelFileLocator locator;
BackendIdbackend;
} RelFileLocatorBackend;

#define RelFileLocatorBackendIsTemp(rlocator) \
((rlocator).backend != InvalidBackendId)
```

And this is what ReadBuffer_common() and other callers of SmgrIsTemp()
are using. So no, you can't have a temporary table without an assigned
RelFileLocatorBackend.backend.

It is my understanding that SmgrIsTemp() and
RelationUsesLocalBuffers() are equivalent except the fact that the
first macro works with SMgrRelation objects and the second one - with
Relation objects.

-- 
Best regards,
Aleksander Alekseev




Re: Hide exposed impl detail of wchar.c

2023-11-21 Thread Eric Ridge
> On Nov 20, 2023, at 7:10 PM, Andres Freund  wrote:
> 
> 
> What I don't quite get is why SIMD headers are particularly more problematic
> than others - there's other headers that are compiler specific?

The short answer is the rust-based bindings generation tool pgrx uses (bindgen) 
is a little brain dead and gets confused when there’s multiple compiler 
builtins headers on the host.

This simd header is the first of its kind we’ve run across that’s exposed via 
Postgres’ “public api”. And bindgen wants to follow all the includes, it gets 
confused, picks the wrong one, and then errors happen.

And I don’t know that it makes much sense for Postgres to include such a header 
into 3rd-party code anyways. 

I think Jubilee is also working with them to fix this, but we’re hoping 
Jubilee’s patch here (or similar) can get merged so we can clear our build 
drama.

eric



Re: Annoying build warnings from latest Apple toolchain

2023-11-21 Thread Aleksander Alekseev
Hi,

> > On Mon, Nov 20, 2023 at 11:37 PM Andres Freund  wrote:
> >> WRT Robert seeing those warnings and Tom not: There's something odd going
> >> on. I couldn't immediately reproduce it. Then I realized it reproduces 
> >> against
> >> a homebrew install but not a macports one.
> >>
> >> Robert, which are you using?
> >
> > macports
>
> Btw., I'm also seeing warnings like this.  I'm using homebrew.  Here is
> a sample:
>
> [145/147] Linking target src/test/modules/test_shm_mq/test_shm_mq.dylib
> -macosx_version_min has been renamed to -macos_version_min
> ld: warning: -undefined error is deprecated
> ld: warning: ignoring duplicate libraries: '-lgcc'
> [146/147] Linking target src/test/modules/test_slru/test_slru.dylib

I prefer not to build Postgres on Mac because it slows down GMail and
Slack. After reading this discussion I tried and I can confirm I see
the same warnings Robert does:

```
[322/1905] Linking target src/interfaces/libpq/libpq.5.dylib
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[326/1905] Linking target src/timezone/zic
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lz'
[1113/1905] Linking target src/backend/postgres
ld: warning: -undefined error is deprecated
ld: warning: ignoring duplicate libraries: '-lpam', '-lxml2', '-lz'
[1124/1905] Linking target src/backend/replication/pgoutput/pgoutput.dylib
ld: warning: -undefined error is deprecated
[1125/1905] Linking target
src/backend/replication/libpqwalreceiver/libpqwalreceiver.dylib
ld: warning: -undefined error is deprecated

... many more ...
```

My laptop is an Intel MacBook Pro 2019. The MacOS version is Sonoma
14.0 and I'm using homebrew. `xcode-select --version` says 2399. I'm
using the following command:

```
XML_CATALOG_FILES=/usr/local/etc/xml/catalog time -p sh -c 'git clean
-dfx && meson setup --buildtype debug -DPG_TEST_EXTRA="kerberos ldap
ssl" -Dldap=disabled -Dssl=openssl -Dcassert=true -Dtap_tests=enabled
-Dprefix=/Users/eax/pginstall build && ninja -C build && ninja -C
build docs && meson test -C build'
```

I don't see any warnings when using Autotools.

--
Best regards,
Aleksander Alekseev




Re: Do away with a few backwards compatibility macros

2023-11-21 Thread Nathan Bossart
On Tue, Nov 21, 2023 at 12:05:36AM -0500, Tom Lane wrote:
> No objection here, but should we try to establish some sort of project
> policy around this sort of change (ie, removal of backwards-compatibility
> support)?  "Once it no longer matters for any supported version" sounds
> about right to me, but maybe somebody has an argument for thinking about
> it differently.

That seems reasonable to me.  I don't think we need to mandate that
backwards-compatibility support be removed as soon as it is eligible, but
it can be considered fair game at that point.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key

2023-11-21 Thread Nathan Bossart
On Mon, Nov 20, 2023 at 10:37:47PM -0800, Jeff Davis wrote:
> It would be interesting to know how often it's a good idea to turn it
> on, though. I could try turning it on for various other uses of
> simplehash, and see where it tends to win.

That seems worthwhile to me.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Remove distprep

2023-11-21 Thread Alvaro Herrera
On 2023-Nov-07, Michael Paquier wrote:

> On Mon, Nov 06, 2023 at 04:21:40PM +0100, Peter Eisentraut wrote:
> > done
> 
> Nice to see 721856ff24b3 in, thanks!

Hmm, do we still need to have README.git as a separate file from README?

Also, looking at README, I see it refers to the INSTALL file in the
root, but that doesn't exist.  "make -C doc/src/sgml INSTALL" creates
it, but it's not copied to the root directory.  Do we need some fixup
for that?

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Learn about compilers. Then everything looks like either a compiler or
a database, and now you have two problems but one of them is fun."
https://twitter.com/thingskatedid/status/1456027786158776329




Re: PSQL error: total cell count of XXX exceeded

2023-11-21 Thread Alvaro Herrera
On 2023-Nov-21, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Right.  I made a few more adjustments, including the additional overflow
> > check in printTableInit that Tom Lane suggested, and pushed this.
> 
> Committed patch LGTM.

Thanks for looking!

> > It's a bit annoying that the error recovery decision of this code is to
> > exit the process with an error.  [...]
> > TBH though, I've never hit that code in real usage.
> 
> Yeah, I think the reason it's stayed like that for 25 years is that
> nobody's hit the case in practice.  Changing the API would be a bit
> troublesome, too, because we don't know if anybody besides psql
> uses it.

True.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread Andres Freund
Hi,

On 2023-11-21 07:42:42 -0400, David Steele wrote:
> On 11/20/23 19:58, Andres Freund wrote:
> > On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:
> > > On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:
> > > > Given that, I wonder if what we should do is to just add a new field to
> > > > pg_control that says "error out if backup_label does not exist", that 
> > > > we set
> > > > when creating a streaming base backup
> > > 
> > > That would mean that one still needs to take an extra step to update a
> > > control file with this byte set, which is something you had a concern
> > > with in terms of compatibility when it comes to external backup
> > > solutions because more steps are necessary to take a backup, no?
> > 
> > I was thinking we'd just set it in the pg_basebackup style path, and we'd
> > error out if it's set and backup_label is present. But we'd still use
> > backup_label without the pg_control flag set.
> > 
> > So it'd just provide a cross-check that backup_label was not removed for
> > pg_basebackup style backup, but wouldn't do anything for external backups. 
> > But
> > imo the proposal to just us pg_control doesn't actually do anything for
> > external backups either - which is why I think my proposal would achieve as
> > much, for a much lower price.
> 
> I'm not sure why you think the patch under discussion doesn't do anything
> for external backups. It provides the same benefits to both pg_basebackup
> and external backups, i.e. they both receive the updated version of
> pg_control.

Sure. They also receive a backup_label today. If an external solution forgets
to replace pg_control copied as part of the filesystem copy, they won't get an
error after the remove of backup_label, just like they don't get one today if
they don't put backup_label in the data directory.  Given that users don't do
the right thing with backup_label today, why can we rely on them doing the
right thing with pg_control?

Greetings,

Andres Freund




Re: Locks on unlogged tables are locked?!

2023-11-21 Thread Bruce Momjian


Uh, was this ever addressed?  I don't see the patch applied or the code
in this area modified.

---

On Thu, May 24, 2018 at 04:33:11PM +0200, Laurenz Albe wrote:
> While looking at this:
> https://stackoverflow.com/q/50413623/6464308
> I realized that "LOCK TABLE " puts a
> Standby/LOCK into the WAL stream, which causes a log flush
> at COMMIT time.
> 
> That hurts performance, and I doubt that it is necessary.
> At any rate, DROP TABLE on an unlogged table logs nothing.
> 
> The attached patch would take care of it, but I'm not sure
> if that's the right place to check.
> 
> Yours,
> Laurenz Albe

> From d474e7b41298944e43bb9141eb33adbdd9ea1098 Mon Sep 17 00:00:00 2001
> From: Laurenz Albe 
> Date: Tue, 22 May 2018 18:13:31 +0200
> Subject: [PATCH] Don't log locks on unlogged tables
> 
> ---
>  src/backend/storage/lmgr/lock.c | 28 +---
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
> index dc3d8d9817..70cac47ab3 100644
> --- a/src/backend/storage/lmgr/lock.c
> +++ b/src/backend/storage/lmgr/lock.c
> @@ -37,6 +37,7 @@
>  #include "access/twophase_rmgr.h"
>  #include "access/xact.h"
>  #include "access/xlog.h"
> +#include "catalog/pg_class.h"
>  #include "miscadmin.h"
>  #include "pg_trace.h"
>  #include "pgstat.h"
> @@ -47,6 +48,7 @@
>  #include "storage/standby.h"
>  #include "utils/memutils.h"
>  #include "utils/ps_status.h"
> +#include "utils/rel.h"
>  #include "utils/resowner_private.h"
>  
>  
> @@ -1041,13 +1043,25 @@ LockAcquireExtended(const LOCKTAG *locktag,
>*/
>   if (log_lock)
>   {
> - /*
> -  * Decode the locktag back to the original values, to avoid 
> sending
> -  * lots of empty bytes with every message.  See lock.h to check 
> how a
> -  * locktag is defined for LOCKTAG_RELATION
> -  */
> - LogAccessExclusiveLock(locktag->locktag_field1,
> -
> locktag->locktag_field2);
> + bool unlogged_rel = false;
> +
> + if (locktag->locktag_type == LOCKTAG_RELATION)
> + {
> + Relation r = 
> RelationIdGetRelation(locktag->locktag_field2);
> + unlogged_rel = r->rd_rel->relpersistence == 
> RELPERSISTENCE_UNLOGGED;
> + RelationClose(r);
> + }
> +
> + if (!unlogged_rel)
> + {
> + /*
> +  * Decode the locktag back to the original values, to 
> avoid sending
> +  * lots of empty bytes with every message.  See lock.h 
> to check how a
> +  * locktag is defined for LOCKTAG_RELATION
> +  */
> + LogAccessExclusiveLock(locktag->locktag_field1,
> +
> locktag->locktag_field2);
> + }
>   }
>  
>   return LOCKACQUIRE_OK;
> -- 
> 2.14.3
> 


-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key

2023-11-21 Thread Andres Freund
Hi,

On 2023-11-20 22:37:47 -0800, Jeff Davis wrote:
> On Mon, 2023-11-20 at 22:50 -0600, Nathan Bossart wrote:
> > I'm mostly thinking out loud here, but could we just always do this? 
> > I
> > guess you might want to avoid it if your SH_EQUAL is particularly
> > expensive
> > and you know repeated lookups are rare, but maybe that's uncommon
> > enough
> > that we don't really care.
> 
> I like that simplehash is simple, so I'm not inclined to introduce an
> always-on feature.

I think it'd be a bad idea to make it always on - there's plenty cases where
it just would make things slower because the hit rate is low. A equal
comparison is far from free.

I am not quite sure this kind of cache best lives in simplehash - ISTM that
quite often it'd be more beneficial to have a cache that you can test more
cheaply higher up.

Greetings,

Andres Freund




Re: Change GUC hashtable to use simplehash?

2023-11-21 Thread Jeff Davis
On Tue, 2023-11-21 at 16:42 +0700, John Naylor wrote:
> The strlen call required for hashbytes() is not free.

Should we have a hash_string() that's like hash_bytes() but checks for
the NUL terminator itself?

That wouldn't be inlinable, but it would save on the strlen() call. It
might benefit some other callers?

Regards,
Jeff Davis





Re: Partial aggregates pushdown

2023-11-21 Thread Robert Haas
On Mon, Nov 20, 2023 at 5:48 PM Bruce Momjian  wrote:
> > I do have a concern about this, though. It adds a lot of bloat. It
> > adds a whole lot of additional entries to pg_aggregate, and every new
> > aggregate we add in the future will require a bonus entry for this,
> > and it needs a bunch of new pg_proc entries as well. One idea that
> > I've had in the past is to instead introduce syntax that just does
> > this, without requiring a separate aggregate definition in each case.
> > For example, maybe instead of changing string_agg(whatever) to
> > string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE
> > string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or
> > something. Then all aggregates could be treated in a generic way. I'm
> > not completely sure that's better, but I think it's worth considering.
>
> So use an SQL keyword to indicates a pushdown call?  We could then
> automate the behavior rather than requiring special catalog functions?

Right. It would require more infrastructure in the parser, planner,
and executor, but it would be infinitely reusable instead of needing a
new thing for every aggregate. I think that might be better, but to be
honest I'm not totally sure.

> > I don't think the patch does a good job explaining why HAVING,
> > DISTINCT, and ORDER BY are a problem. It seems to me that HAVING
> > shouldn't really be a problem, because HAVING is basically a WHERE
> > clause that occurs after aggregation is complete, and whether or not
> > the aggregation is safe shouldn't depend on what we're going to do
> > with the value afterward. The HAVING clause can't necessarily be
> > pushed to the remote side, but I don't see how or why it could make
> > the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a
> > little trickier: if we pushed down DISTINCT, we'd still have to
> > re-DISTINCT-ify when combining locally, and if we pushed down ORDER
> > BY, we'd have to do a merge pass to combine the returned values unless
> > we could prove that the partitions were non-overlapping ranges that
> > would be visited in the correct order. Although that all sounds
> > doable, I think it's probably a good thing that the current patch
> > doesn't try to handle it -- this is complicated already. But it should
> > explain why it's not handling it and maybe even a bit about how it
> > could be handling in the future, rather than just saying "well, this
> > kind of thing is not safe." The trouble with that explanation is that
> > it does nothing to help the reader understand whether the thing in
> > question is *fundamentally* unsafe or whether we just don't have the
> > right code to make it work.
>
> Makes sense.

Actually, I think I was wrong about this. We can't handle ORDER BY or
DISTINCT because we can't distinct-ify or order after we've already
partially aggregated. At least not in general, and not without
additional aggregate support functions. So what I said above was wrong
with respect to those. Or so I believe, anyway. But I still don't see
why HAVING should be a problem.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/21/23 12:41, Andres Freund wrote:


On 2023-11-21 07:42:42 -0400, David Steele wrote:

On 11/20/23 19:58, Andres Freund wrote:

On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup


That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?


I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.


I'm not sure why you think the patch under discussion doesn't do anything
for external backups. It provides the same benefits to both pg_basebackup
and external backups, i.e. they both receive the updated version of
pg_control.


Sure. They also receive a backup_label today. If an external solution forgets
to replace pg_control copied as part of the filesystem copy, they won't get an
error after the remove of backup_label, just like they don't get one today if
they don't put backup_label in the data directory.  Given that users don't do
the right thing with backup_label today, why can we rely on them doing the
right thing with pg_control?


I think reliable backup software does the right thing with backup_label, 
but if the user starts getting errors on recovery they the decide to 
remove backup_label. I know we can't do much about bad backup software, 
but we can at least make this a bit more resistant to user error after 
the fact.


It doesn't help that one of our hints suggests removing backup_label. In 
highly automated systems, the user might not even know they just 
restored from a backup. They are only in the loop because the restore 
failed and they are trying to figure out what is going wrong. When they 
remove backup_label the cluster comes up just fine. Victory!


This is the scenario I've seen most often -- not the backup/restore 
process getting it wrong but the user removing backup_label on their own 
initiative. And because it yields such a positive result, at least 
initially, they remember in the future that the thing to do is to remove 
backup_label whenever they see the error.


If they only have pg_control, then their only choice is to get it right 
or run pg_resetwal. Most users have no knowledge of pg_resetwal so it 
will take them longer to get there. Also, I think that tool make it 
pretty clear that corruption will result and the only thing to do is a 
logical dump and restore after using it.


There are plenty of ways a user can mess things up. What I'd like to 
prevent is the appearance of everything being OK when in fact they have 
corrupted their cluster. That's the situation we have now with 
backup_label. Is this new solution perfect? No, but I do think it checks 
several boxes, and is a worthwhile improvement.


Regards,
-David

Regards,
-David




Re: Locks on unlogged tables are locked?!

2023-11-21 Thread Robert Haas
On Tue, Nov 21, 2023 at 11:49 AM Bruce Momjian  wrote:
> Uh, was this ever addressed?  I don't see the patch applied or the code
> in this area modified.

I never saw this email originally, but I don't think I believe
Laurenz's argument. Are all AEL-requiring operations on unlogged
tables safe to perform on a standby without AEL? I believe I would be
quite surprised if that turned out to be true.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Hide exposed impl detail of wchar.c

2023-11-21 Thread Eric Ridge
(I hope you don't mind I'm reposting your reply -- I accidentally replied 
directly to you b/c phone)

> On Nov 21, 2023, at 11:56 AM, Andres Freund  wrote:
> 
> Hi,
> 
> On 2023-11-21 10:11:18 -0500, Eric Ridge wrote:
>> On Mon, Nov 20, 2023 at 7:10 PM Andres Freund  wrote:



>> And I don’t know that it makes much sense for Postgres to include such a
>> header into 3rd-party code anyways.
> 
> Well, we want to expose such functionality to extensions. For some cases using
> full functions would to be too expensive, hence using static inlines. In case
> of exposing simd stuff, that means we need to include headers.

Sure.  Probably not my place to make that kind of broad statement anyways.  The 
"static inlines" are problematic for us in pgrx-land too, but that's a 
different problem for another day.


> I'm not against splitting this out of pg_wchar.h, to be clear - that's a too
> widely used header for, so there's a good independent reason for such a
> change. I just don't really believe that moving simd.h out of there will end
> the issue, we'll add more inlines using simd over time...

Yeah and that's why Jubilee is working with the bindgen folks to tighten this 
up for good.

(We tracked all of the pg16 betas and didn't run into this until after pg16 
went gold.  I personally haven't groveled through the git logs to see when this 
header/static inline was added, but we'd have reported this sooner had we found 
it sooner.)

eric



Re: pg_upgrade and logical replication

2023-11-21 Thread vignesh C
On Mon, 20 Nov 2023 at 10:44, Peter Smith  wrote:
>
> Here are some review comments for patch v15-0001
>
> ==
> src/bin/pg_dump/pg_dump.c
>
> 1. getSubscriptions
>
> + if (fout->remoteVersion >= 17)
> + appendPQExpBufferStr(query, "o.remote_lsn AS suboriginremotelsn\n");
> + else
> + appendPQExpBufferStr(query, "NULL AS suboriginremotelsn\n");
> +
>
> There should be preceding spaces in those append strings to match the
> other ones.

Modified

> ~~~
>
> 2. dumpSubscriptionTable
>
> +/*
> + * dumpSubscriptionTable
> + *   Dump the definition of the given subscription table mapping. This will 
> be
> + *used only in binary-upgrade mode.
> + */
> +static void
> +dumpSubscriptionTable(Archive *fout, const SubRelInfo *subrinfo)
> +{
> + DumpOptions *dopt = fout->dopt;
> + SubscriptionInfo *subinfo = subrinfo->subinfo;
> + PQExpBuffer query;
> + char*tag;
> +
> + /* Do nothing in data-only dump */
> + if (dopt->dataOnly)
> + return;
> +
> + Assert(fout->dopt->binary_upgrade || fout->remoteVersion >= 17);
>
> The function comment says this is only for binary-upgrade mode, so why
> does the Assert use || (OR)?

Added comments

> ==
> src/bin/pg_upgrade/check.c
>
> 3. check_and_dump_old_cluster
>
> + /*
> + * Subscription dependencies can be migrated since PG17. See comments atop
> + * get_old_cluster_subscription_count().
> + */
> + if (GET_MAJOR_VERSION(old_cluster.major_version) >= 1700)
> + check_old_cluster_subscription_state(&old_cluster);
> +
>
> Should this be combined with the other adjacent check so there is only
> one "if (GET_MAJOR_VERSION(old_cluster.major_version) >= 1700)"
> needed?

Modified

> ~~~
>
> 4. check_new_cluster
>
>   check_new_cluster_logical_replication_slots();
> +
> + check_new_cluster_subscription_configuration();
>
> When checking the old cluster, the subscription was checked before the
> slots, but here for the new cluster, the slots are checked before the
> subscription. Maybe it makes no difference but it might be tidier to
> do these old/new checks in the same order.

Modified

> ~~~
>
> 5. check_new_cluster_logical_replication_slots
>
> - /* Quick return if there are no logical slots to be migrated. */
> + /* Quick return if there are no logical slots to be migrated */
>
> Change is not relevant for this patch.

Removed it

> ~~~
>
> 6.
>
> + res = executeQueryOrDie(conn, "SELECT setting FROM pg_settings "
> + "WHERE name IN ('max_replication_slots') "
> + "ORDER BY name DESC;");
>
> Using IN and ORDER BY in this SQL seems unnecessary when you are only
> searching for one name.

Modified

> ==
> src/bin/pg_upgrade/info.c
>
> 7. statics
>
> -
> +static void get_old_cluster_subscription_count(DbInfo *dbinfo);
>
> This change also removes an existing blank line -- not sure if that
> was intentional

Modified

> ~~~
>
> 8.
> @@ -365,7 +369,6 @@ get_template0_info(ClusterInfo *cluster)
>   PQfinish(conn);
>  }
>
> -
>  /*
>   * get_db_infos()
>   *
>
> This blank line change (before get_db_infos) should not be part of this patch.

Modified

> ~~~
>
> 9. get_old_cluster_subscription_count
>
> It seems a slightly misleading function name because this is a PER-DB
> count, not a cluster count.

Modified

> ~~~
>
>
> 10.
> + /* Subscriptions can be migrated since PG17. */
> + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600)
> + return;
>
> IMO it is better to compare < 1700 instead of <= 1600. It keeps the
> code more aligned with the comment.

Modified

> ~~~
>
> 11. count_old_cluster_subscriptions
>
> +/*
> + * count_old_cluster_subscriptions()
> + *
> + * Returns the number of subscription for all databases.
> + *
> + * Note: this function always returns 0 if the old_cluster is PG16 and prior
> + * because we gather subscriptions only for cluster versions greater than or
> + * equal to PG17. See get_old_cluster_subscription_count().
> + */
> +int
> +count_old_cluster_subscriptions(void)
> +{
> + int nsubs = 0;
> +
> + for (int dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++)
> + nsubs += old_cluster.dbarr.dbs[dbnum].nsubs;
> +
> + return nsubs;
> +}
>
> 11a.
> /subscription/subscriptions/

Modified

> ~
>
> 11b.
> The code is now consistent with the slots code which looks good. OTOH
> I thought that 'pg_catalog.pg_subscription' is shared across all
> databases of the cluster, so isn't this code inefficient to be
> querying again and again for every database (if there are many of
> them) instead of just querying 1 time only for the whole cluster?

My earlier version was like that, changed it to keep the code
consistent to logical replication slots.

> ==
> src/bin/pg_upgrade/t/004_subscription.pl
>
> 12.
> It is difficult to keep track of all the tables (upgraded and not
> upgraded) at each step of these tests. Maybe the comments can be more
> explicit along the way. e.g
>
> BEFORE
> +# Add tab_not_upgraded1 to the publication
>
> SUGGESTION
> +# Add tab_not_upgraded1 to the publication. Now publication has 
>
> and
>
> BEFORE
> +# Subscript

Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/20/23 16:41, Andres Freund wrote:


On 2023-11-20 15:56:19 -0400, David Steele wrote:

I understand this is an option -- but does it need to be? What is the
benefit of excluding the manifest?


It's not free to create the manifest, particularly if checksums are enabled.


It's virtually free, even with the basic CRCs. Anyway, would you really 
want a backup without a manifest? How would you know something is 
missing? In particular, for page incremental how do you know something 
is new (but not WAL logged) if there is no manifest? Is the plan to just 
recopy anything not WAL logged with each incremental?



Also, for external backups, there's no manifest...


There certainly is a manifest for many external backup solutions. Not 
having a manifest is just running with scissors, backup-wise.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread Andres Freund
Hi,

On 2023-11-21 13:41:15 -0400, David Steele wrote:
> On 11/20/23 16:41, Andres Freund wrote:
> >
> > On 2023-11-20 15:56:19 -0400, David Steele wrote:
> > > I understand this is an option -- but does it need to be? What is the
> > > benefit of excluding the manifest?
> >
> > It's not free to create the manifest, particularly if checksums are enabled.
>
> It's virtually free, even with the basic CRCs.

Huh?

perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast 
-Xnone  --format=tar > /dev/null

  4,423.81 msec task-clock   #0.626 CPUs 
utilized
   433,475  context-switches #   97.987 K/sec
 5  cpu-migrations   #1.130 /sec
   599  page-faults  #  135.404 /sec
12,208,261,153  cycles   #2.760 GHz
 6,805,401,520  instructions #0.56  insn per 
cycle
 1,273,896,027  branches #  287.964 M/sec
14,233,126  branch-misses#1.12% of all 
branches

   7.068946385 seconds time elapsed

   1.106072000 seconds user
   3.403793000 seconds sys


perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast 
-Xnone  --format=tar --manifest-checksums=CRC32C > /dev/null

  4,324.64 msec task-clock   #0.640 CPUs 
utilized
   433,306  context-switches #  100.195 K/sec
 3  cpu-migrations   #0.694 /sec
   598  page-faults  #  138.277 /sec
11,952,475,908  cycles   #2.764 GHz
 6,816,888,845  instructions #0.57  insn per 
cycle
 1,275,949,455  branches #  295.042 M/sec
13,721,376  branch-misses#1.08% of all 
branches

   6.760321433 seconds time elapsed

   1.113256000 seconds user
   3.302907000 seconds sys

perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast 
-Xnone  --format=tar --no-manifest > /dev/null

  3,925.38 msec task-clock   #0.823 CPUs 
utilized
   257,467  context-switches #   65.590 K/sec
 4  cpu-migrations   #1.019 /sec
   552  page-faults  #  140.624 /sec
11,577,054,842  cycles   #2.949 GHz
 5,933,731,797  instructions #0.51  insn per 
cycle
 1,108,784,719  branches #  282.466 M/sec
11,867,511  branch-misses#1.07% of all 
branches

   4.770347012 seconds time elapsed

   1.002521000 seconds user
   2.991769000 seconds sys


I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".


And this actually *under* selling the cost - we waste a lot of cycles due to
bad buffering decisions. Once we fix that, the cost differential increases
further.


> Anyway, would you really want a backup without a manifest? How would you
> know something is missing? In particular, for page incremental how do you
> know something is new (but not WAL logged) if there is no manifest? Is the
> plan to just recopy anything not WAL logged with each incremental?

Shrug. If you just want to create a new standby by copying the primary, I
don't think creating and then validating the manifest buys you much. Long term
backups are a different story, particularly if data files are stored
individually, rather than in a single checksummed file.


> > Also, for external backups, there's no manifest...
>
> There certainly is a manifest for many external backup solutions. Not having
> a manifest is just running with scissors, backup-wise.

You mean that you have an external solution gin up a backup manifest? I fail
to see how that's relevant here?

Greetings,

Andres Freund




Re: PSQL error: total cell count of XXX exceeded

2023-11-21 Thread Tom Lane
Alvaro Herrera  writes:
> On 2023-Nov-21, Tom Lane wrote:
>> Alvaro Herrera  writes:
>>> It's a bit annoying that the error recovery decision of this code is to
>>> exit the process with an error.  [...]
>>> TBH though, I've never hit that code in real usage.

>> Yeah, I think the reason it's stayed like that for 25 years is that
>> nobody's hit the case in practice.  Changing the API would be a bit
>> troublesome, too, because we don't know if anybody besides psql
>> uses it.

> True.

It strikes me that perhaps a workable compromise behavior could be
"report the error to wherever we would have printed the table, and
return normally".  I'm still not excited about doing anything about
it, but ...

regards, tom lane




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/20/23 16:37, Andres Freund wrote:


On 2023-11-20 11:11:13 -0500, Robert Haas wrote:

I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits.


+1.  The amount of whacking around in this area has been substantial, and it's
hard for operators to keep up. And realistically, with data sizes today, the
pressure to do basebackups with disk snapshots etc is not going to shrink.


True enough, but disk snapshots aren't really backups in themselves, in 
most scenarios, because they reside on the same storage as the cluster. 
Of course, snapshots can be exported, but that's also expensive.


I see snapshots as an adjunct to backups -- a safe backup offsite 
somewhere for DR and snapshots for day to day operations. Even so, 
managing snapshots as backups is harder than people think. It is easy to 
get wrong and end up with silent corruption.



Leaving that concern aside, I am still on the fence about this proposal. I
think it does decrease the chance of getting things wrong in the
streaming-basebackup case. But for external backups, it seems almost
universally worse (with the exception of the torn pg_control issue, that we
also can address otherwise):


Why universally worse? The software stores pg_control instead of backup 
label. The changes to pg_basebackup were pretty trivial and the changes 
to external backup are pretty much the same, at least in my limited 
sample of one.


And I don't believe we have a satisfactory solution to the torn 
pg_control issue yet. Certainly it has not been committed and Thomas has 
shown enthusiasm for this approach, to the point of hoping it could be 
back patched (it can't).



It doesn't reduce the risk of getting things wrong, you can still omit placing
a file into the data directory and get silent corruption as a consequence. In
addition, it's harder to see when looking at a base backup whether the process
was right or not, because now the good and bad state look the same if you just
look on the filesystem level!


This is one of the reasons I thought writing just the first 512 bytes of 
pg_control would be valuable. It would give an easy indicator that 
pg_control came from a backup. Michael was not in favor of conflating 
that change with this patch -- but I still think it's a good idea.



Then there's the issue of making ad-hoc debugging harder by not having a
human readable file with information anymore, including when looking at the
history, via backup_label.old.


Yeah, you'd need to use pg_controldata instead. But as Michael has 
suggested, we could also write backup_label as backup_info so there is 
human-readable information available.



Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup


I'm not in favor of a change only accessible to pg_basebackup or 
external software that can manipulate pg_control.


Regards,
-David




Re: Locks on unlogged tables are locked?!

2023-11-21 Thread Tom Lane
Bruce Momjian  writes:
> Uh, was this ever addressed?  I don't see the patch applied or the code
> in this area modified.

This patch as-is would surely be disastrous: having LockAcquire
try to open the relcache entry for the thing we're trying to lock
is going to be circular in at least some cases.  I'm not convinced
that there's a problem worth solving here, but if there is, it'd
have to be done in some other way.

regards, tom lane




Re: Remove distprep

2023-11-21 Thread Tom Lane
Alvaro Herrera  writes:
> Hmm, do we still need to have README.git as a separate file from README?

> Also, looking at README, I see it refers to the INSTALL file in the
> root, but that doesn't exist.  "make -C doc/src/sgml INSTALL" creates
> it, but it's not copied to the root directory.  Do we need some fixup
> for that?

Yeah, we clearly need to rethink this area if the plan is that tarballs
will be pristine git pulls.  I think we want just README at the top
level, and I propose we give up on the text INSTALL file altogether
(thereby removing a documentation build gotcha that catches people
every so often).  I propose that in 2023 it ought to be sufficient
for the README file to point at build instructions on the web.

regards, tom lane




Re: Report planning memory in EXPLAIN ANALYZE

2023-11-21 Thread Alvaro Herrera
I gave this a quick look.  I think the usefulness aspect is already
established in general terms; the bit I'm not so sure about is whether
we want it enabled by default.  For many cases it'd just be noise.
Perhaps we want it hidden behind something like "EXPLAIN (MEMORY)" or
such, particularly since things like "allocated" (which, per David,
seems to be the really useful metric) seems too much a PG-developer
value rather than an end-user value.

If EXPLAIN (MEMORY) is added, then probably auto_explain needs a
corresponding flag, and doc updates.

Some code looks to be in weird places.  Why is calc_mem_usage, which
operates on MemoryContextCounters, in explain.c instead of mcxt.c?
why is MemUsage in explain.h instead of memnodes.h?  I moved both.  I
also renamed them, to MemoryContextSizeDifference() and MemoryUsage
respectively; fixup patch attached.

I see no reason for this to be three separate patches anymore.

The EXPLAIN docs (explain.sgml) need an update to mention the new flag
and the new output, too.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
commit 48b0c6682a9e8cf07096b979693fac09b2f7a0ba
Author: Alvaro Herrera  [Álvaro Herrera 
]
AuthorDate: Tue Nov 21 18:20:32 2023 +0100
CommitDate: Tue Nov 21 19:18:18 2023 +0100

review

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9cd9b577c7..8c7f27b661 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -123,7 +123,7 @@ static void show_buffer_usage(ExplainState *es, const 
BufferUsage *usage,
  bool planning);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
 static void show_planning_memory(ExplainState *es,
-const MemUsage 
*usage);
+const 
MemoryUsage *usage);
 static void ExplainIndexScanDetails(Oid indexid, ScanDirection indexorderdir,

ExplainState *es);
 static void ExplainScanTarget(Scan *plan, ExplainState *es);
@@ -395,14 +395,14 @@ ExplainOneQuery(Query *query, int cursorOptions,
else
{
PlannedStmt *plan;
+   MemoryContext planner_ctx;
instr_time  planstart,
planduration;
BufferUsage bufusage_start,
bufusage;
MemoryContextCounters mem_counts_start;
MemoryContextCounters mem_counts_end;
-   MemUsagemem_usage;
-   MemoryContext planner_ctx;
+   MemoryUsage mem_usage;
MemoryContext saved_ctx;
 
/*
@@ -415,7 +415,6 @@ ExplainOneQuery(Query *query, int cursorOptions,
planner_ctx = AllocSetContextCreate(CurrentMemoryContext,

"explain analyze planner context",

ALLOCSET_DEFAULT_SIZES);
-
if (es->buffers)
bufusage_start = pgBufferUsage;
MemoryContextMemConsumed(planner_ctx, &mem_counts_start);
@@ -429,7 +428,7 @@ ExplainOneQuery(Query *query, int cursorOptions,
INSTR_TIME_SUBTRACT(planduration, planstart);
MemoryContextSwitchTo(saved_ctx);
MemoryContextMemConsumed(planner_ctx, &mem_counts_end);
-   calc_mem_usage(&mem_usage, &mem_counts_end, &mem_counts_start);
+   MemoryContextSizeDifference(&mem_usage, &mem_counts_start, 
&mem_counts_end);
 
/* calc differences of buffer counters. */
if (es->buffers)
@@ -551,7 +550,7 @@ void
 ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
   const char *queryString, ParamListInfo params,
   QueryEnvironment *queryEnv, const instr_time 
*planduration,
-  const BufferUsage *bufusage, const MemUsage 
*mem_usage)
+  const BufferUsage *bufusage, const MemoryUsage 
*mem_usage)
 {
DestReceiver *dest;
QueryDesc  *queryDesc;
@@ -656,9 +655,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, 
ExplainState *es,
 
if (es->summary && mem_usage)
{
-   ExplainOpenGroup("Planning Memory", "Planning Memory", true, 
es);
+   ExplainOpenGroup("Planner Memory", "Planner Memory", true, es);
show_planning_memory(es, mem_usage);
-   ExplainCloseGroup("Planning Memory", "Planning Memory", true, 
es);
+   ExplainCloseGroup("Planner Memory", "Planner Memory", true, es);
}
 
/* Print info about runtim

Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key

2023-11-21 Thread Jeff Davis
On Tue, 2023-11-21 at 08:51 -0800, Andres Freund wrote:
> I am not quite sure this kind of cache best lives in simplehash -
> ISTM that
> quite often it'd be more beneficial to have a cache that you can test
> more
> cheaply higher up.

Yeah. I suppose when a few more callers are likely to benefit we can
reconsider.

Though it makes it easy to test a few other callers, just to see what
numbers appear.

Regards,
Jeff Davis





Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/21/23 13:59, Andres Freund wrote:


On 2023-11-21 13:41:15 -0400, David Steele wrote:

On 11/20/23 16:41, Andres Freund wrote:


On 2023-11-20 15:56:19 -0400, David Steele wrote:

I understand this is an option -- but does it need to be? What is the
benefit of excluding the manifest?


It's not free to create the manifest, particularly if checksums are enabled.


It's virtually free, even with the basic CRCs.


Huh?





I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".


OK, but how does that look with compression -- to a remote location? 
Uncompressed backup to local storage doesn't seem very realistic. With 
gzip compression we measure SHA1 checksums at about 5% of total CPU. 
Obviously that goes up with zstd or lz4. but parallelism helps offset 
that cost, at least in clock time.


I can't understate how valuable checksums are in finding corruption, 
especially in long-lived backups.



Anyway, would you really want a backup without a manifest? How would you
know something is missing? In particular, for page incremental how do you
know something is new (but not WAL logged) if there is no manifest? Is the
plan to just recopy anything not WAL logged with each incremental?


Shrug. If you just want to create a new standby by copying the primary, I
don't think creating and then validating the manifest buys you much. Long term
backups are a different story, particularly if data files are stored
individually, rather than in a single checksummed file.


Fine, but you are probably not using page incremental if just using 
pg_basebackup to create a standby. With page incremental, at least one 
of the backups will already exist, which argues for a manifest.



Also, for external backups, there's no manifest...


There certainly is a manifest for many external backup solutions. Not having
a manifest is just running with scissors, backup-wise.


You mean that you have an external solution gin up a backup manifest? I fail
to see how that's relevant here?


Just saying that for external backups there *is* often a manifest and it 
is a good thing to have.


Regards,
-David




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread Andres Freund
Hi,

On 2023-11-21 14:48:59 -0400, David Steele wrote:
> > I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".
> 
> OK, but how does that look with compression

With compression it's obviously somewhat different - but that part is done in
parallel, potentially on a different machine with client side compression,
whereas I think right now the checksumming is single-threaded, on the server
side.

With parallel server side compression, it's still 20% slower with the default
checksumming than none. With client side it's 15%.


> -- to a remote location?

I think this one unfortunately makes checksums a bigger issue, not a smaller
one. The network interaction piece is single-threaded, adding another
significant use of CPU onto the same thread means that you are hit harder by
using substantial amount of CPU for checksumming in the same thread.

Once you go beyond the small instances, you have plenty network bandwidth in
cloud environments. We top out well before the network on bigger instances.


> Uncompressed backup to local storage doesn't seem very realistic. With gzip
> compression we measure SHA1 checksums at about 5% of total CPU.

IMO using gzip is basically infeasible for non-toy sized databases today. I
think we're using our users a disservice by defaulting to it in a bunch of
places. Even if another default exposes them to difficulty due to potentially
using a different compiled binary with fewer supported compression methods -
that's gona be very rare in practice.


> I can't understate how valuable checksums are in finding corruption,
> especially in long-lived backups.

I agree!  But I think we need faster checksum algorithms or a faster
implementation of the existing ones. And probably default to something faster
once we have it.

Greetings,

Andres Freund




Re: Add recovery to pg_control and remove backup_label

2023-11-21 Thread David Steele

On 11/21/23 16:00, Andres Freund wrote:

Hi,

On 2023-11-21 14:48:59 -0400, David Steele wrote:

I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".


OK, but how does that look with compression


With compression it's obviously somewhat different - but that part is done in
parallel, potentially on a different machine with client side compression,
whereas I think right now the checksumming is single-threaded, on the server
side.


Ah, yes, that's certainly a bottleneck.


With parallel server side compression, it's still 20% slower with the default
checksumming than none. With client side it's 15%.


Yeah, that still seems a lot. But to a large extent it sounds like a 
limitation of the current implementation.



-- to a remote location?


I think this one unfortunately makes checksums a bigger issue, not a smaller
one. The network interaction piece is single-threaded, adding another
significant use of CPU onto the same thread means that you are hit harder by
using substantial amount of CPU for checksumming in the same thread.

Once you go beyond the small instances, you have plenty network bandwidth in
cloud environments. We top out well before the network on bigger instances.


Uncompressed backup to local storage doesn't seem very realistic. With gzip
compression we measure SHA1 checksums at about 5% of total CPU.


IMO using gzip is basically infeasible for non-toy sized databases today. I
think we're using our users a disservice by defaulting to it in a bunch of
places. Even if another default exposes them to difficulty due to potentially
using a different compiled binary with fewer supported compression methods -
that's gona be very rare in practice.


Yeah, I don't use gzip anymore, but there are still some platforms that 
do not provide zstd (at least not easily) and lz4 compresses less. One 
thing people do seem to have is a lot of cores.



I can't understate how valuable checksums are in finding corruption,
especially in long-lived backups.


I agree!  But I think we need faster checksum algorithms or a faster
implementation of the existing ones. And probably default to something faster
once we have it.


We've been using xxHash to generate checksums for our block-level 
incremental and it is seriously fast, written by the same guy who did 
zstd and lz4.


Regards,
-David




Re: Locks on unlogged tables are locked?!

2023-11-21 Thread Bruce Momjian
On Tue, Nov 21, 2023 at 01:16:19PM -0500, Tom Lane wrote:
> Bruce Momjian  writes:
> > Uh, was this ever addressed?  I don't see the patch applied or the code
> > in this area modified.
> 
> This patch as-is would surely be disastrous: having LockAcquire
> try to open the relcache entry for the thing we're trying to lock
> is going to be circular in at least some cases.  I'm not convinced
> that there's a problem worth solving here, but if there is, it'd
> have to be done in some other way.

Thank you, and Robert, for the feedback.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: vacuum_cost_limit doc description patch

2023-11-21 Thread Bruce Momjian
On Fri, Apr 13, 2018 at 09:56:18AM -0300, Martín Marqués wrote:
> El 11/04/18 a las 02:04, David Rowley escribió:
> > On 11 April 2018 at 09:13, Martín Marqués  wrote:
> >> This is a patch to add some further description, plus the upper and
> >> lower limits it has.
> > 
> > Hi,
> > 
> > + for vacuum_cost_delay. The parameter can take a value
> > between 1 and 1.
> > 
> > vacuum_cost_delay should be in  tags.
> > 
> > +1 to mentioning that we sleep for vacuum_cost_delay, but I just don't
> > see many other GUCs with mention of their supported range.
> 
> Thanks David for having a look.
> 
> New version attached with the missing  tags.
> 
> > effective_io_concurrency mentions the range it supports, but this
> > happens to depend on USE_POSIX_FADVISE, which if undefined the maximum
> > setting is 0, which means the docs are wrong in some cases on that.
> > 
> > vacuum_cost_limit seems fairly fixed at 0-1 with no compile-time
> > conditions, so perhaps it's okay, providing we remember and update the
> > docs if that ever changes.
> 
> I'm also adding a second patch over the config.sgml doc to fix what I
> believe is a misguidance in the minimum resolution time modern systems have.
> 
> The patch just changes *many* for *some* systems which have a minimum
> resolution time of 10 milliseconds.

Patch applied to master, though I removed the valid range part of the
patch.  I also liked the many-to-some change since it is more accurate.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Partial aggregates pushdown

2023-11-21 Thread Bruce Momjian
On Tue, Nov 21, 2023 at 12:16:41PM -0500, Robert Haas wrote:
> On Mon, Nov 20, 2023 at 5:48 PM Bruce Momjian  wrote:
> > > I do have a concern about this, though. It adds a lot of bloat. It
> > > adds a whole lot of additional entries to pg_aggregate, and every new
> > > aggregate we add in the future will require a bonus entry for this,
> > > and it needs a bunch of new pg_proc entries as well. One idea that
> > > I've had in the past is to instead introduce syntax that just does
> > > this, without requiring a separate aggregate definition in each case.
> > > For example, maybe instead of changing string_agg(whatever) to
> > > string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE
> > > string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or
> > > something. Then all aggregates could be treated in a generic way. I'm
> > > not completely sure that's better, but I think it's worth considering.
> >
> > So use an SQL keyword to indicates a pushdown call?  We could then
> > automate the behavior rather than requiring special catalog functions?
> 
> Right. It would require more infrastructure in the parser, planner,
> and executor, but it would be infinitely reusable instead of needing a
> new thing for every aggregate. I think that might be better, but to be
> honest I'm not totally sure.

It would make it automatic.  I guess we need to look at how big the
patch is to do it.

> > > I don't think the patch does a good job explaining why HAVING,
> > > DISTINCT, and ORDER BY are a problem. It seems to me that HAVING
> > > shouldn't really be a problem, because HAVING is basically a WHERE
> > > clause that occurs after aggregation is complete, and whether or not
> > > the aggregation is safe shouldn't depend on what we're going to do
> > > with the value afterward. The HAVING clause can't necessarily be
> > > pushed to the remote side, but I don't see how or why it could make
> > > the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a
> > > little trickier: if we pushed down DISTINCT, we'd still have to
> > > re-DISTINCT-ify when combining locally, and if we pushed down ORDER
> > > BY, we'd have to do a merge pass to combine the returned values unless
> > > we could prove that the partitions were non-overlapping ranges that
> > > would be visited in the correct order. Although that all sounds
> > > doable, I think it's probably a good thing that the current patch
> > > doesn't try to handle it -- this is complicated already. But it should
> > > explain why it's not handling it and maybe even a bit about how it
> > > could be handling in the future, rather than just saying "well, this
> > > kind of thing is not safe." The trouble with that explanation is that
> > > it does nothing to help the reader understand whether the thing in
> > > question is *fundamentally* unsafe or whether we just don't have the
> > > right code to make it work.
> >
> > Makes sense.
> 
> Actually, I think I was wrong about this. We can't handle ORDER BY or
> DISTINCT because we can't distinct-ify or order after we've already
> partially aggregated. At least not in general, and not without
> additional aggregate support functions. So what I said above was wrong
> with respect to those. Or so I believe, anyway. But I still don't see
> why HAVING should be a problem.

This should probably be documented in the patch.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: pg_class.reltuples of brin indexes

2023-11-21 Thread Bruce Momjian
On Tue, Mar 27, 2018 at 08:58:11PM +0900, Masahiko Sawada wrote:
> Hi,
> 
> I found that pg_class.reltuples of brin indexes can be either the
> number of index tuples or the number of heap tuples.
> 
> =# create table test as select generate_series(1,10) as c;
> =# create index test_brin on test using brin (c);
> =# analyze test;
> =# select relname, reltuples, relpages from pg_class where relname in
> ('test', 'test_brin');
>   relname  | reltuples | relpages
> ---+---+--
>  test  | 10 |   443
>  test_brin | 10 |3
> (2 rows)
> 
> =# vacuum test;
> =# select relname, reltuples, relpages from pg_class where relname in
> ('test', 'test_brin');
>   relname  | reltuples | relpages
> ---+---+--
>  test  | 10 |   443
>  test_brin | 3 |3
> (2 rows)
> 
> If I understand correctly pg_class.reltuples of indexes should have
> the number of index tuples but especially for brin indexes it would be
> hard to estimate it in the analyze code. I thought that we can change
> brinvacuumcleanup so that it returns the estimated number of index
> tuples and do vac_update_relstats using that value but it would break
> API contract. Better ideas?

I assume there is nothing to do on this issue.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: proposal: possibility to read dumped table's name from file

2023-11-21 Thread Daniel Gustafsson
> On 20 Nov 2023, at 06:20, Pavel Stehule  wrote:

> I was pondering replacing the is_include handling with returning an enum for
> the operation, to keep things more future proof in case we add more operations
> (and also a bit less magic IMHO).
> 
> +1
> 
> I did it.

Nice, I think it's an improvement.

+   extension: data on foreign servers, works like
+   --extension. This keyword can only be
+   used with the include keyword.
This seems like a copy-pasteo, fixed in the attached.

I've spent some time polishing this version of the patch, among other things
trying to make the docs and --help screen consistent across the tools.  I've
added the diff as a txt file to this email (to keep the CFbot from applying
it), it's mainly reformatting a few comments and making things consistent.

The attached is pretty close to a committable patch IMO, review is welcome on
both the patch and commit message.  I tried to identify all reviewers over the
past 3+ years but I might have missed someone.

--
Daniel Gustafsson

commit 4a3c0bdaf3fd21b75e17244691fbeb9340e960e1
Author: Daniel Gustafsson 
Date:   Tue Nov 21 15:08:27 2023 +0100

Fixups and tweaks

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index e2f100d552..0e5ba4f712 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -873,49 +873,52 @@ PostgreSQL documentation
 
  
   
-   extension: data on foreign servers, works like
-   --extension. This keyword can only be
+   extension: extensions, works like the
+   --extension option. This keyword can only be
used with the include keyword.
   
  
  
   
foreign_data: data on foreign servers, works like
-   --include-foreign-data. This keyword can only be
-   used with the include keyword.
+   the --include-foreign-data option. This keyword can
+   only be used with the include keyword.
   
  
  
   
-   table: tables, works like
-   -t/--table
+   table: tables, works like the
+   -t/--table option.
   
  
  
   
-   table_and_children: tables, works like
-   --table-and-children
+   table_and_children: tables including any 
partitions
+   or inheritance child tables, works like the
+   --table-and-children option.
   
  
  
   
-   table_data: table data, works like
-   --exclude-table-data. This keyword can only be
-   used with the exclude keyword.
+   table_data: table data of any tables matching
+   pattern, works like the
+   --exclude-table-data option. This keyword can only
+   be used with the exclude keyword.
   
  
  
   
-   table_data_and_children: table data of any
-   partitions or inheritance child, works like
-   --exclude-table-data-and-children. This keyword 
can only be
-   used with the exclude keyword.
+   table_data_and_children: table data of any tables
+   matching pattern as well as any 
partitions
+   or inheritance children of the table(s), works like the
+   --exclude-table-data-and-children option. This
+   keyword can only be used with the exclude 
keyword.
   
  
  
   
-   schema: schemas, works like
-   -n/--schema
+   schema: schemas, works like the
+   -n/--schema option.
   
  
 
@@ -923,9 +926,9 @@ PostgreSQL documentation
 

 Lines starting with # are considered comments and
-ignored. Comments can be placed after filter as well. Blank lines
-are also ignored. See  for how to
-perform quoting in patterns.
+ignored. Comments can be placed after an object pattern row as well.
+Blank lines are also ignored. See 
+for how to perform quoting in patterns.

 

@@ -1715,8 +1718,8 @@ CREATE DATABASE foo WITH TEMPLATE template0;
 
 
   
-   To dump all tables with names starting with mytable, except for table
-   mytable2, specify a filter file
+   To dump all tables whose names start with mytable, except
+   for table mytable2, specify a filter file
filter.txt like:
 
 include table mytable*
diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index 75ba03f3ad..4d7c046468 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -130,20 +130,29 @@ PostgreSQL documentation
   

 Specify a filename from which to read patterns for databases excluded
-from the dump. The patterns are interpretted according to the same 
rules
+from the dump. The patterns are interpreted according to the s

common signal handler protection

2023-11-21 Thread Nathan Bossart
In commit 97550c0, I added a "MyProcPid == getpid()" check in the SIGTERM
handler for the startup process.  This ensures that processes forked by
system(3) (i.e., for restore_command) that have yet to install their own
signal handlers do not call proc_exit() upon receiving SIGTERM.  Without
this protection, both the startup process and the restore_command process
might try to remove themselves from the PGPROC shared array (among other
things), which can end badly.

Since then, I've been exploring a more general approach that would offer
protection against similar issues in the future.  We probably don't want
signal handlers in these grandchild processes to touch shared memory at
all.  The attached 0001 is an attempt at adding such protection for all
handlers installed via pqsignal().  In short, it stores the actual handler
functions in a separate array, and sigaction() is given a wrapper function
that performs the "MyProcPid == getpid()" check.  If that check fails, the
wrapper function installs the default signal handler and calls it.

Besides allowing us to revert commit 97550c0 (see attached 0003), this
wrapper handler could also restore errno, as shown in 0002.  Right now,
individual signal handlers must do this today as needed, but that seems
easy to miss and prone to going unnoticed for a long time.

I see two main downsides of this proposal:

* Overhead: The wrapper handler calls a function pointer and getpid(),
  which AFAICT is a real system call on most platforms.  That might not be
  a tremendous amount of overhead, but it's not zero, either.  I'm
  particularly worried about signal-heavy code like synchronous
  replication.  (Are there other areas that should be tested?)  If this is
  a concern, perhaps we could allow certain processes to opt out of this
  wrapper handler, provided we believe it is unlikely to fork or that the
  handler code is safe to run in grandchild processes.

* Race conditions: With these patches, pqsignal() becomes quite racy when
  used within signal handlers.  Specifically, you might get a bogus return
  value.  However, there are no in-tree callers of pqsignal() that look at
  the return value (and I don't see any reason they should), and it seems
  unlikely that pqsignal() will be used within signal handlers frequently,
  so this might not be a deal-breaker.  I did consider trying to convert
  pqsignal() into a void function, but IIUC that would require an SONAME
  bump.  For now, I've just documented the bogosity of the return values.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 5f46dc0150aba3ef75053e91f7c9a4d6624e2c4f Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 17 Nov 2023 21:38:18 -0600
Subject: [PATCH v1 1/3] Check that MyProcPid == getpid() in all signal
 handlers.

In commit 97550c0711, we added a similar check to the SIGTERM
handler for the startup process.  This commit adds this check to
all signal handlers installed with pqsignal().  This is done by
using a wrapper function that performs the check before calling the
actual handler.

The hope is that this will offer more general protection against
grandchildren processes inadvertently modifying shared memory due
to inherited signal handlers.  Another potential follow-up
improvement is to use this wrapper handler function to restore
errno instead of relying on each individual handler function to do
so.

This commit makes the changes in commit 97550c0711 obsolete but
leaves reverting it for a follow-up commit.

Reviewed-by: ???
Discussion: ???
---
 src/port/pqsignal.c | 72 +++--
 1 file changed, 70 insertions(+), 2 deletions(-)

diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c
index 83d876db6c..978877dec6 100644
--- a/src/port/pqsignal.c
+++ b/src/port/pqsignal.c
@@ -28,23 +28,85 @@
 #include "c.h"
 
 #include 
+#ifndef FRONTEND
+#include 
+#endif
 
 #ifndef FRONTEND
 #include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#endif
+
+#ifdef NSIG
+#define PG_NSIG (NSIG)
+#else
+#define PG_NSIG (64)			/* XXX: wild guess */
+#endif
+
+static volatile pqsigfunc pqsignal_handlers[PG_NSIG];
+
+/*
+ * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function
+ * as the handler for all signals.  This wrapper handler function checks that
+ * it is called within a process that the server knows about, and not a
+ * grandchild process forked by system(3), etc.  This check ensures that such
+ * grandchildren processes do not modify shared memory, which could be
+ * detrimental.  If the check succeeds, the function originally provided to
+ * pqsignal() is called.  Otherwise, the default signal handler is installed
+ * and then called.
+ */
+static void
+wrapper_handler(SIGNAL_ARGS)
+{
+#ifndef FRONTEND
+	Assert(MyProcPid);
+
+	if (unlikely(MyProcPid != (int) getpid()))
+	{
+		pqsignal(postgres_signal_arg, SIG_DFL);
+		raise(postgres_signal_arg);
+		return;
+	}
 #endif
 
+	(*pqsignal_handlers[postgres_signal_arg]) (p

Re: Remove distprep

2023-11-21 Thread Andrew Dunstan



On 2023-11-21 Tu 13:23, Tom Lane wrote:

Alvaro Herrera  writes:

Hmm, do we still need to have README.git as a separate file from README?
Also, looking at README, I see it refers to the INSTALL file in the
root, but that doesn't exist.  "make -C doc/src/sgml INSTALL" creates
it, but it's not copied to the root directory.  Do we need some fixup
for that?

Yeah, we clearly need to rethink this area if the plan is that tarballs
will be pristine git pulls.  I think we want just README at the top
level, and I propose we give up on the text INSTALL file altogether
(thereby removing a documentation build gotcha that catches people
every so often).  I propose that in 2023 it ought to be sufficient
for the README file to point at build instructions on the web.





+1


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Changing baserel to foreignrel in postgres_fdw functions

2023-11-21 Thread Bruce Momjian


Should this patch be applied?

---

On Thu, Feb 15, 2018 at 06:57:50PM +0530, Ashutosh Bapat wrote:
> Hi,
> I noticed that functions is_foreign_expr(), classifyConditions() and
> appendOrderByClause() had variables/arguments named baserel when the
> relations passed to those could be join or upper relation as well.
> Here's patch renaming those as foreignrel.
> -- 
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company

> diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
> index e111b09..bcf9bea 100644
> --- a/contrib/postgres_fdw/deparse.c
> +++ b/contrib/postgres_fdw/deparse.c
> @@ -198,7 +198,7 @@ static void get_relation_column_alias_ids(Var *node, 
> RelOptInfo *foreignrel,
>   */
>  void
>  classifyConditions(PlannerInfo *root,
> -RelOptInfo *baserel,
> +RelOptInfo *foreignrel,
>  List *input_conds,
>  List **remote_conds,
>  List **local_conds)
> @@ -212,7 +212,7 @@ classifyConditions(PlannerInfo *root,
>   {
>   RestrictInfo *ri = lfirst_node(RestrictInfo, lc);
>  
> - if (is_foreign_expr(root, baserel, ri->clause))
> + if (is_foreign_expr(root, foreignrel, ri->clause))
>   *remote_conds = lappend(*remote_conds, ri);
>   else
>   *local_conds = lappend(*local_conds, ri);
> @@ -224,29 +224,29 @@ classifyConditions(PlannerInfo *root,
>   */
>  bool
>  is_foreign_expr(PlannerInfo *root,
> - RelOptInfo *baserel,
> + RelOptInfo *foreignrel,
>   Expr *expr)
>  {
>   foreign_glob_cxt glob_cxt;
>   foreign_loc_cxt loc_cxt;
> - PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) 
> (baserel->fdw_private);
> + PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) 
> (foreignrel->fdw_private);
>  
>   /*
>* Check that the expression consists of nodes that are safe to execute
>* remotely.
>*/
>   glob_cxt.root = root;
> - glob_cxt.foreignrel = baserel;
> + glob_cxt.foreignrel = foreignrel;
>  
>   /*
>* For an upper relation, use relids from its underneath scan relation,
>* because the upperrel's own relids currently aren't set to anything
>* meaningful by the core code.  For other relation, use their own 
> relids.
>*/
> - if (IS_UPPER_REL(baserel))
> + if (IS_UPPER_REL(foreignrel))
>   glob_cxt.relids = fpinfo->outerrel->relids;
>   else
> - glob_cxt.relids = baserel->relids;
> + glob_cxt.relids = foreignrel->relids;
>   loc_cxt.collation = InvalidOid;
>   loc_cxt.state = FDW_COLLATE_NONE;
>   if (!foreign_expr_walker((Node *) expr, &glob_cxt, &loc_cxt))
> @@ -301,7 +301,7 @@ foreign_expr_walker(Node *node,
>   if (node == NULL)
>   return true;
>  
> - /* May need server info from baserel's fdw_private struct */
> + /* May need server info from foreignrel's fdw_private struct */
>   fpinfo = (PgFdwRelationInfo *) (glob_cxt->foreignrel->fdw_private);
>  
>   /* Set up inner_cxt for possible recursion to child nodes */
> @@ -2965,7 +2965,7 @@ appendGroupByClause(List *tlist, deparse_expr_cxt 
> *context)
>  }
>  
>  /*
> - * Deparse ORDER BY clause according to the given pathkeys for given base
> + * Deparse ORDER BY clause according to the given pathkeys for given foreign
>   * relation. From given pathkeys expressions belonging entirely to the given
>   * base relation are obtained and deparsed.
>   */
> @@ -2975,7 +2975,7 @@ appendOrderByClause(List *pathkeys, deparse_expr_cxt 
> *context)
>   ListCell   *lcell;
>   int nestlevel;
>   char   *delim = " ";
> - RelOptInfo *baserel = context->scanrel;
> + RelOptInfo *foreignrel = context->scanrel;
>   StringInfo  buf = context->buf;
>  
>   /* Make sure any constants in the exprs are printed portably */
> @@ -2987,7 +2987,7 @@ appendOrderByClause(List *pathkeys, deparse_expr_cxt 
> *context)
>   PathKey*pathkey = lfirst(lcell);
>   Expr   *em_expr;
>  
> - em_expr = find_em_expr_for_rel(pathkey->pk_eclass, baserel);
> + em_expr = find_em_expr_for_rel(pathkey->pk_eclass, foreignrel);
>   Assert(em_expr != NULL);
>  
>   appendStringInfoString(buf, delim);


-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: common signal handler protection

2023-11-21 Thread Nathan Bossart
On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote:
> +#ifdef NSIG
> +#define PG_NSIG (NSIG)
> +#else
> +#define PG_NSIG (64) /* XXX: wild guess */
> +#endif

> + Assert(signo < PG_NSIG);

cfbot seems unhappy with this on Windows.  IIUC we need to use
PG_SIGNAL_COUNT there instead, but I'd like to find a way to have just one
macro for all platforms.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




WIP: libpq: add a possibility to not send D(escribe) when executing a prepared statement

2023-11-21 Thread Ivan Trofimov

Hi!


Currently libpq sends B(ind), D(escribe), E(execute), S(ync) when 
executing a prepared statement.


The response for that D message is a RowDescription, which doesn't 
change during prepared


statement lifetime (with the attributes format being an exception, as 
they aren't know before execution) .



In a presumably very common case of repeatedly executing the same 
statement, this leads to


both client and server parsing/sending exactly the same RowDescritpion 
data over and over again.



Instead, library user could acquire a statement result RowDescription 
once (via PQdescribePrepared),


and reuse it in subsequent calls to PQexecPrepared and/or its async 
friends. Doing it this way saves


a measurable amount of CPU for both client and server and saves a lot of 
network traffic, for example:


when selecting a single row from a table with 30 columns, where each 
column has 10-symbols name, and


every value in a row is a 10-symbols TEXT, i'm seeing an amount of bytes 
sent to client decreased


by a factor of 2.8, and the CPU time client spends in userland decreased 
by a factor of ~1.5.



The patch attached adds a family of functions

PQsendQueryPreparedPredescribed, PQgetResultPredescribed, 
PQisBusyPredescribed,


which allow a user to do just that.

If the idea seems reasonable i'd be happy to extend these to 
PQexecPrepared as well and cover it with tests.



P.S. This is my first time ever sending a patch via email, so please 
don't hesitate to point at mistakes


i'm doing in the process.
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 850734ac96..989354ca06 100644
--- a/src/interfaces/libpq/exports.txt
+++ b/src/interfaces/libpq/exports.txt
@@ -191,3 +191,6 @@ PQclosePrepared   188
 PQclosePortal 189
 PQsendClosePrepared   190
 PQsendClosePortal 191
+PQsendQueryPreparedPredescribed 192
+PQgetResultPredescribed 193
+PQisBusyPredescribed194
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 04610ccf5e..a18bead9e6 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -72,8 +72,19 @@ static int	PQsendQueryGuts(PGconn *conn,
 			const char *const *paramValues,
 			const int *paramLengths,
 			const int *paramFormats,
-			int resultFormat);
-static void parseInput(PGconn *conn);
+			int resultFormat,
+			bool sendDescribe);
+static int	PQsendQueryPreparedInternal(PGconn *conn,
+		const char *stmtName,
+		int nParams,
+		const char *const *paramValues,
+		const int *paramLengths,
+		const int *paramFormats,
+		int resultFormat,
+		bool sendDescribe);
+static int	PQisBusyInternal(PGconn *conn, PGresult *description);
+static PGresult *PQgetResultInternal(PGconn *conn, PGresult *description);
+static void parseInput(PGconn *conn, PGresult *description);
 static PGresult *getCopyResult(PGconn *conn, ExecStatusType copytype);
 static bool PQexecStart(PGconn *conn);
 static PGresult *PQexecFinish(PGconn *conn);
@@ -1528,7 +1539,8 @@ PQsendQueryParams(PGconn *conn,
 		   paramValues,
 		   paramLengths,
 		   paramFormats,
-		   resultFormat);
+		   resultFormat,
+		   true /* send Describe */ );
 }
 
 /*
@@ -1643,6 +1655,26 @@ PQsendQueryPrepared(PGconn *conn,
 	const int *paramLengths,
 	const int *paramFormats,
 	int resultFormat)
+{
+	return PQsendQueryPreparedInternal(conn,
+	   stmtName,
+	   nParams,
+	   paramValues,
+	   paramLengths,
+	   paramFormats,
+	   resultFormat,
+	   true /* send Describe */ );
+}
+
+int
+PQsendQueryPreparedInternal(PGconn *conn,
+			const char *stmtName,
+			int nParams,
+			const char *const *paramValues,
+			const int *paramLengths,
+			const int *paramFormats,
+			int resultFormat,
+			bool sendDescribe)
 {
 	if (!PQsendQueryStart(conn, true))
 		return 0;
@@ -1668,7 +1700,50 @@ PQsendQueryPrepared(PGconn *conn,
 		   paramValues,
 		   paramLengths,
 		   paramFormats,
-		   resultFormat);
+		   resultFormat,
+		   sendDescribe);
+}
+
+int
+PQsendQueryPreparedPredescribed(PGconn *conn,
+const char *stmtName,
+int nParams,
+const char *const *paramValues,
+const int *paramLengths,
+const int *paramFormats,
+int resultFormat,
+PGresult *description)
+{
+	int			i;
+	int			nFields;
+	int			send_result;
+
+	if (!description)
+	{
+		libpq_append_conn_error(conn, "query result description is a null pointer");
+		return 0;
+	}
+
+	send_result = PQsendQueryPreparedInternal(conn,
+			  stmtName,
+			  nParams,
+			  paramValues,
+			  paramLengths,
+			  paramFormats,
+			  resultFormat,
+			  false /* send Describe */ );
+
+	/* We only need to adjust attributes format if 

Re: [HACKERS] legitimacy of using PG_TRY , PG_CATCH , PG_END_TRY in C function

2023-11-21 Thread Bruce Momjian
On Mon, Oct 23, 2017 at 01:27:43AM -0700, Andres Freund wrote:
> On 2017-10-22 23:04:50 -0400, Tom Lane wrote:
> > John Lumby  writes:
> > > I have a C function (a trigger function) which uses the PG_TRY() 
> > > construct to handle certain ERROR conditions.
> > > One example is where invoked as INSTEAD OF INSERT into a view.  It 
> > > PG_TRYs INSERT into the real base table,
> > > but this table may not yet exist  (it is a partitioned child of an 
> > > inheritance parent).
> > > If the error is  ERRCODE_UNDEFINED_TABLE,  then the CATCH issues 
> > > FlushErrorState() and returns to caller who CREATes the table and 
> > > re-issues the insert.
> > > All works perfectly (on every release of 9.x).
> > 
> > If it works, it's only because you didn't try very hard to break it.
> > In general you can't catch and ignore errors without a full-fledged
> > subtransaction --- BeginInternalSubTransaction, then either
> > ReleaseCurrentSubTransaction or RollbackAndReleaseCurrentSubTransaction,
> > not just FlushErrorState.  See e.g. plpgpsql's exec_stmt_block.
> > 
> > There may well be specific scenarios where an error gets thrown without
> > having done anything that requires transaction cleanup.  But when you
> > hit a scenario where that's not true, or when a scenario that used to
> > not require cleanup now does, nobody is going to consider that a PG bug.
> 
> It'd probably be a good idea to expand on this in the sgml docs. This
> has confused quite anumber of people...

I know this is from 2017, but where would we document this?  I don't see
PG_TRY/PG_CATCH mentioned in the SGML docs at all.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  Only you can decide what is important to you.




Re: Typo with amtype = 's' in opr_sanity.sql

2023-11-21 Thread Michael Paquier
On Tue, Nov 21, 2023 at 01:02:40PM +0300, Aleksander Alekseev wrote:
>> It seems to me that this has been copy-pasted on HEAD from the
>> sequence AM patch, but forgot to update amtype to 't'.  While that's
>> maybe cosmetic, I think that this could lead to unexpected results, so
>> perhaps there is a point in doing a backpatch?
> 
> I disagree that it's cosmetic. The test doesn't check what it's supposed to.

Yes, I've backpatched that all the way down to 12 at the end.
--
Michael


signature.asc
Description: PGP signature


Re: pg_upgrade and logical replication

2023-11-21 Thread Peter Smith
Thanks for addressing my past review comments.

Here are some more review comments for patch v16-0001

==
doc/src/sgml/ref/pgupgrade.sgml

1.
+  
+   Create all the new tables that were created in the publication during
+   upgrade and refresh the publication by executing
+   ALTER
SUBSCRIPTION ... REFRESH PUBLICATION.
+  

"Create all ... that were created" sounds a bit strange.

SUGGESTION (maybe like this or similar?)
Create equivalent subscriber tables for anything that became newly
part of the publication during the upgrade and

==
src/bin/pg_dump/pg_dump.c

2. getSubscriptionTables

+/*
+ * getSubscriptionTables
+ *   Get information about subscription membership for dumpable tables. This
+ *will be used only in binary-upgrade mode.
+ */
+void
+getSubscriptionTables(Archive *fout)
+{
+ DumpOptions *dopt = fout->dopt;
+ SubscriptionInfo *subinfo = NULL;
+ SubRelInfo *subrinfo;
+ PQExpBuffer query;
+ PGresult   *res;
+ int i_srsubid;
+ int i_srrelid;
+ int i_srsubstate;
+ int i_srsublsn;
+ int ntups;
+ Oid last_srsubid = InvalidOid;
+
+ if (dopt->no_subscriptions || !dopt->binary_upgrade ||
+ fout->remoteVersion < 17)
+ return;

This function comment says "used only in binary-upgrade mode." and the
Assert says the same. But, is this compatible with the other function
dumpSubscriptionTable() where it says "used only in binary-upgrade
mode and for PG17 or later versions"?

==
src/bin/pg_upgrade/check.c

3. check_new_cluster_subscription_configuration

+static void
+check_new_cluster_subscription_configuration(void)
+{
+ PGresult   *res;
+ PGconn*conn;
+ int nsubs_on_old;
+ int max_replication_slots;
+
+ /* Logical slots can be migrated since PG17. */
+ if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600)
+ return;

IMO it is better to say < 1700 in this check, instead of <= 1600.

~~~

4.
+ /* Quick return if there are no subscriptions to be migrated */
+ if (nsubs_on_old == 0)
+ return;

Missing period in comment.

~~~

5.
+/*
+ * check_old_cluster_subscription_state()
+ *
+ * Verify that each of the subscriptions has all their corresponding tables in
+ * i (initialize), r (ready) or s (synchronized) state.
+ */
+static void
+check_old_cluster_subscription_state(ClusterInfo *cluster)

This function is only for the old cluster (hint: the function name) so
there is no need to pass the 'cluster' parameter here. Just directly
use old_cluster in the function body.

==
src/bin/pg_upgrade/t/004_subscription.pl

6.
+# Add tab_not_upgraded1 to the publication. Now publication has tab_upgraded1
+# and tab_upgraded2 tables.
+$publisher->safe_psql('postgres',
+ "ALTER PUBLICATION regress_pub ADD TABLE tab_upgraded2");

Typo in comment. You added tab_not_upgraded2, not tab_not_upgraded1

~~

7.
+# Subscription relations should be preserved. The upgraded won't know
+# about 'tab_not_upgraded1' because the subscription is not yet refreshed.

Typo or missing word in comment?

"The upgraded" ??

==
Kind Regards,
Peter Smith.
Fujitsu Australia




dblink query interruptibility

2023-11-21 Thread Noah Misch
=== Background

Something as simple as the following doesn't respond to cancellation.  In
v15+, any DROP DATABASE will hang as long as it's running:

  SELECT dblink_exec(
$$dbname='$$||current_database()||$$' port=$$||current_setting('port'),
'SELECT pg_sleep(15)');

https://postgr.es/m/4b584c99.8090...@enterprisedb.com proposed a fix back in
2010.  Latches and the libpqsrv facility have changed the server programming
environment since that patch.  The problem statement also came up here:

On Thu, Dec 08, 2022 at 06:08:15PM -0800, Andres Freund wrote:
> dblink.c uses a lot of other blocking libpq functions, which obviously also
> isn't ok.


=== Key decisions

This patch adds to libpqsrv facility.  It dutifully follows the existing
naming scheme.  For greppability, I am favoring renaming new and old functions
such that the libpq name is a substring of this facility's name.  That is,
rename libpqsrv_disconnect to srvPQfinish or maybe libpqsrv_PQfinish().  Now
is better than later, while pgxn contains no references to libpqsrv.  Does
anyone else have a preference between naming schemes?  If nobody does, I'll
keep today's libpqsrv_disconnect() style.

I was tempted to add a timeout argument to each libpqsrv function, which would
allow libpqsrv_get_result_last() to replace pgfdw_get_cleanup_result().  We
can always add a timeout-accepting function later and make this thread's
function name a thin wrapper around it.  Does anyone feel a mandatory timeout
argument, accepting -1 for no timeout, would be the right thing?


=== Minor topics

It would be nice to replace libpqrcv_PQgetResult() and friends with the new
functions.  I refrained since they use ProcessWalRcvInterrupts(), not
CHECK_FOR_INTERRUPTS().  Since walreceiver already reaches
CHECK_FOR_INTERRUPTS() via libpqsrv_connect_params(), things might just work.

This patch contains a PQexecParams() wrapper, called nowhere in
postgresql.git.  It's inessential, but twelve pgxn modules do mention
PQexecParams.  Just one mentions PQexecPrepared, and none mention PQdescribe*.

The patch makes postgres_fdw adopt its functions, as part of confirming the
functions are general enough.  postgres_fdw create_cursor() has been passing
the "DECLARE CURSOR FOR inner_query" string for some error messages and just
inner_query for others.  I almost standardized on the longer one, but the test
suite checks that.  Hence, I standardized on just inner_query.

I wrote this because pglogical needs these functions to cooperate with v15+
DROP DATABASE (https://github.com/2ndQuadrant/pglogical/issues/418).

Thanks,
nm
Author: Noah Misch 
Commit: Noah Misch 

Make dblink interruptible, via new libpqsrv APIs.

This replaces dblink's blocking libpq calls, allowing cancellation and
allowing DROP DATABASE (of a database not involved in the query).  Apart
from explicit dblink_cancel_query() calls, dblink still doesn't cancel
the remote side.  The replacement for the blocking calls consists of
new, general-purpose query execution wrappers in the libpqsrv facility.
Out-of-tree extensions should adopt these.  Use them in postgres_fdw,
replacing a local implementation from which the libpqsrv implementation
derives.  This is a bug fix for dblink.  Code inspection identified the
bug at least thirteen years ago, but user complaints have not appeared.
Hence, no back-patch for now.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 195b278..4624e53 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -133,6 +133,7 @@ static HTAB *remoteConnHash = NULL;
 /* custom wait event values, retrieved from shared memory */
 static uint32 dblink_we_connect = 0;
 static uint32 dblink_we_get_conn = 0;
+static uint32 dblink_we_get_result = 0;
 
 /*
  * Following is list that holds multiple remote connections.
@@ -252,6 +253,9 @@ dblink_init(void)
 {
if (!pconn)
{
+   if (dblink_we_get_result == 0)
+   dblink_we_get_result = 
WaitEventExtensionNew("DblinkGetResult");
+
pconn = (remoteConn *) MemoryContextAlloc(TopMemoryContext, 
sizeof(remoteConn));
pconn->conn = NULL;
pconn->openCursorCount = 0;
@@ -442,7 +446,7 @@ dblink_open(PG_FUNCTION_ARGS)
/* If we are not in a transaction, start one */
if (PQtransactionStatus(conn) == PQTRANS_IDLE)
{
-   res = PQexec(conn, "BEGIN");
+   res = libpqsrv_exec(conn, "BEGIN", dblink_we_get_result);
if (PQresultStatus(res) != PGRES_COMMAND_OK)
dblink_res_internalerror(conn, res, "begin error");
PQclear(res);
@@ -461,7 +465,7 @@ dblink_open(PG_FUNCTION_ARGS)
(rconn->openCursorCount)++;
 
appendStringInfo(&buf, "DECLARE %s CURSOR FOR %s", curname, sql);
-   res = PQexec(conn, buf.d

Re: [PATCH] fix race condition in libpq (related to ssl connections)

2023-11-21 Thread Michael Paquier
On Tue, Nov 21, 2023 at 12:14:16PM +0300, Aleksander Alekseev wrote:

Thanks for the report, Willi, and the test case!  Thanks Aleksander
for the reply.

> I wonder if we should just document that libpq is thread safe as of PG
> v??? and deprecate PQisthreadsafe() at some point. Currently the
> documentation gives an impression that the library may or may not be
> thread safe depending on the circumstances.

Because --disable-thread-safe has been removed recently in
68a4b58eca03.  The routine could be marked as deprecated on top of
saying that it always returns 1 for 17~.

> Please add the patch to the nearest open commit fest [2]. The patch
> will be automatically picked up by cfbot [3] and tested on different
> platforms. Also this way it will not be lost among other patches.
> 
> The code looks OK but I would appreciate a second opinion from cfbot.
> Also maybe a few comments before my_BIO_methods_init_mutex and/or
> pthread_mutex_lock would be appropriate. Personally I am inclined to
> think that the automatic test in this particular case is redundant.

I am not really convinced that we require a second mutex here, as
there is always a concern with inter-locking changes.  I may be
missing something, of course, but BIO_s_socket() is just a pointer to
a set of callbacks hidden in bss_sock.c with BIO_meth_new() and
BIO_get_new_index() assigning some centralized data to handle the
methods in a controlled way in OpenSSL.  We only case about
initializing once for the sake of libpq's threads, so wouldn't it be
better to move my_BIO_s_socket() in pgtls_init() where the
initialization of the BIO methods would be protected by
ssl_config_mutex?  That's basically what Willi means in his first
message, no?
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] fix race condition in libpq (related to ssl connections)

2023-11-21 Thread Michael Paquier
On Tue, Nov 21, 2023 at 12:14:16PM +0300, Aleksander Alekseev wrote:
> Please add the patch to the nearest open commit fest [2]. The patch
> will be automatically picked up by cfbot [3] and tested on different
> platforms. Also this way it will not be lost among other patches.

I have noticed that this was not tracked yet, so I have added an entry
here:
https://commitfest.postgresql.org/46/4670/

Willi, note that this requires a PostgreSQL community account, and it
does not seem like you have one, or I would have added you as author
;)
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] fix race condition in libpq (related to ssl connections)

2023-11-21 Thread Thomas Munro
On Wed, Nov 22, 2023 at 2:44 PM Michael Paquier  wrote:
> On Tue, Nov 21, 2023 at 12:14:16PM +0300, Aleksander Alekseev wrote:
> > I wonder if we should just document that libpq is thread safe as of PG
> > v??? and deprecate PQisthreadsafe() at some point. Currently the
> > documentation gives an impression that the library may or may not be
> > thread safe depending on the circumstances.
>
> Because --disable-thread-safe has been removed recently in
> 68a4b58eca03.  The routine could be marked as deprecated on top of
> saying that it always returns 1 for 17~.

See also commit ce0b0fa3 "Doc: Adjust libpq docs about thread safety."




Re: [HACKERS] Changing references of password encryption to hashing

2023-11-21 Thread Bruce Momjian


Is there any interest in fixing our documentation that says encrypted
when it means hashed?  Should I pursue this?

---

On Fri, Mar 10, 2017 at 11:16:02AM +0900, Michael Paquier wrote:
> Hi all,
> 
> As discussed here:
> https://www.postgresql.org/message-id/98cafcd0-5557-0bdf-4837-0f2b7782d...@joeconway.com
> We are using in documentation and code comments "encryption" to define
> what actually is hashing, which is confusing.
> 
> Attached is a patch for HEAD to change the documentation to match hashing.
> 
> There are a couple of things I noticed on the way:
> 1) There is the user-visible PQencryptPassword in libpq, which
> actually hashes the password, and not encrypts it :)
> 2) There is as well pg_md5_encrypt() in the code, as well as there is
> pg_md5_hash(). Those may be better if renamed, at least I would think
> that pg_md5_encrypt should be pg_md5_password_hash, because a salt is
> used as well, and the routine is dedicated to work on passwords.
> Thoughts?
> 3) createuser also has --encrypt and --unencrypted, perhaps those
> should be renamed? Honestly I don't really think that this is worth a
> breakage and the option names match with the SQL commands.
> 
> I did not bother about those things in the attached, which works only
> documentation and comment changes.
> 
> An open item has been added on the wiki.
> 
> Thanks,
> -- 
> Michael

> diff --git a/contrib/pgcrypto/crypt-des.c b/contrib/pgcrypto/crypt-des.c
> index a4aa4966bf..1f65f286ea 100644
> --- a/contrib/pgcrypto/crypt-des.c
> +++ b/contrib/pgcrypto/crypt-des.c
> @@ -753,7 +753,7 @@ px_crypt_des(const char *key, const char *setting)
>   output[0] = setting[0];
>  
>   /*
> -  * If the encrypted password that the salt was extracted from 
> is only
> +  * If the hashed password that the salt was extracted from is 
> only
>* 1 character long, the salt will be corrupted.  We need to 
> ensure
>* that the output string doesn't have an extra NUL in it!
>*/
> diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
> index 28cdabe6fe..abbd5dd19e 100644
> --- a/doc/src/sgml/catalogs.sgml
> +++ b/doc/src/sgml/catalogs.sgml
> @@ -1334,8 +1334,8 @@
>rolpassword
>text
>
> -   Password (possibly encrypted); null if none. The format depends
> -   on the form of encryption used.
> +   Password (possibly hashed); null if none. The format depends
> +   on the form of hashing used.
>
>   
>  
> @@ -1350,19 +1350,20 @@
>
>  
>
> -   For an MD5 encrypted password, rolpassword
> +   For an MD5-hashed password, rolpassword
> column will begin with the string md5 followed by a
> 32-character hexadecimal MD5 hash. The MD5 hash will be of the user's
> password concatenated to their user name. For example, if user
> joe has password xyzzy, PostgreSQL
> will store the md5 hash of xyzzyjoe.  If the password is
> -   encrypted with SCRAM-SHA-256, it consists of 5 fields separated by colons.
> +   hashed with SCRAM-SHA-256, it consists of 5 fields separated by colons.
> The first field is the constant scram-sha-256, to
> identify the password as a SCRAM-SHA-256 verifier. The second field is a
> salt, Base64-encoded, and the third field is the number of iterations used
> to generate the password.  The fourth field and fifth field are the stored
> key and server key, respectively, in hexadecimal format. A password that
> -   does not follow either of those formats is assumed to be unencrypted.
> +   does not follow either of those formats is assumed to be in plain format,
> +   non-hashed.
>
>   
>  
> @@ -10269,9 +10270,9 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts 
> ppx
>passwd
>text
>
> -  Password (possibly encrypted); null if none.  See
> +  Password (possibly hashed); null if none.  See
> linkend="catalog-pg-authid">pg_authid
> -  for details of how encrypted passwords are stored.
> +  for details of how hashed passwords are stored.
>   
>  
>   
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index 69844e5b29..994ed6c1bd 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -1190,11 +1190,11 @@ include_dir 'conf.d'
>  When a password is specified in  or
>   without writing either 
> ENCRYPTED
>  or UNENCRYPTED, this parameter determines whether the
> -password is to be encrypted. The default value is md5, 
> which
> +password is to be hashed. The default value is md5, which
>  stores the password as an MD5 hash. Setting this to 
> plain stores
>  it in plaintext. on and off are also 
> accepted, as
>  aliases for md5 and plain, respectively.  
> Setting
> -this parameter to scram will encrypt th

Re: proposal: possibility to read dumped table's name from file

2023-11-21 Thread Erik Rijkers

Op 11/21/23 om 22:10 schreef Daniel Gustafsson:

On 20 Nov 2023, at 06:20, Pavel Stehule  wrote:





The attached is pretty close to a committable patch IMO, review is welcome on
both the patch and commit message.  I tried to identify all reviewers over the
past 3+ years but I might have missed someone.


I've tested this, albeit mostly in the initial iterations  (*shrug* but 
a mention is nice)


Erik Rijkers



--
Daniel Gustafsson







RE: Synchronizing slots from primary to standby

2023-11-21 Thread Zhijie Hou (Fujitsu)
On Tuesday, November 21, 2023 5:33 PM shveta malik  
wrote:
> 
> 
> v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, rebased the
> patches.  PFA v37_2 patches.

Thanks for updating the patches.

I'd like to discuss one issue related to the correct handling of failover flag
when executing ALTER SUBSCRIPTION SET (slot_name = 'new_slot')".

Since the command intends to use a new slot on the primary, the new slot needs
to reflect the "failover" state that the subscription currently has. If the
failoverstate of the Subscription is LOGICALREP_FAILOVER_STATE_ENABLED, then I
can reset it to LOGICALREP_FAILOVER_STATE_PENDING and allow the apply worker to
handle it the way it is handled today (just like two_phase handling).

But if the failoverstate is LOGICALREP_FAILOVER_STATE_DISABLED, the original
idea is to call walrcv_alter_slot and alter the slot from the "ALTER
SUBSCRIPTION" handling backend itself. This works if the slot is currently
disabled. But the " ALTER SUBSCRIPTION SET (slot_name = 'new_slot')" command is
supported even if the subscription is enabled. If the subscription is enabled,
then calling walrcv_alter_slot() fails because the slot is still acquired by
apply worker.

So, I am thinking do we need a new mechanism to change the failover flag to
false on an enabled subscription ? For example, we could call walrcv_alter_slot
on startup of apply worker if AllTablesyncsReady(), for both true and false
values of failover flag. This way, every time apply worker is started, it calls
walrcv_alter_slot to set the failover flag on the primary.

Or we could just document that it is user's responsibility to match the failover
property in case it changes the slot_name.

Thoughts ?

Best Regards,
Hou zj


Re: [PoC] Reducing planning time when tables have many partitions

2023-11-21 Thread Yuya Watari
Hello Alena, Andrei, and all,

Thank you for reviewing this patch. I really apologize for not
updating this thread for a while.

On Sat, Nov 18, 2023 at 6:04 AM Alena Rybakina  wrote:
> Hi, all!
>
> While I was reviewing the patches, I noticed that they needed some rebasing, 
> and in one of the patches (Introduce-indexes-for-RestrictInfo.patch) there 
> was a conflict with the recently added self-join-removal feature [1]. So, I 
> rebased patches and resolved the conflicts. While I was doing this, I found a 
> problem that I also fixed:

Thank you very much for rebasing these patches and fixing the issue.
The bug seemed to be caused because these indexes were in
RangeTblEntry, and the handling of their serialization was not
correct. Thank you for fixing it.

On Mon, Nov 20, 2023 at 1:45 PM Andrei Lepikhov
 wrote:
> During the work on committing the SJE feature [1], Alexander Korotkov
> pointed out the silver lining in this work [2]: he proposed that we
> shouldn't remove RelOptInfo from simple_rel_array at all but replace it
> with an 'Alias', which will refer the kept relation. It can simplify
> further optimizations on removing redundant parts of the query.

Thank you for sharing this information. I think the idea suggested by
Alexander Korotkov is also helpful for our patch. As mentioned above,
the indexes are in RangeTblEntry in the current implementation.
However, I think RangeTblEntry is not the best place to store them. An
'alias' relids may help solve this and simplify fixing the above bug.
I will try this approach soon.

Unfortunately, I've been busy due to work, so I won't be able to
respond for several weeks. I'm really sorry for not being able to see
the patches. As soon as I'm not busy, I will look at them, consider
the above approach, and reply to this thread. If there is no
objection, I will move this CF entry forward to next CF.

Again, thank you very much for looking at this thread, and I'm sorry
for my late work.

-- 
Best regards,
Yuya Watari




Re: Schema variables - new implementation for Postgres 15

2023-11-21 Thread Julien Rouhaud
Hi,

On Tue, Oct 17, 2023 at 08:52:13AM +0200, Pavel Stehule wrote:
>
> When I thought about global temporary tables, I got one maybe interesting
> idea. The one significant problem of global temporary tables is place for
> storing info about size or column statistics.
>
> I think so these data can be stored simply in session variables. Any global
> temporary table can get assigned one session variable, that can hold these
> data.

I don't know how realistic this would be.  For instance it will require to
properly link the global temporary table life cycle with the session variable
and I'm afraid it would require to add some hacks to make it work as needed.

But this still raises the question of whether this feature could be used
internally for the need of another feature.  If we think it's likely, should we
try to act right now and reserve the "pg_" prefix for internal use rather than
do that a few years down the line and probably break some user code as it was
done recently for the role names?




Re: Postgres picks suboptimal index after building of an extended statistics

2023-11-21 Thread Andrei Lepikhov

Thanks for detaied answer,

On 3/11/2023 00:37, Tomas Vondra wrote:

On 9/25/23 06:30, Andrey Lepikhov wrote:

...
I can't stop thinking about this issue. It is bizarre when Postgres
chooses a non-unique index if a unique index gives us proof of minimum
scan.

That's true, but no one implemented this heuristics. So the "proof of
minimum scan" is merely hypothetical - the optimizer is unaware of it.


See the simple patch in the attachment. There, I have attempted to 
resolve situations of uncertainty to avoid making decisions based solely 
on the order of indexes in the list.



I don't see a reason to teach the clauselist_selectivity() routine to
estimate UNIQUE indexes. We add some cycles, but it will work with btree
indexes only.

I'm not sure I understand what this is meant to say. Can you elaborate?
We only allow UNIQUE for btree indexes anyway, so what exactly is the
problem here?


Partly, you already answered yourself below: we have unique index 
estimation in a few estimation calls, but go through the list of indexes 
each time.
Also, for this sake, we would add some input parameters, usually NULL, 
because many estimations don't involve indexes at all.



Maybe to change compare_path_costs_fuzzily() and add some heuristic, for
example:
"If selectivity of both paths gives us no more than 1 row, prefer to use
a unique index or an index with least selectivity."

I don't understand how this would work. What do yo mean by "selectivity
of a path"? AFAICS the problem here is that we estimate a path to return
more rows (while we know there can only be 1, thanks to UNIQUE index).


Oops, I meant cardinality. See the patch in the attachment.


So how would you know the path does not give us more than 1 row? Surely
you're not proposing compare_path_costs_fuzzily() to do something
expensive like analyzing the paths, or something.


I solely propose to make optimizer more consecutive in its decisions: if 
we have one row for both path candidates, use uniqueness of the index or 
value of selectivity as one more parameter.



Also, how would it work in upper levels? If you just change which path
we keep, but leave the inaccurate row estimate in place, that may fix
that level, but it's certainly going to confuse later planning steps.

It is designed for the only scan level.

IMHO the problem here is that we produce wrong estimate, so we better
fix that, instead of adding band-aids to other places.


Agree. I am looking for a solution to help users somehow resolve such 
problems. As an alternative solution, I can propose a selectivity hook 
or (maybe even better) - use the pathlist approach and add indexes into 
the index list with some predefined order - at first positions, place 
unique indexes with more columns, etc.



This happens because functional dependencies are very simple type of
statistics - it has some expectations about the input data and also the
queries executed on it. For example it assumes the data is reasonably
homogeneous, so that we can calculate a global "degree".

But the data in the example directly violates that - it has 1000 rows
that are very random (so we'd find no dependencies), and 1000 rows with
perfect dependencies. Hence we end with degree=0.5, which approximates
the dependencies to all data. Not great, true, but that's the price for
simplicity of this statistics kind.

So the simplest solution is to disable dependencies on such data sets.
It's a bit inconvenient/unfortunate we build dependencies by default,
and people may not be aware of there assumptions.

Perhaps we could/should make dependency_degree() more pessimistic when
we find some "violations" of the rule (we intentionally are not strict
about it, because data are imperfect). I don't have a good idea how to
change the formulas, but I'm open to the idea in principle.


Thanks for the explanation!


The other thing we could do is checking for unique indexes in
clauselist_selectivity, and if we find a match we can just skip the
extended statistics altogether. Not sure how expensive this would be,
but for typical cases (with modest number of indexes) perhaps OK. It
wouldn't work without a unique index, but I don't have a better idea.
It looks a bit expensive for me. But I am ready to try, if current 
solution doesn't look applicable.


--
regards,
Andrei Lepikhov
Postgres Professional
From 21677861e101eed22c829e0b14290a56786a12c2 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Wed, 22 Nov 2023 12:01:39 +0700
Subject: [PATCH] Use an index path with the best selectivity estimation

---
 src/backend/optimizer/util/pathnode.c | 40 +
 .../expected/drop-index-concurrently-1.out| 16 ---
 src/test/regress/expected/functional_deps.out | 43 +++
 src/test/regress/expected/join.out| 40 +
 src/test/regress/sql/functional_deps.sql  | 36 
 5 files changed, 149 insertions(+), 26 deletions(-)

diff --git a/src/backend/optimizer/util/pa

Re: Partial aggregates pushdown

2023-11-21 Thread Alexander Pyhalov

Robert Haas писал 2023-11-21 20:16:


> I don't think the patch does a good job explaining why HAVING,
> DISTINCT, and ORDER BY are a problem. It seems to me that HAVING
> shouldn't really be a problem, because HAVING is basically a WHERE
> clause that occurs after aggregation is complete, and whether or not
> the aggregation is safe shouldn't depend on what we're going to do
> with the value afterward. The HAVING clause can't necessarily be
> pushed to the remote side, but I don't see how or why it could make
> the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a
> little trickier: if we pushed down DISTINCT, we'd still have to
> re-DISTINCT-ify when combining locally, and if we pushed down ORDER
> BY, we'd have to do a merge pass to combine the returned values unless
> we could prove that the partitions were non-overlapping ranges that
> would be visited in the correct order. Although that all sounds
> doable, I think it's probably a good thing that the current patch
> doesn't try to handle it -- this is complicated already. But it should
> explain why it's not handling it and maybe even a bit about how it
> could be handling in the future, rather than just saying "well, this
> kind of thing is not safe." The trouble with that explanation is that
> it does nothing to help the reader understand whether the thing in
> question is *fundamentally* unsafe or whether we just don't have the
> right code to make it work.

Makes sense.


Actually, I think I was wrong about this. We can't handle ORDER BY or
DISTINCT because we can't distinct-ify or order after we've already
partially aggregated. At least not in general, and not without
additional aggregate support functions. So what I said above was wrong
with respect to those. Or so I believe, anyway. But I still don't see
why HAVING should be a problem.


Hi. HAVING is also a problem. Consider the following query

SELECT count(a) FROM t HAVING count(a) > 10 - we can't push it down to
foreign server as HAVING needs full aggregate result, but foreign server
don't know it.

--
Best regards,
Alexander Pyhalov,
Postgres Professional




Re: WIP: libpq: add a possibility to not send D(escribe) when executing a prepared statement

2023-11-21 Thread Andrey M. Borodin
Hi Ivan,

thank you for the patch.

> On 22 Nov 2023, at 03:58, Ivan Trofimov  wrote:
> 
> Currently libpq sends B(ind), D(escribe), E(execute), S(ync) when executing a 
> prepared statement.
> The response for that D message is a RowDescription, which doesn't change 
> during prepared
> statement lifetime (with the attributes format being an exception, as they 
> aren't know before execution) .
From my POV the idea seems reasonable (though I’m not a real libpq expert).
BTW some drivers also send Describe even before Bind. This creates some fuss 
for routing connection poolers.

> In a presumably very common case of repeatedly executing the same statement, 
> this leads to
> both client and server parsing/sending exactly the same RowDescritpion data 
> over and over again.
> Instead, library user could acquire a statement result RowDescription once 
> (via PQdescribePrepared),
> and reuse it in subsequent calls to PQexecPrepared and/or its async friends.
But what if query result structure changes? Will we detect this error 
gracefully and return correct error?


Best regards, Andrey Borodin.



Re: Change GUC hashtable to use simplehash?

2023-11-21 Thread John Naylor
On Wed, Nov 22, 2023 at 12:00 AM Jeff Davis  wrote:
>
> On Tue, 2023-11-21 at 16:42 +0700, John Naylor wrote:
> > The strlen call required for hashbytes() is not free.
>
> Should we have a hash_string() that's like hash_bytes() but checks for
> the NUL terminator itself?
>
> That wouldn't be inlinable, but it would save on the strlen() call. It
> might benefit some other callers?

We do have string_hash(), which...calls strlen. :-)

Thinking some more, I'm not quite comfortable with the number of
places in these patches that have to know about the pre-downcased
strings, or whether we need that in the first place. If lower case is
common enough to optimize for, it seems the equality function can just
check strict equality on the char and only on mismatch try downcasing
before returning false. Doing our own function would allow the
compiler to inline it, or at least keep it on the same page. Further,
the old hash function shouldn't need to branch to do the same
downcasing, since hashing is lossy anyway. In the keyword hashes, we
just do "*ch |= 0x20", which downcases letters and turns undercores to
DEL. I can take a stab at that later.




initdb --no-locale=C doesn't work as specified when the environment is not C

2023-11-21 Thread Kyotaro Horiguchi
Commit 3e51b278db leaves lc_* conf lines as commented-out when
their value is "C". This leads to the following behavior.

$ echo LANG
ja_JP.UTF8
$ initdb --no-locale hoge
$ grep lc_ hoge/postgresql.conf
#lc_messages = 'C'  # locale for system error message
#lc_monetary = 'C'  # locale for monetary formatting
#lc_numeric = 'C'   # locale for number formatting
#lc_time = 'C'  # locale for time formatting

In this scenario, the postmaster ends up emitting log massages in
Japanese, which contradicts the documentation.

https://www.postgresql.org/docs/devel/app-initdb.html

> --locale=locale 
>   Sets the default locale for the database cluster. If this option is
>   not specified, the locale is inherited from the environment that
>   initdb runs in. Locale support is described in Section 24.1.
> 
..
> --lc-messages=locale
>   Like --locale, but only sets the locale in the specified category.

Here's a somewhat amusing case:

$ echo LANG
ja_JP.UTF8
$ initdb --lc_messages=C
$ grep lc_ hoge/postgresql.conf 
#lc_messages = 'C'  # locale for system error message
lc_monetary = 'ja_JP.UTF8'  # locale for monetary formatting
lc_numeric = 'ja_JP.UTF8'   # locale for number formatting
lc_time = 'ja_JP.UTF8'  # locale for time formatting

Hmm. it seems that initdb replaces the values of all categories
*except the specified one*. This behavior seems incorrect to
me. initdb should replace the value when explicitly specified in the
command line. If you use -c lc_messages=C, it does perform the
expected behavior to some extent, but I believe this is a separate
matter.

I have doubts about not replacing these lines for purely cosmetic
reasons. In this mail, I've attached three possible solutions for the
original issue: the first one enforces replacement only when specified
on the command line, the second one simply always performs
replacement, and the last one addresses the concern about the absence
of quotes around "C" by allowing explicit specification. (FWIW, I
prefer the last one.)

What do you think about these?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..56a8c5b60e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -144,6 +144,10 @@ static char *lc_monetary = NULL;
 static char *lc_numeric = NULL;
 static char *lc_time = NULL;
 static char *lc_messages = NULL;
+static bool lc_monetary_specified = false;
+static bool lc_numeric_specified = false;
+static bool lc_time_specified = false;
+static bool lc_messages_specified = false;
 static char locale_provider = COLLPROVIDER_LIBC;
 static char *icu_locale = NULL;
 static char *icu_rules = NULL;
@@ -1230,19 +1234,19 @@ setup_config(void)
 * Hack: don't replace the LC_XXX GUCs when their value is 'C', because
 * replace_guc_value will decide not to quote that, which looks strange.
 */
-   if (strcmp(lc_messages, "C") != 0)
+   if (strcmp(lc_messages, "C") != 0 || lc_messages_specified)
conflines = replace_guc_value(conflines, "lc_messages",
  
lc_messages, false);
 
-   if (strcmp(lc_monetary, "C") != 0)
+   if (strcmp(lc_monetary, "C") != 0 || lc_monetary_specified)
conflines = replace_guc_value(conflines, "lc_monetary",
  
lc_monetary, false);
 
-   if (strcmp(lc_numeric, "C") != 0)
+   if (strcmp(lc_numeric, "C") != 0 || lc_numeric_specified)
conflines = replace_guc_value(conflines, "lc_numeric",
  
lc_numeric, false);
 
-   if (strcmp(lc_time, "C") != 0)
+   if (strcmp(lc_time, "C") != 0 || lc_time_specified)
conflines = replace_guc_value(conflines, "lc_time",
  
lc_time, false);
 
@@ -2374,6 +2378,15 @@ setlocales(void)
icu_locale = locale;
}
 
+   /*
+* At this point, null means that the value is not specifed in the 
command
+* line.
+*/
+   if (lc_numeric != NULL) lc_numeric_specified = true;
+   if (lc_time != NULL) lc_time_specified = true;
+   if (lc_monetary != NULL) lc_monetary_specified = true;
+   if (lc_messages != NULL) lc_messages_specified = true;
+
/*
 * canonicalize locale names, and obtain any missing values from our
 * current environment
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 0c6f5ceb0a..646e8f29f6 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -1226,25 +1226,17 @@ setup_config(void)
conflines = replace_guc_value(conflines, "shared_buf

Re: Make mesage at end-of-recovery less scary.

2023-11-21 Thread Kyotaro Horiguchi
Anyway, this requires rebsaing, and done.

Thanks for John (Naylor) for pointing this out.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From e56f1f24523e3e562a4db166dfeaadc79fd7b27a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Tue, 7 Mar 2023 14:55:58 +0900
Subject: [PATCH v27] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.  To make
sure that the detection is correct, this patch checks if all trailing
bytes in the same page are zeroed in that case.
---
 src/backend/access/transam/xlogreader.c   | 134 ++
 src/backend/access/transam/xlogrecovery.c |  94 +++
 src/backend/replication/walreceiver.c |   7 +-
 src/bin/pg_waldump/pg_waldump.c   |  13 ++-
 src/include/access/xlogreader.h   |   1 +
 src/test/recovery/t/035_recovery.pl   | 130 +
 6 files changed, 327 insertions(+), 52 deletions(-)
 create mode 100644 src/test/recovery/t/035_recovery.pl

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e0baa86bd3..ce65f99c60 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -48,6 +48,8 @@ static int	ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
 			 int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
 		pfree(state);
 		return NULL;
 	}
+	state->EndOfWAL = false;
 	state->errormsg_buf[0] = '\0';
 
 	/*
@@ -525,6 +528,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
 	/* reset error state */
 	state->errormsg_buf[0] = '\0';
 	decoded = NULL;
+	state->EndOfWAL = false;
 
 	state->abortedRecPtr = InvalidXLogRecPtr;
 	state->missingContrecPtr = InvalidXLogRecPtr;
@@ -608,16 +612,12 @@ restart:
 	Assert(pageHeaderSize <= readOff);
 
 	/*
-	 * Read the record length.
+	 * Verify the record header.
 	 *
 	 * NB: Even though we use an XLogRecord pointer here, the whole record
-	 * header might not fit on this page. xl_tot_len is the first field of the
-	 * struct, so it must be on this page (the records are MAXALIGNed), but we
-	 * cannot access any other fields until we've verified that we got the
-	 * whole header.
+	 * header might not fit on this page.
 	 */
 	record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-	total_len = record->xl_tot_len;
 
 	/*
 	 * If the whole record header is on this page, validate it immediately.
@@ -636,19 +636,19 @@ restart:
 	}
 	else
 	{
-		/* There may be no next page if it's too small. */
-		if (total_len < SizeOfXLogRecord)
-		{
-			report_invalid_record(state,
-  "invalid record length at %X/%X: expected at least %u, got %u",
-  LSN_FORMAT_ARGS(RecPtr),
-  (uint32) SizeOfXLogRecord, total_len);
+		/*
+		 * xl_tot_len is the first field of the struct, so it must be on this
+		 * page (the records are MAXALIGNed), but we cannot access any other
+		 * fields until we've verified that we got the whole header.
+		 */
+		if (!ValidXLogRecordLength(state, RecPtr, record))
 			goto err;
-		}
-		/* We'll validate the header once we have the next page. */
+
 		gotheader = false;
 	}
 
+	total_len = record->xl_tot_len;
+
 	/*
 	 * Try to find space to decode this record, if we can do so without
 	 * calling palloc.  If we can't, we'll try again below after we've
@@ -1091,25 +1091,81 @@ XLogReaderInvalReadState(XLogReaderState *state)
 	state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+	  XLogRecord *record)
+{
+	if (record->xl_tot_len == 0)
+	{
+		char	   *p;
+		char	   *pe;
+
+		/*
+		 * We are almost sure reaching the end of WAL, make sure that the
+		 * whole page after the recor

Re: remaining sql/json patches

2023-11-21 Thread Andres Freund
Hi,

On 2023-11-22 15:09:36 +0900, Amit Langote wrote:
> OK, I will keep polishing 0001-0003 with the intent to push it next
> week barring objections / damning findings.

I don't think the patchset is quite there yet. It's definitely getting closer
though!  I'll try to do another review next week.

Greetings,

Andres Freund




Re: Synchronizing slots from primary to standby

2023-11-21 Thread Peter Smith
In addition to my recent v35-0001 comment not yet addressed [1], here
are some review comments for patch v37-0001.

==
src/backend/replication/walsender.c

1. PhysicalWakeupLogicalWalSnd
+/*
+ * Wake up logical walsenders with failover-enabled slots if the physical slot
+ * of the current walsender is specified in standby_slot_names GUC.
+ */
+void
+PhysicalWakeupLogicalWalSnd(void)
+{
+ ListCell   *lc;
+ List*standby_slots;
+ bool slot_in_list = false;
+
+ Assert(MyReplicationSlot != NULL);
+ Assert(SlotIsPhysical(MyReplicationSlot));
+
+ standby_slots = GetStandbySlotList(false);
+
+ foreach(lc, standby_slots)
+ {
+ char*name = lfirst(lc);
+
+ if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0)
+ {
+ slot_in_list = true;
+ break;
+ }
+ }
+
+ if (slot_in_list)
+ ConditionVariableBroadcast(&WalSndCtl->wal_confirm_rcv_cv);
+}

1a.
Easier to have single assertion -- Assert(MyReplicationSlot &&
SlotIsPhysical(MyReplicationSlot));

~

1b.
Why bother with the 'slot_in_list' and break, when you can just call
the ConditionVariableBroadcast() and return without having the extra
variable?

==
src/test/recovery/t/050_verify_slot_order.pl

~~~

2.
Should you name the global objects with a 'regress_' prefix which
seems to be the standard for other new TAP tests?

~~~

3.
+#
+#| > standby1 (connected via streaming replication)
+# | > standby2 (connected via streaming replication)
+# primary - |
+# | > subscriber1 (connected via logical replication)
+# | > subscriber2 (connected via logical replication)
+#
+#
+# Set up is configured in such a way that primary never lets subscriber1 ahead
+# of standby1.

3a.
Misaligned "|" in comment?

~

3b.
IMO it would be better to give an overview of how this all works
instead of just saying "configured in such a way".


~~~

4.
+# Configure primary to disallow specified logical replication slot (lsub1_slot)
+# getting ahead of specified physical replication slot (sb1_slot).
+$primary->append_conf(

It is confusing because there is no "lsub1_slot" specified anywhere
until much later. Would you be able to provide some more details?

~~~

5.
+# Create another subscriber node, wait for sync to complete
+my $subscriber2 = PostgreSQL::Test::Cluster->new('subscriber2');
+$subscriber2->init(allows_streaming => 'logical');
+$subscriber2->start;
+$subscriber2->safe_psql('postgres', "CREATE TABLE tab_int (a int
PRIMARY KEY);");
+$subscriber2->safe_psql('postgres',
+ "CREATE SUBSCRIPTION mysub2 CONNECTION '$publisher_connstr' "
+   . "PUBLICATION mypub WITH (slot_name = lsub2_slot);");
+$subscriber2->wait_for_subscription_sync;

Maybe this comment should explicitly say there is no failover enabled
here. Maybe the SUBSCRIPTION should explicitly set failover=false?

~~~

6.
+# The subscription that's up and running and is enabled for failover
+# doesn't get the data from primary and keeps waiting for the
+# standby specified in standby_slot_names.
+$result = $subscriber1->safe_psql('postgres',
+ "SELECT count(*) = 0 FROM tab_int;");
+is($result, 't', "subscriber1 doesn't get data from primary until
standby1 acknowledges changes");

Might it be better to write as "SELECT count(*) = $primary_row_count
FROM tab_int;" and expect it to return false?

==
src/test/regress/expected/subscription.out

7.
Everything here displays the "Failover" state 'd' (disabled). How
about tests for different state values?

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD55bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia