Re: Synchronizing slots from primary to standby
On Tue, Nov 21, 2023 at 10:02 AM Zhijie Hou (Fujitsu) wrote: > > On Friday, November 17, 2023 7:39 PM Amit Kapila > wrote: > > > > On Thu, Nov 16, 2023 at 5:34 PM shveta malik > > wrote: > > > > > > PFA v35. > > > > > > > Review v35-0002* > > == > > Thanks for the comments. > > > 1. > > As quoted in the commit message, > > > > > If a logical slot is invalidated on the primary, slot on the standby is also > > invalidated. If a logical slot on the primary is valid but is invalidated > > on the > > standby due to conflict (say required rows removed on the primary), then > > that > > slot is dropped and recreated on the standby in next sync-cycle. > > It is okay to recreate such slots as long as these are not consumable on the > > standby (which is the case currently). > > > > > > > I think this won't happen normally because of the physical slot and > > hot_standby_feedback but probably can occur in cases like if the user > > temporarily switches hot_standby_feedback from on to off. Are there any > > other > > reasons? I think we can mention the cases along with it as well at least > > for now. > > Additionally, I think this should be covered in code comments as well. > > I will collect all these cases and update in next version. > > > > > 2. > > #include "postgres.h" > > - > > +#include "access/genam.h" > > > > Spurious line removal. > > Removed. > > > > > 3. > >A password needs to be provided too, if the sender demands > > password > >authentication. It can be provided in the > >primary_conninfo string, or in a separate > > - ~/.pgpass file on the standby server (use > > - replication as the database name). > > - Do not specify a database name in the > > - primary_conninfo string. > > + ~/.pgpass file on the standby server. > > + > > + > > + Specify dbname in > > + primary_conninfo string to allow > > synchronization > > + of slots from the primary server to the standby server. > > + This will only be used for slot synchronization. It is ignored > > + for streaming. > > > > Is there a reason to remove part of the earlier sentence "use > > replication as the database name"? > > Added it back. > > > > > 4. > > + enable_syncslot configuration > > parameter > > + > > + > > + > > + > > +It enables a physical standby to synchronize logical failover slots > > +from the primary server so that logical subscribers are not blocked > > +after failover. > > + > > + > > +It is enabled by default. This parameter can only be set in the > > +postgresql.conf file or on the server > > command line. > > + > > > > I think you forgot to update the documentation for the default value of this > > variable. > > Updated. > > > > > 5. > > + * a) start the logical replication workers for every enabled > > subscription > > + * when not in standby_mode > > + * b) start the slot-sync worker for logical failover slots > > synchronization > > + * from the primary server when in standby_mode. > > > > Either use a full stop after both lines or none of these. > > Added a full stop. > > > > > 6. > > +static void slotsync_worker_cleanup(SlotSyncWorkerInfo * worker); > > > > There shouldn't be space between * and the worker. > > Removed, and added the type to typedefs.list. > > > > > 7. > > + if (!SlotSyncWorker->hdr.in_use) > > + { > > + LWLockRelease(SlotSyncWorkerLock); > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("replication slot-sync worker not initialized, " > > + "cannot attach"))); > > + } > > + > > + if (SlotSyncWorker->hdr.proc) > > + { > > + LWLockRelease(SlotSyncWorkerLock); > > + ereport(ERROR, > > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > > + errmsg("replication slot-sync worker is " > > + "already running, cannot attach"))); > > + } > > > > Using slot-sync in the error messages looks a bit odd to me. Can we use > > "replication slot sync worker ..." in both these and other similar > > messages? I > > think it would be better if we don't split the messages into multiple lines > > in > > these cases as messages don't appear too long to me. > > Changed as suggested. > > > > > 8. > > +/* > > + * Detach the worker from DSM and update 'proc' and 'in_use'. > > + * Logical replication launcher will come to know using these > > + * that the worker has shutdown. > > + */ > > +void > > +slotsync_worker_detach(int code, Datum arg) { > > > > I think the reference to DSM is leftover from the previous version of the > > patch. > > Can we change the above comments as per the new code? > > Changed. > > > > > 9. > > +static bool > > +slotsync_worker_launch() > > { > > ... > > + /* TODO: do we really need 'generation', analyse more here */ > > + worker->hdr.generation++; > > > > We should do something about this TODO.
Re: Synchronizing slots from primary to standby
On Tue, Nov 21, 2023 at 1:13 PM Drouvot, Bertrand wrote: > > Hi, > > On 11/21/23 6:16 AM, Amit Kapila wrote: > > On Mon, Nov 20, 2023 at 6:51 PM Drouvot, Bertrand > > wrote: > >> As far the 'i' state here, from what I see, it is currently useful for: > >> > >> 1. Cascading standby to not sync slots with state = 'i' from > >> the first standby. > >> 2. Easily report Slots that did not catch up on the primary yet. > >> 3. Avoid inactive slots to block "active" ones creation. > >> > >> So not creating those slots should not be an issue for 1. (sync are > >> not needed on cascading standby as not created on the first standby yet) > >> but is an issue for 2. (unless we provide another way to keep track and > >> report > >> such slots) and 3. (as I think we should still need to reserve WAL). > >> > >> I've a question: we'd still need to reserve WAL for those slots, no? > >> > >> If that's the case and if we don't call ReplicationSlotCreate() then > >> ReplicationSlotReserveWal() > >> would not work as MyReplicationSlot would be NULL. > >> > > > > Yes, we need to reserve WAL to see if we can sync the slot. We are > > currently creating an RS_EPHEMERAL slot and if we don't explicitly > > persist it when we can't sync, then it will be dropped when we do > > ReplicationSlotRelease() at the end of synchronize_one_slot(). So, the > > loss is probably, the next time we again try to sync the slot, we need > > to again create it and may need to wait for newer restart_lsn on > > standby > > Yeah, and doing so we'd reduce the time window to give the slot a chance > to catch up (as opposed to create it a single time and maintain an 'i' state). > > > which could be avoided if we have the slot in 'i' state from > > the previous run. > > Right. > > > I don't deny the importance of having 'i' > > (initialized) state but was just trying to say that it has additional > > code complexity. > > Right, and I think it's worth it. > > > OTOH, having it may give better visibility to even > > users about slots that are not active (say manually created slots on > > the primary). > > Agree. > > All that being said, on my side I'm +1 on keeping the 'i' state behavior > as it is implemented currently (would be happy to hear others' opinions too). > +1 for 'i' state. I feel it gives a better slot-sync functionality (optimizing redo-effort for inactive slots, inactive not blocking active ones) along with its usage for monitoring purposes.
Re: About #13489, array dimensions and CREATE TABLE ... LIKE
On Mon, 2023-11-20 at 21:13 -0500, Tom Lane wrote: > Bruce Momjian writes: > > On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote: > > > Bruce Momjian writes: > > > > An alternate approach would > > > > be to remove pg_attribute.attndims so we don't even try to preserve > > > > dimensionality. > > > > I could get behind that, perhaps. It looks like we're not using the > > > field in any meaningful way, and we could simplify TupleDescInitEntry > > > and perhaps some other APIs. > > > So should I work on that patch or do you want to try? I think we should > > do something. > > Let's wait for some other opinions, first ... Looking at the code, I get the impression that we wouldn't lose anything without "pg_attribute.attndims", so +1 for removing it. This would call for some documentation. We should remove most of the documentation about the non-existing difference between declaring a column "integer[]", "integer[][]" or "integer[3][3]" and just describe the first variant in detail, perhaps mentioning that the other notations are accepted for backward compatibility. I also think that it would be helpful to emphasize that while dimensionality does not matter to a column definition, it matters for individual array values. Perhaps it would make sense to recommend a check constraint if one wants to make sure that an array column should contain only a certain kind of array. Yours, Laurenz Albe
Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)
On Sat, Nov 18, 2023 at 4:54 PM Amit Kapila wrote: > > On Thu, Nov 16, 2023 at 6:09 PM Bharath Rupireddy > wrote: > > PSA v18 patch. > > > > LGTM. I'll push this early next week unless there are further > suggestions or comments. > Pushed. -- With Regards, Amit Kapila.
Re: How to accurately determine when a relation should use local buffers?
Hi, > I would like to clarify, what the correct way is to determine that a given > relation is using local buffers. Local buffers, as far as I know, are used > for temporary tables in backends. There are two functions/macros (bufmgr.c): > SmgrIsTemp, RelationUsesLocalBuffers. The first function verifies that the > current process is a regular session backend, while the other macro verifies > the relation persistence characteristic. It seems, the use of each function > independently is not correct. I think, these functions should be applied in > pair to check for local buffers use, but, it seems, these functions are used > independently. It works until temporary tables are allowed only in session > backends. Could you please provide a specific example when the current code will do something wrong/unintended? > I'm concerned, how to determine the use of local buffers in some other > theoretical cases? For example, if we decide to replicate temporary tables? > Are there the other cases, when local buffers can be used with relations in > the Vanilla? Do we allow the use of relations with RELPERSISTENCE_TEMP not > only in session backends? Temporary tables, by definition, are visible only within one session. I can't imagine how and why they would be replicated. -- Best regards, Aleksander Alekseev
Re: [PATCH] fix race condition in libpq (related to ssl connections)
Hi, > I've found a race condition in libpq. It is about the initialization of > the my_bio_methods static variable in fe-secure-openssl.c, which is not > protected by any lock. The race condition may make the initialization of > the connection fail, and as an additional weird consequence, it might > cause openssl call close(0), so stdin of the client application gets > closed. Thanks for the patch. Interestingly enough we have PQisthreadsafe() [1], but it's current implementation is: ``` /* libpq is thread-safe? */ int PQisthreadsafe(void) { return true; } ``` I wonder if we should just document that libpq is thread safe as of PG v??? and deprecate PQisthreadsafe() at some point. Currently the documentation gives an impression that the library may or may not be thread safe depending on the circumstances. > I've prepared a patch to protect the initialization of my_bio_methods > from the race. This is my first patch submission to the postgresql > project, so I hope I didn't miss anything. Any comments and suggestions > are of course very welcome. > > I also prepared a testcase. In the testcase tarball, there is a patch > that adds sleeps at the right positions to make the close(0) problem > occur practically always. It also includes comments to explain how the > race can end up calling close(0). > > Concerning the patch, it is only tested on Linux. I'm unsure about > whether the simple initialization of the mutex would work nowadays also > on Windows or whether the more complicated initialization also to be > found for the ssl_config_mutex in the same source file needs to be used. > Let me know whether I should adapt that. Please add the patch to the nearest open commit fest [2]. The patch will be automatically picked up by cfbot [3] and tested on different platforms. Also this way it will not be lost among other patches. The code looks OK but I would appreciate a second opinion from cfbot. Also maybe a few comments before my_BIO_methods_init_mutex and/or pthread_mutex_lock would be appropriate. Personally I am inclined to think that the automatic test in this particular case is redundant. [1]: https://www.postgresql.org/docs/current/libpq-threading.html [2]: https://commitfest.postgresql.org/ [3]: http://cfbot.cputube.org/ -- Best regards, Aleksander Alekseev
Re: Change GUC hashtable to use simplehash?
On Mon, Nov 20, 2023 at 5:54 AM Jeff Davis wrote: > > Attached are a bunch of tiny patches and some perf numbers based on > simple test described here: > > https://www.postgresql.org/message-id/04c8592dbd694e4114a3ed87139a7a04e4363030.camel%40j-davis.com I tried taking I/O out, like this, thinking the times would be less variable: cat bench.sql select 1 from generate_series(1,50) x(x), lateral (SELECT inc_ab(x)) a offset 1000; (with turbo off) pgbench -n -T 30 -f bench.sql -M prepared master: latency average = 643.625 ms 0001-0005: latency average = 607.354 ms ...about 5.5% less time, similar to what Jeff found. I get a noticeable regression in 0002, though, and I think I see why: guc_name_hash(const char *name) { - uint32 result = 0; + const unsigned char *bytes = (const unsigned char *)name; + int blen = strlen(name); The strlen call required for hashbytes() is not free. The lack of mixing in the (probably inlined after 0001) previous hash function can remedied directly, as in the attached: 0001-0002 only: latency average = 670.059 ms 0001-0002, plus revert hashbytes, add finalizer: latency average = 656.810 ms -#define SH_EQUAL(tb, a, b) (guc_name_compare(a, b) == 0) +#define SH_EQUAL(tb, a, b) (strcmp(a, b) == 0) Likewise, I suspect calling out to the C library is going to throw away some of the gains that were won by not needing to downcase all the time, but I haven't dug deeper. diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index b51d10dbc0..124b8fbe85 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -1363,10 +1363,17 @@ guc_name_compare(const char *namea, const char *nameb) static uint32 guc_name_hash(const char *name) { - const unsigned char *bytes = (const unsigned char *)name; - int blen = strlen(name); + uint32 result = 0; - return hash_bytes(bytes, blen); + while (*name) + { + charch = *name++; + + /* Merge into hash ... not very bright, but it needn't be */ + result = pg_rotate_left32(result, 5); + result ^= (uint32) ch; + } + return murmurhash32(result); } /*
Re: Typo with amtype = 's' in opr_sanity.sql
Hi, > While rebasing a patch from 2016 related to sequence AMs (more about > that later), I've bumped on a mistake from 8586bf7ed888 in > opr_sanity.sql, as of: > +SELECT p1.oid, p1.amname, p2.oid, p2.proname > +FROM pg_am AS p1, pg_proc AS p2 > +WHERE p2.oid = p1.amhandler AND p1.amtype = 's' AND Good catch. > It seems to me that this has been copy-pasted on HEAD from the > sequence AM patch, but forgot to update amtype to 't'. While that's > maybe cosmetic, I think that this could lead to unexpected results, so > perhaps there is a point in doing a backpatch? I disagree that it's cosmetic. The test doesn't check what it's supposed to. -- Best regards, Aleksander Alekseev
Re: How to accurately determine when a relation should use local buffers?
Hi Aleksander, Thank you for the reply. > Could you please provide a specific example when the current code willdo > something wrong/unintended? I can't say that something is wrong in vanilla. But if you decide to replicate DDL in some solutions like multimaster, you might want to replicate CREATE TEMPORARY TABLE. Furthermore, there is some possible inconsistency in the code show below (REL_16_STABLE) in bufmgr.c file: - FlushRelationBuffers, PrefetchBuffer uses RelationUsesLocalBuffers(rel). - ExtendBufferedRel_common finally use BufferManagerRelation.relpersistence which is actually rd_rel->relpersistence, works like RelationUsesLocalBuffers. - ReadBuffer_common uses isLocalBuf = SmgrIsTemp(smgr), that checks rlocator.backend for InvalidBackendId. I would like to clarify, do we completely refuse the use of temporary tables in other contexts than in backends or there is some work-in-progress to allow some other usage contexts? If so, the check of rd_rel->relpersistence is enough. Not sure why we use SmgrIsTemp instead of RelationUsesLocalBuffers in ReadBuffer_common. With best regards, Vitaly Davydov вт, 21 нояб. 2023 г. в 11:52, Aleksander Alekseev : > Hi, > > > I would like to clarify, what the correct way is to determine that a > given relation is using local buffers. Local buffers, as far as I know, are > used for temporary tables in backends. There are two functions/macros > (bufmgr.c): SmgrIsTemp, RelationUsesLocalBuffers. The first function > verifies that the current process is a regular session backend, while the > other macro verifies the relation persistence characteristic. It seems, the > use of each function independently is not correct. I think, these functions > should be applied in pair to check for local buffers use, but, it seems, > these functions are used independently. It works until temporary tables are > allowed only in session backends. > > Could you please provide a specific example when the current code will > do something wrong/unintended? > > > I'm concerned, how to determine the use of local buffers in some other > theoretical cases? For example, if we decide to replicate temporary tables? > Are there the other cases, when local buffers can be used with relations in > the Vanilla? Do we allow the use of relations with RELPERSISTENCE_TEMP not > only in session backends? > > Temporary tables, by definition, are visible only within one session. > I can't imagine how and why they would be replicated. > > -- > Best regards, > Aleksander Alekseev > -- С уважением, Давыдов Виталий http://www.vdavydov.ru
RE: Random pg_upgrade test failure on drongo
Dear hackers, This email tells an update. The machine drongo failed the test a week ago [1] and finally got logfiles. PSA files. ## Observed failure pg_upgrade_server.log is a server log during the pg_upgrade command. According to it, the TRUNCATE command seemed to be failed due to a "File exists" error. ``` 2023-11-15 00:02:02.239 UTC [1752:18] 003_logical_slots.pl ERROR: could not create file "base/1/2683": File exists 2023-11-15 00:02:02.239 UTC [1752:19] 003_logical_slots.pl STATEMENT: -- For binary upgrade, preserve pg_largeobject and index relfilenodes SELECT pg_catalog.binary_upgrade_set_next_index_relfilenode('2683'::pg_catalog.oid); SELECT pg_catalog.binary_upgrade_set_next_heap_relfilenode('2613'::pg_catalog.oid); TRUNCATE pg_catalog.pg_largeobject; ... ``` ## Analysis I think it caused due to the STATUS_DELETE_PENDING failure, not related with recent updates for pg_upgrade. The file "base/1/2683" is an index file for pg_largeobject_loid_pn_index, and the output meant that file creation was failed. Below is a backtrace. ``` pgwin32_open() // <-- this returns -1 open() BasicOpenFilePerm() PathNameOpenFilePerm() PathNameOpenFile() mdcreate() smgrcreate() RelationCreateStorage() RelationSetNewRelfilenumber() ExecuteTruncateGuts() ExecuteTruncate() ``` But this is strange. Before calling mdcreate(), we surely unlink the file which have the same name. Below is a trace until unlink. ``` pgunlink() unlink() mdunlinkfork() mdunlink() smgrdounlinkall() RelationSetNewRelfilenumber() // common path with above ExecuteTruncateGuts() ExecuteTruncate() ``` I found Thomas said that [2] pgunlink sometimes could not remove file even if it returns OK, at that time NTSTATUS is STATUS_DELETE_PENDING. Also, a comment in pgwin32_open_handle() mentions the same thing: ``` /* * ERROR_ACCESS_DENIED is returned if the file is deleted but not yet * gone (Windows NT status code is STATUS_DELETE_PENDING). In that * case, we'd better ask for the NT status too so we can translate it * to a more Unix-like error. We hope that nothing clobbers the NT * status in between the internal NtCreateFile() call and CreateFile() * returning. * ``` The definition of STATUS_DELETE_PENDING can be seen in [3]. Based on that, indeed, open() would be able to fail with STATUS_DELETE_PENDING if the deletion is pending but it is tried to open. Another thread [4] also tries the issue while doing rmtree->unlink, and it reties to remove if it fails with STATUS_DELETE_PENDING. So, should we retry to open when it fails as well? Anyway, this fix seems out-of-scope for pg_upgrade. How do you think? Do you have any thoughts about it? ## Acknowledgement I want to say thanks to Sholk, Vingesh, for helping the analysis. [1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2023-11-15%2006%3A16%3A15 [2]: https://www.postgresql.org/message-id/CA%2BhUKGKsdzw06c5nnb%3DKYG9GmvyykoVpJA_VR3k0r7cZOKcx6Q%40mail.gmail.com [3]: https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-erref/596a1078-e883-4972-9bbc-49e60bebca55 [4]: https://www.postgresql.org/message-id/flat/20220919213217.ptqfdlcc5idk5xup%40awork3.anarazel.de#6ae5e2ba3dd6e1fd680dcc34eab710d5 Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Why is src/test/modules/committs/t/002_standby.pl flaky?
Hello Thomas, 10.11.2023 06:31, Thomas Munro wrote: Here is a new attempt to fix this mess. Disclaimer: this based entirely on reading the manual and vicariously hacking a computer I don't have via CI. As it also might (and I would like it to) be the final attempt, I decided to gather information and all the cases that we had on this topic. At least, for the last 5 years we've seen: [1] 2019-01-18: Re: BUG #15598: PostgreSQL Error Code is not reported when connection terminated due to idle-in-transaction timeout test 099_case_15598 made (in attachment) no commit [2] 2019-01-22: Rare SSL failures on eelpout references [1] test 099_rare_ssl_failures made (in attachment) commit 2019-03-19 1f39a1c06: Restructure libpq's hqandling of send failures. [3] 2019-11-28: pgsql: Add tests for '-f' option in dropdb utility. test 051_dropdb_force was proposed (included in the attachment) commit 2019-11-30 98a9b37ba: Revert commits 290acac92b and 8a7e9e9dad. [4] 2019-12-06: closesocket behavior in different platforms references [1], [2], [3]; a documentation change proposed no commit [5] 2020-06-03: libpq copy error handling busted test 099_pgbench_with_server_off made (in attachment) commit 2020-06-07 7247e243a: Try to read data from the socket in pqSendSome's write_failed paths. (a fix for 1f39a1c06) [6] 2020-10-19: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows no commit [7] 2021-11-17: Windows: Wrong error message at connection termination references [6] commit: 2021-12-02 6051857fc: On Windows, close the client socket explicitly during backend shutdown. [8] 2021-12-05 17:03:18: MSVC SSL test failure commit: 2021-12-07 ed52c3707: On Windows, also call shutdown() while closing the client socket. [9] 2021-12-30: Why is src/test/modules/committs/t/002_standby.pl flaky? additional test 099_postgres_fdw_disconnect made (in attachment) commit 2022-01-26 75674c7ec: Revert "graceful shutdown" changes for Windows, in back branches only. (REL_14_STABLE) commit 2022-03-22 29992a6a5: Revert "graceful shutdown" changes for Windows. (master) [10] 2022-02-02 19:19:22: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on OpenBSD 7.0 commit 2022-02-12 335fa5a26: Fix thinko in PQisBusy(). (a fix for 1f39a1c06) commit 2022-02-12 faa189c93: Move libpq's write_failed mechanism down to pqsecure_raw_write(). (a fix for 1f39a1c06) As it becomes difficult to test/check all those cases scattered around, I decided to retell the whole story by means of tests. Please look at the script win-sock-tests-01.cmd attached, which can be executed both on Windows (as regular cmd) and on Linux (bash win-sock-tests-01.cmd). At the end of the script we can see several things. First, the last patchset you posted, applied to b282fa88d~1, fixes the issue discussed in this thread (it eliminates failures of commit_ts_002_standby (and also 099_postgres_fdw_disconnect)). Second, with the sleep added (see [6]), I had the same results of `meson test` on Windows and on Linux. Namely, there are some tests failing (see win-sock-tests-01.cmd) due to walsender preventing server stop. I describe this issue separately (see details in walsender-cannot-exit.txt; maybe it's worth to discuss it in a separate thread) as it's kind of off-topic. With the supplementary sleep() added to WalSndLoop(), the complete `meson test` passes successfully both on Windows and on Linux. Third, cases [1] and [3] are still broken, due to a Windows peculiarity. Please see server.c and client.c attached, which demonstrate: Case "no shutdown/closesocket" on Windows: C:\src>server Listening for incoming connections... C:\src>client Client connected: 127.0.0.1:64395 Connection to server established. Enter message: msg Client message: msg Sending message... Sleeping... Exiting... C:\src> Calling recv()... recv() failed Case "no shutdown/closesocket" on Linux: $ server Listening for incoming connections... $ client Client connected: 127.0.0.1:33044 Connection to server established. Enter message: msg Client message: msg Sending message... Sleeping... Exiting... $ Calling recv()... Server message: MESSAGE Case "shutdown/closesocket" on Windows: C:\src>server shutdown closesocket Listening for incoming connections... C:\src>client Client connected: 127.0.0.1:64396 Connection to server established. Enter message: msg Client message: msg Sending message... Sleeping... Calling shutdown()... Calling closesocket()... Exiting... C:\src> Calling recv()... Server message: MESSAGE That's okay so far, but what makes cases [1]/[3] different
Re: Synchronizing slots from primary to standby
Hi, On 11/21/23 10:32 AM, shveta malik wrote: On Tue, Nov 21, 2023 at 2:02 PM shveta malik wrote: v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, rebased the patches. PFA v37_2 patches. Thanks! Regarding the promotion flow: If the primary is available and reachable I don't think we currently try to ensure that slots are in sync. I think we'd miss the activity since the last sync and the promotion request or am I missing something? If the primary is available and reachable shouldn't we launch a last round of synchronization (skipping all the slots that are not in 'r' state)? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Use of backup_label not noted in log
On 11/20/23 23:54, Michael Paquier wrote: On Mon, Nov 20, 2023 at 03:31:20PM -0400, David Steele wrote: I still wonder if we need "base backup" in the messages? That sort of implies (at least to me) you used pg_basebackup but that may not be the case. Or just s/base backup/backup/? That's what I meant but did not explain very well. Regards, -David
Re: POC, WIP: OR-clause support for indexes
On 21.11.2023 03:50, Alena Rybakina wrote: On 20.11.2023 11:52, Andrei Lepikhov wrote: Looking into the patch, I found some trivial improvements (see attachment). Also, it is not obvious that using a string representation of the clause as a hash table key is needed here. Also, by making a copy of the node in the get_key_nconst_node(), you replace the location field, but in the case of complex expression, you don't do the same with other nodes. I propose to generate expression hash instead + prove the equality of two expressions by calling equal(). I was thinking about your last email and a possible error where the location field may not be cleared in complex expressions. Unfortunately, I didn't come up with any examples either, but I think I was able to handle this with a special function that removes location-related patterns. The alternative to this is to bypass this expression, but I think it will be more invasive. In addition, I have added changes related to the hash table: now the key is of type int. All changes are displayed in the attached v9-0001-Replace-OR-clause-to_ANY.diff.txt file. I haven't measured it yet. But what do you think about these changes? Sorry, I lost your changes during the revision process. I returned them. I raised the patch version just in case to run ci successfully. -- Regards, Alena Rybakina Postgres Professional From 66d1c479347d80ea45fc7aebaed873d45f9c4351 Mon Sep 17 00:00:00 2001 From: Alena Rybakina Date: Tue, 21 Nov 2023 14:20:56 +0300 Subject: [PATCH] [PATCH] Replace OR clause to ANY expressions. Replace (X=N1) OR (X=N2) ... with X = ANY(N1, N2) on the stage of the optimiser when we are still working with a tree expression. Firstly, we do not try to make a transformation for "non-or" expressions or inequalities and the creation of a relation with "or" expressions occurs according to the same scenario. Secondly, we do not make transformations if there are less than set or_transform_limit. Thirdly, it is worth considering that we consider "or" expressions only at the current level. --- src/backend/parser/parse_expr.c | 280 +- src/backend/utils/misc/guc_tables.c | 10 + src/include/parser/parse_expr.h | 1 + src/test/regress/expected/create_index.out| 115 +++ src/test/regress/expected/guc.out | 3 +- src/test/regress/expected/join.out| 50 src/test/regress/expected/partition_prune.out | 156 ++ src/test/regress/expected/sysviews.out| 3 +- src/test/regress/expected/tidscan.out | 17 ++ src/test/regress/sql/create_index.sql | 32 ++ src/test/regress/sql/join.sql | 10 + src/test/regress/sql/partition_prune.sql | 22 ++ src/test/regress/sql/tidscan.sql | 6 + src/tools/pgindent/typedefs.list | 1 + 14 files changed, 703 insertions(+), 3 deletions(-) diff --git a/src/backend/parser/parse_expr.c b/src/backend/parser/parse_expr.c index 64c582c344c..7b76c9f29a1 100644 --- a/src/backend/parser/parse_expr.c +++ b/src/backend/parser/parse_expr.c @@ -18,6 +18,7 @@ #include "catalog/pg_aggregate.h" #include "catalog/pg_proc.h" #include "catalog/pg_type.h" +#include "common/hashfn.h" #include "commands/dbcommands.h" #include "miscadmin.h" #include "nodes/makefuncs.h" @@ -43,6 +44,7 @@ /* GUC parameters */ bool Transform_null_equals = false; +bool enable_or_transformation = false; static Node *transformExprRecurse(ParseState *pstate, Node *expr); @@ -98,7 +100,283 @@ static Expr *make_distinct_op(ParseState *pstate, List *opname, Node *ltree, Node *rtree, int location); static Node *make_nulltest_from_distinct(ParseState *pstate, A_Expr *distincta, Node *arg); +typedef struct OrClauseGroupEntry +{ + int32 hash_leftvar_key; + + Node *node; + List *consts; + Oidscalar_type; + Oidopno; + Node *expr; +} OrClauseGroupEntry; + +/* + * TODO: consider different algorithm to manage complexity + * in the case of many different clauses, + * like Rabin-Karp or Boyer–Moore algorithms. + */ +static char * +clear_patterns(const char *str, const char *start_pattern) +{ + int i = 0; + int j = 0; + int start_pattern_len = strlen(start_pattern); + char *res = palloc0(sizeof(*res) * (strlen(str) + 1)); + + for (i = 0; str[i];) + { + if (i >= start_pattern_len && strncmp(&str[i - start_pattern_len], + start_pattern, + start_pattern_len) == 0) + { + while (str[i] && !(str[i] == '{' || str[i] == '}')) +i++; + } + if (str[i]) + res[j++] = str[i++]; + } + + return res; +} + +static int32 +get_key_nconst_node(Node *nconst_node) +{ + char *str = nodeToString(nconst_node); + + str = clear_patterns(str, " :location"); + + if (str == NULL) + return 0; + + return DatumGetInt32(hash_any((unsigned char *)str, strlen(str))); +} + +static Node * +transformBoolExprOr(ParseState *pstate, Bool
Re: Add recovery to pg_control and remove backup_label
On 11/20/23 19:58, Andres Freund wrote: On 2023-11-21 08:52:08 +0900, Michael Paquier wrote: On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: Given that, I wonder if what we should do is to just add a new field to pg_control that says "error out if backup_label does not exist", that we set when creating a streaming base backup That would mean that one still needs to take an extra step to update a control file with this byte set, which is something you had a concern with in terms of compatibility when it comes to external backup solutions because more steps are necessary to take a backup, no? I was thinking we'd just set it in the pg_basebackup style path, and we'd error out if it's set and backup_label is present. But we'd still use backup_label without the pg_control flag set. So it'd just provide a cross-check that backup_label was not removed for pg_basebackup style backup, but wouldn't do anything for external backups. But imo the proposal to just us pg_control doesn't actually do anything for external backups either - which is why I think my proposal would achieve as much, for a much lower price. I'm not sure why you think the patch under discussion doesn't do anything for external backups. It provides the same benefits to both pg_basebackup and external backups, i.e. they both receive the updated version of pg_control. I really dislike the idea of pg_basebackup having a special mechanism for making recovery safer that is not generally available to external backup software. It might be easy enough for some (e.g. pgBackRest) to manipulate pg_control but would be out of reach for most. Regards, -David
undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Hi, I decided to do some stress-testing of the built-in logical replication, as part of the sequence decoding work. And I soon ran into an undetected deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-( The attached bash scripts triggers that in a couple seconds for me. The script looks complicated, but most of the code is waiting for sync to complete, catchup, and that sort of thing. What the script does is pretty simple: 1) initialize two clusters, set them as publisher/subscriber pair 2) create some number of tables, add them to publication and wait for the sync to complete 3) start two pgbench runs in the background, modifying the publication (one removes+adds all tables in a single transaction, one does that with transaction per table) 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION in a loop (now that I think about it, could be another pgbench script, but well ...) 5) some consistency checks, but the lockup happens earlier so this does not really matter After a small number of refresh cycles (for me it's usually a couple dozen), we end up with a couple stuck locks (I shortened the backend type string a bit, for formatting reasons): test=# select a.pid, classid, objid, backend_type, query from pg_locks l join pg_stat_activity a on (a.pid = l.pid) where not granted; pid | classid | objid | backend_type | query -+-+---+--+-- 2691941 |6100 | 16785 | client backend | ALTER SUBSCRIPTION s REFRESH PUBLICATION 2691837 |6100 | 16785 | tablesync worker | 2691936 |6100 | 16785 | tablesync worker | (3 rows) All these backends wait for 6100/16785, which is the subscription row in pg_subscription. The tablesync workers are requesting AccessShareLock, the client backend however asks for AccessExclusiveLock. The entry is currently locked by: test=# select a.pid, mode, backend_type from pg_locks l join pg_stat_activity a on (a.pid = l.pid) where classid=6100 and objid=16785 and granted; pid | mode | backend_type -+-+-- 2690477 | AccessShareLock | logical replication apply worker (1 row) But the apply worker is not waiting for any locks, so what's going on? Well, the problem is the apply worker is waiting for notification from the tablesync workers the relation is synced, which happens through updating the pg_subscription_rel row. And that wait happens in wait_for_relation_state_change, which simply checks the row in a loop, with a sleep by WaitLatch(). Unfortunately, the tablesync workers can't update the row because the client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION sneaked in, and waits for an AccessExclusiveLock. So the tablesync workers are stuck in the queue and can't proceed. The client backend can't proceed, because it's waiting for a lock held by the apply worker. The tablesync workers can't proceed because their lock request is stuck behind the AccessExclusiveLock request. And the apply worker can't proceed, because it's waiting for status update from the tablesync workers. And the deadlock is undetected, because the apply worker is not waiting on a lock, but sleeping on a latch :-( I don't know what's the right solution here. I wonder if the apply worker might release the lock before waiting for the update, that'd solve this whole issue. Alternatively, ALTER PUBLICATION might wait for the lock only for a limited amount of time, and try again (but then it'd be susceptible to starving, of course). Or maybe there's a way to make this work in a way that would be visible to the deadlock detector? That'd mean we occasionally get processes killed to resolve a deadlock, but that's still better than processes stuck indefinitely ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company lockup-test.sh Description: application/shellscript refresh.sh Description: application/shellscript
Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra wrote: > > I decided to do some stress-testing of the built-in logical replication, > as part of the sequence decoding work. And I soon ran into an undetected > deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-( > > The attached bash scripts triggers that in a couple seconds for me. The > script looks complicated, but most of the code is waiting for sync to > complete, catchup, and that sort of thing. > > What the script does is pretty simple: > > 1) initialize two clusters, set them as publisher/subscriber pair > > 2) create some number of tables, add them to publication and wait for >the sync to complete > > 3) start two pgbench runs in the background, modifying the publication >(one removes+adds all tables in a single transaction, one does that > with transaction per table) > > 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION >in a loop (now that I think about it, could be another pgbench >script, but well ...) > > 5) some consistency checks, but the lockup happens earlier so this does >not really matter > > After a small number of refresh cycles (for me it's usually a couple > dozen), we end up with a couple stuck locks (I shortened the backend > type string a bit, for formatting reasons): > > test=# select a.pid, classid, objid, backend_type, query >from pg_locks l join pg_stat_activity a on (a.pid = l.pid) > where not granted; > > pid | classid | objid | backend_type | query > -+-+---+--+-- >2691941 |6100 | 16785 | client backend | ALTER SUBSCRIPTION s > REFRESH PUBLICATION >2691837 |6100 | 16785 | tablesync worker | >2691936 |6100 | 16785 | tablesync worker | > (3 rows) > > All these backends wait for 6100/16785, which is the subscription row in > pg_subscription. The tablesync workers are requesting AccessShareLock, > the client backend however asks for AccessExclusiveLock. > > The entry is currently locked by: > > test=# select a.pid, mode, backend_type from pg_locks l >join pg_stat_activity a on (a.pid = l.pid) > where classid=6100 and objid=16785 and granted; > > pid | mode | backend_type > -+-+-- >2690477 | AccessShareLock | logical replication apply worker > (1 row) > > But the apply worker is not waiting for any locks, so what's going on? > > Well, the problem is the apply worker is waiting for notification from > the tablesync workers the relation is synced, which happens through > updating the pg_subscription_rel row. And that wait happens in > wait_for_relation_state_change, which simply checks the row in a loop, > with a sleep by WaitLatch(). > > Unfortunately, the tablesync workers can't update the row because the > client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION > sneaked in, and waits for an AccessExclusiveLock. So the tablesync > workers are stuck in the queue and can't proceed. > > The client backend can't proceed, because it's waiting for a lock held > by the apply worker. > It seems there is some inconsistency in what you have written for client backends/tablesync worker vs. apply worker. The above text seems to be saying that the client backend and table sync worker are waiting on a "subscription row in pg_subscription" and the apply worker is operating on "pg_subscription_rel". So, if that is true then they shouldn't get stuck. I think here client backend and tablesync worker seems to be blocked for a lock on pg_subscription_rel. > The tablesync workers can't proceed because their lock request is stuck > behind the AccessExclusiveLock request. > > And the apply worker can't proceed, because it's waiting for status > update from the tablesync workers. > This part is not clear to me because wait_for_relation_state_change()->GetSubscriptionRelState() seems to be releasing the lock while closing the relation. Am, I missing something? -- With Regards, Amit Kapila.
Re: meson documentation build open issues
On 21.11.23 02:56, Andres Freund wrote: One remaining question is whether we should adjust install-doc-{html,man} to be install-{html,man}, to match the docs targets. Ah didn't notice that one; yes please.
Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
On 11/21/23 14:16, Amit Kapila wrote: > On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra > wrote: >> >> I decided to do some stress-testing of the built-in logical replication, >> as part of the sequence decoding work. And I soon ran into an undetected >> deadlock related to ALTER SUBSCRIPTION ... REFRESH PUBLICATION :-( >> >> The attached bash scripts triggers that in a couple seconds for me. The >> script looks complicated, but most of the code is waiting for sync to >> complete, catchup, and that sort of thing. >> >> What the script does is pretty simple: >> >> 1) initialize two clusters, set them as publisher/subscriber pair >> >> 2) create some number of tables, add them to publication and wait for >>the sync to complete >> >> 3) start two pgbench runs in the background, modifying the publication >>(one removes+adds all tables in a single transaction, one does that >> with transaction per table) >> >> 4) run refresh.sh which does ALTER PUBLICATION ... REFRESH PUBLICATION >>in a loop (now that I think about it, could be another pgbench >>script, but well ...) >> >> 5) some consistency checks, but the lockup happens earlier so this does >>not really matter >> >> After a small number of refresh cycles (for me it's usually a couple >> dozen), we end up with a couple stuck locks (I shortened the backend >> type string a bit, for formatting reasons): >> >> test=# select a.pid, classid, objid, backend_type, query >>from pg_locks l join pg_stat_activity a on (a.pid = l.pid) >> where not granted; >> >> pid | classid | objid | backend_type | query >> -+-+---+--+-- >>2691941 |6100 | 16785 | client backend | ALTER SUBSCRIPTION s >> REFRESH PUBLICATION >>2691837 |6100 | 16785 | tablesync worker | >>2691936 |6100 | 16785 | tablesync worker | >> (3 rows) >> >> All these backends wait for 6100/16785, which is the subscription row in >> pg_subscription. The tablesync workers are requesting AccessShareLock, >> the client backend however asks for AccessExclusiveLock. >> >> The entry is currently locked by: >> >> test=# select a.pid, mode, backend_type from pg_locks l >>join pg_stat_activity a on (a.pid = l.pid) >> where classid=6100 and objid=16785 and granted; >> >> pid | mode | backend_type >> -+-+-- >>2690477 | AccessShareLock | logical replication apply worker >> (1 row) >> >> But the apply worker is not waiting for any locks, so what's going on? >> >> Well, the problem is the apply worker is waiting for notification from >> the tablesync workers the relation is synced, which happens through >> updating the pg_subscription_rel row. And that wait happens in >> wait_for_relation_state_change, which simply checks the row in a loop, >> with a sleep by WaitLatch(). >> >> Unfortunately, the tablesync workers can't update the row because the >> client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION >> sneaked in, and waits for an AccessExclusiveLock. So the tablesync >> workers are stuck in the queue and can't proceed. >> >> The client backend can't proceed, because it's waiting for a lock held >> by the apply worker. >> > > It seems there is some inconsistency in what you have written for > client backends/tablesync worker vs. apply worker. The above text > seems to be saying that the client backend and table sync worker are > waiting on a "subscription row in pg_subscription" and the apply > worker is operating on "pg_subscription_rel". So, if that is true then > they shouldn't get stuck. > > I think here client backend and tablesync worker seems to be blocked > for a lock on pg_subscription_rel. > Not really, they are all locking the subscription. All the locks are on classid=6100, which is pg_subscription: test=# select 6100::regclass; regclass - pg_subscription (1 row) The thing is, the tablesync workers call UpdateSubscriptionRelState, which locks the pg_subscription catalog at the very beginning: LockSharedObject(SubscriptionRelationId, ...); So that's the issue. I haven't explored why it's done this way, and there's no comment explaining locking the subscriptions is needed ... >> The tablesync workers can't proceed because their lock request is stuck >> behind the AccessExclusiveLock request. >> >> And the apply worker can't proceed, because it's waiting for status >> update from the tablesync workers. >> > > This part is not clear to me because > wait_for_relation_state_change()->GetSubscriptionRelState() seems to > be releasing the lock while closing the relation. Am, I missing > something? > I think you're missing the fact that GetSubscriptionRelState() acquires and releases the lock on pg_subscription_rel, but that's not the lock causing th
Re: Annoying build warnings from latest Apple toolchain
On Mon, Nov 20, 2023 at 11:37 PM Andres Freund wrote: > WRT Robert seeing those warnings and Tom not: There's something odd going > on. I couldn't immediately reproduce it. Then I realized it reproduces against > a homebrew install but not a macports one. > > Robert, which are you using? macports > Meson actually *tries* to avoid this warning without resulting in incorrect > results due to ordering. But homebrew does something odd, with libxml-2.0, > zlib and a few others: Unless you do something to change that, brew has > /opt/homebrew/Library/Homebrew/os/mac/pkgconfig/14/ in its search path, but > those libraries aren't from homebrew, they're referencing macos > frameworks. Apparently meson isn't able to understand which files those .pc > files link to and gives up on the deduplicating. > > If I add to the pkg-config search path, e.g. via > meson configure > -Dpkg_config_path=$OTHER_STUFF:/opt/homebrew/opt/zlib/lib/pkgconfig/:/opt/homebrew/opt/libxml2/lib/pkgconfig/ > > the warnings about duplicate libraries vanish. Hmm, I'm happy to try something here, but I'm not sure exactly what. I'm not setting pkg_config_path at all right now. I'm using this: meson setup $HOME/pgsql $HOME/pgsql-meson -Dcassert=true -Ddebug=true -Dextra_include_dirs=/opt/local/include -Dextra_lib_dirs=/opt/local/lib -Dprefix=$HOME/install/dev -- Robert Haas EDB: http://www.enterprisedb.com
Re: trying again to get incremental backup
On Mon, Nov 20, 2023 at 4:43 PM Robert Haas wrote: > > On Fri, Nov 17, 2023 at 5:01 AM Alvaro Herrera > wrote: > > I made a pass over pg_combinebackup for NLS. I propose the attached > > patch. > > This doesn't quite compile for me so I changed a few things and > incorporated it. Hopefully I didn't mess anything up. > > Here's v11. [..] > I wish I had better ideas about how to thoroughly test this. [..] Hopefully the below add some confidence, I've done some further quick(?) checks today and results are good: make check-world #GOOD test_full_pri__incr_stby__restore_on_pri.sh #GOOD test_full_pri__incr_stby__restore_on_stby.sh #GOOD* test_full_stby__incr_stby__restore_on_pri.sh #GOOD test_full_stby__incr_stby__restore_on_stby.sh #GOOD* test_many_incrementals_dbcreate.sh #GOOD test_many_incrementals.sh #GOOD test_multixact.sh #GOOD test_pending_2pc.sh #GOOD test_reindex_and_vacuum_full.sh #GOOD test_truncaterollback.sh #GOOD test_unlogged_table.sh #GOOD test_across_wallevelminimal.sh # GOOD(expected failure, that walsummaries are off during walminimal and incremental cannot be taken--> full needs to be taken after wal_level=minimal) CFbot failed on two hosts this time, I haven't looked at the details yet (https://cirrus-ci.com/task/6425149646307328 -> end of EOL? -> LOG: WAL summarizer process (PID 71511) was terminated by signal 6: Aborted?) The remaining test idea is to have a longer running DB under stress test in more real-world conditions and try to recover using chained incremental backups (one such test was carried out on patchset v6 and the result was good back then). -J.
Re: About #13489, array dimensions and CREATE TABLE ... LIKE
On Tue, Nov 21, 2023 at 09:33:18AM +0100, Laurenz Albe wrote: > On Mon, 2023-11-20 at 21:13 -0500, Tom Lane wrote: > > Bruce Momjian writes: > > > On Mon, Nov 20, 2023 at 09:04:21PM -0500, Tom Lane wrote: > > > > Bruce Momjian writes: > > > > > An alternate approach would > > > > > be to remove pg_attribute.attndims so we don't even try to preserve > > > > > dimensionality. > > > > > > I could get behind that, perhaps. It looks like we're not using the > > > > field in any meaningful way, and we could simplify TupleDescInitEntry > > > > and perhaps some other APIs. > > > > > So should I work on that patch or do you want to try? I think we should > > > do something. > > > > Let's wait for some other opinions, first ... > > Looking at the code, I get the impression that we wouldn't lose anything > without "pg_attribute.attndims", so +1 for removing it. > > This would call for some documentation. We should remove most of the > documentation about the non-existing difference between declaring a column > "integer[]", "integer[][]" or "integer[3][3]" and just describe the first > variant in detail, perhaps mentioning that the other notations are > accepted for backward compatibility. Agreed, I see: https://www.postgresql.org/docs/current/arrays.html However, the current implementation ignores any supplied array size limits, i.e., the behavior is the same as for arrays of unspecified length. The current implementation does not enforce the declared number of dimensions either. So both size limits and dimensions would be ignored. > I also think that it would be helpful to emphasize that while dimensionality > does not matter to a column definition, it matters for individual array > values. > Perhaps it would make sense to recommend a check constraint if one wants > to make sure that an array column should contain only a certain kind of array. The CHECK constraint idea is very good. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: CREATE FUNCTION ... SEARCH { DEFAULT | SYSTEM | SESSION }
On Mon, Nov 20, 2023 at 5:27 PM Jeff Davis wrote: > Of course I welcome others to profile and see what they can do. There's > a setjmp() call, and a couple allocations, and maybe some other stuff > to look at. There are also higher-level ideas, like avoiding calling > into guc.c in some cases, but that starts to get tricky as you pointed > out: > > https://www.postgresql.org/message-id/CA%2BTgmoa8uKQgak5wH0%3D7sL-ukqbwnCPMXA2ZW7Ccdt7tdNGkzg%40mail.gmail.com > > It seems others are also interested in this problem, so I can put some > more effort in after this round of patches goes in. I don't have a > specific target other than "low enough overhead that we can reasonably > recommend it as a best practice in multi-user environments". The two things that jump out at me are the setjmp() and the hash_search() call inside find_option(). As to the first, could we remove the setjmp() and instead have abort-time processing know something about this? For example, imagine we just push something onto a stack like we do for ErrorContextCallback, do whatever, and then pop it off. But if an error is thrown then the abort path knows to look at that variable and do whatever. As to the second, could we somehow come up with an API for guc.c where you can ask for an opaque handle that will later allow you to do a fast-SET of a GUC? The opaque handle would basically be the hashtable entry, perhaps with some kind of wrapping or decoration. Then fmgr_security_definer() could obtain the opaque handles and cache them in fn_extra. (I'm just spitballing here.) -- Robert Haas EDB: http://www.enterprisedb.com
Re: PSQL error: total cell count of XXX exceeded
On 2023-Sep-13, Hongxu Ma wrote: > After double check, looks `int64` of src/include/c.h is the proper type for > it. > Uploaded the v4 patch to fix it. Right. I made a few more adjustments, including the additional overflow check in printTableInit that Tom Lane suggested, and pushed this. It's a bit annoying that the error recovery decision of this code is to exit the process with an error. If somebody were to be interested in a fun improvement exercise, it may be worth redoing the print.c API so that it returns errors that psql can report and recover from, instead of just closing the process. TBH though, I've never hit that code in real usage. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
pg_class.relpages not updated for toast index
Hi all, Was doing a relation size estimation based on pg_class.relpages of the relation and the related objects (index, toast) and noticed that it is not updated for the toast index, for example: fabrizio=# CREATE TABLE t(c TEXT); INSERT INTO t VALUES (repeat('x', (8192^2)::int)); VACUUM (ANALYZE) t; CREATE TABLE INSERT 0 1 VACUUM fabrizio=# \x on Expanded display is on. fabrizio=# SELECT c.oid, c.relname, c.relpages, t.relname, t.relpages AS toast_pages, ci.relname, ci.relpages AS toast_index_pages, (pg_stat_file(pg_relation_filepath(ci.oid))).size AS toast_index_size FROM pg_class c JOIN pg_class t ON t.oid = c.reltoastrelid JOIN pg_index i ON i.indrelid = t.oid JOIN pg_class ci ON ci.oid = i.indexrelid WHERE c.oid = 't'::regclass; -[ RECORD 1 ]-+- oid | 17787 relname | t relpages | 1 relname | pg_toast_17787 toast_pages | 97 relname | pg_toast_17787_index toast_index_pages | 1 toast_index_size | 16384 Are there any reasons for toast index relpages not to be updated? Or is it a bug? Regards, -- Fabrízio de Royes Mello
Re: PSQL error: total cell count of XXX exceeded
Alvaro Herrera writes: > Right. I made a few more adjustments, including the additional overflow > check in printTableInit that Tom Lane suggested, and pushed this. Committed patch LGTM. > It's a bit annoying that the error recovery decision of this code is to > exit the process with an error. If somebody were to be interested in a > fun improvement exercise, it may be worth redoing the print.c API so > that it returns errors that psql can report and recover from, instead of > just closing the process. > TBH though, I've never hit that code in real usage. Yeah, I think the reason it's stayed like that for 25 years is that nobody's hit the case in practice. Changing the API would be a bit troublesome, too, because we don't know if anybody besides psql uses it. regards, tom lane
Re: meson documentation build open issues
On 21.11.23 14:23, Peter Eisentraut wrote: On 21.11.23 02:56, Andres Freund wrote: One remaining question is whether we should adjust install-doc-{html,man} to be install-{html,man}, to match the docs targets. Ah didn't notice that one; yes please. I think this was done?
Re: Annoying build warnings from latest Apple toolchain
On 21.11.23 14:35, Robert Haas wrote: On Mon, Nov 20, 2023 at 11:37 PM Andres Freund wrote: WRT Robert seeing those warnings and Tom not: There's something odd going on. I couldn't immediately reproduce it. Then I realized it reproduces against a homebrew install but not a macports one. Robert, which are you using? macports Btw., I'm also seeing warnings like this. I'm using homebrew. Here is a sample: [145/147] Linking target src/test/modules/test_shm_mq/test_shm_mq.dylib -macosx_version_min has been renamed to -macos_version_min ld: warning: -undefined error is deprecated ld: warning: ignoring duplicate libraries: '-lgcc' [146/147] Linking target src/test/modules/test_slru/test_slru.dylib
Re: How to accurately determine when a relation should use local buffers?
Hi, > Furthermore, there is some possible inconsistency in the code show below > (REL_16_STABLE) in bufmgr.c file: > > FlushRelationBuffers, PrefetchBuffer uses RelationUsesLocalBuffers(rel). > ExtendBufferedRel_common finally use BufferManagerRelation.relpersistence > which is actually rd_rel->relpersistence, works like RelationUsesLocalBuffers. > ReadBuffer_common uses isLocalBuf = SmgrIsTemp(smgr), that checks > rlocator.backend for InvalidBackendId. I didn't do a deep investigation of the code in this particular aspect but that could be a fair point. Would you like to propose a refactoring that unifies the way we check if the relation is temporary? > I would like to clarify, do we completely refuse the use of temporary tables > in other contexts than in backends or there is some work-in-progress to allow > some other usage contexts? If so, the check of rd_rel->relpersistence is > enough. Not sure why we use SmgrIsTemp instead of RelationUsesLocalBuffers in > ReadBuffer_common. According to the comments in relfilelocator.h: ``` /* * Augmenting a relfilelocator with the backend ID provides all the information * we need to locate the physical storage. The backend ID is InvalidBackendId * for regular relations (those accessible to more than one backend), or the * owning backend's ID for backend-local relations. Backend-local relations * are always transient and removed in case of a database crash; they are * never WAL-logged or fsync'd. */ typedef struct RelFileLocatorBackend { RelFileLocator locator; BackendIdbackend; } RelFileLocatorBackend; #define RelFileLocatorBackendIsTemp(rlocator) \ ((rlocator).backend != InvalidBackendId) ``` And this is what ReadBuffer_common() and other callers of SmgrIsTemp() are using. So no, you can't have a temporary table without an assigned RelFileLocatorBackend.backend. It is my understanding that SmgrIsTemp() and RelationUsesLocalBuffers() are equivalent except the fact that the first macro works with SMgrRelation objects and the second one - with Relation objects. -- Best regards, Aleksander Alekseev
Re: Hide exposed impl detail of wchar.c
> On Nov 20, 2023, at 7:10 PM, Andres Freund wrote: > > > What I don't quite get is why SIMD headers are particularly more problematic > than others - there's other headers that are compiler specific? The short answer is the rust-based bindings generation tool pgrx uses (bindgen) is a little brain dead and gets confused when there’s multiple compiler builtins headers on the host. This simd header is the first of its kind we’ve run across that’s exposed via Postgres’ “public api”. And bindgen wants to follow all the includes, it gets confused, picks the wrong one, and then errors happen. And I don’t know that it makes much sense for Postgres to include such a header into 3rd-party code anyways. I think Jubilee is also working with them to fix this, but we’re hoping Jubilee’s patch here (or similar) can get merged so we can clear our build drama. eric
Re: Annoying build warnings from latest Apple toolchain
Hi, > > On Mon, Nov 20, 2023 at 11:37 PM Andres Freund wrote: > >> WRT Robert seeing those warnings and Tom not: There's something odd going > >> on. I couldn't immediately reproduce it. Then I realized it reproduces > >> against > >> a homebrew install but not a macports one. > >> > >> Robert, which are you using? > > > > macports > > Btw., I'm also seeing warnings like this. I'm using homebrew. Here is > a sample: > > [145/147] Linking target src/test/modules/test_shm_mq/test_shm_mq.dylib > -macosx_version_min has been renamed to -macos_version_min > ld: warning: -undefined error is deprecated > ld: warning: ignoring duplicate libraries: '-lgcc' > [146/147] Linking target src/test/modules/test_slru/test_slru.dylib I prefer not to build Postgres on Mac because it slows down GMail and Slack. After reading this discussion I tried and I can confirm I see the same warnings Robert does: ``` [322/1905] Linking target src/interfaces/libpq/libpq.5.dylib ld: warning: -undefined error is deprecated ld: warning: ignoring duplicate libraries: '-lz' [326/1905] Linking target src/timezone/zic ld: warning: -undefined error is deprecated ld: warning: ignoring duplicate libraries: '-lz' [1113/1905] Linking target src/backend/postgres ld: warning: -undefined error is deprecated ld: warning: ignoring duplicate libraries: '-lpam', '-lxml2', '-lz' [1124/1905] Linking target src/backend/replication/pgoutput/pgoutput.dylib ld: warning: -undefined error is deprecated [1125/1905] Linking target src/backend/replication/libpqwalreceiver/libpqwalreceiver.dylib ld: warning: -undefined error is deprecated ... many more ... ``` My laptop is an Intel MacBook Pro 2019. The MacOS version is Sonoma 14.0 and I'm using homebrew. `xcode-select --version` says 2399. I'm using the following command: ``` XML_CATALOG_FILES=/usr/local/etc/xml/catalog time -p sh -c 'git clean -dfx && meson setup --buildtype debug -DPG_TEST_EXTRA="kerberos ldap ssl" -Dldap=disabled -Dssl=openssl -Dcassert=true -Dtap_tests=enabled -Dprefix=/Users/eax/pginstall build && ninja -C build && ninja -C build docs && meson test -C build' ``` I don't see any warnings when using Autotools. -- Best regards, Aleksander Alekseev
Re: Do away with a few backwards compatibility macros
On Tue, Nov 21, 2023 at 12:05:36AM -0500, Tom Lane wrote: > No objection here, but should we try to establish some sort of project > policy around this sort of change (ie, removal of backwards-compatibility > support)? "Once it no longer matters for any supported version" sounds > about right to me, but maybe somebody has an argument for thinking about > it differently. That seems reasonable to me. I don't think we need to mandate that backwards-compatibility support be removed as soon as it is eligible, but it can be considered fair game at that point. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key
On Mon, Nov 20, 2023 at 10:37:47PM -0800, Jeff Davis wrote: > It would be interesting to know how often it's a good idea to turn it > on, though. I could try turning it on for various other uses of > simplehash, and see where it tends to win. That seems worthwhile to me. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Remove distprep
On 2023-Nov-07, Michael Paquier wrote: > On Mon, Nov 06, 2023 at 04:21:40PM +0100, Peter Eisentraut wrote: > > done > > Nice to see 721856ff24b3 in, thanks! Hmm, do we still need to have README.git as a separate file from README? Also, looking at README, I see it refers to the INSTALL file in the root, but that doesn't exist. "make -C doc/src/sgml INSTALL" creates it, but it's not copied to the root directory. Do we need some fixup for that? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Learn about compilers. Then everything looks like either a compiler or a database, and now you have two problems but one of them is fun." https://twitter.com/thingskatedid/status/1456027786158776329
Re: PSQL error: total cell count of XXX exceeded
On 2023-Nov-21, Tom Lane wrote: > Alvaro Herrera writes: > > Right. I made a few more adjustments, including the additional overflow > > check in printTableInit that Tom Lane suggested, and pushed this. > > Committed patch LGTM. Thanks for looking! > > It's a bit annoying that the error recovery decision of this code is to > > exit the process with an error. [...] > > TBH though, I've never hit that code in real usage. > > Yeah, I think the reason it's stayed like that for 25 years is that > nobody's hit the case in practice. Changing the API would be a bit > troublesome, too, because we don't know if anybody besides psql > uses it. True. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: Add recovery to pg_control and remove backup_label
Hi, On 2023-11-21 07:42:42 -0400, David Steele wrote: > On 11/20/23 19:58, Andres Freund wrote: > > On 2023-11-21 08:52:08 +0900, Michael Paquier wrote: > > > On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: > > > > Given that, I wonder if what we should do is to just add a new field to > > > > pg_control that says "error out if backup_label does not exist", that > > > > we set > > > > when creating a streaming base backup > > > > > > That would mean that one still needs to take an extra step to update a > > > control file with this byte set, which is something you had a concern > > > with in terms of compatibility when it comes to external backup > > > solutions because more steps are necessary to take a backup, no? > > > > I was thinking we'd just set it in the pg_basebackup style path, and we'd > > error out if it's set and backup_label is present. But we'd still use > > backup_label without the pg_control flag set. > > > > So it'd just provide a cross-check that backup_label was not removed for > > pg_basebackup style backup, but wouldn't do anything for external backups. > > But > > imo the proposal to just us pg_control doesn't actually do anything for > > external backups either - which is why I think my proposal would achieve as > > much, for a much lower price. > > I'm not sure why you think the patch under discussion doesn't do anything > for external backups. It provides the same benefits to both pg_basebackup > and external backups, i.e. they both receive the updated version of > pg_control. Sure. They also receive a backup_label today. If an external solution forgets to replace pg_control copied as part of the filesystem copy, they won't get an error after the remove of backup_label, just like they don't get one today if they don't put backup_label in the data directory. Given that users don't do the right thing with backup_label today, why can we rely on them doing the right thing with pg_control? Greetings, Andres Freund
Re: Locks on unlogged tables are locked?!
Uh, was this ever addressed? I don't see the patch applied or the code in this area modified. --- On Thu, May 24, 2018 at 04:33:11PM +0200, Laurenz Albe wrote: > While looking at this: > https://stackoverflow.com/q/50413623/6464308 > I realized that "LOCK TABLE " puts a > Standby/LOCK into the WAL stream, which causes a log flush > at COMMIT time. > > That hurts performance, and I doubt that it is necessary. > At any rate, DROP TABLE on an unlogged table logs nothing. > > The attached patch would take care of it, but I'm not sure > if that's the right place to check. > > Yours, > Laurenz Albe > From d474e7b41298944e43bb9141eb33adbdd9ea1098 Mon Sep 17 00:00:00 2001 > From: Laurenz Albe > Date: Tue, 22 May 2018 18:13:31 +0200 > Subject: [PATCH] Don't log locks on unlogged tables > > --- > src/backend/storage/lmgr/lock.c | 28 +--- > 1 file changed, 21 insertions(+), 7 deletions(-) > > diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c > index dc3d8d9817..70cac47ab3 100644 > --- a/src/backend/storage/lmgr/lock.c > +++ b/src/backend/storage/lmgr/lock.c > @@ -37,6 +37,7 @@ > #include "access/twophase_rmgr.h" > #include "access/xact.h" > #include "access/xlog.h" > +#include "catalog/pg_class.h" > #include "miscadmin.h" > #include "pg_trace.h" > #include "pgstat.h" > @@ -47,6 +48,7 @@ > #include "storage/standby.h" > #include "utils/memutils.h" > #include "utils/ps_status.h" > +#include "utils/rel.h" > #include "utils/resowner_private.h" > > > @@ -1041,13 +1043,25 @@ LockAcquireExtended(const LOCKTAG *locktag, >*/ > if (log_lock) > { > - /* > - * Decode the locktag back to the original values, to avoid > sending > - * lots of empty bytes with every message. See lock.h to check > how a > - * locktag is defined for LOCKTAG_RELATION > - */ > - LogAccessExclusiveLock(locktag->locktag_field1, > - > locktag->locktag_field2); > + bool unlogged_rel = false; > + > + if (locktag->locktag_type == LOCKTAG_RELATION) > + { > + Relation r = > RelationIdGetRelation(locktag->locktag_field2); > + unlogged_rel = r->rd_rel->relpersistence == > RELPERSISTENCE_UNLOGGED; > + RelationClose(r); > + } > + > + if (!unlogged_rel) > + { > + /* > + * Decode the locktag back to the original values, to > avoid sending > + * lots of empty bytes with every message. See lock.h > to check how a > + * locktag is defined for LOCKTAG_RELATION > + */ > + LogAccessExclusiveLock(locktag->locktag_field1, > + > locktag->locktag_field2); > + } > } > > return LOCKACQUIRE_OK; > -- > 2.14.3 > -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key
Hi, On 2023-11-20 22:37:47 -0800, Jeff Davis wrote: > On Mon, 2023-11-20 at 22:50 -0600, Nathan Bossart wrote: > > I'm mostly thinking out loud here, but could we just always do this? > > I > > guess you might want to avoid it if your SH_EQUAL is particularly > > expensive > > and you know repeated lookups are rare, but maybe that's uncommon > > enough > > that we don't really care. > > I like that simplehash is simple, so I'm not inclined to introduce an > always-on feature. I think it'd be a bad idea to make it always on - there's plenty cases where it just would make things slower because the hit rate is low. A equal comparison is far from free. I am not quite sure this kind of cache best lives in simplehash - ISTM that quite often it'd be more beneficial to have a cache that you can test more cheaply higher up. Greetings, Andres Freund
Re: Change GUC hashtable to use simplehash?
On Tue, 2023-11-21 at 16:42 +0700, John Naylor wrote: > The strlen call required for hashbytes() is not free. Should we have a hash_string() that's like hash_bytes() but checks for the NUL terminator itself? That wouldn't be inlinable, but it would save on the strlen() call. It might benefit some other callers? Regards, Jeff Davis
Re: Partial aggregates pushdown
On Mon, Nov 20, 2023 at 5:48 PM Bruce Momjian wrote: > > I do have a concern about this, though. It adds a lot of bloat. It > > adds a whole lot of additional entries to pg_aggregate, and every new > > aggregate we add in the future will require a bonus entry for this, > > and it needs a bunch of new pg_proc entries as well. One idea that > > I've had in the past is to instead introduce syntax that just does > > this, without requiring a separate aggregate definition in each case. > > For example, maybe instead of changing string_agg(whatever) to > > string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE > > string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or > > something. Then all aggregates could be treated in a generic way. I'm > > not completely sure that's better, but I think it's worth considering. > > So use an SQL keyword to indicates a pushdown call? We could then > automate the behavior rather than requiring special catalog functions? Right. It would require more infrastructure in the parser, planner, and executor, but it would be infinitely reusable instead of needing a new thing for every aggregate. I think that might be better, but to be honest I'm not totally sure. > > I don't think the patch does a good job explaining why HAVING, > > DISTINCT, and ORDER BY are a problem. It seems to me that HAVING > > shouldn't really be a problem, because HAVING is basically a WHERE > > clause that occurs after aggregation is complete, and whether or not > > the aggregation is safe shouldn't depend on what we're going to do > > with the value afterward. The HAVING clause can't necessarily be > > pushed to the remote side, but I don't see how or why it could make > > the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a > > little trickier: if we pushed down DISTINCT, we'd still have to > > re-DISTINCT-ify when combining locally, and if we pushed down ORDER > > BY, we'd have to do a merge pass to combine the returned values unless > > we could prove that the partitions were non-overlapping ranges that > > would be visited in the correct order. Although that all sounds > > doable, I think it's probably a good thing that the current patch > > doesn't try to handle it -- this is complicated already. But it should > > explain why it's not handling it and maybe even a bit about how it > > could be handling in the future, rather than just saying "well, this > > kind of thing is not safe." The trouble with that explanation is that > > it does nothing to help the reader understand whether the thing in > > question is *fundamentally* unsafe or whether we just don't have the > > right code to make it work. > > Makes sense. Actually, I think I was wrong about this. We can't handle ORDER BY or DISTINCT because we can't distinct-ify or order after we've already partially aggregated. At least not in general, and not without additional aggregate support functions. So what I said above was wrong with respect to those. Or so I believe, anyway. But I still don't see why HAVING should be a problem. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add recovery to pg_control and remove backup_label
On 11/21/23 12:41, Andres Freund wrote: On 2023-11-21 07:42:42 -0400, David Steele wrote: On 11/20/23 19:58, Andres Freund wrote: On 2023-11-21 08:52:08 +0900, Michael Paquier wrote: On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote: Given that, I wonder if what we should do is to just add a new field to pg_control that says "error out if backup_label does not exist", that we set when creating a streaming base backup That would mean that one still needs to take an extra step to update a control file with this byte set, which is something you had a concern with in terms of compatibility when it comes to external backup solutions because more steps are necessary to take a backup, no? I was thinking we'd just set it in the pg_basebackup style path, and we'd error out if it's set and backup_label is present. But we'd still use backup_label without the pg_control flag set. So it'd just provide a cross-check that backup_label was not removed for pg_basebackup style backup, but wouldn't do anything for external backups. But imo the proposal to just us pg_control doesn't actually do anything for external backups either - which is why I think my proposal would achieve as much, for a much lower price. I'm not sure why you think the patch under discussion doesn't do anything for external backups. It provides the same benefits to both pg_basebackup and external backups, i.e. they both receive the updated version of pg_control. Sure. They also receive a backup_label today. If an external solution forgets to replace pg_control copied as part of the filesystem copy, they won't get an error after the remove of backup_label, just like they don't get one today if they don't put backup_label in the data directory. Given that users don't do the right thing with backup_label today, why can we rely on them doing the right thing with pg_control? I think reliable backup software does the right thing with backup_label, but if the user starts getting errors on recovery they the decide to remove backup_label. I know we can't do much about bad backup software, but we can at least make this a bit more resistant to user error after the fact. It doesn't help that one of our hints suggests removing backup_label. In highly automated systems, the user might not even know they just restored from a backup. They are only in the loop because the restore failed and they are trying to figure out what is going wrong. When they remove backup_label the cluster comes up just fine. Victory! This is the scenario I've seen most often -- not the backup/restore process getting it wrong but the user removing backup_label on their own initiative. And because it yields such a positive result, at least initially, they remember in the future that the thing to do is to remove backup_label whenever they see the error. If they only have pg_control, then their only choice is to get it right or run pg_resetwal. Most users have no knowledge of pg_resetwal so it will take them longer to get there. Also, I think that tool make it pretty clear that corruption will result and the only thing to do is a logical dump and restore after using it. There are plenty of ways a user can mess things up. What I'd like to prevent is the appearance of everything being OK when in fact they have corrupted their cluster. That's the situation we have now with backup_label. Is this new solution perfect? No, but I do think it checks several boxes, and is a worthwhile improvement. Regards, -David Regards, -David
Re: Locks on unlogged tables are locked?!
On Tue, Nov 21, 2023 at 11:49 AM Bruce Momjian wrote: > Uh, was this ever addressed? I don't see the patch applied or the code > in this area modified. I never saw this email originally, but I don't think I believe Laurenz's argument. Are all AEL-requiring operations on unlogged tables safe to perform on a standby without AEL? I believe I would be quite surprised if that turned out to be true. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Hide exposed impl detail of wchar.c
(I hope you don't mind I'm reposting your reply -- I accidentally replied directly to you b/c phone) > On Nov 21, 2023, at 11:56 AM, Andres Freund wrote: > > Hi, > > On 2023-11-21 10:11:18 -0500, Eric Ridge wrote: >> On Mon, Nov 20, 2023 at 7:10 PM Andres Freund wrote: >> And I don’t know that it makes much sense for Postgres to include such a >> header into 3rd-party code anyways. > > Well, we want to expose such functionality to extensions. For some cases using > full functions would to be too expensive, hence using static inlines. In case > of exposing simd stuff, that means we need to include headers. Sure. Probably not my place to make that kind of broad statement anyways. The "static inlines" are problematic for us in pgrx-land too, but that's a different problem for another day. > I'm not against splitting this out of pg_wchar.h, to be clear - that's a too > widely used header for, so there's a good independent reason for such a > change. I just don't really believe that moving simd.h out of there will end > the issue, we'll add more inlines using simd over time... Yeah and that's why Jubilee is working with the bindgen folks to tighten this up for good. (We tracked all of the pg16 betas and didn't run into this until after pg16 went gold. I personally haven't groveled through the git logs to see when this header/static inline was added, but we'd have reported this sooner had we found it sooner.) eric
Re: pg_upgrade and logical replication
On Mon, 20 Nov 2023 at 10:44, Peter Smith wrote: > > Here are some review comments for patch v15-0001 > > == > src/bin/pg_dump/pg_dump.c > > 1. getSubscriptions > > + if (fout->remoteVersion >= 17) > + appendPQExpBufferStr(query, "o.remote_lsn AS suboriginremotelsn\n"); > + else > + appendPQExpBufferStr(query, "NULL AS suboriginremotelsn\n"); > + > > There should be preceding spaces in those append strings to match the > other ones. Modified > ~~~ > > 2. dumpSubscriptionTable > > +/* > + * dumpSubscriptionTable > + * Dump the definition of the given subscription table mapping. This will > be > + *used only in binary-upgrade mode. > + */ > +static void > +dumpSubscriptionTable(Archive *fout, const SubRelInfo *subrinfo) > +{ > + DumpOptions *dopt = fout->dopt; > + SubscriptionInfo *subinfo = subrinfo->subinfo; > + PQExpBuffer query; > + char*tag; > + > + /* Do nothing in data-only dump */ > + if (dopt->dataOnly) > + return; > + > + Assert(fout->dopt->binary_upgrade || fout->remoteVersion >= 17); > > The function comment says this is only for binary-upgrade mode, so why > does the Assert use || (OR)? Added comments > == > src/bin/pg_upgrade/check.c > > 3. check_and_dump_old_cluster > > + /* > + * Subscription dependencies can be migrated since PG17. See comments atop > + * get_old_cluster_subscription_count(). > + */ > + if (GET_MAJOR_VERSION(old_cluster.major_version) >= 1700) > + check_old_cluster_subscription_state(&old_cluster); > + > > Should this be combined with the other adjacent check so there is only > one "if (GET_MAJOR_VERSION(old_cluster.major_version) >= 1700)" > needed? Modified > ~~~ > > 4. check_new_cluster > > check_new_cluster_logical_replication_slots(); > + > + check_new_cluster_subscription_configuration(); > > When checking the old cluster, the subscription was checked before the > slots, but here for the new cluster, the slots are checked before the > subscription. Maybe it makes no difference but it might be tidier to > do these old/new checks in the same order. Modified > ~~~ > > 5. check_new_cluster_logical_replication_slots > > - /* Quick return if there are no logical slots to be migrated. */ > + /* Quick return if there are no logical slots to be migrated */ > > Change is not relevant for this patch. Removed it > ~~~ > > 6. > > + res = executeQueryOrDie(conn, "SELECT setting FROM pg_settings " > + "WHERE name IN ('max_replication_slots') " > + "ORDER BY name DESC;"); > > Using IN and ORDER BY in this SQL seems unnecessary when you are only > searching for one name. Modified > == > src/bin/pg_upgrade/info.c > > 7. statics > > - > +static void get_old_cluster_subscription_count(DbInfo *dbinfo); > > This change also removes an existing blank line -- not sure if that > was intentional Modified > ~~~ > > 8. > @@ -365,7 +369,6 @@ get_template0_info(ClusterInfo *cluster) > PQfinish(conn); > } > > - > /* > * get_db_infos() > * > > This blank line change (before get_db_infos) should not be part of this patch. Modified > ~~~ > > 9. get_old_cluster_subscription_count > > It seems a slightly misleading function name because this is a PER-DB > count, not a cluster count. Modified > ~~~ > > > 10. > + /* Subscriptions can be migrated since PG17. */ > + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600) > + return; > > IMO it is better to compare < 1700 instead of <= 1600. It keeps the > code more aligned with the comment. Modified > ~~~ > > 11. count_old_cluster_subscriptions > > +/* > + * count_old_cluster_subscriptions() > + * > + * Returns the number of subscription for all databases. > + * > + * Note: this function always returns 0 if the old_cluster is PG16 and prior > + * because we gather subscriptions only for cluster versions greater than or > + * equal to PG17. See get_old_cluster_subscription_count(). > + */ > +int > +count_old_cluster_subscriptions(void) > +{ > + int nsubs = 0; > + > + for (int dbnum = 0; dbnum < old_cluster.dbarr.ndbs; dbnum++) > + nsubs += old_cluster.dbarr.dbs[dbnum].nsubs; > + > + return nsubs; > +} > > 11a. > /subscription/subscriptions/ Modified > ~ > > 11b. > The code is now consistent with the slots code which looks good. OTOH > I thought that 'pg_catalog.pg_subscription' is shared across all > databases of the cluster, so isn't this code inefficient to be > querying again and again for every database (if there are many of > them) instead of just querying 1 time only for the whole cluster? My earlier version was like that, changed it to keep the code consistent to logical replication slots. > == > src/bin/pg_upgrade/t/004_subscription.pl > > 12. > It is difficult to keep track of all the tables (upgraded and not > upgraded) at each step of these tests. Maybe the comments can be more > explicit along the way. e.g > > BEFORE > +# Add tab_not_upgraded1 to the publication > > SUGGESTION > +# Add tab_not_upgraded1 to the publication. Now publication has > > and > > BEFORE > +# Subscript
Re: Add recovery to pg_control and remove backup_label
On 11/20/23 16:41, Andres Freund wrote: On 2023-11-20 15:56:19 -0400, David Steele wrote: I understand this is an option -- but does it need to be? What is the benefit of excluding the manifest? It's not free to create the manifest, particularly if checksums are enabled. It's virtually free, even with the basic CRCs. Anyway, would you really want a backup without a manifest? How would you know something is missing? In particular, for page incremental how do you know something is new (but not WAL logged) if there is no manifest? Is the plan to just recopy anything not WAL logged with each incremental? Also, for external backups, there's no manifest... There certainly is a manifest for many external backup solutions. Not having a manifest is just running with scissors, backup-wise. Regards, -David
Re: Add recovery to pg_control and remove backup_label
Hi, On 2023-11-21 13:41:15 -0400, David Steele wrote: > On 11/20/23 16:41, Andres Freund wrote: > > > > On 2023-11-20 15:56:19 -0400, David Steele wrote: > > > I understand this is an option -- but does it need to be? What is the > > > benefit of excluding the manifest? > > > > It's not free to create the manifest, particularly if checksums are enabled. > > It's virtually free, even with the basic CRCs. Huh? perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast -Xnone --format=tar > /dev/null 4,423.81 msec task-clock #0.626 CPUs utilized 433,475 context-switches # 97.987 K/sec 5 cpu-migrations #1.130 /sec 599 page-faults # 135.404 /sec 12,208,261,153 cycles #2.760 GHz 6,805,401,520 instructions #0.56 insn per cycle 1,273,896,027 branches # 287.964 M/sec 14,233,126 branch-misses#1.12% of all branches 7.068946385 seconds time elapsed 1.106072000 seconds user 3.403793000 seconds sys perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast -Xnone --format=tar --manifest-checksums=CRC32C > /dev/null 4,324.64 msec task-clock #0.640 CPUs utilized 433,306 context-switches # 100.195 K/sec 3 cpu-migrations #0.694 /sec 598 page-faults # 138.277 /sec 11,952,475,908 cycles #2.764 GHz 6,816,888,845 instructions #0.57 insn per cycle 1,275,949,455 branches # 295.042 M/sec 13,721,376 branch-misses#1.08% of all branches 6.760321433 seconds time elapsed 1.113256000 seconds user 3.302907000 seconds sys perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast -Xnone --format=tar --no-manifest > /dev/null 3,925.38 msec task-clock #0.823 CPUs utilized 257,467 context-switches # 65.590 K/sec 4 cpu-migrations #1.019 /sec 552 page-faults # 140.624 /sec 11,577,054,842 cycles #2.949 GHz 5,933,731,797 instructions #0.51 insn per cycle 1,108,784,719 branches # 282.466 M/sec 11,867,511 branch-misses#1.07% of all branches 4.770347012 seconds time elapsed 1.002521000 seconds user 2.991769000 seconds sys I'd not call 7.06->4.77 or 6.76->4.77 "virtually free". And this actually *under* selling the cost - we waste a lot of cycles due to bad buffering decisions. Once we fix that, the cost differential increases further. > Anyway, would you really want a backup without a manifest? How would you > know something is missing? In particular, for page incremental how do you > know something is new (but not WAL logged) if there is no manifest? Is the > plan to just recopy anything not WAL logged with each incremental? Shrug. If you just want to create a new standby by copying the primary, I don't think creating and then validating the manifest buys you much. Long term backups are a different story, particularly if data files are stored individually, rather than in a single checksummed file. > > Also, for external backups, there's no manifest... > > There certainly is a manifest for many external backup solutions. Not having > a manifest is just running with scissors, backup-wise. You mean that you have an external solution gin up a backup manifest? I fail to see how that's relevant here? Greetings, Andres Freund
Re: PSQL error: total cell count of XXX exceeded
Alvaro Herrera writes: > On 2023-Nov-21, Tom Lane wrote: >> Alvaro Herrera writes: >>> It's a bit annoying that the error recovery decision of this code is to >>> exit the process with an error. [...] >>> TBH though, I've never hit that code in real usage. >> Yeah, I think the reason it's stayed like that for 25 years is that >> nobody's hit the case in practice. Changing the API would be a bit >> troublesome, too, because we don't know if anybody besides psql >> uses it. > True. It strikes me that perhaps a workable compromise behavior could be "report the error to wherever we would have printed the table, and return normally". I'm still not excited about doing anything about it, but ... regards, tom lane
Re: Add recovery to pg_control and remove backup_label
On 11/20/23 16:37, Andres Freund wrote: On 2023-11-20 11:11:13 -0500, Robert Haas wrote: I think we need more votes to make a change this big. I have a concern, which I think I've expressed before, that we keep whacking around the backup APIs, and that has a cost which is potentially larger than the benefits. +1. The amount of whacking around in this area has been substantial, and it's hard for operators to keep up. And realistically, with data sizes today, the pressure to do basebackups with disk snapshots etc is not going to shrink. True enough, but disk snapshots aren't really backups in themselves, in most scenarios, because they reside on the same storage as the cluster. Of course, snapshots can be exported, but that's also expensive. I see snapshots as an adjunct to backups -- a safe backup offsite somewhere for DR and snapshots for day to day operations. Even so, managing snapshots as backups is harder than people think. It is easy to get wrong and end up with silent corruption. Leaving that concern aside, I am still on the fence about this proposal. I think it does decrease the chance of getting things wrong in the streaming-basebackup case. But for external backups, it seems almost universally worse (with the exception of the torn pg_control issue, that we also can address otherwise): Why universally worse? The software stores pg_control instead of backup label. The changes to pg_basebackup were pretty trivial and the changes to external backup are pretty much the same, at least in my limited sample of one. And I don't believe we have a satisfactory solution to the torn pg_control issue yet. Certainly it has not been committed and Thomas has shown enthusiasm for this approach, to the point of hoping it could be back patched (it can't). It doesn't reduce the risk of getting things wrong, you can still omit placing a file into the data directory and get silent corruption as a consequence. In addition, it's harder to see when looking at a base backup whether the process was right or not, because now the good and bad state look the same if you just look on the filesystem level! This is one of the reasons I thought writing just the first 512 bytes of pg_control would be valuable. It would give an easy indicator that pg_control came from a backup. Michael was not in favor of conflating that change with this patch -- but I still think it's a good idea. Then there's the issue of making ad-hoc debugging harder by not having a human readable file with information anymore, including when looking at the history, via backup_label.old. Yeah, you'd need to use pg_controldata instead. But as Michael has suggested, we could also write backup_label as backup_info so there is human-readable information available. Given that, I wonder if what we should do is to just add a new field to pg_control that says "error out if backup_label does not exist", that we set when creating a streaming base backup I'm not in favor of a change only accessible to pg_basebackup or external software that can manipulate pg_control. Regards, -David
Re: Locks on unlogged tables are locked?!
Bruce Momjian writes: > Uh, was this ever addressed? I don't see the patch applied or the code > in this area modified. This patch as-is would surely be disastrous: having LockAcquire try to open the relcache entry for the thing we're trying to lock is going to be circular in at least some cases. I'm not convinced that there's a problem worth solving here, but if there is, it'd have to be done in some other way. regards, tom lane
Re: Remove distprep
Alvaro Herrera writes: > Hmm, do we still need to have README.git as a separate file from README? > Also, looking at README, I see it refers to the INSTALL file in the > root, but that doesn't exist. "make -C doc/src/sgml INSTALL" creates > it, but it's not copied to the root directory. Do we need some fixup > for that? Yeah, we clearly need to rethink this area if the plan is that tarballs will be pristine git pulls. I think we want just README at the top level, and I propose we give up on the text INSTALL file altogether (thereby removing a documentation build gotcha that catches people every so often). I propose that in 2023 it ought to be sufficient for the README file to point at build instructions on the web. regards, tom lane
Re: Report planning memory in EXPLAIN ANALYZE
I gave this a quick look. I think the usefulness aspect is already established in general terms; the bit I'm not so sure about is whether we want it enabled by default. For many cases it'd just be noise. Perhaps we want it hidden behind something like "EXPLAIN (MEMORY)" or such, particularly since things like "allocated" (which, per David, seems to be the really useful metric) seems too much a PG-developer value rather than an end-user value. If EXPLAIN (MEMORY) is added, then probably auto_explain needs a corresponding flag, and doc updates. Some code looks to be in weird places. Why is calc_mem_usage, which operates on MemoryContextCounters, in explain.c instead of mcxt.c? why is MemUsage in explain.h instead of memnodes.h? I moved both. I also renamed them, to MemoryContextSizeDifference() and MemoryUsage respectively; fixup patch attached. I see no reason for this to be three separate patches anymore. The EXPLAIN docs (explain.sgml) need an update to mention the new flag and the new output, too. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ commit 48b0c6682a9e8cf07096b979693fac09b2f7a0ba Author: Alvaro Herrera [Álvaro Herrera ] AuthorDate: Tue Nov 21 18:20:32 2023 +0100 CommitDate: Tue Nov 21 19:18:18 2023 +0100 review diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 9cd9b577c7..8c7f27b661 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -123,7 +123,7 @@ static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning); static void show_wal_usage(ExplainState *es, const WalUsage *usage); static void show_planning_memory(ExplainState *es, -const MemUsage *usage); +const MemoryUsage *usage); static void ExplainIndexScanDetails(Oid indexid, ScanDirection indexorderdir, ExplainState *es); static void ExplainScanTarget(Scan *plan, ExplainState *es); @@ -395,14 +395,14 @@ ExplainOneQuery(Query *query, int cursorOptions, else { PlannedStmt *plan; + MemoryContext planner_ctx; instr_time planstart, planduration; BufferUsage bufusage_start, bufusage; MemoryContextCounters mem_counts_start; MemoryContextCounters mem_counts_end; - MemUsagemem_usage; - MemoryContext planner_ctx; + MemoryUsage mem_usage; MemoryContext saved_ctx; /* @@ -415,7 +415,6 @@ ExplainOneQuery(Query *query, int cursorOptions, planner_ctx = AllocSetContextCreate(CurrentMemoryContext, "explain analyze planner context", ALLOCSET_DEFAULT_SIZES); - if (es->buffers) bufusage_start = pgBufferUsage; MemoryContextMemConsumed(planner_ctx, &mem_counts_start); @@ -429,7 +428,7 @@ ExplainOneQuery(Query *query, int cursorOptions, INSTR_TIME_SUBTRACT(planduration, planstart); MemoryContextSwitchTo(saved_ctx); MemoryContextMemConsumed(planner_ctx, &mem_counts_end); - calc_mem_usage(&mem_usage, &mem_counts_end, &mem_counts_start); + MemoryContextSizeDifference(&mem_usage, &mem_counts_start, &mem_counts_end); /* calc differences of buffer counters. */ if (es->buffers) @@ -551,7 +550,7 @@ void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es, const char *queryString, ParamListInfo params, QueryEnvironment *queryEnv, const instr_time *planduration, - const BufferUsage *bufusage, const MemUsage *mem_usage) + const BufferUsage *bufusage, const MemoryUsage *mem_usage) { DestReceiver *dest; QueryDesc *queryDesc; @@ -656,9 +655,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es, if (es->summary && mem_usage) { - ExplainOpenGroup("Planning Memory", "Planning Memory", true, es); + ExplainOpenGroup("Planner Memory", "Planner Memory", true, es); show_planning_memory(es, mem_usage); - ExplainCloseGroup("Planning Memory", "Planning Memory", true, es); + ExplainCloseGroup("Planner Memory", "Planner Memory", true, es); } /* Print info about runtim
Re: simplehash: SH_OPTIMIZE_REPEAT for optimizing repeated lookups of the same key
On Tue, 2023-11-21 at 08:51 -0800, Andres Freund wrote: > I am not quite sure this kind of cache best lives in simplehash - > ISTM that > quite often it'd be more beneficial to have a cache that you can test > more > cheaply higher up. Yeah. I suppose when a few more callers are likely to benefit we can reconsider. Though it makes it easy to test a few other callers, just to see what numbers appear. Regards, Jeff Davis
Re: Add recovery to pg_control and remove backup_label
On 11/21/23 13:59, Andres Freund wrote: On 2023-11-21 13:41:15 -0400, David Steele wrote: On 11/20/23 16:41, Andres Freund wrote: On 2023-11-20 15:56:19 -0400, David Steele wrote: I understand this is an option -- but does it need to be? What is the benefit of excluding the manifest? It's not free to create the manifest, particularly if checksums are enabled. It's virtually free, even with the basic CRCs. Huh? I'd not call 7.06->4.77 or 6.76->4.77 "virtually free". OK, but how does that look with compression -- to a remote location? Uncompressed backup to local storage doesn't seem very realistic. With gzip compression we measure SHA1 checksums at about 5% of total CPU. Obviously that goes up with zstd or lz4. but parallelism helps offset that cost, at least in clock time. I can't understate how valuable checksums are in finding corruption, especially in long-lived backups. Anyway, would you really want a backup without a manifest? How would you know something is missing? In particular, for page incremental how do you know something is new (but not WAL logged) if there is no manifest? Is the plan to just recopy anything not WAL logged with each incremental? Shrug. If you just want to create a new standby by copying the primary, I don't think creating and then validating the manifest buys you much. Long term backups are a different story, particularly if data files are stored individually, rather than in a single checksummed file. Fine, but you are probably not using page incremental if just using pg_basebackup to create a standby. With page incremental, at least one of the backups will already exist, which argues for a manifest. Also, for external backups, there's no manifest... There certainly is a manifest for many external backup solutions. Not having a manifest is just running with scissors, backup-wise. You mean that you have an external solution gin up a backup manifest? I fail to see how that's relevant here? Just saying that for external backups there *is* often a manifest and it is a good thing to have. Regards, -David
Re: Add recovery to pg_control and remove backup_label
Hi, On 2023-11-21 14:48:59 -0400, David Steele wrote: > > I'd not call 7.06->4.77 or 6.76->4.77 "virtually free". > > OK, but how does that look with compression With compression it's obviously somewhat different - but that part is done in parallel, potentially on a different machine with client side compression, whereas I think right now the checksumming is single-threaded, on the server side. With parallel server side compression, it's still 20% slower with the default checksumming than none. With client side it's 15%. > -- to a remote location? I think this one unfortunately makes checksums a bigger issue, not a smaller one. The network interaction piece is single-threaded, adding another significant use of CPU onto the same thread means that you are hit harder by using substantial amount of CPU for checksumming in the same thread. Once you go beyond the small instances, you have plenty network bandwidth in cloud environments. We top out well before the network on bigger instances. > Uncompressed backup to local storage doesn't seem very realistic. With gzip > compression we measure SHA1 checksums at about 5% of total CPU. IMO using gzip is basically infeasible for non-toy sized databases today. I think we're using our users a disservice by defaulting to it in a bunch of places. Even if another default exposes them to difficulty due to potentially using a different compiled binary with fewer supported compression methods - that's gona be very rare in practice. > I can't understate how valuable checksums are in finding corruption, > especially in long-lived backups. I agree! But I think we need faster checksum algorithms or a faster implementation of the existing ones. And probably default to something faster once we have it. Greetings, Andres Freund
Re: Add recovery to pg_control and remove backup_label
On 11/21/23 16:00, Andres Freund wrote: Hi, On 2023-11-21 14:48:59 -0400, David Steele wrote: I'd not call 7.06->4.77 or 6.76->4.77 "virtually free". OK, but how does that look with compression With compression it's obviously somewhat different - but that part is done in parallel, potentially on a different machine with client side compression, whereas I think right now the checksumming is single-threaded, on the server side. Ah, yes, that's certainly a bottleneck. With parallel server side compression, it's still 20% slower with the default checksumming than none. With client side it's 15%. Yeah, that still seems a lot. But to a large extent it sounds like a limitation of the current implementation. -- to a remote location? I think this one unfortunately makes checksums a bigger issue, not a smaller one. The network interaction piece is single-threaded, adding another significant use of CPU onto the same thread means that you are hit harder by using substantial amount of CPU for checksumming in the same thread. Once you go beyond the small instances, you have plenty network bandwidth in cloud environments. We top out well before the network on bigger instances. Uncompressed backup to local storage doesn't seem very realistic. With gzip compression we measure SHA1 checksums at about 5% of total CPU. IMO using gzip is basically infeasible for non-toy sized databases today. I think we're using our users a disservice by defaulting to it in a bunch of places. Even if another default exposes them to difficulty due to potentially using a different compiled binary with fewer supported compression methods - that's gona be very rare in practice. Yeah, I don't use gzip anymore, but there are still some platforms that do not provide zstd (at least not easily) and lz4 compresses less. One thing people do seem to have is a lot of cores. I can't understate how valuable checksums are in finding corruption, especially in long-lived backups. I agree! But I think we need faster checksum algorithms or a faster implementation of the existing ones. And probably default to something faster once we have it. We've been using xxHash to generate checksums for our block-level incremental and it is seriously fast, written by the same guy who did zstd and lz4. Regards, -David
Re: Locks on unlogged tables are locked?!
On Tue, Nov 21, 2023 at 01:16:19PM -0500, Tom Lane wrote: > Bruce Momjian writes: > > Uh, was this ever addressed? I don't see the patch applied or the code > > in this area modified. > > This patch as-is would surely be disastrous: having LockAcquire > try to open the relcache entry for the thing we're trying to lock > is going to be circular in at least some cases. I'm not convinced > that there's a problem worth solving here, but if there is, it'd > have to be done in some other way. Thank you, and Robert, for the feedback. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: vacuum_cost_limit doc description patch
On Fri, Apr 13, 2018 at 09:56:18AM -0300, Martín Marqués wrote: > El 11/04/18 a las 02:04, David Rowley escribió: > > On 11 April 2018 at 09:13, Martín Marqués wrote: > >> This is a patch to add some further description, plus the upper and > >> lower limits it has. > > > > Hi, > > > > + for vacuum_cost_delay. The parameter can take a value > > between 1 and 1. > > > > vacuum_cost_delay should be in tags. > > > > +1 to mentioning that we sleep for vacuum_cost_delay, but I just don't > > see many other GUCs with mention of their supported range. > > Thanks David for having a look. > > New version attached with the missing tags. > > > effective_io_concurrency mentions the range it supports, but this > > happens to depend on USE_POSIX_FADVISE, which if undefined the maximum > > setting is 0, which means the docs are wrong in some cases on that. > > > > vacuum_cost_limit seems fairly fixed at 0-1 with no compile-time > > conditions, so perhaps it's okay, providing we remember and update the > > docs if that ever changes. > > I'm also adding a second patch over the config.sgml doc to fix what I > believe is a misguidance in the minimum resolution time modern systems have. > > The patch just changes *many* for *some* systems which have a minimum > resolution time of 10 milliseconds. Patch applied to master, though I removed the valid range part of the patch. I also liked the many-to-some change since it is more accurate. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Partial aggregates pushdown
On Tue, Nov 21, 2023 at 12:16:41PM -0500, Robert Haas wrote: > On Mon, Nov 20, 2023 at 5:48 PM Bruce Momjian wrote: > > > I do have a concern about this, though. It adds a lot of bloat. It > > > adds a whole lot of additional entries to pg_aggregate, and every new > > > aggregate we add in the future will require a bonus entry for this, > > > and it needs a bunch of new pg_proc entries as well. One idea that > > > I've had in the past is to instead introduce syntax that just does > > > this, without requiring a separate aggregate definition in each case. > > > For example, maybe instead of changing string_agg(whatever) to > > > string_agg_p_text_text(whatever), you can say PARTIAL_AGGREGATE > > > string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or > > > something. Then all aggregates could be treated in a generic way. I'm > > > not completely sure that's better, but I think it's worth considering. > > > > So use an SQL keyword to indicates a pushdown call? We could then > > automate the behavior rather than requiring special catalog functions? > > Right. It would require more infrastructure in the parser, planner, > and executor, but it would be infinitely reusable instead of needing a > new thing for every aggregate. I think that might be better, but to be > honest I'm not totally sure. It would make it automatic. I guess we need to look at how big the patch is to do it. > > > I don't think the patch does a good job explaining why HAVING, > > > DISTINCT, and ORDER BY are a problem. It seems to me that HAVING > > > shouldn't really be a problem, because HAVING is basically a WHERE > > > clause that occurs after aggregation is complete, and whether or not > > > the aggregation is safe shouldn't depend on what we're going to do > > > with the value afterward. The HAVING clause can't necessarily be > > > pushed to the remote side, but I don't see how or why it could make > > > the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a > > > little trickier: if we pushed down DISTINCT, we'd still have to > > > re-DISTINCT-ify when combining locally, and if we pushed down ORDER > > > BY, we'd have to do a merge pass to combine the returned values unless > > > we could prove that the partitions were non-overlapping ranges that > > > would be visited in the correct order. Although that all sounds > > > doable, I think it's probably a good thing that the current patch > > > doesn't try to handle it -- this is complicated already. But it should > > > explain why it's not handling it and maybe even a bit about how it > > > could be handling in the future, rather than just saying "well, this > > > kind of thing is not safe." The trouble with that explanation is that > > > it does nothing to help the reader understand whether the thing in > > > question is *fundamentally* unsafe or whether we just don't have the > > > right code to make it work. > > > > Makes sense. > > Actually, I think I was wrong about this. We can't handle ORDER BY or > DISTINCT because we can't distinct-ify or order after we've already > partially aggregated. At least not in general, and not without > additional aggregate support functions. So what I said above was wrong > with respect to those. Or so I believe, anyway. But I still don't see > why HAVING should be a problem. This should probably be documented in the patch. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: pg_class.reltuples of brin indexes
On Tue, Mar 27, 2018 at 08:58:11PM +0900, Masahiko Sawada wrote: > Hi, > > I found that pg_class.reltuples of brin indexes can be either the > number of index tuples or the number of heap tuples. > > =# create table test as select generate_series(1,10) as c; > =# create index test_brin on test using brin (c); > =# analyze test; > =# select relname, reltuples, relpages from pg_class where relname in > ('test', 'test_brin'); > relname | reltuples | relpages > ---+---+-- > test | 10 | 443 > test_brin | 10 |3 > (2 rows) > > =# vacuum test; > =# select relname, reltuples, relpages from pg_class where relname in > ('test', 'test_brin'); > relname | reltuples | relpages > ---+---+-- > test | 10 | 443 > test_brin | 3 |3 > (2 rows) > > If I understand correctly pg_class.reltuples of indexes should have > the number of index tuples but especially for brin indexes it would be > hard to estimate it in the analyze code. I thought that we can change > brinvacuumcleanup so that it returns the estimated number of index > tuples and do vac_update_relstats using that value but it would break > API contract. Better ideas? I assume there is nothing to do on this issue. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: proposal: possibility to read dumped table's name from file
> On 20 Nov 2023, at 06:20, Pavel Stehule wrote: > I was pondering replacing the is_include handling with returning an enum for > the operation, to keep things more future proof in case we add more operations > (and also a bit less magic IMHO). > > +1 > > I did it. Nice, I think it's an improvement. + extension: data on foreign servers, works like + --extension. This keyword can only be + used with the include keyword. This seems like a copy-pasteo, fixed in the attached. I've spent some time polishing this version of the patch, among other things trying to make the docs and --help screen consistent across the tools. I've added the diff as a txt file to this email (to keep the CFbot from applying it), it's mainly reformatting a few comments and making things consistent. The attached is pretty close to a committable patch IMO, review is welcome on both the patch and commit message. I tried to identify all reviewers over the past 3+ years but I might have missed someone. -- Daniel Gustafsson commit 4a3c0bdaf3fd21b75e17244691fbeb9340e960e1 Author: Daniel Gustafsson Date: Tue Nov 21 15:08:27 2023 +0100 Fixups and tweaks diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index e2f100d552..0e5ba4f712 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -873,49 +873,52 @@ PostgreSQL documentation - extension: data on foreign servers, works like - --extension. This keyword can only be + extension: extensions, works like the + --extension option. This keyword can only be used with the include keyword. foreign_data: data on foreign servers, works like - --include-foreign-data. This keyword can only be - used with the include keyword. + the --include-foreign-data option. This keyword can + only be used with the include keyword. - table: tables, works like - -t/--table + table: tables, works like the + -t/--table option. - table_and_children: tables, works like - --table-and-children + table_and_children: tables including any partitions + or inheritance child tables, works like the + --table-and-children option. - table_data: table data, works like - --exclude-table-data. This keyword can only be - used with the exclude keyword. + table_data: table data of any tables matching + pattern, works like the + --exclude-table-data option. This keyword can only + be used with the exclude keyword. - table_data_and_children: table data of any - partitions or inheritance child, works like - --exclude-table-data-and-children. This keyword can only be - used with the exclude keyword. + table_data_and_children: table data of any tables + matching pattern as well as any partitions + or inheritance children of the table(s), works like the + --exclude-table-data-and-children option. This + keyword can only be used with the exclude keyword. - schema: schemas, works like - -n/--schema + schema: schemas, works like the + -n/--schema option. @@ -923,9 +926,9 @@ PostgreSQL documentation Lines starting with # are considered comments and -ignored. Comments can be placed after filter as well. Blank lines -are also ignored. See for how to -perform quoting in patterns. +ignored. Comments can be placed after an object pattern row as well. +Blank lines are also ignored. See +for how to perform quoting in patterns. @@ -1715,8 +1718,8 @@ CREATE DATABASE foo WITH TEMPLATE template0; - To dump all tables with names starting with mytable, except for table - mytable2, specify a filter file + To dump all tables whose names start with mytable, except + for table mytable2, specify a filter file filter.txt like: include table mytable* diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml index 75ba03f3ad..4d7c046468 100644 --- a/doc/src/sgml/ref/pg_dumpall.sgml +++ b/doc/src/sgml/ref/pg_dumpall.sgml @@ -130,20 +130,29 @@ PostgreSQL documentation Specify a filename from which to read patterns for databases excluded -from the dump. The patterns are interpretted according to the same rules +from the dump. The patterns are interpreted according to the s
common signal handler protection
In commit 97550c0, I added a "MyProcPid == getpid()" check in the SIGTERM handler for the startup process. This ensures that processes forked by system(3) (i.e., for restore_command) that have yet to install their own signal handlers do not call proc_exit() upon receiving SIGTERM. Without this protection, both the startup process and the restore_command process might try to remove themselves from the PGPROC shared array (among other things), which can end badly. Since then, I've been exploring a more general approach that would offer protection against similar issues in the future. We probably don't want signal handlers in these grandchild processes to touch shared memory at all. The attached 0001 is an attempt at adding such protection for all handlers installed via pqsignal(). In short, it stores the actual handler functions in a separate array, and sigaction() is given a wrapper function that performs the "MyProcPid == getpid()" check. If that check fails, the wrapper function installs the default signal handler and calls it. Besides allowing us to revert commit 97550c0 (see attached 0003), this wrapper handler could also restore errno, as shown in 0002. Right now, individual signal handlers must do this today as needed, but that seems easy to miss and prone to going unnoticed for a long time. I see two main downsides of this proposal: * Overhead: The wrapper handler calls a function pointer and getpid(), which AFAICT is a real system call on most platforms. That might not be a tremendous amount of overhead, but it's not zero, either. I'm particularly worried about signal-heavy code like synchronous replication. (Are there other areas that should be tested?) If this is a concern, perhaps we could allow certain processes to opt out of this wrapper handler, provided we believe it is unlikely to fork or that the handler code is safe to run in grandchild processes. * Race conditions: With these patches, pqsignal() becomes quite racy when used within signal handlers. Specifically, you might get a bogus return value. However, there are no in-tree callers of pqsignal() that look at the return value (and I don't see any reason they should), and it seems unlikely that pqsignal() will be used within signal handlers frequently, so this might not be a deal-breaker. I did consider trying to convert pqsignal() into a void function, but IIUC that would require an SONAME bump. For now, I've just documented the bogosity of the return values. Thoughts? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 5f46dc0150aba3ef75053e91f7c9a4d6624e2c4f Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 17 Nov 2023 21:38:18 -0600 Subject: [PATCH v1 1/3] Check that MyProcPid == getpid() in all signal handlers. In commit 97550c0711, we added a similar check to the SIGTERM handler for the startup process. This commit adds this check to all signal handlers installed with pqsignal(). This is done by using a wrapper function that performs the check before calling the actual handler. The hope is that this will offer more general protection against grandchildren processes inadvertently modifying shared memory due to inherited signal handlers. Another potential follow-up improvement is to use this wrapper handler function to restore errno instead of relying on each individual handler function to do so. This commit makes the changes in commit 97550c0711 obsolete but leaves reverting it for a follow-up commit. Reviewed-by: ??? Discussion: ??? --- src/port/pqsignal.c | 72 +++-- 1 file changed, 70 insertions(+), 2 deletions(-) diff --git a/src/port/pqsignal.c b/src/port/pqsignal.c index 83d876db6c..978877dec6 100644 --- a/src/port/pqsignal.c +++ b/src/port/pqsignal.c @@ -28,23 +28,85 @@ #include "c.h" #include +#ifndef FRONTEND +#include +#endif #ifndef FRONTEND #include "libpq/pqsignal.h" +#include "miscadmin.h" +#endif + +#ifdef NSIG +#define PG_NSIG (NSIG) +#else +#define PG_NSIG (64) /* XXX: wild guess */ +#endif + +static volatile pqsigfunc pqsignal_handlers[PG_NSIG]; + +/* + * Except when called with SIG_IGN or SIG_DFL, pqsignal() sets up this function + * as the handler for all signals. This wrapper handler function checks that + * it is called within a process that the server knows about, and not a + * grandchild process forked by system(3), etc. This check ensures that such + * grandchildren processes do not modify shared memory, which could be + * detrimental. If the check succeeds, the function originally provided to + * pqsignal() is called. Otherwise, the default signal handler is installed + * and then called. + */ +static void +wrapper_handler(SIGNAL_ARGS) +{ +#ifndef FRONTEND + Assert(MyProcPid); + + if (unlikely(MyProcPid != (int) getpid())) + { + pqsignal(postgres_signal_arg, SIG_DFL); + raise(postgres_signal_arg); + return; + } #endif + (*pqsignal_handlers[postgres_signal_arg]) (p
Re: Remove distprep
On 2023-11-21 Tu 13:23, Tom Lane wrote: Alvaro Herrera writes: Hmm, do we still need to have README.git as a separate file from README? Also, looking at README, I see it refers to the INSTALL file in the root, but that doesn't exist. "make -C doc/src/sgml INSTALL" creates it, but it's not copied to the root directory. Do we need some fixup for that? Yeah, we clearly need to rethink this area if the plan is that tarballs will be pristine git pulls. I think we want just README at the top level, and I propose we give up on the text INSTALL file altogether (thereby removing a documentation build gotcha that catches people every so often). I propose that in 2023 it ought to be sufficient for the README file to point at build instructions on the web. +1 cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Changing baserel to foreignrel in postgres_fdw functions
Should this patch be applied? --- On Thu, Feb 15, 2018 at 06:57:50PM +0530, Ashutosh Bapat wrote: > Hi, > I noticed that functions is_foreign_expr(), classifyConditions() and > appendOrderByClause() had variables/arguments named baserel when the > relations passed to those could be join or upper relation as well. > Here's patch renaming those as foreignrel. > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c > index e111b09..bcf9bea 100644 > --- a/contrib/postgres_fdw/deparse.c > +++ b/contrib/postgres_fdw/deparse.c > @@ -198,7 +198,7 @@ static void get_relation_column_alias_ids(Var *node, > RelOptInfo *foreignrel, > */ > void > classifyConditions(PlannerInfo *root, > -RelOptInfo *baserel, > +RelOptInfo *foreignrel, > List *input_conds, > List **remote_conds, > List **local_conds) > @@ -212,7 +212,7 @@ classifyConditions(PlannerInfo *root, > { > RestrictInfo *ri = lfirst_node(RestrictInfo, lc); > > - if (is_foreign_expr(root, baserel, ri->clause)) > + if (is_foreign_expr(root, foreignrel, ri->clause)) > *remote_conds = lappend(*remote_conds, ri); > else > *local_conds = lappend(*local_conds, ri); > @@ -224,29 +224,29 @@ classifyConditions(PlannerInfo *root, > */ > bool > is_foreign_expr(PlannerInfo *root, > - RelOptInfo *baserel, > + RelOptInfo *foreignrel, > Expr *expr) > { > foreign_glob_cxt glob_cxt; > foreign_loc_cxt loc_cxt; > - PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) > (baserel->fdw_private); > + PgFdwRelationInfo *fpinfo = (PgFdwRelationInfo *) > (foreignrel->fdw_private); > > /* >* Check that the expression consists of nodes that are safe to execute >* remotely. >*/ > glob_cxt.root = root; > - glob_cxt.foreignrel = baserel; > + glob_cxt.foreignrel = foreignrel; > > /* >* For an upper relation, use relids from its underneath scan relation, >* because the upperrel's own relids currently aren't set to anything >* meaningful by the core code. For other relation, use their own > relids. >*/ > - if (IS_UPPER_REL(baserel)) > + if (IS_UPPER_REL(foreignrel)) > glob_cxt.relids = fpinfo->outerrel->relids; > else > - glob_cxt.relids = baserel->relids; > + glob_cxt.relids = foreignrel->relids; > loc_cxt.collation = InvalidOid; > loc_cxt.state = FDW_COLLATE_NONE; > if (!foreign_expr_walker((Node *) expr, &glob_cxt, &loc_cxt)) > @@ -301,7 +301,7 @@ foreign_expr_walker(Node *node, > if (node == NULL) > return true; > > - /* May need server info from baserel's fdw_private struct */ > + /* May need server info from foreignrel's fdw_private struct */ > fpinfo = (PgFdwRelationInfo *) (glob_cxt->foreignrel->fdw_private); > > /* Set up inner_cxt for possible recursion to child nodes */ > @@ -2965,7 +2965,7 @@ appendGroupByClause(List *tlist, deparse_expr_cxt > *context) > } > > /* > - * Deparse ORDER BY clause according to the given pathkeys for given base > + * Deparse ORDER BY clause according to the given pathkeys for given foreign > * relation. From given pathkeys expressions belonging entirely to the given > * base relation are obtained and deparsed. > */ > @@ -2975,7 +2975,7 @@ appendOrderByClause(List *pathkeys, deparse_expr_cxt > *context) > ListCell *lcell; > int nestlevel; > char *delim = " "; > - RelOptInfo *baserel = context->scanrel; > + RelOptInfo *foreignrel = context->scanrel; > StringInfo buf = context->buf; > > /* Make sure any constants in the exprs are printed portably */ > @@ -2987,7 +2987,7 @@ appendOrderByClause(List *pathkeys, deparse_expr_cxt > *context) > PathKey*pathkey = lfirst(lcell); > Expr *em_expr; > > - em_expr = find_em_expr_for_rel(pathkey->pk_eclass, baserel); > + em_expr = find_em_expr_for_rel(pathkey->pk_eclass, foreignrel); > Assert(em_expr != NULL); > > appendStringInfoString(buf, delim); -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: common signal handler protection
On Tue, Nov 21, 2023 at 03:20:08PM -0600, Nathan Bossart wrote: > +#ifdef NSIG > +#define PG_NSIG (NSIG) > +#else > +#define PG_NSIG (64) /* XXX: wild guess */ > +#endif > + Assert(signo < PG_NSIG); cfbot seems unhappy with this on Windows. IIUC we need to use PG_SIGNAL_COUNT there instead, but I'd like to find a way to have just one macro for all platforms. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
WIP: libpq: add a possibility to not send D(escribe) when executing a prepared statement
Hi! Currently libpq sends B(ind), D(escribe), E(execute), S(ync) when executing a prepared statement. The response for that D message is a RowDescription, which doesn't change during prepared statement lifetime (with the attributes format being an exception, as they aren't know before execution) . In a presumably very common case of repeatedly executing the same statement, this leads to both client and server parsing/sending exactly the same RowDescritpion data over and over again. Instead, library user could acquire a statement result RowDescription once (via PQdescribePrepared), and reuse it in subsequent calls to PQexecPrepared and/or its async friends. Doing it this way saves a measurable amount of CPU for both client and server and saves a lot of network traffic, for example: when selecting a single row from a table with 30 columns, where each column has 10-symbols name, and every value in a row is a 10-symbols TEXT, i'm seeing an amount of bytes sent to client decreased by a factor of 2.8, and the CPU time client spends in userland decreased by a factor of ~1.5. The patch attached adds a family of functions PQsendQueryPreparedPredescribed, PQgetResultPredescribed, PQisBusyPredescribed, which allow a user to do just that. If the idea seems reasonable i'd be happy to extend these to PQexecPrepared as well and cover it with tests. P.S. This is my first time ever sending a patch via email, so please don't hesitate to point at mistakes i'm doing in the process. diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt index 850734ac96..989354ca06 100644 --- a/src/interfaces/libpq/exports.txt +++ b/src/interfaces/libpq/exports.txt @@ -191,3 +191,6 @@ PQclosePrepared 188 PQclosePortal 189 PQsendClosePrepared 190 PQsendClosePortal 191 +PQsendQueryPreparedPredescribed 192 +PQgetResultPredescribed 193 +PQisBusyPredescribed194 diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c index 04610ccf5e..a18bead9e6 100644 --- a/src/interfaces/libpq/fe-exec.c +++ b/src/interfaces/libpq/fe-exec.c @@ -72,8 +72,19 @@ static int PQsendQueryGuts(PGconn *conn, const char *const *paramValues, const int *paramLengths, const int *paramFormats, - int resultFormat); -static void parseInput(PGconn *conn); + int resultFormat, + bool sendDescribe); +static int PQsendQueryPreparedInternal(PGconn *conn, + const char *stmtName, + int nParams, + const char *const *paramValues, + const int *paramLengths, + const int *paramFormats, + int resultFormat, + bool sendDescribe); +static int PQisBusyInternal(PGconn *conn, PGresult *description); +static PGresult *PQgetResultInternal(PGconn *conn, PGresult *description); +static void parseInput(PGconn *conn, PGresult *description); static PGresult *getCopyResult(PGconn *conn, ExecStatusType copytype); static bool PQexecStart(PGconn *conn); static PGresult *PQexecFinish(PGconn *conn); @@ -1528,7 +1539,8 @@ PQsendQueryParams(PGconn *conn, paramValues, paramLengths, paramFormats, - resultFormat); + resultFormat, + true /* send Describe */ ); } /* @@ -1643,6 +1655,26 @@ PQsendQueryPrepared(PGconn *conn, const int *paramLengths, const int *paramFormats, int resultFormat) +{ + return PQsendQueryPreparedInternal(conn, + stmtName, + nParams, + paramValues, + paramLengths, + paramFormats, + resultFormat, + true /* send Describe */ ); +} + +int +PQsendQueryPreparedInternal(PGconn *conn, + const char *stmtName, + int nParams, + const char *const *paramValues, + const int *paramLengths, + const int *paramFormats, + int resultFormat, + bool sendDescribe) { if (!PQsendQueryStart(conn, true)) return 0; @@ -1668,7 +1700,50 @@ PQsendQueryPrepared(PGconn *conn, paramValues, paramLengths, paramFormats, - resultFormat); + resultFormat, + sendDescribe); +} + +int +PQsendQueryPreparedPredescribed(PGconn *conn, +const char *stmtName, +int nParams, +const char *const *paramValues, +const int *paramLengths, +const int *paramFormats, +int resultFormat, +PGresult *description) +{ + int i; + int nFields; + int send_result; + + if (!description) + { + libpq_append_conn_error(conn, "query result description is a null pointer"); + return 0; + } + + send_result = PQsendQueryPreparedInternal(conn, + stmtName, + nParams, + paramValues, + paramLengths, + paramFormats, + resultFormat, + false /* send Describe */ ); + + /* We only need to adjust attributes format if
Re: [HACKERS] legitimacy of using PG_TRY , PG_CATCH , PG_END_TRY in C function
On Mon, Oct 23, 2017 at 01:27:43AM -0700, Andres Freund wrote: > On 2017-10-22 23:04:50 -0400, Tom Lane wrote: > > John Lumby writes: > > > I have a C function (a trigger function) which uses the PG_TRY() > > > construct to handle certain ERROR conditions. > > > One example is where invoked as INSTEAD OF INSERT into a view. It > > > PG_TRYs INSERT into the real base table, > > > but this table may not yet exist (it is a partitioned child of an > > > inheritance parent). > > > If the error is ERRCODE_UNDEFINED_TABLE, then the CATCH issues > > > FlushErrorState() and returns to caller who CREATes the table and > > > re-issues the insert. > > > All works perfectly (on every release of 9.x). > > > > If it works, it's only because you didn't try very hard to break it. > > In general you can't catch and ignore errors without a full-fledged > > subtransaction --- BeginInternalSubTransaction, then either > > ReleaseCurrentSubTransaction or RollbackAndReleaseCurrentSubTransaction, > > not just FlushErrorState. See e.g. plpgpsql's exec_stmt_block. > > > > There may well be specific scenarios where an error gets thrown without > > having done anything that requires transaction cleanup. But when you > > hit a scenario where that's not true, or when a scenario that used to > > not require cleanup now does, nobody is going to consider that a PG bug. > > It'd probably be a good idea to expand on this in the sgml docs. This > has confused quite anumber of people... I know this is from 2017, but where would we document this? I don't see PG_TRY/PG_CATCH mentioned in the SGML docs at all. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Typo with amtype = 's' in opr_sanity.sql
On Tue, Nov 21, 2023 at 01:02:40PM +0300, Aleksander Alekseev wrote: >> It seems to me that this has been copy-pasted on HEAD from the >> sequence AM patch, but forgot to update amtype to 't'. While that's >> maybe cosmetic, I think that this could lead to unexpected results, so >> perhaps there is a point in doing a backpatch? > > I disagree that it's cosmetic. The test doesn't check what it's supposed to. Yes, I've backpatched that all the way down to 12 at the end. -- Michael signature.asc Description: PGP signature
Re: pg_upgrade and logical replication
Thanks for addressing my past review comments. Here are some more review comments for patch v16-0001 == doc/src/sgml/ref/pgupgrade.sgml 1. + + Create all the new tables that were created in the publication during + upgrade and refresh the publication by executing + ALTER SUBSCRIPTION ... REFRESH PUBLICATION. + "Create all ... that were created" sounds a bit strange. SUGGESTION (maybe like this or similar?) Create equivalent subscriber tables for anything that became newly part of the publication during the upgrade and == src/bin/pg_dump/pg_dump.c 2. getSubscriptionTables +/* + * getSubscriptionTables + * Get information about subscription membership for dumpable tables. This + *will be used only in binary-upgrade mode. + */ +void +getSubscriptionTables(Archive *fout) +{ + DumpOptions *dopt = fout->dopt; + SubscriptionInfo *subinfo = NULL; + SubRelInfo *subrinfo; + PQExpBuffer query; + PGresult *res; + int i_srsubid; + int i_srrelid; + int i_srsubstate; + int i_srsublsn; + int ntups; + Oid last_srsubid = InvalidOid; + + if (dopt->no_subscriptions || !dopt->binary_upgrade || + fout->remoteVersion < 17) + return; This function comment says "used only in binary-upgrade mode." and the Assert says the same. But, is this compatible with the other function dumpSubscriptionTable() where it says "used only in binary-upgrade mode and for PG17 or later versions"? == src/bin/pg_upgrade/check.c 3. check_new_cluster_subscription_configuration +static void +check_new_cluster_subscription_configuration(void) +{ + PGresult *res; + PGconn*conn; + int nsubs_on_old; + int max_replication_slots; + + /* Logical slots can be migrated since PG17. */ + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600) + return; IMO it is better to say < 1700 in this check, instead of <= 1600. ~~~ 4. + /* Quick return if there are no subscriptions to be migrated */ + if (nsubs_on_old == 0) + return; Missing period in comment. ~~~ 5. +/* + * check_old_cluster_subscription_state() + * + * Verify that each of the subscriptions has all their corresponding tables in + * i (initialize), r (ready) or s (synchronized) state. + */ +static void +check_old_cluster_subscription_state(ClusterInfo *cluster) This function is only for the old cluster (hint: the function name) so there is no need to pass the 'cluster' parameter here. Just directly use old_cluster in the function body. == src/bin/pg_upgrade/t/004_subscription.pl 6. +# Add tab_not_upgraded1 to the publication. Now publication has tab_upgraded1 +# and tab_upgraded2 tables. +$publisher->safe_psql('postgres', + "ALTER PUBLICATION regress_pub ADD TABLE tab_upgraded2"); Typo in comment. You added tab_not_upgraded2, not tab_not_upgraded1 ~~ 7. +# Subscription relations should be preserved. The upgraded won't know +# about 'tab_not_upgraded1' because the subscription is not yet refreshed. Typo or missing word in comment? "The upgraded" ?? == Kind Regards, Peter Smith. Fujitsu Australia
dblink query interruptibility
=== Background Something as simple as the following doesn't respond to cancellation. In v15+, any DROP DATABASE will hang as long as it's running: SELECT dblink_exec( $$dbname='$$||current_database()||$$' port=$$||current_setting('port'), 'SELECT pg_sleep(15)'); https://postgr.es/m/4b584c99.8090...@enterprisedb.com proposed a fix back in 2010. Latches and the libpqsrv facility have changed the server programming environment since that patch. The problem statement also came up here: On Thu, Dec 08, 2022 at 06:08:15PM -0800, Andres Freund wrote: > dblink.c uses a lot of other blocking libpq functions, which obviously also > isn't ok. === Key decisions This patch adds to libpqsrv facility. It dutifully follows the existing naming scheme. For greppability, I am favoring renaming new and old functions such that the libpq name is a substring of this facility's name. That is, rename libpqsrv_disconnect to srvPQfinish or maybe libpqsrv_PQfinish(). Now is better than later, while pgxn contains no references to libpqsrv. Does anyone else have a preference between naming schemes? If nobody does, I'll keep today's libpqsrv_disconnect() style. I was tempted to add a timeout argument to each libpqsrv function, which would allow libpqsrv_get_result_last() to replace pgfdw_get_cleanup_result(). We can always add a timeout-accepting function later and make this thread's function name a thin wrapper around it. Does anyone feel a mandatory timeout argument, accepting -1 for no timeout, would be the right thing? === Minor topics It would be nice to replace libpqrcv_PQgetResult() and friends with the new functions. I refrained since they use ProcessWalRcvInterrupts(), not CHECK_FOR_INTERRUPTS(). Since walreceiver already reaches CHECK_FOR_INTERRUPTS() via libpqsrv_connect_params(), things might just work. This patch contains a PQexecParams() wrapper, called nowhere in postgresql.git. It's inessential, but twelve pgxn modules do mention PQexecParams. Just one mentions PQexecPrepared, and none mention PQdescribe*. The patch makes postgres_fdw adopt its functions, as part of confirming the functions are general enough. postgres_fdw create_cursor() has been passing the "DECLARE CURSOR FOR inner_query" string for some error messages and just inner_query for others. I almost standardized on the longer one, but the test suite checks that. Hence, I standardized on just inner_query. I wrote this because pglogical needs these functions to cooperate with v15+ DROP DATABASE (https://github.com/2ndQuadrant/pglogical/issues/418). Thanks, nm Author: Noah Misch Commit: Noah Misch Make dblink interruptible, via new libpqsrv APIs. This replaces dblink's blocking libpq calls, allowing cancellation and allowing DROP DATABASE (of a database not involved in the query). Apart from explicit dblink_cancel_query() calls, dblink still doesn't cancel the remote side. The replacement for the blocking calls consists of new, general-purpose query execution wrappers in the libpqsrv facility. Out-of-tree extensions should adopt these. Use them in postgres_fdw, replacing a local implementation from which the libpqsrv implementation derives. This is a bug fix for dblink. Code inspection identified the bug at least thirteen years ago, but user complaints have not appeared. Hence, no back-patch for now. Reviewed by FIXME. Discussion: https://postgr.es/m/FIXME diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index 195b278..4624e53 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -133,6 +133,7 @@ static HTAB *remoteConnHash = NULL; /* custom wait event values, retrieved from shared memory */ static uint32 dblink_we_connect = 0; static uint32 dblink_we_get_conn = 0; +static uint32 dblink_we_get_result = 0; /* * Following is list that holds multiple remote connections. @@ -252,6 +253,9 @@ dblink_init(void) { if (!pconn) { + if (dblink_we_get_result == 0) + dblink_we_get_result = WaitEventExtensionNew("DblinkGetResult"); + pconn = (remoteConn *) MemoryContextAlloc(TopMemoryContext, sizeof(remoteConn)); pconn->conn = NULL; pconn->openCursorCount = 0; @@ -442,7 +446,7 @@ dblink_open(PG_FUNCTION_ARGS) /* If we are not in a transaction, start one */ if (PQtransactionStatus(conn) == PQTRANS_IDLE) { - res = PQexec(conn, "BEGIN"); + res = libpqsrv_exec(conn, "BEGIN", dblink_we_get_result); if (PQresultStatus(res) != PGRES_COMMAND_OK) dblink_res_internalerror(conn, res, "begin error"); PQclear(res); @@ -461,7 +465,7 @@ dblink_open(PG_FUNCTION_ARGS) (rconn->openCursorCount)++; appendStringInfo(&buf, "DECLARE %s CURSOR FOR %s", curname, sql); - res = PQexec(conn, buf.d
Re: [PATCH] fix race condition in libpq (related to ssl connections)
On Tue, Nov 21, 2023 at 12:14:16PM +0300, Aleksander Alekseev wrote: Thanks for the report, Willi, and the test case! Thanks Aleksander for the reply. > I wonder if we should just document that libpq is thread safe as of PG > v??? and deprecate PQisthreadsafe() at some point. Currently the > documentation gives an impression that the library may or may not be > thread safe depending on the circumstances. Because --disable-thread-safe has been removed recently in 68a4b58eca03. The routine could be marked as deprecated on top of saying that it always returns 1 for 17~. > Please add the patch to the nearest open commit fest [2]. The patch > will be automatically picked up by cfbot [3] and tested on different > platforms. Also this way it will not be lost among other patches. > > The code looks OK but I would appreciate a second opinion from cfbot. > Also maybe a few comments before my_BIO_methods_init_mutex and/or > pthread_mutex_lock would be appropriate. Personally I am inclined to > think that the automatic test in this particular case is redundant. I am not really convinced that we require a second mutex here, as there is always a concern with inter-locking changes. I may be missing something, of course, but BIO_s_socket() is just a pointer to a set of callbacks hidden in bss_sock.c with BIO_meth_new() and BIO_get_new_index() assigning some centralized data to handle the methods in a controlled way in OpenSSL. We only case about initializing once for the sake of libpq's threads, so wouldn't it be better to move my_BIO_s_socket() in pgtls_init() where the initialization of the BIO methods would be protected by ssl_config_mutex? That's basically what Willi means in his first message, no? -- Michael signature.asc Description: PGP signature
Re: [PATCH] fix race condition in libpq (related to ssl connections)
On Tue, Nov 21, 2023 at 12:14:16PM +0300, Aleksander Alekseev wrote: > Please add the patch to the nearest open commit fest [2]. The patch > will be automatically picked up by cfbot [3] and tested on different > platforms. Also this way it will not be lost among other patches. I have noticed that this was not tracked yet, so I have added an entry here: https://commitfest.postgresql.org/46/4670/ Willi, note that this requires a PostgreSQL community account, and it does not seem like you have one, or I would have added you as author ;) -- Michael signature.asc Description: PGP signature
Re: [PATCH] fix race condition in libpq (related to ssl connections)
On Wed, Nov 22, 2023 at 2:44 PM Michael Paquier wrote: > On Tue, Nov 21, 2023 at 12:14:16PM +0300, Aleksander Alekseev wrote: > > I wonder if we should just document that libpq is thread safe as of PG > > v??? and deprecate PQisthreadsafe() at some point. Currently the > > documentation gives an impression that the library may or may not be > > thread safe depending on the circumstances. > > Because --disable-thread-safe has been removed recently in > 68a4b58eca03. The routine could be marked as deprecated on top of > saying that it always returns 1 for 17~. See also commit ce0b0fa3 "Doc: Adjust libpq docs about thread safety."
Re: [HACKERS] Changing references of password encryption to hashing
Is there any interest in fixing our documentation that says encrypted when it means hashed? Should I pursue this? --- On Fri, Mar 10, 2017 at 11:16:02AM +0900, Michael Paquier wrote: > Hi all, > > As discussed here: > https://www.postgresql.org/message-id/98cafcd0-5557-0bdf-4837-0f2b7782d...@joeconway.com > We are using in documentation and code comments "encryption" to define > what actually is hashing, which is confusing. > > Attached is a patch for HEAD to change the documentation to match hashing. > > There are a couple of things I noticed on the way: > 1) There is the user-visible PQencryptPassword in libpq, which > actually hashes the password, and not encrypts it :) > 2) There is as well pg_md5_encrypt() in the code, as well as there is > pg_md5_hash(). Those may be better if renamed, at least I would think > that pg_md5_encrypt should be pg_md5_password_hash, because a salt is > used as well, and the routine is dedicated to work on passwords. > Thoughts? > 3) createuser also has --encrypt and --unencrypted, perhaps those > should be renamed? Honestly I don't really think that this is worth a > breakage and the option names match with the SQL commands. > > I did not bother about those things in the attached, which works only > documentation and comment changes. > > An open item has been added on the wiki. > > Thanks, > -- > Michael > diff --git a/contrib/pgcrypto/crypt-des.c b/contrib/pgcrypto/crypt-des.c > index a4aa4966bf..1f65f286ea 100644 > --- a/contrib/pgcrypto/crypt-des.c > +++ b/contrib/pgcrypto/crypt-des.c > @@ -753,7 +753,7 @@ px_crypt_des(const char *key, const char *setting) > output[0] = setting[0]; > > /* > - * If the encrypted password that the salt was extracted from > is only > + * If the hashed password that the salt was extracted from is > only >* 1 character long, the salt will be corrupted. We need to > ensure >* that the output string doesn't have an extra NUL in it! >*/ > diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml > index 28cdabe6fe..abbd5dd19e 100644 > --- a/doc/src/sgml/catalogs.sgml > +++ b/doc/src/sgml/catalogs.sgml > @@ -1334,8 +1334,8 @@ >rolpassword >text > > - Password (possibly encrypted); null if none. The format depends > - on the form of encryption used. > + Password (possibly hashed); null if none. The format depends > + on the form of hashing used. > > > > @@ -1350,19 +1350,20 @@ > > > > - For an MD5 encrypted password, rolpassword > + For an MD5-hashed password, rolpassword > column will begin with the string md5 followed by a > 32-character hexadecimal MD5 hash. The MD5 hash will be of the user's > password concatenated to their user name. For example, if user > joe has password xyzzy, PostgreSQL > will store the md5 hash of xyzzyjoe. If the password is > - encrypted with SCRAM-SHA-256, it consists of 5 fields separated by colons. > + hashed with SCRAM-SHA-256, it consists of 5 fields separated by colons. > The first field is the constant scram-sha-256, to > identify the password as a SCRAM-SHA-256 verifier. The second field is a > salt, Base64-encoded, and the third field is the number of iterations used > to generate the password. The fourth field and fifth field are the stored > key and server key, respectively, in hexadecimal format. A password that > - does not follow either of those formats is assumed to be unencrypted. > + does not follow either of those formats is assumed to be in plain format, > + non-hashed. > > > > @@ -10269,9 +10270,9 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts > ppx >passwd >text > > - Password (possibly encrypted); null if none. See > + Password (possibly hashed); null if none. See > linkend="catalog-pg-authid">pg_authid > - for details of how encrypted passwords are stored. > + for details of how hashed passwords are stored. > > > > diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml > index 69844e5b29..994ed6c1bd 100644 > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -1190,11 +1190,11 @@ include_dir 'conf.d' > When a password is specified in or > without writing either > ENCRYPTED > or UNENCRYPTED, this parameter determines whether the > -password is to be encrypted. The default value is md5, > which > +password is to be hashed. The default value is md5, which > stores the password as an MD5 hash. Setting this to > plain stores > it in plaintext. on and off are also > accepted, as > aliases for md5 and plain, respectively. > Setting > -this parameter to scram will encrypt th
Re: proposal: possibility to read dumped table's name from file
Op 11/21/23 om 22:10 schreef Daniel Gustafsson: On 20 Nov 2023, at 06:20, Pavel Stehule wrote: The attached is pretty close to a committable patch IMO, review is welcome on both the patch and commit message. I tried to identify all reviewers over the past 3+ years but I might have missed someone. I've tested this, albeit mostly in the initial iterations (*shrug* but a mention is nice) Erik Rijkers -- Daniel Gustafsson
RE: Synchronizing slots from primary to standby
On Tuesday, November 21, 2023 5:33 PM shveta malik wrote: > > > v37 fails to apply to HEAD due to a recent commit e83aa9f92fdd, rebased the > patches. PFA v37_2 patches. Thanks for updating the patches. I'd like to discuss one issue related to the correct handling of failover flag when executing ALTER SUBSCRIPTION SET (slot_name = 'new_slot')". Since the command intends to use a new slot on the primary, the new slot needs to reflect the "failover" state that the subscription currently has. If the failoverstate of the Subscription is LOGICALREP_FAILOVER_STATE_ENABLED, then I can reset it to LOGICALREP_FAILOVER_STATE_PENDING and allow the apply worker to handle it the way it is handled today (just like two_phase handling). But if the failoverstate is LOGICALREP_FAILOVER_STATE_DISABLED, the original idea is to call walrcv_alter_slot and alter the slot from the "ALTER SUBSCRIPTION" handling backend itself. This works if the slot is currently disabled. But the " ALTER SUBSCRIPTION SET (slot_name = 'new_slot')" command is supported even if the subscription is enabled. If the subscription is enabled, then calling walrcv_alter_slot() fails because the slot is still acquired by apply worker. So, I am thinking do we need a new mechanism to change the failover flag to false on an enabled subscription ? For example, we could call walrcv_alter_slot on startup of apply worker if AllTablesyncsReady(), for both true and false values of failover flag. This way, every time apply worker is started, it calls walrcv_alter_slot to set the failover flag on the primary. Or we could just document that it is user's responsibility to match the failover property in case it changes the slot_name. Thoughts ? Best Regards, Hou zj
Re: [PoC] Reducing planning time when tables have many partitions
Hello Alena, Andrei, and all, Thank you for reviewing this patch. I really apologize for not updating this thread for a while. On Sat, Nov 18, 2023 at 6:04 AM Alena Rybakina wrote: > Hi, all! > > While I was reviewing the patches, I noticed that they needed some rebasing, > and in one of the patches (Introduce-indexes-for-RestrictInfo.patch) there > was a conflict with the recently added self-join-removal feature [1]. So, I > rebased patches and resolved the conflicts. While I was doing this, I found a > problem that I also fixed: Thank you very much for rebasing these patches and fixing the issue. The bug seemed to be caused because these indexes were in RangeTblEntry, and the handling of their serialization was not correct. Thank you for fixing it. On Mon, Nov 20, 2023 at 1:45 PM Andrei Lepikhov wrote: > During the work on committing the SJE feature [1], Alexander Korotkov > pointed out the silver lining in this work [2]: he proposed that we > shouldn't remove RelOptInfo from simple_rel_array at all but replace it > with an 'Alias', which will refer the kept relation. It can simplify > further optimizations on removing redundant parts of the query. Thank you for sharing this information. I think the idea suggested by Alexander Korotkov is also helpful for our patch. As mentioned above, the indexes are in RangeTblEntry in the current implementation. However, I think RangeTblEntry is not the best place to store them. An 'alias' relids may help solve this and simplify fixing the above bug. I will try this approach soon. Unfortunately, I've been busy due to work, so I won't be able to respond for several weeks. I'm really sorry for not being able to see the patches. As soon as I'm not busy, I will look at them, consider the above approach, and reply to this thread. If there is no objection, I will move this CF entry forward to next CF. Again, thank you very much for looking at this thread, and I'm sorry for my late work. -- Best regards, Yuya Watari
Re: Schema variables - new implementation for Postgres 15
Hi, On Tue, Oct 17, 2023 at 08:52:13AM +0200, Pavel Stehule wrote: > > When I thought about global temporary tables, I got one maybe interesting > idea. The one significant problem of global temporary tables is place for > storing info about size or column statistics. > > I think so these data can be stored simply in session variables. Any global > temporary table can get assigned one session variable, that can hold these > data. I don't know how realistic this would be. For instance it will require to properly link the global temporary table life cycle with the session variable and I'm afraid it would require to add some hacks to make it work as needed. But this still raises the question of whether this feature could be used internally for the need of another feature. If we think it's likely, should we try to act right now and reserve the "pg_" prefix for internal use rather than do that a few years down the line and probably break some user code as it was done recently for the role names?
Re: Postgres picks suboptimal index after building of an extended statistics
Thanks for detaied answer, On 3/11/2023 00:37, Tomas Vondra wrote: On 9/25/23 06:30, Andrey Lepikhov wrote: ... I can't stop thinking about this issue. It is bizarre when Postgres chooses a non-unique index if a unique index gives us proof of minimum scan. That's true, but no one implemented this heuristics. So the "proof of minimum scan" is merely hypothetical - the optimizer is unaware of it. See the simple patch in the attachment. There, I have attempted to resolve situations of uncertainty to avoid making decisions based solely on the order of indexes in the list. I don't see a reason to teach the clauselist_selectivity() routine to estimate UNIQUE indexes. We add some cycles, but it will work with btree indexes only. I'm not sure I understand what this is meant to say. Can you elaborate? We only allow UNIQUE for btree indexes anyway, so what exactly is the problem here? Partly, you already answered yourself below: we have unique index estimation in a few estimation calls, but go through the list of indexes each time. Also, for this sake, we would add some input parameters, usually NULL, because many estimations don't involve indexes at all. Maybe to change compare_path_costs_fuzzily() and add some heuristic, for example: "If selectivity of both paths gives us no more than 1 row, prefer to use a unique index or an index with least selectivity." I don't understand how this would work. What do yo mean by "selectivity of a path"? AFAICS the problem here is that we estimate a path to return more rows (while we know there can only be 1, thanks to UNIQUE index). Oops, I meant cardinality. See the patch in the attachment. So how would you know the path does not give us more than 1 row? Surely you're not proposing compare_path_costs_fuzzily() to do something expensive like analyzing the paths, or something. I solely propose to make optimizer more consecutive in its decisions: if we have one row for both path candidates, use uniqueness of the index or value of selectivity as one more parameter. Also, how would it work in upper levels? If you just change which path we keep, but leave the inaccurate row estimate in place, that may fix that level, but it's certainly going to confuse later planning steps. It is designed for the only scan level. IMHO the problem here is that we produce wrong estimate, so we better fix that, instead of adding band-aids to other places. Agree. I am looking for a solution to help users somehow resolve such problems. As an alternative solution, I can propose a selectivity hook or (maybe even better) - use the pathlist approach and add indexes into the index list with some predefined order - at first positions, place unique indexes with more columns, etc. This happens because functional dependencies are very simple type of statistics - it has some expectations about the input data and also the queries executed on it. For example it assumes the data is reasonably homogeneous, so that we can calculate a global "degree". But the data in the example directly violates that - it has 1000 rows that are very random (so we'd find no dependencies), and 1000 rows with perfect dependencies. Hence we end with degree=0.5, which approximates the dependencies to all data. Not great, true, but that's the price for simplicity of this statistics kind. So the simplest solution is to disable dependencies on such data sets. It's a bit inconvenient/unfortunate we build dependencies by default, and people may not be aware of there assumptions. Perhaps we could/should make dependency_degree() more pessimistic when we find some "violations" of the rule (we intentionally are not strict about it, because data are imperfect). I don't have a good idea how to change the formulas, but I'm open to the idea in principle. Thanks for the explanation! The other thing we could do is checking for unique indexes in clauselist_selectivity, and if we find a match we can just skip the extended statistics altogether. Not sure how expensive this would be, but for typical cases (with modest number of indexes) perhaps OK. It wouldn't work without a unique index, but I don't have a better idea. It looks a bit expensive for me. But I am ready to try, if current solution doesn't look applicable. -- regards, Andrei Lepikhov Postgres Professional From 21677861e101eed22c829e0b14290a56786a12c2 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Wed, 22 Nov 2023 12:01:39 +0700 Subject: [PATCH] Use an index path with the best selectivity estimation --- src/backend/optimizer/util/pathnode.c | 40 + .../expected/drop-index-concurrently-1.out| 16 --- src/test/regress/expected/functional_deps.out | 43 +++ src/test/regress/expected/join.out| 40 + src/test/regress/sql/functional_deps.sql | 36 5 files changed, 149 insertions(+), 26 deletions(-) diff --git a/src/backend/optimizer/util/pa
Re: Partial aggregates pushdown
Robert Haas писал 2023-11-21 20:16: > I don't think the patch does a good job explaining why HAVING, > DISTINCT, and ORDER BY are a problem. It seems to me that HAVING > shouldn't really be a problem, because HAVING is basically a WHERE > clause that occurs after aggregation is complete, and whether or not > the aggregation is safe shouldn't depend on what we're going to do > with the value afterward. The HAVING clause can't necessarily be > pushed to the remote side, but I don't see how or why it could make > the aggregate itself unsafe to push down. DISTINCT and ORDER BY are a > little trickier: if we pushed down DISTINCT, we'd still have to > re-DISTINCT-ify when combining locally, and if we pushed down ORDER > BY, we'd have to do a merge pass to combine the returned values unless > we could prove that the partitions were non-overlapping ranges that > would be visited in the correct order. Although that all sounds > doable, I think it's probably a good thing that the current patch > doesn't try to handle it -- this is complicated already. But it should > explain why it's not handling it and maybe even a bit about how it > could be handling in the future, rather than just saying "well, this > kind of thing is not safe." The trouble with that explanation is that > it does nothing to help the reader understand whether the thing in > question is *fundamentally* unsafe or whether we just don't have the > right code to make it work. Makes sense. Actually, I think I was wrong about this. We can't handle ORDER BY or DISTINCT because we can't distinct-ify or order after we've already partially aggregated. At least not in general, and not without additional aggregate support functions. So what I said above was wrong with respect to those. Or so I believe, anyway. But I still don't see why HAVING should be a problem. Hi. HAVING is also a problem. Consider the following query SELECT count(a) FROM t HAVING count(a) > 10 - we can't push it down to foreign server as HAVING needs full aggregate result, but foreign server don't know it. -- Best regards, Alexander Pyhalov, Postgres Professional
Re: WIP: libpq: add a possibility to not send D(escribe) when executing a prepared statement
Hi Ivan, thank you for the patch. > On 22 Nov 2023, at 03:58, Ivan Trofimov wrote: > > Currently libpq sends B(ind), D(escribe), E(execute), S(ync) when executing a > prepared statement. > The response for that D message is a RowDescription, which doesn't change > during prepared > statement lifetime (with the attributes format being an exception, as they > aren't know before execution) . From my POV the idea seems reasonable (though I’m not a real libpq expert). BTW some drivers also send Describe even before Bind. This creates some fuss for routing connection poolers. > In a presumably very common case of repeatedly executing the same statement, > this leads to > both client and server parsing/sending exactly the same RowDescritpion data > over and over again. > Instead, library user could acquire a statement result RowDescription once > (via PQdescribePrepared), > and reuse it in subsequent calls to PQexecPrepared and/or its async friends. But what if query result structure changes? Will we detect this error gracefully and return correct error? Best regards, Andrey Borodin.
Re: Change GUC hashtable to use simplehash?
On Wed, Nov 22, 2023 at 12:00 AM Jeff Davis wrote: > > On Tue, 2023-11-21 at 16:42 +0700, John Naylor wrote: > > The strlen call required for hashbytes() is not free. > > Should we have a hash_string() that's like hash_bytes() but checks for > the NUL terminator itself? > > That wouldn't be inlinable, but it would save on the strlen() call. It > might benefit some other callers? We do have string_hash(), which...calls strlen. :-) Thinking some more, I'm not quite comfortable with the number of places in these patches that have to know about the pre-downcased strings, or whether we need that in the first place. If lower case is common enough to optimize for, it seems the equality function can just check strict equality on the char and only on mismatch try downcasing before returning false. Doing our own function would allow the compiler to inline it, or at least keep it on the same page. Further, the old hash function shouldn't need to branch to do the same downcasing, since hashing is lossy anyway. In the keyword hashes, we just do "*ch |= 0x20", which downcases letters and turns undercores to DEL. I can take a stab at that later.
initdb --no-locale=C doesn't work as specified when the environment is not C
Commit 3e51b278db leaves lc_* conf lines as commented-out when their value is "C". This leads to the following behavior. $ echo LANG ja_JP.UTF8 $ initdb --no-locale hoge $ grep lc_ hoge/postgresql.conf #lc_messages = 'C' # locale for system error message #lc_monetary = 'C' # locale for monetary formatting #lc_numeric = 'C' # locale for number formatting #lc_time = 'C' # locale for time formatting In this scenario, the postmaster ends up emitting log massages in Japanese, which contradicts the documentation. https://www.postgresql.org/docs/devel/app-initdb.html > --locale=locale > Sets the default locale for the database cluster. If this option is > not specified, the locale is inherited from the environment that > initdb runs in. Locale support is described in Section 24.1. > .. > --lc-messages=locale > Like --locale, but only sets the locale in the specified category. Here's a somewhat amusing case: $ echo LANG ja_JP.UTF8 $ initdb --lc_messages=C $ grep lc_ hoge/postgresql.conf #lc_messages = 'C' # locale for system error message lc_monetary = 'ja_JP.UTF8' # locale for monetary formatting lc_numeric = 'ja_JP.UTF8' # locale for number formatting lc_time = 'ja_JP.UTF8' # locale for time formatting Hmm. it seems that initdb replaces the values of all categories *except the specified one*. This behavior seems incorrect to me. initdb should replace the value when explicitly specified in the command line. If you use -c lc_messages=C, it does perform the expected behavior to some extent, but I believe this is a separate matter. I have doubts about not replacing these lines for purely cosmetic reasons. In this mail, I've attached three possible solutions for the original issue: the first one enforces replacement only when specified on the command line, the second one simply always performs replacement, and the last one addresses the concern about the absence of quotes around "C" by allowing explicit specification. (FWIW, I prefer the last one.) What do you think about these? regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 0c6f5ceb0a..56a8c5b60e 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -144,6 +144,10 @@ static char *lc_monetary = NULL; static char *lc_numeric = NULL; static char *lc_time = NULL; static char *lc_messages = NULL; +static bool lc_monetary_specified = false; +static bool lc_numeric_specified = false; +static bool lc_time_specified = false; +static bool lc_messages_specified = false; static char locale_provider = COLLPROVIDER_LIBC; static char *icu_locale = NULL; static char *icu_rules = NULL; @@ -1230,19 +1234,19 @@ setup_config(void) * Hack: don't replace the LC_XXX GUCs when their value is 'C', because * replace_guc_value will decide not to quote that, which looks strange. */ - if (strcmp(lc_messages, "C") != 0) + if (strcmp(lc_messages, "C") != 0 || lc_messages_specified) conflines = replace_guc_value(conflines, "lc_messages", lc_messages, false); - if (strcmp(lc_monetary, "C") != 0) + if (strcmp(lc_monetary, "C") != 0 || lc_monetary_specified) conflines = replace_guc_value(conflines, "lc_monetary", lc_monetary, false); - if (strcmp(lc_numeric, "C") != 0) + if (strcmp(lc_numeric, "C") != 0 || lc_numeric_specified) conflines = replace_guc_value(conflines, "lc_numeric", lc_numeric, false); - if (strcmp(lc_time, "C") != 0) + if (strcmp(lc_time, "C") != 0 || lc_time_specified) conflines = replace_guc_value(conflines, "lc_time", lc_time, false); @@ -2374,6 +2378,15 @@ setlocales(void) icu_locale = locale; } + /* +* At this point, null means that the value is not specifed in the command +* line. +*/ + if (lc_numeric != NULL) lc_numeric_specified = true; + if (lc_time != NULL) lc_time_specified = true; + if (lc_monetary != NULL) lc_monetary_specified = true; + if (lc_messages != NULL) lc_messages_specified = true; + /* * canonicalize locale names, and obtain any missing values from our * current environment diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index 0c6f5ceb0a..646e8f29f6 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -1226,25 +1226,17 @@ setup_config(void) conflines = replace_guc_value(conflines, "shared_buf
Re: Make mesage at end-of-recovery less scary.
Anyway, this requires rebsaing, and done. Thanks for John (Naylor) for pointing this out. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From e56f1f24523e3e562a4db166dfeaadc79fd7b27a Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Tue, 7 Mar 2023 14:55:58 +0900 Subject: [PATCH v27] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. To make sure that the detection is correct, this patch checks if all trailing bytes in the same page are zeroed in that case. --- src/backend/access/transam/xlogreader.c | 134 ++ src/backend/access/transam/xlogrecovery.c | 94 +++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/035_recovery.pl | 130 + 6 files changed, 327 insertions(+), 52 deletions(-) create mode 100644 src/test/recovery/t/035_recovery.pl diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index e0baa86bd3..ce65f99c60 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -48,6 +48,8 @@ static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking); +static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess); static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, @@ -149,6 +151,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -525,6 +528,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking) /* reset error state */ state->errormsg_buf[0] = '\0'; decoded = NULL; + state->EndOfWAL = false; state->abortedRecPtr = InvalidXLogRecPtr; state->missingContrecPtr = InvalidXLogRecPtr; @@ -608,16 +612,12 @@ restart: Assert(pageHeaderSize <= readOff); /* - * Read the record length. + * Verify the record header. * * NB: Even though we use an XLogRecord pointer here, the whole record - * header might not fit on this page. xl_tot_len is the first field of the - * struct, so it must be on this page (the records are MAXALIGNed), but we - * cannot access any other fields until we've verified that we got the - * whole header. + * header might not fit on this page. */ record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; /* * If the whole record header is on this page, validate it immediately. @@ -636,19 +636,19 @@ restart: } else { - /* There may be no next page if it's too small. */ - if (total_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: expected at least %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, total_len); + /* + * xl_tot_len is the first field of the struct, so it must be on this + * page (the records are MAXALIGNed), but we cannot access any other + * fields until we've verified that we got the whole header. + */ + if (!ValidXLogRecordLength(state, RecPtr, record)) goto err; - } - /* We'll validate the header once we have the next page. */ + gotheader = false; } + total_len = record->xl_tot_len; + /* * Try to find space to decode this record, if we can do so without * calling palloc. If we can't, we'll try again below after we've @@ -1091,25 +1091,81 @@ XLogReaderInvalReadState(XLogReaderState *state) state->readLen = 0; } +/* + * Validate record length of an XLOG record header. + * + * This is substantially a part of ValidXLogRecordHeader. But XLogReadRecord + * needs this separate from the function in case of a partial record header. + * + * Returns true if the xl_tot_len header field has a seemingly valid value, + * which means the caller can proceed reading to the following part of the + * record. + */ +static bool +ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record) +{ + if (record->xl_tot_len == 0) + { + char *p; + char *pe; + + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the recor
Re: remaining sql/json patches
Hi, On 2023-11-22 15:09:36 +0900, Amit Langote wrote: > OK, I will keep polishing 0001-0003 with the intent to push it next > week barring objections / damning findings. I don't think the patchset is quite there yet. It's definitely getting closer though! I'll try to do another review next week. Greetings, Andres Freund
Re: Synchronizing slots from primary to standby
In addition to my recent v35-0001 comment not yet addressed [1], here are some review comments for patch v37-0001. == src/backend/replication/walsender.c 1. PhysicalWakeupLogicalWalSnd +/* + * Wake up logical walsenders with failover-enabled slots if the physical slot + * of the current walsender is specified in standby_slot_names GUC. + */ +void +PhysicalWakeupLogicalWalSnd(void) +{ + ListCell *lc; + List*standby_slots; + bool slot_in_list = false; + + Assert(MyReplicationSlot != NULL); + Assert(SlotIsPhysical(MyReplicationSlot)); + + standby_slots = GetStandbySlotList(false); + + foreach(lc, standby_slots) + { + char*name = lfirst(lc); + + if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0) + { + slot_in_list = true; + break; + } + } + + if (slot_in_list) + ConditionVariableBroadcast(&WalSndCtl->wal_confirm_rcv_cv); +} 1a. Easier to have single assertion -- Assert(MyReplicationSlot && SlotIsPhysical(MyReplicationSlot)); ~ 1b. Why bother with the 'slot_in_list' and break, when you can just call the ConditionVariableBroadcast() and return without having the extra variable? == src/test/recovery/t/050_verify_slot_order.pl ~~~ 2. Should you name the global objects with a 'regress_' prefix which seems to be the standard for other new TAP tests? ~~~ 3. +# +#| > standby1 (connected via streaming replication) +# | > standby2 (connected via streaming replication) +# primary - | +# | > subscriber1 (connected via logical replication) +# | > subscriber2 (connected via logical replication) +# +# +# Set up is configured in such a way that primary never lets subscriber1 ahead +# of standby1. 3a. Misaligned "|" in comment? ~ 3b. IMO it would be better to give an overview of how this all works instead of just saying "configured in such a way". ~~~ 4. +# Configure primary to disallow specified logical replication slot (lsub1_slot) +# getting ahead of specified physical replication slot (sb1_slot). +$primary->append_conf( It is confusing because there is no "lsub1_slot" specified anywhere until much later. Would you be able to provide some more details? ~~~ 5. +# Create another subscriber node, wait for sync to complete +my $subscriber2 = PostgreSQL::Test::Cluster->new('subscriber2'); +$subscriber2->init(allows_streaming => 'logical'); +$subscriber2->start; +$subscriber2->safe_psql('postgres', "CREATE TABLE tab_int (a int PRIMARY KEY);"); +$subscriber2->safe_psql('postgres', + "CREATE SUBSCRIPTION mysub2 CONNECTION '$publisher_connstr' " + . "PUBLICATION mypub WITH (slot_name = lsub2_slot);"); +$subscriber2->wait_for_subscription_sync; Maybe this comment should explicitly say there is no failover enabled here. Maybe the SUBSCRIPTION should explicitly set failover=false? ~~~ 6. +# The subscription that's up and running and is enabled for failover +# doesn't get the data from primary and keeps waiting for the +# standby specified in standby_slot_names. +$result = $subscriber1->safe_psql('postgres', + "SELECT count(*) = 0 FROM tab_int;"); +is($result, 't', "subscriber1 doesn't get data from primary until standby1 acknowledges changes"); Might it be better to write as "SELECT count(*) = $primary_row_count FROM tab_int;" and expect it to return false? == src/test/regress/expected/subscription.out 7. Everything here displays the "Failover" state 'd' (disabled). How about tests for different state values? == [1] https://www.postgresql.org/message-id/CAHut%2BPv-yu71ogj_hRi6cCtmD55bsyw7XTxj1Nq8yVFKpY3NDQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia