Re: remove volatile qualifiers from pg_stat_statements
On Wed, Jul 31, 2024 at 07:01:38AM +, Bertrand Drouvot wrote: > I share the same understanding and I think those can be removed. > > The patch LGTM. That sounds about right. All the volatile references we have here have been kept under the assumption that a memory barrier is required. As we hold spin locks in these areas, that should not be necessary anyway. So LGTM as well. A quick lookup at the rest of contrib/ is showing me that the remaining volatile references point at uses with TRY/CATCH blocks, where we require them. -- Michael signature.asc Description: PGP signature
Re: Logical Replication of sequences
On Tue, 6 Aug 2024 at 10:24, shveta malik wrote: > > On Tue, Aug 6, 2024 at 9:54 AM Amit Kapila wrote: > > > > On Tue, Aug 6, 2024 at 8:49 AM shveta malik wrote: > > > > > > > > > I was reviewing the design of patch003, and I have a query. Do we > > > > > > need > > > > > > to even start an apply worker and create replication slot when > > > > > > subscription created is for 'sequences only'? IIUC, currently > > > > > > logical > > > > > > replication apply worker is the one launching sequence-sync worker > > > > > > whenever needed. I think it should be the launcher doing this job > > > > > > and > > > > > > thus apply worker may even not be needed for current functionality > > > > > > of > > > > > > sequence sync? > > > > > > > > > > > > > But that would lead to maintaining all sequence-sync of each > > > > subscription by launcher. Say there are 100 sequences per subscription > > > > and some of them from each subscription are failing due to some > > > > reasons then the launcher will be responsible for ensuring all the > > > > sequences are synced. I think it would be better to handle > > > > per-subscription work by the apply worker. > > > > > > I thought we can give that task to sequence-sync worker. Once sequence > > > sync worker is started by launcher, it keeps on syncing until all the > > > sequences are synced (even failed ones) and then exits only after all > > > are synced; instead of apply worker starting it multiple times for > > > failed sequences. Launcher to start sequence sync worker when signaled > > > by 'alter-sub refresh seq'. > > > But after going through details given by Vignesh in [1], I also see > > > the benefits of using apply worker for this task. Since apply worker > > > is already looping and doing that for table-sync, we can reuse the > > > same code for sequence sync and maintenance will be easy. So looks > > > okay if we go with existing apply worker design. > > > > > > > Fair enough. However, I was wondering whether apply_worker should exit > > after syncing all sequences for a sequence-only subscription > > If apply worker exits, then on next sequence-refresh, we need a way to > wake-up launcher to start apply worker which then will start > table-sync worker. Instead, won't it be better if the launcher starts > table-sync worker directly without the need of apply worker being > present (which I stated earlier). I favour the current design because it ensures the system remains extendable for future incremental sequence synchronization. If the launcher were responsible for starting the sequence sync worker, it would add extra load that could hinder its ability to service other subscriptions and complicate the design for supporting incremental sync of sequences. Additionally, this approach offers the other benefits mentioned in [1]. > > or should > > it be there for future commands that can refresh the subscription and > > add additional tables or sequences? > > If we stick with apply worker starting table sync worker when needed > by continuously checking seq-sync states ('i'/'r'), then IMO, it is > better that apply-worker stays. But if we want apply-worker to exit > and start only when needed, then why not to start sequence-sync worker > directly for seq-only subscriptions? There is a risk that sequence synchronization might fail if the sequence value from the publisher falls outside the defined minvalue or maxvalue range. The apply worker must be active to determine whether to initiate the sequence sync worker after the wal_retrieve_retry_interval period. Typically, publications consisting solely of sequences are uncommon. However, if a user wishes to use such publications, they can disable the subscription if necessary and re-enable it when a sequence refresh is needed. [1] - https://www.postgresql.org/message-id/CALDaNm1KO8f3Fj%2BRHHXM%3DUSGwOcW242M1jHee%3DX_chn2ToiCpw%40mail.gmail.com Regards, Vignesh
Re: Remove support for old realpath() API
> On 6 Aug 2024, at 07:43, Michael Paquier wrote: > I am dubious that it is a good idea to remove > this code knowing that we have a thread from a few months ago about > the fact that we have folks complaining about AIX support and that we > should bring it back: According to upthread it is supported since AIX 7.1 which shipped in 2010 so even if support materializes for AIX it still wouldn't be needed. -- Daniel Gustafsson
Re: Remove unnecessary forward declaration for heapam_methods variable
On Wed, Jul 31, 2024 at 10:36:11AM +0800, Japin Li wrote: > I think the forward declaration for heapam_methods variable in > heapam_handler.c > is unnecessary, right? True. This can be removed because all the code paths using heapam_methods are after its declaration, so duplicating it makes little sense. Thanks, applied. -- Michael signature.asc Description: PGP signature
Re: Injection points: preloading and runtime arguments
On Sun, Aug 04, 2024 at 01:02:22AM +0900, Michael Paquier wrote: > On Fri, Aug 02, 2024 at 07:03:58PM +0300, Andrey M. Borodin wrote: > > The test works fine with SQL interface “select > > injection_points_load('get-new-multixact-id');”. > > Yes, just use a load() here to make sure that the DSM required by the > waits are properly initialized before entering in the critical section > where the wait of the point get-new-multixact-id happens. Hmm. How about loading injection_points with shared_preload_libraries now that it has a _PG_init() thanks to 75534436a477 to take care of the initialization you need here? We could add two hooks to request some shmem based on a size and to do the shmem initialization. -- Michael signature.asc Description: PGP signature
Re: psql client does not handle WSAEWOULDBLOCK on Windows
Hi Umar, In the function of gss_read() if print the value of errno and SOCK_ERRNO separately, I found the values are different: *ret = pqsecure_raw_read(conn, recv_buffer, length); if (*ret < 0) { printf("errno: %d\n", errno); printf("result_errno: %d\n", SOCK_ERRNO); ... errno: 0 result_errno: 10035 Also refer to the https://learn.microsoft.com/en-us/windows/win32/api/winsock/nf-winsock-wsagetlasterror , It shows that the Windows Sockets function does not set errno, but uses WSAGetLastError to report errors. And refer to the https://learn.microsoft.com/en-us/windows/win32/api/winsock2/nf-winsock2-recv , If the function fails, it should call the WSAGetLastError to get the expansion error message. This further shows that the socket operation error will not be reported through the errno. So changing the error code check to use the SOCK_ERRNO instead of errno can be properly handled on both Windows and other platforms. To reproduce the issue, I used the following version of Postgres and MIT Kerberos: PostgreSQL version: 16.3 MIT Kerberos Version 4.1 Operating System: Windows 11 Visual Studio 2022 On Tue, Jul 30, 2024 at 4:47 PM Ning wrote: > > *Description:*The connection fails with a non-blocking socket error when > using psql on > Windows to connect to a PostgreSQL server with GSSAPI enabled. The error is > because the socket error code is obtained by WSAGetLastError() instead of > errno. This causes the value of errno to be incorrect when handling a > non-blocking socket error. > > > *Steps to Reproduce:*1. Compile PostgreSQL client (psql) on Windows. > a. Make sure MIT Kerberos is installed. I use the latest version MIT > Kerberos > Version 4.1. > b. Make sure GSSAPI is enabled > 2. Attempt to connect to a PostgreSQL server using psql. > a. Set up the Kerberos server and configure the PostgreSQL server by > referring > to https://github.com/50wu/kerberos-docker/blob/main/POSTGRES.README.md > b. change the entry to hostgssenc on PostgreSQL server pg_hba.conf and > restart > hostgssencallall0.0.0.0/0gssinclude_realm=0 > krb_realm=GPDB.KRB > c. Use the following command to connect to the database server > psql -h -U "postgres/krb5-service-example-com.example.com" -d > postgres > 3. The connection fails with a non-blocking socket error. The error is > something like: > psql: error: connection to server at "xxx", port 5432 failed: > > *Environment*: > PostgreSQL version: 16.3 > Operating System: Windows 11 > > > *Fix Steps:*In the gss_read function of > src/interfaces/libpq/fe-secure-gssapi.c, change the > check of the error code to use the SOCK_ERRNO to make sure that EAGAIN, > EWOULDBLOCK and EINTR can be properly handled on Windows and other > platforms. > > The patch file is attached to this email, please review and consider > merging it to > the main code library. > > Thanks, > Ning Wu >
Rename C23 keyword
constexpr is a keyword in C23. Rename a conflicting identifier for future-proofing. Obviously, C23 is way in the future, but this is a hard error that prevents any further exploration. (To be clear: This only happens if you explicitly select C23 mode. I'm not aware of a compiler where this is the default yet.) From fa0f5ce119bff1bafd9fd278334c5caad242b6d2 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 6 Aug 2024 09:51:20 +0200 Subject: [PATCH] Rename C23 keyword constexpr is a keyword in C23. Rename a conflicting identifier for future-proofing. --- src/backend/optimizer/util/predtest.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/src/backend/optimizer/util/predtest.c b/src/backend/optimizer/util/predtest.c index 6e3b376f3d3..50ea8077367 100644 --- a/src/backend/optimizer/util/predtest.c +++ b/src/backend/optimizer/util/predtest.c @@ -948,7 +948,7 @@ boolexpr_startup_fn(Node *clause, PredIterInfo info) typedef struct { OpExpr opexpr; - Const constexpr; + Const const_expr; int next_elem; int num_elems; Datum *elem_values; @@ -992,13 +992,13 @@ arrayconst_startup_fn(Node *clause, PredIterInfo info) state->opexpr.args = list_copy(saop->args); /* Set up a dummy Const node to hold the per-element values */ - state->constexpr.xpr.type = T_Const; - state->constexpr.consttype = ARR_ELEMTYPE(arrayval); - state->constexpr.consttypmod = -1; - state->constexpr.constcollid = arrayconst->constcollid; - state->constexpr.constlen = elmlen; - state->constexpr.constbyval = elmbyval; - lsecond(state->opexpr.args) = &state->constexpr; + state->const_expr.xpr.type = T_Const; + state->const_expr.consttype = ARR_ELEMTYPE(arrayval); + state->const_expr.consttypmod = -1; + state->const_expr.constcollid = arrayconst->constcollid; + state->const_expr.constlen = elmlen; + state->const_expr.constbyval = elmbyval; + lsecond(state->opexpr.args) = &state->const_expr; /* Initialize iteration state */ state->next_elem = 0; @@ -1011,8 +1011,8 @@ arrayconst_next_fn(PredIterInfo info) if (state->next_elem >= state->num_elems) return NULL; - state->constexpr.constvalue = state->elem_values[state->next_elem]; - state->constexpr.constisnull = state->elem_nulls[state->next_elem]; + state->const_expr.constvalue = state->elem_values[state->next_elem]; + state->const_expr.constisnull = state->elem_nulls[state->next_elem]; state->next_elem++; return (Node *) &(state->opexpr); } -- 2.46.0
Re: [PATCH] Add crc32(text) & crc32(bytea)
Hi, > This looks pretty good to me. The only point that I think deserves more > discussion is the return type. Does bytea make the most sense here? Or > should we consider int/bigint? Personally I would choose BYTEA in order to be consistent with sha*() functions. It can be casted to TEXT if user wants a result similar to the one md5() returns: ``` SELECT encode(crc32('PostgreSQL'), 'hex'); ``` ... and somewhat less convenient to BIGINT: ``` SELECT ((get_byte(crc, 0) :: bigint << 24) | (get_byte(crc, 1) << 16) | (get_byte(crc, 2) << 8) | get_byte(crc, 3)) FROM (SELECT crc32('PostgreSQL') AS crc); ``` I don't like the `integer` option because crc32 value is typically considered as an unsigned one and `integer` is not large enough to represent uint32. Perhaps we need get_int4() / get_int8() / get_numeric() as there seems to be a demand [1][2] and it will allow us to easily cast a `bytea` value to `integer` or `bigint`. This is probably another topic though. [1]: https://stackoverflow.com/questions/32944267/postgresql-converting-bytea-to-bigint [2]: https://postgr.es/m/AANLkTikip9xs8iXc8e%2BMgz1T1701i8Xk6QtbVB3KJQzX%40mail.gmail.com -- Best regards, Aleksander Alekseev
RE: Conflict detection and logging in logical replication
On Monday, August 5, 2024 6:52 PM Amit Kapila wrote: > > On Sun, Aug 4, 2024 at 1:22 PM Zhijie Hou (Fujitsu) > wrote: > > > > On Friday, August 2, 2024 7:03 PM Amit Kapila > wrote: > > > > > > > Here is the V11 patch set which addressed above and Kuroda-san[1] > comments. > > > > A few design-level points: > > * > @@ -525,10 +602,33 @@ ExecSimpleRelationInsert(ResultRelInfo > *resultRelInfo, > /* OK, store the tuple and create index entries for it */ > simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot); > > + conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes; > + > if (resultRelInfo->ri_NumIndices > 0) > recheckIndexes = ExecInsertIndexTuples(resultRelInfo, > -slot, estate, false, false, > -NULL, NIL, false); > +slot, estate, false, > +conflictindexes ? true : false, > +&conflict, > +conflictindexes, false); > + > + /* > + * Checks the conflict indexes to fetch the conflicting local tuple > + * and reports the conflict. We perform this check here, instead of > + * performing an additional index scan before the actual insertion and > + * reporting the conflict if any conflicting tuples are found. This is > + * to avoid the overhead of executing the extra scan for each INSERT > + * operation, even when no conflict arises, which could introduce > + * significant overhead to replication, particularly in cases where > + * conflicts are rare. > + * > + * XXX OTOH, this could lead to clean-up effort for dead tuples added > + * in heap and index in case of conflicts. But as conflicts shouldn't > + * be a frequent thing so we preferred to save the performance overhead > + * of extra scan before each insertion. > + */ > + if (conflict) > + CheckAndReportConflict(resultRelInfo, estate, CT_INSERT_EXISTS, > +recheckIndexes, slot); > > I was thinking about this case where we have some pros and cons of doing > additional scans only after we found the conflict. I was wondering how we will > handle the resolution strategy for this when we have to remote_apply the tuple > for insert_exists/update_exists cases. > We would have already inserted the remote tuple in the heap and index before > we found the conflict which means we have to roll back that change and then > start a forest transaction to perform remote_apply which probably has to > update the existing tuple. We may have to perform something like speculative > insertion and then abort it. That doesn't sound cheap either. Do you have any > better ideas? Since most of the codes of conflict detection can be reused in the later resolution patch. I am thinking we can go for re-scan after insertion approach for detection patch. Then in resolution patch we can probably have a check in the patch that if the resolver is remote_apply/last_update_win we detect conflict before, otherwise detect it after. This way we can save an subscription option in the detection patch because we are not introducing overhead for the detection. And we can also save some overhead in the resolution patch if there is no need to do a prior check. There could be a few duplicate codes in resolution patch as have codes for both prior check and after check, but it seems acceptable. > > * > -ERROR: duplicate key value violates unique constraint "test_pkey" > -DETAIL: Key (c)=(1) already exists. > +ERROR: conflict insert_exists detected on relation "public.test" > +DETAIL: Key (c)=(1) already exists in unique index "t_pkey", which > was modified locally in transaction 740 at 2024-06-26 10:47:04.727375+08. > > I think the format to display conflicts is not very clear. The conflict > should be > apparent just by seeing the LOG/ERROR message. I am thinking of something > like below: > > LOG: CONFLICT: ; > DESCRIPTION: If any .. ; RESOLUTION: (This one can be added later) > DEATAIL: remote_tuple (tuple values); local_tuple (tuple values); > > With the above, one can easily identify the conflict's reason and action > taken by > apply worker. Thanks for the idea! I thought about few styles based on the suggested format, what do you think about the following ? --- Version 1 --- LOG: CONFLICT: insert_exists; DESCRIPTION: remote INSERT violates unique constraint "uniqueindex" on relation "public.test". DETAIL: Existing local tuple (a, b, c) = (2, 3, 4) xid=123,origin="pub",timestamp=xxx; remote tuple (a, b, c) = (2, 4, 5). LOG: CONFLICT: update_differ; DESCRIPTION: updating a row with key (a, b) = (2, 4) on relation "public.test" was modified by a different source. DETAIL: Existing local tuple (a, b, c) = (2, 3, 4) xid=123,origin="pub",timestamp=xxx; remote tuple (a, b, c) = (2, 4, 5). LOG: CONFLICT: update_missing; DESCRIPTION: did not find the row with key (a, b) = (2, 4) on "public.test" to update. DETAIL: remote tuple (a, b, c) = (2, 4, 5). --- Version 2 It moves most the details to the DETAIL line compared to version 1. --- LOG: CONFLICT: insert_exists on relation "public.test". DETAIL: Key (a)=(1) already exists in u
Re: Wrong results with grouping sets
On Wed, Jun 5, 2024 at 5:42 PM Richard Guo wrote: > I found a bug in the v6 patch. The following query would trigger the > Assert in make_restrictinfo that the given subexpression should not be > an AND clause. > > select max(a) from t group by a > b and a = b having a > b and a = b; > > This is because the expression 'a > b and a = b' in the HAVING clause is > replaced by a Var that references the GROUP RTE. When we preprocess the > columns of the GROUP RTE, we do not know whether the grouped expression > is a havingQual or not, so we do not perform make_ands_implicit for it. > As a result, after we replace the group Var in the HAVING clause with > the underlying grouping expression, we will have a havingQual that is an > AND clause. > > As we know, in the planner we need to first preprocess all the columns > of the GROUP RTE. We also need to replace any Vars in the targetlist > and HAVING clause that reference the GROUP RTE with the underlying > grouping expressions. To fix the mentioned issue, I choose the perform > this replacement before we preprocess the targetlist and havingQual, so > that the make_ands_implicit would be performed when we preprocess the > havingQual. I've realized that there is something wrong with this conclusion. If we perform the replacement of GROUP Vars with the underlying grouping expressions before we've done with expression preprocessing on targetlist and havingQual, we may end up with failing to match the expressions that are part of grouping items to lower target items. Consider: create table t (a int, b int); insert into t values (1, 2); select a < b and b < 3 from t group by rollup(a < b and b < 3) having a < b and b < 3; The expression preprocessing process would convert the HAVING clause to implicit-AND format and thus it would fail to be matched to lower target items. Another example is: create table t1 (a boolean); insert into t1 values (true); select not a from t1 group by rollup(not a) having not not a; This HAVING clause 'not not a' would be reduced to 'a' and thus fail to be matched to lower tlist. I fixed this issue in v13 by performing the replacement of GROUP Vars after we've done with expression preprocessing on targetlist and havingQual. An ensuing effect of this approach is that a HAVING clause may contain expressions that are not fully preprocessed if they are part of grouping items. This is not an issue as long as the clause remains in HAVING. But if the clause is moved or copied into WHERE, we need to re-preprocess these expressions. Please see the attached for the changes. Thanks Richard v13-0001-Introduce-an-RTE-for-the-grouping-step.patch Description: Binary data v13-0002-Mark-expressions-nullable-by-grouping-sets.patch Description: Binary data v13-0003-Unwrap-a-PlaceHolderVar-when-safe.patch Description: Binary data
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier wrote: > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote: > > The 0001 patch is intended to improve this situation. Actually, it's > > not right to just put RecoveryInProgress() after > > GetXLogReplayRecPtr(), because more wal could be replayed between > > these calls. Instead we need to recheck GetXLogReplayRecPtr() after > > getting negative result of RecoveryInProgress() because WAL replay > > position couldn't get updated after. > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() > > function > > 0003 patch comprises tests for pg_wal_replay_wait() errors. > > Before adding more tests, could it be possible to stabilize what's in > the tree? drongo has reported one failure with the recovery test > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54 Thank you for pointing! Surely, I'll fix this before. -- Regards, Alexander Korotkov Supabase
Re: Memory growth observed with C++ application consuming libpq.dll on Windows
I ran a trial version of a memory leak detector called Deleaker on windows and found some modules that are listed as having leaks. I ran the program on Linux under valgrind and I do not see any leaks reported there. I have attached the reported leaks on windows as windows_leaks.txt and valgrind summary report as valgrind.txt. I am working on generating a trimmed down version of the sample program to share with you. Let me know if you have any questions. Thanks, Rajesh On Fri, Aug 2, 2024 at 10:19 PM Rajesh Kokkonda wrote: > We did run our application under valgrind on Linux. We did not see any > leaks. There is no platform dependent code in our application. We are > seeing gradual memory growth only on windows. > > That is what lead me to believe the leak may be present in postgresql. I > will run under available memory tools on windows and get back to you. > > I will also try to create a sample and see if I can reproduce the problem. > > Thanks, > Rajesh > > On Fri, 2 Aug 2024, 21:45 Ranier Vilela, wrote: > >> Em sex., 2 de ago. de 2024 às 11:54, Rajesh Kokkonda < >> rajeshk.kokko...@gmail.com> escreveu: >> >>> Okay. I will try to create one sample program and send it to you >>> sometime next week. In the meantime, I am listing down all the methods we >>> are consuming from libpq. >>> >>> PQconnectdbParams >>> PQstatus >>> PQerrorMessage >>> PQpingParams >>> PQfinish >>> PQresultStatus >>> PQclear >>> PQsetSingleRowMode >>> PQntuples >>> PQnfields >>> PQftype >>> PQgetvalue >>> PQgetlength >>> PQgetisnull >>> PQgetCancel >>> PQfreeCancel >>> PQcancel >>> PQsetErrorVerbosity >>> PQsendPrepare >>> PQsendQueryPrepared >>> PQgetResult >>> PQconsumeInput >>> PQisBusy >>> PQsetnonblocking >>> PQflush >>> PQsocket >>> PQtransactionStatus >>> PQresultErrorField >>> >>> It is highly likely that the memory consumption is caused by your >> application. >> Perhaps due to the lack of freeing up the resources used by the library. >> You can try using this tool, to find out the root cause. >> >> https://drmemory.org/ >> >> best regards, >> Ranier Vilela >> > ==1659429== LEAK SUMMARY: ==1659429==definitely lost: 0 bytes in 0 blocks ==1659429==indirectly lost: 0 bytes in 0 blocks ==1659429== possibly lost: 0 bytes in 0 blocks ==1659429==still reachable: 36,482 bytes in 121 blocks ==1659429== of which reachable via heuristic: ==1659429== stdstring : 537 bytes in 1 blocks ==1659429== suppressed: 0 bytes in 0 blocks ==1659429== ==1659429== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Re: Injection points: preloading and runtime arguments
> On 6 Aug 2024, at 12:47, Michael Paquier wrote: > > Hmm. How about loading injection_points with shared_preload_libraries > now that it has a _PG_init() thanks to 75534436a477 to take care of > the initialization you need here? We could add two hooks to request > some shmem based on a size and to do the shmem initialization. SQL initialisation is fine for test purposes. I just considered that I'd better share that doing the same from C code is non-trivial. Thanks! Best regards, Andrey Borodin.
Fix comments in instr_time.h and remove an unneeded cast to int64
Hi hackers, While working on [1], I came across what seems to be incorrect comments in instr_time.h and an unneeded cast to int64. Indeed, 03023a2664 represented time as an int64 on all platforms but forgot to update the comment related to INSTR_TIME_GET_MICROSEC() and provided an incorrect comment for INSTR_TIME_GET_NANOSEC(). Please find attached a tiny patch to correct those and, in passing, remove what I think is an unneeded cast to int64. [1]: https://www.postgresql.org/message-id/19E276C9-2C2B-435A-B275-8FA2AEB8%40gmail.com Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 1d1f8089a9eba4e87af66d7c0f23f30d52dc042c Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Tue, 6 Aug 2024 08:04:43 + Subject: [PATCH v1] Fix comments in instr_time.h and remove an unneeded cast to int64 03023a2664 represented time as an int64 on all platforms but forgot to update the comment related to INSTR_TIME_GET_MICROSEC() and provided an incorrect comment for INSTR_TIME_GET_NANOSEC(). In passing removing an unneeded cast to int64. --- src/include/portability/instr_time.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) 100.0% src/include/portability/ diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index a6fc1922f2..8f9ba2f151 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -32,9 +32,9 @@ * * INSTR_TIME_GET_MILLISEC(t) convert t to double (in milliseconds) * - * INSTR_TIME_GET_MICROSEC(t) convert t to uint64 (in microseconds) + * INSTR_TIME_GET_MICROSEC(t) get t in microseconds * - * INSTR_TIME_GET_NANOSEC(t) convert t to uint64 (in nanoseconds) + * INSTR_TIME_GET_NANOSEC(t) get t in nanoseconds * * Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert * absolute times to intervals. The INSTR_TIME_GET_xxx operations are @@ -123,7 +123,7 @@ pg_clock_gettime_ns(void) ((t) = pg_clock_gettime_ns()) #define INSTR_TIME_GET_NANOSEC(t) \ - ((int64) (t).ticks) + ((t).ticks) #else /* WIN32 */ -- 2.34.1
Re: Restart pg_usleep when interrupted
Hi, On Mon, Aug 05, 2024 at 03:07:34PM -0500, Sami Imseih wrote: > > > yeah, we already have a few macros that access the .ticks, so maybe we > > could add > > 2 new ones, say: > > > > 1. INSTR_TIME_ADD_MS(t1, msec) > > 2. INSTR_TIME_IS_GREATER(t1, t2) > > > > I think the less operations is done in the while loop the better. > > > > See v4. it includes 2 new instr_time.h macros to simplify the > code insidethe while loop. Thanks! 1 === +#define INSTR_TIME_IS_GREATER(x,y) \ + ((bool) (x).ticks > (y).ticks) Around parentheses are missing, that should be ((bool) ((x).ticks > (y).ticks)). I did not pay attention to it initially but found it was the culprit of breaks not occuring (while my test case produces some). That said, I don't think the cast is necessary here and that we could get rid of it. 2 === What about providing a quick comment about the 2 new macros in header of instr_time.h? (like it is done for the others macros) Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Logical Replication of sequences
Here are some review comments for the patch v20240805_2-0003. == doc/src/sgml/catalogs.sgml nitpick - removed the word "either" == doc/src/sgml/ref/alter_subscription.sgml I felt the discussions about "how to handle warnings" are a bit scattered: e.g.1 - ALTER SUBSCRIPTION REFRESH PUBLICATION copy data referred to CREATE SUBSCRIPTION copy data. e.g.2 - ALTER SUBSCRIPTION REFRESH explains what to do, but now the explanation is in 2 places. e.g.3 - CREATE SUBSCRIPTION copy data explains what to do (again), but IMO it belongs better in the common "Notes" part FYI, I've moved all the information to one place (in the CREATE SUBSCRIPTION "Notes") and others refer to this central place. See the attached nitpicks diff. REFRESH PUBLICATION copy_data nitpick - now refers to CREATE SUBSCRIPTION "Notes". I also moved it to be nearer to the other sequence stuff. REFRESH PUBLICATION SEQUENCES: nitpick - now refers to CREATE SUBSCRIPTION "Notes". == doc/src/sgml/ref/create_subscription.sgml REFRESH PUBLICATION copy_data nitpick - now refers to CREATE SUBSCRIPTION "Notes" Notes: nitpick - the explanation of, and what to do about sequence WARNINGS, is moved to here == src/backend/commands/sequence.c pg_sequence_state: nitpick - I just moved the comment in pg_sequence_state() to below the NOTE, which talks about "page LSN". == src/backend/catalog/pg_subscription.c 1. HasSubscriptionRelations Should function 'HasSubscriptionRelations' be renamed to 'HasSubscriptionTables'? ~~~ GetSubscriptionRelations: nitpick - tweak some "skip" comments. == src/backend/commands/subscriptioncmds.c 2. CreateSubscription tables = fetch_table_list(wrconn, publications); - foreach(lc, tables) + foreach_ptr(RangeVar, rv, tables) + { + Oid relid; + + relid = RangeVarGetRelid(rv, AccessShareLock, false); + + /* Check for supported relkind. */ + CheckSubscriptionRelkind(get_rel_relkind(relid), + rv->schemaname, rv->relname); + + AddSubscriptionRelState(subid, relid, table_state, + InvalidXLogRecPtr, true); + } + + /* Add the sequences in init state */ + sequences = fetch_sequence_list(wrconn, publications); + foreach_ptr(RangeVar, rv, sequences) These 2 loops (first for tables and then for sequences) seem to be executing the same code. If you wanted you could combine the lists up-front, and then have one code loop instead of 2. It would mean less code. OTOH, maybe the current code is more readable? I am not sure what is best, so just bringing this to your attention. ~~~ AlterSubscription_refresh: nitpick = typo /indicating tha/indicating that/ ~~~ 3. fetch_sequence_list + appendStringInfoString(&cmd, "SELECT DISTINCT n.nspname, c.relname, s.seqtypid, s.seqmin, s.seqmax, s.seqstart, s.seqincrement, s.seqcycle" +" FROM pg_publication p, LATERAL pg_get_publication_sequences(p.pubname::text) gps(relid)," +" pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace JOIN pg_sequence s ON c.oid = s.seqrelid" +" WHERE c.oid = gps.relid AND p.pubname IN ("); + get_publications_str(publications, &cmd, true); + appendStringInfoChar(&cmd, ')'); Please wrap this better to make the SQL more readable. ~~ 4. + if (seqform->seqtypid != seqtypid || seqform->seqmin != seqmin || + seqform->seqmax != seqmax || seqform->seqstart != seqstart || + seqform->seqincrement != seqincrement || + seqform->seqcycle != seqcycle) + ereport(WARNING, + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("Sequence option in remote and local is not same for \"%s.%s\"", +get_namespace_name(get_rel_namespace(relid)), get_rel_name(relid)), + errhint("Alter/Re-create the sequence using the same options as in remote.")); 4a. Are these really known as "options"? Or should they be called "sequence parameters", or something else, like "sequence attributes"? 4a. Is there a way to give more helpful information by identifying what was different in the log? OTOH, maybe it would become too messy if there were multiple differences... == src/backend/replication/logical/launcher.c 5. logicalrep_sync_worker_count - if (isTablesyncWorker(w) && w->subid == subid) + if ((isTableSyncWorker(w) || isSequenceSyncWorker(w)) && + w->subid == subid) You could micro-optimize this -- it may be more efficient to write the condition the other way around. SUGGESTION if (w->subid == subid && (isTableSyncWorker(w) || isSequenceSyncWorker(w))) == .../replication/logical/sequencesync.c File header comment: nitpick - there seems a large cut/paste mistake (the first 2 paragraphs are almost the same). nitpick - reworded with the help of Chat-GPT for slightly better clarity. Also fixed a couple of typos. nitpick - it mentioned MAX_SEQUENCES_SYNC_PER_BATCH several times so I changed the wording of one of them ~~~ fetch_remote_sequence_data: nitpick - all other params have the same name as sequence members, so change the parameter name /lsn/page_lsn/ ~ copy_sequence: nitpick - rename var /seq_lsn/seq_page_lsn/ == src/backen
Re: Support specify tablespace for each merged/split partition
On Mon, Aug 5, 2024 at 9:05 PM Junwang Zhao wrote: > > Hi Amul, > > Thanks for your review. > > On Mon, Aug 5, 2024 at 8:38 PM Amul Sul wrote: > > > > On Mon, Jul 15, 2024 at 11:19 AM Junwang Zhao wrote: > > > > >[...] > > static Relation > > -createPartitionTable(RangeVar *newPartName, Relation modelRel, > > -AlterTableUtilityContext *context) > > +createPartitionTable(RangeVar *newPartName, char *tablespacename, > > + > > > > The comment should mention the tablespace setting in the same way it > > mentions the access method. > > I'm not good at wording, can you give some advice? My suggestion is to rewrite the third paragraph as follows, but someone else might have a better version: --- The new partitions will also be created in the same tablespace as the parent if not specified. Also, this function sets the new partition access method same as parent table access methods (similarly to CREATE TABLE ... PARTITION OF). It checks that parent and child tables have compatible persistence. --- > > > > +SELECT tablename, tablespace FROM pg_tables > > + WHERE tablename IN ('t', 'tp_0_2') AND schemaname = > > 'partitions_merge_schema' > > + ORDER BY tablename, tablespace; > > + tablename |tablespace > > +---+-- > > + t | > > + tp_0_2| regress_tblspace > > +(2 rows) > > + > > +SELECT tablename, indexname, tablespace FROM pg_indexes > > + WHERE tablename IN ('t', 'tp_0_2') AND schemaname = > > 'partitions_merge_schema' > > + ORDER BY tablename, indexname, tablespace; > > + tablename | indexname | tablespace > > +---+-+ > > + t | t_pkey | > > + tp_0_2| tp_0_2_pkey | > > +(2 rows) > > + > > > > This seems problematic to me. The index should be in the same > > tablespace as the table. > > I'm not sure about this, it seems to me that partition index will alway > inherit the tablespaceId of its parent(see generateClonedIndexStmt), > do you think we should change the signature of this function? > > One thing worth mentioning is that for UNIQUE/PRIMARY KEY, > it allows setting *USING INDEX TABLESPACE tablespace_name*, > I don't think we should change the index tablespace in this case, > what do you think? > I think you are correct; my understanding is a bit hazy. > > I have added you as a reviewer, hope you don't mind. Thank you. Regards, Amul
Vectored IO in XLogWrite()
Hi hackers, I was looking into XLogWrite() and saw the below comment. It cannot really circle back in wal buffers without needing to call pg_write() since next pages wouldn't be contiguous in memory. So it needs to write whenever it reaches the end of wal buffers. /* > * Dump the set if this will be the last loop iteration, or if we are > * at the last page of the cache area (since the next page won't be > * contiguous in memory), or if we are at the end of the logfile > * segment. > */ I think that we don't have the "contiguous pages" constraint when writing anymore as we can do vectored IO. It seems unnecessary to write just because XLogWrite() is at the end of wal buffers. Attached patch uses pg_pwritev() instead of pg_pwrite() and tries to write pages in one call even if they're not contiguous in memory, until it reaches the page at startidx. After quickly experimenting the patch and comparing the number of write calls, the patch's affect can be more visible when wal_buffers is quite low as it's more likely to circle back to the beginning. When wal_buffers is set to a decent amount, the patch only saves a few write calls. But I wouldn't expect any regression introduced by the patch (I may be wrong here), so I thought it may be worth to consider. I appreciate any feedback on the proposed change. I'd also be glad to benchmark the patch if you want to see how it performs in some specific cases since I've been struggling with coming up a good test case. Regards, -- Melih Mutlu Microsoft v1-0001-Use-pg_pwritev-in-XlogWrite.patch Description: Binary data
Remove hardcoded hash opclass function signature exceptions
hashvalidate(), which validates the signatures of support functions for the hash AM, contains several hardcoded exceptions. For example, hash/date_ops support function 1 is hashint4(), which would ordinarily fail validation because the function argument is int4, not date. But this works internally because int4 and date are of the same size. There are several more exceptions like this that happen to work and were allowed historically but would now fail the function signature validation. AFAICT, these exceptions were just carried over from before the current index AM API and validation functions were added. The code contains comments like "For the moment, fix it by having a list of allowed cases.", so it probably wasn't meant as the ideal state. This patch removes those exceptions by providing new support functions that have the proper declared signatures. They internally share most of the C code with the "wrong" functions they replace, so the behavior is still the same. With the exceptions gone, hashvalidate() is now simplified and relies fully on check_amproc_signature(), similar to other index AMs. I'm also fixing one case where a brin opclass used hashvarlena() for bytea, even though in that case, there is no function signature validation done, so it doesn't matter that much. Not done here, but maybe hashvarlena() and hashvarlenaextended() should be removed from pg_proc.dat, since their use as opclass support functions is now dubious. They could continue to exist in the C code as internal support functions.From c154f024a58f96e8e65db5c621ae34f6d630b84b Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Tue, 6 Aug 2024 12:08:35 +0200 Subject: [PATCH] Remove hardcoded hash opclass function signature exceptions hashvalidate(), which validates the signatures of support functions for the hash AM, contained several hardcoded exceptions. For example, hash/date_ops support function 1 was hashint4(), which would ordinarily fail validation because the function argument is int4, not date. But this works internally because int4 and date are of the same size. There are several more exceptions like this that happen to work and were allowed historically but would now fail the function signature validation. This patch removes those exceptions by providing new support functions that have the proper declared signatures. They internally share most of the code with the "wrong" functions they replace, so the behavior is still the same. With the exceptions gone, hashvalidate() is now simplified and relies fully on check_amproc_signature(). TODO: Maybe hashvarlena() and hashvarlenaextended() should be removed from pg_proc.dat, since their use as opclass support functions is dubious. They can continue to exist in the C code as internal support functions. TODO: catversion --- src/backend/access/hash/hashfunc.c | 12 +++ src/backend/access/hash/hashvalidate.c | 129 + src/backend/utils/adt/bool.c | 13 +++ src/backend/utils/adt/date.c | 12 +++ src/backend/utils/adt/timestamp.c | 12 +++ src/backend/utils/adt/xid.c| 37 +++ src/include/catalog/pg_amproc.dat | 30 +++--- src/include/catalog/pg_proc.dat| 42 8 files changed, 169 insertions(+), 118 deletions(-) diff --git a/src/backend/access/hash/hashfunc.c b/src/backend/access/hash/hashfunc.c index 86f1adecb75..577ca470145 100644 --- a/src/backend/access/hash/hashfunc.c +++ b/src/backend/access/hash/hashfunc.c @@ -406,3 +406,15 @@ hashvarlenaextended(PG_FUNCTION_ARGS) return result; } + +Datum +hashbytea(PG_FUNCTION_ARGS) +{ + return hashvarlena(fcinfo); +} + +Datum +hashbyteaextended(PG_FUNCTION_ARGS) +{ + return hashvarlenaextended(fcinfo); +} diff --git a/src/backend/access/hash/hashvalidate.c b/src/backend/access/hash/hashvalidate.c index 40164e2ea2b..330355d4c5b 100644 --- a/src/backend/access/hash/hashvalidate.c +++ b/src/backend/access/hash/hashvalidate.c @@ -22,19 +22,13 @@ #include "catalog/pg_amproc.h" #include "catalog/pg_opclass.h" #include "catalog/pg_opfamily.h" -#include "catalog/pg_proc.h" #include "catalog/pg_type.h" -#include "parser/parse_coerce.h" #include "utils/builtins.h" -#include "utils/fmgroids.h" #include "utils/lsyscache.h" #include "utils/regproc.h" #include "utils/syscache.h" -static bool check_hash_func_signature(Oid funcid, int16 amprocnum, Oid argtype); - - /* * Validator for a hash opclass. * @@ -90,6 +84,7 @@ hashvalidate(Oid opclassoid) { HeapTuple proctup = &proclist->members[i]->tuple; Form_pg_amproc procform = (Form_pg_amproc) GETSTRUCT(proctup); + boolok; /* * All hash functions should be registered with matching left/right @@ -109,30 +104,17 @@ hashvalidate(Oid opclassoid) switch (procform->amprocnum) {
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Tue, Aug 6, 2024 at 11:18 AM Alexander Korotkov wrote: > On Tue, Aug 6, 2024 at 8:36 AM Michael Paquier wrote: > > On Tue, Aug 06, 2024 at 05:17:10AM +0300, Alexander Korotkov wrote: > > > The 0001 patch is intended to improve this situation. Actually, it's > > > not right to just put RecoveryInProgress() after > > > GetXLogReplayRecPtr(), because more wal could be replayed between > > > these calls. Instead we need to recheck GetXLogReplayRecPtr() after > > > getting negative result of RecoveryInProgress() because WAL replay > > > position couldn't get updated after. > > > 0002 patch comprises fix for the header comment of WaitLSNSetLatches() function > > > 0003 patch comprises tests for pg_wal_replay_wait() errors. > > > > Before adding more tests, could it be possible to stabilize what's in > > the tree? drongo has reported one failure with the recovery test > > 043_wal_replay_wait.pl introduced recently by 3c5db1d6b016: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-08-05%2004%3A24%3A54 > > Thank you for pointing! > Surely, I'll fix this before. Something breaks in these lines during second iteration of the loop. "SELECT pg_current_wal_insert_lsn()" has been queried from primary, but standby didn't receive "CALL pg_wal_replay_wait('...');" for (my $i = 0; $i < 5; $i++) { print($i); $node_primary->safe_psql('postgres', "INSERT INTO wait_test VALUES (${i});"); my $lsn = $node_primary->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn()"); $psql_sessions[$i] = $node_standby1->background_psql('postgres'); $psql_sessions[$i]->query_until( qr/start/, qq[ \\echo start CALL pg_wal_replay_wait('${lsn}'); SELECT log_count(${i}); ]); } I wonder what could it be. Probably something hangs inside launching background psql... I'll investigate this more. -- Regards, Alexander Korotkov Supabase
Re: Support specify tablespace for each merged/split partition
On Tue, Aug 6, 2024 at 5:34 PM Amul Sul wrote: > > On Mon, Aug 5, 2024 at 9:05 PM Junwang Zhao wrote: > > > > Hi Amul, > > > > Thanks for your review. > > > > On Mon, Aug 5, 2024 at 8:38 PM Amul Sul wrote: > > > > > > On Mon, Jul 15, 2024 at 11:19 AM Junwang Zhao wrote: > > > > > > >[...] > > > static Relation > > > -createPartitionTable(RangeVar *newPartName, Relation modelRel, > > > -AlterTableUtilityContext *context) > > > +createPartitionTable(RangeVar *newPartName, char *tablespacename, > > > + > > > > > > The comment should mention the tablespace setting in the same way it > > > mentions the access method. > > > > I'm not good at wording, can you give some advice? > > My suggestion is to rewrite the third paragraph as follows, but > someone else might have a better version: > --- > The new partitions will also be created in the same tablespace as the parent > if not specified. Also, this function sets the new partition access method > same as parent table access methods (similarly to CREATE TABLE ... PARTITION > OF). It checks that parent and child tables have compatible persistence. > --- I changed to this with minor changes. > > > > > > +SELECT tablename, tablespace FROM pg_tables > > > + WHERE tablename IN ('t', 'tp_0_2') AND schemaname = > > > 'partitions_merge_schema' > > > + ORDER BY tablename, tablespace; > > > + tablename |tablespace > > > +---+-- > > > + t | > > > + tp_0_2| regress_tblspace > > > +(2 rows) > > > + > > > +SELECT tablename, indexname, tablespace FROM pg_indexes > > > + WHERE tablename IN ('t', 'tp_0_2') AND schemaname = > > > 'partitions_merge_schema' > > > + ORDER BY tablename, indexname, tablespace; > > > + tablename | indexname | tablespace > > > +---+-+ > > > + t | t_pkey | > > > + tp_0_2| tp_0_2_pkey | > > > +(2 rows) > > > + > > > > > > This seems problematic to me. The index should be in the same > > > tablespace as the table. > > > > I'm not sure about this, it seems to me that partition index will alway > > inherit the tablespaceId of its parent(see generateClonedIndexStmt), > > do you think we should change the signature of this function? > > > > One thing worth mentioning is that for UNIQUE/PRIMARY KEY, > > it allows setting *USING INDEX TABLESPACE tablespace_name*, > > I don't think we should change the index tablespace in this case, > > what do you think? > > > > I think you are correct; my understanding is a bit hazy. Thanks for your confirmation. Attached v2 addressed all the problems you mentioned, thanks. > > > > > I have added you as a reviewer, hope you don't mind. > > Thank you. > > Regards, > Amul -- Regards Junwang Zhao v2-0001-support-specify-tablespace-for-each-merged-split.patch Description: Binary data v2-0002-fix-stale-comments.patch Description: Binary data
Re: Memory growth observed with C++ application consuming libpq.dll on Windows
Em ter., 6 de ago. de 2024 às 05:33, Rajesh Kokkonda < rajeshk.kokko...@gmail.com> escreveu: > I ran a trial version of a memory leak detector called Deleaker on windows > and found some modules that are listed as having leaks. I ran the program > on Linux under valgrind and I do not see any leaks reported there. I have > attached the reported leaks on windows as windows_leaks.txt and valgrind > summary report as valgrind.txt. > None of these sources are Postgres. https://www.gnu.org/software/libiconv/ https://gnuwin32.sourceforge.net/packages/libintl.htm best regards, Ranier Vilela
Re: Fix comments in instr_time.h and remove an unneeded cast to int64
On 06/08/2024 11:54, Bertrand Drouvot wrote: Hi hackers, While working on [1], I came across what seems to be incorrect comments in instr_time.h and an unneeded cast to int64. Indeed, 03023a2664 represented time as an int64 on all platforms but forgot to update the comment related to INSTR_TIME_GET_MICROSEC() and provided an incorrect comment for INSTR_TIME_GET_NANOSEC(). Please find attached a tiny patch to correct those and, in passing, remove what I think is an unneeded cast to int64. Applied, thanks! -- Heikki Linnakangas Neon (https://neon.tech)
Re: Logical Replication of sequences
On Mon, 5 Aug 2024 at 18:05, shveta malik wrote: > > On Mon, Aug 5, 2024 at 11:04 AM vignesh C wrote: > > > > On Wed, 31 Jul 2024 at 14:39, shveta malik wrote: > > > > > > On Mon, Jun 10, 2024 at 5:00 PM vignesh C wrote: > > > > > > > > On Mon, 10 Jun 2024 at 12:24, Amul Sul wrote: > > > > > > > > > > > > > > > > > > > > On Sat, Jun 8, 2024 at 6:43 PM vignesh C wrote: > > > > >> > > > > >> On Wed, 5 Jun 2024 at 14:11, Amit Kapila > > > > >> wrote: > > > > >> [...] > > > > >> A new catalog table, pg_subscription_seq, has been introduced for > > > > >> mapping subscriptions to sequences. Additionally, the sequence LSN > > > > >> (Log Sequence Number) is stored, facilitating determination of > > > > >> sequence changes occurring before or after the returned sequence > > > > >> state. > > > > > > > > > > > > > > > Can't it be done using pg_depend? It seems a bit excessive unless I'm > > > > > missing > > > > > something. > > > > > > > > We'll require the lsn because the sequence LSN informs the user that > > > > it has been synchronized up to the LSN in pg_subscription_seq. Since > > > > we are not supporting incremental sync, the user will be able to > > > > identify if he should run refresh sequences or not by checking the lsn > > > > of the pg_subscription_seq and the lsn of the sequence(using > > > > pg_sequence_state added) in the publisher. > > > > > > How the user will know from seq's lsn that he needs to run refresh. > > > lsn indicates page_lsn and thus the sequence might advance on pub > > > without changing lsn and thus lsn may look the same on subscriber even > > > though a sequence-refresh is needed. Am I missing something here? > > > > When a sequence is synchronized to the subscriber, the page LSN of the > > sequence from the publisher is also retrieved and stored in > > pg_subscriber_rel as shown below: > > --- Publisher page lsn > > publisher=# select pg_sequence_state('seq1'); > > pg_sequence_state > > > > (0/1510E38,65,1,t) > > (1 row) > > > > --- Subscriber stores the publisher's page lsn for the sequence > > subscriber=# select * from pg_subscription_rel where srrelid = 16384; > > srsubid | srrelid | srsubstate | srsublsn > > -+-++--- > >16389 | 16384 | r | 0/1510E38 > > (1 row) > > > > If changes are made to the sequence, such as performing many nextvals, > > the page LSN will be updated. Currently the sequence values are > > prefetched for SEQ_LOG_VALS 32, so the lsn will not get updated for > > the prefetched values, once the prefetched values are consumed the lsn > > will get updated. > > For example: > > --- Updated LSN on the publisher (old lsn - 0/1510E38, new lsn - 0/1558CA8) > > publisher=# select pg_sequence_state('seq1'); > > pg_sequence_state > > -- > > (0/1558CA8,143,22,t) > > (1 row) > > > > The user can then compare this updated value with the sequence's LSN > > in pg_subscription_rel to determine when to re-synchronize the > > sequence. > > Thanks for the details. But I was referring to the case where we are > in between pre-fetched values on publisher (say at 25th value), while > on subscriber we are slightly behind (say at 15th value), but page-lsn > will be the same on both. Since the subscriber is behind, a > sequence-refresh is needed on sub, but by looking at lsn (which is > same), one can not say that for sure. Let me know if I have > misunderstood it. Yes, at present, if the value is within the pre-fetched range, we cannot distinguish it solely using the page_lsn. However, the pg_sequence_state function also provides last_value and log_cnt, which can be used to handle these specific cases. Regards, Vignesh
Re: Optimize mul_var() for var1ndigits >= 8
On Mon, 5 Aug 2024 at 13:34, Joel Jacobson wrote: > > Noted from 5e1f3b9eb: > "While it adds some space on 32-bit machines, we aren't optimizing for that > case anymore." > > In this case, the extra 32-bit numeric_mul code seems to be 89 lines of code, > excluding comments. > To me, this seems like quite a lot, so I lean on thinking we should omit that > code for now. > We can always add it later if we get pushback. > OK, I guess that's reasonable. There is no clear-cut right answer here, but I don't really want to have a lot of 32-bit-specific code that significantly complicates this function, making it harder to maintain. Without that code, the patch becomes much simpler, which seems like a decent justification for any performance tradeoffs on 32-bit machines that are unlikely to affect many people anyway. Regards, Dean From 6c1820257997facfe8e74fac8b574c8f683bbebc Mon Sep 17 00:00:00 2001 From: Dean Rasheed Date: Thu, 18 Jul 2024 17:38:59 +0100 Subject: [PATCH v4 1/2] Extend mul_var_short() to 5 and 6-digit inputs. Commit ca481d3c9a introduced mul_var_short(), which is used by mul_var() whenever the shorter input has 1-4 NBASE digits and the exact product is requested. As speculated on in that commit, it can be extended to work for more digits in the shorter input. This commit extends it up to 6 NBASE digits (21-24 decimal digits), for which it also gives a significant speedup. To avoid excessive code bloat and duplication, refactor it a bit using macros and exploiting the fact that some portions of the code are shared between the different cases. --- src/backend/utils/adt/numeric.c | 175 ++-- 1 file changed, 123 insertions(+), 52 deletions(-) diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index d0f0923710..ca28d0e3b3 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -8720,10 +8720,10 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, } /* - * If var1 has 1-4 digits and the exact result was requested, delegate to + * If var1 has 1-6 digits and the exact result was requested, delegate to * mul_var_short() which uses a faster direct multiplication algorithm. */ - if (var1ndigits <= 4 && rscale == var1->dscale + var2->dscale) + if (var1ndigits <= 6 && rscale == var1->dscale + var2->dscale) { mul_var_short(var1, var2, result); return; @@ -8882,7 +8882,7 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result, /* * mul_var_short() - * - * Special-case multiplication function used when var1 has 1-4 digits, var2 + * Special-case multiplication function used when var1 has 1-6 digits, var2 * has at least as many digits as var1, and the exact product var1 * var2 is * requested. */ @@ -8904,7 +8904,7 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2, /* Check preconditions */ Assert(var1ndigits >= 1); - Assert(var1ndigits <= 4); + Assert(var1ndigits <= 6); Assert(var2ndigits >= var1ndigits); /* @@ -8931,6 +8931,13 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2, * carry up as we go. The i'th result digit consists of the sum of the * products var1digits[i1] * var2digits[i2] for which i = i1 + i2 + 1. */ +#define PRODSUM1(v1,i1,v2,i2) ((v1)[i1] * (v2)[i2]) +#define PRODSUM2(v1,i1,v2,i2) (PRODSUM1(v1,i1,v2,i2) + (v1)[i1+1] * (v2)[i2-1]) +#define PRODSUM3(v1,i1,v2,i2) (PRODSUM2(v1,i1,v2,i2) + (v1)[i1+2] * (v2)[i2-2]) +#define PRODSUM4(v1,i1,v2,i2) (PRODSUM3(v1,i1,v2,i2) + (v1)[i1+3] * (v2)[i2-3]) +#define PRODSUM5(v1,i1,v2,i2) (PRODSUM4(v1,i1,v2,i2) + (v1)[i1+4] * (v2)[i2-4]) +#define PRODSUM6(v1,i1,v2,i2) (PRODSUM5(v1,i1,v2,i2) + (v1)[i1+5] * (v2)[i2-5]) + switch (var1ndigits) { case 1: @@ -8942,9 +8949,9 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2, * -- */ carry = 0; - for (int i = res_ndigits - 2; i >= 0; i--) + for (int i = var2ndigits - 1; i >= 0; i--) { -term = (uint32) var1digits[0] * var2digits[i] + carry; +term = PRODSUM1(var1digits, 0, var2digits, i) + carry; res_digits[i + 1] = (NumericDigit) (term % NBASE); carry = term / NBASE; } @@ -8960,23 +8967,17 @@ mul_var_short(const NumericVar *var1, const NumericVar *var2, * -- */ /* last result digit and carry */ - term = (uint32) var1digits[1] * var2digits[res_ndigits - 3]; + term = PRODSUM1(var1digits, 1, var2digits, var2ndigits - 1); res_digits[res_ndigits - 1] = (NumericDigit) (term % NBASE); carry = term / NBASE; /* remaining digits, except for the first two */ - for (int i = res_ndigits - 3; i >= 1; i--) + for (int i = var2ndigits - 1; i >= 1; i--) { -term = (uint32) var1digits[0] * var2digits[i] + - (uint32) var1digits[1] * var2digits[i - 1] + carry; +term = PRODSUM2(var1digits, 0, var2digits, i) + carry; res_digits[i + 1] = (NumericDigit) (term % NBASE); carry = term / N
Thread-unsafe MD5 on big-endian systems with no OpenSSL
While browsing through all our global variables for the multithreading effort, I noticed that our MD5 implementation in src/common/md5.c uses a static buffer on big-endian systems, which makes it not thread-safe. That's a bug because that function is also used in libpq. This was introduced in commit b67b57a966, which replaced the old MD5 fallback implementation with the one from pgcrypto. The thread-safety didn't matter for pgcrypto, but for libpq it does. This only affects big-endian systems that are compiled without OpenSSL. -- Heikki Linnakangas Neon (https://neon.tech)From 5ec1ec90ba4b1828b4ecc3d6907489f879e1af12 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 6 Aug 2024 15:20:06 +0300 Subject: [PATCH 1/1] Make fallback MD5 implementation thread-safe on big-endian systems Replace a static scratch buffer with a local variable, because a static buffer makes the function not thread-safe. This function is used in client-code in libpq, so it needs to be thread-safe. It was until commit b67b57a966, which replaced the implementation with the one from pgcrypto. Backpatch to v14, where we switched to the new implementation. --- src/common/md5.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/src/common/md5.c b/src/common/md5.c index f1d47722f8..32b696a66d 100644 --- a/src/common/md5.c +++ b/src/common/md5.c @@ -150,10 +150,6 @@ static const uint8 md5_paddat[MD5_BUFLEN] = { 0, 0, 0, 0, 0, 0, 0, 0, }; -#ifdef WORDS_BIGENDIAN -static uint32 X[16]; -#endif - static void md5_calc(const uint8 *b64, pg_md5_ctx *ctx) { @@ -167,6 +163,7 @@ md5_calc(const uint8 *b64, pg_md5_ctx *ctx) #else /* 4 byte words */ /* what a brute force but fast! */ + uint32 X[16]; uint8 *y = (uint8 *) X; y[0] = b64[3]; -- 2.39.2
Re: pg_combinebackup does not detect missing files
On Fri, Aug 2, 2024 at 7:07 PM Robert Haas wrote: > > On Fri, Apr 19, 2024 at 11:47 AM Robert Haas wrote: > > > [...] > Here is a rebased version of the patch. No other changes since v1. > Here are two minor comments on this: $ pg_combinebackup /tmp/backup_full/ /tmp/backup_incr2/ /tmp/backup_incr3/ -o /tmp/backup_comb pg_combinebackup: warning: "pg_wal/00010020" is present on disk but not in the manifest This warning shouldn’t be reported, since we don’t include WAL in the backup manifest ? Also, I found that the final resultant manifest includes this WAL entry: $ head /tmp/backup_comb/backup_manifest | grep pg_wal { "Path": "pg_wal/00010020", "Size": 16777216, "Last-Modified": "2024-08-06 11:54:16 GMT" }, --- +# Set up another new database instance. force_initdb is used because +# we want it to be a separate cluster with a different system ID. +my $node2 = PostgreSQL::Test::Cluster->new('node2'); +$node2->init(force_initdb => 1, has_archiving => 1, allows_streaming => 1); +$node2->append_conf('postgresql.conf', 'summarize_wal = on'); +$node2->start; + Unused cluster-node in the test. Regards, Amul
Re: Instability with incremental backup tests (pg_combinebackup, 003_timeline.pl)
On 8/6/24 07:48, Michael Paquier wrote: > Hi all, > > dikkop has reported a failure with the regression tests of pg_combinebackup: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-08-04%2010%3A04%3A51 > > That's in the test 003_timeline.pl, from dc212340058b: > # Failed test 'incremental backup from node1' > # at t/003_timeline.pl line 43. > > The node is extremely slow, so perhaps bumping up the timeout would be > fine enough in this case (did not spend time analyzing it). I don't > think that this has been discussed, but perhaps I just missed a > reference to it and the incremental backup thread is quite large. > Yeah, it's a freebsd running on rpi4, from a USB flash disk, and in my experience it's much slower than rpi4 running Linux. I'm not sure why is that, never found a way to make it faster The machine already has: export PGCTLTIMEOUT=600 export PG_TEST_TIMEOUT_DEFAULT=600 I doubt increasing it further will do the trick. Maybe there's some other timeout that I should increase? FWIW I just moved the buildfarm stuff to a proper SSD disk (still USB, but hopefully better than the crappy flash disk). regards -- Tomas Vondra
Re: Detailed release notes
Em ter., 6 de ago. de 2024 às 04:30, jian he escreveu: > > you can also preview the attached screenshot to see the rendered effect. > Loved, except that the commit id does not help too much, so I don't think we need it. I think a numbered link would be better. - Change functions to use a safe search_path during maintenance operations (Jeff Davis) [1, 2] And your patch has an additional space before comma before second, third links, [1 , 2] instead of [1, 2] regards Marcos
Support multi-column renaming?
I have noticed that ALTER TABLE supports multiple column actions (ADD/DROP column etc), but does not support multiple column renaming. See [1] Here is small example of what im talking about: ``` db2=# create table tt(); CREATE TABLE -- multiple column altering ok db2=# alter table tt add column i int, add column j int; ALTER TABLE -- single column renaming ok db2=# alter table tt rename column i to i2; ALTER TABLE -- multiple column renaming not allowed db2=# alter table tt rename column i2 to i3, rename column j to j2; ERROR: syntax error at or near "," LINE 1: alter table tt rename column i2 to i3, rename column j to j2... ^ db2=# ``` Looking closely on gram.y, the only reason for this is that RenameStmt is defined less flexible than alter_table_cmds (which is a list). All other core infrastructure is ready to allow $subj. So is it worth a patch? [1] https://www.postgresql.org/docs/current/sql-altertable.html -- Best regards, Kirill Reshke
Re: Rename C23 keyword
On Tue, Aug 6, 2024 at 4:04 AM Peter Eisentraut wrote: > constexpr is a keyword in C23. Rename a conflicting identifier for > future-proofing. > > Obviously, C23 is way in the future, but this is a hard error that > prevents any further exploration. (To be clear: This only happens if > you explicitly select C23 mode. I'm not aware of a compiler where this > is the default yet.) This seems fine. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Detailed release notes
On Tue, Aug 6, 2024 at 9:57 AM Marcos Pegoraro wrote: > Loved, except that the commit id does not help too much, so I don't think we > need it. > I think a numbered link would be better. I think the commit ID is quite useful. If you're using git, you can do "git show $COMMITID". If you're using the web, you can go to https://git.postgresql.org/pg/commitdiff/$COMMITID Big -1 for removing the commit ID. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Thread-unsafe MD5 on big-endian systems with no OpenSSL
On Tue, Aug 6, 2024 at 8:23 AM Heikki Linnakangas wrote: > While browsing through all our global variables for the multithreading > effort, I noticed that our MD5 implementation in src/common/md5.c uses a > static buffer on big-endian systems, which makes it not thread-safe. > That's a bug because that function is also used in libpq. > > This was introduced in commit b67b57a966, which replaced the old MD5 > fallback implementation with the one from pgcrypto. The thread-safety > didn't matter for pgcrypto, but for libpq it does. > > This only affects big-endian systems that are compiled without OpenSSL. LGTM. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Fix comments in instr_time.h and remove an unneeded cast to int64
Heikki Linnakangas writes: > On 06/08/2024 11:54, Bertrand Drouvot wrote: >> Please find attached a tiny patch to correct those and, in passing, remove >> what >> I think is an unneeded cast to int64. > Applied, thanks! I think this comment change is a dis-improvement. It's removed the documentation of the important fact that INSTR_TIME_GET_MICROSEC and INSTR_TIME_GET_NANOSEC return a different data type from INSTR_TIME_GET_MILLISEC (ie, integer versus float). Also, the expectation is that users of these APIs do not know the actual data type of instr_time, and instead we tell them what the output of those macros is. This patch just blew a hole in that abstraction. regards, tom lane
Re: Incremental View Maintenance, take 2
On Thu, 11 Jul 2024 at 09:24, Yugo NAGATA wrote: > > On Tue, 2 Jul 2024 17:03:11 +0900 > Yugo NAGATA wrote: > > > On Sun, 31 Mar 2024 22:59:31 +0900 > > Yugo NAGATA wrote: > > > > > > > > Also, I added a comment on RelationIsIVM() macro persuggestion from > > > > jian he. > > > > In addition, I fixed a failure reported from cfbot on FreeBSD build > > > > caused by; > > > > > > > > WARNING: outfuncs/readfuncs failed to produce an equal rewritten > > > > parse tree > > > > > > > > This warning was raised since I missed to modify outfuncs.c for a new > > > > field. > > > > > > I found cfbot on FreeBSD still reported a failure due to > > > ENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS because the regression test used > > > wrong role names. Attached is a fixed version, v32. > > > > Attached is a rebased version, v33. > > I updated the patch to bump up the version numbers in psql and pg_dump codes > from 17 to 18. > > Regards, > Yugo Nagata > > > > > Regards, > > Yugo Nagata > > > > > > -- > > Yugo NAGATA > > > -- > Yugo NAGATA Small updates with something o found recent days: ``` db2=# create incremental materialized view v2 as select * from v1; ERROR: VIEW or MATERIALIZED VIEW is not supported on incrementally maintainable materialized view ``` Error messaging is not true, create view v2 as select * from v1; works fine. ``` db2=# create incremental materialized view vv2 as select i,j2, i / j2 from t1 join t2 on true; db2=# insert into t2 values(1,0); ERROR: division by zero ``` It is very strange to receive `division by zero` while inserting into relation, isn't it? Can we add some hints/CONTEXT here? Regular triggers do it: ``` db2=# insert into ttt values(10,0); ERROR: division by zero CONTEXT: PL/pgSQL function f1() line 3 at IF ``` -- Best regards, Kirill Reshke
Re: Support multi-column renaming?
On 2024-Aug-06, Kirill Reshke wrote: > I have noticed that ALTER TABLE supports multiple column actions > (ADD/DROP column etc), but does not support multiple column renaming. > Looking closely on gram.y, the only reason for this is that RenameStmt > is defined less flexible than alter_table_cmds (which is a list). All > other core infrastructure is ready to allow $subj. > > So is it worth a patch? Hmm, yeah, maybe it'd be better if ALTER TABLE RENAME is not part of RenameStmt but instead part of AlterTableStmt. Probably not a super trivial code change, but it should be doable. The interactions with different subcommands types in the same command should be considered carefully (as well as with ALTER {VIEW,SEQUENCE,etc} RENAME, which I bet we don't want changed due to the implications). -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Java is clearly an example of money oriented programming" (A. Stepanov)
Re: Fix comments in instr_time.h and remove an unneeded cast to int64
On 06/08/2024 17:20, Tom Lane wrote: Heikki Linnakangas writes: On 06/08/2024 11:54, Bertrand Drouvot wrote: Please find attached a tiny patch to correct those and, in passing, remove what I think is an unneeded cast to int64. Applied, thanks! I think this comment change is a dis-improvement. It's removed the documentation of the important fact that INSTR_TIME_GET_MICROSEC and INSTR_TIME_GET_NANOSEC return a different data type from INSTR_TIME_GET_MILLISEC (ie, integer versus float). Also, the expectation is that users of these APIs do not know the actual data type of instr_time, and instead we tell them what the output of those macros is. This patch just blew a hole in that abstraction. Hmm, ok I see. Then I propose: 1. Revert 2. Just fix the comment to say int64 instead of uint64. -- Heikki Linnakangas Neon (https://neon.tech) From 47a209a1840c9f66b584fb03a99baf56c5abe69f Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 6 Aug 2024 17:46:58 +0300 Subject: [PATCH 1/2] Revert "Fix comments in instr_time.h and remove an unneeded cast to int64" This reverts commit 3dcb09de7b. Tom Lane pointed out that it broke the abstraction provided by the macros. The callers should not need to know what the internal type is. This commit is an exact revert, the next commit will fix the comments on the macros that incorrectly claim that they return uint64. Discussion: https://www.postgresql.org/message-id/zrhkv3maqfwns...@ip-10-97-1-34.eu-west-3.compute.internal --- src/include/portability/instr_time.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 8f9ba2f151..a6fc1922f2 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -32,9 +32,9 @@ * * INSTR_TIME_GET_MILLISEC(t) convert t to double (in milliseconds) * - * INSTR_TIME_GET_MICROSEC(t) get t in microseconds + * INSTR_TIME_GET_MICROSEC(t) convert t to uint64 (in microseconds) * - * INSTR_TIME_GET_NANOSEC(t) get t in nanoseconds + * INSTR_TIME_GET_NANOSEC(t) convert t to uint64 (in nanoseconds) * * Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert * absolute times to intervals. The INSTR_TIME_GET_xxx operations are @@ -123,7 +123,7 @@ pg_clock_gettime_ns(void) ((t) = pg_clock_gettime_ns()) #define INSTR_TIME_GET_NANOSEC(t) \ - ((t).ticks) + ((int64) (t).ticks) #else /* WIN32 */ -- 2.39.2 From a78416110d0d0746f668c7d133d40d7c7f982d49 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 6 Aug 2024 17:47:21 +0300 Subject: [PATCH 2/2] Fix datatypes in comments in instr_time.h The INSTR_TIME_GET_NANOSEC(t) and INSTR_TIME_GET_MICROSEC(t) macros return a signed int64. Discussion: https://www.postgresql.org/message-id/zrhkv3maqfwns...@ip-10-97-1-34.eu-west-3.compute.internal --- src/include/portability/instr_time.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index a6fc1922f2..e66ecf34cd 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -32,9 +32,9 @@ * * INSTR_TIME_GET_MILLISEC(t) convert t to double (in milliseconds) * - * INSTR_TIME_GET_MICROSEC(t) convert t to uint64 (in microseconds) + * INSTR_TIME_GET_MICROSEC(t) convert t to int64 (in microseconds) * - * INSTR_TIME_GET_NANOSEC(t) convert t to uint64 (in nanoseconds) + * INSTR_TIME_GET_NANOSEC(t) convert t to int64 (in nanoseconds) * * Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert * absolute times to intervals. The INSTR_TIME_GET_xxx operations are -- 2.39.2
Re: SQL:2011 application time
On Tue, Aug 6, 2024 at 10:02 AM jian he wrote: > > On Fri, Aug 2, 2024 at 1:09 AM Paul Jungwirth > wrote: > > > > On 7/25/24 08:52, Paul Jungwirth wrote: > > > Here is a patch moving the not-empty check into > > > check_exclusion_or_unique_constraint. That is a more > > > logical place for it than ExecConstraints, since WITHOUT OVERLAPS is part > > > of the index constraint > > > (not a CHECK constraint). At that point we've already looked up all the > > > information we need. So > > > there is no extra cost for non-temporal tables, and no need to change > > > pg_class or add to the > > > relcache. Also putting it there means we don't need any extra code to > > > enforce non-empties when we > > > build the index or do anything else with it. > > > > > > I think this is the nicest solution we can expect. It is even cleaner > > > than the &&& ideas. So > > > hopefully this gets us back to where we were when we decided to commit > > > PKs & FKs to v17. > > > > > > As before, I've left the nonempty check as a separate patch to make > > > reviewing easier, but when > > > committing I would squash it with the PK patch. > > > > Hello, > > > > Here is an updated set of patches, rebased because the old patches no > > longer applied. > > hi. some minor issues. in generateClonedIndexStmt index->iswithoutoverlaps = (idxrec->indisprimary || idxrec->indisunique) && idxrec->indisexclusion; this case, the index accessMethod will be "gist" only? do you think it's necessary to: index->iswithoutoverlaps = (idxrec->indisprimary || idxrec->indisunique) && idxrec->indisexclusion && strcmp(index->accessMethod, "gist") == 0); src/bin/pg_dump/pg_dump.c and src/bin/psql/describe.c should be "if (pset.sversion >= 18)"? + (This is sometimes called a + temporal key, if the column is a range of dates or timestamps, but + PostgreSQL allows ranges over any base type.) PostgreSQL should be decorated as PostgreSQL ? in DefineIndex we have: if (stmt->unique && !stmt->iswithoutoverlaps && !amRoutine->amcanunique) if (stmt->indexIncludingParams != NIL && !amRoutine->amcaninclude) if (numberOfKeyAttributes > 1 && !amRoutine->amcanmulticol) if (exclusion && amRoutine->amgettuple == NULL) maybe we can add: if (stmt->iswithoutoverlaps && strcmp(accessMethodName, "gist") != 0) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("access method \"%s\" does not support WITHOUT OVERLAPS constraints", accessMethodName))); + /* exclusionOpNames can be non-NIL if we are creating a partition */ + if (iswithoutoverlaps && exclusionOpNames == NIL) + { + indexInfo->ii_ExclusionOps = palloc_array(Oid, nkeycols); + indexInfo->ii_ExclusionProcs = palloc_array(Oid, nkeycols); + indexInfo->ii_ExclusionStrats = palloc_array(uint16, nkeycols); + } the comment is not 100% correct, i think. creating a partition, "create table like INCLUDING ALL", both will go through generateClonedIndexStmt. generateClonedIndexStmt will produce exclusionOpNames if this index supports exclusion constraint.
Re: Fix comments in instr_time.h and remove an unneeded cast to int64
Hi, On Tue, Aug 06, 2024 at 05:49:32PM +0300, Heikki Linnakangas wrote: > On 06/08/2024 17:20, Tom Lane wrote: > > Heikki Linnakangas writes: > > > On 06/08/2024 11:54, Bertrand Drouvot wrote: > > > > Please find attached a tiny patch to correct those and, in passing, > > > > remove what > > > > I think is an unneeded cast to int64. > > > > > Applied, thanks! > > > > I think this comment change is a dis-improvement. It's removed the > > documentation of the important fact that INSTR_TIME_GET_MICROSEC and > > INSTR_TIME_GET_NANOSEC return a different data type from > > INSTR_TIME_GET_MILLISEC (ie, integer versus float). Also, the > > expectation is that users of these APIs do not know the actual data > > type of instr_time, and instead we tell them what the output of those > > macros is. This patch just blew a hole in that abstraction. Oh ok, did not think about it that way, thanks for the feedback! > > Hmm, ok I see. Then I propose: > > 1. Revert > 2. Just fix the comment to say int64 instead of uint64. LGTM, thanks! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin
On Sat, Jul 27, 2024 at 4:04 AM Melanie Plageman wrote: > > On Wed, Jul 24, 2024 at 8:19 AM John Naylor wrote: > I ran my test with your patch (on my 64-bit system, non-assert build) > and the result is great: > > master with my test (slightly modified to now use DELETE instead of > UPDATE as mentioned upthread) > 3.09s > > master with your patch applied, MWM set to 64kB and 9000 rows instead of > 80 > 1.06s Glad to hear it! > I took a look at the patch, but I can't say I know enough about the > memory allocation subsystems and how TIDStore works to meaningfully > review it -- nor enough about DSM to comment about the interactions. I tried using parallel vacuum with 64kB and it succeeded, but needed to perform an index scan for every heap page pruned. It's not hard to imagine some code moving around so that it doesn't work anymore, but since this is for testing only, it seems a warning comment is enough. > I suspect 256kB would also be fast enough to avoid my test timing out > on the buildfarm, but it is appealing to have a minimum for > maintenance_work_mem that is the same as work_mem. Agreed on both counts: I came up with a simple ctid expression to make the bitmap arrays larger: delete from test where ctid::text like '%,2__)'; With that, it still takes between 250k and 300k tuples to force a second index scan with 256kB m_w_m, default fillfactor, and without asserts. (It may need a few more pages for 32-bit but not many more) The table is around 1300 pages, where on v16 it's about 900. But with fewer tuples deleted, the WAL for deletes should be lower. So it might be comparable to v16's test. It also turns out that to support 64kB memory settings, we actually wouldn't need to change radix tree to lazily create memory contexts -- at least currently, SlabCreate doesn't allocate a keeper block, so a newly created slab context reports 0 for "mem_allocated". So I'm inclined to go ahead change the minimum m_w_m on v17 and master to 64kB. It's the quickest and (I think) most future-proof way to make this test work. Any objections?
Re: Detailed release notes
On Tue, Aug 6, 2024, at 11:02 AM, Robert Haas wrote: > On Tue, Aug 6, 2024 at 9:57 AM Marcos Pegoraro wrote: > > Loved, except that the commit id does not help too much, so I don't think > > we need it. > > I think a numbered link would be better. > > I think the commit ID is quite useful. If you're using git, you can do > "git show $COMMITID". If you're using the web, you can go to > https://git.postgresql.org/pg/commitdiff/$COMMITID > > Big -1 for removing the commit ID. Agree. Numbers mean nothing in this context. You are searching for detailed information about the referred feature. A visual information (commit hash) provides a context that you will find the source code modifications for that feature. Talking about the patch, do we want to rely on an external resource? I suggest that we use a postgresql.org subdomain. It can point to https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=$COMMIT or even better use a rewrite rule to define a user-friendly URL (and probably a mechanism to hide the gitweb URL) like https://www.postgresql.org/commit/da4017a694d and the short version that we usually use for the mailing list. https://postgr.es/c/da4017a694d -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Thread-unsafe MD5 on big-endian systems with no OpenSSL
> On Aug 6, 2024, at 23:05, Robert Haas wrote: > On Tue, Aug 6, 2024 at 8:23 AM Heikki Linnakangas wrote: >> >> This only affects big-endian systems that are compiled without OpenSSL. > > LGTM. Nice catch, looks fine to me as well. -- Michael
Re: Detailed release notes
Hi, On 2024-08-06 12:02:59 -0300, Euler Taveira wrote: > Talking about the patch, do we want to rely on an external resource? I suggest > that we use a postgresql.org subdomain. It can point to > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=$COMMIT I wonder if we should make that a configurable base domain? We have a few other variables in the sgml that can optionally be set. Greetings, Andres Freund
Re: Minor refactorings to eliminate some static buffers
Here's another batch of little changes in the same vein. Mostly converting static bufs that are never modified to consts. -- Heikki Linnakangas Neon (https://neon.tech) From 6dd5a4a413212a61d9a4f5b9db73e812c8b5dcbd Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 6 Aug 2024 17:58:29 +0300 Subject: [PATCH 1/5] Turn a few 'validnsps' static variables into locals There was no need for these to be static buffers, a local variable works just as well. I think they were marked as 'static' to imply that they are read-only, but 'const' is more appropriate for that, so change them to const. To make it possible to mark the variables as 'const', also add 'const' decorations to the transformRelOptions() signature. Discussion: https://www.postgresql.org/message-id/7f86e06a-98c5-4ce3-8ec9-3885c8de0...@iki.fi --- src/backend/access/common/reloptions.c | 2 +- src/backend/commands/createas.c| 2 +- src/backend/commands/tablecmds.c | 4 ++-- src/backend/tcop/utility.c | 2 +- src/include/access/reloptions.h| 2 +- 5 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c index d6eb5d8559..49fd35bfc5 100644 --- a/src/backend/access/common/reloptions.c +++ b/src/backend/access/common/reloptions.c @@ -1154,7 +1154,7 @@ add_local_string_reloption(local_relopts *relopts, const char *name, */ Datum transformRelOptions(Datum oldOptions, List *defList, const char *namspace, - char *validnsps[], bool acceptOidsOff, bool isReset) + const char *const validnsps[], bool acceptOidsOff, bool isReset) { Datum result; ArrayBuildState *astate; diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c index c71ff80188..0b629b1f79 100644 --- a/src/backend/commands/createas.c +++ b/src/backend/commands/createas.c @@ -83,7 +83,7 @@ create_ctas_internal(List *attrList, IntoClause *into) bool is_matview; char relkind; Datum toast_options; - static char *validnsps[] = HEAP_RELOPT_NAMESPACES; + const char *const validnsps[] = HEAP_RELOPT_NAMESPACES; ObjectAddress intoRelationAddr; /* This code supports both CREATE TABLE AS and CREATE MATERIALIZED VIEW */ diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 0b2a52463f..1f94f4fdbb 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -700,7 +700,7 @@ DefineRelation(CreateStmt *stmt, char relkind, Oid ownerId, ListCell *listptr; AttrNumber attnum; bool partitioned; - static char *validnsps[] = HEAP_RELOPT_NAMESPACES; + const char *const validnsps[] = HEAP_RELOPT_NAMESPACES; Oid ofTypeId; ObjectAddress address; LOCKMODE parentLockmode; @@ -14897,7 +14897,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation, Datum repl_val[Natts_pg_class]; bool repl_null[Natts_pg_class]; bool repl_repl[Natts_pg_class]; - static char *validnsps[] = HEAP_RELOPT_NAMESPACES; + const char *const validnsps[] = HEAP_RELOPT_NAMESPACES; if (defList == NIL && operation != AT_ReplaceRelOptions) return; /* nothing to do */ diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 702a6c3a0b..2732d6bfc9 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -1155,7 +1155,7 @@ ProcessUtilitySlow(ParseState *pstate, { CreateStmt *cstmt = (CreateStmt *) stmt; Datum toast_options; - static char *validnsps[] = HEAP_RELOPT_NAMESPACES; + const char *validnsps[] = HEAP_RELOPT_NAMESPACES; /* Remember transformed RangeVar for LIKE */ table_rv = cstmt->relation; diff --git a/src/include/access/reloptions.h b/src/include/access/reloptions.h index 81829b8270..df6923c9d5 100644 --- a/src/include/access/reloptions.h +++ b/src/include/access/reloptions.h @@ -220,7 +220,7 @@ extern void add_local_string_reloption(local_relopts *relopts, const char *name, fill_string_relopt filler, int offset); extern Datum transformRelOptions(Datum oldOptions, List *defList, - const char *namspace, char *validnsps[], + const char *namspace, const char *const validnsps[], bool acceptOidsOff, bool isReset); extern List *untransformRelOptions(Datum options); extern bytea *extractRelOptions(HeapTuple tuple, TupleDesc tupdesc, -- 2.39.2 From f108ae4c2ddfa6ca77e9736dc3fb20e6bda7b67c Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Tue, 6 Aug 2024 17:59:33 +0300 Subject: [PATCH 2/5] Make nullSemAction const, add 'const' decorators to related functions To make it more clear that these should never be modified. Discussion: https://www.postgresql.org/message-id/7f86e06a-98c5-4ce3-8ec9-3885c8de0...@iki.fi --- src/backend/utils/adt/jsonfuncs.c | 2 +- src/common/jsonapi.c | 26 +-- src/include/common/jsonapi.h | 6 ++--- src/include/utils/json
Re: Fix comments in instr_time.h and remove an unneeded cast to int64
Heikki Linnakangas writes: > Hmm, ok I see. Then I propose: > 1. Revert > 2. Just fix the comment to say int64 instead of uint64. Yeah, it's probably reasonable to specify the output as int64 not uint64 (especially since it looks like that's what the macros actually produce). regards, tom lane
Re: Minor refactorings to eliminate some static buffers
Hi, On 2024-08-06 18:13:56 +0300, Heikki Linnakangas wrote: > From 6dd5a4a413212a61d9a4f5b9db73e812c8b5dcbd Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas > Date: Tue, 6 Aug 2024 17:58:29 +0300 > Subject: [PATCH 1/5] Turn a few 'validnsps' static variables into locals > > There was no need for these to be static buffers, a local variable > works just as well. I think they were marked as 'static' to imply that > they are read-only, but 'const' is more appropriate for that, so > change them to const. I looked at these at some point in the past. Unfortunately compilers don't always generate better code with const than static :( (the static initialization can be done once in a global var, the const one not necessarily). Arguably what you'd want here is both. I doubt that matters here though. > diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c > index 702a6c3a0b..2732d6bfc9 100644 > --- a/src/backend/tcop/utility.c > +++ b/src/backend/tcop/utility.c > @@ -1155,7 +1155,7 @@ ProcessUtilitySlow(ParseState *pstate, > { > CreateStmt *cstmt = > (CreateStmt *) stmt; > Datum > toast_options; > - static char > *validnsps[] = HEAP_RELOPT_NAMESPACES; > + const char *validnsps[] > = HEAP_RELOPT_NAMESPACES; > > /* Remember transformed > RangeVar for LIKE */ > table_rv = > cstmt->relation; In the other places you used "const char * const", here just "const char *" - it doesn't look like that's a required difference? > From f108ae4c2ddfa6ca77e9736dc3fb20e6bda7b67c Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas > Date: Tue, 6 Aug 2024 17:59:33 +0300 > Subject: [PATCH 2/5] Make nullSemAction const, add 'const' decorators to > related functions > To make it more clear that these should never be modified. Yep - and it reduces the size of writable mappings to boot. LGTM. > From da6f101b0ecc2d4e4b33bbcae316dbaf72e67d14 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas > Date: Tue, 6 Aug 2024 17:59:45 +0300 > Subject: [PATCH 3/5] Mark misc static global variables as const LGTM > From 5d562f15aaba0bb082e714e844995705f0ca1368 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas > Date: Tue, 6 Aug 2024 17:59:52 +0300 > Subject: [PATCH 4/5] Constify fields and parameters in spell.c > > I started by marking VoidString as const, and fixing the fallout by > marking more fields and function arguments as const. It proliferated > quite a lot, but all within spell.c and spell.h. > > A more narrow patch to get rid of the static VoidString buffer would > be to replace it with '#define VoidString ""', as C99 allows assigning > "" to a non-const pointer, even though you're not allowed to modify > it. But it seems like good hygiene to mark all these as const. In the > structs, the pointers can point to the constant VoidString, or a > buffer allocated with palloc(), or with compact_palloc(), so you > should not modify them. Looks reasonable to me. > From bb66efccf4f97d0001b730a1376845c0a19c7f27 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas > Date: Tue, 6 Aug 2024 18:00:01 +0300 > Subject: [PATCH 5/5] Use psprintf to simplify gtsvectorout() > > The buffer allocation was correct, but looked archaic and scary: > > - It was weird to calculate the buffer size before determining which > format string was used. With the same effort, we could've used the > right-sized buffer for each branch. > > - Commit aa0d3504560 added one more possible return string ("all true > bits"), but didn't adjust the code at the top of the function to > calculate the returned string's max size. It was not a live bug, > because the new string was smaller than the existing ones, but > seemed wrong in principle. > > - Use of sprintf() is generally eyebrow-raising these days > > Switch to psprintf(). psprintf() allocates a larger buffer than what > was allocated before, 128 bytes vs 80 bytes, which is acceptable as > this code is not performance or space critical. I find the undocumented EXTRALEN the most confusing :) LGTM. Greetings, Andres Freund
Re: Support multi-column renaming?
Hi, On 2024-08-06 10:45:48 -0400, Alvaro Herrera wrote: > On 2024-Aug-06, Kirill Reshke wrote: > > > I have noticed that ALTER TABLE supports multiple column actions > > (ADD/DROP column etc), but does not support multiple column renaming. > > > Looking closely on gram.y, the only reason for this is that RenameStmt > > is defined less flexible than alter_table_cmds (which is a list). All > > other core infrastructure is ready to allow $subj. > > > > So is it worth a patch? > > Hmm, yeah, maybe it'd be better if ALTER TABLE RENAME is not part of > RenameStmt but instead part of AlterTableStmt. Probably not a super > trivial code change, but it should be doable. The interactions with > different subcommands types in the same command should be considered > carefully (as well as with ALTER {VIEW,SEQUENCE,etc} RENAME, which I bet > we don't want changed due to the implications). I think you'd likely run into grammar ambiguity issues if you did it naively. I think I looked at something related at some point in the past and concluded that to avoid bison getting confused (afaict the grammar is still LALR(1), it's just that bison doesn't merge states well enough), we'd have to invent a RENAME_TO_P and inject that "manually" in base_yylex(). IIRC introducing RENAME_TO_P (as well as SET_SCHEMA_P, OWNER TO) did actually result in a decent size reduction of the grammar. Greetings, Andres Freund
Re: remove volatile qualifiers from pg_stat_statements
On Tue, Aug 06, 2024 at 04:04:01PM +0900, Michael Paquier wrote: > On Wed, Jul 31, 2024 at 07:01:38AM +, Bertrand Drouvot wrote: >> I share the same understanding and I think those can be removed. >> >> The patch LGTM. > > That sounds about right. All the volatile references we have here > have been kept under the assumption that a memory barrier is required. > As we hold spin locks in these areas, that should not be necessary > anyway. So LGTM as well. Committed, thanks. -- nathan
Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
On Thu, 1 Aug 2024 23:41:18 +0500 Kirill Reshke wrote: > On Thu, 1 Aug 2024 at 23:27, Jeff Davis wrote: > > Relatedly, if we can EXPLAIN a CREATE MATERIALIZED VIEW, perhaps we > > should be able to EXPLAIN a REFRESH MATERIALIZED VIEW, too? > Sure REFRESH MATERIALIZED VIEW consists of not only the view query execution in refresh_matview_datafill but also refresh_by_heap_swap or refresh_by_match_merge. The former doesn't execute any query, so it would not a problem, but the latter executes additional queries including SELECT, some DDL, DELETE, and INSERT. If we would make EXPLAIN support REFRESH MATERIALIZED VIEW CONCURRENTLY command, how should we handle these additional queries? Regards, Yugo Nagata -- Yugo Nagata
Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
On Tue, 6 Aug 2024 at 21:06, Yugo Nagata wrote: > > On Thu, 1 Aug 2024 23:41:18 +0500 > Kirill Reshke wrote: > > > On Thu, 1 Aug 2024 at 23:27, Jeff Davis wrote: > > > Relatedly, if we can EXPLAIN a CREATE MATERIALIZED VIEW, perhaps we > > > should be able to EXPLAIN a REFRESH MATERIALIZED VIEW, too? > > Sure > > REFRESH MATERIALIZED VIEW consists of not only the view query > execution in refresh_matview_datafill but also refresh_by_heap_swap > or refresh_by_match_merge. The former doesn't execute any query, so > it would not a problem, but the latter executes additional queries > including SELECT, some DDL, DELETE, and INSERT. > > If we would make EXPLAIN support REFRESH MATERIALIZED VIEW CONCURRENTLY > command, how should we handle these additional queries? > > Regards, > Yugo Nagata > > -- > Yugo Nagata Hmm, is it a big issue? Maybe we can just add them in proper places of output the same way we do it with triggers? -- Best regards, Kirill Reshke
Re: [PATCH] Add crc32(text) & crc32(bytea)
On Tue, Aug 06, 2024 at 11:04:41AM +0300, Aleksander Alekseev wrote: > Perhaps we need get_int4() / get_int8() / get_numeric() as there seems > to be a demand [1][2] and it will allow us to easily cast a `bytea` > value to `integer` or `bigint`. This is probably another topic though. Yeah, I was surprised to learn there wasn't yet an easy way to do this. I'm not sure how much of a factor this should play in deciding the return value for the CRC functions, but IMHO it's a reason to reconsider returning text as you originally proposed. -- nathan
Re: pg_verifybackup: TAR format backup verification
On Thu, Aug 1, 2024 at 9:19 AM Amul Sul wrote: > > I think I would have made this pass context->show_progress to > > progress_report() instead of the whole verifier_context, but that's an > > arguable stylistic choice, so I'll defer to you if you prefer it the > > way you have it. Other than that, this LGTM. > > Additionally, I moved total_size and done_size to verifier_context > because done_size needs to be accessed in astreamer_verify.c. > With this change, verifier_context is now more suitable. But it seems like 0006 now changes the logic for computing total_size. Prepatch, the condition is: - if (context->show_progress && !context->skip_checksums && - should_verify_checksum(m)) - context->total_size += m->size; where should_verify_checksum(m) checks (((m)->matched) && !((m)->bad) && (((m)->checksum_type) != CHECKSUM_TYPE_NONE)). But post-patch the condition is: + if (!context.skip_checksums) ... + if (!should_ignore_relpath(context, m->pathname)) + total_size += m->size; The old logic was reached from verify_backup_directory() which does check should_ignore_relpath(), so the new condition hasn't added anything. But it seems to have lost the show_progress condition, and the m->checksum_type != CHECKSUM_TYPE_NONE condition. I think this means that we'll sum the sizes even when not displaying progress, and that if some of the files in the manifest had no checksums, our progress reporting would compute wrong percentages after the patch. > Understood. At the start of working on the v3 review, I thought of > completely discarding the 0007 patch and copying most of > verify_file_checksum() to a new function in astreamer_verify.c. > However, I later realized we could deduplicate some parts, so I split > verify_file_checksum() and moved the reusable part to a separate > function. Please have a look at v4-0007. Yeah, that seems OK. The fact that these patches don't have commit messages is making life more difficult for me than it needs to be. In particular, I'm looking at 0009 and there's no hint about why you want to do this. In fact that's the case for all of these refactoring patches. Instead of saying something like "tar format verification will want to verify the control file, but will not be able to read the file directly from disk, so separate the logic that reads the control file from the logic that verifies it" you just say what code you moved. Then I have to guess why you moved it, or flip back and forth between the refactoring patch and 0011 to try to figure it out. It would be nice if each of these refactoring patches contained a clear indication about the purpose of the refactoring in the commit message. > I had the same thought about checking for NULL inside > should_verify_control_data(), but I wanted to maintain the structure > similar to should_verify_checksum(). Making this change would have > also required altering should_verify_checksum(), I wasn’t sure if I > should make that change before. Now, I did that in the attached > version -- 0008 patch. I believe there is no reason for this change to be part of 0008 at all, and that this should be part of whatever later patch needs it. > > Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also. > > Done. OK, the formatting of 0011 looks much better now. It seems to me that 0011 is arranging to palloc the checksum context for every file and then pfree it at the end. It seems like it would be considerably more efficient if astreamer_verify contained a pg_checksum_context instead of a pointer to a pg_checksum_context. If you need a flag to indicate whether we've reinitialized the checksum for the current file, it's better to add that than to have all of these unnecessary allocate/free cycles. Existing astreamer code uses struct member names_like_this. For the new one, you mostly used namesLikeThis except when you used names_like_this or namesLkThs. It seems to me that instead of adding a global variable verify_backup_file_cb, it would be better to move the 'format' variable into verifier_context. Then you can do something like if (context->format == 'p') verify_plain_backup_file() else verify_tar_backup_file(). It's pretty common for .tar.gz to be abbreviated to .tgz. I think we should support that. Let's suppose that I have a backup which, for some reason, does not use the same compression for all files (base.tar, 16384.tgz, 16385.tar.gz, 16366.tar.lz4). With this patch, that will fail. Now, that's not really a problem, because having a backup with mixed compression algorithms like that is strange and you probably wouldn't try to do it. But on the other hand, it looks to me like making the code support that would be more elegant than what you have now. Because, right now, you have code to detect what type of backup you've got by looking at base.WHATEVER_EXTENSION ... but then you have to also have code that complains if some later file d
Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
On Thu, 01 Aug 2024 13:34:51 -0700 Jeff Davis wrote: > On Fri, 2024-08-02 at 00:13 +0500, Kirill Reshke wrote: > > On Thu, 1 Aug 2024 at 23:27, Jeff Davis wrote: > > > > > > EXPLAIN ANALYZE CREATE MATERIALIZED VIEW doesn't go through > > > ExecCreateTableAs(), but does use CreateIntoRelDestReceiver(). > > > > EXPLAIN ANALYZE and regular query goes through create_ctas_internal > > (WITH NO DATA case too). Maybe we can simply move > > SetUserIdAndSecContext call in this function? > > We need to set up the SECURITY_RESTRICTED_OPERATION before planning, in > case the planner executes some functions. > > I believe we need to do some more refactoring to make this work. In > version 17, I already refactored so that CREATE MATERIALIZED VIEW and > REFRESH MATERIALIZED VIEW share the code. We can do something similar > to extend that to EXPLAIN ... CREATE MATERIALIZED VIEW. I think the most simple way to fix this is to set up the userid and the the SECURITY_RESTRICTED_OPERATION, and call RestrictSearchPath() before pg_plan_query in ExplainOneQuery(), as in the attached patch. However, this is similar to the old code of ExecCreateTableAs() before refactored. If we want to reuse REFRESH even in EXPLAIN as similar as the current CREATE MATERIALIZED code, I think we would have to refactor RefereshMatViewByOid to receive instrument_option and eflags as arguments and call it in ExplainOnePlan, for example. Do you think that is more preferred than setting up SECURITY_RESTRICTED_OPERATION in ExplainOneQuery? > As for the August release, the code freeze is on Saturday. Perhaps it > can be done by then, but is there a reason we should rush it? This > affects all supported versions, so we've lived with it for a while, and > I don't see a security problem here. I wouldn't expect it to be a > common use case, either. I agree that we don't have to rush it since it is not a security bug but just it could make a materialized view that cannot be refreshed. Regards, Yugo Nagata -- Yugo Nagata diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 5771aabf40..f8beb6f6a6 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -22,6 +22,7 @@ #include "jit/jit.h" #include "libpq/pqformat.h" #include "libpq/protocol.h" +#include "miscadmin.h" #include "nodes/extensible.h" #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" @@ -467,6 +468,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions, MemoryContext planner_ctx = NULL; MemoryContext saved_ctx = NULL; + Oid save_userid = InvalidOid; + int save_sec_context = 0; + int save_nestlevel = 0; + if (es->memory) { /* @@ -487,6 +492,20 @@ standard_ExplainOneQuery(Query *query, int cursorOptions, bufusage_start = pgBufferUsage; INSTR_TIME_SET_CURRENT(planstart); + /* + * For CREATE MATERIALIZED VIEW command, switch to the owner's userid, so + * that any functions are run as that user. Also lock down security-restricted + * operations and arrange to make GUC variable changes local to this command. + */ + if (into && into->viewQuery) + { + GetUserIdAndSecContext(&save_userid, &save_sec_context); + SetUserIdAndSecContext(save_userid, + save_sec_context | SECURITY_RESTRICTED_OPERATION); + save_nestlevel = NewGUCNestLevel(); + RestrictSearchPath(); + } + /* plan the query */ plan = pg_plan_query(query, queryString, cursorOptions, params); @@ -510,6 +529,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions, ExplainOnePlan(plan, into, es, queryString, params, queryEnv, &planduration, (es->buffers ? &bufusage : NULL), es->memory ? &mem_counters : NULL); + + /* CREATE MATERIALIZED VIEW command */ + if (into && into->viewQuery) + { + /* Roll back any GUC changes */ + AtEOXact_GUC(false, save_nestlevel); + + /* Restore userid and security context */ + SetUserIdAndSecContext(save_userid, save_sec_context); + } } /*
Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
On Tue, 6 Aug 2024 at 22:13, Yugo Nagata wrote: > > On Thu, 01 Aug 2024 13:34:51 -0700 > Jeff Davis wrote: > > > On Fri, 2024-08-02 at 00:13 +0500, Kirill Reshke wrote: > > > On Thu, 1 Aug 2024 at 23:27, Jeff Davis wrote: > > > > > > > > EXPLAIN ANALYZE CREATE MATERIALIZED VIEW doesn't go through > > > > ExecCreateTableAs(), but does use CreateIntoRelDestReceiver(). > > > > > > EXPLAIN ANALYZE and regular query goes through create_ctas_internal > > > (WITH NO DATA case too). Maybe we can simply move > > > SetUserIdAndSecContext call in this function? > > > > We need to set up the SECURITY_RESTRICTED_OPERATION before planning, in > > case the planner executes some functions. > > > > I believe we need to do some more refactoring to make this work. In > > version 17, I already refactored so that CREATE MATERIALIZED VIEW and > > REFRESH MATERIALIZED VIEW share the code. We can do something similar > > to extend that to EXPLAIN ... CREATE MATERIALIZED VIEW. > > I think the most simple way to fix this is to set up the userid and the > the SECURITY_RESTRICTED_OPERATION, and call RestrictSearchPath() > before pg_plan_query in ExplainOneQuery(), as in the attached patch. > > However, this is similar to the old code of ExecCreateTableAs() before > refactored. If we want to reuse REFRESH even in EXPLAIN as similar as the > current CREATE MATERIALIZED code, I think we would have to refactor > RefereshMatViewByOid to receive instrument_option and eflags as arguments > and call it in ExplainOnePlan, for example. > > Do you think that is more preferred than setting up > SECURITY_RESTRICTED_OPERATION in ExplainOneQuery? > > > As for the August release, the code freeze is on Saturday. Perhaps it > > can be done by then, but is there a reason we should rush it? This > > affects all supported versions, so we've lived with it for a while, and > > I don't see a security problem here. I wouldn't expect it to be a > > common use case, either. > > I agree that we don't have to rush it since it is not a security bug > but just it could make a materialized view that cannot be refreshed. > > Regards, > Yugo Nagata > > -- > Yugo Nagata > + /* > + * For CREATE MATERIALIZED VIEW command, switch to the owner's userid, so > + * that any functions are run as that user. Also lock down > security-restricted > + * operations and arrange to make GUC variable changes local to this command. > + */ > + if (into && into->viewQuery) > + { > + GetUserIdAndSecContext(&save_userid, &save_sec_context); > + SetUserIdAndSecContext(save_userid, > + save_sec_context | SECURITY_RESTRICTED_OPERATION); > + save_nestlevel = NewGUCNestLevel(); > + RestrictSearchPath(); > + } > + In general, doing this ad-hoc coding for MV inside standard_ExplainOneQuery function looks just out of place for me. standard_ExplainOneQuery is on another level of abstraction. -- Best regards, Kirill Reshke
Re: Restart pg_usleep when interrupted
v5-0001-vaccum_delay-with-absolute-time-nanosleep.patch Description: Binary data v5 takes care of the latest comments by Bertrand. Regards, Sami
Re: Vectored IO in XLogWrite()
On Tue, Aug 6, 2024 at 5:36 AM Melih Mutlu wrote: > I think that we don't have the "contiguous pages" constraint when writing > anymore as we can do vectored IO. It seems unnecessary to write just because > XLogWrite() is at the end of wal buffers. > Attached patch uses pg_pwritev() instead of pg_pwrite() and tries to write > pages in one call even if they're not contiguous in memory, until it reaches > the page at startidx. Here are a few notes on this patch: - It's not pgindent-clean. In fact, it doesn't even pass git diff --check. - You added a new comment (/* Reaching the buffer... */) in the middle of a chunk of lines that were already covered by an existing comment (/* Dump the set ... */). This makes it look like the /* Dump the set... */ comment only covers the 3 lines of code that immediately follow it rather than everything in the "if" statement. You could fix this in a variety of ways, but in this case the easiest solution, to me, looks like just skipping the new comment. It seems like the point is pretty self-explanatory. - The patch removes the initialization of "from" but not the variable itself. You still increment the variable you haven't initialized. - I don't think the logic is correct after a partial write. Pre-patch, "from" advances while "nleft" goes down, but post-patch, what gets written is dependent on the contents of "iov" which is initialized outside the loop and never updated. Perhaps compute_remaining_iovec would be useful here? - I assume this is probably a good idea in principle, because fewer system calls are presumably better than more. The impact will probably be very small, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
On Wed, 7 Aug 2024 02:13:04 +0900 Yugo Nagata wrote: > On Thu, 01 Aug 2024 13:34:51 -0700 > Jeff Davis wrote: > > > On Fri, 2024-08-02 at 00:13 +0500, Kirill Reshke wrote: > > > On Thu, 1 Aug 2024 at 23:27, Jeff Davis wrote: > > > > > > > > EXPLAIN ANALYZE CREATE MATERIALIZED VIEW doesn't go through > > > > ExecCreateTableAs(), but does use CreateIntoRelDestReceiver(). > > > > > > EXPLAIN ANALYZE and regular query goes through create_ctas_internal > > > (WITH NO DATA case too). Maybe we can simply move > > > SetUserIdAndSecContext call in this function? > > > > We need to set up the SECURITY_RESTRICTED_OPERATION before planning, in > > case the planner executes some functions. > > > > I believe we need to do some more refactoring to make this work. In > > version 17, I already refactored so that CREATE MATERIALIZED VIEW and > > REFRESH MATERIALIZED VIEW share the code. We can do something similar > > to extend that to EXPLAIN ... CREATE MATERIALIZED VIEW. > > I think the most simple way to fix this is to set up the userid and the > the SECURITY_RESTRICTED_OPERATION, and call RestrictSearchPath() > before pg_plan_query in ExplainOneQuery(), as in the attached patch. I'm sorry. After sending the mail, I found the patch didn't work. If we call RestrictSearchPath() before creating a relation, it tries to create the relation in pg_catalog and it causes an error. Perhaps, we would need to create the rel before RestrictSearchPath() by calling create_ctas_nodata or something like this... > > However, this is similar to the old code of ExecCreateTableAs() before > refactored. If we want to reuse REFRESH even in EXPLAIN as similar as the > current CREATE MATERIALIZED code, I think we would have to refactor > RefereshMatViewByOid to receive instrument_option and eflags as arguments > and call it in ExplainOnePlan, for example. > > Do you think that is more preferred than setting up > SECURITY_RESTRICTED_OPERATION in ExplainOneQuery? > > > As for the August release, the code freeze is on Saturday. Perhaps it > > can be done by then, but is there a reason we should rush it? This > > affects all supported versions, so we've lived with it for a while, and > > I don't see a security problem here. I wouldn't expect it to be a > > common use case, either. > > I agree that we don't have to rush it since it is not a security bug > but just it could make a materialized view that cannot be refreshed. > > Regards, > Yugo Nagata > > -- > Yugo Nagata -- Yugo Nagata
Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
On Wed, 2024-08-07 at 03:06 +0900, Yugo Nagata wrote: > I'm sorry. After sending the mail, I found the patch didn't work. > If we call RestrictSearchPath() before creating a relation, it tries > to create the relation in pg_catalog and it causes an error. Yeah, that's exactly the problem I ran into in ExecCreateTableAs(), which was the reason I refactored it to use RefreshMatViewByOid(). > Perhaps, we would need to create the rel before RestrictSearchPath() > by calling > create_ctas_nodata or something like this... I haven't looked at the details, but that makes sense: break it into two parts so that the create (without data) happens first without doing the work of EXPLAIN, and the second part refreshes using the explain logic. That means RefreshMatViewByOid() would need to work with EXPLAIN. As you point out in the other email, it's not easy to make that all work with REFRESH ... CONCURRENTLY, but perhaps it could work with CREATE MATERIALIZED VIEW and REFRESH (without CONCURRENTLY). Regards, Jeff Davis
Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
Jeff Davis writes: > As you point out in the other email, it's not easy to make that all > work with REFRESH ... CONCURRENTLY, but perhaps it could work with > CREATE MATERIALIZED VIEW and REFRESH (without CONCURRENTLY). I'm not really sure I see the point of this, if it doesn't "just work" with all variants of C.M.V. It's not like you can't easily EXPLAIN the view's SELECT. If REFRESH M. V. does something different than CREATE, there would certainly be value in being able to EXPLAIN what that does --- but that still isn't an argument for allowing EXPLAIN CREATE MATERIALIZED VIEW. regards, tom lane
Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
On Tue, 06 Aug 2024 11:24:03 -0700 Jeff Davis wrote: > On Wed, 2024-08-07 at 03:06 +0900, Yugo Nagata wrote: > > I'm sorry. After sending the mail, I found the patch didn't work. > > If we call RestrictSearchPath() before creating a relation, it tries > > to create the relation in pg_catalog and it causes an error. > > Yeah, that's exactly the problem I ran into in ExecCreateTableAs(), > which was the reason I refactored it to use RefreshMatViewByOid(). I came up with an idea that we can rewrite into->rel to add the current schemaname before calling RestrictSearchPath() as in the attached patch. It seems a simpler fix at least, but I am not sure whether this design is better than using RefreshMatViewByOid from EXPLAIN considering to support EXPLAIN REFRESH ... Regards, Yugo Nagata -- Yugo NAGATA diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index 5771aabf40..1d06a7e96e 100644 --- a/src/backend/commands/explain.c +++ b/src/backend/commands/explain.c @@ -15,6 +15,7 @@ #include "access/xact.h" #include "catalog/pg_type.h" +#include "catalog/namespace.h" #include "commands/createas.h" #include "commands/defrem.h" #include "commands/prepare.h" @@ -22,6 +23,7 @@ #include "jit/jit.h" #include "libpq/pqformat.h" #include "libpq/protocol.h" +#include "miscadmin.h" #include "nodes/extensible.h" #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" @@ -467,6 +469,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions, MemoryContext planner_ctx = NULL; MemoryContext saved_ctx = NULL; + Oid save_userid = InvalidOid; + int save_sec_context = 0; + int save_nestlevel = 0; + if (es->memory) { /* @@ -487,6 +493,24 @@ standard_ExplainOneQuery(Query *query, int cursorOptions, bufusage_start = pgBufferUsage; INSTR_TIME_SET_CURRENT(planstart); + /* + * For CREATE MATERIALIZED VIEW command, switch to the owner's userid, so + * that any functions are run as that user. Also lock down security-restricted + * operations and arrange to make GUC variable changes local to this command. + */ + if (into && into->viewQuery) + { + Oid nspid = RangeVarGetCreationNamespace(into->rel); + + into->rel = makeRangeVar(get_namespace_name(nspid), into->rel->relname, -1); + + GetUserIdAndSecContext(&save_userid, &save_sec_context); + SetUserIdAndSecContext(save_userid, + save_sec_context | SECURITY_RESTRICTED_OPERATION); + save_nestlevel = NewGUCNestLevel(); + RestrictSearchPath(); + } + /* plan the query */ plan = pg_plan_query(query, queryString, cursorOptions, params); @@ -510,6 +534,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions, ExplainOnePlan(plan, into, es, queryString, params, queryEnv, &planduration, (es->buffers ? &bufusage : NULL), es->memory ? &mem_counters : NULL); + + /* CREATE MATERIALIZED VIEW command */ + if (into && into->viewQuery) + { + /* Roll back any GUC changes */ + AtEOXact_GUC(false, save_nestlevel); + + /* Restore userid and security context */ + SetUserIdAndSecContext(save_userid, save_sec_context); + } } /*
Re: CI and test improvements
On Tue, Jun 18, 2024 at 02:25:46PM -0700, Andres Freund wrote: > > > If not, we can just install the binaries distributed by the project, it's > > > just more work to keep up2date that way. > > > > I guess you mean a separate line to copy choco's ccache.exe somewhere. > > I was thinking of alternatively just using the binaries upstream > distributes. But the mingw way seems easier. > > Perhaps we should add an environment variable pointing to ccache to the image, > or such? I guess you mean changing the OS image so there's a $CCACHE. That sounds fine... > > +CCACHE_MAXSIZE: "500M" > > Does it have to be this big to work? It's using 150MB for an initial compilation, and maxsize will need to be 20% larger for it to not evict its cache before it can be used. The other ccaches (except for mingw) seem to be several times larger than what's needed for a single compilation, which makes sense to cache across multiple branches (or even commits in a single branch), and for cfbot. > A comment explaining why /Z7 is necessary would be good. Sure > >build_script: | > > vcvarsall x64 > > -ninja -C build > > +ninja -C build |tee build.txt > > +REM Since pipes lose the exit status of the preceding command, rerun > > the compilation > > +REM without the pipe, exiting now if it fails, to avoid trying to run > > checks > > +ninja -C build > nul > > Perhaps we could do something like > (ninja -C build && touch success) | tee build.txt > cat success > ? I don't know -- a pipe alone seems more direct than a subshell+conditional+pipe written in cmd.exe, whose syntax is not well known here. Maybe you're suggesting to write sh -c "(ninja -C build && touch success) | tee build.txt ; cat ./success" But that's another layer of complexity .. for what ? > > Subject: [PATCH 4/7] WIP: cirrus: upload changed html docs as artifacts > > I still think that this runs the risk of increasing space utilization and thus > increase frequency of caches/artifacts being purged. Maybe it should run on the local macs where I think you can control that. > > + # The commit that this branch is rebased on. There's no easy way to > > find this. > > I don't think that's true, IIRC I've complained about it before. We can do > something like: cfbot now exposes it, so it'd be trivial for that case (although there was no response here to my inquiries about that). I'll revisit this in the future, once other patches have progressed. > major_num=$(grep PG_MAJORVERSION_NUM build/src/include/pg_config.h|awk > '{print $3}'); > echo major: $major_num; > if git rev-parse --quiet --verify origin/REL_${major_num}_STABLE > /dev/null > ; then > basebranch=origin/REL_${major_num}_STABLE; > else > basebranch=origin/master; > fi; > echo base branch: $basebranch > base_commit=$(git merge-base HEAD $basebranch) >From 0cf4fa490b19aea8e536506e975f1e9b61be6574 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 25 May 2022 21:53:22 -0500 Subject: [PATCH 1/6] cirrus/windows: add compiler_warnings_script The goal is to fail due to warnings only after running tests. (At least historically, it's too slow to run a separate windows VM to compile with -Werror.) https://www.postgresql.org/message-id/20220212212310.f645c6vw3njkgxka%40alap3.anarazel.de I'm not sure how to write this test in windows shell; it's also easy to write something that doesn't work in posix sh, since windows shell is interpretting && and ||... See also: 8a1ce5e54f6d144e4f8e19af7c767b026ee0c956 https://cirrus-ci.com/task/6241060062494720 https://cirrus-ci.com/task/6496366607204352 20221104161934.gb16...@telsasoft.com 20221031223744.gt16...@telsasoft.com 20221104161934.gb16...@telsasoft.com 20221123054329.gg11...@telsasoft.com 20230224002029.gq1...@telsasoft.com ci-os-only: windows --- .cirrus.tasks.yml | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml index 1ce6c443a8c..5dd37e07aae 100644 --- a/.cirrus.tasks.yml +++ b/.cirrus.tasks.yml @@ -576,12 +576,21 @@ task: build_script: | vcvarsall x64 -ninja -C build +ninja -C build |tee build.txt +REM Since pipes lose the exit status of the preceding command, rerun the compilation +REM without the pipe, exiting now if it fails, to avoid trying to run checks +ninja -C build > nul check_world_script: | vcvarsall x64 meson test %MTEST_ARGS% --num-processes %TEST_JOBS% + # This should be last, so check_world is run even if there are warnings + always: +compiler_warnings_script: + # this avoids using metachars which would be interpretted by the windows shell + - sh -c 'if grep ": warning " build.txt; then exit 1; fi; exit 0' + on_failure: <<: *on_failure_meson crashlog_artifacts: -- 2.42.0 >From 764987ff4093783e967c13f55dee60fb6d89b4a0 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 25 May 2022 22:05:13 -0500 Subject: [PATCH 2/6
Re: Fix comments in instr_time.h and remove an unneeded cast to int64
On 06/08/2024 18:16, Tom Lane wrote: Heikki Linnakangas writes: Hmm, ok I see. Then I propose: 1. Revert 2. Just fix the comment to say int64 instead of uint64. Yeah, it's probably reasonable to specify the output as int64 not uint64 (especially since it looks like that's what the macros actually produce). Committed -- Heikki Linnakangas Neon (https://neon.tech)
Re: optimizing pg_upgrade's once-in-each-database steps
On Sun, Aug 04, 2024 at 07:19:57PM +0100, Ilya Gladyshev wrote: > -- End of review -- Thanks for the review. I've attempted to address all your feedback in v8 of the patch set. I think the names could still use some work, but I wanted to get the main structure in place before trying to fix them. > Regarding keeping the connections, the way I envisioned it entailed passing > a list of connections from one check to the next one (or keeping a global > state with connections?). I didn't concretely look at the code to verify > this, so it's just an abstract idea. My main concern with doing something like that is it could require maintaining a huge number of connections when there are many databases. GUCs like max_connections would need to be set accordingly. I'm a little dubious that the gains would be enough to justify it. -- nathan >From 2536a3aac4cabe70c59feb8134d88ee54214e3a2 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 28 Jun 2024 11:02:44 -0500 Subject: [PATCH v8 01/11] introduce framework for parallelizing pg_upgrade tasks --- src/bin/pg_upgrade/Makefile | 1 + src/bin/pg_upgrade/async.c | 466 +++ src/bin/pg_upgrade/meson.build | 1 + src/bin/pg_upgrade/pg_upgrade.h | 14 + src/tools/pgindent/typedefs.list | 4 + 5 files changed, 486 insertions(+) create mode 100644 src/bin/pg_upgrade/async.c diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile index bde91e2beb..3bc4f5d740 100644 --- a/src/bin/pg_upgrade/Makefile +++ b/src/bin/pg_upgrade/Makefile @@ -12,6 +12,7 @@ include $(top_builddir)/src/Makefile.global OBJS = \ $(WIN32RES) \ + async.o \ check.o \ controldata.o \ dump.o \ diff --git a/src/bin/pg_upgrade/async.c b/src/bin/pg_upgrade/async.c new file mode 100644 index 00..e279fdd676 --- /dev/null +++ b/src/bin/pg_upgrade/async.c @@ -0,0 +1,466 @@ +/* + * async.c + * parallelization via libpq's async APIs + * + * This framework provides an efficient way of running the various + * once-in-each-database tasks required by pg_upgrade. Specifically, it + * parallelizes these tasks by managing a simple state machine of + * user_opts.jobs slots and using libpq's asynchronous APIs to establish the + * connections and run the queries. Callers simply need to create a callback + * function and build/execute an AsyncTask. A simple example follows: + * + * static void + * my_process_cb(DbInfo *dbinfo, PGresult *res, void *arg) + * { + * for (int i = 0; i < PQntuples(res); i++) + * { + * ... process results ... + * } + * } + * + * void + * my_task(ClusterInfo *cluster) + * { + * AsyncTask *task = async_task_create(); + * + * async_task_add_step(task, + * "... query text ...", + * my_process_cb, + * true, // let the task free the PGresult + * NULL); // "arg" pointer for the callbacks + * async_task_run(task, cluster); + * async_task_free(task); + * } + * + * Note that multiple steps can be added to a given task. When there are + * multiple steps, the task will run all of the steps consecutively in the same + * database connection before freeing the connection and moving on. In other + * words, it only ever initiates one connection to each database in the + * cluster for a given run. + * + * Copyright (c) 2024, PostgreSQL Global Development Group + * src/bin/pg_upgrade/async.c + */ + +#include "postgres_fe.h" + +#include "common/connect.h" +#include "fe_utils/string_utils.h" +#include "pg_upgrade.h" + +/* + * dbs_complete stores the number of databases that we have completed + * processing. When this value equals the number of databases in the cluster, + * the task is finished. + */ +static int dbs_complete; + +/* + * dbs_processing stores the index of the next database in the cluster's array + * of databases that will be picked up for processing. It will always be + * greater than or equal to dbs_complete. + */ +static int dbs_processing; + +/* + * This struct stores all the information for a single step of a task. All + * steps in a task are run in a single connection before moving on to the next + * database (which requires a new connection). + */ +typedef struct AsyncTaskCallbacks +{ + AsyncTaskProcessCB process_cb; /* processes the results of the query */ + const char *query; /* query text */ + boolfree_result;/* should we free the result? */ + void *arg;
Re: Inval reliability, especially for inplace updates
On Tue, Jun 18, 2024 at 08:23:49AM -0700, Noah Misch wrote: > On Mon, Jun 17, 2024 at 06:57:30PM -0700, Andres Freund wrote: > > On 2024-06-17 16:58:54 -0700, Noah Misch wrote: > > > That inplace150 patch turned out to be unnecessary. Contrary to the > > > "noncritical resource releasing" comment some lines above > > > AtEOXact_Inval(true), the actual behavior is already to promote ERROR to > > > PANIC. An ERROR just before or after sending invals becomes PANIC, > > > "cannot > > > abort transaction %u, it was already committed". > > > > Relying on that, instead of explicit critical sections, seems fragile to me. > > IIRC some of the behaviour around errors around transaction commit/abort has > > changed a bunch of times. Tying correctness into something that could be > > changed for unrelated reasons doesn't seem great. > > Fair enough. It could still be a good idea for master, but given I missed a > bug in inplace150-inval-durability-atcommit-v1.patch far worse than the ones > $SUBJECT fixes, let's not risk it in back branches. What are your thoughts on whether a change to explicit critical sections should be master-only vs. back-patched? I have a feeling your comment pointed to something I'm still missing, but I don't know where to look next.
Unused expression indexes
In a blog post [1], Bruce Momjian notes that expression indexes can help with planning even if they're not used directly. But the examples cited in that post are vague (i.e., they improve stats, but it's not clear how they could change plans), and Bruce's answer to a comment [2] suggests that this is not documented. Is there any more info on this mechanism? Specifically, if one has unused expression indexes (according to pg_stat_user_indexes), is it safe to drop them? Or could they be providing statistics that materially affect query planning even though the indexes themselves are unused? It looks like a very similar question was asked earlier this year on pgsql-general [3], but got only a vague answer. Any background or tips for where to look in the source regarding this behavior would be greatly appreciated. Thanks, Maciek [1]: https://momjian.us/main/blogs/pgblog/2017.html#February_20_2017 [2]: https://momjian.us/main/comment_item.html?/main/blogs/pgblog.html/February_20_2017#comment-3174376969 [3]: https://www.postgresql.org/message-id/flat/CAHg%3DPn%3DOZu7A3p%2B0Z-CDG4s2CHYe3UFQCTZp4RWGCEn2gmD35A%40mail.gmail.com
Re: Minor refactorings to eliminate some static buffers
On 06/08/2024 18:52, Andres Freund wrote: On 2024-08-06 18:13:56 +0300, Heikki Linnakangas wrote: diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 702a6c3a0b..2732d6bfc9 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -1155,7 +1155,7 @@ ProcessUtilitySlow(ParseState *pstate, { CreateStmt *cstmt = (CreateStmt *) stmt; Datum toast_options; - static char *validnsps[] = HEAP_RELOPT_NAMESPACES; + const char *validnsps[] = HEAP_RELOPT_NAMESPACES; /* Remember transformed RangeVar for LIKE */ table_rv = cstmt->relation; In the other places you used "const char * const", here just "const char *" - it doesn't look like that's a required difference? Just an oversight. I went back and forth on whether to use plain "char *validnps[]", "const char *validnsps[]" or "const char *const validnsps[]". The first form doesn't convey the fact it's read-only, like the "static" used to. The second form hints that, but only for the strings, not for the pointers in the array. The last form is what we want, but it's a bit verbose and ugly. I chose the last form in the end, but missed this one. Fixed that and pushed. Thanks! -- Heikki Linnakangas Neon (https://neon.tech)
Re: Unused expression indexes
Maciek Sakrejda writes: > In a blog post [1], Bruce Momjian notes that expression indexes can > help with planning even if they're not used directly. But the examples > cited in that post are vague (i.e., they improve stats, but it's not > clear how they could change plans), and Bruce's answer to a comment > [2] suggests that this is not documented. > Is there any more info on this mechanism? Specifically, if one has > unused expression indexes (according to pg_stat_user_indexes), is it > safe to drop them? Or could they be providing statistics that > materially affect query planning even though the indexes themselves > are unused? Expression indexes definitely can affect planning, because ANALYZE collects stats on the values of those expressions. As a trivial example, regression=# create table foo (x1 float8); CREATE TABLE regression=# insert into foo select 10 * random() from generate_series(1,1); INSERT 0 1 regression=# analyze foo; ANALYZE regression=# explain analyze select * from foo where sqrt(x1) < 1; QUERY PLAN - Seq Scan on foo (cost=0.00..195.00 rows= width=8) (actual time=0.009..0.546 rows=1028 loops=1) Filter: (sqrt(x1) < '1'::double precision) Rows Removed by Filter: 8972 Planning Time: 0.065 ms Execution Time: 0.572 ms (5 rows) The planner has no info about the values of sqrt(x1), so you get a default estimate (one-third) of the selectivity of the WHERE clause. But watch this: regression=# create index on foo (sqrt(x1)); CREATE INDEX regression=# analyze foo; ANALYZE regression=# explain analyze select * from foo where sqrt(x1) < 1; QUERY PLAN Bitmap Heap Scan on foo (cost=24.24..84.63 rows=1026 width=8) (actual time=0.078..0.229 rows=1028 loops=1) Recheck Cond: (sqrt(x1) < '1'::double precision) Heap Blocks: exact=45 -> Bitmap Index Scan on foo_sqrt_idx (cost=0.00..23.98 rows=1026 width=0) (actual time=0.068..0.068 rows=1028 loops=1) Index Cond: (sqrt(x1) < '1'::double precision) Planning Time: 0.113 ms Execution Time: 0.259 ms (7 rows) Now there are stats about the values of sqrt(x1), allowing a far more accurate selectivity estimate to be made. In this particular example there's no change of plan (it would have used the index anyway), but certainly a different rowcount estimate can make a big difference. This mechanism is quite ancient, and in principle it's now superseded by extended statistics. For example, I can drop this index and instead do regression=# drop index foo_sqrt_idx; DROP INDEX regression=# create statistics foostats on sqrt(x1) from foo; CREATE STATISTICS regression=# analyze foo; ANALYZE regression=# explain analyze select * from foo where sqrt(x1) < 1; QUERY PLAN - Seq Scan on foo (cost=0.00..195.00 rows=1026 width=8) (actual time=0.006..0.479 rows=1028 loops=1) Filter: (sqrt(x1) < '1'::double precision) Rows Removed by Filter: 8972 Planning Time: 0.079 ms Execution Time: 0.503 ms (5 rows) So the accurate rowcount estimate is still obtained in this example; and we're not incurring any index maintenance costs, only ANALYZE costs that are going to be roughly the same either way. However, I am not certain that extended statistics are plugged into all the places where the older mechanism applies. Tomas Vondra might have a better idea than I of where gaps remain in that. regards, tom lane
Re: Minor refactorings to eliminate some static buffers
On 06.08.24 17:13, Heikki Linnakangas wrote: > --- a/src/backend/access/transam/xlogprefetcher.c > +++ b/src/backend/access/transam/xlogprefetcher.c > @@ -362,7 +362,7 @@ XLogPrefetcher * > XLogPrefetcherAllocate(XLogReaderState *reader) > { > XLogPrefetcher *prefetcher; > - static HASHCTL hash_table_ctl = { > + const HASHCTL hash_table_ctl = { Is there a reason this is not changed to static const HASHCTL ... ? Most other places where changed in that way.
Re: Unused expression indexes
On Tue, Aug 6, 2024 at 1:25 PM Tom Lane wrote: > The planner has no info about the values of sqrt(x1), so you get a > default estimate (one-third) of the selectivity of the WHERE clause. > But watch this: > > regression=# create index on foo (sqrt(x1)); > CREATE INDEX > regression=# analyze foo; > ANALYZE > regression=# explain analyze select * from foo where sqrt(x1) < 1; > QUERY PLAN > > Bitmap Heap Scan on foo (cost=24.24..84.63 rows=1026 width=8) (actual > time=0.078..0.229 rows=1028 loops=1) >Recheck Cond: (sqrt(x1) < '1'::double precision) >Heap Blocks: exact=45 >-> Bitmap Index Scan on foo_sqrt_idx (cost=0.00..23.98 rows=1026 > width=0) (actual time=0.068..0.068 rows=1028 loops=1) > Index Cond: (sqrt(x1) < '1'::double precision) > Planning Time: 0.113 ms > Execution Time: 0.259 ms > (7 rows) > > Now there are stats about the values of sqrt(x1), allowing a far more > accurate selectivity estimate to be made. In this particular example > there's no change of plan (it would have used the index anyway), but > certainly a different rowcount estimate can make a big difference. Thanks, but I was asking specifically about _unused_ indexes (according to pg_stat_user_indexes). Bruce's blog post showed how they can still influence rowcount estimates, but can they do that (1) even if they don't end up being used by the query plan and (2) in a way that leads to a different plan? Basically, if I have some unused expression indexes, is it safe to drop them, or could they be used for planning optimizations even if they are not used directly. Thanks, Maciek
Re: Unused expression indexes
Maciek Sakrejda writes: > Thanks, but I was asking specifically about _unused_ indexes > (according to pg_stat_user_indexes). Bruce's blog post showed how they > can still influence rowcount estimates, but can they do that (1) even > if they don't end up being used by the query plan and (2) in a way > that leads to a different plan? Certainly. This example was too simple to illustrate that point perhaps, but we would have arrived at the same rowcount estimate whether it then chose to use the index or not. (You could prove that with the same example by modifying the comparison constant to make the rowcount estimate high enough to discourage use of the index.) In turn, a different rowcount estimate might change the plan for subsequent joins or other processing. I didn't spend enough time on the example to show that, but it's surely not hard to show. regards, tom lane
Re: Unused expression indexes
On 8/6/24 22:25, Tom Lane wrote: > Maciek Sakrejda writes: >> In a blog post [1], Bruce Momjian notes that expression indexes can >> help with planning even if they're not used directly. But the examples >> cited in that post are vague (i.e., they improve stats, but it's not >> clear how they could change plans), and Bruce's answer to a comment >> [2] suggests that this is not documented. > >> Is there any more info on this mechanism? Specifically, if one has >> unused expression indexes (according to pg_stat_user_indexes), is it >> safe to drop them? Or could they be providing statistics that >> materially affect query planning even though the indexes themselves >> are unused? > > Expression indexes definitely can affect planning, because ANALYZE > collects stats on the values of those expressions. As a trivial > example, > > regression=# create table foo (x1 float8); > CREATE TABLE > regression=# insert into foo select 10 * random() from > generate_series(1,1); > INSERT 0 1 > regression=# analyze foo; > ANALYZE > regression=# explain analyze select * from foo where sqrt(x1) < 1; > QUERY PLAN > - > Seq Scan on foo (cost=0.00..195.00 rows= width=8) (actual > time=0.009..0.546 rows=1028 loops=1) >Filter: (sqrt(x1) < '1'::double precision) >Rows Removed by Filter: 8972 > Planning Time: 0.065 ms > Execution Time: 0.572 ms > (5 rows) > > The planner has no info about the values of sqrt(x1), so you get a > default estimate (one-third) of the selectivity of the WHERE clause. > But watch this: > > regression=# create index on foo (sqrt(x1)); > CREATE INDEX > regression=# analyze foo; > ANALYZE > regression=# explain analyze select * from foo where sqrt(x1) < 1; > QUERY PLAN > > > Bitmap Heap Scan on foo (cost=24.24..84.63 rows=1026 width=8) (actual > time=0.078..0.229 rows=1028 loops=1) >Recheck Cond: (sqrt(x1) < '1'::double precision) >Heap Blocks: exact=45 >-> Bitmap Index Scan on foo_sqrt_idx (cost=0.00..23.98 rows=1026 > width=0) (actual time=0.068..0.068 rows=1028 loops=1) > Index Cond: (sqrt(x1) < '1'::double precision) > Planning Time: 0.113 ms > Execution Time: 0.259 ms > (7 rows) > > Now there are stats about the values of sqrt(x1), allowing a far more > accurate selectivity estimate to be made. In this particular example > there's no change of plan (it would have used the index anyway), but > certainly a different rowcount estimate can make a big difference. > > This mechanism is quite ancient, and in principle it's now superseded > by extended statistics. For example, I can drop this index and > instead do > > regression=# drop index foo_sqrt_idx; > DROP INDEX > regression=# create statistics foostats on sqrt(x1) from foo; > CREATE STATISTICS > regression=# analyze foo; > ANALYZE > regression=# explain analyze select * from foo where sqrt(x1) < 1; > QUERY PLAN > > - > Seq Scan on foo (cost=0.00..195.00 rows=1026 width=8) (actual > time=0.006..0.479 rows=1028 loops=1) >Filter: (sqrt(x1) < '1'::double precision) >Rows Removed by Filter: 8972 > Planning Time: 0.079 ms > Execution Time: 0.503 ms > (5 rows) > > So the accurate rowcount estimate is still obtained in this example; > and we're not incurring any index maintenance costs, only ANALYZE > costs that are going to be roughly the same either way. > > However, I am not certain that extended statistics are plugged into > all the places where the older mechanism applies. Tomas Vondra might > have a better idea than I of where gaps remain in that. > AFAIK it handles all / exactly the same cases. The magic happens in examine_variable() in selfuncs.c, where we look for stats for simple Vars, and then for stats for expressions. First we look at all indexes, and then (if there's no suitable index) at all extended stats. There might be a place doing something ad hoc, but I can't think of any. regards -- Tomas Vondra
Re: Inconsistency with EXPLAIN ANALYZE CREATE MATERIALIZED VIEW
On Tue, 2024-08-06 at 14:36 -0400, Tom Lane wrote: > I'm not really sure I see the point of this, if it doesn't "just > work" > with all variants of C.M.V. It's not like you can't easily EXPLAIN > the view's SELECT. > > If REFRESH M. V. does something different than CREATE, there would > certainly be value in being able to EXPLAIN what that does --- but > that still isn't an argument for allowing EXPLAIN CREATE MATERIALIZED > VIEW. We already allow EXPLAIN ANALYZE CREATE MATERIALIZED VIEW in all supported versions. That seems strange and it surprised me, but the parser structure is shared between SELECT ... INTO and CREATE MATERIALIZED VIEW, so I suppose it was supported out of convenience. The problem is that the implentation is split between the EXPLAIN ANALYZE path and the non-EXPLAIN path. The long-ago commit f3ab5d4696 missed the EXPLAIN path. NB: I do not believe this is a security concern, but it does create the inconsistency described in the email starting this thread. Options: 1. Do nothing on the grounds that EXPLAIN ANALYZE CREATE MATERIALIZED VIEW is not common enough to worry about, and the consequences of the inconsistency are not bad enough. 2. Refactor some more to make EXPLAIN ANALYZE CREATE MATERIALIZED VIEW share the query part of the code path with REFRESH so that it benefits from the SECURITY_RESTRICTED_OPERATION and RestrictSearchPath(). 3. Do #2 but also make it work for REFRESH, but not CONCURRENTLY. 4. Do #3 but also make it work for REFRESH ... CONCURRENTLY and provide new information that's not available by only explaining the query. And also figure out if any of this should be back-patched. Regards, Jeff Davis
Re: tiny step toward threading: reduce dependence on setlocale()
On Tue, 2024-07-30 at 12:13 -0700, Jeff Davis wrote: > I found a couple issues with the later patches: > > * There was still some confusion about the default collation vs. > datcollate/datctype for callers of wchar2char() and char2wchar() > (those > functions only work for libc). I introduced a new pg_locale_t > structure > to represent datcollate/datctype regardless of datlocprovider to > solve > this. I didn't quite like the API I introduced in 0001, so I skipped 0001. For 0002 I left char2wchar() and wchar2char() as-is, where they expect libc and accept a NULL pg_locale_t. I committed the rest of 0002. > * Another loose end relying on setlocale(): in selfuncs.c, there's > still a call directly to strxfrm(), which depends on setlocale(). I > changed this to lookup the collation and then use pg_strxfrm(). That > should improve histogram selectivity estimates because it uses the > correct provider, rather than relying on setlocale(), right? Committed 0003. With these changes, collations are no longer dependent on the environment locale (setlocale()) at all for either collation behavior (ORDER BY) or ctype behavior (LOWER(), etc.). Additionally, unless I missed something, nothing in the server is dependent on LC_COLLATE at all. There are still some things that depend on setlocale() in one way or another: - char2wchar() & wchar2char() - ts_locale.c - various places that depend on LC_CTYPE unrelated to the collation infrastructure - things that depend on other locale settings, like LC_NUMERIC We can address those as part of a separate thread. I'll count this as committed. Regards, Jeff Davis
Re: btree: implement dynamic prefix truncation (was: Improving btree performance through specializing by key shape, take 2)
On Fri, 1 Mar 2024 at 14:48, Matthias van de Meent wrote: > Attached is version 15 of this patch, with the above issues fixed. > It's also rebased on top of 655dc310 of this morning, so that should > keep good for some time again. Attached is version 16 now. Relevant changes from previous patch versions: - Move from 1-indexed AttributeNumber to 0-indexed ints for prefixes, and use "prefix" as naming scheme, rather than cmpcol. A lack of prefix, previously indicated with a cmpcol value of 1, is now a prefix value of 0. - Adjusted README - Improved the efficiency of the insertion path in some cases where we've previously compared the page's highkey. As always, why we need this: Currently, btrees are quite inefficient when they have many key attributes but low attribute cardinality in the prefix, e.g. an index on ("", "", "", uuid). This is not just inefficient use of disk space with the high repetition of duplicate prefix values in pages, but it is also a computational overhead when we're descending the tree in e.g. _bt_first() or btinsert(): The code we use to search a page currently compares the full key to the full searchkey, for a complexity of O(n_equal_attributes + 1) for every tuple on the page, for O(log(page_ntups) * (n_equal_attributes + 1)) attribute compares every page during descent. This patch fixes that part of the computational complexity by applying dynamic prefix compression, thus reducing the average computational complexity in random index lookups to O(3 * (n_equal_attributes) + log(page_ntups)) per page (assuming at least 4 non-hikey tuples on each page). In practice, this makes indexes with 3+ attributes and prefixes with low selectivity (such as the example above) much more viable computationally, as we have to spend much less effort on comparing known index attributes during descent. Note that this does _not_ reuse prefix bounds across pages - it re-establishes the left- and right prefixes every page during the binary search. See the README modified in the patch for specific implementation details and considerations. This patch synergizes with the highkey optimization used in [0]: When combined, the number of attribute compare function calls could be further reduced to O(2 * (n_equal_atts) + log(page_ntups)), a reduction by n_equal_atts every page, which in certain wide index types could be over 25% of all attribute compare function calls on the page after dynamic prefix truncation. However, both are separately useful and reduce the amount of work done on most pages. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://www.postgresql.org/message-id/flat/CAEze2WijWhCQy_nZVP4Ye5Hwj=YW=3rqv+hbmjgcohtryqm...@mail.gmail.com v16-0001-btree-Implement-dynamic-prefix-truncation.patch Description: Binary data
Re: Remove dependence on integer wrapping
On Wed, Jul 24, 2024 at 05:49:41PM -0400, Matthew Kim wrote: > Hi, I´ve attached a patch that fixes the make_date overflow. I´ve upgraded > the patches accordingly to reflect Joseph´s committed fix. cfbot is unhappy with this one on Windows [0], and I think I see why. While the patch uses pg_mul_s32_overflow(), it doesn't check its return value, so we'll just proceed with a bogus value in the event of an overflow. Windows apparently doesn't have HAVE__BUILTIN_OP_OVERFLOW defined, so pg_mul_s32_overflow() sets the result to 0x5eed (24,301 in decimal), which is where I'm guessing the year -24300 in the ERROR is from. [0] https://api.cirrus-ci.com/v1/artifact/task/5872294914949120/testrun/build/testrun/regress/regress/regression.diffs -- nathan
Remaining dependency on setlocale()
After some previous work here: https://postgr.es/m/89475ee5487d795124f4e25118ea8f1853edb8cb.ca...@j-davis.com we are less dependent on setlocale(), but it's still not completely gone. setlocale() counts as thread-unsafe, so it would be nice to eliminate it completely. The obvious answer is uselocale(), which sets the locale only for the calling thread, and takes precedence over whatever is set with setlocale(). But there are a couple problems: 1. I don't think it's supported on Windows. 2. I don't see a good way to canonicalize a locale name, like in check_locale(), which uses the result of setlocale(). Thoughts? Regards, Jeff Davis
Re: Minor refactorings to eliminate some static buffers
On 06/08/2024 23:41, Peter Eisentraut wrote: On 06.08.24 17:13, Heikki Linnakangas wrote: > --- a/src/backend/access/transam/xlogprefetcher.c > +++ b/src/backend/access/transam/xlogprefetcher.c > @@ -362,7 +362,7 @@ XLogPrefetcher * > XLogPrefetcherAllocate(XLogReaderState *reader) > { > XLogPrefetcher *prefetcher; > - static HASHCTL hash_table_ctl = { > + const HASHCTL hash_table_ctl = { Is there a reason this is not changed to static const HASHCTL ... ? Most other places where changed in that way. No particular reason. Grepping for HASHCTL's, this is actually different from all other uses of HASHCTL and hash_create. All others use a plain local variable, and fill the fields like this: HASHCTL hash_ctl; hash_ctl.keysize = sizeof(missing_cache_key); hash_ctl.entrysize = sizeof(missing_cache_key); hash_ctl.hcxt = TopMemoryContext; hash_ctl.hash = missing_hash; hash_ctl.match = missing_match; I think that's just because we haven't allowed C99 designated initializers for very long. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Remaining dependency on setlocale()
Jeff Davis writes: > But there are a couple problems: > 1. I don't think it's supported on Windows. Can't help with that, but surely Windows has some thread-safe way. > 2. I don't see a good way to canonicalize a locale name, like in > check_locale(), which uses the result of setlocale(). What I can tell you about that is that check_locale's expectation that setlocale does any useful canonicalization is mostly wishful thinking [1]. On a lot of platforms you just get the input string back again. If that's the only thing keeping us on setlocale, I think we could drop it. (Perhaps we should do some canonicalization of our own instead?) regards, tom lane [1] https://www.postgresql.org/message-id/14856.1348497...@sss.pgh.pa.us
Re: Vectored IO in XLogWrite()
Hi Robert, Thanks for reviewing. Robert Haas , 6 Ağu 2024 Sal, 20:43 tarihinde şunu yazdı: > On Tue, Aug 6, 2024 at 5:36 AM Melih Mutlu wrote: > > I think that we don't have the "contiguous pages" constraint when > writing anymore as we can do vectored IO. It seems unnecessary to write > just because XLogWrite() is at the end of wal buffers. > > Attached patch uses pg_pwritev() instead of pg_pwrite() and tries to > write pages in one call even if they're not contiguous in memory, until it > reaches the page at startidx. > > Here are a few notes on this patch: > > - It's not pgindent-clean. In fact, it doesn't even pass git diff --check. > Fixed. > - You added a new comment (/* Reaching the buffer... */) in the middle > of a chunk of lines that were already covered by an existing comment > (/* Dump the set ... */). This makes it look like the /* Dump the > set... */ comment only covers the 3 lines of code that immediately > follow it rather than everything in the "if" statement. You could fix > this in a variety of ways, but in this case the easiest solution, to > me, looks like just skipping the new comment. It seems like the point > is pretty self-explanatory. > Removed the new comment. Only keeping the updated version of the /* Dump the set... */ comment. > - The patch removes the initialization of "from" but not the variable > itself. You still increment the variable you haven't initialized. > > - I don't think the logic is correct after a partial write. Pre-patch, > "from" advances while "nleft" goes down, but post-patch, what gets > written is dependent on the contents of "iov" which is initialized > outside the loop and never updated. Perhaps compute_remaining_iovec > would be useful here? > You're right. I should have thought about the partial write case. I now fixed it by looping and trying to write until compute_remaining_iovec() returns 0. Thanks, -- Melih Mutlu Microsoft v2-0001-Use-pg_pwritev-in-XlogWrite.patch Description: Binary data
Re: Avoiding superfluous buffer locking during nbtree backwards scans
On Fri, 5 Jul 2024 at 18:48, Peter Geoghegan wrote: > > Attached patch teaches nbtree backwards scans to avoid needlessly > relocking a previously read page/buffer at the point where we need to > consider reading the next page (the page to the left). +1, LGTM. This changes the backward scan code in _bt_readpage to have an approximately equivalent handling as the forward scan case for end-of-scan cases, which is an improvement IMO. Kind regards, Matthias van de Meent Neon (https://neon.tech/)
Enable data checksums by default
Please find attached a patch to enable data checksums by default. Currently, initdb only enables data checksums if passed the --data-checksums or -k argument. There was some hesitation years ago when this feature was first added, leading to the current situation where the default is off. However, many years later, there is wide consensus that this is an extraordinarily safe, desirable setting. Indeed, most (if not all) of the major commercial and open source Postgres systems currently turn this on by default. I posit you would be hard-pressed to find many systems these days in which it has NOT been turned on. So basically we have a de-facto standard, and I think it's time we flipped the switch to make it on by default. The patch is simple enough: literally flipping the boolean inside of initdb.c, and adding a new argument '--no-data-checksums' for those instances that truly want the old behavior. One place that still needs the old behavior is our internal tests for pg_checksums and pg_amcheck, so I added a new argument to init() in PostgreSQL/Test/Cluster.pm to allow those to still pass their tests. This is just the default - people are still more than welcome to turn it off with the new flag. The pg_checksums program is another option that actually argues for having the default "on", as switching to "off" once initdb has been run is trivial. Yes, I am aware of the previous discussions on this, but the world moves fast - wal compression is better than in the past, vacuum is better now, and data-checksums being on is such a complete default in the wild, it feels weird and a disservice that we are not running all our tests like that. Cheers, Greg From 12ce067f5ba64414d1d14c5f2e763d04cdfacd13 Mon Sep 17 00:00:00 2001 From: Greg Sabino Mullane Date: Tue, 6 Aug 2024 18:18:56 -0400 Subject: [PATCH] Make initdb enable data checksums by default --- doc/src/sgml/ref/initdb.sgml | 4 ++- src/bin/initdb/initdb.c | 6 - src/bin/initdb/t/001_initdb.pl| 31 +-- src/bin/pg_amcheck/t/003_check.pl | 2 +- src/bin/pg_amcheck/t/004_verify_heapam.pl | 2 +- src/bin/pg_checksums/t/002_actions.pl | 2 +- src/test/perl/PostgreSQL/Test/Cluster.pm | 7 + 7 files changed, 41 insertions(+), 13 deletions(-) diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml index bdd613e77f..511f489d34 100644 --- a/doc/src/sgml/ref/initdb.sgml +++ b/doc/src/sgml/ref/initdb.sgml @@ -267,12 +267,14 @@ PostgreSQL documentation Use checksums on data pages to help detect corruption by the I/O system that would otherwise be silent. Enabling checksums -may incur a noticeable performance penalty. If set, checksums +may incur a small performance penalty. If set, checksums are calculated for all objects, in all databases. All checksum failures will be reported in the pg_stat_database view. See for details. +As of version 18, checksums are enabled by default. They can be +disabled by use of --no-data-checksums. diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c index f00718a015..ce7d3e99e5 100644 --- a/src/bin/initdb/initdb.c +++ b/src/bin/initdb/initdb.c @@ -164,7 +164,7 @@ static bool noinstructions = false; static bool do_sync = true; static bool sync_only = false; static bool show_setting = false; -static bool data_checksums = false; +static bool data_checksums = true; static char *xlog_dir = NULL; static int wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024); static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC; @@ -3121,6 +3121,7 @@ main(int argc, char *argv[]) {"waldir", required_argument, NULL, 'X'}, {"wal-segsize", required_argument, NULL, 12}, {"data-checksums", no_argument, NULL, 'k'}, + {"no-data-checksums", no_argument, NULL, 20}, {"allow-group-access", no_argument, NULL, 'g'}, {"discard-caches", no_argument, NULL, 14}, {"locale-provider", required_argument, NULL, 15}, @@ -3319,6 +3320,9 @@ main(int argc, char *argv[]) if (!parse_sync_method(optarg, &sync_method)) exit(1); break; + case 20: +data_checksums = false; +break; default: /* getopt_long already emitted a complaint */ pg_log_error_hint("Try \"%s --help\" for more information.", progname); diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl index 06a35ac0b7..64395ec531 100644 --- a/src/bin/initdb/t/001_initdb.pl +++ b/src/bin/initdb/t/001_initdb.pl @@ -69,16 +69,11 @@ mkdir $datadir; } } -# Control file should tell that data checksums are disabled by default. +# Control file should tell that data checksums are enabled by default. command_like( [ 'pg_controldata', $datadir ], - qr/Data page checksum version:.*0/, - 'checksums are disabled in control file'); -# pg_checksums fails with checksums disabled by default.
Re: Instability with incremental backup tests (pg_combinebackup, 003_timeline.pl)
On 8/6/24 14:53, Tomas Vondra wrote: > On 8/6/24 07:48, Michael Paquier wrote: >> Hi all, >> >> dikkop has reported a failure with the regression tests of pg_combinebackup: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-08-04%2010%3A04%3A51 >> >> That's in the test 003_timeline.pl, from dc212340058b: >> # Failed test 'incremental backup from node1' >> # at t/003_timeline.pl line 43. >> >> The node is extremely slow, so perhaps bumping up the timeout would be >> fine enough in this case (did not spend time analyzing it). I don't >> think that this has been discussed, but perhaps I just missed a >> reference to it and the incremental backup thread is quite large. >> > > Yeah, it's a freebsd running on rpi4, from a USB flash disk, and in my > experience it's much slower than rpi4 running Linux. I'm not sure why is > that, never found a way to make it faster > > The machine already has: > > export PGCTLTIMEOUT=600 > export PG_TEST_TIMEOUT_DEFAULT=600 > > I doubt increasing it further will do the trick. Maybe there's some > other timeout that I should increase? > > FWIW I just moved the buildfarm stuff to a proper SSD disk (still USB, > but hopefully better than the crappy flash disk). > Seems the move to SSD helped a lot - the runs went from ~4h to ~40m. So chances are the instability won't be such a problem. regards -- Tomas Vondra
Re: Proposal for implementing OCSP Stapling in PostgreSQL
Thanks a lot Jacob for helping update the tests and sorry for the late reply. Based on previous discussion, I remove the document patch, and start to focus on the v1 simple OCSP logic by checking the leaf/Postgres server certificate's status only (0001-v1-WIP-OCSP-support-certificate-status-check.patch). I also merge your changes and simplify the test by testing the Postgres server certificate's status only (0002-v1-WIP-OCSP-simplify-.res-generation-and-regress-tes.patch). On 2024-07-18 10:18 a.m., Jacob Champion wrote: A question from ignorance: how does the client decide that the signature on the OCSP response is actually valid for the specific chain in use? If I understand correctly, the certificate used by the OCSP responder to sign the OCSP response must be valid for the specific chain in use, or the admins allow to load a new chain to validate the certificate used to sign the OCSP response. I think it would be better to make this part to be more open. Okay. That part needs more design work IMO, and solid testing. Based on the RFC here, https://datatracker.ietf.org/doc/html/rfc6960#section-4.2.2.2. My understanding is that the OCSP responder's certificate should be directly signed by the same CA which signed the Postgres Server's certificate. It looks the openssl 3.0.14 implements in this way. In other words, it the OCSP responder's certificate is signed by a different branch of the chain, then openssl will report some errors. Unless the end user explicitly provides the OCSP responder's certificate with trust_other option. In this case it will skip the some certificate verification. I think it should be simple enough for v1 by set `OCSP_basic_verify(basic_resp, NULL, trusted_store, 0)`. The key function OCSP_basic_verify is documented at here, https://docs.openssl.org/3.0/man3/OCSP_resp_find_status/ Thank you, DavidFrom 368c77885d7925334e8dabcce655b6a82f0a9a8f Mon Sep 17 00:00:00 2001 From: David Zhang Date: Tue, 6 Aug 2024 15:38:14 -0700 Subject: [PATCH 2/2] v1 WIP OCSP simplify .res generation and regress test --- src/test/ssl/conf/ocsp_ca.config | 39 ++ src/test/ssl/sslfiles.mk | 59 +-- src/test/ssl/t/001_ssltests.pl| 100 ++ src/test/ssl/t/SSL/Backend/OpenSSL.pm | 3 + src/test/ssl/t/SSL/Server.pm | 4 ++ 5 files changed, 201 insertions(+), 4 deletions(-) create mode 100644 src/test/ssl/conf/ocsp_ca.config diff --git a/src/test/ssl/conf/ocsp_ca.config b/src/test/ssl/conf/ocsp_ca.config new file mode 100644 index 00..04f78bacb8 --- /dev/null +++ b/src/test/ssl/conf/ocsp_ca.config @@ -0,0 +1,39 @@ +# An OpenSSL format CSR config file for creating the ocsp responder certificate. +# This configuration file is also used when operating the CA. +# +# This certificate is used to sign OCSP resonpose. + +[ req ] +distinguished_name = req_distinguished_name +prompt = no +req_extensions = v3_ocsp + +[ req_distinguished_name ] +CN = Test CA for PostgreSQL SSL regression test ocsp response + +# Extensions for OCSP responder certs +[ v3_ocsp ] +basicConstraints = CA:false +keyUsage = nonRepudiation, digitalSignature, keyEncipherment +extendedKeyUsage = OCSPSigning + +[ ca ] +default_ca = ocsp + +# A shell of a CA, mostly duplicating the server CA, which is used only during +# the OCSP index generation recipes. +[ ocsp ] +dir = ./ssl/ + +# The database (or "index") is the main thing we want. +database = ./ssl/ocsp-certindex + +# Everything else should all be unused, so we specify whatever's most +# convenient. In particular there's no need to have a unique cert/key pair for +# this. +certificate = ./ssl/server_ca.crt +private_key = ./ssl/server_ca.key +serial = ./ssl/ocsp_ca.srl +default_md = sha256 +policy = policy_match + diff --git a/src/test/ssl/sslfiles.mk b/src/test/ssl/sslfiles.mk index 88c93ec18d..6f314b7153 100644 --- a/src/test/ssl/sslfiles.mk +++ b/src/test/ssl/sslfiles.mk @@ -35,6 +35,10 @@ SERVERS := server-cn-and-alt-names \ server-revoked CLIENTS := client client-dn client-revoked client_ext client-long \ client-revoked-utf8 +OCSPS := server-ocsp-good \ + server-ocsp-revoked \ + server-ocsp-expired \ + server-ocsp-unknown # # To add a new non-standard certificate, add it to SPECIAL_CERTS and then add @@ -64,12 +68,13 @@ COMBINATIONS := \ ssl/client+client_ca.crt \ ssl/server-cn-only+server_ca.crt -CERTIFICATES := root_ca server_ca client_ca $(SERVERS) $(CLIENTS) +CERTIFICATES := root_ca server_ca client_ca ocsp_ca $(SERVERS) $(CLIENTS) STANDARD_CERTS := $(CERTIFICATES:%=ssl/%.crt) STANDARD_KEYS := $(CERTIFICATES:%=ssl/%.key) CRLS := ssl/root.crl \ ssl/client.crl \ ssl/server.crl +OCSPRES := $(OCSPS:%=ssl/%.res) SSLFILES := \ $(STANDARD_CERTS) \ @@ -77,7 +82,8 @@ SSLFILES := \ $(SPECIAL_CERTS) \ $(SPECIAL_KEYS) \
Re: Revisiting {CREATE INDEX, REINDEX} CONCURRENTLY improvements
On Tue, 11 Jun 2024 at 10:58, Michail Nikolaev wrote: > > Hello. > > I did the POC (1) of the method described in the previous email, and it looks > promising. > > It doesn't block the VACUUM, indexes are built about 30% faster (22 mins vs > 15 mins). That's a nice improvement. > Additional index is lightweight and does not produce any WAL. That doesn't seem to be what I see in the current patchset: https://github.com/postgres/postgres/compare/master...michail-nikolaev:postgres:new_index_concurrently_approach#diff-cc3cb8968cf833c4b8498ad2c561c786099c910515c4bf397ba853ae60aa2bf7R311 > I'll continue the more stress testing for a while. Also, I need to > restructure the commits (my path was no direct) into some meaningful and > reviewable patches. While waiting for this, here are some initial comments on the github diffs: - I notice you've added a new argument to heapam_index_build_range_scan. I think this could just as well be implemented by reading the indexInfo->ii_Concurrent field, as the values should be equivalent, right? - In heapam_index_build_range_scan, it seems like you're popping the snapshot and registering a new one while holding a tuple from heap_getnext(), thus while holding a page lock. I'm not so sure that's OK, expecially when catalogs are also involved (specifically for expression indexes, where functions could potentially be updated or dropped if we re-create the visibility snapshot) - In heapam_index_build_range_scan, you pop the snapshot before the returned heaptuple is processed and passed to the index-provided callback. I think that's incorrect, as it'll change the visibility of the returned tuple before it's passed to the index's callback. I think the snapshot manipulation is best added at the end of the loop, if we add it at all in that function. - The snapshot reset interval is quite high, at 500ms. Why did you configure it that low, and didn't you make this configurable? - You seem to be using WAL in the STIR index, while it doesn't seem that relevant for the use case of auxiliary indexes that won't return any data and are only used on the primary. It would imply that the data is being sent to replicas and more data being written than strictly necessary, which to me seems wasteful. - The locking in stirinsert can probably be improved significantly if we use things like atomic operations on STIR pages. We'd need an exclusive lock only for page initialization, while share locks are enough if the page's data is modified without WAL. That should improve concurrent insert performance significantly, as it would further reduce the length of the exclusively locked hot path. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: POC, WIP: OR-clause support for indexes
On Mon, Aug 5, 2024 at 11:24 PM Alena Rybakina wrote: > Ok, thank you for your work) > > I think we can leave only the two added libraries in the first patch, > others are superfluous. Thank you. I also have fixed some grammar issues. -- Regards, Alexander Korotkov Supabase v32-0001-Transform-OR-clauses-to-SAOP-s-during-index-matc.patch Description: Binary data v32-0002-Teach-bitmap-path-generation-about-transforming-.patch Description: Binary data
Re: Fix memory counter update in reorderbuffer
On Tue, Aug 6, 2024 at 1:12 PM Amit Kapila wrote: > > On Sat, Aug 3, 2024 at 1:21 AM Masahiko Sawada wrote: > > > > I found a bug in the memory counter update in reorderbuffer. It was > > introduced by commit 5bec1d6bc5e, so pg17 and master are affected. > > > > In ReorderBufferCleanupTXN() we zero the transaction size and then > > free the transaction entry as follows: > > > > /* Update the memory counter */ > > ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size); > > > > /* deallocate */ > > ReorderBufferReturnTXN(rb, txn); > > > > Why do we need to zero the transaction size explicitly? Shouldn't it > automatically become zero after freeing all the changes? It will become zero after freeing all the changes. However, since updating the max-heap when freeing each change could bring some overhead, we freed the changes without updating the memory counter, and then zerod it. > > > However, if the transaction entry has toast changes we free them in > > ReorderBufferToastReset() called from ReorderBufferToastReset(), > > > > Here, you mean ReorderBufferToastReset() called from > ReorderBufferReturnTXN(), right? Right. Thank you for pointing it out. > BTW, commit 5bec1d6bc5e also introduced > ReorderBufferChangeMemoryUpdate() in ReorderBufferTruncateTXN() which > is also worth considering while fixing the reported problem. It may > not have the same problem as you have reported but we can consider > whether setting txn size as zero explicitly is required or not. The reason why it introduced ReorderBufferChangeMemoryUpdate() is the same as I mentioned above. And yes, as you mentioned, it doesn't have the same problem that I reported here. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Logical Replication of sequences
On Tue, Aug 6, 2024 at 5:13 PM vignesh C wrote: > > On Mon, 5 Aug 2024 at 18:05, shveta malik wrote: > > > > On Mon, Aug 5, 2024 at 11:04 AM vignesh C wrote: > > > > > > On Wed, 31 Jul 2024 at 14:39, shveta malik wrote: > > > > > > > > On Mon, Jun 10, 2024 at 5:00 PM vignesh C wrote: > > > > > > > > > > On Mon, 10 Jun 2024 at 12:24, Amul Sul wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Jun 8, 2024 at 6:43 PM vignesh C > > > > > > wrote: > > > > > >> > > > > > >> On Wed, 5 Jun 2024 at 14:11, Amit Kapila > > > > > >> wrote: > > > > > >> [...] > > > > > >> A new catalog table, pg_subscription_seq, has been introduced for > > > > > >> mapping subscriptions to sequences. Additionally, the sequence LSN > > > > > >> (Log Sequence Number) is stored, facilitating determination of > > > > > >> sequence changes occurring before or after the returned sequence > > > > > >> state. > > > > > > > > > > > > > > > > > > Can't it be done using pg_depend? It seems a bit excessive unless > > > > > > I'm missing > > > > > > something. > > > > > > > > > > We'll require the lsn because the sequence LSN informs the user that > > > > > it has been synchronized up to the LSN in pg_subscription_seq. Since > > > > > we are not supporting incremental sync, the user will be able to > > > > > identify if he should run refresh sequences or not by checking the lsn > > > > > of the pg_subscription_seq and the lsn of the sequence(using > > > > > pg_sequence_state added) in the publisher. > > > > > > > > How the user will know from seq's lsn that he needs to run refresh. > > > > lsn indicates page_lsn and thus the sequence might advance on pub > > > > without changing lsn and thus lsn may look the same on subscriber even > > > > though a sequence-refresh is needed. Am I missing something here? > > > > > > When a sequence is synchronized to the subscriber, the page LSN of the > > > sequence from the publisher is also retrieved and stored in > > > pg_subscriber_rel as shown below: > > > --- Publisher page lsn > > > publisher=# select pg_sequence_state('seq1'); > > > pg_sequence_state > > > > > > (0/1510E38,65,1,t) > > > (1 row) > > > > > > --- Subscriber stores the publisher's page lsn for the sequence > > > subscriber=# select * from pg_subscription_rel where srrelid = 16384; > > > srsubid | srrelid | srsubstate | srsublsn > > > -+-++--- > > >16389 | 16384 | r | 0/1510E38 > > > (1 row) > > > > > > If changes are made to the sequence, such as performing many nextvals, > > > the page LSN will be updated. Currently the sequence values are > > > prefetched for SEQ_LOG_VALS 32, so the lsn will not get updated for > > > the prefetched values, once the prefetched values are consumed the lsn > > > will get updated. > > > For example: > > > --- Updated LSN on the publisher (old lsn - 0/1510E38, new lsn - > > > 0/1558CA8) > > > publisher=# select pg_sequence_state('seq1'); > > > pg_sequence_state > > > -- > > > (0/1558CA8,143,22,t) > > > (1 row) > > > > > > The user can then compare this updated value with the sequence's LSN > > > in pg_subscription_rel to determine when to re-synchronize the > > > sequence. > > > > Thanks for the details. But I was referring to the case where we are > > in between pre-fetched values on publisher (say at 25th value), while > > on subscriber we are slightly behind (say at 15th value), but page-lsn > > will be the same on both. Since the subscriber is behind, a > > sequence-refresh is needed on sub, but by looking at lsn (which is > > same), one can not say that for sure. Let me know if I have > > misunderstood it. > > Yes, at present, if the value is within the pre-fetched range, we > cannot distinguish it solely using the page_lsn. > This makes sense to me. > > However, the > pg_sequence_state function also provides last_value and log_cnt, which > can be used to handle these specific cases. > BTW, can we document all these steps for users to know when to refresh the sequences, if not already documented? -- With Regards, Amit Kapila.
RE: Conflict detection and logging in logical replication
Dear Hou, > > Here is the V11 patch set which addressed above and Kuroda-san[1] comments. > Thanks for updating the patch. I read 0001 again and I don't have critical comments for now. I found some cosmetic issues (e.g., lines should be shorter than 80 columns) and attached the fix patch. Feel free to include in the next version. Best regards, Hayato Kuroda FUJITSU LIMITED fixes_for_v11.patch Description: fixes_for_v11.patch
Re: Windows default locale vs initdb
On Tue, Jul 23, 2024 at 11:19 AM Thomas Munro wrote: > On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan wrote: > > I have an environment I can use for testing. But what exactly am I > > testing? :-) Install a few "problem" language/region settings, switch > > the system and ensure initdb runs ok? I thought a bit more about what to do with the messy .UTF-8 situation on Windows, and I think I might see a way forward that harmonises the code and behaviour with Unix, and deletes a lot of special case code. But it's only theories + CI so far. 0001, 0002: As before, teach initdb.exe to choose eg "en-US" by default. 0003: Force people to choose locales that match the database encoding, as we do on Unix. That is, forbid contradictory combinations like --locale="English_United States.1252" --encoding=UTF8, which are currently allowed (and the world is full of such database clusters because that is how the EDB installer GUI makes them). The only allowed combinations for American English should now be: --locale="en-US" --encoding="WIN1252", and --locale="en-US.UTF-8" --encoding="UTF8". You can still use the old names if you like, by explicitly writing --locale="English_United States.1252", but the encoding then has to be WIN1252. It's crazy to mix them up, let's ban that. Obviously there is a pg_upgrade case to worry about there. We'd have to "fix" the now illegal combinations, and I don't know exactly how yet. 0004: Rip out the code that does extra wchar_t conversations for collations. If I've understood correctly, we don't need them: if you have a .UTF-8 locale then your encoding is UTF-8 and should be able to use strcoll_l() directly. Right? 0005: Something similar was being done for strftime(). And we might as well use strftime_l() instead while we're here (part of general movement to use _l functions and stop splattering setlocale() all over the place, for the multithreaded future). These patches pass on CI. Do they give the expected results when used on a real Windows system? There are a few more places where we do wchar_t conversions that could probably be stripped out too, if my assumptions are correct, and we could dig further if the basic idea can be validated and people think this is going in a good direction. From 886815244ab43092562ae3118cd5588a2fad5bb2 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 20 Nov 2023 14:24:35 +1300 Subject: [PATCH v6 1/5] MinGW has GetLocaleInfoEx(). To use BCP 47 locale names like "en-US" without a suffix ".encoding", we need to be able to call GetLocaleInfoEx() to look up the encoding. That was previously gated for MSVC only, but MinGW has had the function for many years. Remove that gating, because otherwise our MinGW build farm animals would fail when a later commit switches to using the new names by default. There are probably other places where _MSC_VER is being used as a proxy for detecting MinGW with an out-of-date idea about missing functions. Discussion: https://postgr.es/m/CA%2BhUKGLsV3vTjPp7bOZBr3JTKp3Brkr9V0Qfmc7UvpWcmAQL4A%40mail.gmail.com --- src/port/chklocale.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/src/port/chklocale.c b/src/port/chklocale.c index 8cb81c8640..a15b0d5349 100644 --- a/src/port/chklocale.c +++ b/src/port/chklocale.c @@ -204,7 +204,6 @@ win32_langinfo(const char *ctype) char *r = NULL; char *codepage; -#if defined(_MSC_VER) uint32 cp; WCHAR wctype[LOCALE_NAME_MAX_LENGTH]; @@ -229,7 +228,6 @@ win32_langinfo(const char *ctype) } } else -#endif { /* * Locale format on Win32 is _.. For -- 2.39.2 From 357751c04cdd3dc7dea1ee9409356d818af70d5d Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 19 Jul 2022 06:31:17 +1200 Subject: [PATCH v6 2/5] Default to IETF BCP 47 locale names in initdb on Windows. MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Avoid selecting traditional Windows locale names written with English words, because (1) they are unstable and explicitly not recommended for use in databases and (2) they may contain non-ASCII characters, which we can't put in our shared catalogs. Since setlocale() returns such names, on Windows use GetUserDefaultLocaleName() if the user didn't provide an explicit locale. It returns BCP 47 strings like "en-US". Also update the documentation to recommend BCP 47 over the traditional names when providing explicit values to initdb. Reviewed-by: Juan José Santamaría Flecha Reviewed-by: Discussion: https://postgr.es/m/CA%2BhUKGJ%3DXThErgAQRoqfCy1bKPxXVuF0%3D2zDbB%2BSxDs59pv7Fw%40mail.gmail.com --- doc/src/sgml/charset.sgml | 13 +++-- src/bin/initdb/initdb.c | 31 +-- 2 files changed, 40 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index 834cb30c85..adb21eb079 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -83,8 +83,17 @@ initdb --locale=sv_SE system un
Re: Logical Replication of sequences
Hi Vignesh, This is mostly a repeat of my previous mail from a while ago [1] but includes some corrections, answers, and more examples. I'm going to try to persuade one last time because the current patch is becoming stable, so I wanted to revisit this syntax proposal before it gets too late to change anything. If there is some problem with the proposed idea please let me know because I can see only the advantages and no disadvantages of doing it this way. ~~~ The current patchset offers two forms of subscription refresh: 1. ALTER SUBSCRIPTION name REFRESH PUBLICATION [ WITH ( refresh_option [= value] [, ... ] ) ] 2. ALTER SUBSCRIPTION name REFRESH PUBLICATION SEQUENCES Since 'copy_data' is the only supported refresh_option, really it is more like: 1. ALTER SUBSCRIPTION name REFRESH PUBLICATION [ WITH ( copy_data [= true|false] ) ] 2. ALTER SUBSCRIPTION name REFRESH PUBLICATION SEQUENCES ~~~ I proposed previously that instead of having 2 commands for refreshing subscriptions we should have a single refresh command: ALTER SUBSCRIPTION name REFRESH PUBLICATION [TABLES|SEQUENCES] [ WITH ( copy_data [= true|false] ) ] Why? - IMO it is less confusing than having 2 commands that both refresh sequences in slightly different ways - It is more flexible because apart from refreshing everything, a user can choose to refresh only tables or only sequences if desired; IMO more flexibility is always good. - There is no loss of functionality from the current implementation AFAICT. You can still say "ALTER SUBSCRIPTION sub REFRESH PUBLICATION SEQUENCES" exactly the same as the patchset allows. - The implementation code will become simpler. For example, the current implementation of AlterSubscription_refresh(...) includes the (hacky?) 'resync_all_sequences' parameter and has an overcomplicated relationship with other parameters as demonstrated by the assertions below. IMO using the prosed syntax means this coding will become not only simpler, but shorter too. + /* resync_all_sequences cannot be specified with refresh_tables */ + Assert(!(resync_all_sequences && refresh_tables)); + + /* resync_all_sequences cannot be specified with copy_data as false */ + Assert(!(resync_all_sequences && !copy_data)); ~~~ So, to continue this proposal, let the meaning of 'copy_data' for SEQUENCES be as follows: - when copy_data == false: it means don't copy data (i.e. don't synchronize anything). Add/remove sequences from pg_subscriber_rel as needed. - when copy_data == true: it means to copy data (i.e. synchronize) for all sequences. Add/remove sequences from pg_subscriber_rel as needed) ~~~ EXAMPLES using the proposed syntax: Refreshing TABLES only... ex1. ALTER SUBSCRIPTION sub REFRESH PUBLICATION TABLES WITH (copy_data = false) - same as PG17 functionality for "ALTER SUBSCRIPTION sub REFRESH PUBLICATION WITH (copy_data = false)" ex2. ALTER SUBSCRIPTION sub REFRESH PUBLICATION TABLES WITH (copy_data = true) - same as PG17 functionality for "ALTER SUBSCRIPTION sub REFRESH PUBLICATION WITH (copy_data = true)" ex3. (using default copy_data) ALTER SUBSCRIPTION sub REFRESH PUBLICATION TABLES - same as ex2. ~ Refreshing SEQUENCES only... ex4. ALTER SUBSCRIPTION sub REFRESH PUBLICATION SEQUENCES WITH (copy data = false) - this adds/removes only sequences to pg_subscription_rel but doesn't update the sequence values ex5. ALTER SUBSCRIPTION sub REFRESH PUBLICATION SEQUENCES WITH (copy data = true) - this adds/removes only sequences to pg_subscription_rel and also updates (synchronizes) all sequence values. - same functionality as "ALTER SUBSCRIPTION sub REFRESH PUBLICATION SEQUENCES" in your current patchset ex6. (using default copy_data) ALTER SUBSCRIPTION sub REFRESH PUBLICATION SEQUENCES - same as ex5. - note, that this command has the same syntax and functionality as the current patchset ~~~ When no object_type is specified it has intuitive meaning to refresh both TABLES and SEQUENCES... ex7. ALTER SUBSCRIPTION sub REFRESH PUBLICATION WITH (copy_data = false) - For tables, it is the same as the PG17 functionality - For sequences it includes the same behaviour of ex4. ex8. ALTER SUBSCRIPTION sub REFRESH PUBLICATION WITH (copy_data = true) - For tables, it is the same as the PG17 functionality - For sequences it includes the same behaviour of ex5. - There is one subtle difference from the current patchset because this proposal will synchronize *all* sequences instead of only new ones. But, this is a good thing. The current documentation is complicated by having to explain the differences between REFRESH PUBLICATION and REFRESH PUBLICATION SEQUENCES. The current patchset also raises questions like how the user chooses whether to use "REFRESH PUBLICATION SEQUENCES" versus "REFRESH PUBLICATION WITH (copy_data=true)". OTHO, the proposed syntax eliminates ambiguity. ex9. (using default copy_data) ALTER SUBSCRIPTION sub REFRESH PUBLICATION - same as ex8 == [1] https://www.postgresql.org/message-id/CAHut%2BPuFH1O
Re: Logical Replication of sequences
On Mon, Aug 5, 2024 at 10:26 AM vignesh C wrote: > > On Thu, 1 Aug 2024 at 04:25, Peter Smith wrote: > > > > Hi Vignesh, > > > > I noticed that when replicating sequences (using the latest patches > > 0730_2*) the subscriber-side checks the *existence* of the sequence, > > but apparently it is not checking other sequence attributes. > > > > For example, consider: > > > > Publisher: "CREATE SEQUENCE s1 START 1 INCREMENT 2;" should be a > > sequence of only odd numbers. > > Subscriber: "CREATE SEQUENCE s1 START 2 INCREMENT 2;" should be a > > sequence of only even numbers. > > > > Because the names match, currently the patch allows replication of the > > s1 sequence. I think that might lead to unexpected results on the > > subscriber. IMO it might be safer to report ERROR unless the sequences > > match properly (i.e. not just a name check). > > > > Below is a demonstration the problem: > > > > == > > Publisher: > > == > > > > (publisher sequence is odd numbers) > > > > test_pub=# create sequence s1 start 1 increment 2; > > CREATE SEQUENCE > > test_pub=# select * from nextval('s1'); > > nextval > > - > >1 > > (1 row) > > > > test_pub=# select * from nextval('s1'); > > nextval > > - > >3 > > (1 row) > > > > test_pub=# select * from nextval('s1'); > > nextval > > - > >5 > > (1 row) > > > > test_pub=# CREATE PUBLICATION pub1 FOR ALL SEQUENCES; > > CREATE PUBLICATION > > test_pub=# > > > > == > > Subscriber: > > == > > > > (subscriber sequence is even numbers) > > > > test_sub=# create sequence s1 start 2 increment 2; > > CREATE SEQUENCE > > test_sub=# SELECT * FROM nextval('s1'); > > nextval > > - > >2 > > (1 row) > > > > test_sub=# SELECT * FROM nextval('s1'); > > nextval > > - > >4 > > (1 row) > > > > test_sub=# SELECT * FROM nextval('s1'); > > nextval > > - > >6 > > (1 row) > > > > test_sub=# CREATE SUBSCRIPTION sub1 CONNECTION 'dbname=test_pub' > > PUBLICATION pub1; > > 2024-08-01 08:43:04.198 AEST [24325] WARNING: subscriptions created > > by regression test cases should have names starting with "regress_" > > WARNING: subscriptions created by regression test cases should have > > names starting with "regress_" > > NOTICE: created replication slot "sub1" on publisher > > CREATE SUBSCRIPTION > > test_sub=# 2024-08-01 08:43:04.294 AEST [26240] LOG: logical > > replication apply worker for subscription "sub1" has started > > 2024-08-01 08:43:04.309 AEST [26244] LOG: logical replication > > sequence synchronization worker for subscription "sub1" has started > > 2024-08-01 08:43:04.323 AEST [26244] LOG: logical replication > > synchronization for subscription "sub1", sequence "s1" has finished > > 2024-08-01 08:43:04.323 AEST [26244] LOG: logical replication > > sequence synchronization worker for subscription "sub1" has finished > > > > (after the CREATE SUBSCRIPTION we are getting replicated odd values > > from the publisher, even though the subscriber side sequence was > > supposed to be even numbers) > > > > test_sub=# SELECT * FROM nextval('s1'); > > nextval > > - > >7 > > (1 row) > > > > test_sub=# SELECT * FROM nextval('s1'); > > nextval > > - > >9 > > (1 row) > > > > test_sub=# SELECT * FROM nextval('s1'); > > nextval > > - > > 11 > > (1 row) > > > > (Looking at the description you would expect odd values for this > > sequence to be impossible) I see that for such even sequences, user can still do 'setval' to a odd number and then nextval will keep on returning odd value. postgres=# SELECT nextval('s1'); 6 postgres=SELECT setval('s1', 43); 43 postgres=# SELECT nextval('s1'); 45 > > test_sub=# \dS+ s1 > > Sequence "public.s1" > > Type | Start | Minimum | Maximum | Increment | Cycles? | > > Cache > > +---+-+-+---+-+--- > > bigint | 2 | 1 | 9223372036854775807 | 2 | no | > > 1 > > Even if we check the sequence definition during the CREATE > SUBSCRIPTION/ALTER SUBSCRIPTION ... REFRESH PUBLICATION or ALTER > SUBSCRIPTION ... REFRESH PUBLICATION SEQUENCES commands, there's still > a chance that the sequence definition might change after the command > has been executed. Currently, there's no mechanism to lock a sequence, > and we also permit replication of table data even if the table > structures differ, such as mismatched data types like int and > smallint. I have modified it to log a warning to inform users that the > sequence options on the publisher and subscriber are not the same and > advise them to ensure that the sequence definitions are consistent > between both. > The v20240805 version patch attached at [1] has the changes for the same. > [1] - > https://www.postgresql.org/message-id/CALDaNm1Y_ot-jFRfmtwDuwmFrgSSYHjVuy28RspSopTtwzXy8w%40mail.gmail.com The
Re: Conflict detection and logging in logical replication
On Tue, Aug 6, 2024 at 1:45 PM Zhijie Hou (Fujitsu) wrote: > > On Monday, August 5, 2024 6:52 PM Amit Kapila wrote: > > > > On Sun, Aug 4, 2024 at 1:22 PM Zhijie Hou (Fujitsu) > > wrote: > > > > > > On Friday, August 2, 2024 7:03 PM Amit Kapila > > wrote: > > > > > > > > > > Here is the V11 patch set which addressed above and Kuroda-san[1] > > comments. > > > > > > > A few design-level points: > > > > * > > @@ -525,10 +602,33 @@ ExecSimpleRelationInsert(ResultRelInfo > > *resultRelInfo, > > /* OK, store the tuple and create index entries for it */ > > simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot); > > > > + conflictindexes = resultRelInfo->ri_onConflictArbiterIndexes; > > + > > if (resultRelInfo->ri_NumIndices > 0) > > recheckIndexes = ExecInsertIndexTuples(resultRelInfo, > > -slot, estate, false, false, > > -NULL, NIL, false); > > +slot, estate, false, > > +conflictindexes ? true : false, > > +&conflict, > > +conflictindexes, false); > > + > > + /* > > + * Checks the conflict indexes to fetch the conflicting local tuple > > + * and reports the conflict. We perform this check here, instead of > > + * performing an additional index scan before the actual insertion and > > + * reporting the conflict if any conflicting tuples are found. This is > > + * to avoid the overhead of executing the extra scan for each INSERT > > + * operation, even when no conflict arises, which could introduce > > + * significant overhead to replication, particularly in cases where > > + * conflicts are rare. > > + * > > + * XXX OTOH, this could lead to clean-up effort for dead tuples added > > + * in heap and index in case of conflicts. But as conflicts shouldn't > > + * be a frequent thing so we preferred to save the performance overhead > > + * of extra scan before each insertion. > > + */ > > + if (conflict) > > + CheckAndReportConflict(resultRelInfo, estate, CT_INSERT_EXISTS, > > +recheckIndexes, slot); > > > > I was thinking about this case where we have some pros and cons of doing > > additional scans only after we found the conflict. I was wondering how we > > will > > handle the resolution strategy for this when we have to remote_apply the > > tuple > > for insert_exists/update_exists cases. > > We would have already inserted the remote tuple in the heap and index before > > we found the conflict which means we have to roll back that change and then > > start a forest transaction to perform remote_apply which probably has to > > update the existing tuple. We may have to perform something like speculative > > insertion and then abort it. That doesn't sound cheap either. Do you have > > any > > better ideas? > > Since most of the codes of conflict detection can be reused in the later > resolution patch. I am thinking we can go for re-scan after insertion approach > for detection patch. Then in resolution patch we can probably have a check in > the patch that if the resolver is remote_apply/last_update_win we detect > conflict before, otherwise detect it after. This way we can save an > subscription option in the detection patch because we are not introducing > overhead > for the detection. > Sounds reasonable to me. > > > > > > * > > -ERROR: duplicate key value violates unique constraint "test_pkey" > > -DETAIL: Key (c)=(1) already exists. > > +ERROR: conflict insert_exists detected on relation "public.test" > > +DETAIL: Key (c)=(1) already exists in unique index "t_pkey", which > > was modified locally in transaction 740 at 2024-06-26 10:47:04.727375+08. > > > > I think the format to display conflicts is not very clear. The conflict > > should be > > apparent just by seeing the LOG/ERROR message. I am thinking of something > > like below: > > > > LOG: CONFLICT: ; > > DESCRIPTION: If any .. ; RESOLUTION: (This one can be added later) > > DEATAIL: remote_tuple (tuple values); local_tuple (tuple values); > > > > With the above, one can easily identify the conflict's reason and action > > taken by > > apply worker. > > Thanks for the idea! I thought about few styles based on the suggested format, > what do you think about the following ? > > --- > Version 1 > --- > LOG: CONFLICT: insert_exists; DESCRIPTION: remote INSERT violates unique > constraint "uniqueindex" on relation "public.test". > DETAIL: Existing local tuple (a, b, c) = (2, 3, 4) > xid=123,origin="pub",timestamp=xxx; remote tuple (a, b, c) = (2, 4, 5). > Won't this case be ERROR? If so, the error message format like the above appears odd to me because in some cases, the user may want to add some filter based on the error message though that is not ideal. Also, the primary error message starts with a small case letter and should be short. > LOG: CONFLICT: update_differ; DESCRIPTION: updating a row with key (a, b) = > (2, 4) on relation "public.test" was modified by a different source. > DETAIL: Existing local tuple (a, b, c) = (2, 3, 4) > xid=123,origin="pub",timestamp=xxx; remote tuple (a, b,
Re: Pgoutput not capturing the generated columns
On Mon, Aug 5, 2024 at 9:15 AM Peter Smith wrote: > > Hi, > > Writing many new test case combinations has exposed a possible bug in > patch 0001. > > In my previous post [1] there was questionable behaviour when > replicating from a normal (not generated) column on the publisher side > to a generated column on the subscriber side. Initially, I thought the > test might have exposed a possible PG17 bug, but now I think it has > really found a bug in patch 0001. > > ~~~ > > Previously (PG17) this would fail consistently both during COPY and > during normal replication.Now, patch 0001 has changed this behaviour > -- it is not always failing anymore. > > The patch should not be impacting this existing behaviour. It only > introduces a new 'include_generated_columns', but since the publisher > side is not a generated column I do not expect there should be any > difference in behaviour for this test case. IMO the TAP test expected > results should be corrected for this scenario. And fix the bug. > > Below is an example demonstrating PG17 behaviour. > > == > > > Publisher: > -- > > (notice column "b" is not generated) > > test_pub=# CREATE TABLE tab_nogen_to_gen (a int, b int); > CREATE TABLE > test_pub=# INSERT INTO tab_nogen_to_gen VALUES (1,101),(2,102); > INSERT 0 2 > test_pub=# CREATE PUBLICATION pub1 for TABLE tab_nogen_to_gen; > CREATE PUBLICATION > test_pub=# > > Subscriber: > --- > > (notice corresponding column "b" is generated) > > test_sub=# CREATE TABLE tab_nogen_to_gen (a int, b int GENERATED > ALWAYS AS (a * 22) STORED); > CREATE TABLE > test_sub=# > > Try to create a subscription. Notice we get the error: ERROR: logical > replication target relation "public.tab_nogen_to_gen" is missing > replicated column: "b" > > test_sub=# CREATE SUBSCRIPTION sub1 CONNECTION 'dbname=test_pub' > PUBLICATION pub1; > 2024-08-05 13:16:40.043 AEST [20957] WARNING: subscriptions created > by regression test cases should have names starting with "regress_" > WARNING: subscriptions created by regression test cases should have > names starting with "regress_" > NOTICE: created replication slot "sub1" on publisher > CREATE SUBSCRIPTION > test_sub=# 2024-08-05 13:16:40.105 AEST [29258] LOG: logical > replication apply worker for subscription "sub1" has started > 2024-08-05 13:16:40.117 AEST [29260] LOG: logical replication table > synchronization worker for subscription "sub1", table > "tab_nogen_to_gen" has started > 2024-08-05 13:16:40.172 AEST [29260] ERROR: logical replication > target relation "public.tab_nogen_to_gen" is missing replicated > column: "b" > 2024-08-05 13:16:40.173 AEST [20039] LOG: background worker "logical > replication tablesync worker" (PID 29260) exited with exit code 1 > 2024-08-05 13:16:45.187 AEST [29400] LOG: logical replication table > synchronization worker for subscription "sub1", table > "tab_nogen_to_gen" has started > 2024-08-05 13:16:45.285 AEST [29400] ERROR: logical replication > target relation "public.tab_nogen_to_gen" is missing replicated > column: "b" > 2024-08-05 13:16:45.286 AEST [20039] LOG: background worker "logical > replication tablesync worker" (PID 29400) exited with exit code 1 > ... > > Create the subscription again, but this time with copy_data = false > > test_sub=# CREATE SUBSCRIPTION sub1_nocopy CONNECTION > 'dbname=test_pub' PUBLICATION pub1 WITH (copy_data = false); > 2024-08-05 13:22:57.719 AEST [20957] WARNING: subscriptions created > by regression test cases should have names starting with "regress_" > WARNING: subscriptions created by regression test cases should have > names starting with "regress_" > NOTICE: created replication slot "sub1_nocopy" on publisher > CREATE SUBSCRIPTION > test_sub=# 2024-08-05 13:22:57.765 AEST [7012] LOG: logical > replication apply worker for subscription "sub1_nocopy" has started > > test_sub=# > > ~~~ > > Then insert data from the publisher to see what happens for normal > replication. > > test_pub=# > test_pub=# INSERT INTO tab_nogen_to_gen VALUES (3,103),(4,104); > INSERT 0 2 > > ~~~ > > Notice the subscriber gets the same error as before: ERROR: logical > replication target relation "public.tab_nogen_to_gen" is missing > replicated column: "b" > > 2024-08-05 13:25:14.897 AEST [20039] LOG: background worker "logical > replication apply worker" (PID 10957) exited with exit code 1 > 2024-08-05 13:25:19.933 AEST [11095] LOG: logical replication apply > worker for subscription "sub1_nocopy" has started > 2024-08-05 13:25:19.966 AEST [11095] ERROR: logical replication > target relation "public.tab_nogen_to_gen" is missing replicated > column: "b" > 2024-08-05 13:25:19.966 AEST [11095] CONTEXT: processing remote data > for replication origin "pg_16390" during message type "INSERT" in > transaction 742, finished at 0/1967BB0 > 2024-08-05 13:25:19.968 AEST [20039] LOG: background worker "logical > replication apply worker" (PID 11095) exited with exit code 1 > 2024-08-05 13:25:24.917 AEST [
Re: Restart pg_usleep when interrupted
Hi, On Tue, Aug 06, 2024 at 12:36:44PM -0500, Sami Imseih wrote: > > > v5 takes care of the latest comments by Bertrand. > Thanks! Please find attached a rebase version (due to 39a138fbef) and in passing I changed one comment: "add t in microseconds to a instr_time" -> "add t (in microseconds) to x" Does that make sense to you? That said, it looks good to me. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com >From 619d4fea1eada6fb65fec673bc9600f8502f9b00 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Wed, 7 Aug 2024 05:04:10 + Subject: [PATCH v6] vaccum_delay with absolute time nanosleep --- src/backend/commands/vacuum.c| 47 +++- src/include/portability/instr_time.h | 10 ++ 2 files changed, 56 insertions(+), 1 deletion(-) 81.6% src/backend/commands/ 18.3% src/include/portability/ diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 48f8eab202..0ed188a083 100644 --- a/src/backend/commands/vacuum.c +++ b/src/backend/commands/vacuum.c @@ -116,6 +116,51 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, static double compute_parallel_delay(void); static VacOptValue get_vacoptval_from_boolean(DefElem *def); static bool vac_tid_reaped(ItemPointer itemptr, void *state); +static void vacuum_sleep(double msec); + +static +void +vacuum_sleep(double msec) +{ + long microsec = msec * 1000; + + if (microsec > 0) + { +#ifndef WIN32 + /* + * We allow nanosleep to handle interrupts and retry with the + * remaining time. However, frequent interruptions and restarts of the + * nanosleep calls can substantially lead to drift in the time when + * the sleep finally completes. To deal with this, we break out of the + * loop whenever the current time is past the expected end time of the + * sleep. + */ + struct timespec delay; + struct timespec remain; + instr_time end_time; + + INSTR_TIME_SET_CURRENT(end_time); + INSTR_TIME_ADD_MICROSEC(end_time, microsec); + + delay.tv_sec = microsec / 100L; + delay.tv_nsec = (microsec % 100L) * 1000; + + while (nanosleep(&delay, &remain) == -1 && errno == EINTR) + { + instr_time current_time; + + INSTR_TIME_SET_CURRENT(current_time); + + if (INSTR_TIME_IS_GREATER(current_time, end_time)) +break; + + delay = remain; + } +#else + SleepEx((microsec < 500 ? 1 : (microsec + 500) / 1000), FALSE); +#endif + } +} /* * GUC check function to ensure GUC value specified is within the allowable @@ -2384,7 +2429,7 @@ vacuum_delay_point(void) msec = vacuum_cost_delay * 4; pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY); - pg_usleep(msec * 1000); + vacuum_sleep(msec); pgstat_report_wait_end(); /* diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index e66ecf34cd..754ea77c43 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -36,6 +36,10 @@ * * INSTR_TIME_GET_NANOSEC(t) convert t to int64 (in nanoseconds) * + * INSTR_TIME_ADD_MICROSEC(x,t) add t (in microseconds) to x + * + * INSTR_TIME_IS_GREATER(x,y) is x greater than y? + * * Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert * absolute times to intervals. The INSTR_TIME_GET_xxx operations are * only useful on intervals. @@ -194,4 +198,10 @@ GetTimerFrequency(void) #define INSTR_TIME_GET_MICROSEC(t) \ (INSTR_TIME_GET_NANOSEC(t) / NS_PER_US) +#define INSTR_TIME_ADD_MICROSEC(x,t) \ + ((x).ticks += t * NS_PER_US) + +#define INSTR_TIME_IS_GREATER(x,y) \ + ((x).ticks > (y).ticks) + #endif /* INSTR_TIME_H */ -- 2.34.1
Re: Fix memory counter update in reorderbuffer
On Wed, Aug 7, 2024 at 7:42 AM Masahiko Sawada wrote: > > On Tue, Aug 6, 2024 at 1:12 PM Amit Kapila wrote: > > > > On Sat, Aug 3, 2024 at 1:21 AM Masahiko Sawada > > wrote: > > > > > > I found a bug in the memory counter update in reorderbuffer. It was > > > introduced by commit 5bec1d6bc5e, so pg17 and master are affected. > > > > > > In ReorderBufferCleanupTXN() we zero the transaction size and then > > > free the transaction entry as follows: > > > > > > /* Update the memory counter */ > > > ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size); > > > > > > /* deallocate */ > > > ReorderBufferReturnTXN(rb, txn); > > > > > > > Why do we need to zero the transaction size explicitly? Shouldn't it > > automatically become zero after freeing all the changes? > > It will become zero after freeing all the changes. However, since > updating the max-heap when freeing each change could bring some > overhead, we freed the changes without updating the memory counter, > and then zerod it. > I think this should be covered in comments as it is not apparent. > > > BTW, commit 5bec1d6bc5e also introduced > > ReorderBufferChangeMemoryUpdate() in ReorderBufferTruncateTXN() which > > is also worth considering while fixing the reported problem. It may > > not have the same problem as you have reported but we can consider > > whether setting txn size as zero explicitly is required or not. > > The reason why it introduced ReorderBufferChangeMemoryUpdate() is the > same as I mentioned above. And yes, as you mentioned, it doesn't have > the same problem that I reported here. > I checked again and found that ReorderBufferResetTXN() first calls ReorderBufferTruncateTXN() and then ReorderBufferToastReset(). After that, it also tries to free spec_insert change which should have the same problem. So, what saves this path from the same problem? * + /* + * Update the memory counter of the transaction, removing it from + * the max-heap. + */ + ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size); + Assert(txn->size == 0); + pfree(txn); Just before freeing the TXN, updating the size looks odd though I understand the idea is to remove TXN from max_heap. Anyway, let's first discuss whether the same problem exists in ReorderBufferResetTXN() code path, and if so, how we want to fix it. -- With Regards, Amit Kapila.
Remove TRACE_SORT macro?
I think we could remove the TRACE_SORT macro. The TRACE_SORT macro has guarded the availability of the trace_sort GUC setting. But it has been enabled by default ever since it was introduced in PostgreSQL 8.1, and there have been no reports that someone wanted to disable it. So I think we could just remove the macro to simplify things.From 72cdf2044056b138e7f7a4a513d34e48d7f22961 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Wed, 7 Aug 2024 08:44:08 +0200 Subject: [PATCH] Remove TRACE_SORT macro The TRACE_SORT macro guarded the availability of the trace_sort GUC setting. But it has been enabled by default ever since it was introduced in PostgreSQL 8.1, and there have been no reports that someone wanted to disable it. So just remove the macro to simplify things. --- doc/src/sgml/config.sgml | 3 -- src/backend/utils/adt/mac.c| 6 src/backend/utils/adt/network.c| 6 src/backend/utils/adt/numeric.c| 6 src/backend/utils/adt/uuid.c | 6 src/backend/utils/adt/varlena.c| 4 --- src/backend/utils/misc/guc_tables.c| 2 -- src/backend/utils/sort/tuplesort.c | 39 -- src/backend/utils/sort/tuplesortvariants.c | 14 src/include/pg_config_manual.h | 6 src/include/utils/guc.h| 3 -- 11 files changed, 95 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index a1a1d58a436..2937384b001 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -11711,9 +11711,6 @@ Developer Options If on, emit information about resource usage during sort operations. -This parameter is only available if the TRACE_SORT macro -was defined when PostgreSQL was compiled. -(However, TRACE_SORT is currently defined by default.) diff --git a/src/backend/utils/adt/mac.c b/src/backend/utils/adt/mac.c index ae4caedef50..ce0fbe7a49b 100644 --- a/src/backend/utils/adt/mac.c +++ b/src/backend/utils/adt/mac.c @@ -430,13 +430,11 @@ macaddr_abbrev_abort(int memtupcount, SortSupport ssup) */ if (abbr_card > 10.0) { -#ifdef TRACE_SORT if (trace_sort) elog(LOG, "macaddr_abbrev: estimation ends at cardinality %f" " after " INT64_FORMAT " values (%d rows)", abbr_card, uss->input_count, memtupcount); -#endif uss->estimating = false; return false; } @@ -449,23 +447,19 @@ macaddr_abbrev_abort(int memtupcount, SortSupport ssup) */ if (abbr_card < uss->input_count / 2000.0 + 0.5) { -#ifdef TRACE_SORT if (trace_sort) elog(LOG, "macaddr_abbrev: aborting abbreviation at cardinality %f" " below threshold %f after " INT64_FORMAT " values (%d rows)", abbr_card, uss->input_count / 2000.0 + 0.5, uss->input_count, memtupcount); -#endif return true; } -#ifdef TRACE_SORT if (trace_sort) elog(LOG, "macaddr_abbrev: cardinality %f after " INT64_FORMAT " values (%d rows)", abbr_card, uss->input_count, memtupcount); -#endif return false; } diff --git a/src/backend/utils/adt/network.c b/src/backend/utils/adt/network.c index 640fc37dc83..450dacd031c 100644 --- a/src/backend/utils/adt/network.c +++ b/src/backend/utils/adt/network.c @@ -503,13 +503,11 @@ network_abbrev_abort(int memtupcount, SortSupport ssup) */ if (abbr_card > 10.0) { -#ifdef TRACE_SORT if (trace_sort) elog(LOG, "network_abbrev: estimation ends at cardinality %f" " after " INT64_FORMAT " values (%d rows)", abbr_card, uss->input_count, memtupcount); -#endif uss->estimating = false; return false; } @@ -522,23 +520,19 @@ network_abbrev_abort(int memtupcount, SortSupport ssup) */ if (abbr_card < uss->input_count / 2000.0 + 0.5) { -#ifdef TRACE_SORT if (trace_sort) elog(LOG, "network_abbrev: aborting abbreviation at cardinality %f" " below threshold %f after " INT64_FORMAT " values (%d rows)", abbr_card, uss->input_count / 2000.0 + 0.5, uss->input_count, memtupcount); -#endif return true; } -#ifdef TRACE_SORT if (trace_sort) elog(LOG,