RE: Conflict detection and logging in logical replication
On Wednesday, August 14, 2024 7:02 PM Amit Kapila wrote: > > On Wed, Aug 14, 2024 at 8:05 AM Zhijie Hou (Fujitsu) > wrote: > > > > Here is the V14 patch. > > > > Review comments: > 1. > ReportApplyConflict() > { > ... > + ereport(elevel, > + errcode(ERRCODE_INTEGRITY_CONSTRAINT_VIOLATION), > + errmsg("conflict detected on relation \"%s.%s\": conflict=%s", > +get_namespace_name(RelationGetNamespace(localrel)), > ... > > Is it a good idea to use ERRCODE_INTEGRITY_CONSTRAINT_VIOLATION for > all conflicts? I think it is okay to use for insert_exists and update_exists. > The > other error codes to consider for conflicts other than insert_exists and > update_exists are ERRCODE_T_R_SERIALIZATION_FAILURE, > ERRCODE_CARDINALITY_VIOLATION, ERRCODE_NO_DATA, > ERRCODE_NO_DATA_FOUND, > ERRCODE_TRIGGERED_DATA_CHANGE_VIOLATION, > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE. > > BTW, even for insert/update_exists, won't it better to use > ERRCODE_UNIQUE_VIOLATION ? Agreed. I changed the patch to use ERRCODE_UNIQUE_VIOLATION for Insert,update_exists, and ERRCODE_T_R_SERIALIZATION_FAILURE for other conflicts. > > Apart from the above, the attached contains some cosmetic changes. Thanks. I have checked and merged the changes. Here is the V15 patch which addressed above comments. Best Regards, Hou zj v15-0001-Detect-and-log-conflicts-in-logical-replication.patch Description: v15-0001-Detect-and-log-conflicts-in-logical-replication.patch
RE: Conflict detection and logging in logical replication
On Wednesday, August 14, 2024 10:15 PM Michail Nikolaev wrote: > > This is as expected, and we have documented this in the code comments. We > > don't > > need to report a conflict if the conflicting tuple has been removed or > > updated > > due to concurrent transaction. The same is true if the transaction that > > inserted the conflicting tuple is rolled back before > > CheckAndReportConflict(). > > We don't consider such cases as a conflict. > > That seems a little bit strange to me. > > From the perspective of a user, I expect that if a change from publisher is > not > applied - I need to know about it from the logs. I think this is exactly the current behavior in the patch. In the race condition we discussed, the insert will be applied if the conflicting tuple is removed concurrently before CheckAndReportConflict(). > But in that case, I will not see any information about conflict in the logs > in SOME cases. But in OTHER cases I will see it. However, in both cases the > change from publisher was not applied. And these cases are just random and > depend on the timing of race conditions. It is not something I am expecting > from the database. I think you might misunderstand the behavior of CheckAndReportConflict(), even if it found a conflict, it still inserts the tuple into the index which means the change is anyway applied. Best Regards, Hou zj
Re: ECPG cleanup and fix for clang compile-time problem
On 15.08.24 02:43, Tom Lane wrote: I wrote: Here's a rebased but otherwise identical patchset. I also added an 0007 that removes check_rules.pl as threatened. I've done some more work on this and hope to post an updated patchset tomorrow. Before that though, is there any objection to going ahead with pushing the 0001 patch (pgindent'ing ecpg's lexer and parser files)? It's pretty bulky yet of no intellectual interest, so I'd like to stop carrying it forward. The indentation patch looks good to me and it would be good to get it out of the way.
Re: Improve error message for ICU libraries if pkg-config is absent
Hi, On Wed, Aug 14, 2024 at 06:05:19PM -0700, Jeff Davis wrote: > That looks good to me. Does anyone have a different opinion? If not, > I'll go ahead and commit (but not backport) this change. What is the rationale not to backpatch this? The error message changes, but we do not translate configure output, so is this a problem/project policy to never change configure output in the back-branches? If the change is too invasive, would something like the initial patch I suggested (i.e., in the absense of pkg-config, add a more targetted error message) be acceptable for the back-branches? Michael
CREATE SUBSCRIPTION - add missing test case
Hi Hackers, While reviewing another logical replication thread [1], I found an ERROR scenario that seems to be untested. TEST CASE: Attempt CREATE SUBSCRIPTION where the subscriber table is missing some expected column(s). Attached is a patch to add the missing test for this error message. == [1] https://www.postgresql.org/message-id/CAHut%2BPt5FqV7J9GnnWFRNW_Z1KOMMAZXNTRcRNdtFrfMBz_GLA%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia v1-0001-Add-missing-test-case.patch Description: Binary data
Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest
On Thu, 15 Aug 2024 at 01:19, Tom Lane wrote: > +1 for the basic idea. That's great to hear. > Not sure about your 3 & 4 points though, especially if that'd > mean some delay in sending the mail. For one thing, wouldn't those > links be stale as soon as the initial patch got replaced? I recently made those links predictable based on only the ID of the CF entry. So they won't get stale, and I would want to send them straight away (even if that means that they might return a 404 until cfbot actually pushes the patches to github). The links would be: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf/$CFENTRYID https://github.com/postgresql-cfbot/postgresql/compare/cf/$CFENTRYID~1...cf/$CFENTRYID > Those links seem like they ought to live in the commitfest entry. Agreed. I'll start with that, since that should be very easy.
Re: Pgoutput not capturing the generated columns
On Tue, Jul 23, 2024 at 9:23 AM Peter Smith wrote: > > On Fri, Jul 19, 2024 at 4:01 PM Shlok Kyal wrote: > > > > On Thu, 18 Jul 2024 at 13:55, Peter Smith wrote: > > > > > > Hi, here are some review comments for v19-0002 > > > == > > > src/test/subscription/t/004_sync.pl > > > > > > 1. > > > This new test is not related to generated columns. IIRC, this is just > > > some test that we discovered missing during review of this thread. As > > > such, I think this change can be posted/patched separately from this > > > thread. > > > > > I have removed the test for this thread. > > > > I have also addressed the remaining comments for v19-0002 patch. > > Hi, I have no more review comments for patch v20-0002 at this time. > > I saw that the above test was removed from this thread as suggested, > but I could not find that any new thread was started to propose this > valuable missing test. > I still did not find any new thread for adding the missing test case, so I started one myself [1]. == [1] https://www.postgresql.org/message-id/CAHut+PtX8P0EGhsk9p=hqguhrzxecszanxsmkovyilx-ejd...@mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Re: Enable data checksums by default
On Wed, Aug 7, 2024 at 4:18 PM Greg Sabino Mullane wrote: > > On Wed, Aug 7, 2024 at 4:43 AM Michael Banck wrote: >> >> I think the last time we dicussed this the consensus was that >> computational overhead of computing the checksums is pretty small for >> most systems (so the above change seems warranted regardless of whether >> we switch the default), but turning on wal_compression also turns on >> wal_log_hints, which can increase WAL by quite a lot. Maybe this is [..] > > > Yeah, that seems something beyond this patch? Certainly we should mention > wal_compression in the release notes if the default changes. I mean, I feel > wal_log_hints should probably default to on as well, but I've honestly never > really given it much thought because my fingers are trained to type "initdb > -k". I've been using data checksums for roughly a decade now. I think the > only time I've NOT used checksums was when I was doing checksum overhead > measurements, or hacking on the pg_checksums program. Maybe I don't understand something, but just to be clear: wal_compression (mentioned above) is not turning wal_log_hints on, just the wal_log_hints needs to be on when using data checksums (implicitly, by the XLogHintBitIsNeeded() macro). I suppose Michael was thinking about the wal_log_hints earlier (?) -J.
Re: Enable data checksums by default
Hi Greg and others On Tue, Aug 13, 2024 at 4:42 PM Greg Sabino Mullane wrote: > > On Thu, Aug 8, 2024 at 6:11 AM Peter Eisentraut wrote: > >> >> My understanding was that the reason for some hesitation about adopting data >> checksums was the performance impact. Not the checksumming itself, but the >> overhead from hint bit logging. The last time I looked into that, you could >> get performance impacts on the order of 5% tps. Maybe that's acceptable, >> and you of course can turn it off if you want the extra performance. But I >> think this should be discussed in this thread. > > > Fair enough. I think the performance impact is acceptable, as evidenced by > the large number of people that turn it on. And it is easy enough to turn it > off again, either via --no-data-checksums or pg_checksums --disable. I've > come across people who have regretted not throwing a -k into their initial > initdb, but have not yet come across someone who has the opposite regret. > When I did some measurements some time ago, I found numbers much less than > 5%, but of course it depends on a lot of factors. Same here, and +1 to data_checksums=on by default for new installations. The best public measurement of the impact was posted in [1] in 2019 by Tomas to the best of my knowledge, where he explicitly mentioned the problem with more WAL with hints/checksums: SATA disks (low IOPS). My take: now we have 2024, and most people are using at least SSDs or slow-SATA (but in cloud they could just change the class of I/O if required to get IOPS to avoid too much throttling), therefore the price of IOPS dropped significantly. >> About the claim that it's already the de-facto standard. Maybe that is >> approximately true for "serious" installations. But AFAICT, the popular >> packagings don't enable checksums by default, so there is likely a >> significant middle tier between "just trying it out" and serious >> production use that don't have it turned on. > > > I would push back on that "significant" a good bit. The number of Postgres > installations in the cloud is very likely to dwarf the total package > installations. Maybe not 10 years ago, but now? Maybe someone from Amazon can > share some numbers. Not that we have any way to compare against package > installs :) But anecdotally the number of people who mention RDS etc. on the > various fora has exploded. Same here. If it helps the case the: 43% of all PostgreSQL DBs involved in any support case or incident in EDB within last year had data_checksums=on (at least if they had collected the data using our ) . That's a surprisingly high number (for something that's off by default), and it makes me think this is because plenty of customers are either managed by DBAs who care, or assisted by consultants when deploying, or simply using TPAexec [2] which has this on by default. Another thing is plenty of people run with wal_log_hints=on (without data_checksums=off) just to have pg_rewind working. As this is a strictly standby related tool it means they don't have WAL/network bandwidth problems, so the WAL rate is not that high in the wild to cause problems. I found 1 or 2 cases within last year where we would mention that high WAL generation was attributed to wal_log_hints=on/XLOG_FPI and they still didn't disable it apparently (we have plenty of cases related to too much WAL, but it's mostly due to other basic reasons) -J. [1] - https://www.postgresql.org/message-id/20190330192543.GH4719%40development [2] - https://www.enterprisedb.com/docs/pgd/4/deployments/tpaexec/
Re: format_datum debugging function
On 14.08.24 17:46, Paul Jungwirth wrote: On 8/14/24 02:16, Peter Eisentraut wrote: On 12.08.24 23:15, Paul Jungwirth wrote: On 8/12/24 04:32, Aleksander Alekseev wrote: [...] This function takes a Datum and the appropriate out function, and returns a char *. So you can do this: (gdb) call format_datum(range_out, $1) $2 = 0x59162692d938 "[1,4)" I assume a patch like this doesn't need documentation. Does it need a test? Anything else? I think you forgot to attach the patch. Or is it just a proposal? Sorry, patch attached here. I don't think it's safe to call output functions at arbitrary points from a debugger. But if you're okay with that during development, say, then I think you could just call OidOutputFunctionCall(F_RANGE_OUT, $1)? I assumed it wasn't safe everywhere (e.g. there is potentially a TOAST lookup), but for debugging a patch that's okay with me. Are you doing something to get macro expansion? I've never gotten my gdb to see #defines, although in theory this configure line should do it, right?: Oh I see, you don't have the F_* constants available then. Maybe just put in the OID manually then?
Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest
On Thu, 15 Aug 2024 at 05:17, Euler Taveira wrote: > +1. Regarding the CI link, I would be good if the CF entry automatically adds > a > link to the CI run. It can be a separate field or even add it to "Links". I'm on it. I think this email should be a subset of the info on the CF entry webpage, so I'll first change the cf entry page to include all this info. > I'm not sure about 4, you can always check the latest patch in the CF entry > (it > is usually the "latest attachment") and that's what the cfbot uses to run. This is definitely a personal preference thing, but I like reading patches on GitHub much better than looking at raw patch files. It has syntax highlighting and has those little arrow buttons at the top of a diff, to show more context about the file. I realized a 5th thing that I would want in the email and cf entry page 5. A copy-pastable set of git command that checks out the patch by downloading it from the cfbot repo like this: git config branch.cf/5107.remote https://github.com/postgresql-cfbot/postgresql.git git config branch.cf/5107.merge refs/heads/cf/5107 git checkout -b cf/5107 git pull > If I understand your proposal correctly, there will be another email to the > thread if the previous CF was closed and someone opened a new CF entry. > Sometimes some CF entries are about the same thread. Yeah, that's correct. If a new CF entry is created for an existing thread a new email would be sent. But to be clear, if CF entry is pushed to the next commitfest, **no** new email would be sent.
Re: Enable data checksums by default
On Thu, Aug 15, 2024 at 09:49:04AM +0200, Jakub Wartak wrote: > On Wed, Aug 7, 2024 at 4:18 PM Greg Sabino Mullane wrote: > > > > On Wed, Aug 7, 2024 at 4:43 AM Michael Banck wrote: > >> > >> I think the last time we dicussed this the consensus was that > >> computational overhead of computing the checksums is pretty small for > >> most systems (so the above change seems warranted regardless of whether > >> we switch the default), but turning on wal_compression also turns on > >> wal_log_hints, which can increase WAL by quite a lot. Maybe this is > [..] > > > > > > Yeah, that seems something beyond this patch? Certainly we should > > mention wal_compression in the release notes if the default changes. > > I mean, I feel wal_log_hints should probably default to on as well, > > but I've honestly never really given it much thought because my > > fingers are trained to type "initdb -k". I've been using data > > checksums for roughly a decade now. I think the only time I've NOT > > used checksums was when I was doing checksum overhead measurements, > > or hacking on the pg_checksums program. > > Maybe I don't understand something, but just to be clear: > wal_compression (mentioned above) is not turning wal_log_hints on, > just the wal_log_hints needs to be on when using data checksums > (implicitly, by the XLogHintBitIsNeeded() macro). I suppose Michael > was thinking about the wal_log_hints earlier (?) Uh, I am pretty sure I meant to say "turning on data_checksums als turns on wal_log_hints", sorry about the confusion. I guess the connection is that if you turn on wal_lot_hints (either directly or via data_checksums) then the number FPIs goes up (possibly signficantly), and enabling wal_compression could (partly) remedy that. But I agree with Greg that such a discussion is probably out-of-scope for this default change. Michael
Re: Thread-safe nl_langinfo() and localeconv()
Here's a new patch to add to this pile, this time for check_locale(). I also improved the error handling and comments in the other patches. From 4831ff4373b9d713c78e303d9758de347aadfc2f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 13 Aug 2024 14:15:54 +1200 Subject: [PATCH v3 1/3] Provide thread-safe pg_localeconv_r(). This involves four different implementation strategies: 1. For Windows, we now require _configthreadlocale() to be available and work, and the documentation says that the object returned by localeconv() is in thread-local memory. 2. For glibc, we translate to nl_langinfo_l() calls, because it offers the same information that way as an extension, and that API is thread-safe. 3. For macOS/*BSD, use localeconv_l(), which is thread-safe. 4. For everything else, use uselocale() to set the locale for the thread, and use a big ugly lock to defend against the returned object being concurrently clobbered. In practice this currently means only Solaris. The new call is used in pg_locale.c, replacing calls to setlocale() and localeconv(). This patch adds a hard requirement on Windows' _configthreadlocale(). In the past there were said to be MinGW systems that didn't have it, or had it but it didn't work. As far as I can tell, CI (optional MinGW task + mingw cross build warning task) and build farm (fairywren msys) should be happy with this. Fingers crossed. (There are places that use configure checks for that in ECPG; other proposed patches would remove those later.) Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/CA%2BhUKGJqVe0%2BPv9dvC9dSums_PXxGo9SWcxYAMBguWJUGbWz-A%40mail.gmail.com --- configure | 2 +- configure.ac | 1 + meson.build | 1 + src/backend/utils/adt/pg_locale.c | 128 ++- src/include/pg_config.h.in| 3 + src/include/port.h| 6 + src/port/Makefile | 1 + src/port/meson.build | 1 + src/port/pg_localeconv_r.c| 367 ++ 9 files changed, 402 insertions(+), 108 deletions(-) create mode 100644 src/port/pg_localeconv_r.c diff --git a/configure b/configure index 2abbeb27944..60dcf1e436e 100755 --- a/configure +++ b/configure @@ -15232,7 +15232,7 @@ fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -for ac_func in backtrace_symbols copyfile copy_file_range getifaddrs getpeerucred inet_pton kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l +for ac_func in backtrace_symbols copyfile copy_file_range getifaddrs getpeerucred inet_pton localeconv_l kqueue mbstowcs_l memset_s posix_fallocate ppoll pthread_is_threaded_np setproctitle setproctitle_fast strchrnul strsignal syncfs sync_file_range uselocale wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.ac b/configure.ac index c46ed2c591a..59e51a74629 100644 --- a/configure.ac +++ b/configure.ac @@ -1735,6 +1735,7 @@ AC_CHECK_FUNCS(m4_normalize([ getifaddrs getpeerucred inet_pton + localeconv_l kqueue mbstowcs_l memset_s diff --git a/meson.build b/meson.build index cd711c6d018..028a14547aa 100644 --- a/meson.build +++ b/meson.build @@ -2675,6 +2675,7 @@ func_checks = [ ['inet_aton'], ['inet_pton'], ['kqueue'], + ['localeconv_l'], ['mbstowcs_l'], ['memset_s'], ['mkdtemp'], diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c index cd3661e7279..dd4ba9e0e89 100644 --- a/src/backend/utils/adt/pg_locale.c +++ b/src/backend/utils/adt/pg_locale.c @@ -543,12 +543,8 @@ PGLC_localeconv(void) static struct lconv CurrentLocaleConv; static bool CurrentLocaleConvAllocated = false; struct lconv *extlconv; - struct lconv worklconv; - char *save_lc_monetary; - char *save_lc_numeric; -#ifdef WIN32 - char *save_lc_ctype; -#endif + struct lconv tmp; + struct lconv worklconv = {0}; /* Did we do it already? */ if (CurrentLocaleConvValid) @@ -562,77 +558,21 @@ PGLC_localeconv(void) } /* - * This is tricky because we really don't want to risk throwing error - * while the locale is set to other than our usual settings. Therefore, - * the process is: collect the usual settings, set locale to special - * setting, copy relevant data into worklconv using strdup(), restore - * normal settings, convert data to desired encoding, and finally stash - * the collected data in CurrentLocaleConv. This makes it safe if we - * throw an error during encoding conversion or run out of memory anywhere - * in the process. All data pointed to by struct lconv members is - * allocated with strdup, to avoid premature elog(ERROR) and to allow - * using a single cleanup routine. + * Use thread-safe method of obtainin
RE: [BUG?] check_exclusion_or_unique_constraint false negative
On Monday, August 12, 2024 7:11 PM Michail Nikolaev wrote: > > In my test, if the tuple is updated and new tuple is in the same page, > > heapam_index_fetch_tuple should find the new tuple using HOT chain. So, > > it's a > > bit unclear to me how the updated tuple is missing. Maybe I missed some > > other > > conditions for this issue. > > Yeah, I think the pgbench-based reproducer may also cause page splits in > btree. > But we may add an index to the table to disable HOT. > > I have attached a reproducer for this case using a spec and injection points. > > I hope it helps, check the attached files. Thanks a lot for the steps! I successfully reproduced the issue you mentioned in the context of logical replication[1]. As you said, it could increase the possibility of tuple missing when applying updates or deletes in the logical apply worker. I think this is a long-standing issue and I will investigate the fix you proposed. In addition, I think the bug is not a blocker for the conflict detection feature. As the feature simply reports the current behavior of the logical apply worker (either unique violation or tuple missing) without introducing any new functionality. Furthermore, I think that the new ExecCheckIndexConstraints call after ExecInsertIndexTuples() is not affected by the dirty snapshot bug. This is because a tuple has already been inserted into the btree before the dirty snapshot scan, which means that a concurrent non-HOT update would not be possible (it would be blocked after finding the just inserted tuple and wait for the apply worker to commit the current transaction). It would be good if others could also share their opinion on this. [1] The steps to reproduce the tuple missing in logical replication. 1. setup pub/sub env, and publish a table with 1 row. pub: CREATE TABLE t(a int primary key, b int); INSERT INTO t VALUES(1,1); CREATE PUBLICATION pub FOR TABLE t; sub: CREATE TABLE t (a int primary key, b int check (b < 5)); CREATE INDEX t_b_idx ON t(b); CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=$port_publisher' PUBLICATION pub; 2. Execute an UPDATE(UPDATE t set b = b + 1) on the publisher and use gdb to stop the apply worker at the point after index_getnext_tid() and before index_fetch_heap(). 3. execute a concurrent update(UPDATE t set b = b + 100) on the subscriber to update a non-key column value and commit the update. 4. release the apply worker and it would report the update_missing conflict. Best Regards, Hou zj
Re: Remaining dependency on setlocale()
On Thu, Aug 15, 2024 at 11:00 AM Jeff Davis wrote: > On Thu, 2024-08-15 at 10:43 +1200, Thomas Munro wrote: > > So I think the solution could perhaps be something like: in some > > early > > startup phase before there are any threads, we nail down all the > > locale categories to "C" (or whatever we decide on for the permanent > > global locale), and also query the "" categories and make a copy of > > them in case anyone wants them later, and then never call setlocale() > > again. > > +1. We currently nail down these categories: /* We keep these set to "C" always. See pg_locale.c for explanation. */ init_locale("LC_MONETARY", LC_MONETARY, "C"); init_locale("LC_NUMERIC", LC_NUMERIC, "C"); init_locale("LC_TIME", LC_TIME, "C"); CF #5170 has patches to make it so that we stop changing them even transiently, using locale_t interfaces to feed our caches of stuff needed to work with those categories, so they really stay truly nailed down. It sounds like someone needs to investigate doing the same thing for these two, from CheckMyDatabase(): if (pg_perm_setlocale(LC_COLLATE, collate) == NULL) ereport(FATAL, (errmsg("database locale is incompatible with operating system"), errdetail("The database was initialized with LC_COLLATE \"%s\", " " which is not recognized by setlocale().", collate), errhint("Recreate the database with another locale or install the missing locale."))); if (pg_perm_setlocale(LC_CTYPE, ctype) == NULL) ereport(FATAL, (errmsg("database locale is incompatible with operating system"), errdetail("The database was initialized with LC_CTYPE \"%s\", " " which is not recognized by setlocale().", ctype), errhint("Recreate the database with another locale or install the missing locale."))); How should that work? Maybe we could imagine something like MyDatabaseLocale, a locale_t with LC_COLLATE and LC_CTYPE categories set appropriately. Or should it be a pg_locale_t instead (if your database default provider is ICU, then you don't even need a locale_t, right?). Then I think there is one quite gnarly category, from assign_locale_messages() (a GUC assignment function): (void) pg_perm_setlocale(LC_MESSAGES, newval); I have never really studied gettext(), but I know it was just standardised in POSIX 2024, and the standardised interface has _l() variants of all functions. Current implementations don't have them yet. Clearly we absolutely couldn't call pg_perm_setlocale() after early startup -- but if gettext() is relying on the current locale to affect far away code, then maybe this is one place where we'd just have to use uselocale(). Perhaps we could plan some transitional strategy where NetBSD users lose the ability to change the GUC without restarting the server and it has to be the same for all sessions, or something like that, until they produce either gettext_l() or uselocale(), but I haven't thought hard about this part at all yet...
Re: On non-Windows, hard depend on uselocale(3)
On Wed, Aug 14, 2024 at 11:17 AM Tristan Partin wrote: > Thanks for picking this up. I think your patch looks really good. Thanks for looking! > Are > you familiar with gcc's function poisoning? > > #include > #pragma GCC poison puts > > int main(){ > #pragma GCC bless begin puts > puts("a"); > #pragma GCC bless end puts > } > > I wonder if we could use function poisoning to our advantage. For > instance in ecpg, it looks like you got all of the strtod() invocations > and replaced them with strtod_l(). Here is a patch with an example of > what I'm talking about. Thanks, this looks very useful.
Re: Vacuum statistics
On 15.8.24 11:49, Alena Rybakina wrote: I have some suggestions: 1. pgstatfuncs.c in functions tuplestore_put_for_database() and tuplestore_put_for_relation you can remove 'nulls' array if you're sure that columns cannot be NULL. We need to use this for tuplestore_putvalues function. With this function, we fill the table with the values of the statistics. Ah, right! I'm sorry. 1. 2. These functions are almost the same and I would think of writing one function depending of type 'ExtVacReportType' I'm not sure that I fully understand what you mean. Can you explain it more clearly, please? Ah, I didn't notice that the size of all three tables is different. Therefore, it won't be possible to write one function instead of two to avoid code duplication. My mistake.
Re: Optimize mul_var() for var1ndigits >= 8
On Wed, 14 Aug 2024 at 07:31, Joel Jacobson wrote: > > I think this is acceptable, since it produces more correct results. > Thanks for checking. I did a bit more testing myself and didn't see any problems, so I have committed both these patches. > In addition, I've traced the rscale_adjustment -15 mul_var() calls to > originate > from numeric_exp() and numeric_power(), so I thought it would be good to > brute-force test those as well, to get an idea of the probability of different > results from those functions. > > Brute-force testing of course doesn't prove it's impossible to happen, > but millions of inputs didn't cause any observable differences in the > returned results, so I think it's at least very improbable to > happen in practice. > Indeed, there certainly will be cases where the result changes. I saw some with ln(), for which HEAD rounded the final digit the wrong way, and the result is now correct, but the opposite cannot be ruled out either, since these functions are inherently inexact. The aim is to have them generate the correctly rounded result in the vast majority of cases, while accepting an occasional off-by-one error in the final digit. Having them generate the correct result in all cases is certainly possible, but would require a fair bit of additional code that probably isn't worth the effort. In my testing, exp() rounded the final digit incorrectly with a probability of roughly 1 in 50-100 million when computing results with a handful of digits (consistent with the "+8" digits added to "sig_digits"), rising to roughly 1 in 5-10 million when computing around 1000 digits (presumably because we don't take into account the number of Taylor series terms when deciding on the local rscale). That wasn't affected significantly by the patch, and it's not surprising that you saw nothing with brute-force testing. In any case, I'm pretty content with those results. Regards, Dean
Re: [Patch] remove duplicated smgrclose
On Wed, Aug 14, 2024 at 2:35 PM Steven Niu wrote: > > Junwang, Kirill, > > The split work has been done. I created a new patch for removing redundant > smgrclose() function as attached. > Please help review it. Patch looks good, actually you can make the refactoring code as v3-0002-xxx by using: git format-patch -2 -v 3 Another kind reminder: Do not top-post when you reply ;) > > Thanks, > Steven > > Steven Niu 于2024年8月12日周一 18:11写道: >> >> Kirill, >> >> Good catch! >> I will split the patch into two to cover both cases. >> >> Thanks, >> Steven >> >> >> Junwang Zhao 于2024年8月9日周五 18:19写道: >>> >>> On Fri, Aug 9, 2024 at 5:20 PM Kirill Reshke wrote: >>> > >>> > On Thu, 1 Aug 2024 at 17:32, Junwang Zhao wrote: >>> > > >>> > > Hi Steven, >>> > > >>> > > On Wed, Jul 31, 2024 at 11:16 AM Steven Niu wrote: >>> > > > >>> > > > Hello, hackers, >>> > > > >>> > > > I think there may be some duplicated codes. >>> > > > Function smgrDoPendingDeletes() calls both smgrdounlinkall() and >>> > > > smgrclose(). >>> > > > But both functions would close SMgrRelation object, it's dupliacted >>> > > > behavior? >>> > > > >>> > > > So I make this patch. Could someone take a look at it? >>> > > > >>> > > > Thanks for your help, >>> > > > Steven >>> > > > >>> > > > From Highgo.com >>> > > > >>> > > > >>> > > You change LGTM, but the patch seems not to be applied to HEAD, >>> > > I generate the attached v2 using `git format` with some commit message. >>> > > >>> > > -- >>> > > Regards >>> > > Junwang Zhao >>> > >>> > Hi all! >>> > This change looks good to me. However, i have an objection to these >>> > lines from v2: >>> > >>> > > /* Close the forks at smgr level */ >>> > > - for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) >>> > > - smgrsw[which].smgr_close(rels[i], forknum); >>> > > + smgrclose(rels[i]); >>> > >>> > Why do we do this? This seems to be an unrelated change given thread >>> > $subj. This is just a pure refactoring job, which deserves a separate >>> > patch. There is similar coding in >>> > smgrdestroy function: >>> > >>> > ``` >>> > for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) >>> > smgrsw[reln->smgr_which].smgr_close(reln, forknum); >>> > ``` >>> > >>> > So, I feel like these two places should be either changed together or >>> > not be altered at all. And is it definitely a separate change. >>> >>> Yeah, I tend to agree with you, maybe we should split the patch >>> into two. >>> >>> Steven, could you do this? >>> >>> > >>> > -- >>> > Best regards, >>> > Kirill Reshke >>> >>> >>> >>> -- >>> Regards >>> Junwang Zhao -- Regards Junwang Zhao
Re: [PATCH] Add get_bytes() and set_bytes() functions
On Thu, 15 Aug 2024 at 05:20, Joel Jacobson wrote: > > On Wed, Aug 14, 2024, at 19:25, Joel Jacobson wrote: > > What do we want to happen if passing a numeric with decimal digits, > > to decimal_to_bytes()? It must be an error, right? > > > > Example: SELECT decimal_to_bytes(1.23); > > Hmm, an error feels quite ugly on second thought. > Would be nicer if all numerics could be represented, > > But then what about Inf,-Inf,NaN? > Perhaps we should also add casts between bytea and the integer/numeric types. That might be easier to use than functions in some circumstances. When casting a numeric to an integer, the result is rounded to the nearest integer, and NaN/Inf generate errors, so we should probably do the same here. Regards, Dean
replace magic num in struct cachedesc with CATCACHE_MAXKEYS
Hi hackers, I noticed that there is a magic number which can be replaced by CATCACHE_MAXKEYS in struct cachedesc, I checked some other struct like CatCache, CatCTup, they all use CATCACHE_MAXKEYS. I did some search on pg-hackers, and found an old thread[0] that Robert proposed to change the maximum number of keys for a syscache from 4 to 5. It seems to me that the *five-key syscaches* feature is not necessary since the idea was 14 years old and we still use 4 keys without any problems(I might be wrong). However, in that patch, there is a change that seems reasonable. --- a/src/backend/utils/cache/syscache.c +++ b/src/backend/utils/cache/syscache.c @@ -95,7 +95,7 @@ struct cachedesc Oid reloid; /* OID of the relation being cached */ Oid indoid; /* OID of index relation for this cache */ int nkeys; /* # of keys needed for cache lookup */ - int key[4]; /* attribute numbers of key attrs */ + int key[CATCACHE_MAXKEYS]; /* attribute numbers of key attrs */ int nbuckets; /* number of hash buckets for this cache */ }; [0]: https://www.postgresql.org/message-id/flat/603c8f071003281532t5e6c68eex458825485d4fcd98%40mail.gmail.com -- Regards Junwang Zhao v1-0001-replace-magic-num-with-CATCACHE_MAXKEYS.patch Description: Binary data
Re: [PATCH] Add get_bytes() and set_bytes() functions
Hi, > Perhaps we should also add casts between bytea and the integer/numeric > types. That might be easier to use than functions in some > circumstances. > > When casting a numeric to an integer, the result is rounded to the > nearest integer, and NaN/Inf generate errors, so we should probably do > the same here. Yes, I was also thinking about adding NUMERIC versions of get_bytes() / set_bytes(). This would allow converting more than 8 bytes to/from an integer. I dropped this idea because I thought there would be not much practical use for it. On the flip side you never know who uses Postgres and for what purpose. I will add corresponding casts unless the idea will get a push-back from the community. IMO the existence of these casts will at least not make things worse. -- Best regards, Aleksander Alekseev
Re: Thread-safe nl_langinfo() and localeconv()
On 15/08/2024 11:03, Thomas Munro wrote: Here's a new patch to add to this pile, this time for check_locale(). I also improved the error handling and comments in the other patches. There's a similar function in initdb, check_locale_name. I wonder if that could reuse the same code. I wonder if it would be more clear to split this into three functions: /* * Get the name of the locale in "native environment", * like setlocale(category, NULL) does */ char *get_native_locale(int category); /* * Return true if 'locale' is valid locale name for 'category */ bool check_locale(int category, const char *locale); /* * Return a canonical name of given locale */ char *canonicalize_locale(int category, const char *locale); result = malloc(strlen(canonical) + 1); if (!result) goto exit; strcpy(result, canonical); Could use "result = strdup(canonical)" here. Or even better, could we use palloc()/pstrdup() here, and save the caller the trouble to copy it? -- Heikki Linnakangas Neon (https://neon.tech)
Re: Skip adding row-marks for non target tables when result relation is foreign table.
On Thu, Aug 15, 2024 at 9:56 AM Jeff Davis wrote: > Is there any sample code that implements late locking for a FDW? I'm > not quite clear on how it's supposed to work. See the patch in [1]. It would not apply to HEAD anymore, though. Best regards, Etsuro Fujita [1] https://www.postgresql.org/message-id/16016.1431455...@sss.pgh.pa.us
Re: Enable data checksums by default
Hi all, On Tue, Aug 13, 2024 at 10:08 PM Robert Haas wrote: > And it's not like we have statistics anywhere that you can look at to > see how much CPU time you spent computing checksums, so if a user DOES > have a performance problem that would not have occurred if checksums > had been disabled, they'll probably never know it. In worst case, per second and per-pid CPU time consumption could be quantified by having eBPF which is the standard on distros now (requires kernel headers and bpfcc-tools installed), e.g. here 7918 was PID doing pgbench-related -c 4 workload with checksum=on (sorry for formatting, but I don't want to use HTML here): # funclatency-bpfcc --microseconds -i 1 -p 7918 /usr/lib/postgresql/16/bin/postgres:pg_checksum_page Tracing 1 functions for "/usr/lib/postgresql/16/bin/postgres:pg_checksum_page"... Hit Ctrl-C to end. usecs : count distribution 0 -> 1 : 0|| 2 -> 3 : 238 |* | 4 -> 7 : 714 || 8 -> 15 : 2|| 16 -> 31 : 5|| 32 -> 63 : 0|| 64 -> 127: 1|| 128 -> 255: 0|| 256 -> 511: 1|| 512 -> 1023 : 1|| avg = 6 usecs, total: 6617 usecs, count: 962 usecs : count distribution 0 -> 1 : 0|| 2 -> 3 : 241 |* | 4 -> 7 : 706 || 8 -> 15 : 11 || 16 -> 31 : 10 || 32 -> 63 : 1|| avg = 5 usecs, total: 5639 usecs, count: 969 [..refreshes every 1s here..] So the above can tell us e.g. that this pg_checksum_page() took 5639 us out of 1s full sample time (and with 100% CPU pegged core so that's gives again ~5% CPU util per this routine; I'm ignoring the WAL/log hint impact for sure). One could also write a small script using bpftrace instead, too. Disassembly on Debian version and stock PGDG is telling me it's ful SSE2 instruction-set, so that's nice and optimal too. > >> For those uses, this change would render pg_upgrade useless for upgrades > >> from an old instance with default settings to a new instance with default > >> settings. And then users would either need to re-initdb with checksums > >> turned back off, or I suppose run pg_checksums on the old instance before > >> upgrading? This is significant additional complication. > > Meh, re-running initdb with --no-data-checksums seems a fairly low hurdle. > > I tend to agree with that, but I would also like to see the sort of > improvements that Peter mentions. [..] > None of that is to say that I'm totally hostile to this change. [.,.] > Whether that's worth the overhead for everyone, I'm not quite sure. Without data checksums there's a risk that someone receives silent-bit corruption and no one will notice. Shouldn't integrity of data stand above performance by default, in this case? (and performance could be opt-in, if someone really wants it). -J.
Re: Improve error message for ICU libraries if pkg-config is absent
On 15.08.24 09:20, Michael Banck wrote: On Wed, Aug 14, 2024 at 06:05:19PM -0700, Jeff Davis wrote: That looks good to me. Does anyone have a different opinion? If not, I'll go ahead and commit (but not backport) this change. What is the rationale not to backpatch this? The error message changes, but we do not translate configure output, so is this a problem/project policy to never change configure output in the back-branches? If the change is too invasive, would something like the initial patch I suggested (i.e., in the absense of pkg-config, add a more targetted error message) be acceptable for the back-branches? But it's not just changing an error message, it's changing the logic that leads to the error message. Have we really thought through all the combinations here? I don't know. So committing for master and then seeing if there is any further feedback seems most prudent. (I'm not endorsing either patch version here, just commenting on the process.)
Re: Remaining dependency on setlocale()
On 15.08.24 00:43, Thomas Munro wrote: "" is a problem however... the special value for "native environment" is returned as a real locale name, which we probably still need in places. We could change that to newlocale("") + query instead, but Where do we need that in the server? It should just be initdb doing that and then initializing the server with concrete values based on that. I guess technically some of these GUC settings default to the environment? But I think we could consider getting rid of that.
Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest
On 15.08.24 00:40, Jelte Fennema-Nio wrote: I'd like to send an automatic mail to a thread whenever it gets added to a commitfest. Since this would impact everyone that's subscribed to the mailinglist I'd like some feedback on this. This mail would include: 1. A very short blurb like: "This thread was added to the commitfest with ID 1234" How would you attach such an email to a thread? Where in the thread would you attach it? I'm not quite sure how well that would work. The main reason I'm proposing this is that currently it's not trivial to go from an email in my inbox to the commitfest entry. Maybe there could be a feature in the commitfest app to enter a message ID and get the commitfest entries that contain that message.
Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest
On 15.08.24 09:59, Jelte Fennema-Nio wrote: I realized a 5th thing that I would want in the email and cf entry page 5. A copy-pastable set of git command that checks out the patch by downloading it from the cfbot repo like this: git config branch.cf/5107.remote https://github.com/postgresql-cfbot/postgresql.git git config branch.cf/5107.merge refs/heads/cf/5107 git checkout -b cf/5107 git pull Maybe this kind of thing should rather be on the linked-to web page, not in every email. But a more serious concern here is that the patches created by the cfbot are not canonical. There are various heuristics when they get applied. I would prefer that people work with the actual patches sent by email, at least unless they know exactly what they are doing. We don't want to create parallel worlds of patches that are like 90% similar but not really identical.
Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest
On Thu, 15 Aug 2024 at 15:28, Peter Eisentraut wrote: > How would you attach such an email to a thread? Where in the thread > would you attach it? I'm not quite sure how well that would work. My idea would be to have the commitfest app send it in reply to the message id that was entered in the "New patch" page: https://commitfest.postgresql.org/open/new/ > Maybe there could be a feature in the commitfest app to enter a message > ID and get the commitfest entries that contain that message. That's a good idea.
Re: Parallel CREATE INDEX for BRIN indexes
On 13.04.24 23:04, Tomas Vondra wrote: While preparing a differential code coverage report between 16 and HEAD, one thing that stands out is the parallel brin build code. Neither on coverage.postgresql.org nor locally is that code reached during our tests. Thanks for pointing this out, it's definitely something that I need to improve (admittedly, should have been part of the patch). I'll also look into eliminating the difference between BTREE and BRIN parallel builds, mentioned in my last message in this thread. Here's a couple patches adding a test for the parallel CREATE INDEX with BRIN. The actual test is 0003/0004 - I added the test to pageinspect, because that allows cross-checking the index to one built without parallelism, which I think is better than just doing CREATE INDEX without properly testing it produces correct results. These pageinspect tests added a new use of the md5() function. We got rid of those in the tests for PG17. You could write the test case with something like SELECT (CASE WHEN (mod(i,231) = 0) OR (i BETWEEN 3500 AND 4000) THEN NULL ELSE i END), - (CASE WHEN (mod(i,233) = 0) OR (i BETWEEN 3750 AND 4250) THEN NULL ELSE md5(i::text) END), + (CASE WHEN (mod(i,233) = 0) OR (i BETWEEN 3750 AND 4250) THEN NULL ELSE encode(sha256(i::text::bytea), 'hex') END), (CASE WHEN (mod(i,233) = 0) OR (i BETWEEN 3850 AND 4500) THEN NULL ELSE (i/100) + mod(i,8) END) But this changes the test output slightly and I'm not sure if this gives you the data distribution that you need for you test. Could your check this please?
Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest
On Thu, 15 Aug 2024 at 15:33, Peter Eisentraut wrote: > Maybe this kind of thing should rather be on the linked-to web page, not > in every email. Yeah, I'll first put a code snippet on the page for the commitfest entry. > But a more serious concern here is that the patches created by the cfbot > are not canonical. There are various heuristics when they get applied. > I would prefer that people work with the actual patches sent by email, > at least unless they know exactly what they are doing. We don't want to > create parallel worlds of patches that are like 90% similar but not > really identical. I'm not really sure what kind of heuristics and resulting differences you're worried about here. The heuristics it uses are very simple and are good enough for our CI. Basically they are: 1. Unzip/untar based on file extension 2. Apply patches using "patch" in alphabetic order Also, when I apply patches myself, I use heuristics too. And my heuristics are probably different from yours. So I'd expect that many people using the exact same heuristic would only make the situation better. Especially because if people don't know exactly what they are doing, then their heuristics are probably not as good as the one of our cfbot. I know I've struggled a lot the first few times when I was manually applying patches.
Re: format_datum debugging function
Peter Eisentraut writes: > On 14.08.24 17:46, Paul Jungwirth wrote: >> Are you doing something to get macro expansion? I've never gotten my gdb >> to see #defines, although in theory this configure line should do it, >> right?: > Oh I see, you don't have the F_* constants available then. Maybe just > put in the OID manually then? That's pretty illegible and error-prone. I agree that writing the output function's C name is noticeably better. However, I would call the result DirectOutputFunctionCall and put it near OidOutputFunctionCall in fmgr.c. It's not like we don't already have a naming convention and many instances of this. (Also, now that I look at the code, I wonder why it looks so little like any of the existing DirectFunctionCall functions.) regards, tom lane
Re: Alias of VALUES RTE in explain plan
On Mon, Jul 1, 2024 at 3:17 PM Ashutosh Bapat wrote: > Hi All, > While reviewing Richard's patch for grouping sets, I stumbled upon > following explain output > > explain (costs off) > select distinct on (a, b) a, b > from (values (1, 1), (2, 2)) as t (a, b) where a = b > group by grouping sets((a, b), (a)) > order by a, b; >QUERY PLAN > > Unique >-> Sort > Sort Key: "*VALUES*".column1, "*VALUES*".column2 > -> HashAggregate >Hash Key: "*VALUES*".column1, "*VALUES*".column2 >Hash Key: "*VALUES*".column1 >-> Values Scan on "*VALUES*" > Filter: (column1 = column2) > (8 rows) > > There is no VALUES.column1 and VALUES.column2 in the query. The alias t.a > and t.b do not appear anywhere in the explain output. I think explain > output should look like > explain (costs off) > select distinct on (a, b) a, b > from (values (1, 1), (2, 2)) as t (a, b) where a = b > group by grouping sets((a, b), (a)) > order by a, b; >QUERY PLAN > > Unique >-> Sort > Sort Key: t.a, t.b > -> HashAggregate >Hash Key: t.a, t.b >Hash Key: t.a >-> Values Scan on "*VALUES*" t > Filter: (a = b) > (8 rows) > > I didn't get time to figure out the reason behind this, nor the history. > But I thought I would report it nonetheless. > I have looked into the issue and found that when subqueries are pulled up, a modifiable copy of the subquery is created for modification in the pull_up_simple_subquery() function. During this process, flatten_join_alias_vars() is called to flatten any join alias variables in the subquery's target list. However at this point, we lose subquery's alias. If you/hackers agree with my findings, I can provide a working patch soon. > -- > Best Wishes, > Ashutosh Bapat >
Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest
On Thu, Aug 15, 2024 at 12:40:29AM +0200, Jelte Fennema-Nio wrote: > I'd like to send an automatic mail to a thread whenever it gets added > to a commitfest. Since this would impact everyone that's subscribed to > the mailinglist I'd like some feedback on this. This mail would > include: I haven't thought too much about the details, but in general, +1 for sending links to cfest/cfbot entries to the thread. -- nathan
Re: Adding clarification to description of IPC wait events XactGroupUpdate and ProcArrayGroupUpdate
On Thu, Aug 15, 2024 at 11:25:25AM +0800, SAMEER KUMAR wrote: > I think it is important to indicate that the group leader is responsible > for clearing the transaction ID/transaction status of other backends > (including this one). Your proposal is Waiting for the group leader process to clear the transaction ID of this backend at the end of a transaction. Waiting for the group leader process to update the transaction status for this backend. Mine is Waiting for the group leader to clear the transaction ID at transaction end. Waiting for the group leader to update transaction status at transaction end. IMHO the latter doesn't convey substantially less information, and it fits a little better with the terse style of the other wait events nearby. But I'll yield to majority opinion here. -- nathan
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Wed, Aug 14, 2024 at 2:04 PM Jelte Fennema-Nio wrote: > Applied all 0002 feedback. Although I used Min(proto, > PG_PROTOCOL_LATEST) because Max results in the wrong value. Picky, picky. :-) Committed. > Makes sense. I'm not in too much of a hurry with this specific one. So > I'll leave it like this for now and hopefully someone else responds. > If this becomes close to being the final unmerged patch of this > patchset, I'll probably cut my losses and create a patch that adds a > function instead. Maybe reorder the series to put that one later then. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Remove dependence on integer wrapping
Hello Joe, 05.08.2024 02:55, Joseph Koshakow wrote: On Fri, Jun 14, 2024 at 8:00 AM Alexander Lakhin wrote: > > And the most interesting case to me: > SET temp_buffers TO 10; > > CREATE TEMP TABLE t(i int PRIMARY KEY); > INSERT INTO t VALUES(1); > ... Alex, are you able to get a full stack trace for this panic? I'm unable to reproduce this because I don't have enough memory in my system. I've tried reducing `BLCKSZ` to 1024, which is the lowest value allowed per my understanding, and I still don't have enough memory. Yes, please take a look at it (sorry for the late reply): (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140438687430464) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140438687430464) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140438687430464, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x7fba70025476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x7fba7000b7f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x563945aed511 in __addvsi3 () #6 0x563945a6c106 in init_htab (hashp=0x563947700980, nelem=10) at dynahash.c:720 #7 0x563945a6bd22 in hash_create (tabname=0x563945c591d9 "Local Buffer Lookup Table", nelem=10, info=0x7ffd4d394620, flags=40) at dynahash.c:567 #8 0x5639457f2760 in el () at localbuf.c:635 #9 0x5639457f19e3 in ExtendBufferedRelLocal (bmr=..., fork=MAIN_FORKNUM, flags=8, extend_by=1, extend_upto=4294967295, buffers=0x7ffd4d3948e0, extended_by=0x7ffd4d3947ac) at localbuf.c:326 #10 0x5639457e8851 in ExtendBufferedRelCommon (bmr=..., fork=MAIN_FORKNUM, strategy=0x0, flags=8, extend_by=1, extend_upto=4294967295, buffers=0x7ffd4d3948e0, extended_by=0x7ffd4d39488c) at bufmgr.c:2175 #11 0x5639457e6850 in ExtendBufferedRelBy (bmr=..., fork=MAIN_FORKNUM, strategy=0x0, flags=8, extend_by=1, buffers=0x7ffd4d3948e0, extended_by=0x7ffd4d39488c) at bufmgr.c:923 #12 0x5639452d8ae6 in RelationAddBlocks (relation=0x7fba650abd78, bistate=0x0, num_pages=1, use_fsm=true, did_unlock=0x7ffd4d394a3d) at hio.c:341 #13 0x5639452d944a in RelationGetBufferForTuple (relation=0x7fba650abd78, len=32, otherBuffer=0, options=0, bistate=0x0, vmbuffer=0x7ffd4d394ac4, vmbuffer_other=0x0, num_pages=1) at hio.c:767 #14 0x5639452be996 in heap_insert (relation=0x7fba650abd78, tup=0x5639476ecfc0, cid=0, options=0, bistate=0x0) at heapam.c:2019 #15 0x5639452cee84 in heapam_tuple_insert (relation=0x7fba650abd78, slot=0x5639476ecf30, cid=0, options=0, bistate=0x0) at heapam_handler.c:251 #16 0x5639455b3b07 in table_tuple_insert (rel=0x7fba650abd78, slot=0x5639476ecf30, cid=0, options=0, bistate=0x0) at ../../../src/include/access/tableam.h:1405 #17 0x5639455b5c60 in ExecInsert (context=0x7ffd4d394d20, resultRelInfo=0x5639476ec390, slot=0x5639476ecf30, canSetTag=true, inserted_tuple=0x0, insert_destrel=0x0) at nodeModifyTable.c:1139 #18 0x5639455ba942 in ExecModifyTable (pstate=0x5639476ec180) at nodeModifyTable.c:4077 #19 0x563945575425 in ExecProcNodeFirst (node=0x5639476ec180) at execProcnode.c:469 #20 0x563945568095 in ExecProcNode (node=0x5639476ec180) at ../../../src/include/executor/executor.h:274 #21 0x56394556af65 in ExecutePlan (estate=0x5639476ebf00, planstate=0x5639476ec180, use_parallel_mode=false, operation=CMD_INSERT, sendTuples=false, numberTuples=0, direction=ForwardScanDirection, dest=0x5639476f5470, execute_once=true) at execMain.c:1646 #22 0x5639455687e3 in standard_ExecutorRun (queryDesc=0x5639476f3e70, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:363 #23 0x5639455685b9 in ExecutorRun (queryDesc=0x5639476f3e70, direction=ForwardScanDirection, count=0, execute_once=true) at execMain.c:304 #24 0x56394584986e in ProcessQuery (plan=0x5639476f5310, sourceText=0x56394760d610 "INSERT INTO t VALUES(1);", params=0x0, queryEnv=0x0, dest=0x5639476f5470, qc=0x7ffd4d395180) at pquery.c:160 #25 0x56394584b445 in PortalRunMulti (portal=0x56394768ab20, isTopLevel=true, setHoldSnapshot=false, dest=0x5639476f5470, altdest=0x5639476f5470, qc=0x7ffd4d395180) at pquery.c:1278 #26 0x56394584a93c in PortalRun (portal=0x56394768ab20, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x5639476f5470, altdest=0x5639476f5470, qc=0x7ffd4d395180) at pquery.c:791 #27 0x563945842fd9 in exec_simple_query (query_string=0x56394760d610 "INSERT INTO t VALUES(1);") at postgres.c:1284 #28 0x563945848536 in PostgresMain (dbname=0x563947644900 "regression", username=0x5639476448e8 "law") at postgres.c:4766 #29 0x56394583eb67 in BackendMain (startup_data=0x7ffd4d395404 "", startup_data_len=4) at backend_startup.c:107 #30 0x56394574e00e in postmaster_child_launch (child_type=B_BACKEND, startup_data=0x7ffd4d395404 "", startup_data_len=4, client_sock=0x7ffd4d395450) at launch_backend.c:274 #31 0x563945753f7
Re: generic plans and "initial" pruning
On Thu, Aug 15, 2024 at 8:57 AM Amit Langote wrote: > TBH, it's more of a hunch that people who are not involved in this > development might find the new reality, whereby the execution is not > racefree until ExecutorRun(), hard to reason about. I'm confused by what you mean here by "racefree". A race means multiple sessions are doing stuff at the same time and the result depends on who does what first, but the executor stuff is all backend-private. Heavyweight locks are not backend-private, but those would be taken in ExectorStart(), not ExecutorRun(), IIUC. > With the patch, CreateQueryDesc() and ExecutorStart() are moved to > PortalStart() so that QueryDescs including the PlanState trees for all > queries are built before any is run. Why? So that if ExecutorStart() > fails for any query in the list, we can simply throw out the QueryDesc > and the PlanState trees of the previous queries (NOT run them) and ask > plancache for a new CachedPlan for the list of queries. We don't have > a way to ask plancache.c to replan only a given query in the list. I agree that moving this from PortalRun() to PortalStart() seems like a bad idea, especially in view of what you write below. > * There's no longer CCI() between queries in PortalRunMulti() because > the snapshots in each query's QueryDesc must have been adjusted to > reflect the correct command counter. I've checked but can't really be > sure if the value in the snapshot is all anyone ever uses if they want > to know the current value of the command counter. I don't think anything stops somebody wanting to look at the current value of the command counter. I also don't think you can remove the CommandCounterIncrement() calls between successive queries, because then they won't see the effects of earlier calls. So this sounds broken to me. Also keep in mind that one of the queries could call a function which does something that bumps the command counter again. I'm not sure if that creates its own hazzard separate from the lack of CCIs, or whether it's just another part of that same issue. But you can't assume that each query's snapshot should have a command counter value one more than the previous query. While this all seems bad for the partially-initialized-execution-tree approach, I wonder if you don't have problems here with the other design, too. Let's say you've the multi-query case and there are 2 queries. The first one (Q1) is SELECT mysterious_function() and the second one (Q2) is SELECT * FROM range_partitioned_table WHERE key_column = 42. What if mysterious_function() performs DDL on range_partitioned_table? I haven't tested this so maybe there are things going on here that prevent trouble, but it seems like executing Q1 can easily invalidate the plan for Q2. And then it seems like you're basically back to the same problem. > > > 3. The need to add *back* the fields to store the RT indexes of > > > relations that are not looked at by ExecInitNode() traversal such as > > > root partitioned tables and non-leaf partitions. > > > > I don't remember exactly why we removed those or what the benefit was, > > so I'm not sure how big of a problem it is if we have to put them > > back. > > We removed those in commit 52ed730d511b after commit f2343653f5b2 > removed redundant execution-time locking of non-leaf relations. So we > removed them because we realized that execution time locking is > unnecessary given that AcquireExecutorLocks() exists and now we want > to add them back because we'd like to get rid of > AcquireExecutorLocks(). :-) My bias is to believe that getting rid of AcquireExecutorLocks() is probably the right thing to do, but that's not a strongly-held position and I could be totally wrong about it. The thing is, though, that AcquireExecutorLocks() is fundamentally stupid, and it's hard to see how it can ever be any smarter. If we want to make smarter decisions about what to lock, it seems reasonable to me to think that the locking code needs to be closer to code that can evaluate expressions and prune partitions and stuff like that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: long-standing data loss bug in initial sync of logical replication
On Thu, 8 Aug 2024 at 16:24, Shlok Kyal wrote: > > On Wed, 31 Jul 2024 at 11:17, Shlok Kyal wrote: > > > > On Wed, 31 Jul 2024 at 09:36, Amit Kapila wrote: > > > > > > On Wed, Jul 31, 2024 at 3:27 AM Masahiko Sawada > > > wrote: > > > > > > > > On Wed, Jul 24, 2024 at 9:53 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Wed, Jul 17, 2024 at 5:25 PM vignesh C wrote: > > > > > > > > > > > > On Wed, 17 Jul 2024 at 11:54, Amit Kapila > > > > > > wrote: > > > > > > > > > > > > > > On Tue, Jul 16, 2024 at 6:54 PM vignesh C > > > > > > > wrote: > > > > > > > > > > > > > > BTW, I noticed that we don't take any table-level locks for Create > > > > > > > Publication .. For ALL TABLES (and Drop Publication). Can that > > > > > > > create > > > > > > > a similar problem? I haven't tested so not sure but even if there > > > > > > > is a > > > > > > > problem for the Create case, it should lead to some ERROR like > > > > > > > missing > > > > > > > publication. > > > > > > > > > > > > I tested these scenarios, and as you expected, it throws an error > > > > > > for > > > > > > the create publication case: > > > > > > 2024-07-17 14:50:01.145 IST [481526] 481526 ERROR: could not > > > > > > receive > > > > > > data from WAL stream: ERROR: publication "pub1" does not exist > > > > > > CONTEXT: slot "sub1", output plugin "pgoutput", in the > > > > > > change > > > > > > callback, associated LSN 0/1510CD8 > > > > > > 2024-07-17 14:50:01.147 IST [481450] 481450 LOG: background worker > > > > > > "logical replication apply worker" (PID 481526) exited with exit > > > > > > code > > > > > > 1 > > > > > > > > > > > > The steps for this process are as follows: > > > > > > 1) Create tables in both the publisher and subscriber. > > > > > > 2) On the publisher: Create a replication slot. > > > > > > 3) On the subscriber: Create a subscription using the slot created > > > > > > by > > > > > > the publisher. > > > > > > 4) On the publisher: > > > > > > 4.a) Session 1: BEGIN; INSERT INTO T1; > > > > > > 4.b) Session 2: CREATE PUBLICATION FOR ALL TABLES > > > > > > 4.c) Session 1: COMMIT; > > > > > > > > > > > > Since we are throwing out a "publication does not exist" error, > > > > > > there > > > > > > is no inconsistency issue here. > > > > > > > > > > > > However, an issue persists with DROP ALL TABLES publication, where > > > > > > data continues to replicate even after the publication is dropped. > > > > > > This happens because the open transaction consumes the invalidation, > > > > > > causing the publications to be revalidated using old snapshot. As a > > > > > > result, both the open transactions and the subsequent transactions > > > > > > are > > > > > > getting replicated. > > > > > > > > > > > > We can reproduce this issue by following these steps in a logical > > > > > > replication setup with an "ALL TABLES" publication: > > > > > > On the publisher: > > > > > > Session 1: BEGIN; INSERT INTO T1 VALUES (val1); > > > > > > In another session on the publisher: > > > > > > Session 2: DROP PUBLICATION > > > > > > Back in Session 1 on the publisher: > > > > > > COMMIT; > > > > > > Finally, in Session 1 on the publisher: > > > > > > INSERT INTO T1 VALUES (val2); > > > > > > > > > > > > Even after dropping the publication, both val1 and val2 are still > > > > > > being replicated to the subscriber. This means that both the > > > > > > in-progress concurrent transaction and the subsequent transactions > > > > > > are > > > > > > being replicated. > > > > > > > > > > > > I don't think locking all tables is a viable solution in this case, > > > > > > as > > > > > > it would require asking the user to refrain from performing any > > > > > > operations on any of the tables in the database while creating a > > > > > > publication. > > > > > > > > > > > > > > > > Indeed, locking all tables in the database to prevent concurrent DMLs > > > > > for this scenario also looks odd to me. The other alternative > > > > > previously suggested by Andres is to distribute catalog modifying > > > > > transactions to all concurrent in-progress transactions [1] but as > > > > > mentioned this could add an overhead. One possibility to reduce > > > > > overhead is that we selectively distribute invalidations for > > > > > catalogs-related publications but I haven't analyzed the feasibility. > > > > > > > > > > We need more opinions to decide here, so let me summarize the problem > > > > > and solutions discussed. As explained with an example in an email [1], > > > > > the problem related to logical decoding is that it doesn't process > > > > > invalidations corresponding to DDLs for the already in-progress > > > > > transactions. We discussed preventing DMLs in the first place when > > > > > concurrent DDLs like ALTER PUBLICATION ... ADD TABLE ... are in > > > > > progress. The solution discussed was to acquire > > > > > ShareUpdateExclusiveLock for all the tables being added via such > > > > > commands. Further
Re: optimizing pg_upgrade's once-in-each-database steps
On Sat, Aug 10, 2024 at 10:35:46AM -0500, Nathan Bossart wrote: > Another option might be to combine all the queries for a task into a single > string and then send that in one PQsendQuery() call. That may be a simpler > way to eliminate the time between queries. I tried this out and didn't see any difference in my tests. However, the idea seems sound, and I could remove ~40 lines of code by doing this and by making the search_path query an implicit first step (instead of its own state). So, here's a v9 of the patch set with those changes. -- nathan >From 8d8577b3dd1bbdbe584816dfb9045a5988862412 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Fri, 28 Jun 2024 11:02:44 -0500 Subject: [PATCH v9 01/11] introduce framework for parallelizing pg_upgrade tasks --- src/bin/pg_upgrade/Makefile | 1 + src/bin/pg_upgrade/async.c | 425 +++ src/bin/pg_upgrade/meson.build | 1 + src/bin/pg_upgrade/pg_upgrade.h | 14 + src/tools/pgindent/typedefs.list | 4 + 5 files changed, 445 insertions(+) create mode 100644 src/bin/pg_upgrade/async.c diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile index bde91e2beb..3bc4f5d740 100644 --- a/src/bin/pg_upgrade/Makefile +++ b/src/bin/pg_upgrade/Makefile @@ -12,6 +12,7 @@ include $(top_builddir)/src/Makefile.global OBJS = \ $(WIN32RES) \ + async.o \ check.o \ controldata.o \ dump.o \ diff --git a/src/bin/pg_upgrade/async.c b/src/bin/pg_upgrade/async.c new file mode 100644 index 00..18a0b01f79 --- /dev/null +++ b/src/bin/pg_upgrade/async.c @@ -0,0 +1,425 @@ +/* + * async.c + * parallelization via libpq's async APIs + * + * This framework provides an efficient way of running the various + * once-in-each-database tasks required by pg_upgrade. Specifically, it + * parallelizes these tasks by managing a simple state machine of + * user_opts.jobs slots and using libpq's asynchronous APIs to establish the + * connections and run the queries. Callers simply need to create a callback + * function and build/execute an AsyncTask. A simple example follows: + * + * static void + * my_process_cb(DbInfo *dbinfo, PGresult *res, void *arg) + * { + * for (int i = 0; i < PQntuples(res); i++) + * { + * ... process results ... + * } + * } + * + * void + * my_task(ClusterInfo *cluster) + * { + * AsyncTask *task = async_task_create(); + * + * async_task_add_step(task, + * "... query text ...", + * my_process_cb, + * true, // let the task free the PGresult + * NULL); // "arg" pointer for the callbacks + * async_task_run(task, cluster); + * async_task_free(task); + * } + * + * Note that multiple steps can be added to a given task. When there are + * multiple steps, the task will run all of the steps consecutively in the same + * database connection before freeing the connection and moving on. In other + * words, it only ever initiates one connection to each database in the + * cluster for a given run. + * + * Copyright (c) 2024, PostgreSQL Global Development Group + * src/bin/pg_upgrade/async.c + */ + +#include "postgres_fe.h" + +#include "common/connect.h" +#include "fe_utils/string_utils.h" +#include "pg_upgrade.h" + +/* + * dbs_complete stores the number of databases that we have completed + * processing. When this value equals the number of databases in the cluster, + * the task is finished. + */ +static int dbs_complete; + +/* + * dbs_processing stores the index of the next database in the cluster's array + * of databases that will be picked up for processing. It will always be + * greater than or equal to dbs_complete. + */ +static int dbs_processing; + +/* + * This struct stores all the information for a single step of a task. All + * steps in a task are run in a single connection before moving on to the next + * database (which requires a new connection). + */ +typedef struct AsyncTaskCallbacks +{ + AsyncTaskProcessCB process_cb; /* processes the results of the query */ + const char *query; /* query text */ + boolfree_result;/* should we free the result? */ + void *arg;/* pointer passed to each callback */ +} AsyncTaskCallbacks; + +/* + * This struct is a thin wrapper around an array of steps, i.e., + * AsyncTaskCallbacks. + */ +typedef struct AsyncTask +{ + AsyncTaskCallbacks *cbs; + int num_cb_sets; +} AsyncTask; + +
Re: Make query cancellation keys longer
I'm back to working on the main patch here, to make cancellation keys longer. New rebased version attached, with all the FIXMEs and TODOs from the earlier version fixed. There was a lot of bitrot, too. The first patch now introduces timingsafe_bcmp(), a function borrowed from OpenBSD to perform a constant-time comparison. There's a configure check to use the function from the OS if it's available, and includes a copy of OpenBSD's implementation otherwise. Similar functions exist with different names in OpenSSL (CRYPTO_memcmp) and NetBSD (consttime_memequal), but it's a pretty simple function so I don't think we need to work too hard to pick up those other native implementations. I used it for checking if the cancellation key matches, now that it's not a single word anymore. It feels paranoid to worry about timing attacks here, a few instructions is unlikely to give enough signal to an attacker and query cancellation is not a very interesting target anyway. But better safe than sorry. You can still get information about whether a backend with the given PID exists at all, the constant-time comparison only applies to comparing the key. We probably should be using this in some other places in the backend, but I haven't gone around looking for them. Hmm, looking at the current code, I do agree that not introducing a new message would simplify both client and server implementation. Now clients need to do different things depending on if the server supports 3.1 or 3.0 (sending CancelRequestExtended instead of CancelRequest and having to parse BackendKeyData differently). And I also agree that the extra length field doesn't add much in regards to sanity checking (for the CancelRequest and BackendKeyData message types at least). So, sorry for the back and forth on this, but I now agree with you that we should not add the length field. I think one reason I didn't see the benefit before was because the initial patch 0002 was still introducing a CancelRequestExtended message type. If we can get rid of this message type by not adding a length, then I think that's worth losing the sanity checking. Ok, I went back to the original scheme that just redefines the secret key in the CancelRequest message to be variable length, with the length deduced from the message length. We could teach libpq to disconnect and reconnect with minor protocol version 3.0, if connecting with 3.1 fails, but IMHO it's better to not complicate this and accept the break in backwards-compatibility. Yeah, implementing automatic reconnection seems a bit overkill to me too. But it might be nice to add a connection option that causes libpq to use protocol 3.0. Having to install an old libpq to connect to an old server seems quite annoying. Added a "protocol_version" libpq option for that. It defaults to "auto", but you can set it to "3.1" or "3.0" to force the version. It makes it easier to test that the backwards-compatibility works, too. Especially since I think that many other types of servers that implement the postgres protocol have not implemented the minor version negotiation. I at least know PgBouncer[1] and pgcat[2] have not, but probably other server implementations like CockroachDB and Google Spanner have this problem too. Good point. -- Heikki Linnakangas Neon (https://neon.tech) From 5fb2a442efca75dbc384bafee486b090f91806ba Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Thu, 15 Aug 2024 15:49:46 +0300 Subject: [PATCH v5 1/2] Add timingsafe_bcmp(), for constant-time memory comparison timingsafe_bcmp() should be used instead of memcmp() or a naive for-loop, when comparing passwords or secret tokens, to avoid leaking information about the secret token by timing. This commit just introduces the function but does not change any existing code to use it yet. --- configure | 23 +++ configure.ac | 3 ++- meson.build| 2 ++ src/include/port.h | 4 src/port/meson.build | 1 + src/port/timingsafe_bcmp.c | 35 +++ 6 files changed, 67 insertions(+), 1 deletion(-) create mode 100644 src/port/timingsafe_bcmp.c diff --git a/configure b/configure index 2abbeb2794..1fd832bd5b 100755 --- a/configure +++ b/configure @@ -15758,6 +15758,16 @@ fi cat >>confdefs.h <<_ACEOF #define HAVE_DECL_STRSEP $ac_have_decl _ACEOF +ac_fn_c_check_decl "$LINENO" "timingsafe_bcmp" "ac_cv_have_decl_timingsafe_bcmp" "$ac_includes_default" +if test "x$ac_cv_have_decl_timingsafe_bcmp" = xyes; then : + ac_have_decl=1 +else + ac_have_decl=0 +fi + +cat >>confdefs.h <<_ACEOF +#define HAVE_DECL_TIMINGSAFE_BCMP $ac_have_decl +_ACEOF # We can't use AC_CHECK_FUNCS to detect these functions, because it @@ -15918,6 +15928,19 @@ esac fi +ac_fn_c_check_func "$LINENO" "timingsafe_bcmp" "ac_cv_func_timingsafe_bcmp" +if test "x$ac_cv_func_timingsafe_bcmp" = xyes; then : + $as_echo "#define HAVE_TIMINGSAFE_
Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest
(sorry for the formatting, my mobile phone doesn't have the capabilities I usually get when using my laptop) On Thu, 15 Aug 2024, 16:02 Jelte Fennema-Nio, wrote: > On Thu, 15 Aug 2024 at 15:33, Peter Eisentraut > wrote: > > Maybe this kind of thing should rather be on the linked-to web page, not > > in every email. > > Yeah, I'll first put a code snippet on the page for the commitfest entry. > > > But a more serious concern here is that the patches created by the cfbot > > are not canonical. There are various heuristics when they get applied. > > I would prefer that people work with the actual patches sent by email, > > at least unless they know exactly what they are doing. We don't want to > > create parallel worlds of patches that are like 90% similar but not > > really identical. > > I'm not really sure what kind of heuristics and resulting differences > you're worried about here. The heuristics it uses are very simple and > are good enough for our CI. Basically they are: > 1. Unzip/untar based on file extension > 2. Apply patches using "patch" in alphabetic order > > Also, when I apply patches myself, I use heuristics too. And my > heuristics are probably different from yours. So I'd expect that many > people using the exact same heuristic would only make the situation > better. Especially because if people don't know exactly what they are > doing, then their heuristics are probably not as good as the one of > our cfbot. I know I've struggled a lot the first few times when I was > manually applying patches. One serious issue with this is that in cases of apply failures, CFBot delays, or other issues, the CFBot repo won't contain the latest version of the series' patchsets. E.g. a hacker can accidentally send an incremental patch, or an unrelated patch to fix an issue mentioned in the thread without splitting into a different thread, etc. This can easily cause users (and CFBot) to test and review the wrong patch, esp. when the mail thread proper is not looked by the reviewer, which would be somewhat promoted by a CFA+github -centric workflow. Apart from the above issue, I'm -0.5 on what to me equates with automated spam to -hackers: the volume of mails would put this around the 16th most common sender on -hackers, with about 400 mails/year (based on 80 new patches for next CF, and 5 CFs/year, combined with Robert's 2023 statistics at [0]). I also don't quite like the suggested contents of such mail: (1) and (2) are essentially duplicative information, and because CF's entries' IDs are not shown in the app the "with ID " part of (1) is practically useless (better use the CFE's title), (3) would best be stored and/or integrated in the CFA, as would (4). Additionally, (4) isn't canonical/guaranteed to be up-to-date, see above. As for the "copy-pastable git commands" suggestion, I'm not sure that's applicable, for the same reasons that (4) won't work reliably. CFBot's repo to me seems more like an internal implementation detail of CFBot than an authorative source of patchset diffs. Maybe we could instead automate CF mail thread registration by allowing registration of threadless CF entries (as 'preliminary'), and detecting (and subsequently linking) new threads containing references to those CF entries, with e.g. an "CF: https://commitfest.postgresql.org/49/4980/"; directive in the new thread's initial mail's text. This would give the benefits of requiring no second mail for CF referencing purposes, be it automated or manual. Alternatively, we could allow threads for new entries to be started through the CF app (which would automatically insert the right form data into the mail), providing an alternative avenue to registering patches that doesn't have the chicken-and-egg problem you're trying to address here. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://rhaas.blogspot.com/2024/01/who-contributed-to-postgresql.html
Re: Reducing the log spam
On Thu, 25 Jul 2024 at 18:03, Laurenz Albe wrote: > Thanks for the review! > > On Wed, 2024-07-24 at 15:27 +0200, Rafia Sabih wrote: > > I liked the idea for this patch. I will also go for the default being > > an empty string. > > I went through this patch and have some comments on the code, > > > > 1. In general, I don't like the idea of goto, maybe we can have a > > free_something function to call here. > > The PostgreSQL code base has over 3000 goto's... > > Sure, that can be factored out to a function (except the final "return"), > but I feel that a function for three "free" calls is code bloat. > > On a detailed look over this, you are right Laurenz about this. > Do you think that avoiding the goto and using a function would make the > code simpler and clearer? > > > 2. > > if (!SplitIdentifierString(new_copy, ',', &states)) > > { > > GUC_check_errdetail("List syntax is invalid."); > > goto failed; > > } > > Here, we don't need all that free-ing, we can just return false here. > > I am OK with changing that; I had thought it was more clearer and more > foolproof to use the same pattern everywhere. > > > 3. > > /* > > * Check the the values are alphanumeric and convert them to upper case > > * (SplitIdentifierString converted them to lower case). > > */ > > for (p = state; *p != '\0'; p++) > > if (*p >= 'a' && *p <= 'z') > > *p += 'A' - 'a'; > > else if (*p < '0' || *p > '9') > > { > > GUC_check_errdetail("error codes can only contain digits and ASCII > letters."); > > goto failed; > > } > > I was thinking, maybe we can use tolower() function here. > > That is a good idea, but these C library respect the current locale. > I would have to explicitly specify the C locale or switch the locale > temporarily. > Hmm. actually I don't have any good answers to this locale issue. > > Switching the locale seems clumsy, and I have no idea what I would have > to feed as second argument to toupper_l() or isalnum_l(). > Do you have an idea? > > > 4. > > list_free(states); > > pfree(new_copy); > > > > *extra = statelist; > > return true; > > > > failed: > > list_free(states); > > pfree(new_copy); > > guc_free(statelist); > > return false; > > > > This looks like duplication that can be easily avoided. > > You may have free_somthing function to do all free-ing stuff only and > > its callee can then have a return statement. > > e.g for here, > > free_states(states, new_copy, statelist); > > return true; > > That free_states() function would just contain two function calls. > I think that defining a special function for that is somewhat out of > proportion. > > > 5. Also, for alphanumeric check, maybe we can have something like, > > if (isalnum(*state) == 0) > > { > > GUC_check_errdetail("error codes can only contain digits and ASCII > letters."); > > goto failed; > > } > > and we can do this in the beginning after the len check. > > isalnum() operates on a single character and depends on the current locale. > See my comments to 3. above. > > > Please let me know what you think, particularly about the locale problem. > > Yours, > Laurenz Albe > -- Regards, Rafia Sabih
Re: Remove dependence on integer wrapping
On Thu, Aug 15, 2024 at 02:56:00PM +0800, jian he wrote: > i am confused with > " > +#elif defined(HAVE_INT128) > + uint128 res = -((int128) a); > " > I thought "unsigned" means non-negative, therefore uint128 means non-negative. > therefore "int128 res = -((int128) a);" makes sense to me. Ah, that's a typo, thanks for pointing it out. > also in HAVE_INT128 branch > do we need cast int128 to int64, like > > *result = (int64) res; I don't think we need an explicit cast here since *result is known to be an int64. But it certainly wouldn't hurt anything... -- nathan
Re: Partial aggregates pushdown
On Sun, Jul 7, 2024 at 09:52:27PM +, fujii.y...@df.mitsubishielectric.co.jp wrote: > Hi Bruce. > > > From: Bruce Momjian > > Is there a reason the documentation is no longer a part of this patch? > > Can I help you keep it current? > > Here are the reasons: > Reason1. The approach differs significantly from the previous patch that > included documentation, the below. > > https://www.postgresql.org/message-id/attachment/152086/0001-Partial-aggregates-push-down-v34.patch > Reason2. I have two options for transmitting the state value and I'm > evaluating which one is optimal. >One is what I presented you in PGConf.dev2024. The other is Jelte's one. >He listened to my talk at the conference and gave me some useful comments > on hackers. I'm very grateful that. > Reason3. The implementation and test have been not finished yet. > Regarding Reason 2, I provided my conclusion in the previous message. > > My plan for advancing the patch involves the following steps: > Step1. Decide the approach on transmitting state value. > Step2. Implement code (including comments) and tests to support a subset of > aggregate functions. > Specifically, I plan to support avg, sum, and other aggregate functions > like min and max which don't need export/import functions. > Step3. Add documentations. > > To clarify my approach, should I proceed with Step 3 before Step2? > I would appreciate your feedback on this. Thanks, I now understand why the docs were remove, and I agree. I will post about the options now in a new email. -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: POC, WIP: OR-clause support for indexes
Hi! On 07.08.2024 04:11, Alexander Korotkov wrote: On Mon, Aug 5, 2024 at 11:24 PM Alena Rybakina wrote: Ok, thank you for your work) I think we can leave only the two added libraries in the first patch, others are superfluous. Thank you. I also have fixed some grammar issues. While reviewing the patch, I can't understand one part of the code where we check the comparability of restrictinfos. /* RestrictInfo parameters dmust match parent */ if (subRinfo->is_pushed_down != rinfo->is_pushed_down || subRinfo->is_clone != rinfo->is_clone || subRinfo->security_level != rinfo->security_level || !bms_equal(subRinfo->required_relids, rinfo->required_relids) || !bms_equal(subRinfo->incompatible_relids, rinfo->incompatible_relids) || !bms_equal(subRinfo->outer_relids, rinfo->outer_relids)) return NULL; I didn't find a place in the optimizer where required_relids, incompatible_relids and outer_relids become different. Each make_restrictinfo function takes arguments from parent data. I disabled this check and the regression tests passed. This code is needed for security verification, may I clarify? In the last patch I corrected the libraries - one of them was not in alphabetical order. -- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company From 0d4e4d89f32496c692c43ec6b2d5a33b906f6697 Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Mon, 5 Aug 2024 21:27:02 +0300 Subject: [PATCH 1/2] Transform OR-clauses to SAOP's during index matching Replace "(indexkey op C1) OR (indexkey op C2) ... (indexkey op CN)" with "indexkey op ANY(ARRAY[C1, C2, ...])" (ScalarArrayOpExpr node) during matching a clause to index. Here Ci is an i-th constant or parameters expression, 'expr' is non-constant expression, 'op' is an operator which returns boolean result and has a commuter (for the case of reverse order of constant and non-constant parts of the expression, like 'Cn op expr'). This transformation allows handling long OR-clauses with single IndexScan avoiding slower bitmap scans. Discussion: https://postgr.es/m/567ED6CA.2040504%40sigaev.ru Author: Alena Rybakina Author: Andrey Lepikhov Reviewed-by: Peter Geoghegan Reviewed-by: Ranier Vilela Reviewed-by: Alexander Korotkov Reviewed-by: Robert Haas Reviewed-by: Jian He Reviewed-by: Tom Lane Reviewed-by: Nikolay Shaplov --- src/backend/optimizer/path/indxpath.c | 253 + src/test/regress/expected/create_index.out | 183 +-- src/test/regress/expected/join.out | 57 - src/test/regress/sql/create_index.sql | 42 src/test/regress/sql/join.sql | 9 + 5 files changed, 522 insertions(+), 22 deletions(-) diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index c0fcc7d78df..cbfb0fdb3c8 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -32,8 +32,10 @@ #include "optimizer/paths.h" #include "optimizer/prep.h" #include "optimizer/restrictinfo.h" +#include "utils/array.h" #include "utils/lsyscache.h" #include "utils/selfuncs.h" +#include "utils/syscache.h" /* XXX see PartCollMatchesExprColl */ @@ -177,6 +179,10 @@ static IndexClause *match_rowcompare_to_indexcol(PlannerInfo *root, RestrictInfo *rinfo, int indexcol, IndexOptInfo *index); +static IndexClause *match_orclause_to_indexcol(PlannerInfo *root, + RestrictInfo *rinfo, + int indexcol, + IndexOptInfo *index); static IndexClause *expand_indexqual_rowcompare(PlannerInfo *root, RestrictInfo *rinfo, int indexcol, @@ -2248,6 +2254,10 @@ match_clause_to_indexcol(PlannerInfo *root, { return match_rowcompare_to_indexcol(root, rinfo, indexcol, index); } + else if (restriction_is_or_clause(rinfo)) + { + return match_orclause_to_indexcol(root, rinfo, indexcol, index); + } else if (index->amsearchnulls && IsA(clause, NullTest)) { NullTest *nt = (NullTest *) clause; @@ -2771,6 +2781,249 @@ match_rowcompare_to_indexcol(PlannerInfo *root, return NULL; } +/* + * match_orclause_to_indexcol() + * Handles the OR-expr case for match_clause_to_indexcol() in the case + * when it could be transformed to ScalarArrayOpExpr. + */ +static IndexClause * +match_orclause_to_indexcol(PlannerInfo *root, + RestrictInfo *rinfo, + int indexcol, + IndexOptInfo *index) +{ + ListCell *lc; + BoolExpr *orclause = (BoolExpr *) rinfo->orclause; + Node *indexExpr = NULL; + List *consts = NIL; + Node *arrayNode = NULL; + ScalarArrayOpExpr *saopexpr = NULL; + HeapTuple opertup; + Form_pg_operator operform; + Oid matchOpno = InvalidOid; + IndexClause *iclause; + Oid consttype = InvalidOid; + Oid arraytype = InvalidOid; + Oid inputcollid = InvalidOid; + bool firstTime = t
Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)
Attached patch has EXPLAIN ANALYZE display the total number of primitive index scans for all 3 kinds of index scan node. This is useful for index scans that happen to use SAOP arrays. It also seems almost essential to offer this kind of instrumentation for the skip scan patch [1]. Skip scan works by reusing all of the Postgres 17 work (see commit 5bf748b8) to skip over irrelevant sections of a composite index with a low cardinality leading column, so it has all the same issues. One reason to have this patch is to differentiate between similar cases involving simple SAOP arrays. The user will have some reasonable way of determining how a query such as this: pg@regression:5432 [2070325]=# explain (analyze, buffers, costs off, summary off) select abalance from pgbench_accounts where aid in (1, 2, 3, 4, 5); ┌──┐ │ QUERY PLAN │ ├──┤ │ Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual time=0.007..0.008 rows=5 loops=1) │ │ Index Cond: (aid = ANY ('{1,2,3,4,5}'::integer[])) │ │ Primitive Index Scans: 1 │ │ Buffers: shared hit=4 │ └──┘ (4 rows) ...differs from a similar query, such as this: pg@regression:5432 [2070325]=# explain (analyze, buffers, costs off, summary off) select abalance from pgbench_accounts where aid in (1000, 2000, 3000, 4000, 5000); ┌──┐ │ QUERY PLAN │ ├──┤ │ Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual time=0.006..0.012 rows=5 loops=1) │ │ Index Cond: (aid = ANY ('{1000,2000,3000,4000,5000}'::integer[])) │ │ Primitive Index Scans: 5 │ │ Buffers: shared hit=20 │ └──┘ (4 rows) Another reason to have this patch is consistency. We're only showing the user the number of times we've incremented pg_stat_user_tables.idx_scan in each case. The fact that pg_stat_user_tables.idx_scan counts primitive index scans like this is nothing new. That issue was only documented quite recently, as part of the Postgres 17 work, and it seems quite misleading. It's consistent, but not necessarily nonsurprising. Making it readily apparent that there is more than one primitive index scan involved here makes the issue less surprising. Skip scan - Here's an example with this EXPLAIN ANALYZE patch applied on top of my skip scan patch [1], using the tenk1 table left behind when the standard regression tests are run: pg@regression:5432 [2070865]=# create index on tenk1 (four, stringu1); CREATE INDEX pg@regression:5432 [2070865]=# explain (analyze, buffers, costs off, summary off) select stringu1 from tenk1 where -- Omitted: the leading column on "four" stringu1 = 'YG'; ┌───┐ │QUERY PLAN │ ├───┤ │ Index Only Scan using tenk1_four_stringu1_idx on tenk1 (actual time=0.011..0.017 rows=15 loops=1) │ │ Index Cond: (stringu1 = 'YG'::name) │ │ Heap Fetches: 0 │ │ Primitive Index Scans: 5 │ │ Buffers: shared hit=11 │ └───┘ (5 rows) Notice that there are 5 primitive index scans here. That's what I'd expect, given that there are exactly 4 distinct "logical subindexes" implied by our use of a leading column on "four" as the scan's skip column. Under the hood, an initial primitive index scan locates the lowest "four" value. There are then 4 additional primitive index scans to locate the next "four" value (needed when the current "four" value gets past the value's "stringu1 = 'YG'" tuples). Obviously, the cardinality of the leading column predicts the number of primitive index scans at runtime. But it can be much more complicated of a relationship than what I've shown here may suggest. Skewness matters, too. Small clusters of index tuple
Re: Make query cancellation keys longer
On Thu, Aug 15, 2024 at 1:13 PM Heikki Linnakangas wrote: > Added a "protocol_version" libpq option for that. It defaults to "auto", > but you can set it to "3.1" or "3.0" to force the version. It makes it > easier to test that the backwards-compatibility works, too. Over on the "Add new protocol message to change GUCs for usage with future protocol-only GUCs" there is a lot of relevant discussion about how bumping the protocol version should work. This thread shouldn't ignore all that discussion. Just to take one example, Jelte wants to bump the protocol version to 3.2, not 3.1, for some reasons that are in the commit message for the relevant patch over there. -- Robert Haas EDB: http://www.enterprisedb.com
Re: ECPG cleanup and fix for clang compile-time problem
I wrote: > Thanks, done. Here's a revised patchset. The cfbot points out that I should probably have marked progname as "static" in 0008. I'm not going to repost the patchset just for that, though. regards, tom lane
Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)
On Thu, 15 Aug 2024 at 21:23, Peter Geoghegan wrote: > > Attached patch has EXPLAIN ANALYZE display the total number of > primitive index scans for all 3 kinds of index scan node. This is > useful for index scans that happen to use SAOP arrays. It also seems > almost essential to offer this kind of instrumentation for the skip > scan patch [1]. Skip scan works by reusing all of the Postgres 17 work > (see commit 5bf748b8) to skip over irrelevant sections of a composite > index with a low cardinality leading column, so it has all the same > issues. Did you notice the patch over at [0], where additional diagnostic EXPLAIN output for btrees is being discussed, too? I'm asking, because I'm not very convinced that 'primitive scans' are a useful metric across all (or even: most) index AMs (e.g. BRIN probably never will have a 'primitive scans' metric that differs from the loop count), so maybe this would better be implemented in that framework? Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://www.postgresql.org/message-id/flat/TYWPR01MB10982D24AFA7CDC273445BFF0B1DC2%40TYWPR01MB10982.jpnprd01.prod.outlook.com#9c64cf75179da8d657a5eab7c75be480
Re: [Patch] add new parameter to pg_replication_origin_session_setup
Hello again, On Tue, Aug 13, 2024 at 12:48 AM Euler Taveira wrote: > I'm curious about your use case. Is it just because the internal function has > a > different signature or your tool is capable of apply logical replication > changes > in parallel using the SQL API? The latter is correct, it applies logical replication changes in parallel. Since multiple connections may commit, we need all of them to be able to advance the replication origin. > * no documentation changes. Since the function you are changing has a new > signature, this change should be reflected in the documentation. > * no need for a new internal function. The second parameter (PID) can be > optional and defaults to 0 in this case. See how we changed the > pg_create_logical_replication_slot along the years add some IN parameters like > twophase and failover in the recent versions. I updated/rewrote the patch to reflect these suggestions. It now has the same DEFAULT 0 style used in pg_create_logical_replication_slot. I also updated the documentation. > * add a CF entry [1] for this patch so we don't forget it. Another advantage > is > that this patch is covered by CI [2][3]. Sadly I still can't log in to the Commitfest due to the cool-off period. I will create an entry as soon as this period ends. Thanks for all the feedback, Doruk Yılmaz From b9c54f3d217f67c24ce74ffa7c1f2812d784333e Mon Sep 17 00:00:00 2001 From: Doruk Date: Thu, 15 Aug 2024 23:34:26 +0300 Subject: [PATCH] add new parameter to pg_replication_origin_session_setup --- doc/src/sgml/func.sgml | 2 +- src/backend/catalog/system_functions.sql | 9 - src/backend/replication/logical/origin.c | 8 +--- src/include/catalog/pg_proc.dat | 2 +- 4 files changed, 15 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 5dd95d73a1..7db5a8ed52 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -29486,7 +29486,7 @@ DETAIL: Make sure pg_wal_replay_wait() isn't called within a transaction with a pg_replication_origin_session_setup -pg_replication_origin_session_setup ( node_name text ) +pg_replication_origin_session_setup ( node_name text, acquired_by integer DEFAULT 0 ) void diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql index 623b9539b1..4aae06e06d 100644 --- a/src/backend/catalog/system_functions.sql +++ b/src/backend/catalog/system_functions.sql @@ -639,6 +639,13 @@ LANGUAGE INTERNAL CALLED ON NULL INPUT VOLATILE PARALLEL SAFE AS 'pg_stat_reset_slru'; +CREATE OR REPLACE FUNCTION + pg_replication_origin_session_setup(node_name text, acquired_by integer DEFAULT 0) +RETURNS void +LANGUAGE INTERNAL +STRICT VOLATILE +AS 'pg_replication_origin_session_setup'; + -- -- The default permissions for functions mean that anyone can execute them. -- A number of functions shouldn't be executable by just anyone, but rather @@ -736,7 +743,7 @@ REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_progress(boolean) FROM REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_reset() FROM public; -REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_setup(text) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_setup(text,integer) FROM public; REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_reset() FROM public; diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c index 419e4814f0..e50bcc8466 100644 --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -1351,13 +1351,15 @@ pg_replication_origin_session_setup(PG_FUNCTION_ARGS) { char *name; RepOriginId origin; - + int pid; + replorigin_check_prerequisites(true, false); name = text_to_cstring((text *) DatumGetPointer(PG_GETARG_DATUM(0))); origin = replorigin_by_name(name, false); - replorigin_session_setup(origin, 0); - + pid = PG_GETARG_INT32(1); + replorigin_session_setup(origin, pid); + replorigin_session_origin = origin; pfree(name); diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 4abc6d9526..a490d4fc6e 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -11948,7 +11948,7 @@ { oid => '6006', descr => 'configure session to maintain replication progress tracking for the passed in origin', proname => 'pg_replication_origin_session_setup', provolatile => 'v', - proparallel => 'u', prorettype => 'void', proargtypes => 'text', + proparallel => 'u', prorettype => 'void', proargtypes => 'text int4', prosrc => 'pg_replication_origin_session_setup' }, { oid => '6007', descr => 'teardown configured replication progress tracking', -- 2.39.2
Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)
Hi! Thank you for your work on this subject! On 15.08.2024 22:22, Peter Geoghegan wrote: Attached patch has EXPLAIN ANALYZE display the total number of primitive index scans for all 3 kinds of index scan node. This is useful for index scans that happen to use SAOP arrays. It also seems almost essential to offer this kind of instrumentation for the skip scan patch [1]. Skip scan works by reusing all of the Postgres 17 work (see commit 5bf748b8) to skip over irrelevant sections of a composite index with a low cardinality leading column, so it has all the same issues. I think that it is enough to pass the IndexScanDesc parameter to the function - this saves us from having to define the planstate type twice. For this reason, I suggest some changes that I think may improve your patch. One reason to have this patch is to differentiate between similar cases involving simple SAOP arrays. The user will have some reasonable way of determining how a query such as this: pg@regression:5432 [2070325]=# explain (analyze, buffers, costs off, summary off) select abalance from pgbench_accounts where aid in (1, 2, 3, 4, 5); ┌──┐ │ QUERY PLAN │ ├──┤ │ Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual time=0.007..0.008 rows=5 loops=1) │ │ Index Cond: (aid = ANY ('{1,2,3,4,5}'::integer[])) │ │ Primitive Index Scans: 1 │ │ Buffers: shared hit=4 │ └──┘ (4 rows) ...differs from a similar query, such as this: pg@regression:5432 [2070325]=# explain (analyze, buffers, costs off, summary off) select abalance from pgbench_accounts where aid in (1000, 2000, 3000, 4000, 5000); ┌──┐ │ QUERY PLAN │ ├──┤ │ Index Scan using pgbench_accounts_pkey on pgbench_accounts (actual time=0.006..0.012 rows=5 loops=1) │ │ Index Cond: (aid = ANY ('{1000,2000,3000,4000,5000}'::integer[])) │ │ Primitive Index Scans: 5 │ │ Buffers: shared hit=20 │ └──┘ (4 rows) Another reason to have this patch is consistency. We're only showing the user the number of times we've incremented pg_stat_user_tables.idx_scan in each case. The fact that pg_stat_user_tables.idx_scan counts primitive index scans like this is nothing new. That issue was only documented quite recently, as part of the Postgres 17 work, and it seems quite misleading. It's consistent, but not necessarily nonsurprising. Making it readily apparent that there is more than one primitive index scan involved here makes the issue less surprising. Skip scan - Here's an example with this EXPLAIN ANALYZE patch applied on top of my skip scan patch [1], using the tenk1 table left behind when the standard regression tests are run: pg@regression:5432 [2070865]=# create index on tenk1 (four, stringu1); CREATE INDEX pg@regression:5432 [2070865]=# explain (analyze, buffers, costs off, summary off) select stringu1 from tenk1 where -- Omitted: the leading column on "four" stringu1 = 'YG'; ┌───┐ │QUERY PLAN │ ├───┤ │ Index Only Scan using tenk1_four_stringu1_idx on tenk1 (actual time=0.011..0.017 rows=15 loops=1) │ │ Index Cond: (stringu1 = 'YG'::name) │ │ Heap Fetches: 0 │ │ Primitive Index Scans: 5 │ │ Buffers: shared hit=11 │ └───┘ (5 rows) Notice that there are 5 primitive index scans here. That's what I'd expect, given that there are exactly 4 distinct "logical subindexes" implied by our use of a leading column on "four" as the scan's skip column. Under the hood, an initial primitive index scan locates the lowest "four" value. There are then 4 additional primitive index scans to locate the next "four" value (need
Re: pgsql: Introduce hash_search_with_hash_value() function
On Wed, Aug 07, 2024 at 02:01:30PM -0400, Robert Haas wrote: > Also, for the notes to be useful, we'd probably need some conventions > about how we, as a project, want to use them. If everyone does > something different, the result isn't likely to be all that great. What did you have in mind? Would it be sufficient to propose a template that, once ratified, would be available in the committing checklist [0]? [0] https://wiki.postgresql.org/wiki/Committing_checklist -- nathan
Re: [PoC] XMLCast (SQL/XML X025)
On 05.07.24 16:18, Jim Jones wrote: > On 02.07.24 18:02, Jim Jones wrote: >> It basically does the following: >> >> * When casting an XML value to a SQL data type, XML values containing >> XSD literals will be converted to their equivalent SQL data type. >> * When casting from a SQL data type to XML, the cast operand will be >> translated to its corresponding XSD data type. >> > v2 attached adds missing return for NO_XML_SUPPORT control path in > unescape_xml > v3 adds the missing XML passing mechanism BY VALUE and BY REF, as described in the XMLCast specification: XMLCAST ( AS [ ]) Tests and documentation were updated accordingly. -- Jim From 0679e82e9653183190a6af6c97de1887f567ef72 Mon Sep 17 00:00:00 2001 From: Jim Jones Date: Thu, 15 Aug 2024 20:27:36 +0200 Subject: [PATCH v3] Add XMLCast function (SQL/XML X025) This function implements the SQL/XML function xmlcast, which enables conversions between SQL data types and the XML data type. XMLCAST ( expression AS type [ BY REF | BY VALUE ] ) When casting an XML value to a SQL data type, XML values containing XSD literals will be converted to their equivalent SQL data type. When casting from a SQL data type to XML, the cast operand will be translated to its corresponding XSD data type. This patch also includes documentation and regression tests. --- doc/src/sgml/datatype.sgml| 78 - src/backend/catalog/sql_features.txt | 2 +- src/backend/executor/execExprInterp.c | 83 - src/backend/nodes/nodeFuncs.c | 13 + src/backend/parser/gram.y | 22 +- src/backend/parser/parse_expr.c | 81 + src/backend/parser/parse_target.c | 7 + src/backend/utils/adt/ruleutils.c | 4 + src/backend/utils/adt/xml.c | 29 ++ src/include/nodes/parsenodes.h| 8 + src/include/nodes/primnodes.h | 3 + src/include/parser/kwlist.h | 1 + src/include/utils/xml.h | 1 + src/test/regress/expected/xml.out | 429 ++ src/test/regress/expected/xml_1.out | 346 + src/test/regress/expected/xml_2.out | 429 ++ src/test/regress/sql/xml.sql | 210 + src/tools/pgindent/typedefs.list | 1 + 18 files changed, 1739 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index e0d33f12e1..28c93460a5 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -4472,14 +4472,84 @@ XMLPARSE ( { DOCUMENT | CONTENT } value) XMLPARSE (DOCUMENT 'Manual...') XMLPARSE (CONTENT 'abcbarfoo') ]]> -While this is the only way to convert character strings into XML -values according to the SQL standard, the PostgreSQL-specific -syntaxes: + +Another option to convert character strings into xml is the function xmlcast, +which is designed to cast SQL data types into xml, and vice versa. + +XMLCAST ( expression AS type [ BY REF | BY VALUE ] ) + +Similar to the SQL function CAST, this function converts an expression +into the specified type. This can be useful for creating XML +documents using SQL or when parsing the contents of XML documents. The function xmlcast works with the +following criteria: + + + + + Either expression or type must be of type xml. + + + + + It supports casting between xml and character, numeric, date/time, and boolean data types. + + + + + Similar to the function xmltext, expressions containing XML predifined entities + will be escaped (see examples below). + + + + + The expressions of type date, time [with time zone], timestamp [with time zone], + and interval will be converted to their XSD equivalents, xs:date, xs:time, + xs:dateTime, and xs:duration, respectively. + + + + + The BY REF and BY VALUE clauses + are accepted but ignored, as discussed in + . + + + + + Examples: + + +Alternatively, it is also possible to convert character strings into XML using PostgreSQL-specific cast syntaxes: -can also be used. + diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt index c002f37202..a4d8f7a2ac 100644 --- a/src/backend/catalog/sql_features.txt +++ b/src/backend/catalog/sql_features.txt @@ -624,7 +624,7 @@ X014 Attributes of XML type YES X015 Fields of XML type NO X016 Persistent XML values YES X020 XMLConcat YES -X025 XMLCast NO +X025 XMLCast YES X030 XMLDocument NO X031 XMLElement YES X032 XMLForest YES diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c index ea47c4d6f9..1eb2231aa1 100644 --- a/src/backend/executor/execExprInterp.c +++ b/src/backend/executor/execExprInterp.c @@ -69,6 +69,7 @@
Re: Remaining dependency on setlocale()
On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut wrote: > On 15.08.24 00:43, Thomas Munro wrote: > > "" is a problem however... the special value for "native environment" > > is returned as a real locale name, which we probably still need in > > places. We could change that to newlocale("") + query instead, but > > Where do we need that in the server? Hmm. Yeah, right, the only way I've found so far to even reach that code and that captures that result is: create database db2 locale = ''; Thats puts 'en_NZ.UTF-8' or whatever in pg_database. In contrast, create collation will accept '' but just store it verbatim, and the GUCs for changing time, monetary, numeric accept it too and keep it verbatim. We could simply ban '' in all user commands. I doubt they're documented as acceptable values, once you get past initdb and have a running system. Looking into that... > It should just be initdb doing that and then initializing the server > with concrete values based on that. Right. > I guess technically some of these GUC settings default to the > environment? But I think we could consider getting rid of that. Yeah.
Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)
On Thu, Aug 15, 2024 at 4:34 PM Matthias van de Meent wrote: > > Attached patch has EXPLAIN ANALYZE display the total number of > > primitive index scans for all 3 kinds of index scan node. This is > > useful for index scans that happen to use SAOP arrays. It also seems > > almost essential to offer this kind of instrumentation for the skip > > scan patch [1]. Skip scan works by reusing all of the Postgres 17 work > > (see commit 5bf748b8) to skip over irrelevant sections of a composite > > index with a low cardinality leading column, so it has all the same > > issues. > > Did you notice the patch over at [0], where additional diagnostic > EXPLAIN output for btrees is being discussed, too? To be clear, for those that haven't been paying attention to that other thread: that other EXPLAIN patch (the one authored by Masahiro Ikeda) surfaces information about a distinction that the skip scan patch renders obsolete. That is, the skip scan patch makes all "Non Key Filter" quals into quals that can relocate the scan to a later leaf page by starting a new primitive index scan. Technically, skip scan removes the concept that that patch calls "Non Key Filter" altogether. Note that this isn't the same thing as making that other patch obsolete. Skip scan renders the whole concept of "Non Key Filter" obsolete *in name only*. You might prefer to think of it as making that whole concept squishy. Just because we can theoretically use the leading column to skip doesn't mean we actually will. It isn't an either/or thing. We might skip during some parts of a scan, but not during other parts. It's just not clear how to handle those sorts of fuzzy distinctions right now. It does seem worth pursuing, but I see no conflict. > I'm asking, because > I'm not very convinced that 'primitive scans' are a useful metric > across all (or even: most) index AMs (e.g. BRIN probably never will > have a 'primitive scans' metric that differs from the loop count), so > maybe this would better be implemented in that framework? What do you mean by "within that framework"? They seem orthogonal? It's true that BRIN index scans will probably never show more than a single primitive index scan. I don't think that the same is true of any other index AM, though. Don't they all support SAOPs, albeit non-natively? The important question is: what do you want to do about cases like the BRIN case? Our choices are all fairly obvious choices. We can be selective, and *not* show this information when a set of heuristics indicate that it's not relevant. This is fairly straightforward to implement. Which do you prefer: overall consistency, or less verbosity? Personally I think that the consistency argument works in favor of displaying this information for every kind of index scan. That's a hopelessly subjective position, though. -- Peter Geoghegan
Re: Partial aggregates pushdown
On Thu, Aug 8, 2024 at 01:48:49PM +0200, Jelte Fennema-Nio wrote: > SUMMARY OF THREAD > > The design of patch 0001 is agreed upon by everyone on the thread (so > far). This adds the PARTIAL_AGGREGATE label for aggregates, which will > cause the finalfunc not to run. It also starts using PARTIAL_AGGREGATE > for pushdown of aggregates in postgres_fdw. In 0001 PARTIAL_AGGREGATE > is only supported for aggregates with a non-internal/pseudo type as > the stype. > > The design for patch 0002 is still under debate. This would expand on > the functionality added by adding support for PARTIAL_AGGREGATE for > aggregates with an internal stype. This is done by returning a byte > array containing the bytes that the serialfunc of the aggregate > returns. > > A competing proposal for 0002 is to instead change aggregates to not > use an internal stype anymore, and create dedicated types. The main > downside here is that infunc and outfunc would need to be added for > text serialization, in addition to the binary serialization. An open > question is: Can we change the requirements for CREATE TYPE, so that > types can be created without infunc and outfunc. > > WHAT IS NEEDED? > > The things needed for this patch are that docs need to be added, and > detailed codereview needs to be done. > > Feedback from more people on the two competing proposals for 0002 > would be very helpful in making a decision. First, I am sorry to be replying so late --- I have been traveling for the past four weeks. Second, I consider this feature a big part of sharding, and I think sharding is Postgres's biggest missing feature. I talk about this patch often when asked about what Postgres is working on. Third, I would like to show a more specific example to clarify what is being considered above. If we look at MAX(), we can have FDWs return the max for each FDW, and the coordinator can chose the highest value. This is the patch 1 listed above. These can return the pg_aggregate.aggtranstype data type using the pg_type.typoutput text output. The second case is for something like AVG(), which must return the SUM() and COUNT(), and we currently have no way to return multiple text values on the wire. For patch 0002, we have the option of creating functions that can do this and record them in new pg_attribute columns, or we can create a data type with these functions, and assign the data type to pg_aggregate.aggtranstype. Is that accurate? -- Bruce Momjian https://momjian.us EDB https://enterprisedb.com Only you can decide what is important to you.
Re: Restart pg_usleep when interrupted
On Wed, Aug 14, 2024 at 06:00:06AM +, Bertrand Drouvot wrote: > I gave it more thoughts and I don't think we have to choose between the two. > The 1 Hz approach reduces the number of interrupts and Sami's patch provides a > way to get "accurate" delay in case of interrupts. I think both have their own > benefit. Is it really that important to delay with that level of accuracy? In most cases, the chances of actually interrupting a given vacuum delay point are pretty small. Even in the extreme scenario you tested with ~350K interrupts in a 19 minute vacuum, you only saw a 10-15% difference in total time. I wouldn't say I'm diametrically opposed to this patch, but I do think we need to carefully consider whether it's worth the extra code. Separately, I've been wondering whether it's worth allowing the sleep to be interrupted in certain cases, such as SIGINT and SIGTERM. That should address one of Heikki's points. -- nathan
Re: Remove dependence on integer wrapping
I've committed 0001. Now to 0002... - if (-element > nelements) + if (element == PG_INT32_MIN || -element > nelements) This seems like a good opportunity to use our new pg_abs_s32() function, and godbolt.org [0] seems to indicate that it might produce better code, too (at least to my eye). I've attached an updated version of the patch with this change. Barring additional feedback, I plan to commit this one shortly. [0] https://godbolt.org/z/57P4vvGYf -- nathan >From 62677b351cf18dee37dc0d9253bd694fe6fbf26a Mon Sep 17 00:00:00 2001 From: Joseph Koshakow Date: Sat, 6 Jul 2024 15:41:09 -0400 Subject: [PATCH v23 1/1] Remove dependence on integer wrapping for jsonb This commit updates various jsonb operators and functions to no longer rely on integer wrapping for correctness. Not all compilers support -fwrapv, so it's best not to rely on it. --- src/backend/utils/adt/jsonfuncs.c | 7 --- src/test/regress/expected/jsonb.out | 12 src/test/regress/sql/jsonb.sql | 2 ++ 3 files changed, 18 insertions(+), 3 deletions(-) diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c index 5ecb9fffae..1f8ea51e6a 100644 --- a/src/backend/utils/adt/jsonfuncs.c +++ b/src/backend/utils/adt/jsonfuncs.c @@ -18,6 +18,7 @@ #include "access/htup_details.h" #include "catalog/pg_type.h" +#include "common/int.h" #include "common/jsonapi.h" #include "common/string.h" #include "fmgr.h" @@ -946,7 +947,7 @@ jsonb_array_element(PG_FUNCTION_ARGS) { uint32 nelements = JB_ROOT_COUNT(jb); - if (-element > nelements) + if (pg_abs_s32(element) > nelements) PG_RETURN_NULL(); else element += nelements; @@ -5426,7 +5427,7 @@ setPathArray(JsonbIterator **it, Datum *path_elems, bool *path_nulls, if (idx < 0) { - if (-idx > nelems) + if (pg_abs_s32(idx) > nelems) { /* * If asked to keep elements position consistent, it's not allowed @@ -5438,7 +5439,7 @@ setPathArray(JsonbIterator **it, Datum *path_elems, bool *path_nulls, errmsg("path element at position %d is out of range: %d", level + 1, idx))); else - idx = INT_MIN; + idx = PG_INT32_MIN; } else idx = nelems + idx; diff --git a/src/test/regress/expected/jsonb.out b/src/test/regress/expected/jsonb.out index e66d760189..a9d93052fc 100644 --- a/src/test/regress/expected/jsonb.out +++ b/src/test/regress/expected/jsonb.out @@ -680,6 +680,18 @@ select '"foo"'::jsonb -> 'z'; (1 row) +select '[]'::jsonb -> -2147483648; + ?column? +-- + +(1 row) + +select jsonb_delete_path('{"a":[]}', '{"a",-2147483648}'); + jsonb_delete_path +--- + {"a": []} +(1 row) + select '{"a": [{"b": "c"}, {"b": "cc"}]}'::jsonb ->> null::text; ?column? -- diff --git a/src/test/regress/sql/jsonb.sql b/src/test/regress/sql/jsonb.sql index 97bc2242a1..6a18577ead 100644 --- a/src/test/regress/sql/jsonb.sql +++ b/src/test/regress/sql/jsonb.sql @@ -204,6 +204,8 @@ select '[{"b": "c"}, {"b": "cc"}]'::jsonb -> 'z'; select '{"a": "c", "b": null}'::jsonb -> 'b'; select '"foo"'::jsonb -> 1; select '"foo"'::jsonb -> 'z'; +select '[]'::jsonb -> -2147483648; +select jsonb_delete_path('{"a":[]}', '{"a",-2147483648}'); select '{"a": [{"b": "c"}, {"b": "cc"}]}'::jsonb ->> null::text; select '{"a": [{"b": "c"}, {"b": "cc"}]}'::jsonb ->> null::int; -- 2.39.3 (Apple Git-146)
Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)
On Thu, Aug 15, 2024 at 4:58 PM Alena Rybakina wrote: > I think that it is enough to pass the IndexScanDesc parameter to the function > - this saves us from having to define the planstate type twice. > > For this reason, I suggest some changes that I think may improve your patch. Perhaps it's a little better that way. I'll consider it. > To be honest, I don't quite understand how information in explain analyze > about the number of used primitive indexes > will help me improve my database system as a user. Perhaps I'm missing > something. There is probably no typical case. The patch shows implementation details, which need to be interpreted in the context of a particular problem. Maybe the problem is that some of the heuristics added by one of my nbtree patches interact relatively badly with some real world query. It would be presumptuous of me to say that that will never happen. > Maybe it can tell me which columns are best to create an index on or > something like that? That's definitely going to be important in the case of skip scan. Simply showing the user that the index scan skips at all will make them aware that there are missing index columns. That could be a sign that they'd be better off not using skip scan at all, by creating a new index that suits the particular query (by not having the extra skipped column). It's almost always possible to beat skip scan by creating a new index -- whether or not it's worth the trouble/expense of maintaining a whole new index is the important question. Is this particular query the most important query *to the business*, for whatever reason? Or is having merely adequate performance acceptable? Your OR-to-SAOP-rewrite patch effectively makes two or more bitmap index scans into one single continuous index scan. Or...does it? The true number of (primitive) index scans might be "the same" as it was before (without your patch), or there might really only be one (primitive) index scan with your patch. Or it might be anywhere in between those two extremes. Users will benefit from knowing where on this continuum a particular index scan falls. It's just useful to know where time is spent. Knowing this information might even allow the user to create a new multicolumn index, with columns in an order better suited to an affected query. It's not so much the cost of descending the index multiple times that we need to worry about here, even though that's what we're talking about counting here. Varying index column order could make an index scan faster by increasing locality. Locality is usually very important. Few index scans is a good proxy for greater locality. It's easiest to understand what I mean about locality with an example. An index on (a, b) is good for queries with quals such as "where a = 42 and b in (1,2,3,4,5,6,7,8,9)" if it allows such a query to only access one or two leaf pages, where all of the "b" values of interest live side by side. Obviously that won't be true if it's the other way around -- if the typical qual looks more like "where b = 7 and a in (1,2,3,4,5,6,7,8,9)". This is the difference between 1 primitive index scan and 9 primitive index scans -- quite a big difference. Note that the main cost we need to worry about here *isn't* the cost of descending the index. It's mostly the cost of reading the leaf pages. -- Peter Geoghegan
Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)
On Thu, 15 Aug 2024 at 23:10, Peter Geoghegan wrote: > > On Thu, Aug 15, 2024 at 4:34 PM Matthias van de Meent > wrote: > > > Attached patch has EXPLAIN ANALYZE display the total number of > > > primitive index scans for all 3 kinds of index scan node. This is > > > useful for index scans that happen to use SAOP arrays. It also seems > > > almost essential to offer this kind of instrumentation for the skip > > > scan patch [1]. Skip scan works by reusing all of the Postgres 17 work > > > (see commit 5bf748b8) to skip over irrelevant sections of a composite > > > index with a low cardinality leading column, so it has all the same > > > issues. > > > > Did you notice the patch over at [0], where additional diagnostic > > EXPLAIN output for btrees is being discussed, too? > > To be clear, for those that haven't been paying attention to that > other thread: that other EXPLAIN patch (the one authored by Masahiro > Ikeda) surfaces information about a distinction that the skip scan > patch renders obsolete. That is, the skip scan patch makes all "Non > Key Filter" quals into quals that can relocate the scan to a later > leaf page by starting a new primitive index scan. Technically, skip > scan removes the concept that that patch calls "Non Key Filter" > altogether. > > Note that this isn't the same thing as making that other patch > obsolete. Skip scan renders the whole concept of "Non Key Filter" > obsolete *in name only*. You might prefer to think of it as making > that whole concept squishy. Just because we can theoretically use the > leading column to skip doesn't mean we actually will. It isn't an > either/or thing. We might skip during some parts of a scan, but not > during other parts. Yes. > It's just not clear how to handle those sorts of fuzzy distinctions > right now. It does seem worth pursuing, but I see no conflict. > > > I'm asking, because > > I'm not very convinced that 'primitive scans' are a useful metric > > across all (or even: most) index AMs (e.g. BRIN probably never will > > have a 'primitive scans' metric that differs from the loop count), so > > maybe this would better be implemented in that framework? > > What do you mean by "within that framework"? They seem orthogonal? What I meant was putting this 'primitive scans' info into the AM-specific explain callback as seen in the latest patch version. > It's true that BRIN index scans will probably never show more than a > single primitive index scan. I don't think that the same is true of > any other index AM, though. Don't they all support SAOPs, albeit > non-natively? Not always. For Bitmap Index Scan the node's functions can allow non-native SAOP support (it ORs the bitmaps), but normal indexes without SAOP support won't get SAOP-functionality from the IS/IOS node's infrastructure, it'll need to be added as Filter. > The important question is: what do you want to do about cases like the > BRIN case? Our choices are all fairly obvious choices. We can be > selective, and *not* show this information when a set of heuristics > indicate that it's not relevant. This is fairly straightforward to > implement. Which do you prefer: overall consistency, or less > verbosity? Consistency, I suppose. But adding explain attributes left and right in Index Scan's explain output when and where every index type needs them doesn't scale, so I'd put index-specific output into it's own system (see the linked thread for more rationale). And, in this case, the use case seems quite index-specific, at least for IS/IOS nodes. > Personally I think that the consistency argument works in favor of > displaying this information for every kind of index scan. Agreed, assuming "this information" is indeed shared (and useful) across all AMs. This made me notice that you add a new metric that should generally be exactly the same as pg_stat_all_indexes.idx_scan (you mention the same). Can't you pull that data, instead of inventing a new place every AMs needs to touch for it's metrics? Kind regards, Matthias van de Meent Neon (https://neon.tech)
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On 14/08/2024 21:04, Jelte Fennema-Nio wrote: On Wed, 7 Aug 2024 at 22:10, Robert Haas wrote: I respect that, but I don't want to get flamed for doing something that might be controversial without anybody else endorsing it. I'll commit this if it gets some support, but not otherwise. I'm willing to commit a patch adding a new function even if nobody else votes, but not this. Makes sense. I'm not in too much of a hurry with this specific one. So I'll leave it like this for now and hopefully someone else responds. If this becomes close to being the final unmerged patch of this patchset, I'll probably cut my losses and create a patch that adds a function instead. I think Jelte's proposal on PQprotocolVersion() is OK. As he pointed out, the function is pretty useless as it is, so I doubt anyone is doing anything interesting with it. Perhaps we should even change it to return 30 for protocol version 3.0, and just leave a note in the docs like "in older versions of libpq, this returned 3 for protocol version 3.0". On Wed, 7 Aug 2024 at 22:10, Robert Haas wrote: > > Patch 7: Bump the protocol version to 3.2 (see commit message for why > > not bumping to 3.1) > > Good commit message. The point is arguable, so putting forth your best > argument is important. Just to clarify: do you agree with the point now, after that argument? Well, here again, I would like to know what other people think. It doesn't intrinsically matter to me that much what we do here, but it matters to me a lot that extensive recriminations don't ensue afterwards. Makes sense to me. It's sad that pgbouncer had such a bug, but it makes sense to accommodate it. We're not going to run out of integers. This deserves some more commentary in the docs I think. If I understand the plan correctly, if the client requests version 3.1, the server accepts it, but behaves exactly the same as 3.0. Or should 3.1 be forbidden altogether? On the default for "max_protocol_version": I'm pretty disappointed if we cannot change the default to "latest". I realize that that won't work with poolers that don't implement NegotiateProtocolVersion. But I'm afraid if we make the new protocol version opt-in, no one will use it, and the poolers etc still won't bother to implement NegotiateProtocolVersion for years to come, if ever. We can revisit this decision later in the release cycle though. But I'd suggest changing the default to "latest" for now, so that more hackers working with connection poolers will notice, and we get more testing of the new protocol and the negotiation. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Make query cancellation keys longer
On 15/08/2024 23:20, Robert Haas wrote: On Thu, Aug 15, 2024 at 1:13 PM Heikki Linnakangas wrote: Added a "protocol_version" libpq option for that. It defaults to "auto", but you can set it to "3.1" or "3.0" to force the version. It makes it easier to test that the backwards-compatibility works, too. Over on the "Add new protocol message to change GUCs for usage with future protocol-only GUCs" there is a lot of relevant discussion about how bumping the protocol version should work. This thread shouldn't ignore all that discussion. Just to take one example, Jelte wants to bump the protocol version to 3.2, not 3.1, for some reasons that are in the commit message for the relevant patch over there. Ok, I've read through that thread now, and opined there too. One difference is with libpq option name: My patch adds "protocol_version", while Jelte proposes "max_protocol_version". I don't have strong opinions on that. I hope the ecosystem catches up to support NegotiateProtocolVersion quickly, so that only few people will need to set this option. In particular, I hope that there will never be need to use "max_protocol_version=3.2", because by the time we introduce version 3.3, all the connection poolers that support 3.2 will also implement NegotiateProtocolVersion. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Showing primitive index scan count in EXPLAIN ANALYZE (for skip scan and SAOP scans)
On Thu, Aug 15, 2024 at 5:47 PM Matthias van de Meent wrote: > > > I'm asking, because > > > I'm not very convinced that 'primitive scans' are a useful metric > > > across all (or even: most) index AMs (e.g. BRIN probably never will > > > have a 'primitive scans' metric that differs from the loop count), so > > > maybe this would better be implemented in that framework? > > > > What do you mean by "within that framework"? They seem orthogonal? > > What I meant was putting this 'primitive scans' info into the > AM-specific explain callback as seen in the latest patch version. I don't see how that could work. This is fundamentally information that is only known when the query has fully finished execution. Again, this is already something that we track at the whole-table level, within pg_stat_user_tables.idx_scan. It's already considered index AM agnostic information, in that sense. > > It's true that BRIN index scans will probably never show more than a > > single primitive index scan. I don't think that the same is true of > > any other index AM, though. Don't they all support SAOPs, albeit > > non-natively? > > Not always. For Bitmap Index Scan the node's functions can allow > non-native SAOP support (it ORs the bitmaps), but normal indexes > without SAOP support won't get SAOP-functionality from the IS/IOS > node's infrastructure, it'll need to be added as Filter. Again, what do you want me to do about it? Almost anything is possible in principle, and can be implemented without great difficulty. But you have to clearly say what you want, and why you want it. Yeah, non-native SAOP index scans are always bitmap scans. In the case of GIN, there are only lossy/bitmap index scans, anyway -- can't see that ever changing. In the case of GiST, we could in the future add native SAOP support, so do we really want to be inconsistent in what we show now? (Tom said something about that recently, in fact.) I don't hate the idea of selectively not showing this information (for BRIN, say). Just as I don't hate the idea of totally omitting "loops=1" in the common case where we couldn't possibly be more than one loop in practice. It's just that I don't think that it's worth it, on balance. Not all redundancy is bad. > > The important question is: what do you want to do about cases like the > > BRIN case? Our choices are all fairly obvious choices. We can be > > selective, and *not* show this information when a set of heuristics > > indicate that it's not relevant. This is fairly straightforward to > > implement. Which do you prefer: overall consistency, or less > > verbosity? > > Consistency, I suppose. But adding explain attributes left and right > in Index Scan's explain output when and where every index type needs > them doesn't scale, so I'd put index-specific output into it's own > system (see the linked thread for more rationale). I can't argue with that. I just don't think it's directly relevant. > And, in this case, > the use case seems quite index-specific, at least for IS/IOS nodes. I disagree. It's an existing concept, exposed in system views, and now in EXPLAIN ANALYZE. It's precisely that -- nothing more, nothing less. The fact that it tends to be much more useful in the case of nbtree (at least for now) makes this no less true. > This made me notice that you add a new metric that should generally be > exactly the same as pg_stat_all_indexes.idx_scan (you mention the > same). I didn't imagine that that part was subtle. > Can't you pull that data, instead of inventing a new place > every AMs needs to touch for it's metrics? No. At least not in a way that's scoped to a particular index scan. -- Peter Geoghegan
Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs
On Thu, Aug 15, 2024 at 3:04 PM Heikki Linnakangas wrote: > Perhaps we should even change it to return > 30 for protocol version 3.0, and just leave a note in the docs like > "in older versions of libpq, this returned 3 for protocol version 3.0". I think that would absolutely break current code. It's not uncommon (IME) for hand-built clients wrapping libpq to make sure they're not talking v2 before turning on some feature, and they're allowed to do that with a PQprotocolVersion() == 3 check. A GitHub code search brings up examples. As for 30001: I don't see the value in modifying an exported API in this way. Especially since we can add a new entry point that will be guaranteed not to break anyone, as Robert suggested. I think it's a POLA violation at minimum; my understanding was that up until this point, the value was incremented during major (incompatible) version bumps. And I think other users will have had the same understanding. --Jacob
Re: libpq: Fix lots of discrepancies in PQtrace
Hello, On 2024-Aug-14, Jelte Fennema-Nio wrote: > The following removed comments seems useful to keep (I realize I > already removed them in a previous version of the patch, but I don't > think I did that on purpose) > [...] Ah, yeah, I agree. I put them back, and pushed 0005, 6 and 7 as a single commit. It didn't seem worth pushing each separately, really. I added two lines for the CopyData message as well, since otherwise the output shows the "mismatched length" error when getting COPY data. I'm leaving 0008 to whoever is doing the NegotiateProtocolVersion stuff; maybe post that one in that thread you mentioned. I'll mark this CF entry committed. Many thanks! -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: PATCH: Add hooks for pg_total_relation_size and pg_indexes_size
On Fri, 9 Aug 2024 at 18:10, Andreas Karlsson wrote: > On 8/8/24 2:18 PM, Abdoulaye Ba wrote: > > I am submitting a patch to add hooks for the functions > > pg_total_relation_size and pg_indexes_size. These hooks allow for custom > > behaviour to be injected into these functions, which can be useful for > > extensions and other custom PostgreSQL modifications. > > What kind of extensions do you imagine would use this hook? If it is for > custom index AMs I think adding this to the index AM interface would > make more sense than just adding a generic callback hook. > > Andreas > > The primary use case for this hook is to allow extensions to account for > additional storage mechanisms that are not covered by the default > PostgreSQL relation size calculations. For instance, in our project, we are > working with an external indexing system (Tantivy) that maintains > additional data structures outside the standard PostgreSQL storage. This > hook allows us to include the size of these additional structures in the > total relation size calculations. > > While I understand your suggestion about custom index AMs, the intent > behind this hook is broader. It is not limited to custom index types but > can also be used for other forms of external storage that need to be > accounted for in relation size calculations. This is why a generic callback > hook was chosen over extending the index AM interface. > > However, if there is a consensus that such a hook would be better suited > within the index AM interface for cases involving custom index storage, I'm > open to discussing this further and exploring how it could be integrated > more tightly with the existing PostgreSQL AM framework. >
Re: Statistics Import and Export
On Thu, 2024-08-15 at 01:57 -0700, Jeff Davis wrote: > On Sun, 2024-08-04 at 01:09 -0400, Corey Huinker wrote: > > > > > I believe 0001 and 0002 are in good shape API-wise, and I can > > > start > > > getting those committed. I will try to clean up the code in the > > > process. > > Attached v26j. I did a lot of refactoring, and it's starting to take the shape I had in mind. Some of it is surely just style preference, but I think it reads more nicely and I caught a couple bugs along the way. The function attribute_statsitics_update() is significantly shorter. (Thank you for a good set of tests, by the way, which sped up the refactoring process.) Attached v27j. Questions: * Remind me why the new stats completely replace the new row, rather than updating only the statistic kinds that are specified? * I'm not sure what the type_is_scalar() function was doing before, but I just removed it. If it can't find the element type, then it skips over the kinds that require it. * I introduced some hard errors. These happen when it can't find the table, or the attribute, or doesn't have permissions. I don't see any reason to demote those to a WARNING. Even for the restore case, analagous errors happen for COPY, etc. * I'm still sorting through some of the type info derivations. I think we need better explanations about why it's doing exactly the things it's doing, e.g. for tsvector and multiranges. Regards, Jeff Davis From 93bb8e2dfeab17d5291273e6343f61e40e501633 Mon Sep 17 00:00:00 2001 From: Corey Huinker Date: Wed, 24 Jul 2024 23:45:26 -0400 Subject: [PATCH v27j 1/2] Create function pg_set_relation_stats. This function is used to tweak statistics on any relation that the user owns. The first parameter, relation, is used to identify the the relation to be modified. The remaining parameters correspond to the statistics attributes in pg_class: relpages, reltuples, and relallisvible. This function allows the user to tweak pg_class statistics values, allowing the user to inflate rowcounts, table size, and visibility to see what effect those changes will have on the the query planner. The function has no return value. Author: Corey Huinker Discussion: https://postgr.es/m/CADkLM=e0jM7m0GFV44nNtvL22CtnFUu+pekppCVE=dobe58...@mail.gmail.com --- doc/src/sgml/func.sgml | 65 +++ src/backend/statistics/Makefile| 1 + src/backend/statistics/import_stats.c | 207 + src/backend/statistics/meson.build | 1 + src/include/catalog/pg_proc.dat| 9 + src/test/regress/expected/stats_import.out | 99 ++ src/test/regress/parallel_schedule | 2 +- src/test/regress/sql/stats_import.sql | 79 8 files changed, 462 insertions(+), 1 deletion(-) create mode 100644 src/backend/statistics/import_stats.c create mode 100644 src/test/regress/expected/stats_import.out create mode 100644 src/test/regress/sql/stats_import.sql diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 5dd95d73a1a..02d93f6c925 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -29938,6 +29938,71 @@ DETAIL: Make sure pg_wal_replay_wait() isn't called within a transaction with a in identifying the specific disk files associated with database objects. + +Database Object Statistics Import Functions + + + + +Function + + +Description + + + + + + + + + + pg_set_relation_stats + + pg_set_relation_stats ( relation regclass, relpages integer, reltuples real, relallvisible integer ) + void + + + Updates table-level statistics for the given relation to the + specified values. The parameters correspond to columns in pg_class. + + + Ordinarily, these statistics are collected automatically or updated + as a part of or , so it's not necessary to call this + function. However, it may be useful when testing the effects of + statistics on the planner to understand or anticipate plan changes. + + + The caller must have the MAINTAIN privilege on + the table or be the owner of the database. + + + The value of relpages must be greater than + or equal to 0, + reltuples must be greater than or equal to + -1.0, and relallvisible + must be greater than or equal to 0. + + + + Changes made by this function are likely to be overwritten by autovacuum (or manual + VACUUM or ANALYZE) and should + be considered temporary. + + + The signature of this function may change in new major releases if + there are changes to the statistics being tracked. + + + + + + + + Databas
Re: define PG_REPLSLOT_DIR
On 2024-Aug-14, Bertrand Drouvot wrote: > Out of curiosity, checking the sizes of affected files (O2, no debug): > > with patch: > >textdata bss dec hex filename > 30304 0 8 303127668 > src/backend/replication/logical/reorderbuffer.o > without patch: > 30286 0 8 302947656 > src/backend/replication/logical/reorderbuffer.o Hmm, I suppose this delta can be explained because because the literal string is used inside larger snprintf() format strings or similar, so things like snprintf(path, sizeof(path), -"pg_replslot/%s/%s", slotname, +"%s/%s/%s", PG_REPLSLOT_DIR, slotname, spill_de->d_name); and ereport(ERROR, (errcode_for_file_access(), -errmsg("could not remove file \"%s\" during removal of pg_replslot/%s/xid*: %m", - path, slotname))); +errmsg("could not remove file \"%s\" during removal of %s/%s/xid*: %m", + PG_REPLSLOT_DIR, path, slotname))); I don't disagree with making this change, but changing some of the other directory names you suggest, such as > pg_notify > pg_serial > pg_subtrans > pg_multixact > pg_wal would probably make no difference, since there are no literal strings that contain that as a substring(*). It might some sense to handle pg_tblspc similarly. As for pg_logical, I think you'll want separate defines for pg_logical/mappings, pg_logical/snapshots and so on. (*) I hope you're not going to suggest this kind of change (word-diff): if (GetOldestSnapshot()) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("[-pg_wal-]{+%s+}_replay_wait() must be only called without an active or registered snapshot"{+, PG_WAL_DIR+}), errdetail("Make sure [-pg_wal-]{+%s+}_replay_wait() isn't called within a transaction with an isolation level higher than READ COMMITTED, another procedure, or a function."{+, PG_WAL_DIR+}))); -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Los trabajadores menos efectivos son sistematicamente llevados al lugar donde pueden hacer el menor daño posible: gerencia." (El principio Dilbert)
Re: Thread-safe nl_langinfo() and localeconv()
On Thu, Aug 15, 2024 at 11:11 PM Heikki Linnakangas wrote: > There's a similar function in initdb, check_locale_name. I wonder if > that could reuse the same code. Thanks, between this comment and some observations from Peter E and Tom, I think I have a better plan now. I think they should *not* match, and a comment saying so should be deleted. In the backend, we should do neither ""-expansion (ie getenv("LC_...") whether direct or indirect) nor canonicalisation (of Windows' deprecated pre-BCP 47 locale names), making this v4 patch extremely simple. 1. CREATE DATABASE doesn't really need to accept LOCALE = ''. What is the point? It's not documented or desirable behavior AFAIK. If you like defaults you can just not provide a locale at all and get the template database's (which often comes from initdb, which often uses the server environment). That behavior was already inconsistent with CREATE COLLATION. So I think we should just reject "" in the backend check_locale(). 2. A similar argument applies to Windows canonicalisation. CREATE COLLATION isn't doing it. CREATE DATABASE is, but again, what is the point? See previous. (I also think that initdb should use a different mechanism for finding the native locale on Windows, but I already have a CF #3772 for that, ie migration plan for BCP 47 and native UTF-8 on Windows, but I don't want *this* thread to get blocked by our absence of Windows reviewers/testers, so let's not tangle that up with this thread-safety expedition.) To show a concrete example of commands no longer accepted with this version, because they call check_locale(): postgres=# set lc_monetary = ''; ERROR: invalid value for parameter "lc_monetary": "" postgres=# create database db2 locale = ''; ERROR: invalid LC_COLLATE locale name: "" HINT: If the locale name is specific to ICU, use ICU_LOCALE. Does anyone see a problem with that? I do see a complication for CREATE COLLATION, though. It doesn't call check_locale(), is not changed in this patch, and still accepts ''. Reasoning: There may be systems with '' in their pg_collation catalog in the wild, since we never canonicalised with setlocale(), so it might create some kind of unforeseen dump/restore/upgrade hazard if we just ban '' outright, I just don't know what yet. There is no immediate problem, ie there is no setlocale() to excise, for *this* project. Longer term, we can't actually continue to allow '' in COLLATION objects, though: that tells newlocale() to call getenv(), which may be technically OK in a multi-threaded program (that never calls setenv()), but is hardly desirable. But it will also give the wrong result, if we pursue the plan that Jeff and I discussed: we'll stop doing setenv("LC_COLLATE", datacollate) and setenv("LC_CTYPE", datctype) in postinit.c (see pg_perm_setlocale() calls). So I think any pg_collation catalog entries holding '' need to be translated to datcollate/datctype... somewhere. I just don't know where yet and don't want to tackle that in the same patch. From 4831ff4373b9d713c78e303d9758de347aadfc2f Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Tue, 13 Aug 2024 14:15:54 +1200 Subject: [PATCH v4 1/3] Provide thread-safe pg_localeconv_r(). This involves four different implementation strategies: 1. For Windows, we now require _configthreadlocale() to be available and work, and the documentation says that the object returned by localeconv() is in thread-local memory. 2. For glibc, we translate to nl_langinfo_l() calls, because it offers the same information that way as an extension, and that API is thread-safe. 3. For macOS/*BSD, use localeconv_l(), which is thread-safe. 4. For everything else, use uselocale() to set the locale for the thread, and use a big ugly lock to defend against the returned object being concurrently clobbered. In practice this currently means only Solaris. The new call is used in pg_locale.c, replacing calls to setlocale() and localeconv(). This patch adds a hard requirement on Windows' _configthreadlocale(). In the past there were said to be MinGW systems that didn't have it, or had it but it didn't work. As far as I can tell, CI (optional MinGW task + mingw cross build warning task) and build farm (fairywren msys) should be happy with this. Fingers crossed. (There are places that use configure checks for that in ECPG; other proposed patches would remove those later.) Reviewed-by: Heikki Linnakangas Discussion: https://postgr.es/m/CA%2BhUKGJqVe0%2BPv9dvC9dSums_PXxGo9SWcxYAMBguWJUGbWz-A%40mail.gmail.com --- configure | 2 +- configure.ac | 1 + meson.build | 1 + src/backend/utils/adt/pg_locale.c | 128 ++- src/include/pg_config.h.in| 3 + src/include/port.h| 6 + src/port/Makefile | 1 + src/port/meson.build | 1 + src/port/pg_localeconv_r.c| 367 ++ 9 files chang
Re: Statistics Import and Export
> > > function attribute_statsitics_update() is significantly shorter. (Thank > you for a good set of tests, by the way, which sped up the refactoring > process.) > yw > * Remind me why the new stats completely replace the new row, rather > than updating only the statistic kinds that are specified? > because: - complexity - we would then need a mechanism to then tell it to *delete* a stakind - we'd have to figure out how to reorder the remaining stakinds, or spend effort finding a matching stakind in the existing row to know to replace it - "do what analyze does" was an initial goal and as a result many test cases directly compared pg_statistic rows from an original table to an empty clone table to see if the "copy" had fidelity. > * I'm not sure what the type_is_scalar() function was doing before, > but I just removed it. If it can't find the element type, then it skips > over the kinds that require it. > that may be sufficient, > * I introduced some hard errors. These happen when it can't find the > table, or the attribute, or doesn't have permissions. I don't see any > reason to demote those to a WARNING. Even for the restore case, > analagous errors happen for COPY, etc. > I can accept that reasoning. > * I'm still sorting through some of the type info derivations. I > think we need better explanations about why it's doing exactly the > things it's doing, e.g. for tsvector and multiranges. I don't have the specifics of each, but any such cases were derived from similar behaviors in the custom typanalyze functions, and the lack of a custom typanalyze function for a given type was taken as evidence that the type was adequately handled by the default rules. I can see that this is an argument for having a second stats-specific custom typanalyze function for datatypes that need them, but I wasn't ready to go that far myself.
Re: Remaining dependency on setlocale()
On Fri, Aug 16, 2024 at 9:09 AM Thomas Munro wrote: > On Fri, Aug 16, 2024 at 1:25 AM Peter Eisentraut wrote: > > It should just be initdb doing that and then initializing the server > > with concrete values based on that. > > Right. > > > I guess technically some of these GUC settings default to the > > environment? But I think we could consider getting rid of that. > > Yeah. Seems to make a lot of sense. I tried that out over in CF #5170. (In case it's not clear why I'm splitting discussion between threads: I was thinking of this thread as the one where we're discussing what needs to be done, with other threads being spun off to become CF entry with concrete patches. I realised re-reading some discussion that that might not be obvious...)
Re: SQL:2011 application time
On Thu, Aug 8, 2024 at 4:54 AM Paul Jungwirth wrote: > > Rebased to e56ccc8e42. I only applied to 0001-0003. in create_table.sgml, I saw the WITHOUT OVERLAPS change is mainly in table_constraint. but we didn't touch alter_table.sgml. Do we also need to change alter_table.sgml correspondingly? + if (constraint->without_overlaps) + { + /* + * This enforces that there is at least one equality column + * besides the WITHOUT OVERLAPS columns. This is per SQL + * standard. XXX Do we need this? + */ + if (list_length(constraint->keys) < 2) + ereport(ERROR, + errcode(ERRCODE_SYNTAX_ERROR), + errmsg("constraint using WITHOUT OVERLAPS needs at least two columns")); + + /* WITHOUT OVERLAPS requires a GiST index */ + index->accessMethod = "gist"; + } if Constraint->conname is not NULL, we can + errmsg("constraint \"%s\" using WITHOUT OVERLAPS needs at least two columns")); "XXX Do we need this?" I think currently we need this, otherwise the following create_table synopsis will not be correct. UNIQUE [ NULLS [ NOT ] DISTINCT ] ( column_name [, ... ] [, column_name WITHOUT OVERLAPS ] ) PRIMARY KEY ( column_name [, ... ] [, column_name WITHOUT OVERLAPS ] ) we add a column in catalog-pg-constraint. do we need change column conexclop, "If an exclusion constraint, list of the per-column exclusion operators " but currently, primary key, unique constraint both have valid conexclop. +static void +ExecWithoutOverlapsNotEmpty(Relation rel, NameData attname, Datum attval, char typtype, Oid atttypid) +{ + bool isempty; + RangeType *r; + MultirangeType *mr; + + switch (typtype) + { + case TYPTYPE_RANGE: + r = DatumGetRangeTypeP(attval); + isempty = RangeIsEmpty(r); + break; + case TYPTYPE_MULTIRANGE: + mr = DatumGetMultirangeTypeP(attval); + isempty = MultirangeIsEmpty(mr); + break; + default: + elog(ERROR, "WITHOUT OVERLAPS column \"%s\" is not a range or multirange", + NameStr(attname)); + } + + /* Report a CHECK_VIOLATION */ + if (isempty) + ereport(ERROR, + (errcode(ERRCODE_CHECK_VIOLATION), + errmsg("empty WITHOUT OVERLAPS value found in column \"%s\" in relation \"%s\"", + NameStr(attname), RelationGetRelationName(rel; +} I think in the default branch, you need at least set the isempty value, otherwise maybe there will be a compiler warning because later your use isempty, but via default branch is value undefined? + /* + * If this is a WITHOUT OVERLAPS constraint, + * we must also forbid empty ranges/multiranges. + * This must happen before we look for NULLs below, + * or a UNIQUE constraint could insert an empty + * range along with a NULL scalar part. + */ + if (indexInfo->ii_WithoutOverlaps) + { + ExecWithoutOverlapsNotEmpty(heap, att->attname, + } previously we found out that if this happens later, then it won't work. but this comment didn't explain why this must have happened earlier. I didn't dig deep enough to find out why. but explaining it would be very helpful. I think some tests are duplicated, so I did the refactoring. v39-0001-refactor-tests.no-cfbot Description: Binary data
Re: Remove dependence on integer wrapping
On Thu, Aug 15, 2024 at 5:34 PM Nathan Bossart wrote: > Now to 0002... > > - if (-element > nelements) > + if (element == PG_INT32_MIN || -element > nelements) > > This seems like a good opportunity to use our new pg_abs_s32() function, > and godbolt.org [0] seems to indicate that it might produce better code, > too (at least to my eye). This updated version LGTM, I agree it's a good use of pg_abs_s32(). Thanks, Joseph Koshakow
Re: CREATE SUBSCRIPTION - add missing test case
On Thu, 15 Aug 2024 at 12:55, Peter Smith wrote: > > Hi Hackers, > > While reviewing another logical replication thread [1], I found an > ERROR scenario that seems to be untested. > > TEST CASE: Attempt CREATE SUBSCRIPTION where the subscriber table is > missing some expected column(s). > > Attached is a patch to add the missing test for this error message. I agree currently there is no test to hit this code. I'm not sure if this is the correct location for the test, should it be included in the 008_diff_schema.pl file? Additionally, the commenting style for this test appears quite different from the others. Could we use a commenting style consistent with the earlier tests? Regards, Vignesh
Re: Conflict detection and logging in logical replication
On Wed, Aug 14, 2024 at 7:45 PM Michail Nikolaev wrote: > > > This is as expected, and we have documented this in the code comments. We > > don't > > need to report a conflict if the conflicting tuple has been removed or > > updated > > due to concurrent transaction. The same is true if the transaction that > > inserted the conflicting tuple is rolled back before > > CheckAndReportConflict(). > > We don't consider such cases as a conflict. > > That seems a little bit strange to me. > > From the perspective of a user, I expect that if a change from publisher is > not applied - I need to know about it from the logs. > In the above conditions where a concurrent tuple insertion is removed or rolled back before CheckAndReportConflict, the tuple inserted by apply will remain. There is no need to report anything in such cases as apply was successful. > But in that case, I will not see any information about conflict in the logs > in SOME cases. But in OTHER cases I will see it. > However, in both cases the change from publisher was not applied. > And these cases are just random and depend on the timing of race conditions. > It is not something I am expecting from the database. > > Maybe it is better to report about the fact that event from publisher was not > applied because of conflict and then try to > provide additional information about the conflict itself? > > Or possibly in case we were unable to apply the event and not able to find > the conflict, we should retry the event processing? > Per my understanding, we will apply or the conflict will be logged and retried where required (unique key violation). -- With Regards, Amit Kapila.
Re: define PG_REPLSLOT_DIR
On Wed, 14 Aug 2024 11:32:14 + Bertrand Drouvot wrote: > Hi hackers, > > while working on a replication slot tool (idea is to put it in contrib, not > shared yet), I realized that "pg_replslot" is being used > 25 times in > .c files. > > I think it would make sense to replace those occurrences with a $SUBJECT, > attached > a patch doing so. > > There is 2 places where it is not done: > > src/bin/initdb/initdb.c > src/bin/pg_rewind/filemap.c > > for consistency with the existing PG_STAT_TMP_DIR define. > > Out of curiosity, checking the sizes of affected files (O2, no debug): > > with patch: > >textdata bss dec hex filename > 20315 224 17 20556504c src/backend/backup/basebackup.o > 30304 0 8 303127668 > src/backend/replication/logical/reorderbuffer.o > 23812 36 40 238885d50 src/backend/replication/slot.o >6367 0 0636718df src/backend/utils/adt/genfile.o > 4099725742528 46099b413 src/bin/initdb/initdb.o >6963 224 871951c1b src/bin/pg_rewind/filemap.o > > without patch: > >textdata bss dec hex filename > 20315 224 17 20556504c src/backend/backup/basebackup.o > 30286 0 8 302947656 > src/backend/replication/logical/reorderbuffer.o > 23766 36 40 238425d22 src/backend/replication/slot.o >6363 0 0636318db src/backend/utils/adt/genfile.o > 4099725742528 46099b413 src/bin/initdb/initdb.o >6963 224 871951c1b src/bin/pg_rewind/filemap.o > > Also, I think we could do the same for: > > pg_notify > pg_serial > pg_subtrans > pg_wal > pg_multixact > pg_tblspc > pg_logical > > And I volunteer to do so if you think that makes sense. > > Looking forward to your feedback, /* restore all slots by iterating over all on-disk entries */ - replication_dir = AllocateDir("pg_replslot"); - while ((replication_de = ReadDir(replication_dir, "pg_replslot")) != NULL) + replication_dir = AllocateDir(PG_REPLSLOT_DIR); + while ((replication_de = ReadDir(replication_dir, PG_REPLSLOT_DIR)) != NULL) { charpath[MAXPGPATH + 12]; PGFileType de_type; I think the size of path can be rewritten to "MAXPGPATH + sizeof(PG_REPLSLOT_DIR)" and it seems easier to understand the reason why this size is used. (That said, I wonder the path would never longer than MAXPGPATH...) Regards, Yugo Nagata > > Regards, > > -- > Bertrand Drouvot > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com -- Yugo Nagata
Re: libpq minor TOCTOU violation
On 14.08.24 03:12, Andreas Karlsson wrote: On 8/10/24 9:10 AM, Peter Eisentraut wrote: Thoughts? I like it. Not because of the security issue but mainly because it is more correct to do it this way. Plus the old code running stat() on Windows also made little sense. I think this simple fix can be committed. committed
Re: Logical Replication of sequences
Hi Vignesh. I looked at the latest v20240815* patch set. I have only the following few comments for patch v20240815-0004, below. == Commit message. Please see the attachment for some suggested updates. == src/backend/commands/subscriptioncmds.c CreateSubscription: nit - fix wording in one of the XXX comments == .../replication/logical/sequencesync.c report_mismatched_sequences: nit - param name /warning_sequences/mismatched_seqs/ append_mismatched_sequences: nit - param name /warning_sequences/mismatched_seqs/ LogicalRepSyncSequences: nit - var name /warning_sequences/mismatched_seqs/ == Kind Regards, Peter Smith. Fujitsu Australia diff --git a/src/backend/commands/subscriptioncmds.c b/src/backend/commands/subscriptioncmds.c index 534d385..e799a41 100644 --- a/src/backend/commands/subscriptioncmds.c +++ b/src/backend/commands/subscriptioncmds.c @@ -800,9 +800,9 @@ CreateSubscription(ParseState *pstate, CreateSubscriptionStmt *stmt, * export it. * * XXX: If the subscription is for a sequence-only publication, -* creating this slot. It can be created later during the ALTER -* SUBSCRIPTION ... REFRESH command, if the publication is updated -* to include tables or tables in schema. +* creating this slot is unnecessary. It can be created later +* during the ALTER SUBSCRIPTION ... REFRESH command, if the +* publication is updated to include tables or tables in schema. */ if (opts.create_slot) { diff --git a/src/backend/replication/logical/sequencesync.c b/src/backend/replication/logical/sequencesync.c index 1aa5ab8..934646f 100644 --- a/src/backend/replication/logical/sequencesync.c +++ b/src/backend/replication/logical/sequencesync.c @@ -264,17 +264,17 @@ copy_sequence(WalReceiverConn *conn, Relation rel, * Report any sequence mismatches as a single warning log. */ static void -report_mismatched_sequences(StringInfo warning_sequences) +report_mismatched_sequences(StringInfo mismatched_seqs) { - if (warning_sequences->len) + if (mismatched_seqs->len) { ereport(WARNING, errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("parameters differ for the remote and local sequences (%s) for subscription \"%s\"", - warning_sequences->data, MySubscription->name), + mismatched_seqs->data, MySubscription->name), errhint("Alter/Re-create local sequences to have the same parameters as the remote sequences.")); - resetStringInfo(warning_sequences); + resetStringInfo(mismatched_seqs); } } @@ -282,15 +282,15 @@ report_mismatched_sequences(StringInfo warning_sequences) * append_mismatched_sequences * * Appends details of sequences that have discrepancies between the publisher - * and subscriber to the warning_sequences string. + * and subscriber to the mismatched_seqs string. */ static void -append_mismatched_sequences(StringInfo warning_sequences, Relation seqrel) +append_mismatched_sequences(StringInfo mismatched_seqs, Relation seqrel) { - if (warning_sequences->len) - appendStringInfoString(warning_sequences, ", "); + if (mismatched_seqs->len) + appendStringInfoString(mismatched_seqs, ", "); - appendStringInfo(warning_sequences, "\"%s.%s\"", + appendStringInfo(mismatched_seqs, "\"%s.%s\"", get_namespace_name(RelationGetNamespace(seqrel)), RelationGetRelationName(seqrel)); } @@ -314,7 +314,7 @@ LogicalRepSyncSequences(void) boolstart_txn = true; Oid subid = MyLogicalRepWorker->subid; MemoryContext oldctx; - StringInfo warning_sequences = makeStringInfo(); + StringInfo mismatched_seqs = makeStringInfo(); /* * Synchronizing each sequence individually incurs overhead from starting @@ -425,15 +425,15 @@ LogicalRepSyncSequences(void) PG_CATCH(); { if (sequence_mismatch) - append_mismatched_sequences(warning_sequences, sequence_rel); + append_mismatched_sequences(mismatched_seqs, sequence_rel); - report_mismatched_sequences(warning_sequences); + report_mismatched_sequences(mismatched_seqs); PG_RE_THROW(); } PG_END_TRY(); if (sequence_mismatch) - append_mismatched_se
Re: [Patch] remove duplicated smgrclose
Junwang Zhao 于2024年8月15日周四 18:03写道: > On Wed, Aug 14, 2024 at 2:35 PM Steven Niu wrote: > > > > Junwang, Kirill, > > > > The split work has been done. I created a new patch for removing > redundant smgrclose() function as attached. > > Please help review it. > > Patch looks good, actually you can make the refactoring code as v3-0002-xxx > by using: > > git format-patch -2 -v 3 > > Another kind reminder: Do not top-post when you reply ;) > > > > > Thanks, > > Steven > > > > Steven Niu 于2024年8月12日周一 18:11写道: > >> > >> Kirill, > >> > >> Good catch! > >> I will split the patch into two to cover both cases. > >> > >> Thanks, > >> Steven > >> > >> > >> Junwang Zhao 于2024年8月9日周五 18:19写道: > >>> > >>> On Fri, Aug 9, 2024 at 5:20 PM Kirill Reshke > wrote: > >>> > > >>> > On Thu, 1 Aug 2024 at 17:32, Junwang Zhao wrote: > >>> > > > >>> > > Hi Steven, > >>> > > > >>> > > On Wed, Jul 31, 2024 at 11:16 AM Steven Niu > wrote: > >>> > > > > >>> > > > Hello, hackers, > >>> > > > > >>> > > > I think there may be some duplicated codes. > >>> > > > Function smgrDoPendingDeletes() calls both smgrdounlinkall() and > smgrclose(). > >>> > > > But both functions would close SMgrRelation object, it's > dupliacted behavior? > >>> > > > > >>> > > > So I make this patch. Could someone take a look at it? > >>> > > > > >>> > > > Thanks for your help, > >>> > > > Steven > >>> > > > > >>> > > > From Highgo.com > >>> > > > > >>> > > > > >>> > > You change LGTM, but the patch seems not to be applied to HEAD, > >>> > > I generate the attached v2 using `git format` with some commit > message. > >>> > > > >>> > > -- > >>> > > Regards > >>> > > Junwang Zhao > >>> > > >>> > Hi all! > >>> > This change looks good to me. However, i have an objection to these > >>> > lines from v2: > >>> > > >>> > > /* Close the forks at smgr level */ > >>> > > - for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) > >>> > > - smgrsw[which].smgr_close(rels[i], forknum); > >>> > > + smgrclose(rels[i]); > >>> > > >>> > Why do we do this? This seems to be an unrelated change given thread > >>> > $subj. This is just a pure refactoring job, which deserves a separate > >>> > patch. There is similar coding in > >>> > smgrdestroy function: > >>> > > >>> > ``` > >>> > for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) > >>> > smgrsw[reln->smgr_which].smgr_close(reln, forknum); > >>> > ``` > >>> > > >>> > So, I feel like these two places should be either changed together or > >>> > not be altered at all. And is it definitely a separate change. > >>> > >>> Yeah, I tend to agree with you, maybe we should split the patch > >>> into two. > >>> > >>> Steven, could you do this? > >>> > >>> > > >>> > -- > >>> > Best regards, > >>> > Kirill Reshke > >>> > >>> > >>> > >>> -- > >>> Regards > >>> Junwang Zhao > > > > -- > Regards > Junwang Zhao > OK, thanks for your kind help. Steven
Re: Conflict detection and logging in logical replication
On Thu, Aug 15, 2024 at 12:47 PM Zhijie Hou (Fujitsu) wrote: > > Thanks. I have checked and merged the changes. Here is the V15 patch > which addressed above comments. Thanks for the patch. Please find few comments and queries: 1) For various conflicts , we have these in Logs: Replica identity (val1)=(30).(for RI on 1 column) Replica identity (pk, val1)=(200, 20). (for RI on 2 columns) Replica identity (40, 40, 11).(for RI full) Shall we have have column list in last case as well, or can simply have *full* keyword i.e. Replica identity full (40, 40, 11) 2) For toast column, we dump null in remote-tuple. I know that the toast column is not sent in new-tuple from the publisher and thus the behaviour, but it could be misleading for users. Perhaps document this? See 'null' in all these examples in remote tuple: update_differ With PK: LOG: conflict detected on relation "public.t1": conflict=update_differ DETAIL: Updating the row that was modified locally in transaction 831 at 2024-08-16 09:59:26.566012+05:30. Existing local tuple (30, 30, 30, ...); remote tuple (30, 30, 300, null); replica identity (pk, val1)=(30, 30). update_missing With PK: LOG: conflict detected on relation "public.t1": conflict=update_missing DETAIL: Could not find the row to be updated. Remote tuple (10, 10, 100, null); replica identity (pk, val1)=(10, 10). update_missing with RI full: LOG: conflict detected on relation "public.t1": conflict=update_missing DETAIL: Could not find the row to be updated. Remote tuple (20, 20, 2000, null); replica identity (20, 20, 10, ...). 3) For update_exists(), we dump: Key (a, b)=(2, 1) For delete_missing, update_missing, update_differ, we dump: Replica identity (a, b)=(2, 1). For update_exists as well, shouldn't we dump 'Replica identity'? Only for insert case, it should be referred as 'Key'. 4) Why delete_missing is not having remote_tuple. Is it because we dump new tuple as 'remote tuple', which is not relevant for delete_missing? 2024-08-16 09:13:33.174 IST [419839] LOG: conflict detected on relation "public.t1": conflict=delete_missing 2024-08-16 09:13:33.174 IST [419839] DETAIL: Could not find the row to be deleted. Replica identity (val1)=(30). thanks Shveta
Re: Conflict detection and logging in logical replication
On Fri, Aug 16, 2024 at 10:46 AM shveta malik wrote: > > 3) > For update_exists(), we dump: > Key (a, b)=(2, 1) > > For delete_missing, update_missing, update_differ, we dump: > Replica identity (a, b)=(2, 1). > > For update_exists as well, shouldn't we dump 'Replica identity'? Only > for insert case, it should be referred as 'Key'. > On rethinking, is it because for update_exists case 'Key' dumped is not the one used to search the row to be updated? Instead it is the one used to search the conflicting row. Unlike update_differ, the row to be updated and the row currently conflicting will be different for update_exists case. I earlier thought that 'KEY' and 'Existing local tuple' dumped always belong to the row currently being updated/deleted/inserted. But for 'update_eixsts', that is not the case. We are dumping 'Existing local tuple' and 'Key' for the row which is conflicting and not the one being updated. Example: ERROR: conflict detected on relation "public.tab_1": conflict=update_exists Key (a, b)=(2, 1); existing local tuple (2, 1); remote tuple (2, 1). Operations performed were: Pub: insert into tab values (1,1); Sub: insert into tab values (2,1); Pub: update tab set a=2 where a=1; Here Key and local tuple are both 2,1 instead of 1,1. While replica identity value (used to search original row) will be 1,1 only. It may be slightly confusing or say tricky to understand when compared to other conflicts' LOGs. But not sure what better we can do here. One more comment: 5) For insert/update_exists, the sequence is: Key .. ; existing local tuple .. ; remote tuple ... For rest of the conflicts, sequence is: Existing local tuple .. ; remote tuple .. ; replica identity .. Is it intentional? Shall the 'Key' or 'Replica Identity' be the first one to come in all conflicts? thanks Shveta
Re: Conflict detection and logging in logical replication
On Fri, Aug 16, 2024 at 10:46 AM shveta malik wrote: > > On Thu, Aug 15, 2024 at 12:47 PM Zhijie Hou (Fujitsu) > wrote: > > > > Thanks. I have checked and merged the changes. Here is the V15 patch > > which addressed above comments. > > Thanks for the patch. Please find few comments and queries: > > 1) > For various conflicts , we have these in Logs: > Replica identity (val1)=(30).(for RI on 1 column) > Replica identity (pk, val1)=(200, 20). (for RI on 2 columns) > Replica identity (40, 40, 11).(for RI full) > > Shall we have have column list in last case as well, or can simply > have *full* keyword i.e. Replica identity full (40, 40, 11) > I would prefer 'full' instead of the entire column list as the complete column list could be long and may not much sense. > > 2) > For toast column, we dump null in remote-tuple. I know that the toast > column is not sent in new-tuple from the publisher and thus the > behaviour, but it could be misleading for users. Perhaps document > this? > Agreed that we should document this. I suggest that we can have a doc patch that explains the conflict logging format and in that, we can mention this behavior as well. > 3) > For update_exists(), we dump: > Key (a, b)=(2, 1) > > For delete_missing, update_missing, update_differ, we dump: > Replica identity (a, b)=(2, 1). > > For update_exists as well, shouldn't we dump 'Replica identity'? Only > for insert case, it should be referred as 'Key'. > I think update_exists is quite similar to insert_exists and both happen due to unique key violation. So, it seems okay to display the Key for update_exists. > > 4) > Why delete_missing is not having remote_tuple. Is it because we dump > new tuple as 'remote tuple', which is not relevant for delete_missing? > Right. -- With Regards, Amit Kapila.
Re: Conflict detection and logging in logical replication
On Fri, Aug 16, 2024 at 11:48 AM shveta malik wrote: > > On Fri, Aug 16, 2024 at 10:46 AM shveta malik wrote: > > > > 3) > > For update_exists(), we dump: > > Key (a, b)=(2, 1) > > > > For delete_missing, update_missing, update_differ, we dump: > > Replica identity (a, b)=(2, 1). > > > > For update_exists as well, shouldn't we dump 'Replica identity'? Only > > for insert case, it should be referred as 'Key'. > > > > On rethinking, is it because for update_exists case 'Key' dumped is > not the one used to search the row to be updated? Instead it is the > one used to search the conflicting row. Unlike update_differ, the row > to be updated and the row currently conflicting will be different for > update_exists case. I earlier thought that 'KEY' and 'Existing local > tuple' dumped always belong to the row currently being > updated/deleted/inserted. But for 'update_eixsts', that is not the > case. We are dumping 'Existing local tuple' and 'Key' for the row > which is conflicting and not the one being updated. Example: > > ERROR: conflict detected on relation "public.tab_1": conflict=update_exists > Key (a, b)=(2, 1); existing local tuple (2, 1); remote tuple (2, 1). > > Operations performed were: > Pub: insert into tab values (1,1); > Sub: insert into tab values (2,1); > Pub: update tab set a=2 where a=1; > > Here Key and local tuple are both 2,1 instead of 1,1. While replica > identity value (used to search original row) will be 1,1 only. > > It may be slightly confusing or say tricky to understand when compared > to other conflicts' LOGs. But not sure what better we can do here. > The update_exists behaves more like insert_exists as we detect that only while inserting into index. It is also not clear to me if we can do better than to clarify this in docs. > > > One more comment: > > 5) > For insert/update_exists, the sequence is: > Key .. ; existing local tuple .. ; remote tuple ... > > For rest of the conflicts, sequence is: > Existing local tuple .. ; remote tuple .. ; replica identity .. > > Is it intentional? Shall the 'Key' or 'Replica Identity' be the first > one to come in all conflicts? > This is worth considering but Replica Identity signifies the old tuple values, that is why it is probably kept at the end. But let's see what Hou-San or others think about this. -- With Regards, Amit Kapila.