Re: Proposal to introduce a shuffle function to intarray extension
Am 17.07.22 um 05:37 schrieb Tom Lane: Actually ... is there a reason to bother with an intarray version at all, rather than going straight for an in-core anyarray function? It's not obvious to me that an int4-only version would have major performance advantages. regards, tom lane Hi Tom, thank you for your thoughts. There are two reasons for choosing an int4-only version. I am not familiar with postgres development (yet) and i was not sure how open you are about such changes to core and if the proposed feature is considered valuable enough to go into core. The second reason was ease of implementation. The intarray extension does not allow any NULL elements in arrays and treats multidimensional arrays as though they were linear. Which makes the implementation straight forward, because there are fewer cases to consider. However, i will take a look at an implementation for anyarray in core. Martin
Re: Proposal to introduce a shuffle function to intarray extension
Am 17.07.22 um 08:00 schrieb Thomas Munro: I went to see what Professor Lemire would have to say about all this, expecting to find a SIMD rabbit hole to fall down for some Sunday evening reading, but the main thing that jumped out was this article about the modulo operation required by textbook Fisher-Yates to get a bounded random number, the random() % n that appears in the patch. He talks about shuffling twice as fast by using a no-division trick to get bounded random numbers[1]. I guess you might need to use our pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might introduce bias. Anyway, file that under go-faster ideas for later. [1] https://lemire.me/blog/2016/06/30/fast-random-shuffling/ Hi Thomas, the small bias of random() % n is not a problem for my use case, but might be for others. Its easily replaceable with (int) pg_prng_uint64_range(&pg_global_prng_state, 0, n-1) Unfortunately it is a bit slower (on my machine), but thats negligible. Martin
Re: Proposal to introduce a shuffle function to intarray extension
Am 17.07.22 um 05:32 schrieb David G. Johnston: +SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) != sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6); + ?column? +-- + t +(1 row) + While small, there is a non-zero chance for both samples to be equal. This test should probably just go, I don't see what it tests that isn't covered by other tests or even trivial usage. Hey David, you are right. There is a small chance for this test to fail. I wanted to test, that two invocations produce different results (after all the main feature of the function). But it can probably go. Martin
Re: Allowing REINDEX to have an optional name
On Sun, 17 Jul 2022 at 07:19, Michael Paquier wrote: > > On Fri, Jul 15, 2022 at 06:21:22PM +0100, Simon Riggs wrote: > > That's fixed it on the CFbot. Over to you, Michael. Thanks. > > Sure. I have looked over that, and this looks fine overall. I have > made two changes though. > > if (objectKind == REINDEX_OBJECT_SYSTEM && > - !IsSystemClass(relid, classtuple)) > + !IsCatalogRelationOid(relid)) > + continue; > + else if (objectKind == REINDEX_OBJECT_DATABASE && > +IsCatalogRelationOid(relid)) > > The patch originally relied on IsSystemClass() to decide if a relation > is a catalog table or not. This is not wrong in itself because > ReindexMultipleTables() discards RELKIND_TOAST a couple of lines > above, but I think that we should switch to IsCatalogRelationOid() as > that's the line drawn to check for the catalog-ness of a relation. > > The second thing is test coverage. Using a REINDEX DATABASE/SYSTEM > within the main regression test suite is not a good idea, but we > already have those commands running in the reindexdb suite so I could > not resist expanding the test section to track and check relfilenode > changes through four relations for these cases: > - Catalog index. > - Catalog toast index. > - User table index. > - User toast index. > The relfilenodes of those relations are saved in a table and > cross-checked with the contents of pg_class after each REINDEX, on > SYSTEM or DATABASE. There are no new heavy commands, so it does not > make the test longer. > > With all that, I finish with the attached. Does that look fine to > you? Sounds great, looks fine. Thanks for your review. -- Simon Riggshttp://www.EnterpriseDB.com/
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Fri, Jul 15, 2022 at 3:32 PM shiy.f...@fujitsu.com wrote: > > On Mon, Jul 11, 2022 9:54 PM Masahiko Sawada wrote: > > > > I've attached an updated patch, please review it. > > > > Thanks for your patch. Here are some comments for the REL14-v1 patch. > > 1. > + Sizesz = sizeof(TransactionId) * nxacts;; > > There is a redundant semicolon at the end. > > 2. > + workspace = MemoryContextAlloc(rb->context, > rb->n_initial_running_xacts); > > Should it be: > + workspace = MemoryContextAlloc(rb->context, sizeof(TransactionId) * > rb->n_initial_running_xacts); > > 3. > + /* bound check if there is at least one transaction to be removed */ > + if (NormalTransactionIdPrecedes(rb->initial_running_xacts[0], > + > running->oldestRunningXid)) > + return; > + > > Here, I think it should return if rb->initial_running_xacts[0] is older than > oldestRunningXid, right? Should it be changed to: > > + if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0], > + > running->oldestRunningXid)) > + return; > > 4. > + if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) != 0) > > Maybe we can change it like the following, to be consistent with other places > in > this file. It's also fine if you don't change it. > > + if (parsed->xinfo & XACT_XINFO_HAS_INVALS) Thank you for the comments! I've attached patches for all supported branches including the master. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/ REL13-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch Description: Binary data REL12-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch Description: Binary data REL14-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch Description: Binary data REL11-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch Description: Binary data REL10-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch Description: Binary data master-v6-0001-Add-catalog-modifying-transactions-to-logical-dec.patch Description: Binary data
Re: proposal: possibility to read dumped table's name from file
Thanks for updating the patch. This failed to build on windows. http://cfbot.cputube.org/pavel-stehule.html Some more comments inline. On Sun, Jul 17, 2022 at 08:20:47AM +0200, Pavel Stehule wrote: > The attached patch implements the --filter option for pg_dumpall and for > pg_restore too. > diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml > index 5efb442b44..ba2920dbee 100644 > --- a/doc/src/sgml/ref/pg_dump.sgml > +++ b/doc/src/sgml/ref/pg_dump.sgml > @@ -779,6 +779,80 @@ PostgreSQL documentation > > > > + > + --filter= class="parameter">filename > + > + > +Specify a filename from which to read patterns for objects to include > +or exclude from the dump. The patterns are interpreted according to > the > +same rules as the corresponding options: > +-t/--table for tables, > +-n/--schema for schemas, > +--include-foreign-data for data on foreign servers > and > +--exclude-table-data for table data. > +To read from STDIN use - as > the STDIN comma > + > +Lines starting with # are considered comments and > +are ignored. Comments can be placed after filter as well. Blank lines change "are ignored" to "ignored", I think. > diff --git a/doc/src/sgml/ref/pg_dumpall.sgml > b/doc/src/sgml/ref/pg_dumpall.sgml > index 8a081f0080..137491340c 100644 > --- a/doc/src/sgml/ref/pg_dumpall.sgml > +++ b/doc/src/sgml/ref/pg_dumpall.sgml > @@ -122,6 +122,29 @@ PostgreSQL documentation > > > > + > + --filter= class="parameter">filename > + > + > +Specify a filename from which to read patterns for databases excluded > +from dump. The patterns are interpretted according to the same rules > +like --exclude-database. same rules *as* > +To read from STDIN use - as > the comma > +filename. The --filter option can be specified in > +conjunction with the above listed options for including or excluding For dumpall, remove "for including or" change "above listed options" to "exclude-database" ? > diff --git a/doc/src/sgml/ref/pg_restore.sgml > b/doc/src/sgml/ref/pg_restore.sgml > index 526986eadb..5f16c4a333 100644 > --- a/doc/src/sgml/ref/pg_restore.sgml > +++ b/doc/src/sgml/ref/pg_restore.sgml > @@ -188,6 +188,31 @@ PostgreSQL documentation > > > > + > + --filter= class="parameter">filename > + > + > +Specify a filename from which to read patterns for objects excluded > +or included from restore. The patterns are interpretted according to > the > +same rules like --schema, > --exclude-schema, s/like/as/ > +--function, --index, > --table > +or --trigger. > +To read from STDIN use - as > the STDIN comma > +/* > + * filter_get_keyword - read the next filter keyword from buffer > + * > + * Search for keywords (limited to containing ascii alphabetic characters) in remove "containing" > + /* > + * If the object name pattern has been quoted we must take care parse > out > + * the entire quoted pattern, which may contain whitespace and can span > + * over many lines. quoted comma *to parse remove "over" > + * The pattern is either simple without any whitespace, or properly quoted double space > + * in case there is whitespace in the object name. The pattern handling > follows s/is/may be/ > + if (size == 7 && pg_strncasecmp(keyword, "include", 7) > == 0) > + *is_include = true; > + else if (size == 7 && pg_strncasecmp(keyword, > "exclude", 7) == 0) > + *is_include = false; Can't you write strncasecmp(keyword, "include", size) to avoid hardcoding "7" ? > + > + if (size == 4 && pg_strncasecmp(keyword, "data", 4) == > 0) > + *objtype = FILTER_OBJECT_TYPE_DATA; > + else if (size == 8 && pg_strncasecmp(keyword, > "database", 8) == 0) > + *objtype = FILTER_OBJECT_TYPE_DATABASE; > + else if (size == 12 && pg_strncasecmp(keyword, > "foreign_data", 12) == 0) > + *objtype = FILTER_OBJECT_TYPE_FOREIGN_DATA; > + else if (size == 8 && pg_strncasecmp(keyword, > "function", 8) == 0) > + *objtype = FILTER_OBJECT_TYPE_FUNCTION; > + else if (size == 5 && pg_strncasecmp(keyword, "index", > 5) == 0) > + *objtype = FILTER_OBJECT_TYPE_INDEX; > + else if (size == 6 && pg_strncasecmp(keyword, "schema", > 6) == 0) > + *objtype = FILTER_OBJECT_TYPE_SCHEMA; > + else if (size == 5 && pg_strncasecmp(keyword, "table", > 5) == 0) > + *objtype = FILTER_OBJECT_
Re: Commitfest Update
On Fri, Jul 15, 2022 at 09:37:14PM -0500, Justin Pryzby wrote: > I'm not suggesting to give the community regulars special treatment, but you > could reasonably assume that when they added themselves and then "didn't > remove > themself", it was on purpose and not by omission. I think most of those > people > would be surprised to learn that they're no longer considered to be reviewing > the patch. Yeah, I happened to look in my commitfest update folder this morning and was surprised to learn that I was no longer reviewing 3612. I spent a good amount of time getting that patch into a state where I felt it was ready-for-committer, and I intended to follow up on any further developments in the thread. It's not uncommon for me to use the filter functionality in the app to keep track of patches I'm reviewing. Of course, there are probably patches where I could be removed from the reviewers field. I can try to stay on top of that better. If you think I shouldn't be marked as a reviewer and that it's hindering further review for a patch, feel free to message me publicly or privately about it. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Reducing the chunk header sizes on all memory context types
В Вт, 12/07/2022 в 22:41 -0700, Andres Freund пишет: > Hi, > > On 2022-07-12 10:42:07 -0700, Andres Freund wrote: > > On 2022-07-12 17:01:18 +1200, David Rowley wrote: > > > There is at least one. It might be major; to reduce the AllocSet chunk > > > header from 16 bytes down to 8 bytes I had to get rid of the freelist > > > pointer that was reusing the "aset" field in the chunk header struct. > > > This works now by storing that pointer in the actual palloc'd memory. > > > This could lead to pretty hard-to-trace bugs if we have any code that > > > accidentally writes to memory after pfree. > > > > Can't we use the same trick for allcations in the freelist as we do for the > > header in a live allocation? I.e. split the 8 byte header into two and use > > part of it to point to the next element in the list using the offset from > > the > > start of the block, and part of it to indicate the size? > > So that doesn't work because the members in the freelist can be in different > blocks and those can be further away from each other. > > > Perhaps that could still made work somehow: To point to a block we don't > actually need 64bit pointers, they're always at least of some certain size - > assuming we can allocate them suitably aligned. And chunks are always 8 byte > aligned. Unfortunately that doesn't quite get us far enough - assuming a 4kB > minimum block size (larger than now, but potentially sensible as a common OS > page size), we still only get to 2^12*8 = 32kB. Well, we actually have freelists for 11 size classes. It is just 11 pointers. We could embed this 88 bytes in every ASet block and then link blocks. And then in every block have 44 bytes for in-block free lists. Total overhead is 132 bytes per-block. Or 110 if we limit block size to 65k*8b=512kb. With double-linked block lists (176bytes per block + 44bytes for in-block lists = 220bytes), we could track block fullness and deallocate it if it doesn't contain any alive alocation. Therefore "generational" and "slab" allocators will be less useful. But CPU overhead will be noticeable. > It'd easily work if we made each context have an array of allocated non-huge > blocks, so that the blocks can be addressed with a small index. The overhead > of that could be reduced in the common case by embedding a small constant > sized array in the Aset. That might actually be worth trying out. > > Greetings, > > Andres Freund > >
Re: Proposal to introduce a shuffle function to intarray extension
Am 17.07.22 um 08:00 schrieb Thomas Munro: Actually ... is there a reason to bother with an intarray version at all, rather than going straight for an in-core anyarray function? It's not obvious to me that an int4-only version would have major performance advantages. Yeah, that seems like a good direction. If there is a performance advantage to specialising, then perhaps we only have to specialise on size, not type. Perhaps there could be a general function that internally looks out for typbyval && typlen == 4, and dispatches to a specialised 4-byte, and likewise for 8, if it can, and that'd already be enough to cover int, bigint, float etc, without needing specialisations for each type. I played around with the idea of an anyarray shuffle(). The hard part was to deal with arrays with variable length elements, as they can not be swapped easily in place. I solved it by creating an intermediate array of references to the elements. I'll attach a patch with the proof of concept. Unfortunatly it is already about 5 times slower than the specialised version and i am not sure if it is worth going down that road. MartinFrom 85cb90c79459132c4220b8d78586dffe6e590ba9 Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH] introduce array_shuffle() --- src/backend/utils/adt/arrayfuncs.c | 73 ++ src/include/catalog/pg_proc.dat| 3 ++ 2 files changed, 76 insertions(+) diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index fb167f226a..b6576ce178 100644 --- a/src/backend/utils/adt/arrayfuncs.c +++ b/src/backend/utils/adt/arrayfuncs.c @@ -6872,3 +6872,76 @@ trim_array(PG_FUNCTION_ARGS) PG_RETURN_DATUM(result); } + +Datum +array_shuffle(PG_FUNCTION_ARGS) +{ + ArrayType *array, + *result; + char *adat, + *rdat; + int nitems, + i; + Oid elemtype; + TypeCacheEntry typentry; + struct { char *dat; size_t size; } *refs; + + array = PG_GETARG_ARRAYTYPE_P(0); + + if (ARR_HASNULL(array)) + ereport(ERROR, +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("shuffling arrays with null elements is not supported"))); + + if (ARR_NDIM(array) > 1) + ereport(ERROR, +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("shuffling of multidimensional arrays is not supported"))); + + elemtype = ARR_ELEMTYPE(array); + typentry = *lookup_type_cache(elemtype, 0); + nitems = ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array)); + refs = palloc(nitems * (sizeof(char *) + sizeof(size_t))); + adat = ARR_DATA_PTR(array); + + /* + * To allow shuffling of arrays with variable length elements, we are going + * to shuffle references to the elements, because swapping variable length + * elements in an array is complicated. In a second step we will iterate the + * shuffled references and copy the elements to the destination array. + * We are using the "inside-out" variant of the Fisher-Yates algorithm to + * produce a shuffled array of references: Append each reference than swap it + * with a radomly-chosen refence. + */ + for (i = 0; i < nitems; i++) + { + int j = random() % (i + 1); + char *next = att_addlength_pointer(adat, typentry.typlen, adat); + next = (char *) att_align_nominal(next, typentry.typalign); + refs[i] = refs[j]; + refs[j].dat = adat; + refs[j].size = next - adat; + adat = next; + } + + result = create_array_envelope(ARR_NDIM(array), + ARR_DIMS(array), + ARR_LBOUND(array), + ARR_SIZE(array), + ARR_ELEMTYPE(array), + array->dataoffset); + + rdat = ARR_DATA_PTR(result); + + /* Collect all references and copy elements to the destination array */ + for (i = 0; i < nitems; i++) + { + memcpy(rdat, refs[i].dat, refs[i].size); + rdat += refs[i].size; + } + + pfree(refs); + PG_FREE_IF_COPY(array, 0); + + PG_RETURN_ARRAYTYPE_P(result); +} diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 2e41f4d9e8..56aff551d3 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -1681,6 +1681,9 @@ proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8', proargtypes => 'internal oid internal int2 internal', prosrc => 'arraycontjoinsel' }, +{ oid => '', descr => 'shuffle array', + proname => 'array_shuffle', proisstrict => 'f', prorettype => 'anyarray', + proargtypes => 'anyarray', prosrc => 'array_shuffle' }, { oid => '764', descr => 'large object import', proname => 'lo_import', provolatile => 'v', proparallel => 'u', -- 2.37.1
Re: [PATCH] Compression dictionaries for JSONB
Hi hackers! Aleksander, I've carefully gone over discussion and still have some questions to ask - 1) Is there any means of measuring overhead of dictionaries over vanilla implementation? IMO it is a must because JSON is a widely used functionality. Also, as it was mentioned before, to check the dictionary value must be detoasted; 2) Storing dictionaries in one table. As I wrote before, this will surely lead to locks and waits while inserting and updating dictionaries, and could cause serious performance issues. And vacuuming this table will lead to locks for all tables using dictionaries until vacuum is complete; 3) JSON documents in production environments could be very complex and use thousands of keys, so creating dictionary directly in SQL statement is not very good approach, so it's another reason to have means for creating dictionaries as a separate tables and/or passing them as files or so; 4) Suggested mechanics, if put on top of the TOAST, could not benefit from knowledge if internal JSON structure, which is seen as important drawback in spite of extensive research work done on working with JSON schema (storing, validating, etc.), and also it cannot recognize and help to compress duplicated parts of JSON document; 5) A small test issue - if dictionaried' JSON has a key which is equal to OID used in a dictionary for some other key? In Pluggable TOAST we suggest that as an improvement compression should be put inside the Toaster as an option, thus the Toaster could have maximum benefits from knowledge of data internal structure (and in future use JSON Schema). For using in special Toaster for JSON datatype compression dictionaries seem to be very valuable addition, but now I have to agree that this feature in current state is competing with Pluggable TOAST. Thank you! Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/ On Tue, Jul 12, 2022 at 3:15 PM Aleksander Alekseev < aleksan...@timescale.com> wrote: > Hi Nikita, > > > Aleksander, please point me in the right direction if it was mentioned > before, I have a few questions: > > Thanks for your feedback. These are good questions indeed. > > > 1) It is not clear for me, how do you see the life cycle of such a > dictionary? If it is meant to keep growing without > > cleaning up/rebuilding it could affect performance in an undesirable > way, along with keeping unused data without > > any means to get rid of them. > > 2) From (1) follows another question - I haven't seen any means for > getting rid of unused keys (or any other means > > for dictionary cleanup). How could it be done? > > Good point. This was not a problem for ZSON since the dictionary size > was limited to 2**16 entries, the dictionary was immutable, and the > dictionaries had versions. For compression dictionaries we removed the > 2**16 entries limit and also decided to get rid of versions. The idea > was that you can simply continue adding new entries, but no one > thought about the fact that this will consume the memory required to > decompress the document indefinitely. > > Maybe we should return to the idea of limited dictionary size and > versions. Objections? > > > 4) If one dictionary is used by several tables - I see future issues in > concurrent dictionary updates. This will for sure > > affect performance and can cause unpredictable behavior for queries. > > You are right. Another reason to return to the idea of dictionary versions. > > > Also, I agree with Simon Riggs, using OIDs from the general pool for > dictionary entries is a bad idea. > > Yep, we agreed to stop using OIDs for this, however this was not > changed in the patch at this point. Please don't hesitate joining the > effort if you want to. I wouldn't mind taking a short break from this > patch. > > > 3) Is the possible scenario legal - by some means a dictionary does not > contain some keys for entries? What happens then? > > No, we should either forbid removing dictionary entries or check that > all the existing documents are not using the entries being removed. > > > If you have any questions on Pluggable TOAST don't hesitate to ask me > and on JSONB Toaster you can ask Nikita Glukhov. > > Will do! Thanks for working on this and I'm looking forward to the > next version of the patch for the next round of review. > > -- > Best regards, > Aleksander Alekseev >
Re: Making pg_rewind faster
Hi everyone! Here's the attached patch submission to optimize pg_rewind performance when many WAL files are retained on server. This patch avoids replaying (copying over) older WAL segment files that fall before the point of divergence between the source and target servers. Thanks, Justin From: Justin Kwan Sent: July 15, 2022 6:13 PM To: vignesh ravichandran Cc: pgsql-hackers ; vignesh ; justinpk...@outlook.com Subject: Re: Making pg_rewind faster Looping in my other email. On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran mailto:ad...@viggy28.dev>> wrote: Hi Hackers, I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target. However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot. 1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point? Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned. Thanks, Vignesh v1-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patch Description: v1-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patch
Re: Making pg_rewind faster
Hi everyone! I've also attached the pg_rewind optimization patch file for Postgres version 14.4. The previous patch file targets version Postgres version 15 Beta 1/2. Thanks, Justin From: Justin Kwan Sent: July 15, 2022 6:13 PM To: vignesh ravichandran Cc: pgsql-hackers ; vignesh ; justinpk...@outlook.com Subject: Re: Making pg_rewind faster Looping in my other email. On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran mailto:ad...@viggy28.dev>> wrote: Hi Hackers, I have been using pg_rewind in production for 2 years. One of the things that I noticed in pg_rewind is if it doesn't know what to do with a file "it copies". I understand it's the more safer option. After all, the alternative, pg_basebackup copies all the files from source to target. However, this is making pg_rewind inefficient when we have a high number of WAL files. Majority of the data (in most of my cases 95%+) that it copies are WAL files which are anyway same between the source and target. Skipping those same WAL files from copying will improve the speed of pg_rewind a lot. 1. Does pg_rewind need to copy WAL files before the WAL that contains the last common check point? Heikki's presentation https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me a good overview and also explained the behavior what I mentioned. Thanks, Vignesh v1-pg14.4-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patch Description: v1-pg14.4-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patch
Re: Making pg_rewind faster
Justin Kwan writes: > I've also attached the pg_rewind optimization patch file for Postgres version > 14.4. The previous patch file targets version Postgres version 15 Beta 1/2. It's very unlikely that we would consider committing such changes into released branches. In fact, it's too late even for v15. You should be submitting non-bug-fix patches against master (v16-to-be). regards, tom lane
Re: Use -fvisibility=hidden for shared libraries
Andres Freund writes: > On 2022-03-24 16:13:31 +0100, Peter Eisentraut wrote: >> The easiest solution would be to change worker_spi's Makefile to use >> MODULE_big. > If it were just worker_spi that might be tenable, but there's other extension > modules - even if they don't fail to fail right now, we shouldn't change the > symbol export rules based on MODULES vs MODULE_big. Agreed, that doesn't seem nice. > Is there any reason an extension module would need to directly link against > pgport/pgcommon? I don't think so, right? Shouldn't --- we want it to use the backend's own copy. (Hmm ... but what if there's some file in one of those libraries that is not used by the core backend, but is used by the extension?) In any case, why would that matter for this patch? If an extension does link in such a file, for sure we don't want that exposed outside the extension. > What I'm not sure about is what to do about pg_config - we can't just add > -fvisibility=hidden to pg_config --cflags_sl. We could add --cflags_sl_mod - > but I'm not sure it's worth it? Don't have a strong opinion here. But if we're forcing -fvisibility=hidden on modules built with pgxs, I'm not sure why we can't do so for modules relying on pg_config. > There are a few symbols in plpython that don't need to be exported right now > but are. But it seems better to export a few too many there, as the > alternative is breaking out-of-core transforms. Most of the symbols are used > by the various transform extensions. Concur. I wanted to test this on a compiler lacking -fvisibility, and was somewhat amused to discover that I haven't got one --- even prairiedog's ancient gcc 4.0.1 knows it. Maybe some of the vendor-specific compilers in the buildfarm will be able to verify that aspect for us. I have a couple of very minor nitpicks: 1. The comment for the shared _PG_init/_PG_fini declarations could be worded better, perhaps like * Declare _PG_init/_PG_fini centrally. Historically each shared library had * its own declaration; but now that we want to mark these PGDLLEXPORT, * using central declarations avoids each extension having to add that. * Any existing declarations in extensions will continue to work if fmgr.h * is included before them, otherwise compilation for Windows will fail. 2. I'm not really thrilled with the symbol names C[XX]FLAGS_SL_MOD; it's not apparent what the MOD means, nor is that documented anywhere. Maybe C[XX]FLAGS_SL_VISIBILITY? In any case a comment explaining when these are meant to be used might help extension developers. (Although now that I look, there seems noplace documenting what CFLAG_SL is either :-(.) Beyond that, I think this is committable. We're not likely to learn much more about any potential issues except by exposing it to the buildfarm and developer usage. regards, tom lane
Re: Costing elided SubqueryScans more nearly correctly
I wrote: > I also notice that setrefs.c can elide Append and MergeAppend nodes > too in some cases, but AFAICS costing of those node types doesn't > take that into account. I took a closer look at this point and realized that in fact, create_append_path and create_merge_append_path do attempt to account for this. But they get it wrong! Somebody changed the rules in setrefs.c to account for parallelism, and did not update the costing side of things. The attached v2 is the same as v1 plus adding a fix for this point. No regression test results change from that AFAICT. regards, tom lane diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index 358bb2aed6..028d9e1680 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -2451,6 +2451,7 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel, { Query *parse = root->parse; Query *subquery = rte->subquery; + bool trivial_pathtarget; Relids required_outer; pushdown_safety_info safetyInfo; double tuple_fraction; @@ -2613,6 +2614,36 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel, */ set_subquery_size_estimates(root, rel); + /* + * Also detect whether the reltarget is trivial, so that we can pass that + * info to cost_subqueryscan (rather than re-deriving it multiple times). + * It's trivial if it fetches all the subplan output columns in order. + */ + if (list_length(rel->reltarget->exprs) != list_length(subquery->targetList)) + trivial_pathtarget = false; + else + { + trivial_pathtarget = true; + foreach(lc, rel->reltarget->exprs) + { + Node *node = (Node *) lfirst(lc); + Var *var; + + if (!IsA(node, Var)) + { +trivial_pathtarget = false; +break; + } + var = (Var *) node; + if (var->varno != rti || +var->varattno != foreach_current_index(lc) + 1) + { +trivial_pathtarget = false; +break; + } + } + } + /* * For each Path that subquery_planner produced, make a SubqueryScanPath * in the outer query. @@ -2631,6 +2662,7 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel, /* Generate outer path using this subpath */ add_path(rel, (Path *) create_subqueryscan_path(root, rel, subpath, + trivial_pathtarget, pathkeys, required_outer)); } @@ -2656,6 +2688,7 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel, /* Generate outer path using this subpath */ add_partial_path(rel, (Path *) create_subqueryscan_path(root, rel, subpath, + trivial_pathtarget, pathkeys, required_outer)); } diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index 5e5732f6e1..fb28e6411a 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -1415,10 +1415,12 @@ cost_tidrangescan(Path *path, PlannerInfo *root, * * 'baserel' is the relation to be scanned * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL + * 'trivial_pathtarget' is true if the pathtarget is believed to be trivial. */ void cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root, - RelOptInfo *baserel, ParamPathInfo *param_info) + RelOptInfo *baserel, ParamPathInfo *param_info, + bool trivial_pathtarget) { Cost startup_cost; Cost run_cost; @@ -1458,6 +1460,22 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root, path->path.startup_cost = path->subpath->startup_cost; path->path.total_cost = path->subpath->total_cost; + /* + * However, if there are no relevant restriction clauses and the + * pathtarget is trivial, then we expect that setrefs.c will optimize away + * the SubqueryScan plan node altogether, so we should just make its cost + * and rowcount equal to the input path's. + * + * Note: there are some edge cases where createplan.c will apply a + * different targetlist to the SubqueryScan node, thus falsifying our + * current estimate of whether the target is trivial, and making the cost + * estimate (though not the rowcount) wrong. It does not seem worth the + * extra complication to try to account for that exactly, especially since + * that behavior falsifies other cost estimates as well. + */ + if (qpquals == NIL && trivial_pathtarget) + return; + get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost); startup_cost = qpqual_cost.startup; diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c index 9cef92cab2..eb0ba6cb1c 100644 --- a/src/backend/optimizer/plan/setrefs.c +++ b/src/backend/optimizer/plan/setrefs.c @@ -1636,10 +1636,10 @@ set_append_references(PlannerInfo *root, /* * See if it's safe to get rid of the Append entirely. For this to be * safe, there must be only one child plan and that child plan's parallel - * awareness must match that of the Append's. The
Re: Proposal: allow database-specific role memberships
Rebased yet again... On Mon, Jul 4, 2022 at 1:17 PM Kenaniah Cerny wrote: > Hi Antonin, > > First of all, thank you so much for taking the time to review my patch. > I'll answer your questions in reverse order: > > The "unsafe_tests" directory is where the pre-existing role tests were > located. According to the readme of the "unsafe_tests" directory, the tests > contained within are not run during "make installcheck" because they could > have side-effects that seem undesirable for a production installation. This > seemed like a reasonable location as the new tests that this patch > introduces also modifies the "state" of the database cluster by adding, > modifying, and removing roles & databases (including template1). > > Regarding roles_is_member_of(), the nuance is that role "A" in your > example would only be considered a member of role "B" (and by extension > role "C") when connected to the database in which "A" was granted > database-specific membership to "B". Conversely, when connected to any > other database, "A" would not be considered to be a member of "B". > > This patch is designed to solve the scenarios in which one may want to > grant constrained access to a broader set of privileges. For example, > membership in "pg_read_all_data" effectively grants SELECT and USAGE rights > on everything (implicitly cluster-wide in today's implementation). By > granting a role membership to "pg_read_all_data" within the context of a > specific database, the grantee's read-everything privilege is effectively > constrained to just that specific database (as membership within > "pg_read_all_data" would not otherwise be held). > > A rebased version is attached. > > Thanks again! > > - Kenaniah > > On Wed, Jun 29, 2022 at 6:45 AM Antonin Houska wrote: > >> Kenaniah Cerny wrote: >> >> > Attached is a newly-rebased patch -- would love to get a review from >> someone whenever possible. >> >> I've picked this patch for a review. The patch currently does not apply >> to the >> master branch, so I could only read the diff. Following are my comments: >> >> * I think that roles_is_member_of() deserves a comment explaining why the >> code >> that you moved into append_role_memberships() needs to be called twice, >> i.e. once for global memberships and once for the database-specific >> ones. >> >> I think the reason is that if, for example, role "A" is a >> database-specific >> member of role "B" and "B" is a "global" member of role "C", then "A" >> should >> not be considered a member of "C", unless "A" is granted "C" >> explicitly. Is >> this behavior intended? >> >> Note that in this example, the "C" members are a superset of "B" >> members, >> and thus "C" should have weaker permissions on database objects than >> "B". What's then the reason to not consider "A" a member of "C"? If "C" >> gives its members some permissions of "B" (e.g. "pg_write_all_data"), >> then I >> think the roles hierarchy is poorly designed. >> >> A counter-example might help me to understand. >> >> * Why do you think that "unsafe_tests" is the appropriate name for the >> directory that contains regression tests? >> >> I can spend more time on the review if the patch gets rebased. >> >> -- >> Antonin Houska >> Web: https://www.cybertec-postgresql.com >> > database-role-memberships-v10.patch Description: Binary data
Re: Use -fvisibility=hidden for shared libraries
Hi, Thanks for the quick review. On 2022-07-17 14:54:48 -0400, Tom Lane wrote: > Andres Freund writes: > > Is there any reason an extension module would need to directly link against > > pgport/pgcommon? I don't think so, right? > > Shouldn't --- we want it to use the backend's own copy. (Hmm ... but > what if there's some file in one of those libraries that is not used > by the core backend, but is used by the extension?) In any case, why > would that matter for this patch? If an extension does link in such > a file, for sure we don't want that exposed outside the extension. I was wondering whether there is a reason {pgport,pgcommon}_srv should ever be built with -fno-visibility. Right now there's no reason to do so, but if an extension .so would directly link to them, they'd have to built with that. But until / unless we add visibility information to all backend functions declarations, we can't do that for the versions the backend uses. > > What I'm not sure about is what to do about pg_config - we can't just add > > -fvisibility=hidden to pg_config --cflags_sl. We could add --cflags_sl_mod - > > but I'm not sure it's worth it? > > Don't have a strong opinion here. But if we're forcing > -fvisibility=hidden on modules built with pgxs, I'm not sure why > we can't do so for modules relying on pg_config. If an extension were to use pg_config to build a "proper" shared library (rather than our more plugin like extension libraries), they might not want the -fvisibility=hidden, e.g. if they use a mechanism like our exports.txt. That's also the reason -fvisibility=hidden isn't added to libpq and the ecpg libs - their symbol visility is done via exports.txt. Depending on the platform that'd not work if symbols were already hidden via -fvisibility=hidden (unless explicitly exported via PGDLLEXPORT, of course). It might or might not be worth switching to PGDLLEXPORT for those in the future, but that's imo a separate discussion. > I wanted to test this on a compiler lacking -fvisibility, and was somewhat > amused to discover that I haven't got one --- even prairiedog's ancient > gcc 4.0.1 knows it. Maybe some of the vendor-specific compilers in the > buildfarm will be able to verify that aspect for us. Heh. Even AIX's compiler knows it, and I think sun studio as well. MSVC obviously doesn't, but we have declspec support to deal with that. It's possible that HP-UX's acc doesn't, but that's gone now... It's possible that there's ABIs targeted by gcc that don't have symbol visibility support - but I can't think of any we support of the top of my head. IOW, we could likely rely on symbol visibility support, if there's advantage in that. > I have a couple of very minor nitpicks: > > 1. The comment for the shared _PG_init/_PG_fini declarations could be > worded better, perhaps like > > * Declare _PG_init/_PG_fini centrally. Historically each shared library had > * its own declaration; but now that we want to mark these PGDLLEXPORT, > * using central declarations avoids each extension having to add that. > * Any existing declarations in extensions will continue to work if fmgr.h > * is included before them, otherwise compilation for Windows will fail. WFM. > 2. I'm not really thrilled with the symbol names C[XX]FLAGS_SL_MOD; > it's not apparent what the MOD means, nor is that documented anywhere. It's for module, which I got from pgxs' MODULES/MODULE_big (and incidentally that's also what meson calls it). Definitely should be explained, and perhaps be expanded to MODULE. > Maybe C[XX]FLAGS_SL_VISIBILITY? In any case a comment explaining > when these are meant to be used might help extension developers. > (Although now that I look, there seems noplace documenting what > CFLAG_SL is either :-(.) I was thinking that it might be nicer to bundle the flags that should be applied to extension libraries together. I don't think we have others right now, but I think that might change (e.g. -fno-plt -fno-semantic-interposition, -Wl,-Bdynamic would make sense). I'm not wedded to this, but I think it makes some sense? > Beyond that, I think this is committable. We're not likely to learn much > more about any potential issues except by exposing it to the buildfarm > and developer usage. Cool. I'll work on committing the first two then and see what comes out of the CFLAGS_SL_* discussion. Greetings, Andres Freund
Re: Use -fvisibility=hidden for shared libraries
Andres Freund writes: > That's also the reason -fvisibility=hidden isn't added to libpq and the ecpg > libs - their symbol visility is done via exports.txt. Depending on the > platform that'd not work if symbols were already hidden via > -fvisibility=hidden (unless explicitly exported via PGDLLEXPORT, of > course). Got it. Yeah, let's not break those cases. (I don't think we dare put PGDLLEXPORT into libpq-fe.h, for example, so it might be hard to solve it any other way anyhow.) >> 2. I'm not really thrilled with the symbol names C[XX]FLAGS_SL_MOD; >> it's not apparent what the MOD means, nor is that documented anywhere. > It's for module, which I got from pgxs' MODULES/MODULE_big (and incidentally > that's also what meson calls it). Definitely should be explained, and perhaps > be expanded to MODULE. I guessed as much, but +1 for spelling it out as MODULE. regards, tom lane
Re: PATCH: Add Table Access Method option to pgbench
On Wed, Jul 13, 2022 at 12:33 AM Michel Pelletier wrote: > On Thu, 30 Jun 2022 at 18:09, Michael Paquier wrote: > >> On Fri, Jul 01, 2022 at 10:06:49AM +0900, Michael Paquier wrote: >> > And the conclusion back then is that one can already achieve this by >> > using PGOPTIONS: >> > PGOPTIONS='-c default_table_access_method=wuzza' pgbench [...] >> > >> > So there is no need to complicate more pgbench, particularly when it >> > comes to partitioned tables where USING is not supported. Your patch >> > touches this area of the client code to bypass the backend error. >> >> Actually, it could be a good thing to mention that directly in the >> docs of pgbench. >> > > I've attached a documentation patch that mentions and links to the > PGOPTIONS documentation per your suggestion. I'll keep the other patch on > the back burner, perhaps in the future there will be demand for a command > line option as more TAMs are created. > >> >> The documentation change looks good to me
Re: postgres_fdw versus regconfig and similar constants
Robert Haas writes: > On Mon, May 16, 2022 at 1:33 PM Tom Lane wrote: >> 0002 tightens deparse.c's rules to only consider an OID alias constant >> as shippable if the object it refers to is shippable. This seems >> obvious in hindsight; I wonder how come we've not realized it before? >> However, this has one rather nasty problem for regconfig in particular: >> with our standard shippability rules none of the built-in text search >> configurations would be treated as shippable, because initdb gives them >> non-fixed OIDs above . That seems like a performance hit we don't >> want to take. In the attached, I hacked around that by making a special >> exception for OIDs up to 16383, but that seems like a fairly ugly kluge. >> Anybody have a better idea? > No. It feels to me like there are not likely to be any really > satisfying answers here. Yeah. Hearing no better ideas from anyone else either, pushed that way. I noted one interesting factoid while trying to make a test case for the missing-schema-qualification issue. I thought of making a foreign table that maps to pg_class and checking what is shipped for select oid, relname from remote_pg_class where oid = 'information_schema.key_column_usage'::regclass; (In hindsight, this wouldn't have worked anyway after patch 0002, because that OID would have been above .) But what I got was Foreign Scan on public.remote_pg_class (cost=100.00..121.21 rows=4 width=68) Output: oid, relname Remote SQL: SELECT oid, relname FROM pg_catalog.pg_class WHERE ((oid = 13527::oid)) The reason for that is that the constant is smashed to type OID so hard that we can no longer tell that it ever was regclass, thus there's no hope of deparsing it in a more-symbolic fashion. I'm not sure if there's anything we could do about that that wouldn't break more things than it fixes (e.g. by making things that should look equal() not be so). But anyway, this effect may help explain the lack of previous complaints in this area. regconfig arguments to text search functions might be pretty nearly the only realistic use-case for shipping symbolic reg* values to the remote. regards, tom lane
Re: Proposal to introduce a shuffle function to intarray extension
On Mon, Jul 18, 2022 at 4:15 AM Martin Kalcher wrote: > Am 17.07.22 um 08:00 schrieb Thomas Munro: > >> Actually ... is there a reason to bother with an intarray version > >> at all, rather than going straight for an in-core anyarray function? > >> It's not obvious to me that an int4-only version would have > >> major performance advantages. > > > > Yeah, that seems like a good direction. If there is a performance > > advantage to specialising, then perhaps we only have to specialise on > > size, not type. Perhaps there could be a general function that > > internally looks out for typbyval && typlen == 4, and dispatches to a > > specialised 4-byte, and likewise for 8, if it can, and that'd already > > be enough to cover int, bigint, float etc, without needing > > specialisations for each type. > > I played around with the idea of an anyarray shuffle(). The hard part > was to deal with arrays with variable length elements, as they can not > be swapped easily in place. I solved it by creating an intermediate > array of references to the elements. I'll attach a patch with the proof > of concept. Unfortunatly it is already about 5 times slower than the > specialised version and i am not sure if it is worth going down that road. Seems OK for a worst case. It must still be a lot faster than doing it in SQL. Now I wonder what the exact requirements would be to dispatch to a faster version that would handle int4. I haven't studied this in detail but perhaps to dispatch to a fast shuffle for objects of size X, the requirement would be something like typlen == X && align_bytes <= typlen && typlen % align_bytes == 0, where align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}? Or in English, 'the data consists of densely packed objects of fixed size X, no padding'. Or perhaps you can work out the padded size and use that, to catch a few more types. Then you call array_shuffle_{2,4,8}() as appropriate, which should be as fast as your original int[] proposal, but work also for float, date, ...? About your experimental patch, I haven't reviewed it properly or tried it but I wonder if uint32 dat_offset, uint32 size (= half size elements) would be enough due to limitations on varlenas.
Re: Proposal to introduce a shuffle function to intarray extension
Martin Kalcher writes: > Am 17.07.22 um 08:00 schrieb Thomas Munro: >>> Actually ... is there a reason to bother with an intarray version >>> at all, rather than going straight for an in-core anyarray function? > I played around with the idea of an anyarray shuffle(). The hard part > was to deal with arrays with variable length elements, as they can not > be swapped easily in place. I solved it by creating an intermediate > array of references to the elements. I'll attach a patch with the proof > of concept. This does not look particularly idiomatic, or even type-safe. What you should have done was use deconstruct_array to get an array of Datums and isnull flags, then shuffled those, then used construct_array to build the output. (Or, perhaps, use construct_md_array to replicate the input's precise dimensionality. Not sure if anyone would care.) regards, tom lane
Re: Proposal to introduce a shuffle function to intarray extension
Thomas Munro writes: > Seems OK for a worst case. It must still be a lot faster than doing > it in SQL. Now I wonder what the exact requirements would be to > dispatch to a faster version that would handle int4. I find it impossible to believe that it's worth micro-optimizing shuffle() to that extent. Now, maybe doing something in that line in deconstruct_array and construct_array would be worth our time, as that'd benefit a pretty wide group of functions. regards, tom lane
Re: Proposal to introduce a shuffle function to intarray extension
Am 18.07.22 um 00:46 schrieb Tom Lane: This does not look particularly idiomatic, or even type-safe. What you should have done was use deconstruct_array to get an array of Datums and isnull flags, then shuffled those, then used construct_array to build the output. (Or, perhaps, use construct_md_array to replicate the input's precise dimensionality. Not sure if anyone would care.) regards, tom lane deconstruct_array() would destroy the arrays dimensions. I would expect that shuffle() only shuffles the first dimension and keeps the inner arrays intact. Martin
Re: doc: Make selectivity example match wording
On Sat Jul 16, 2022 at 11:23 PM EDT, David G. Johnston wrote: > Thanks for the review. I generally like everything you said but it made me > realize that I still didn't really understand the intent behind the > formula. I spent way too much time working that out for myself, then > turned what I found useful into this v2 patch. > > It may need some semantic markup still but figured I'd see if the idea > makes sense. > > I basically rewrote, in a bit different style, the same material into the > code comments, then proceeded to rework the proof that was already present > there. > > I did do this in somewhat of a vacuum. I'm not inclined to learn this all > start-to-end though. If the abrupt style change is unwanted so be it. I'm > not really sure how much benefit the proof really provides. The comments > in the docs are probably sufficient for the code as well - just define why > the three pieces of the formula exist and are packaged into a single > multiplier called selectivity as an API choice. I suspect once someone > gets to that comment it is fair to assume some prior knowledge. > Admittedly, I didn't really come into this that way... Fair enough, I only know what I can glean from the comments in eqjoinsel_inner and friends myself. I do think even this smaller change is valuable because the current example talks about using an algorithm based on the number of distinct values immediately after showing n_distinct == -1, so making it clear that this case uses num_rows instead is helpful. "This value does get scaled in the non-unique case" again could be more specific ("since here all values are unique, otherwise the calculation uses num_distinct" perhaps?). But past that quibble I'm good.
Re: Proposal to introduce a shuffle function to intarray extension
Am 18.07.22 um 00:37 schrieb Thomas Munro: Seems OK for a worst case. It must still be a lot faster than doing it in SQL. Now I wonder what the exact requirements would be to dispatch to a faster version that would handle int4. I haven't studied this in detail but perhaps to dispatch to a fast shuffle for objects of size X, the requirement would be something like typlen == X && align_bytes <= typlen && typlen % align_bytes == 0, where align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}? Or in English, 'the data consists of densely packed objects of fixed size X, no padding'. Or perhaps you can work out the padded size and use that, to catch a few more types. Then you call array_shuffle_{2,4,8}() as appropriate, which should be as fast as your original int[] proposal, but work also for float, date, ...? About your experimental patch, I haven't reviewed it properly or tried it but I wonder if uint32 dat_offset, uint32 size (= half size elements) would be enough due to limitations on varlenas. I made another experimental patch with fast tracks for typelen4 and typelen8. alignments are not yet considered.From 37269598a1dace99c9059514e0fdcb90b9240ed3 Mon Sep 17 00:00:00 2001 From: Martin Kalcher Date: Sun, 17 Jul 2022 18:06:04 +0200 Subject: [PATCH] introduce array_shuffle() --- src/backend/utils/adt/arrayfuncs.c | 178 + src/include/catalog/pg_proc.dat| 3 + 2 files changed, 181 insertions(+) diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c index fb167f226a..919cb2bfd6 100644 --- a/src/backend/utils/adt/arrayfuncs.c +++ b/src/backend/utils/adt/arrayfuncs.c @@ -6872,3 +6872,181 @@ trim_array(PG_FUNCTION_ARGS) PG_RETURN_DATUM(result); } + +static ArrayType * +array_shuffle_4(ArrayType *array) +{ + ArrayType *result; + int32 *adat, +*rdat; + int i, +j, +nitems; + + result = create_array_envelope(ARR_NDIM(array), + ARR_DIMS(array), + ARR_LBOUND(array), + ARR_SIZE(array), + ARR_ELEMTYPE(array), + array->dataoffset); + + adat = (int32 *) ARR_DATA_PTR(array); + rdat = (int32 *) ARR_DATA_PTR(result); + nitems = ARR_DIMS(array)[0]; + + for (i = 0; i < nitems; i++) + { + j = random() % (i + 1); + rdat[i] = rdat[j]; + rdat[j] = adat[i]; + } + return result; +} + +static ArrayType * +array_shuffle_8(ArrayType *array) +{ + ArrayType *result; + int64 *adat, +*rdat; + int i, +j, +nitems; + + result = create_array_envelope(ARR_NDIM(array), + ARR_DIMS(array), + ARR_LBOUND(array), + ARR_SIZE(array), + ARR_ELEMTYPE(array), + array->dataoffset); + + adat = (int64 *) ARR_DATA_PTR(array); + rdat = (int64 *) ARR_DATA_PTR(result); + nitems = ARR_DIMS(array)[0]; + + for (i = 0; i < nitems; i++) + { + j = random() % (i + 1); + rdat[i] = rdat[j]; + rdat[j] = adat[i]; + } + return result; +} + +static ArrayType * +array_shuffle_generic(ArrayType *array, + int16 elmlen, bool elmbyval, char elmalign) +{ + ArrayType *result; + char *adat, +*rdat; + int i, +ndim, +*dims, +nitems, +nelms_per_item; + + struct { + char *dat; + size_t size; + } *refs; + + ndim = ARR_NDIM(array); + dims = ARR_DIMS(array); + + nelms_per_item = 1; + + for (i = 1; i < ndim; i++) + nelms_per_item *= dims[i]; + + nitems = dims[0]; + + refs = palloc(nitems * (sizeof(char *) + sizeof(size_t))); + adat = ARR_DATA_PTR(array); + + /* + * To allow shuffling of arrays with variable length elements, we are going + * to shuffle references to the elements, because swapping variable length + * elements in an array is complicated. In a second step we will iterate the + * shuffled references and copy the elements to the destination array. + * We are using the "inside-out" variant of the Fisher-Yates algorithm to + * produce a shuffled array of references: Append each reference than swap it + * with a radomly-chosen refence. + */ + for (i = 0; i < nitems; i++) + { + char *next = array_seek(adat, 0, NULL, nelms_per_item, + elmlen, elmbyval, elmalign); + int j = random() % (i + 1); + refs[i] = refs[j]; + refs[j].dat = adat; + refs[j].size = next - adat; + adat = next; + } + + result = create_array_envelope(ARR_NDIM(array), + ARR_DIMS(array), + ARR_LBOUND(array), + ARR_SIZE(array), + ARR_ELEMTYPE(array), + array->dataoffset); + + rdat = ARR_DATA_PTR(result); + + /* Collect all references and copy elements to the destination array */ + for (i = 0; i < nitems; i++) + { + memcpy(rdat, refs[i].dat, refs[i].size); + rdat += refs[i].size; + } + + pfree(refs); + + return result; +} + +Datum +array_shuffle(PG_FUNCTION_ARGS) +{ + ArrayType *array, + *result; + int16 elmlen; + bool elmbyval; + char elmalign; + Oid elmtyp; + TypeCacheEntry *typentry; + + array = PG_GETARG_ARRAYTYPE_P(0); + + if (ARR_HASNULL(array)) + ereport(ER
Re: Proposal to introduce a shuffle function to intarray extension
Martin Kalcher writes: > Am 18.07.22 um 00:46 schrieb Tom Lane: >> This does not look particularly idiomatic, or even type-safe. What you >> should have done was use deconstruct_array to get an array of Datums and >> isnull flags, then shuffled those, then used construct_array to build the >> output. >> >> (Or, perhaps, use construct_md_array to replicate the input's >> precise dimensionality. Not sure if anyone would care.) > deconstruct_array() would destroy the arrays dimensions. As I said, you can use construct_md_array if you want to preserve the array shape. > I would expect that shuffle() only shuffles the first dimension and > keeps the inner arrays intact. This argument is based on a false premise, ie that Postgres thinks multidimensional arrays are arrays-of-arrays. They aren't, and we're not going to start making them so by defining shuffle() at variance with every other array-manipulating function. Shuffling the individual elements regardless of array shape is the definition that's consistent with our existing functionality. (Having said that, even if we were going to implement it with that definition, I should think that it'd be easiest to do so on the array-of-Datums representation produced by deconstruct_array. That way you don't need to do different things for different element types.) regards, tom lane
System column support for partitioned tables using heap
I've run into an existing behavior where xmax(), and various other system tables, return an error when included in the RETURNING list on a partitioned table. ERROR: cannot retrieve a system column in this context ` This issue got a fair airing back in 2020: AW: posgres 12 bug (partitioned table) https://www.postgresql.org/message-id/flat/GVAP278MB006939B1D7DFDD650E383FBFEACE0%40GVAP278MB0069.CHEP278.PROD.OUTLOOK.COM#908f2604081699e7f41fa20d352e1b79 I'm using 14.4, and just ran into this behavior today. I'm wondering if there has been any new work on this subject, or anything to take into account moving forward? I'm not a C coder, and do not know the Postgres internals, but here's what I gleaned from the thread: * Available system columns depend on the underlying table access method, and may/will vary across AMs. For example, the columns implemented by heap is what the docs describe, an FDW could be anything, and Postgres has no control of what, if any, system column-like attributes they support, and future and hypothetical AMs may have different sets. * Rather than return garbage results, or a default of 0, etc., the system throws the error I ran into. I'd be happier working with a NULL result than garbage, ambiguous results, or errors...but an error is the current behavior. Agreed on that, I'd rather an error than a bad/meaningless result. Postgres' consistent emphasis on correctness is easily one of its greatest qualities. In my case, I'm upgrading a lot of existing code to try and capture a more complete profile of what an UPSERT did. Right now, I grab a count(*) of the rows and return that. Works fine. A revised snippet looks a bit like this: ...UPSERT code returning xmax as inserted_transaction_id), status_data AS ( select count(*) FILTER (where inserted_transaction_id = 0) AS insert_count, count(*) FILTER (where inserted_transaction_id != 0) AS estimated_update_count, pg_current_xact_id_if_assigned()::text AS transaction_id from inserted_rows), ...custom logging code -- Final output/result. select insert_count, estimated_update_count, transaction_id from status_data; This fails on a partitioned table because xmax() may not exist. In fact, it does exist in all of those tables, but the system doesn't know how to guarantee that. I know which tables are partitioned, and can downgrade the result on partitioned tables to the count(*) I've been using to date. But now I'm wondering if working with xmax() like this is a poor idea going forward. I don't want to lean on a feature/behavior that's likely to change. For example, I noticed the other day that MERGE does not support RETURNING. I'd appreciate any insight or advice you can offer.
Re: NAMEDATALEN increase because of non-latin languages
Hi, On 2022-07-18 09:46:44 +0700, John Naylor wrote: > I've made a small step in this direction. Thanks for working on this! > 0001 is just boilerplate, same as v1 If we were to go for this, I wonder if we should backpatch the cast containing version of GESTRUCT for less pain backpatching bugfixes. That'd likely require using a different name for the cast containing one. > 0002 teaches Catalog.pm to export both C attr name and SQL attr name, so we > can use "sizeof". > 0003 generates static inline functions that work the same as the current > GETSTRUCT macro, i.e. just cast to the right pointer and return it. It seems likely that inline functions are going to be too large for this. There's a lot of GESTRUCTs in a lot of files, emitting a copy of the function every time doesn't seem great. > current offset is the previous offset plus the previous type length, plus > any alignment padding suitable for the current type (there is none here, so > the alignment aspect is not tested). I'm hoping something like this will be > sufficient for what's in the current structs, but of course will need > additional work when expanding those to include pointers to varlen > attributes. I've not yet inspected the emitted assembly language, but > regression tests pass. Hm. Wouldn't it make sense to just use the normal tuple deforming routines and then map the results to the structs? Greetings, Andres Freund
Re: Windows now has fdatasync()
On Fri, Apr 8, 2022 at 7:56 PM Dave Page wrote: > Windows 8 was a pretty unpopular release, so I would expect shifting to > 10/2016+ for PG 16 would be unlikely to be a major problem. Thanks to Michael for making that happen. That removes the main thing I didn't know how to deal with in this patch. Here's a rebase with some cleanup. With my garbage collector hat on, I see that all systems we target have fdatasync(), except: 1. Windows, but this patch supplies src/port/fdatasync.c. 2. DragonflyBSD before 6.1. We have 6.0 in the build farm. 3. Ancient macOS. Current releases have it, though we have to cope with a missing declaration. >From a standards point of view, fdatasync() is issue 5 POSIX like fsync(). Both are optional, but, being a database, we require fsync(), and they're both covered by the same POSIX option "Synchronized Input and Output". My plan now is to commit this patch so that problem #1 is solved, prod conchuela's owner to upgrade to solve #2, and wait until Tom shuts down prairiedog to solve #3. Then we could consider removing the HAVE_FDATASYNC probe and associated #ifdefs when convenient. For that reason, I'm not too bothered about the slight weirdness of defining HAVE_FDATASYNC on Windows even though that doesn't come from configure; it'd hopefully be short-lived. Better ideas welcome, though. Does that make sense? From a985d52d1870ccfb92fda3316158242ce2ba3fe8 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 12 Dec 2021 14:13:35 +1300 Subject: [PATCH v3 1/2] Fix treatment of direct I/O in pg_test_fsync. pg_test_fsync traditionally enabled O_DIRECT for the open_datasync and open_sync tests. In fact current releases of PostgreSQL only do that if wal_level=minimal (less commonly used). So, run the test both ways. Discussion: https://postgr.es/m/CA%2BhUKGJZJVO%3DiX%2Beb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ%40mail.gmail.com --- src/bin/pg_test_fsync/pg_test_fsync.c | 75 --- 1 file changed, 69 insertions(+), 6 deletions(-) diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c index f7bc199a30..109ccba819 100644 --- a/src/bin/pg_test_fsync/pg_test_fsync.c +++ b/src/bin/pg_test_fsync/pg_test_fsync.c @@ -294,14 +294,45 @@ test_sync(int writes_per_op) printf(_("\nCompare file sync methods using two %dkB writes:\n"), XLOG_BLCKSZ_K); printf(_("(in wal_sync_method preference order, except fdatasync is Linux's default)\n")); +/* + * Test open_datasync with O_DIRECT if available + */ + printf(LABEL_FORMAT, "open_datasync (direct)"); + fflush(stdout); + +#if defined(OPEN_DATASYNC_FLAG) && (defined(O_DIRECT) || defined(F_NOCACHE)) + if ((tmpfile = open_direct(filename, O_RDWR | O_DSYNC | PG_BINARY, 0)) == -1) + { + printf(NA_FORMAT, _("n/a*")); + fs_warning = true; + } + else + { + START_TIMER; + for (ops = 0; alarm_triggered == false; ops++) + { + for (writes = 0; writes < writes_per_op; writes++) +if (pg_pwrite(tmpfile, + buf, + XLOG_BLCKSZ, + writes * XLOG_BLCKSZ) != XLOG_BLCKSZ) + die("write failed"); + } + STOP_TIMER; + close(tmpfile); + } +#else + printf(NA_FORMAT, _("n/a")); +#endif + /* - * Test open_datasync if available + * Test open_datasync buffered if available */ - printf(LABEL_FORMAT, "open_datasync"); + printf(LABEL_FORMAT, "open_datasync (buffered)"); fflush(stdout); #ifdef OPEN_DATASYNC_FLAG - if ((tmpfile = open_direct(filename, O_RDWR | O_DSYNC | PG_BINARY, 0)) == -1) + if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_BINARY, 0)) == -1) { printf(NA_FORMAT, _("n/a*")); fs_warning = true; @@ -402,12 +433,12 @@ test_sync(int writes_per_op) #endif /* - * Test open_sync if available + * Test open_sync with O_DIRECT if available */ - printf(LABEL_FORMAT, "open_sync"); + printf(LABEL_FORMAT, "open_sync (direct)"); fflush(stdout); -#ifdef OPEN_SYNC_FLAG +#if defined(OPEN_SYNC_FLAG) && (defined(O_DIRECT) || defined(F_NOCACHE)) if ((tmpfile = open_direct(filename, O_RDWR | OPEN_SYNC_FLAG | PG_BINARY, 0)) == -1) { printf(NA_FORMAT, _("n/a*")); @@ -439,6 +470,38 @@ test_sync(int writes_per_op) printf(NA_FORMAT, _("n/a")); #endif +/* + * Test open_sync if available + */ + printf(LABEL_FORMAT, "open_sync (buffered)"); + fflush(stdout); + +#ifdef OPEN_SYNC_FLAG + if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_BINARY, 0)) == -1) + { + printf(NA_FORMAT, _("n/a*")); + fs_warning = true; + } + else + { + START_TIMER; + for (ops = 0; alarm_triggered == false; ops++) + { + for (writes = 0; writes < writes_per_op; writes++) +if (pg_pwrite(tmpfile, + buf, + XLOG_BLCKSZ, + writes * XLOG_BLCKSZ) != XLOG_BLCKSZ) + die("write failed"); + } + STOP_TIMER; + close(tmpfile); + } +#else + printf(NA_FORMAT, _("n/a")); +#endif + + if (fs_warning) { printf(_("* This file system and its mount options do not support direct\n" -- 2.35.1 From 56882f910e58f72987433c72c32fc6eedc1cb
RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Fri, Jul 15, 2022 10:39 PM Masahiko Sawada wrote: > > This patch should have the fix for the issue that Shi yu reported. Shi > yu, could you please test it again with this patch? > Thanks for updating the patch! I have tested and confirmed that the problem I found has been fixed. Regards, Shi yu
Re: Windows now has fdatasync()
Thomas Munro writes: > With my garbage collector hat on, I see that all systems we target > have fdatasync(), except: > 1. Windows, but this patch supplies src/port/fdatasync.c. > 2. DragonflyBSD before 6.1. We have 6.0 in the build farm. > 3. Ancient macOS. Current releases have it, though we have to cope > with a missing declaration. Hmmm ... according to [1], while current macOS has an undocumented fdatasync function, it doesn't seem to do anything as useful as, say, sync data to disk. I'm not sure what direction you're headed in here, but it probably shouldn't include assuming that fdatasync is actually useful on macOS. But maybe that's not your point? > My plan now is to commit this patch so that problem #1 is solved, prod > conchuela's owner to upgrade to solve #2, and wait until Tom shuts > down prairiedog to solve #3. You could force my hand by pushing something that requires this ;-). I'm not feeling particularly wedded to prairiedog per se. As with my once-and-future HPPA machine, I'd prefer to wait until NetBSD 10 is a thing before spinning up an official buildfarm animal, but I suppose that that's not far away. regards, tom lane [1] https://www.postgresql.org/message-id/1673109.1610733352%40sss.pgh.pa.us
Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns
On Fri, Jul 15, 2022 at 8:09 PM Masahiko Sawada wrote: > > This patch should have the fix for the issue that Shi yu reported. Shi > yu, could you please test it again with this patch? > Can you explain the cause of the failure and your fix for the same? -- With Regards, Amit Kapila.
Re: Windows now has fdatasync()
On Mon, Jul 18, 2022 at 3:43 PM Tom Lane wrote: > Thomas Munro writes: > > With my garbage collector hat on, I see that all systems we target > > have fdatasync(), except: > > > 1. Windows, but this patch supplies src/port/fdatasync.c. > > 2. DragonflyBSD before 6.1. We have 6.0 in the build farm. > > 3. Ancient macOS. Current releases have it, though we have to cope > > with a missing declaration. > > Hmmm ... according to [1], while current macOS has an undocumented > fdatasync function, it doesn't seem to do anything as useful as, > say, sync data to disk. I'm not sure what direction you're headed > in here, but it probably shouldn't include assuming that fdatasync > is actually useful on macOS. But maybe that's not your point? Oh, I'm not planning to change the default choice on macOS (or Windows). I *am* assuming we're not going to take it away as an option, that cat being out of the bag ever since configure found Apple's secret fdatasync (note that O_DSYNC, our default, is also undocumented and also known not to flush caches, but at least it's present in an Apple header!). I was just noting an upcoming opportunity to remove the configure/meson probes for fdatasync, which made me feel better about the slightly kludgy way this patch is defining HAVE_FDATASYNC explicitly on Windows. > > My plan now is to commit this patch so that problem #1 is solved, prod > > conchuela's owner to upgrade to solve #2, and wait until Tom shuts > > down prairiedog to solve #3. > > You could force my hand by pushing something that requires this ;-). Heh. Let me ask about the DragonFlyBSD thing first.
Re: Windows now has fdatasync()
Thomas Munro writes: > ... I was just noting an upcoming > opportunity to remove the configure/meson probes for fdatasync, which > made me feel better about the slightly kludgy way this patch is > defining HAVE_FDATASYNC explicitly on Windows. Hm. There is certainly not any harm in the meson infrastructure skipping that test, because prairiedog is not able to run meson anyway. Can we do that and still leave it in place on the autoconf side? Maybe not, because I suppose you want to remove #ifdefs in the code itself. I see that fdatasync goes back as far as SUS v2, which we've long taken as our minimum POSIX infrastructure. So there's not a lot of room to insist that we should support allegedly-Unix platforms without it. regards, tom lane
Re: Handle infinite recursion in logical replication setup
On Sat, Jul 16, 2022 at 10:29 AM Dilip Kumar wrote: > > I think giving two options would be really confusing from the > usability perspective. I think what we should be doing here is to > keep these three names 'none', 'any' and 'local' as reserved names for > the origin name so that those are not allowed to be set by the user > and they have some internal meaning. > This makes sense to me. I think we can avoid reserving 'local' for now till we agree on its use case and implementation. One similar point about slots is that we treat 'none' slot_name in subscription commands as a special value indicating no slot name whereas we do allow creating a slot with the name 'none' with pg_create_logical_replication_slot(). So, if we want to follow a similar convention here, we may not need to add any restriction for origin names but personally, I think it is better to add such a restriction to avoid confusion and in fact, as a separate patch we should even disallow creating slot name as 'none'. -- With Regards, Amit Kapila.
Re: Costing elided SubqueryScans more nearly correctly
On Mon, Jul 18, 2022 at 3:16 AM Tom Lane wrote: > I wrote: > > I also notice that setrefs.c can elide Append and MergeAppend nodes > > too in some cases, but AFAICS costing of those node types doesn't > > take that into account. > > I took a closer look at this point and realized that in fact, > create_append_path and create_merge_append_path do attempt to account > for this. But they get it wrong! Somebody changed the rules in > setrefs.c to account for parallelism, and did not update the costing > side of things. > > The attached v2 is the same as v1 plus adding a fix for this point. > No regression test results change from that AFAICT. The new fix looks good to me. Seems setrefs.c added a new logic to check parallel_aware when removing single-child Appends/MergeAppends in f9a74c14, but it neglected to update the related costing logic. And I can see this patch fixes the costing for that. BTW, not related to this patch, the new lines for parallel_aware check in setrefs.c are very wide. How about wrap them to keep consistent with arounding codes? Thanks Richard
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
On Tue, Jul 12, 2022 at 7:07 PM Önder Kalacı wrote: > > Hi hackers, > > > It is often not feasible to use `REPLICA IDENTITY FULL` on the publication, > because it leads to full table scan > > per tuple change on the subscription. This makes `REPLICA IDENTITY FULL` > impracticable -- probably other > > than some small number of use cases. > IIUC, this proposal is to optimize cases where users can't have a unique/primary key for a relation on the subscriber and those relations receive lots of updates or deletes? > With this patch, I'm proposing the following change: If there is an index on > the subscriber, use the index > > as long as the planner sub-modules picks any index over sequential scan. > > Majority of the logic on the subscriber side has already existed in the code. > The subscriber is already > > capable of doing (unique) index scans. With this patch, we are allowing the > index to iterate over the > > tuples fetched and only act when tuples are equal. The ones familiar with > this part of the code could > > realize that the sequential scan code on the subscriber already implements > the `tuples_equal()` function. > > In short, the changes on the subscriber are mostly combining parts of > (unique) index scan and > > sequential scan codes. > > The decision on whether to use an index (or which index) is mostly derived > from planner infrastructure. > > The idea is that on the subscriber we have all the columns. So, construct all > the `Path`s with the > > restrictions on all columns, such as `col_1 = $1 AND col_2 = $2 ... AND col_n > = $N`. Finally, let > > the planner sub-module -- `create_index_paths()` -- to give us the relevant > index `Path`s. On top of > > that adds the sequential scan `Path` as well. Finally, pick the cheapest > `Path` among. > > From the performance point of view, there are few things to note. First, the > patch aims not to > change the behavior when PRIMARY KEY or UNIQUE INDEX is used. Second, when > REPLICA IDENTITY > IS FULL on the publisher and an index is used on the subscriber, the > difference mostly comes down > to `index scan` vs `sequential scan`. That's why it is hard to claim a > certain number of improvements. > It mostly depends on the data size, index and the data distribution. > It seems that in favorable cases it will improve performance but we should consider unfavorable cases as well. Two things that come to mind in that regard are (a) while choosing index/seq. scan paths, the patch doesn't account for cost for tuples_equal() which needs to be performed for index scans, (b) it appears to me that the patch decides which index to use the first time it opens the rel (or if the rel gets invalidated) on subscriber and then for all consecutive operations it uses the same index. It is quite possible that after some more operations on the table, using the same index will actually be costlier than a sequence scan or some other index scan. -- With Regards, Amit Kapila.