Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Martin Kalcher

Am 17.07.22 um 05:37 schrieb Tom Lane:


Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.

regards, tom lane


Hi Tom,

thank you for your thoughts. There are two reasons for choosing an 
int4-only version. I am not familiar with postgres development (yet) and 
i was not sure how open you are about such changes to core and if the 
proposed feature is considered valuable enough to go into core. The 
second reason was ease of implementation. The intarray extension does 
not allow any NULL elements in arrays and treats multidimensional arrays 
as though they were linear. Which makes the implementation straight 
forward, because there are fewer cases to consider.


However, i will take a look at an implementation for anyarray in core.

Martin




Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Martin Kalcher

Am 17.07.22 um 08:00 schrieb Thomas Munro:


I went to see what Professor Lemire would have to say about all this,
expecting to find a SIMD rabbit hole to fall down for some Sunday
evening reading, but the main thing that jumped out was this article
about the modulo operation required by textbook Fisher-Yates to get a
bounded random number, the random() % n that appears in the patch.  He
talks about shuffling twice as fast by using a no-division trick to
get bounded random numbers[1].  I guess you might need to use our
pg_prng_uint32() for that trick because random()'s 0..RAND_MAX might
introduce bias.  Anyway, file that under go-faster ideas for later.

[1] https://lemire.me/blog/2016/06/30/fast-random-shuffling/


Hi Thomas,

the small bias of random() % n is not a problem for my use case, but 
might be for others. Its easily replaceable with


  (int) pg_prng_uint64_range(&pg_global_prng_state, 0, n-1)

Unfortunately it is a bit slower (on my machine), but thats negligible.

Martin




Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Martin Kalcher

Am 17.07.22 um 05:32 schrieb David G. Johnston:


  +SELECT sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6) !=
sample('{1,2,3,4,5,6,7,8,9,10,11,12}', 6);
+ ?column?
+--
+ t
+(1 row)
+

While small, there is a non-zero chance for both samples to be equal.  This
test should probably just go, I don't see what it tests that isn't covered
by other tests or even trivial usage.



Hey David,

you are right. There is a small chance for this test to fail. I wanted 
to test, that two invocations produce different results (after all the 
main feature of the function). But it can probably go.


Martin




Re: Allowing REINDEX to have an optional name

2022-07-17 Thread Simon Riggs
On Sun, 17 Jul 2022 at 07:19, Michael Paquier  wrote:
>
> On Fri, Jul 15, 2022 at 06:21:22PM +0100, Simon Riggs wrote:
> > That's fixed it on the CFbot. Over to you, Michael. Thanks.
>
> Sure.  I have looked over that, and this looks fine overall.  I have
> made two changes though.
>
> if (objectKind == REINDEX_OBJECT_SYSTEM &&
> -   !IsSystemClass(relid, classtuple))
> +   !IsCatalogRelationOid(relid))
> +   continue;
> +   else if (objectKind == REINDEX_OBJECT_DATABASE &&
> +IsCatalogRelationOid(relid))
>
> The patch originally relied on IsSystemClass() to decide if a relation
> is a catalog table or not.  This is not wrong in itself because
> ReindexMultipleTables() discards RELKIND_TOAST a couple of lines
> above, but I think that we should switch to IsCatalogRelationOid() as
> that's the line drawn to check for the catalog-ness of a relation.
>
> The second thing is test coverage.  Using a REINDEX DATABASE/SYSTEM
> within the main regression test suite is not a good idea, but we
> already have those commands running in the reindexdb suite so I could
> not resist expanding the test section to track and check relfilenode
> changes through four relations for these cases:
> - Catalog index.
> - Catalog toast index.
> - User table index.
> - User toast index.
> The relfilenodes of those relations are saved in a table and
> cross-checked with the contents of pg_class after each REINDEX, on
> SYSTEM or DATABASE.  There are no new heavy commands, so it does not
> make the test longer.
>
> With all that, I finish with the attached.  Does that look fine to
> you?

Sounds great, looks fine. Thanks for your review.

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-17 Thread Masahiko Sawada
On Fri, Jul 15, 2022 at 3:32 PM shiy.f...@fujitsu.com
 wrote:
>
> On Mon, Jul 11, 2022 9:54 PM Masahiko Sawada  wrote:
> >
> > I've attached an updated patch, please review it.
> >
>
> Thanks for your patch. Here are some comments for the REL14-v1 patch.
>
> 1.
> +   Sizesz = sizeof(TransactionId) * nxacts;;
>
> There is a redundant semicolon at the end.
>
> 2.
> +   workspace = MemoryContextAlloc(rb->context, 
> rb->n_initial_running_xacts);
>
> Should it be:
> +   workspace = MemoryContextAlloc(rb->context, sizeof(TransactionId) * 
> rb->n_initial_running_xacts);
>
> 3.
> +   /* bound check if there is at least one transaction to be removed */
> +   if (NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
> +   
> running->oldestRunningXid))
> +   return;
> +
>
> Here, I think it should return if rb->initial_running_xacts[0] is older than
> oldestRunningXid, right? Should it be changed to:
>
> +   if (!NormalTransactionIdPrecedes(rb->initial_running_xacts[0],
> +   
> running->oldestRunningXid))
> +   return;
>
> 4.
> +   if ((parsed->xinfo & XACT_XINFO_HAS_INVALS) != 0)
>
> Maybe we can change it like the following, to be consistent with other places 
> in
> this file. It's also fine if you don't change it.
>
> +   if (parsed->xinfo & XACT_XINFO_HAS_INVALS)

Thank you for the comments!

I've attached patches for all supported branches including the master.

Regards,

--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/


REL13-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


REL12-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


REL14-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


REL11-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


REL10-v6-0001-Fix-catalog-lookup-with-the-wrong-snapshot-during.patch
Description: Binary data


master-v6-0001-Add-catalog-modifying-transactions-to-logical-dec.patch
Description: Binary data


Re: proposal: possibility to read dumped table's name from file

2022-07-17 Thread Justin Pryzby
Thanks for updating the patch.

This failed to build on windows.
http://cfbot.cputube.org/pavel-stehule.html

Some more comments inline.

On Sun, Jul 17, 2022 at 08:20:47AM +0200, Pavel Stehule wrote:
> The attached patch implements the --filter option for pg_dumpall and for
> pg_restore too.

> diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
> index 5efb442b44..ba2920dbee 100644
> --- a/doc/src/sgml/ref/pg_dump.sgml
> +++ b/doc/src/sgml/ref/pg_dump.sgml
> @@ -779,6 +779,80 @@ PostgreSQL documentation
>
>   
>  
> + 
> +  --filter= class="parameter">filename
> +  
> +   
> +Specify a filename from which to read patterns for objects to include
> +or exclude from the dump. The patterns are interpreted according to 
> the
> +same rules as the corresponding options:
> +-t/--table for tables,
> +-n/--schema for schemas,
> +--include-foreign-data for data on foreign servers 
> and
> +--exclude-table-data for table data.
> +To read from STDIN use - as 
> the

STDIN comma

> +   
> +Lines starting with # are considered comments and
> +are ignored. Comments can be placed after filter as well. Blank lines

change "are ignored" to "ignored", I think.

> diff --git a/doc/src/sgml/ref/pg_dumpall.sgml 
> b/doc/src/sgml/ref/pg_dumpall.sgml
> index 8a081f0080..137491340c 100644
> --- a/doc/src/sgml/ref/pg_dumpall.sgml
> +++ b/doc/src/sgml/ref/pg_dumpall.sgml
> @@ -122,6 +122,29 @@ PostgreSQL documentation
>
>   
>  
> + 
> +  --filter= class="parameter">filename
> +  
> +   
> +Specify a filename from which to read patterns for databases excluded
> +from dump. The patterns are interpretted according to the same rules
> +like --exclude-database.

same rules *as*

> +To read from STDIN use - as 
> the

comma

> +filename.  The --filter option can be specified in
> +conjunction with the above listed options for including or excluding

For dumpall, remove "for including or"
change "above listed options" to "exclude-database" ?

> diff --git a/doc/src/sgml/ref/pg_restore.sgml 
> b/doc/src/sgml/ref/pg_restore.sgml
> index 526986eadb..5f16c4a333 100644
> --- a/doc/src/sgml/ref/pg_restore.sgml
> +++ b/doc/src/sgml/ref/pg_restore.sgml
> @@ -188,6 +188,31 @@ PostgreSQL documentation
>
>   
>  
> + 
> +  --filter= class="parameter">filename
> +  
> +   
> +Specify a filename from which to read patterns for objects excluded
> +or included from restore. The patterns are interpretted according to 
> the
> +same rules like --schema, 
> --exclude-schema,

s/like/as/

> +--function, --index, 
> --table
> +or --trigger.
> +To read from STDIN use - as 
> the

STDIN comma

> +/*
> + * filter_get_keyword - read the next filter keyword from buffer
> + *
> + * Search for keywords (limited to containing ascii alphabetic characters) in

remove "containing"

> + /*
> +  * If the object name pattern has been quoted we must take care parse 
> out
> +  * the entire quoted pattern, which may contain whitespace and can span
> +  * over many lines.

quoted comma
*to parse
remove "over"

> + * The pattern is either simple without any  whitespace, or properly quoted

double space

> + * in case there is whitespace in the object name. The pattern handling 
> follows

s/is/may be/

> + if (size == 7 && pg_strncasecmp(keyword, "include", 7) 
> == 0)
> + *is_include = true;
> + else if (size == 7 && pg_strncasecmp(keyword, 
> "exclude", 7) == 0)
> + *is_include = false;

Can't you write strncasecmp(keyword, "include", size) to avoid hardcoding "7" ?

> +
> + if (size == 4 && pg_strncasecmp(keyword, "data", 4) == 
> 0)
> + *objtype = FILTER_OBJECT_TYPE_DATA;
> + else if (size == 8 && pg_strncasecmp(keyword, 
> "database", 8) == 0)
> + *objtype = FILTER_OBJECT_TYPE_DATABASE;
> + else if (size == 12 && pg_strncasecmp(keyword, 
> "foreign_data", 12) == 0)
> + *objtype = FILTER_OBJECT_TYPE_FOREIGN_DATA;
> + else if (size == 8 && pg_strncasecmp(keyword, 
> "function", 8) == 0)
> + *objtype = FILTER_OBJECT_TYPE_FUNCTION;
> + else if (size == 5 && pg_strncasecmp(keyword, "index", 
> 5) == 0)
> + *objtype = FILTER_OBJECT_TYPE_INDEX;
> + else if (size == 6 && pg_strncasecmp(keyword, "schema", 
> 6) == 0)
> + *objtype = FILTER_OBJECT_TYPE_SCHEMA;
> + else if (size == 5 && pg_strncasecmp(keyword, "table", 
> 5) == 0)
> + *objtype = FILTER_OBJECT_

Re: Commitfest Update

2022-07-17 Thread Nathan Bossart
On Fri, Jul 15, 2022 at 09:37:14PM -0500, Justin Pryzby wrote:
> I'm not suggesting to give the community regulars special treatment, but you
> could reasonably assume that when they added themselves and then "didn't 
> remove
> themself", it was on purpose and not by omission.  I think most of those 
> people
> would be surprised to learn that they're no longer considered to be reviewing
> the patch.

Yeah, I happened to look in my commitfest update folder this morning and
was surprised to learn that I was no longer reviewing 3612.  I spent a good
amount of time getting that patch into a state where I felt it was
ready-for-committer, and I intended to follow up on any further
developments in the thread.  It's not uncommon for me to use the filter
functionality in the app to keep track of patches I'm reviewing.

Of course, there are probably patches where I could be removed from the
reviewers field.  I can try to stay on top of that better.  If you think I
shouldn't be marked as a reviewer and that it's hindering further review
for a patch, feel free to message me publicly or privately about it.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: Reducing the chunk header sizes on all memory context types

2022-07-17 Thread Yura Sokolov
В Вт, 12/07/2022 в 22:41 -0700, Andres Freund пишет:
> Hi,
> 
> On 2022-07-12 10:42:07 -0700, Andres Freund wrote:
> > On 2022-07-12 17:01:18 +1200, David Rowley wrote:
> > > There is at least one. It might be major; to reduce the AllocSet chunk
> > > header from 16 bytes down to 8 bytes I had to get rid of the freelist
> > > pointer that was reusing the "aset" field in the chunk header struct.
> > > This works now by storing that pointer in the actual palloc'd memory.
> > > This could lead to pretty hard-to-trace bugs if we have any code that
> > > accidentally writes to memory after pfree.
> > 
> > Can't we use the same trick for allcations in the freelist as we do for the
> > header in a live allocation? I.e. split the 8 byte header into two and use
> > part of it to point to the next element in the list using the offset from 
> > the
> > start of the block, and part of it to indicate the size?
> 
> So that doesn't work because the members in the freelist can be in different
> blocks and those can be further away from each other.
> 
> 
> Perhaps that could still made work somehow: To point to a block we don't
> actually need 64bit pointers, they're always at least of some certain size -
> assuming we can allocate them suitably aligned. And chunks are always 8 byte
> aligned. Unfortunately that doesn't quite get us far enough - assuming a 4kB
> minimum block size (larger than now, but potentially sensible as a common OS
> page size), we still only get to 2^12*8 = 32kB.

Well, we actually have freelists for 11 size classes.
It is just 11 pointers.
We could embed this 88 bytes in every ASet block and then link blocks.
And then in every block have 44 bytes for in-block free lists.
Total overhead is 132 bytes per-block.
Or 110 if we limit block size to 65k*8b=512kb.

With double-linked block lists (176bytes per block + 44bytes for in-block lists
= 220bytes), we could track block fullness and deallocate it if it doesn't
contain any alive alocation. Therefore "generational" and "slab" allocators
will be less useful.

But CPU overhead will be noticeable.

> It'd easily work if we made each context have an array of allocated non-huge
> blocks, so that the blocks can be addressed with a small index. The overhead
> of that could be reduced in the common case by embedding a small constant
> sized array in the Aset.  That might actually be worth trying out.
> 
> Greetings,
> 
> Andres Freund
> 
> 





Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Martin Kalcher

Am 17.07.22 um 08:00 schrieb Thomas Munro:

Actually ... is there a reason to bother with an intarray version
at all, rather than going straight for an in-core anyarray function?
It's not obvious to me that an int4-only version would have
major performance advantages.


Yeah, that seems like a good direction.  If there is a performance
advantage to specialising, then perhaps we only have to specialise on
size, not type.  Perhaps there could be a general function that
internally looks out for typbyval && typlen == 4, and dispatches to a
specialised 4-byte, and likewise for 8, if it can, and that'd already
be enough to cover int, bigint, float etc, without needing
specialisations for each type.


I played around with the idea of an anyarray shuffle(). The hard part 
was to deal with arrays with variable length elements, as they can not 
be swapped easily in place. I solved it by creating an intermediate 
array of references to the elements. I'll attach a patch with the proof 
of concept. Unfortunatly it is already about 5 times slower than the 
specialised version and i am not sure if it is worth going down that road.


MartinFrom 85cb90c79459132c4220b8d78586dffe6e590ba9 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] introduce array_shuffle()

---
 src/backend/utils/adt/arrayfuncs.c | 73 ++
 src/include/catalog/pg_proc.dat|  3 ++
 2 files changed, 76 insertions(+)

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..b6576ce178 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -6872,3 +6872,76 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType  *array,
+			   *result;
+	char	   *adat,
+			   *rdat;
+	int		   nitems,
+			   i;
+	Oid 	   elemtype;
+	TypeCacheEntry typentry;
+	struct { char *dat; size_t size; } *refs;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	if (ARR_HASNULL(array))
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("shuffling arrays with null elements is not supported")));
+
+	if (ARR_NDIM(array) > 1)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("shuffling of multidimensional arrays is not supported")));
+
+	elemtype = ARR_ELEMTYPE(array);
+	typentry = *lookup_type_cache(elemtype, 0);
+	nitems = ArrayGetNItems(ARR_NDIM(array), ARR_DIMS(array));
+	refs = palloc(nitems * (sizeof(char *) + sizeof(size_t)));
+	adat = ARR_DATA_PTR(array);
+
+	/*
+	 * To allow shuffling of arrays with variable length elements, we are going
+	 * to shuffle references to the elements, because swapping variable length
+	 * elements in an array is complicated. In a second step we will iterate the
+	 * shuffled references and copy the elements to the destination array.
+	 * We are using the "inside-out" variant of the Fisher-Yates algorithm to
+	 * produce a shuffled array of references: Append each reference than swap it
+	 * with a radomly-chosen refence.
+	 */
+	for (i = 0; i < nitems; i++)
+	{
+		int j = random() % (i + 1);
+		char *next = att_addlength_pointer(adat, typentry.typlen, adat);
+		next = (char *) att_align_nominal(next, typentry.typalign);
+		refs[i] = refs[j];
+		refs[j].dat = adat;
+		refs[j].size = next - adat;
+		adat = next;
+	}
+
+	result = create_array_envelope(ARR_NDIM(array),
+   ARR_DIMS(array),
+   ARR_LBOUND(array),
+   ARR_SIZE(array),
+   ARR_ELEMTYPE(array),
+   array->dataoffset);
+
+	rdat = ARR_DATA_PTR(result);
+
+	/* Collect all references and copy elements to the destination array */
+	for (i = 0; i < nitems; i++)
+	{
+		memcpy(rdat, refs[i].dat, refs[i].size);
+		rdat += refs[i].size;
+	}
+
+	pfree(refs);
+	PG_FREE_IF_COPY(array, 0);
+
+	PG_RETURN_ARRAYTYPE_P(result);
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 2e41f4d9e8..56aff551d3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -1681,6 +1681,9 @@
   proname => 'arraycontjoinsel', provolatile => 's', prorettype => 'float8',
   proargtypes => 'internal oid internal int2 internal',
   prosrc => 'arraycontjoinsel' },
+{ oid => '', descr => 'shuffle array',
+  proname => 'array_shuffle', proisstrict => 'f', prorettype => 'anyarray',
+  proargtypes => 'anyarray', prosrc => 'array_shuffle' },
 
 { oid => '764', descr => 'large object import',
   proname => 'lo_import', provolatile => 'v', proparallel => 'u',
-- 
2.37.1



Re: [PATCH] Compression dictionaries for JSONB

2022-07-17 Thread Nikita Malakhov
Hi hackers!

Aleksander, I've carefully gone over discussion and still have some
questions to ask -

1) Is there any means of measuring overhead of dictionaries over vanilla
implementation? IMO it is a must because
JSON is a widely used functionality. Also, as it was mentioned before, to
check the dictionary value must be detoasted;

2) Storing dictionaries in one table. As I wrote before, this will surely
lead to locks and waits while inserting and updating
dictionaries, and could cause serious performance issues. And vacuuming
this table will lead to locks for all tables using
dictionaries until vacuum is complete;

3) JSON documents in production environments could be very complex and use
thousands of keys, so creating dictionary
directly in SQL statement is not very good approach, so it's another reason
to have means for creating dictionaries as a
separate tables and/or passing them as files or so;

4) Suggested mechanics, if put on top of the TOAST, could not benefit from
knowledge if internal JSON structure, which
is seen as important drawback in spite of extensive research work done on
working with JSON schema (storing, validating,
etc.), and also it cannot recognize and help to compress duplicated parts
of JSON document;

5) A small test issue - if dictionaried' JSON has a key which is equal to
OID used in a dictionary for some other key?

In Pluggable TOAST we suggest that as an improvement compression should be
put inside the Toaster as an option,
thus the Toaster could have maximum benefits from knowledge of data
internal structure (and in future use JSON Schema).
For using in special Toaster for JSON datatype compression dictionaries
seem to be very valuable addition, but now I
have to agree that this feature in current state is competing with
Pluggable TOAST.

Thank you!

Regards,
Nikita Malakhov
Postgres Professional
https://postgrespro.ru/

On Tue, Jul 12, 2022 at 3:15 PM Aleksander Alekseev <
aleksan...@timescale.com> wrote:

> Hi Nikita,
>
> > Aleksander, please point me in the right direction if it was mentioned
> before, I have a few questions:
>
> Thanks for your feedback. These are good questions indeed.
>
> > 1) It is not clear for me, how do you see the life cycle of such a
> dictionary? If it is meant to keep growing without
> > cleaning up/rebuilding it could affect performance in an undesirable
> way, along with keeping unused data without
> > any means to get rid of them.
> > 2) From (1) follows another question - I haven't seen any means for
> getting rid of unused keys (or any other means
> > for dictionary cleanup). How could it be done?
>
> Good point. This was not a problem for ZSON since the dictionary size
> was limited to 2**16 entries, the dictionary was immutable, and the
> dictionaries had versions. For compression dictionaries we removed the
> 2**16 entries limit and also decided to get rid of versions. The idea
> was that you can simply continue adding new entries, but no one
> thought about the fact that this will consume the memory required to
> decompress the document indefinitely.
>
> Maybe we should return to the idea of limited dictionary size and
> versions. Objections?
>
> > 4) If one dictionary is used by several tables - I see future issues in
> concurrent dictionary updates. This will for sure
> > affect performance and can cause unpredictable behavior for queries.
>
> You are right. Another reason to return to the idea of dictionary versions.
>
> > Also, I agree with Simon Riggs, using OIDs from the general pool for
> dictionary entries is a bad idea.
>
> Yep, we agreed to stop using OIDs for this, however this was not
> changed in the patch at this point. Please don't hesitate joining the
> effort if you want to. I wouldn't mind taking a short break from this
> patch.
>
> > 3) Is the possible scenario legal - by some means a dictionary does not
> contain some keys for entries? What happens then?
>
> No, we should either forbid removing dictionary entries or check that
> all the existing documents are not using the entries being removed.
>
> > If you have any questions on Pluggable TOAST don't hesitate to ask me
> and on JSONB Toaster you can ask Nikita Glukhov.
>
> Will do! Thanks for working on this and I'm looking forward to the
> next version of the patch for the next round of review.
>
> --
> Best regards,
> Aleksander Alekseev
>


Re: Making pg_rewind faster

2022-07-17 Thread Justin Kwan
Hi everyone!

Here's the attached patch submission to optimize pg_rewind performance when 
many WAL files are retained on server. This patch avoids replaying (copying 
over) older WAL segment files that fall before the point of divergence between 
the source and target servers.

Thanks,
Justin

From: Justin Kwan 
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran 
Cc: pgsql-hackers ; vignesh 
; justinpk...@outlook.com 
Subject: Re: Making pg_rewind faster

Looping in my other email.

On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran 
mailto:ad...@viggy28.dev>> wrote:
Hi Hackers,

I have been using pg_rewind in production for 2 years. One of the things that I 
noticed in pg_rewind is if it doesn't know what to do with a file "it copies". 
I understand it's the more safer option. After all, the alternative, 
pg_basebackup copies all the files from source to target.

However, this is making pg_rewind inefficient when we have a high number of WAL 
files. Majority of the data (in most of my cases 95%+) that it copies are WAL 
files which are anyway same between the source and target. Skipping those same 
WAL files from copying will improve the speed of pg_rewind a lot.

1. Does pg_rewind need to copy WAL files before the WAL that contains the last 
common check point?

Heikki's presentation 
https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me 
a good overview and also explained the behavior what I mentioned.

Thanks,
Vignesh



v1-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patch
Description:  v1-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patch


Re: Making pg_rewind faster

2022-07-17 Thread Justin Kwan
Hi everyone!

I've also attached the pg_rewind optimization patch file for Postgres version 
14.4. The previous patch file targets version Postgres version 15 Beta 1/2.

Thanks,
Justin

From: Justin Kwan 
Sent: July 15, 2022 6:13 PM
To: vignesh ravichandran 
Cc: pgsql-hackers ; vignesh 
; justinpk...@outlook.com 
Subject: Re: Making pg_rewind faster

Looping in my other email.

On Thu, Jun 30, 2022 at 6:22 AM vignesh ravichandran 
mailto:ad...@viggy28.dev>> wrote:
Hi Hackers,

I have been using pg_rewind in production for 2 years. One of the things that I 
noticed in pg_rewind is if it doesn't know what to do with a file "it copies". 
I understand it's the more safer option. After all, the alternative, 
pg_basebackup copies all the files from source to target.

However, this is making pg_rewind inefficient when we have a high number of WAL 
files. Majority of the data (in most of my cases 95%+) that it copies are WAL 
files which are anyway same between the source and target. Skipping those same 
WAL files from copying will improve the speed of pg_rewind a lot.

1. Does pg_rewind need to copy WAL files before the WAL that contains the last 
common check point?

Heikki's presentation 
https://pgsessions.com/assets/archives/pg_rewind-presentation-paris.pdf gave me 
a good overview and also explained the behavior what I mentioned.

Thanks,
Vignesh



v1-pg14.4-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patch
Description:  v1-pg14.4-0001-Avoid-copying-WAL-segments-before-divergence-to-spee.patch


Re: Making pg_rewind faster

2022-07-17 Thread Tom Lane
Justin Kwan  writes:
> I've also attached the pg_rewind optimization patch file for Postgres version 
> 14.4. The previous patch file targets version Postgres version 15 Beta 1/2.

It's very unlikely that we would consider committing such changes into
released branches.  In fact, it's too late even for v15.  You should
be submitting non-bug-fix patches against master (v16-to-be).

regards, tom lane




Re: Use -fvisibility=hidden for shared libraries

2022-07-17 Thread Tom Lane
Andres Freund  writes:
> On 2022-03-24 16:13:31 +0100, Peter Eisentraut wrote:
>> The easiest solution would be to change worker_spi's Makefile to use
>> MODULE_big.

> If it were just worker_spi that might be tenable, but there's other extension
> modules - even if they don't fail to fail right now, we shouldn't change the
> symbol export rules based on MODULES vs MODULE_big.

Agreed, that doesn't seem nice.

> Is there any reason an extension module would need to directly link against
> pgport/pgcommon? I don't think so, right?

Shouldn't --- we want it to use the backend's own copy.  (Hmm ... but
what if there's some file in one of those libraries that is not used
by the core backend, but is used by the extension?)  In any case, why
would that matter for this patch?  If an extension does link in such
a file, for sure we don't want that exposed outside the extension.

> What I'm not sure about is what to do about pg_config - we can't just add
> -fvisibility=hidden to pg_config --cflags_sl. We could add --cflags_sl_mod -
> but I'm not sure it's worth it?

Don't have a strong opinion here.  But if we're forcing
-fvisibility=hidden on modules built with pgxs, I'm not sure why
we can't do so for modules relying on pg_config.

> There are a few symbols in plpython that don't need to be exported right now
> but are. But it seems better to export a few too many there, as the
> alternative is breaking out-of-core transforms. Most of the symbols are used
> by the various transform extensions.

Concur.

I wanted to test this on a compiler lacking -fvisibility, and was somewhat
amused to discover that I haven't got one --- even prairiedog's ancient
gcc 4.0.1 knows it.  Maybe some of the vendor-specific compilers in the
buildfarm will be able to verify that aspect for us.

I have a couple of very minor nitpicks:

1. The comment for the shared _PG_init/_PG_fini declarations could be
worded better, perhaps like

 * Declare _PG_init/_PG_fini centrally. Historically each shared library had
 * its own declaration; but now that we want to mark these PGDLLEXPORT,
 * using central declarations avoids each extension having to add that.
 * Any existing declarations in extensions will continue to work if fmgr.h
 * is included before them, otherwise compilation for Windows will fail.

2. I'm not really thrilled with the symbol names C[XX]FLAGS_SL_MOD;
it's not apparent what the MOD means, nor is that documented anywhere.
Maybe C[XX]FLAGS_SL_VISIBILITY?  In any case a comment explaining
when these are meant to be used might help extension developers.
(Although now that I look, there seems noplace documenting what
CFLAG_SL is either :-(.)

Beyond that, I think this is committable.  We're not likely to learn much
more about any potential issues except by exposing it to the buildfarm
and developer usage.

regards, tom lane




Re: Costing elided SubqueryScans more nearly correctly

2022-07-17 Thread Tom Lane
I wrote:
> I also notice that setrefs.c can elide Append and MergeAppend nodes
> too in some cases, but AFAICS costing of those node types doesn't
> take that into account.

I took a closer look at this point and realized that in fact,
create_append_path and create_merge_append_path do attempt to account
for this.  But they get it wrong!  Somebody changed the rules in
setrefs.c to account for parallelism, and did not update the costing
side of things.

The attached v2 is the same as v1 plus adding a fix for this point.
No regression test results change from that AFAICT.

regards, tom lane

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 358bb2aed6..028d9e1680 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2451,6 +2451,7 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 {
 	Query	   *parse = root->parse;
 	Query	   *subquery = rte->subquery;
+	bool		trivial_pathtarget;
 	Relids		required_outer;
 	pushdown_safety_info safetyInfo;
 	double		tuple_fraction;
@@ -2613,6 +2614,36 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 	 */
 	set_subquery_size_estimates(root, rel);
 
+	/*
+	 * Also detect whether the reltarget is trivial, so that we can pass that
+	 * info to cost_subqueryscan (rather than re-deriving it multiple times).
+	 * It's trivial if it fetches all the subplan output columns in order.
+	 */
+	if (list_length(rel->reltarget->exprs) != list_length(subquery->targetList))
+		trivial_pathtarget = false;
+	else
+	{
+		trivial_pathtarget = true;
+		foreach(lc, rel->reltarget->exprs)
+		{
+			Node	   *node = (Node *) lfirst(lc);
+			Var		   *var;
+
+			if (!IsA(node, Var))
+			{
+trivial_pathtarget = false;
+break;
+			}
+			var = (Var *) node;
+			if (var->varno != rti ||
+var->varattno != foreach_current_index(lc) + 1)
+			{
+trivial_pathtarget = false;
+break;
+			}
+		}
+	}
+
 	/*
 	 * For each Path that subquery_planner produced, make a SubqueryScanPath
 	 * in the outer query.
@@ -2631,6 +2662,7 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		/* Generate outer path using this subpath */
 		add_path(rel, (Path *)
  create_subqueryscan_path(root, rel, subpath,
+		  trivial_pathtarget,
 		  pathkeys, required_outer));
 	}
 
@@ -2656,6 +2688,7 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 			/* Generate outer path using this subpath */
 			add_partial_path(rel, (Path *)
 			 create_subqueryscan_path(root, rel, subpath,
+	  trivial_pathtarget,
 	  pathkeys,
 	  required_outer));
 		}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 5e5732f6e1..fb28e6411a 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1415,10 +1415,12 @@ cost_tidrangescan(Path *path, PlannerInfo *root,
  *
  * 'baserel' is the relation to be scanned
  * 'param_info' is the ParamPathInfo if this is a parameterized path, else NULL
+ * 'trivial_pathtarget' is true if the pathtarget is believed to be trivial.
  */
 void
 cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
-  RelOptInfo *baserel, ParamPathInfo *param_info)
+  RelOptInfo *baserel, ParamPathInfo *param_info,
+  bool trivial_pathtarget)
 {
 	Cost		startup_cost;
 	Cost		run_cost;
@@ -1458,6 +1460,22 @@ cost_subqueryscan(SubqueryScanPath *path, PlannerInfo *root,
 	path->path.startup_cost = path->subpath->startup_cost;
 	path->path.total_cost = path->subpath->total_cost;
 
+	/*
+	 * However, if there are no relevant restriction clauses and the
+	 * pathtarget is trivial, then we expect that setrefs.c will optimize away
+	 * the SubqueryScan plan node altogether, so we should just make its cost
+	 * and rowcount equal to the input path's.
+	 *
+	 * Note: there are some edge cases where createplan.c will apply a
+	 * different targetlist to the SubqueryScan node, thus falsifying our
+	 * current estimate of whether the target is trivial, and making the cost
+	 * estimate (though not the rowcount) wrong.  It does not seem worth the
+	 * extra complication to try to account for that exactly, especially since
+	 * that behavior falsifies other cost estimates as well.
+	 */
+	if (qpquals == NIL && trivial_pathtarget)
+		return;
+
 	get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
 
 	startup_cost = qpqual_cost.startup;
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 9cef92cab2..eb0ba6cb1c 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -1636,10 +1636,10 @@ set_append_references(PlannerInfo *root,
 	/*
 	 * See if it's safe to get rid of the Append entirely.  For this to be
 	 * safe, there must be only one child plan and that child plan's parallel
-	 * awareness must match that of the Append's.  The

Re: Proposal: allow database-specific role memberships

2022-07-17 Thread Kenaniah Cerny
Rebased yet again...

On Mon, Jul 4, 2022 at 1:17 PM Kenaniah Cerny  wrote:

> Hi Antonin,
>
> First of all, thank you so much for taking the time to review my patch.
> I'll answer your questions in reverse order:
>
> The "unsafe_tests" directory is where the pre-existing role tests were
> located. According to the readme of the "unsafe_tests" directory, the tests
> contained within are not run during "make installcheck" because they could
> have side-effects that seem undesirable for a production installation. This
> seemed like a reasonable location as the new tests that this patch
> introduces also modifies the "state" of the database cluster by adding,
> modifying, and removing roles & databases (including template1).
>
> Regarding roles_is_member_of(), the nuance is that role "A" in your
> example would only be considered a member of role "B" (and by extension
> role "C") when connected to the database in which "A" was granted
> database-specific membership to "B". Conversely, when connected to any
> other database, "A" would not be considered to be a member of "B".
>
> This patch is designed to solve the scenarios in which one may want to
> grant constrained access to a broader set of privileges. For example,
> membership in "pg_read_all_data" effectively grants SELECT and USAGE rights
> on everything (implicitly cluster-wide in today's implementation). By
> granting a role membership to "pg_read_all_data" within the context of a
> specific database, the grantee's read-everything privilege is effectively
> constrained to just that specific database (as membership within
> "pg_read_all_data" would not otherwise be held).
>
> A rebased version is attached.
>
> Thanks again!
>
> - Kenaniah
>
> On Wed, Jun 29, 2022 at 6:45 AM Antonin Houska  wrote:
>
>> Kenaniah Cerny  wrote:
>>
>> > Attached is a newly-rebased patch -- would love to get a review from
>> someone whenever possible.
>>
>> I've picked this patch for a review. The patch currently does not apply
>> to the
>> master branch, so I could only read the diff. Following are my comments:
>>
>> * I think that roles_is_member_of() deserves a comment explaining why the
>> code
>>   that you moved into append_role_memberships() needs to be called twice,
>>   i.e. once for global memberships and once for the database-specific
>> ones.
>>
>>   I think the reason is that if, for example, role "A" is a
>> database-specific
>>   member of role "B" and "B" is a "global" member of role "C", then "A"
>> should
>>   not be considered a member of "C", unless "A" is granted "C"
>> explicitly. Is
>>   this behavior intended?
>>
>>   Note that in this example, the "C" members are a superset of "B"
>> members,
>>   and thus "C" should have weaker permissions on database objects than
>>   "B". What's then the reason to not consider "A" a member of "C"? If "C"
>>   gives its members some permissions of "B" (e.g. "pg_write_all_data"),
>> then I
>>   think the roles hierarchy is poorly designed.
>>
>>   A counter-example might help me to understand.
>>
>> * Why do you think that "unsafe_tests" is the appropriate name for the
>>   directory that contains regression tests?
>>
>> I can spend more time on the review if the patch gets rebased.
>>
>> --
>> Antonin Houska
>> Web: https://www.cybertec-postgresql.com
>>
>


database-role-memberships-v10.patch
Description: Binary data


Re: Use -fvisibility=hidden for shared libraries

2022-07-17 Thread Andres Freund
Hi,

Thanks for the quick review.


On 2022-07-17 14:54:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Is there any reason an extension module would need to directly link against
> > pgport/pgcommon? I don't think so, right?
> 
> Shouldn't --- we want it to use the backend's own copy.  (Hmm ... but
> what if there's some file in one of those libraries that is not used
> by the core backend, but is used by the extension?)  In any case, why
> would that matter for this patch?  If an extension does link in such
> a file, for sure we don't want that exposed outside the extension.

I was wondering whether there is a reason {pgport,pgcommon}_srv should ever be
built with -fno-visibility. Right now there's no reason to do so, but if an
extension .so would directly link to them, they'd have to built with that. But
until / unless we add visibility information to all backend functions
declarations, we can't do that for the versions the backend uses.


> > What I'm not sure about is what to do about pg_config - we can't just add
> > -fvisibility=hidden to pg_config --cflags_sl. We could add --cflags_sl_mod -
> > but I'm not sure it's worth it?
> 
> Don't have a strong opinion here.  But if we're forcing
> -fvisibility=hidden on modules built with pgxs, I'm not sure why
> we can't do so for modules relying on pg_config.

If an extension were to use pg_config to build a "proper" shared library
(rather than our more plugin like extension libraries), they might not want
the -fvisibility=hidden, e.g. if they use a mechanism like our exports.txt.

That's also the reason -fvisibility=hidden isn't added to libpq and the ecpg
libs - their symbol visility is done via exports.txt. Depending on the
platform that'd not work if symbols were already hidden via
-fvisibility=hidden (unless explicitly exported via PGDLLEXPORT, of
course).

It might or might not be worth switching to PGDLLEXPORT for those in the
future, but that's imo a separate discussion.


> I wanted to test this on a compiler lacking -fvisibility, and was somewhat
> amused to discover that I haven't got one --- even prairiedog's ancient
> gcc 4.0.1 knows it.  Maybe some of the vendor-specific compilers in the
> buildfarm will be able to verify that aspect for us.

Heh. Even AIX's compiler knows it, and I think sun studio as well.  MSVC
obviously doesn't, but we have declspec support to deal with that. It's
possible that HP-UX's acc doesn't, but that's gone now...  It's possible that
there's ABIs targeted by gcc that don't have symbol visibility support - but I
can't think of any we support of the top of my head.

IOW, we could likely rely on symbol visibility support, if there's advantage
in that.


> I have a couple of very minor nitpicks:
> 
> 1. The comment for the shared _PG_init/_PG_fini declarations could be
> worded better, perhaps like
> 
>  * Declare _PG_init/_PG_fini centrally. Historically each shared library had
>  * its own declaration; but now that we want to mark these PGDLLEXPORT,
>  * using central declarations avoids each extension having to add that.
>  * Any existing declarations in extensions will continue to work if fmgr.h
>  * is included before them, otherwise compilation for Windows will fail.

WFM.


> 2. I'm not really thrilled with the symbol names C[XX]FLAGS_SL_MOD;
> it's not apparent what the MOD means, nor is that documented anywhere.

It's for module, which I got from pgxs' MODULES/MODULE_big (and incidentally
that's also what meson calls it). Definitely should be explained, and perhaps
be expanded to MODULE.


> Maybe C[XX]FLAGS_SL_VISIBILITY?  In any case a comment explaining
> when these are meant to be used might help extension developers.
> (Although now that I look, there seems noplace documenting what
> CFLAG_SL is either :-(.)

I was thinking that it might be nicer to bundle the flags that should be
applied to extension libraries together. I don't think we have others right
now, but I think that might change (e.g. -fno-plt -fno-semantic-interposition,
-Wl,-Bdynamic would make sense).

I'm not wedded to this, but I think it makes some sense?


> Beyond that, I think this is committable.  We're not likely to learn much
> more about any potential issues except by exposing it to the buildfarm
> and developer usage.

Cool. I'll work on committing the first two then and see what comes out of the
CFLAGS_SL_* discussion.

Greetings,

Andres Freund




Re: Use -fvisibility=hidden for shared libraries

2022-07-17 Thread Tom Lane
Andres Freund  writes:
> That's also the reason -fvisibility=hidden isn't added to libpq and the ecpg
> libs - their symbol visility is done via exports.txt. Depending on the
> platform that'd not work if symbols were already hidden via
> -fvisibility=hidden (unless explicitly exported via PGDLLEXPORT, of
> course).

Got it.  Yeah, let's not break those cases.  (I don't think we dare put
PGDLLEXPORT into libpq-fe.h, for example, so it might be hard to solve
it any other way anyhow.)

>> 2. I'm not really thrilled with the symbol names C[XX]FLAGS_SL_MOD;
>> it's not apparent what the MOD means, nor is that documented anywhere.

> It's for module, which I got from pgxs' MODULES/MODULE_big (and incidentally
> that's also what meson calls it). Definitely should be explained, and perhaps
> be expanded to MODULE.

I guessed as much, but +1 for spelling it out as MODULE.

regards, tom lane




Re: PATCH: Add Table Access Method option to pgbench

2022-07-17 Thread Mason Sharp
On Wed, Jul 13, 2022 at 12:33 AM Michel Pelletier 
wrote:

> On Thu, 30 Jun 2022 at 18:09, Michael Paquier  wrote:
>
>> On Fri, Jul 01, 2022 at 10:06:49AM +0900, Michael Paquier wrote:
>> > And the conclusion back then is that one can already achieve this by
>> > using PGOPTIONS:
>> > PGOPTIONS='-c default_table_access_method=wuzza' pgbench [...]
>> >
>> > So there is no need to complicate more pgbench, particularly when it
>> > comes to partitioned tables where USING is not supported.  Your patch
>> > touches this area of the client code to bypass the backend error.
>>
>> Actually, it could be a good thing to mention that directly in the
>> docs of pgbench.
>>
>
> I've attached a documentation patch that mentions and links to the
> PGOPTIONS documentation per your suggestion.  I'll keep the other patch on
> the back burner, perhaps in the future there will be demand for a command
> line option as more TAMs are created.
>
>>
>>
The documentation change looks good to me


Re: postgres_fdw versus regconfig and similar constants

2022-07-17 Thread Tom Lane
Robert Haas  writes:
> On Mon, May 16, 2022 at 1:33 PM Tom Lane  wrote:
>> 0002 tightens deparse.c's rules to only consider an OID alias constant
>> as shippable if the object it refers to is shippable.  This seems
>> obvious in hindsight; I wonder how come we've not realized it before?
>> However, this has one rather nasty problem for regconfig in particular:
>> with our standard shippability rules none of the built-in text search
>> configurations would be treated as shippable, because initdb gives them
>> non-fixed OIDs above .  That seems like a performance hit we don't
>> want to take.  In the attached, I hacked around that by making a special
>> exception for OIDs up to 16383, but that seems like a fairly ugly kluge.
>> Anybody have a better idea?

> No. It feels to me like there are not likely to be any really
> satisfying answers here.

Yeah.  Hearing no better ideas from anyone else either, pushed that way.

I noted one interesting factoid while trying to make a test case for the
missing-schema-qualification issue.  I thought of making a foreign table
that maps to pg_class and checking what is shipped for

select oid, relname from remote_pg_class where oid =
'information_schema.key_column_usage'::regclass;

(In hindsight, this wouldn't have worked anyway after patch 0002,
because that OID would have been above .)  But what I got was

 Foreign Scan on public.remote_pg_class  (cost=100.00..121.21 rows=4 width=68)
   Output: oid, relname
   Remote SQL: SELECT oid, relname FROM pg_catalog.pg_class WHERE ((oid = 
13527::oid))

The reason for that is that the constant is smashed to type OID so hard
that we can no longer tell that it ever was regclass, thus there's no
hope of deparsing it in a more-symbolic fashion.  I'm not sure if there's
anything we could do about that that wouldn't break more things than
it fixes (e.g. by making things that should look equal() not be so).
But anyway, this effect may help explain the lack of previous complaints
in this area.  regconfig arguments to text search functions might be
pretty nearly the only realistic use-case for shipping symbolic reg*
values to the remote.

regards, tom lane




Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Thomas Munro
On Mon, Jul 18, 2022 at 4:15 AM Martin Kalcher
 wrote:
> Am 17.07.22 um 08:00 schrieb Thomas Munro:
> >> Actually ... is there a reason to bother with an intarray version
> >> at all, rather than going straight for an in-core anyarray function?
> >> It's not obvious to me that an int4-only version would have
> >> major performance advantages.
> >
> > Yeah, that seems like a good direction.  If there is a performance
> > advantage to specialising, then perhaps we only have to specialise on
> > size, not type.  Perhaps there could be a general function that
> > internally looks out for typbyval && typlen == 4, and dispatches to a
> > specialised 4-byte, and likewise for 8, if it can, and that'd already
> > be enough to cover int, bigint, float etc, without needing
> > specialisations for each type.
>
> I played around with the idea of an anyarray shuffle(). The hard part
> was to deal with arrays with variable length elements, as they can not
> be swapped easily in place. I solved it by creating an intermediate
> array of references to the elements. I'll attach a patch with the proof
> of concept. Unfortunatly it is already about 5 times slower than the
> specialised version and i am not sure if it is worth going down that road.

Seems OK for a worst case.  It must still be a lot faster than doing
it in SQL.  Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4.  I haven't
studied this in detail but perhaps to dispatch to a fast shuffle for
objects of size X, the requirement would be something like typlen == X
&& align_bytes <= typlen && typlen % align_bytes == 0, where
align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
Or in English, 'the data consists of densely packed objects of fixed
size X, no padding'.  Or perhaps you can work out the padded size and
use that, to catch a few more types.  Then you call
array_shuffle_{2,4,8}() as appropriate, which should be as fast as
your original int[] proposal, but work also for float, date, ...?

About your experimental patch, I haven't reviewed it properly or tried
it but I wonder if uint32 dat_offset, uint32 size (= half size
elements) would be enough due to limitations on varlenas.




Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Tom Lane
Martin Kalcher  writes:
> Am 17.07.22 um 08:00 schrieb Thomas Munro:
>>> Actually ... is there a reason to bother with an intarray version
>>> at all, rather than going straight for an in-core anyarray function?

> I played around with the idea of an anyarray shuffle(). The hard part 
> was to deal with arrays with variable length elements, as they can not 
> be swapped easily in place. I solved it by creating an intermediate 
> array of references to the elements. I'll attach a patch with the proof 
> of concept.

This does not look particularly idiomatic, or even type-safe.  What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.

(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality.  Not sure if anyone would care.)

regards, tom lane




Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Tom Lane
Thomas Munro  writes:
> Seems OK for a worst case.  It must still be a lot faster than doing
> it in SQL.  Now I wonder what the exact requirements would be to
> dispatch to a faster version that would handle int4.

I find it impossible to believe that it's worth micro-optimizing
shuffle() to that extent.  Now, maybe doing something in that line
in deconstruct_array and construct_array would be worth our time,
as that'd benefit a pretty wide group of functions.

regards, tom lane




Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Martin Kalcher

Am 18.07.22 um 00:46 schrieb Tom Lane:


This does not look particularly idiomatic, or even type-safe.  What you
should have done was use deconstruct_array to get an array of Datums and
isnull flags, then shuffled those, then used construct_array to build the
output.

(Or, perhaps, use construct_md_array to replicate the input's
precise dimensionality.  Not sure if anyone would care.)

regards, tom lane


deconstruct_array() would destroy the arrays dimensions. I would expect 
that shuffle() only shuffles the first dimension and keeps the inner 
arrays intact.


Martin




Re: doc: Make selectivity example match wording

2022-07-17 Thread Dian M Fay
On Sat Jul 16, 2022 at 11:23 PM EDT, David G. Johnston wrote:
> Thanks for the review.  I generally like everything you said but it made me
> realize that I still didn't really understand the intent behind the
> formula.  I spent way too much time working that out for myself, then
> turned what I found useful into this v2 patch.
>
> It may need some semantic markup still but figured I'd see if the idea
> makes sense.
>
> I basically rewrote, in a bit different style, the same material into the
> code comments, then proceeded to rework the proof that was already present
> there.
>
> I did do this in somewhat of a vacuum.  I'm not inclined to learn this all
> start-to-end though.  If the abrupt style change is unwanted so be it.  I'm
> not really sure how much benefit the proof really provides.  The comments
> in the docs are probably sufficient for the code as well - just define why
> the three pieces of the formula exist and are packaged into a single
> multiplier called selectivity as an API choice.  I suspect once someone
> gets to that comment it is fair to assume some prior knowledge.
> Admittedly, I didn't really come into this that way...

Fair enough, I only know what I can glean from the comments in
eqjoinsel_inner and friends myself. I do think even this smaller change
is valuable because the current example talks about using an algorithm
based on the number of distinct values immediately after showing
n_distinct == -1, so making it clear that this case uses num_rows
instead is helpful.

"This value does get scaled in the non-unique case" again could be more
specific ("since here all values are unique, otherwise the calculation
uses num_distinct" perhaps?). But past that quibble I'm good.




Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Martin Kalcher

Am 18.07.22 um 00:37 schrieb Thomas Munro:

Seems OK for a worst case.  It must still be a lot faster than doing
it in SQL.  Now I wonder what the exact requirements would be to
dispatch to a faster version that would handle int4.  I haven't
studied this in detail but perhaps to dispatch to a fast shuffle for
objects of size X, the requirement would be something like typlen == X
&& align_bytes <= typlen && typlen % align_bytes == 0, where
align_bytes is typalign converted to ALIGNOF_{CHAR,SHORT,INT,DOUBLE}?
Or in English, 'the data consists of densely packed objects of fixed
size X, no padding'.  Or perhaps you can work out the padded size and
use that, to catch a few more types.  Then you call
array_shuffle_{2,4,8}() as appropriate, which should be as fast as
your original int[] proposal, but work also for float, date, ...?

About your experimental patch, I haven't reviewed it properly or tried
it but I wonder if uint32 dat_offset, uint32 size (= half size
elements) would be enough due to limitations on varlenas.


I made another experimental patch with fast tracks for typelen4 and 
typelen8. alignments are not yet considered.From 37269598a1dace99c9059514e0fdcb90b9240ed3 Mon Sep 17 00:00:00 2001
From: Martin Kalcher 
Date: Sun, 17 Jul 2022 18:06:04 +0200
Subject: [PATCH] introduce array_shuffle()

---
 src/backend/utils/adt/arrayfuncs.c | 178 +
 src/include/catalog/pg_proc.dat|   3 +
 2 files changed, 181 insertions(+)

diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index fb167f226a..919cb2bfd6 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -6872,3 +6872,181 @@ trim_array(PG_FUNCTION_ARGS)
 
 	PG_RETURN_DATUM(result);
 }
+
+static ArrayType *
+array_shuffle_4(ArrayType *array)
+{
+	ArrayType	*result;
+	int32		*adat,
+*rdat;
+	int			i,
+j,
+nitems;
+
+	result = create_array_envelope(ARR_NDIM(array),
+   ARR_DIMS(array),
+   ARR_LBOUND(array),
+   ARR_SIZE(array),
+   ARR_ELEMTYPE(array),
+   array->dataoffset);
+
+	adat = (int32 *) ARR_DATA_PTR(array);
+	rdat = (int32 *) ARR_DATA_PTR(result);
+	nitems = ARR_DIMS(array)[0];
+
+	for (i = 0; i < nitems; i++)
+	{
+		j = random() % (i + 1);
+		rdat[i] = rdat[j];
+		rdat[j] = adat[i];
+	}
+	return result;
+}
+
+static ArrayType *
+array_shuffle_8(ArrayType *array)
+{
+	ArrayType	*result;
+	int64		*adat,
+*rdat;
+	int			i,
+j,
+nitems;
+
+	result = create_array_envelope(ARR_NDIM(array),
+   ARR_DIMS(array),
+   ARR_LBOUND(array),
+   ARR_SIZE(array),
+   ARR_ELEMTYPE(array),
+   array->dataoffset);
+
+	adat = (int64 *) ARR_DATA_PTR(array);
+	rdat = (int64 *) ARR_DATA_PTR(result);
+	nitems = ARR_DIMS(array)[0];
+
+	for (i = 0; i < nitems; i++)
+	{
+		j = random() % (i + 1);
+		rdat[i] = rdat[j];
+		rdat[j] = adat[i];
+	}
+	return result;
+}
+
+static ArrayType *
+array_shuffle_generic(ArrayType *array,
+	  int16 elmlen, bool elmbyval, char elmalign)
+{
+	ArrayType	*result;
+	char		*adat,
+*rdat;
+	int 		i,
+ndim,
+*dims,
+nitems,
+nelms_per_item;
+
+	struct {
+		char *dat;
+		size_t size;
+	} *refs;
+
+	ndim = ARR_NDIM(array);
+	dims = ARR_DIMS(array);
+
+	nelms_per_item = 1;
+
+	for (i = 1; i < ndim; i++)
+		nelms_per_item *= dims[i];
+
+	nitems = dims[0];
+
+	refs = palloc(nitems * (sizeof(char *) + sizeof(size_t)));
+	adat = ARR_DATA_PTR(array);
+
+	/*
+	 * To allow shuffling of arrays with variable length elements, we are going
+	 * to shuffle references to the elements, because swapping variable length
+	 * elements in an array is complicated. In a second step we will iterate the
+	 * shuffled references and copy the elements to the destination array.
+	 * We are using the "inside-out" variant of the Fisher-Yates algorithm to
+	 * produce a shuffled array of references: Append each reference than swap it
+	 * with a radomly-chosen refence.
+	 */
+	for (i = 0; i < nitems; i++)
+	{
+		char *next = array_seek(adat, 0, NULL, nelms_per_item,
+			elmlen, elmbyval, elmalign);
+		int j = random() % (i + 1);
+		refs[i] = refs[j];
+		refs[j].dat = adat;
+		refs[j].size = next - adat;
+		adat = next;
+	}
+
+	result = create_array_envelope(ARR_NDIM(array),
+   ARR_DIMS(array),
+   ARR_LBOUND(array),
+   ARR_SIZE(array),
+   ARR_ELEMTYPE(array),
+   array->dataoffset);
+
+	rdat = ARR_DATA_PTR(result);
+
+	/* Collect all references and copy elements to the destination array */
+	for (i = 0; i < nitems; i++)
+	{
+		memcpy(rdat, refs[i].dat, refs[i].size);
+		rdat += refs[i].size;
+	}
+
+	pfree(refs);
+
+	return result;
+}
+
+Datum
+array_shuffle(PG_FUNCTION_ARGS)
+{
+	ArrayType		*array,
+	*result;
+	int16			elmlen;
+	bool			elmbyval;
+	char			elmalign;
+	Oid elmtyp;
+	TypeCacheEntry	*typentry;
+
+	array = PG_GETARG_ARRAYTYPE_P(0);
+
+	if (ARR_HASNULL(array))
+		ereport(ER

Re: Proposal to introduce a shuffle function to intarray extension

2022-07-17 Thread Tom Lane
Martin Kalcher  writes:
> Am 18.07.22 um 00:46 schrieb Tom Lane:
>> This does not look particularly idiomatic, or even type-safe.  What you
>> should have done was use deconstruct_array to get an array of Datums and
>> isnull flags, then shuffled those, then used construct_array to build the
>> output.
>> 
>> (Or, perhaps, use construct_md_array to replicate the input's
>> precise dimensionality.  Not sure if anyone would care.)

> deconstruct_array() would destroy the arrays dimensions.

As I said, you can use construct_md_array if you want to preserve the
array shape.

> I would expect that shuffle() only shuffles the first dimension and
> keeps the inner arrays intact.

This argument is based on a false premise, ie that Postgres thinks
multidimensional arrays are arrays-of-arrays.  They aren't, and
we're not going to start making them so by defining shuffle()
at variance with every other array-manipulating function.  Shuffling
the individual elements regardless of array shape is the definition
that's consistent with our existing functionality.

(Having said that, even if we were going to implement it with that
definition, I should think that it'd be easiest to do so on the
array-of-Datums representation produced by deconstruct_array.
That way you don't need to do different things for different element
types.)

regards, tom lane




System column support for partitioned tables using heap

2022-07-17 Thread Morris de Oryx
I've run into an existing behavior where xmax(), and various other system
tables, return an error when included in the RETURNING list on a
partitioned table.

ERROR:  cannot retrieve a system column in this context
`
This issue got a fair airing back in 2020:

AW: posgres 12 bug (partitioned table)
https://www.postgresql.org/message-id/flat/GVAP278MB006939B1D7DFDD650E383FBFEACE0%40GVAP278MB0069.CHEP278.PROD.OUTLOOK.COM#908f2604081699e7f41fa20d352e1b79

I'm using 14.4, and just ran into this behavior today. I'm wondering if
there has been any new work on this subject, or anything to take into
account moving forward?

I'm not a C coder, and do not know the Postgres internals, but here's what
I gleaned from the thread:

* Available system columns depend on the underlying table access method,
and may/will vary across AMs. For example, the columns implemented by heap
is what the docs describe, an FDW could be anything, and Postgres has no
control of what, if any, system column-like attributes they support, and
future and hypothetical AMs may have different sets.

* Rather than return garbage results, or a default of 0, etc., the system
throws the error I ran into.

I'd be happier working with a NULL result than garbage, ambiguous results,
or errors...but an error is the current behavior. Agreed on that, I'd
rather an error than a bad/meaningless result. Postgres' consistent
emphasis on correctness is easily one of its greatest qualities.

In my case, I'm upgrading a lot of existing code to try and capture a more
complete profile of what an UPSERT did. Right now, I grab a count(*) of the
rows and return that. Works fine. A revised snippet looks a bit like this:


...UPSERT code
returning xmax as inserted_transaction_id),

status_data AS (
 select count(*) FILTER (where inserted_transaction_id  = 0) AS
insert_count,
count(*) FILTER (where inserted_transaction_id != 0) AS
estimated_update_count,
pg_current_xact_id_if_assigned()::text   AS
transaction_id

   from inserted_rows),

...custom logging code

-- Final output/result.
   select insert_count,
  estimated_update_count,
  transaction_id

  from status_data;


This fails on a partitioned table because xmax() may not exist. In fact, it
does exist in all of those tables, but the system doesn't know how to
guarantee that. I know which tables are partitioned, and can downgrade the
result on partitioned tables to the count(*) I've been using to date. But
now I'm wondering if working with xmax() like this is a poor idea going
forward. I don't want to lean on a feature/behavior that's likely to
change. For example, I noticed the other day that MERGE does not support
RETURNING.

I'd appreciate any insight or advice you can offer.


Re: NAMEDATALEN increase because of non-latin languages

2022-07-17 Thread Andres Freund
Hi,

On 2022-07-18 09:46:44 +0700, John Naylor wrote:
> I've made a small step in this direction.

Thanks for working on this!


> 0001 is just boilerplate, same as v1

If we were to go for this, I wonder if we should backpatch the cast containing
version of GESTRUCT for less pain backpatching bugfixes. That'd likely require
using a different name for the cast containing one.


> 0002 teaches Catalog.pm to export both C attr name and SQL attr name, so we
> can use "sizeof".
> 0003 generates static inline functions that work the same as the current
> GETSTRUCT macro, i.e. just cast to the right pointer and return it.

It seems likely that inline functions are going to be too large for
this. There's a lot of GESTRUCTs in a lot of files, emitting a copy of the
function every time doesn't seem great.


> current offset is the previous offset plus the previous type length, plus
> any alignment padding suitable for the current type (there is none here, so
> the alignment aspect is not tested). I'm hoping something like this will be
> sufficient for what's in the current structs, but of course will need
> additional work when expanding those to include pointers to varlen
> attributes. I've not yet inspected the emitted assembly language, but
> regression tests pass.

Hm. Wouldn't it make sense to just use the normal tuple deforming routines and
then map the results to the structs?

Greetings,

Andres Freund




Re: Windows now has fdatasync()

2022-07-17 Thread Thomas Munro
On Fri, Apr 8, 2022 at 7:56 PM Dave Page  wrote:
> Windows 8 was a pretty unpopular release, so I would expect shifting to 
> 10/2016+ for PG 16 would be unlikely to be a major problem.

Thanks to Michael for making that happen.  That removes the main thing
I didn't know how to deal with in this patch.  Here's a rebase with
some cleanup.

With my garbage collector hat on, I see that all systems we target
have fdatasync(), except:

1.  Windows, but this patch supplies src/port/fdatasync.c.
2.  DragonflyBSD before 6.1.  We have 6.0 in the build farm.
3.  Ancient macOS.  Current releases have it, though we have to cope
with a missing declaration.

>From a standards point of view, fdatasync() is issue 5 POSIX like
fsync().  Both are optional, but, being a database, we require
fsync(), and they're both covered by the same POSIX option
"Synchronized Input and Output".

My plan now is to commit this patch so that problem #1 is solved, prod
conchuela's owner to upgrade to solve #2, and wait until Tom shuts
down prairiedog to solve #3.  Then we could consider removing the
HAVE_FDATASYNC probe and associated #ifdefs when convenient.  For that
reason, I'm not too bothered about the slight weirdness of defining
HAVE_FDATASYNC on Windows even though that doesn't come from
configure; it'd hopefully be short-lived.  Better ideas welcome,
though.  Does that make sense?
From a985d52d1870ccfb92fda3316158242ce2ba3fe8 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 12 Dec 2021 14:13:35 +1300
Subject: [PATCH v3 1/2] Fix treatment of direct I/O in pg_test_fsync.

pg_test_fsync traditionally enabled O_DIRECT for the open_datasync and
open_sync tests.  In fact current releases of PostgreSQL only do that if
wal_level=minimal (less commonly used).  So, run the test both ways.

Discussion: https://postgr.es/m/CA%2BhUKGJZJVO%3DiX%2Beb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ%40mail.gmail.com
---
 src/bin/pg_test_fsync/pg_test_fsync.c | 75 ---
 1 file changed, 69 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_test_fsync/pg_test_fsync.c b/src/bin/pg_test_fsync/pg_test_fsync.c
index f7bc199a30..109ccba819 100644
--- a/src/bin/pg_test_fsync/pg_test_fsync.c
+++ b/src/bin/pg_test_fsync/pg_test_fsync.c
@@ -294,14 +294,45 @@ test_sync(int writes_per_op)
 		printf(_("\nCompare file sync methods using two %dkB writes:\n"), XLOG_BLCKSZ_K);
 	printf(_("(in wal_sync_method preference order, except fdatasync is Linux's default)\n"));
 
+/*
+ * Test open_datasync with O_DIRECT if available
+ */
+	printf(LABEL_FORMAT, "open_datasync (direct)");
+	fflush(stdout);
+
+#if defined(OPEN_DATASYNC_FLAG) && (defined(O_DIRECT) || defined(F_NOCACHE))
+	if ((tmpfile = open_direct(filename, O_RDWR | O_DSYNC | PG_BINARY, 0)) == -1)
+	{
+		printf(NA_FORMAT, _("n/a*"));
+		fs_warning = true;
+	}
+	else
+	{
+		START_TIMER;
+		for (ops = 0; alarm_triggered == false; ops++)
+		{
+			for (writes = 0; writes < writes_per_op; writes++)
+if (pg_pwrite(tmpfile,
+			  buf,
+			  XLOG_BLCKSZ,
+			  writes * XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	die("write failed");
+		}
+		STOP_TIMER;
+		close(tmpfile);
+	}
+#else
+	printf(NA_FORMAT, _("n/a"));
+#endif
+
 	/*
-	 * Test open_datasync if available
+	 * Test open_datasync buffered if available
 	 */
-	printf(LABEL_FORMAT, "open_datasync");
+	printf(LABEL_FORMAT, "open_datasync (buffered)");
 	fflush(stdout);
 
 #ifdef OPEN_DATASYNC_FLAG
-	if ((tmpfile = open_direct(filename, O_RDWR | O_DSYNC | PG_BINARY, 0)) == -1)
+	if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_BINARY, 0)) == -1)
 	{
 		printf(NA_FORMAT, _("n/a*"));
 		fs_warning = true;
@@ -402,12 +433,12 @@ test_sync(int writes_per_op)
 #endif
 
 /*
- * Test open_sync if available
+ * Test open_sync with O_DIRECT if available
  */
-	printf(LABEL_FORMAT, "open_sync");
+	printf(LABEL_FORMAT, "open_sync (direct)");
 	fflush(stdout);
 
-#ifdef OPEN_SYNC_FLAG
+#if defined(OPEN_SYNC_FLAG) && (defined(O_DIRECT) || defined(F_NOCACHE))
 	if ((tmpfile = open_direct(filename, O_RDWR | OPEN_SYNC_FLAG | PG_BINARY, 0)) == -1)
 	{
 		printf(NA_FORMAT, _("n/a*"));
@@ -439,6 +470,38 @@ test_sync(int writes_per_op)
 	printf(NA_FORMAT, _("n/a"));
 #endif
 
+/*
+ * Test open_sync if available
+ */
+	printf(LABEL_FORMAT, "open_sync (buffered)");
+	fflush(stdout);
+
+#ifdef OPEN_SYNC_FLAG
+	if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_BINARY, 0)) == -1)
+	{
+		printf(NA_FORMAT, _("n/a*"));
+		fs_warning = true;
+	}
+	else
+	{
+		START_TIMER;
+		for (ops = 0; alarm_triggered == false; ops++)
+		{
+			for (writes = 0; writes < writes_per_op; writes++)
+if (pg_pwrite(tmpfile,
+			  buf,
+			  XLOG_BLCKSZ,
+			  writes * XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	die("write failed");
+		}
+		STOP_TIMER;
+		close(tmpfile);
+	}
+#else
+	printf(NA_FORMAT, _("n/a"));
+#endif
+
+
 	if (fs_warning)
 	{
 		printf(_("* This file system and its mount options do not support direct\n"
-- 
2.35.1

From 56882f910e58f72987433c72c32fc6eedc1cb

RE: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-17 Thread shiy.f...@fujitsu.com
On Fri, Jul 15, 2022 10:39 PM Masahiko Sawada  wrote:
> 
> This patch should have the fix for the issue that Shi yu reported. Shi
> yu, could you please test it again with this patch?
> 

Thanks for updating the patch!
I have tested and confirmed that the problem I found has been fixed.

Regards,
Shi yu


Re: Windows now has fdatasync()

2022-07-17 Thread Tom Lane
Thomas Munro  writes:
> With my garbage collector hat on, I see that all systems we target
> have fdatasync(), except:

> 1.  Windows, but this patch supplies src/port/fdatasync.c.
> 2.  DragonflyBSD before 6.1.  We have 6.0 in the build farm.
> 3.  Ancient macOS.  Current releases have it, though we have to cope
> with a missing declaration.

Hmmm ... according to [1], while current macOS has an undocumented
fdatasync function, it doesn't seem to do anything as useful as,
say, sync data to disk.  I'm not sure what direction you're headed
in here, but it probably shouldn't include assuming that fdatasync
is actually useful on macOS.  But maybe that's not your point?

> My plan now is to commit this patch so that problem #1 is solved, prod
> conchuela's owner to upgrade to solve #2, and wait until Tom shuts
> down prairiedog to solve #3.

You could force my hand by pushing something that requires this ;-).
I'm not feeling particularly wedded to prairiedog per se.  As with
my once-and-future HPPA machine, I'd prefer to wait until NetBSD 10
is a thing before spinning up an official buildfarm animal, but
I suppose that that's not far away.

regards, tom lane

[1] https://www.postgresql.org/message-id/1673109.1610733352%40sss.pgh.pa.us




Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns

2022-07-17 Thread Amit Kapila
On Fri, Jul 15, 2022 at 8:09 PM Masahiko Sawada  wrote:
>
> This patch should have the fix for the issue that Shi yu reported. Shi
> yu, could you please test it again with this patch?
>

Can you explain the cause of the failure and your fix for the same?

-- 
With Regards,
Amit Kapila.




Re: Windows now has fdatasync()

2022-07-17 Thread Thomas Munro
On Mon, Jul 18, 2022 at 3:43 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > With my garbage collector hat on, I see that all systems we target
> > have fdatasync(), except:
>
> > 1.  Windows, but this patch supplies src/port/fdatasync.c.
> > 2.  DragonflyBSD before 6.1.  We have 6.0 in the build farm.
> > 3.  Ancient macOS.  Current releases have it, though we have to cope
> > with a missing declaration.
>
> Hmmm ... according to [1], while current macOS has an undocumented
> fdatasync function, it doesn't seem to do anything as useful as,
> say, sync data to disk.  I'm not sure what direction you're headed
> in here, but it probably shouldn't include assuming that fdatasync
> is actually useful on macOS.  But maybe that's not your point?

Oh, I'm not planning to change the default choice on macOS (or
Windows).  I *am* assuming we're not going to take it away as an
option, that cat being out of the bag ever since configure found
Apple's secret fdatasync (note that O_DSYNC, our default, is also
undocumented and also known not to flush caches, but at least it's
present in an Apple header!).  I was just noting an upcoming
opportunity to remove the configure/meson probes for fdatasync, which
made me feel better about the slightly kludgy way this patch is
defining HAVE_FDATASYNC explicitly on Windows.

> > My plan now is to commit this patch so that problem #1 is solved, prod
> > conchuela's owner to upgrade to solve #2, and wait until Tom shuts
> > down prairiedog to solve #3.
>
> You could force my hand by pushing something that requires this ;-).

Heh.  Let me ask about the DragonFlyBSD thing first.




Re: Windows now has fdatasync()

2022-07-17 Thread Tom Lane
Thomas Munro  writes:
> ... I was just noting an upcoming
> opportunity to remove the configure/meson probes for fdatasync, which
> made me feel better about the slightly kludgy way this patch is
> defining HAVE_FDATASYNC explicitly on Windows.

Hm.  There is certainly not any harm in the meson infrastructure
skipping that test, because prairiedog is not able to run meson
anyway.  Can we do that and still leave it in place on the autoconf
side?  Maybe not, because I suppose you want to remove #ifdefs in
the code itself.

I see that fdatasync goes back as far as SUS v2, which we've long
taken as our minimum POSIX infrastructure.  So there's not a lot
of room to insist that we should support allegedly-Unix platforms
without it.

regards, tom lane




Re: Handle infinite recursion in logical replication setup

2022-07-17 Thread Amit Kapila
On Sat, Jul 16, 2022 at 10:29 AM Dilip Kumar  wrote:
>
> I think giving two options would be really confusing from the
> usability perspective.  I think what we should be doing here is to
> keep these three names 'none', 'any' and 'local' as reserved names for
> the origin name so that those are not allowed to be set by the user
> and they have some internal meaning.
>

This makes sense to me. I think we can avoid reserving 'local' for now
till we agree on its use case and implementation. One similar point
about slots is that we treat 'none' slot_name in subscription commands
as a special value indicating no slot name whereas we do allow
creating a slot with the name 'none' with
pg_create_logical_replication_slot(). So, if we want to follow a
similar convention here, we may not need to add any restriction for
origin names but personally, I think it is better to add such a
restriction to avoid confusion and in fact, as a separate patch we
should even disallow creating slot name as 'none'.

-- 
With Regards,
Amit Kapila.




Re: Costing elided SubqueryScans more nearly correctly

2022-07-17 Thread Richard Guo
On Mon, Jul 18, 2022 at 3:16 AM Tom Lane  wrote:

> I wrote:
> > I also notice that setrefs.c can elide Append and MergeAppend nodes
> > too in some cases, but AFAICS costing of those node types doesn't
> > take that into account.
>
> I took a closer look at this point and realized that in fact,
> create_append_path and create_merge_append_path do attempt to account
> for this.  But they get it wrong!  Somebody changed the rules in
> setrefs.c to account for parallelism, and did not update the costing
> side of things.
>
> The attached v2 is the same as v1 plus adding a fix for this point.
> No regression test results change from that AFAICT.


The new fix looks good to me. Seems setrefs.c added a new logic to check
parallel_aware when removing single-child Appends/MergeAppends in
f9a74c14, but it neglected to update the related costing logic. And I
can see this patch fixes the costing for that.

BTW, not related to this patch, the new lines for parallel_aware check
in setrefs.c are very wide. How about wrap them to keep consistent with
arounding codes?

Thanks
Richard


Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2022-07-17 Thread Amit Kapila
On Tue, Jul 12, 2022 at 7:07 PM Önder Kalacı  wrote:
>
> Hi hackers,
>
>
> It is often not feasible to use `REPLICA IDENTITY FULL` on the publication, 
> because it leads to full table scan
>
> per tuple change on the subscription. This makes `REPLICA IDENTITY FULL` 
> impracticable -- probably other
>
> than some small number of use cases.
>

IIUC, this proposal is to optimize cases where users can't have a
unique/primary key for a relation on the subscriber and those
relations receive lots of updates or deletes?

> With this patch, I'm proposing the following change: If there is an index on 
> the subscriber, use the index
>
> as long as the planner sub-modules picks any index over sequential scan.
>
> Majority of the logic on the subscriber side has already existed in the code. 
> The subscriber is already
>
> capable of doing (unique) index scans. With this patch, we are allowing the 
> index to iterate over the
>
> tuples fetched and only act when tuples are equal. The ones familiar with 
> this part of the code could
>
> realize that the sequential scan code on the subscriber already implements 
> the `tuples_equal()` function.
>
> In short, the changes on the subscriber are mostly combining parts of 
> (unique) index scan and
>
> sequential scan codes.
>
> The decision on whether to use an index (or which index) is mostly derived 
> from planner infrastructure.
>
> The idea is that on the subscriber we have all the columns. So, construct all 
> the `Path`s with the
>
> restrictions on all columns, such as `col_1 = $1 AND col_2 = $2 ... AND col_n 
> = $N`. Finally, let
>
> the planner sub-module -- `create_index_paths()` -- to give us the relevant  
> index `Path`s. On top of
>
> that adds the sequential scan `Path` as well. Finally, pick the cheapest 
> `Path` among.
>
> From the performance point of view, there are few things to note. First, the 
> patch aims not to
> change the behavior when PRIMARY KEY or UNIQUE INDEX is used. Second, when 
> REPLICA IDENTITY
> IS FULL on the publisher and an index is used on the subscriber, the 
> difference mostly comes down
> to `index scan` vs `sequential scan`. That's why it is hard to claim a 
> certain number of improvements.
> It mostly depends on the data size, index and the data distribution.
>

It seems that in favorable cases it will improve performance but we
should consider unfavorable cases as well. Two things that come to
mind in that regard are (a) while choosing index/seq. scan paths, the
patch doesn't account for cost for tuples_equal() which needs to be
performed for index scans, (b) it appears to me that the patch decides
which index to use the first time it opens the rel (or if the rel gets
invalidated) on subscriber and then for all consecutive operations it
uses the same index. It is quite possible that after some more
operations on the table, using the same index will actually be
costlier than a sequence scan or some other index scan.

-- 
With Regards,
Amit Kapila.