date:20230307

po 6. 3. 2023 v 22:52 odesílatel David Rowley  napsal:

> On Mon, 6 Mar 2023 at 21:55, Pavel Stehule 
> wrote:
> > default https://explain.depesz.com/s/fnBe
>
> It looks like the slowness is coming from the Bitmap Index scan and
> Bitmap heap scan rather than Memoize.
>
>->  Bitmap Heap Scan on public.item i  (cost=285.69..41952.12
> rows=29021 width=16) (actual time=20.395..591.606 rows=20471
> loops=784)
>  Output: i.id, i.item_category_id
>  Recheck Cond: (i.item_category_id = ictc.sub_category_id)
>  Heap Blocks: exact=1590348
>  Worker 0:  actual time=20.128..591.426 rows=20471 loops=112
>  Worker 1:  actual time=20.243..591.627 rows=20471 loops=112
>  Worker 2:  actual time=20.318..591.660 rows=20471 loops=112
>  Worker 3:  actual time=21.180..591.644 rows=20471 loops=112
>  Worker 4:  actual time=20.226..591.357 rows=20471 loops=112
>  Worker 5:  actual time=20.597..591.418 rows=20471 loops=112
>  ->  Bitmap Index Scan on ixfk_ite_itemcategoryid
> (cost=0.00..278.43 rows=29021 width=0) (actual time=14.851..14.851
> rows=25362 loops=784)
>Index Cond: (i.item_category_id = ictc.sub_category_id)
>Worker 0:  actual time=14.863..14.863 rows=25362 loops=112
>Worker 1:  actual time=14.854..14.854 rows=25362 loops=112
>Worker 2:  actual time=14.611..14.611 rows=25362 loops=112
>Worker 3:  actual time=15.245..15.245 rows=25362 loops=112
>Worker 4:  actual time=14.909..14.909 rows=25362 loops=112
>Worker 5:  actual time=14.841..14.841 rows=25362 loops=112
>
> > disabled memoize https://explain.depesz.com/s/P2rP
>
> ->  Bitmap Heap Scan on public.item i  (cost=285.69..41952.12
> rows=29021 width=16) (actual time=9.256..57.503 rows=20471 loops=784)
>Output: i.id, i.item_category_id
>Recheck Cond: (i.item_category_id = ictc.sub_category_id)
>Heap Blocks: exact=1590349
>Worker 0:  actual time=9.422..58.420 rows=20471 loops=112
>Worker 1:  actual time=9.449..57.539 rows=20471 loops=112
>Worker 2:  actual time=9.751..58.129 rows=20471 loops=112
>Worker 3:  actual time=9.620..57.484 rows=20471 loops=112
>Worker 4:  actual time=8.940..57.911 rows=20471 loops=112
>Worker 5:  actual time=9.454..57.488 rows=20471 loops=112
>->  Bitmap Index Scan on ixfk_ite_itemcategoryid
> (cost=0.00..278.43 rows=29021 width=0) (actual time=4.581..4.581
> rows=25363 loops=784)
>  Index Cond: (i.item_category_id = ictc.sub_category_id)
>  Worker 0:  actual time=4.846..4.846 rows=25363 loops=112
>  Worker 1:  actual time=4.734..4.734 rows=25363 loops=112
>  Worker 2:  actual time=4.803..4.803 rows=25363 loops=112
>  Worker 3:  actual time=4.959..4.959 rows=25363 loops=112
>  Worker 4:  actual time=4.402..4.402 rows=25363 loops=112
>  Worker 5:  actual time=4.778..4.778 rows=25363 loops=112
>
> I wonder if the additional work_mem required for Memoize is just doing
> something like causing kernel page cache evictions and leading to
> fewer buffers for ixfk_ite_itemcategoryid and the item table being
> cached in the kernel page cache.
>
> Maybe you could get an idea of that if you SET track_io_timing = on;
> and EXPLAIN (ANALYZE, BUFFERS) both queries.
>

https://explain.depesz.com/s/vhk0
https://explain.depesz.com/s/R5ju

Regards

Pavel


> David
>

Re: Raising the SCRAM iteration count

> On 7 Mar 2023, at 05:53, Michael Paquier  wrote:
> 
> On Fri, Mar 03, 2023 at 11:13:36PM +0100, Daniel Gustafsson wrote:
>> That would indeed be nice, but is there a way to do this without a 
>> complicated
>> pump TAP expression?  I was unable to think of a way but I might be missing
>> something?
> 
> A SET command refreshes immediately the cache information of the
> connection in pqSaveParameterStatus()@libpq, so a test in password.sql
> with \password would be enough to check the computation happens in
> pg_fe_scram_build_secret() with the correct iteration number.  Say
> like:
> =# SET scram_iterations = 234;
> SET
> =# \password
> Enter new password for user "postgres": TYPEME 
> Enter it again: TYPEME
> =# select substr(rolpassword, 1, 18) from pg_authid
> where oid::regrole::name = current_role;
>   substr   
> 
> SCRAM-SHA-256$234:
> (1 row)
> 
> Or perhaps I am missing something?

Right, what I meant was: can a pg_regress sql/expected test drive a psql
interactive prompt?  Your comments suggested using password.sql so I was
curious if I was missing a neat trick for doing this.

--
Daniel Gustafsson

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

2023-03-07 Thread Damir Belyalov

>
> FWIW, Greenplum has a similar construct (but which also logs the errors
> in the
> db) where data type errors are skipped as long as the number of errors
> don't
> exceed a reject limit.  If the reject limit is reached then the COPY
> fails:
> >
> >   LOG ERRORS [ SEGMENT REJECT LIMIT  [ ROWS | PERCENT ]]
> >
> IIRC the gist of this was to catch then the user copies the wrong input
> data or
> plain has a broken file.  Rather than finding out after copying n rows
> which
> are likely to be garbage the process can be restarted.
>

I think this is a matter for discussion. The same question is: "Where to
log errors to separate files or to the system logfile?".
IMO it's better for users to log short-detailed error message to system
logfile and not output errors to the terminal.


This version of the patch has a compiler error in the error message:
>
Yes, corrected it. Changed "ignored_errors" to int64 because "processed"
(used for counting copy rows) is int64.


I felt just logging "Error: %ld" would make people wonder the meaning of
> the %ld. Logging something like ""Error: %ld data type errors were
> found" might be clearer.
>

Thanks. For more clearance change the message to: "Errors were found: %".

Regards, Damir Belyalov
Postgres Professional
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index c25b52d0cb..706b929947 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -34,6 +34,7 @@ COPY { table_name [ ( format_name
 FREEZE [ boolean ]
+IGNORE_DATATYPE_ERRORS [ boolean ]
 DELIMITER 'delimiter_character'
 NULL 'null_string'
 HEADER [ boolean | MATCH ]
@@ -233,6 +234,17 @@ COPY { table_name [ ( 

 
+   
+IGNORE_DATATYPE_ERRORS
+
+ 
+  Drops rows that contain malformed data while copying. These are rows
+  with columns where the data type's input-function raises an error.
+  Outputs warnings about rows with incorrect data to system logfile.
+ 
+
+   
+

 DELIMITER
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e34f583ea7..0334894014 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -410,6 +410,7 @@ ProcessCopyOptions(ParseState *pstate,
 	bool		format_specified = false;
 	bool		freeze_specified = false;
 	bool		header_specified = false;
+	bool		ignore_datatype_errors_specified= false;
 	ListCell   *option;
 
 	/* Support external use for option sanity checking */
@@ -449,6 +450,13 @@ ProcessCopyOptions(ParseState *pstate,
 			freeze_specified = true;
 			opts_out->freeze = defGetBoolean(defel);
 		}
+		else if (strcmp(defel->defname, "ignore_datatype_errors") == 0)
+		{
+			if (ignore_datatype_errors_specified)
+errorConflictingDefElem(defel, pstate);
+			ignore_datatype_errors_specified = true;
+			opts_out->ignore_datatype_errors = defGetBoolean(defel);
+		}
 		else if (strcmp(defel->defname, "delimiter") == 0)
 		{
 			if (opts_out->delim)
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 29cd1cf4a6..facfc44def 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -949,10 +949,14 @@ CopyFrom(CopyFromState cstate)
 	errcallback.previous = error_context_stack;
 	error_context_stack = &errcallback;
 
+	if (cstate->opts.ignore_datatype_errors)
+		cstate->ignored_errors = 0;
+
 	for (;;)
 	{
 		TupleTableSlot *myslot;
 		bool		skip_tuple;
+		ErrorSaveContext escontext = {T_ErrorSaveContext};
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -985,9 +989,26 @@ CopyFrom(CopyFromState cstate)
 
 		ExecClearTuple(myslot);
 
+		if (cstate->opts.ignore_datatype_errors)
+		{
+			escontext.details_wanted = true;
+			cstate->escontext = escontext;
+		}
+
 		/* Directly store the values/nulls array in the slot */
 		if (!NextCopyFrom(cstate, econtext, myslot->tts_values, myslot->tts_isnull))
+		{
+			if (cstate->opts.ignore_datatype_errors && cstate->ignored_errors > 0)
+ereport(WARNING, errmsg("Errors were found: %lld", (long long) cstate->ignored_errors));
 			break;
+		}
+
+		/* Soft error occured, skip this tuple */
+		if (cstate->escontext.error_occurred)
+		{
+			ExecClearTuple(myslot);
+			continue;
+		}
 
 		ExecStoreVirtualTuple(myslot);
 
diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 91b564c2bc..9c36b0dc8b 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -70,6 +70,7 @@
 #include "libpq/pqformat.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "nodes/miscnodes.h"
 #include "pgstat.h"
 #include "port/pg_bswap.h"
 #include "utils/builtins.h"
@@ -938,10 +939,23 @@ NextCopyFrom(CopyFromState cstate, ExprContext *econtext,
 
 			cstate->cur_attname = NameStr(att->attname);
 			cstate->cur_attval = string;
-			values[m] = InputFunctionCall(&in_functions[m],
-		  string,
-		  typioparams[m],
-		  att->atttypmod);
+
+			/* If IGNORE_DATATYPE_ERRORS i

Re: using memoize in in paralel query decreases performance

 /On Tue, 7 Mar 2023 at 21:09, Pavel Stehule  wrote:
>
> po 6. 3. 2023 v 22:52 odesílatel David Rowley  napsal:
>> I wonder if the additional work_mem required for Memoize is just doing
>> something like causing kernel page cache evictions and leading to
>> fewer buffers for ixfk_ite_itemcategoryid and the item table being
>> cached in the kernel page cache.
>>
>> Maybe you could get an idea of that if you SET track_io_timing = on;
>> and EXPLAIN (ANALYZE, BUFFERS) both queries.
>
>
> https://explain.depesz.com/s/vhk0

This is the enable_memoize=on one. The I/O looks like:

Buffers: shared hit=105661309 read=15274264 dirtied=15707 written=34863
I/O Timings: shared/local read=2671836.341 write=1286.869

2671836.341 / 15274264 = ~0.175 ms per read.

> https://explain.depesz.com/s/R5ju

This is the faster enable_memoize = off one. The I/O looks like:

Buffers: shared hit=44542473 read=18541899 dirtied=11988 written=18625
I/O Timings: shared/local read=1554838.583 write=821.477

1554838.583 / 18541899 = ~0.084 ms per read.

That indicates that the enable_memoize=off version is just finding
more pages in the kernel's page cache than the slower query.  The
slower query just appears to be under more memory pressure causing the
kernel to have less free memory to cache useful pages. I don't see
anything here that indicates any problems with Memoize.  Sure the
statistics could be better as, ideally, the Memoize wouldn't have
happened for the i_2 relation. I don't see anything that indicates any
bugs with this, however. It's pretty well known that Memoize puts
quite a bit of faith into ndistinct estimates. If it causes too many
issues the enable_memoize switch can be turned to off.

You might want to consider experimenting with smaller values of
work_mem and/or hash_mem_multiplier for this query or just disabling
memoize altogether.

David

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-07 Thread Sandro Santilli

On Mon, Mar 06, 2023 at 02:54:35PM -0500, Gregory Stark (as CFM) wrote:
> This patch too is conflicting on meson.build.

I'm attaching a rebased version to this email.

> Maybe it can just be solved with multiple one-line scripts
> that call to a master script?

Not really, as the problem we are trying to solve is the need
to install many files, and the need to foresee any possible
future bug-fix release we might want to support upgrades from.

PostGIS is already installing zero-line scripts to upgrade
from  to a virtual "ANY" version which we then
use to have a single ANY-- upgrade path, but we
are still REQUIRED to install a file for any  we
want to allow upgrade from, which is what this patch is aimed
at fixing.

--strk;
>From ffefd039f24e3def8d2216b3ac7875d0b6feb8fb Mon Sep 17 00:00:00 2001
From: Sandro Santilli 
Date: Wed, 14 Sep 2022 11:10:10 +0200
Subject: [PATCH v1] Allow wildcard (%) in extension upgrade paths

A wildcard character "%" will be accepted in the
"source" side of the upgrade script and be considered
usable to upgrade any version to the "target" side.

Includes regression test and documentation.
---
 doc/src/sgml/extend.sgml  |  8 
 src/backend/commands/extension.c  | 42 ---
 src/test/modules/test_extensions/Makefile |  6 ++-
 .../expected/test_extensions.out  | 15 +++
 src/test/modules/test_extensions/meson.build  |  3 ++
 .../test_extensions/sql/test_extensions.sql   |  7 
 .../test_ext_wildcard1--%--2.0.sql|  6 +++
 .../test_ext_wildcard1--1.0.sql   |  6 +++
 .../test_ext_wildcard1.control|  3 ++
 9 files changed, 88 insertions(+), 8 deletions(-)
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--%--2.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1--1.0.sql
 create mode 100644 src/test/modules/test_extensions/test_ext_wildcard1.control

diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index b70cbe83ae..f1f0ae1244 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -1081,6 +1081,14 @@ SELECT pg_catalog.pg_extension_config_dump('my_config', 'WHERE NOT standard_entr
  1.1).
 
 
+
+ The literal value % can be used as the
+ old_version component in an extension
+ update script for it to match any version. Such wildcard update
+ scripts will only be used when no explicit path is found from
+ old to target version.
+
+
 
  Given that a suitable update script is available, the command
  ALTER EXTENSION UPDATE will update an installed extension
diff --git a/src/backend/commands/extension.c b/src/backend/commands/extension.c
index 02ff4a9a7f..6df0fd403a 100644
--- a/src/backend/commands/extension.c
+++ b/src/backend/commands/extension.c
@@ -130,6 +130,7 @@ static void ApplyExtensionUpdates(Oid extensionOid,
   bool cascade,
   bool is_create);
 static char *read_whole_file(const char *filename, int *length);
+static bool file_exists(const char *name);
 
 
 /*
@@ -893,7 +894,14 @@ execute_extension_script(Oid extensionOid, ExtensionControlFile *control,
 	if (from_version == NULL)
 		elog(DEBUG1, "executing extension script for \"%s\" version '%s'", control->name, version);
 	else
+	{
+		if ( ! file_exists(filename) )
+		{
+			/* if filename does not exist, try wildcard */
+			filename = get_extension_script_filename(control, "%", version);
+		}
 		elog(DEBUG1, "executing extension script for \"%s\" update from version '%s' to '%s'", control->name, from_version, version);
+	}
 
 	/*
 	 * If installing a trusted extension on behalf of a non-superuser, become
@@ -1217,14 +1225,19 @@ identify_update_path(ExtensionControlFile *control,
 
 	/* Find shortest path */
 	result = find_update_path(evi_list, evi_start, evi_target, false, false);
+	if (result != NIL)
+		return result;
 
-	if (result == NIL)
-		ereport(ERROR,
-(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"",
-		control->name, oldVersion, newVersion)));
+	/* Find wildcard path, if no explicit path was found */
+	evi_start = get_ext_ver_info("%", &evi_list);
+	result = find_update_path(evi_list, evi_start, evi_target, false, false);
+	if (result != NIL)
+		return result;
 
-	return result;
+	ereport(ERROR,
+			(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+			 errmsg("extension \"%s\" has no update path from version \"%s\" to version \"%s\"",
+	control->name, oldVersion, newVersion)));
 }
 
 /*
@@ -3395,3 +3408,20 @@ read_whole_file(const char *filename, int *length)
 	buf[*length] = '\0';
 	return buf;
 }
+
+static bool
+file_exists(const char *name)
+{
+	struct stat st;
+
+	Assert(name != NULL);
+
+	if (stat(name, &st) == 0)
+		return !S_ISDIR(st.st_mode);
+	else if (!(errno == ENOENT || errno == ENOTDIR || errno == EACCES))
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errm

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

On Tue, Mar 7, 2023 at 1:19 PM Peter Smith  wrote:
>
> Let me give an example to demonstrate why I thought something is fishy here:
>
> Imagine rel has a (non-default) REPLICA IDENTITY with Oid=.
> Imagine the same rel has a PRIMARY KEY with Oid=.
>
> ---
>
> +/*
> + * Get replica identity index or if it is not defined a primary key.
> + *
> + * If neither is defined, returns InvalidOid
> + */
> +Oid
> +GetRelationIdentityOrPK(Relation rel)
> +{
> + Oid idxoid;
> +
> + idxoid = RelationGetReplicaIndex(rel);
> +
> + if (!OidIsValid(idxoid))
> + idxoid = RelationGetPrimaryKeyIndex(rel);
> +
> + return idxoid;
> +}
> +
> +/*
> + * Given a relation and OID of an index, returns true if the
> + * index is relation's replica identity index or relation's
> + * primary key's index.
> + *
> + * Returns false otherwise.
> + */
> +bool
> +IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid)
> +{
> + Assert(OidIsValid(idxoid));
> +
> + return GetRelationIdentityOrPK(rel) == idxoid;
> +}
> +
>
>
> So, according to the above function comment/name, I will expect
> calling IdxIsRelationIdentityOrPK passing Oid= or Oid- will
> both return true, right?
>
> But AFAICT
>
> IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel)
> returns  (the valid oid of the RI) -->  ==  --> true;
>
> IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel)
> returns  (the valid oid of the RI) -->  ==  --> false;
>
> ~
>
> Now two people are telling me this is OK, but I've been staring at it
> for too long and I just don't see how it can be. (??)
>

The difference is that you are misunderstanding the intent of this
function. GetRelationIdentityOrPK() returns a "replica identity index
oid" if the same is defined, else return PK, if that is defined,
otherwise, return invalidOid. This is what is expected by its callers.
Now, one can argue to have a different function name and that may be a
valid argument but as far as I can see the function does what is
expected from it.

-- 
With Regards,
Amit Kapila.

Re: Add pg_walinspect function with block info columns

2023-03-07 Thread Kyotaro Horiguchi

At Tue, 7 Mar 2023 16:18:21 +0900, Michael Paquier  wrote 
in 
> On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote:
> > Ah. Yes, that expansion sounds sensible.
> 
> Okay, so, based on this idea, I have hacked on this stuff and finish
> with the attached that shows block data if it exists, as well as FPI
> stuff if any.  bimg_info is showed as a text[] for its flags.

# The naming convetion looks inconsistent between
# pg_get_wal_records_info and pg_get_wal_block_info but it's not an
# issue of this patch..

> I guess that I'd better add a test that shows correctly a record with
> some block data attached to it, on top of the existing one for FPIs..
> Any suggestions?  Perhaps just a heap/heap2 record?
> 
> Thoughts?

I thought that we needed a test for block data when I saw the patch.
I don't have great idea but a single insert should work.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: using memoize in in paralel query decreases performance

út 7. 3. 2023 v 9:58 odesílatel David Rowley  napsal:

>  /On Tue, 7 Mar 2023 at 21:09, Pavel Stehule 
> wrote:
> >
> > po 6. 3. 2023 v 22:52 odesílatel David Rowley 
> napsal:
> >> I wonder if the additional work_mem required for Memoize is just doing
> >> something like causing kernel page cache evictions and leading to
> >> fewer buffers for ixfk_ite_itemcategoryid and the item table being
> >> cached in the kernel page cache.
> >>
> >> Maybe you could get an idea of that if you SET track_io_timing = on;
> >> and EXPLAIN (ANALYZE, BUFFERS) both queries.
> >
> >
> > https://explain.depesz.com/s/vhk0
>
> This is the enable_memoize=on one. The I/O looks like:
>
> Buffers: shared hit=105661309 read=15274264 dirtied=15707 written=34863
> I/O Timings: shared/local read=2671836.341 write=1286.869
>
> 2671836.341 / 15274264 = ~0.175 ms per read.
>
> > https://explain.depesz.com/s/R5ju
>
> This is the faster enable_memoize = off one. The I/O looks like:
>
> Buffers: shared hit=44542473 read=18541899 dirtied=11988 written=18625
> I/O Timings: shared/local read=1554838.583 write=821.477
>
> 1554838.583 / 18541899 = ~0.084 ms per read.
>
> That indicates that the enable_memoize=off version is just finding
> more pages in the kernel's page cache than the slower query.  The
> slower query just appears to be under more memory pressure causing the
> kernel to have less free memory to cache useful pages. I don't see
> anything here that indicates any problems with Memoize.  Sure the
> statistics could be better as, ideally, the Memoize wouldn't have
> happened for the i_2 relation. I don't see anything that indicates any
> bugs with this, however. It's pretty well known that Memoize puts
> quite a bit of faith into ndistinct estimates. If it causes too many
> issues the enable_memoize switch can be turned to off.
>
> You might want to consider experimenting with smaller values of
> work_mem and/or hash_mem_multiplier for this query or just disabling
> memoize altogether.
>

I can live with it. This is an analytical query and the performance is not
too important for us. I was surprised that the performance was about 25%
worse, and so the hit ratio was almost zero. I am thinking, but I am not
sure if the estimation of the effectiveness of memoization can depend (or
should depend) on the number of workers? In this case the number of workers
is high.



> David
>

Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)

> On 7 Mar 2023, at 09:35, Damir Belyalov  wrote:

> I felt just logging "Error: %ld" would make people wonder the meaning of 
> the %ld. Logging something like ""Error: %ld data type errors were 
> found" might be clearer.
>  
> Thanks. For more clearance change the message to: "Errors were found: %". 

I'm not convinced that this adds enough clarity to assist the user.  We also
shouldn't use "error" in a WARNING log since the user has explicitly asked to
skip rows on error, so it's not an error per se. How about something like:

  ereport(WARNING,
  (errmsg("%ld rows were skipped due to data type incompatibility", 
cstate->ignored_errors),
   errhint("Skipped rows can be inspected in the database log for 
reprocessing.")));

--
Daniel Gustafsson

Re: pg_rewind: Skip log directory for file type check like pg_wal

> On 7 Mar 2023, at 08:33, Alexander Kukushkin  wrote:

> The "log_directory" GUC must be examined on both, source and target.

Agreed, log_directory must be resolved to the configured values.  Teaching
pg_rewind about those in case they are stored in $PGDATA sounds like a good
idea though.

--
Daniel Gustafsson

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

On Tue, Mar 7, 2023 at 8:01 PM Amit Kapila  wrote:
>
> On Tue, Mar 7, 2023 at 1:19 PM Peter Smith  wrote:
> >
> > Let me give an example to demonstrate why I thought something is fishy here:
> >
> > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=.
> > Imagine the same rel has a PRIMARY KEY with Oid=.
> >
> > ---
> >
> > +/*
> > + * Get replica identity index or if it is not defined a primary key.
> > + *
> > + * If neither is defined, returns InvalidOid
> > + */
> > +Oid
> > +GetRelationIdentityOrPK(Relation rel)
> > +{
> > + Oid idxoid;
> > +
> > + idxoid = RelationGetReplicaIndex(rel);
> > +
> > + if (!OidIsValid(idxoid))
> > + idxoid = RelationGetPrimaryKeyIndex(rel);
> > +
> > + return idxoid;
> > +}
> > +
> > +/*
> > + * Given a relation and OID of an index, returns true if the
> > + * index is relation's replica identity index or relation's
> > + * primary key's index.
> > + *
> > + * Returns false otherwise.
> > + */
> > +bool
> > +IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid)
> > +{
> > + Assert(OidIsValid(idxoid));
> > +
> > + return GetRelationIdentityOrPK(rel) == idxoid;
> > +}
> > +
> >
> >
> > So, according to the above function comment/name, I will expect
> > calling IdxIsRelationIdentityOrPK passing Oid= or Oid- will
> > both return true, right?
> >
> > But AFAICT
> >
> > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel)
> > returns  (the valid oid of the RI) -->  ==  --> true;
> >
> > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel)
> > returns  (the valid oid of the RI) -->  ==  --> false;
> >
> > ~
> >
> > Now two people are telling me this is OK, but I've been staring at it
> > for too long and I just don't see how it can be. (??)
> >
>
> The difference is that you are misunderstanding the intent of this
> function. GetRelationIdentityOrPK() returns a "replica identity index
> oid" if the same is defined, else return PK, if that is defined,
> otherwise, return invalidOid. This is what is expected by its callers.
> Now, one can argue to have a different function name and that may be a
> valid argument but as far as I can see the function does what is
> expected from it.
>

Sure, but I am questioning the function IdxIsRelationIdentityOrPK, not
GetRelationIdentityOrPK.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: using memoize in in paralel query decreases performance

On Tue, 7 Mar 2023 at 22:09, Pavel Stehule  wrote:
> I can live with it. This is an analytical query and the performance is not 
> too important for us. I was surprised that the performance was about 25% 
> worse, and so the hit ratio was almost zero. I am thinking, but I am not sure 
> if the estimation of the effectiveness of memoization can depend (or should 
> depend) on the number of workers? In this case the number of workers is high.

The costing for Memoize takes the number of workers into account by
way of the change in expected input rows.  The number of estimated
input rows is effectively just divided by the number of parallel
workers, so if we expect 1 million rows from the outer side of the
join and 4 workers, then we'll assume the memorize will deal with
250,000 rows per worker.  If the n_distinct estimate for the cache key
is 500,000, then it's not going to look very attractive to Memoize
that.  In reality, estimate_num_groups() won't say the number of
groups is higher than the input rows, but Memoize, with all the other
overheads factored into the costs, it would never look favourable if
the planner thought there was never going to be any repeated values.
The expected cache hit ratio there would be zero.

David

Re: using memoize in in paralel query decreases performance

út 7. 3. 2023 v 10:46 odesílatel David Rowley  napsal:

> On Tue, 7 Mar 2023 at 22:09, Pavel Stehule 
> wrote:
> > I can live with it. This is an analytical query and the performance is
> not too important for us. I was surprised that the performance was about
> 25% worse, and so the hit ratio was almost zero. I am thinking, but I am
> not sure if the estimation of the effectiveness of memoization can depend
> (or should depend) on the number of workers? In this case the number of
> workers is high.
>
> The costing for Memoize takes the number of workers into account by
> way of the change in expected input rows.  The number of estimated
> input rows is effectively just divided by the number of parallel
> workers, so if we expect 1 million rows from the outer side of the
> join and 4 workers, then we'll assume the memorize will deal with
> 250,000 rows per worker.  If the n_distinct estimate for the cache key
> is 500,000, then it's not going to look very attractive to Memoize
> that.  In reality, estimate_num_groups() won't say the number of
> groups is higher than the input rows, but Memoize, with all the other
> overheads factored into the costs, it would never look favourable if
> the planner thought there was never going to be any repeated values.
> The expected cache hit ratio there would be zero.
>

Thanks for the explanation.

Pavel


> David
>

Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

On Sun, 5 Mar 2023 at 13:21, Lukas Fittl  wrote:
> Alternatively (or in addition) we could consider showing the "ndistinct" 
> value that is calculated in cost_memoize_rescan - since that's the most 
> significant contributor to the cache hit ratio (and you can influence that 
> directly by improving the ndistinct statistics).

I think the ndistinct estimate plus the est_entries together would be
useful. I think showing just the hit ratio number might often just
raise too many questions about how that's calculated. To calculate the
hit ratio we need to estimate the number of entries that can be kept
in the cache at once and also the number of input rows and the number
of distinct values.  We can see the input rows by looking at the outer
side of the join in EXPLAIN, but we've no idea about the ndistinct or
how many items the planner thought could be kept in the cache at once.

The plan node already has est_entries, so it should just be a matter
of storing the ndistinct estimate in the Path and putting it into the
Plan node so the executor has access to it during EXPLAIN.

David

Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2023-03-07 Thread Alvaro Herrera

On 2023-Mar-07, Julien Rouhaud wrote:

> I registered lapwing as a 32b Debian 7 so I thought it would be expected to
> keep it as-is rather than upgrading to all newer major Debian versions,
> especially since there were newer debian animal registered (no 32b though
> AFAICS).  I'm not opposed to upgrading it but I think there's still value in
> having somewhat old packages versions being tested, especially since there
> isn't much 32b coverage of those.  I would be happy to register a newer 32b
> version, or even sid, if needed but the -m32 part on the CI makes me think
> there isn't much value doing that now.

I think a pure 32bit animal running contemporary Debian would be better
than just ditching the animal completely, as would appear to be the
alternative, precisely because we have no other 32bit machine running
x86 Linux.

Maybe you can have *two* animals on the same machine: one running the
old Debian without -Werror, and the one with new Debian and that flag
kept.

I think CI is not a replacement for the buildfarm.  It helps catche some
problems earlier, but we shouldn't think that we no longer need some
buildfarm animals because CI runs those configs.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: Add pg_walinspect function with block info columns

2023-03-07 Thread Bharath Rupireddy

On Tue, Mar 7, 2023 at 12:48 PM Michael Paquier  wrote:
>
> On Tue, Mar 07, 2023 at 03:49:02PM +0900, Kyotaro Horiguchi wrote:
> > Ah. Yes, that expansion sounds sensible.
>
> Okay, so, based on this idea, I have hacked on this stuff and finish
> with the attached that shows block data if it exists, as well as FPI
> stuff if any.  bimg_info is showed as a text[] for its flags.

+1.

> I guess that I'd better add a test that shows correctly a record with
> some block data attached to it, on top of the existing one for FPIs..
> Any suggestions?  Perhaps just a heap/heap2 record?
>
> Thoughts?

That would be a lot better. Not just the test, but also the
documentation can have it. Simple way to generate such a record (both
block data and FPI) is to just change the wal_level to logical in
walinspect.conf [1], see code around REGBUF_KEEP_DATA and
RelationIsLogicallyLogged in heapam.c

I had the following comments and fixed them in the attached v2 patch:

1. Still a trace of pg_get_wal_fpi_info in docs, removed it.

2. Used int4 instead of int for fpilen just to be in sync with
fpi_length of pg_get_wal_record_info.

3. Changed to be consistent and use just FPI or "F/full page".
/* FPI flags */
/* No full page image, so store NULLs for all its fields */
/* Full-page image */
/* Full page exists, so let's save it. */
 * and end LSNs.  This produces information about the full page images with
 * to a record.  Decompression is applied to the full-page images, if

4. I think we need to free raw_data, raw_page and flags as we loop
over multiple blocks (XLR_MAX_BLOCK_ID) and will leak memory which can
be a problem if we have many blocks assocated with a single WAL
record.
flags = (Datum *) palloc0(sizeof(Datum) * bitcnt);
Also, we will leak all CStringGetTextDatum memory in the block_id for loop.
Another way is to use and reset temp memory context in the for loop
over block_ids. I prefer this approach over multiple pfree()s in
block_id for loop.

5. I think it'd be good to say if the FPI is for WAL_VERIFICATION, so
I changed it to the following. Feel free to ignore this if you think
it's not required.
if (blk->apply_image)
flags[cnt++] = CStringGetTextDatum("APPLY");
else
flags[cnt++] = CStringGetTextDatum("WAL_VERIFICATION");

6. Did minor wordsmithing.

7. Added test case which shows both block data and fpi in the documentation.

8. Changed wal_level to logical in walinspect.conf to test case with block data.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

v2-0001-Rework-pg_walinspect-to-retrieve-more-block-infor.patch
Description: Binary data

Re: RFC: logical publication via inheritance root?

2023-03-07 Thread Aleksander Alekseev

Hi Jacob,

> I'm going to register this in CF for feedback.

Many thanks for the updated patch.

Despite the fact that the patch is still work in progress all in all
it looks very good to me.

So far I only have a couple of nitpicks, mostly regarding the code coverage [1]:

```
+tablename = get_rel_name(tableoid);
+if (tablename == NULL)
+ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_TABLE),
+ errmsg("OID %u does not refer to a table", tableoid)));
+rootname = get_rel_name(rootoid);
+if (rootname == NULL)
+ereport(ERROR,
+(errcode(ERRCODE_UNDEFINED_TABLE),
+ errmsg("OID %u does not refer to a table", rootoid)));
```

```
+res = walrcv_exec(LogRepWorkerWalRcvConn, cmd.data,
+  lengthof(descRow), descRow);
+
+if (res->status != WALRCV_OK_TUPLES)
+ereport(ERROR,
+(errmsg("could not fetch logical descendants for
table \"%s.%s\" from publisher: %s",
+nspname, relname, res->err)));
```

```
+res = walrcv_exec(LogRepWorkerWalRcvConn, cmd.data, 0, NULL);
+pfree(cmd.data);
+if (res->status != WALRCV_OK_COPY_OUT)
+ereport(ERROR,
+(errcode(ERRCODE_CONNECTION_FAILURE),
+ errmsg("could not start initial contents copy
for table \"%s.%s\" from remote %s: %s",
+lrel.nspname, lrel.relname, quoted_name,
res->err)));
```

These new ereport() paths are never executed when we run the tests.
I'm not 100% sure if they are "should never happen in practice" cases
or not. If they are, I suggest adding corresponding comments.
Otherwise we have to test these paths.

```
+else
+{
+/* For older servers, we only COPY the table itself. */
+char   *quoted = quote_qualified_identifier(lrel->nspname,
+lrel->relname);
+*to_copy = lappend(*to_copy, quoted);
+}
```

Also we have to be extra careful with this code path because it is not
test-covered too.

```
+Datum
+pg_get_publication_rels_to_sync(PG_FUNCTION_ARGS)
+{
+#define NUM_SYNC_TABLES_ELEM   1
```

What is this macro for?

```
+{ oid => '8137', descr => 'get list of tables to copy during initial sync',
+  proname => 'pg_get_publication_rels_to_sync', prorows => '10',
proretset => 't',
+  provolatile => 's', prorettype => 'regclass', proargtypes => 'regclass text',
+  proargnames => '{rootid,pubname}',
+  prosrc => 'pg_get_publication_rels_to_sync' },
```

Something seems odd here. Is there a chance that it can return
different results even within one statement, especially considering
the fact that pg_set_logical_root() is VOLATILE? Maybe
pg_get_publication_rels_to_sync() should be VOLATILE too [2].

```
+{ oid => '8136', descr => 'mark a table root for logical replication',
+  proname => 'pg_set_logical_root', provolatile => 'v', proparallel => 'u',
+  prorettype => 'void', proargtypes => 'regclass regclass',
+  prosrc => 'pg_set_logical_root' },
```

Shouldn't we also have pg_unset(reset?)_logical_root?

[1]: https://github.com/afiskon/pgscripts/blob/master/code-coverage.sh
[2]: https://www.postgresql.org/docs/current/xfunc-volatility.html

-- 
Best regards,
Aleksander Alekseev

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Hi Andres, Amit, all

I think the case in which the patch regresses performance in is irrelevant
> in
> practice.
>

This is similar to what I think in this context.

I appreciate the effort from Shi Yu, so that we have a clear understanding
on the overhead.
But the tests we do on [1] where we observe the regression are largely
synthetic test cases
that aim to spot the overhead.

And having an index over thousands
> of non-differing values will generally perform badly, not just in this
> context.


Similarly, maybe there are some eccentric use patterns that might follow
this. But I also suspect
even if there are such patterns, could they really be performance sensitive?


Thanks,
Onder KALACI

[1]
https://www.postgresql.org/message-id/OSZPR01MB63103A4AFBBA56BAF8AE7FAAFDA39%40OSZPR01MB6310.jpnprd01.prod.outlook.com

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Hi Shi Yu, all


> Thanks for updating the patch. Here are some comments on v33-0001 patch.
>
> 1.
> +   if (RelationReplicaIdentityFullIndexScanEnabled(localrel) &&
> +   remoterel->replident == REPLICA_IDENTITY_FULL)
>
> RelationReplicaIdentityFullIndexScanEnabled() is introduced in 0002 patch,
> so
> the call to it should be moved to 0002 patch I think.
>

Ah, sure, a rebase issue, fixed in v34


>
> 2.
> +#include "optimizer/cost.h"
>
> Do we need this in the latest patch? I tried and it looks it could be
> removed
> from src/backend/replication/logical/relation.c.
>

Hmm, probably an artifact of the initial versions of the patch where we
needed some
costing functionality.


>
> 3.
> +# now, create a unique index and set the replica
> +$node_publisher->safe_psql('postgres',
> +   "CREATE UNIQUE INDEX test_replica_id_full_unique ON
> test_replica_id_full(x);");
> +$node_subscriber->safe_psql('postgres',
> +   "CREATE UNIQUE INDEX test_replica_id_full_unique ON
> test_replica_id_full(x);");
> +
>
> Should the comment be "now, create a unique index and set the replica
> identity"?
>

yes, fixed


>
> 4.
> +$node_publisher->safe_psql('postgres',
> +   "ALTER TABLE test_replica_id_full REPLICA IDENTITY USING INDEX
> test_replica_id_full_unique;");
> +$node_subscriber->safe_psql('postgres',
> +   "ALTER TABLE test_replica_id_full REPLICA IDENTITY USING INDEX
> test_replica_id_full_unique;");
> +
> +# wait for the synchronization to finish
> +$node_subscriber->wait_for_subscription_sync;
>
> There's no new tables to need to be synchronized here, should we remove
> the call
> to wait_for_subscription_sync?
>

right, probably a copy & paste typo, thanks for spotting.


I'll attach v34 with the next e-mail given the comments here only touch
small parts
of the patch.



Thanks,
Onder KALACI

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Hi Amit, all

>
> Few comments:
> =
> 1.
> +get_usable_indexoid(ApplyExecutionData *edata, ResultRelInfo *relinfo)
> {
> ...
> + if (targetrelkind == RELKIND_PARTITIONED_TABLE)
> + {
> + /* Target is a partitioned table, so find relmapentry of the partition */
> + TupleConversionMap *map = ExecGetRootToChildMap(relinfo, edata->estate);
> + AttrMap*attrmap = map ? map->attrMap : NULL;
> +
> + relmapentry =
> + logicalrep_partition_open(relmapentry, relinfo->ri_RelationDesc,
> +   attrmap);
>
>
> When will we hit this part of the code? As per my understanding, for
> partitioned tables, we always follow apply_handle_tuple_routing()
> which should call logicalrep_partition_open(), and do the necessary
> work for us.
>
>
Looking closer, there is really no need for that. I changed the
code such that we pass usableLocalIndexOid. It looks simpler
to me. Do you agree?



> 2. In logicalrep_partition_open(), it would be better to set
> localrelvalid after finding the required index. The entry should be
> marked valid after initializing/updating all the required members. I
> have changed this in the attached.
>
>
makes sense


> 3.
> @@ -31,6 +32,7 @@ typedef struct LogicalRepRelMapEntry
>   Relation localrel; /* relcache entry (NULL when closed) */
>   AttrMap*attrmap; /* map of local attributes to remote ones */
>   bool updatable; /* Can apply updates/deletes? */
> + Oid usableIndexOid; /* which index to use, or InvalidOid if none */
>
> Would it be better to name this new variable as localindexoid to match
> it with the existing variable localreloid? Also, the camel case for
> this variable appears odd.
>

yes, both makes sense


>
> 4. If you agree with the above, then we should probably change the
> name of functions get_usable_indexoid() and
> FindLogicalRepUsableIndex() accordingly.
>

I dropped get_usable_indexoid() as noted in (1).

Changed FindLogicalRepUsableIndex->FindLogicalRepLocalIndex



> 5.
> + {
> + /*
> + * If we had a primary key or relation identity with a unique index,
> + * we would have already found and returned that oid. At this point,
> + * the remote relation has replica identity full.
>
> These comments are not required as this just states what the code just
> above is doing.
>

I don't have any strong opinions on adding this comment, applied your
suggestion.


>
> Apart from the above, I have made some modifications in the other
> comments. If you are convinced with those, then kindly include them in
> the next version.
>
>
Sure, they all look good. I think I have lost (and caused the reviewers to
lose)
quite a bit of time on the comment reviews. Next time, I'll try to be more
prepared
for the comments.

Attached v34

Thanks,
Onder KALACI


v34_0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data


v34_0002_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data

RE: The order of queues in row lock is changed (not FIFO)

2023-03-07 Thread Ryo Yamaji (Fujitsu)


From: Tom Lane 
> I don't see a bug here, or at least I'm not willing to move the goalposts to 
> where you want them to be.
> I believe that we do guarantee arrival-order locking of individual tuple 
> versions.  However, in the 
> example you show, a single row is being updated over and over.  So, initially 
> we have a single "winner" 
> transaction that got the tuple lock first and updated the row.  When it 
> commits, each other transaction 
> serially comes off the wait queue for that tuple lock and discovers that it 
> now needs a lock on a 
> different tuple version than it has got.
> So it tries to get lock on whichever is the latest tuple version.
> That might still appear serial as far as the original 100 sessions go, 
> because they were all queued on the 
> same tuple lock to start with.
> But when the new sessions come in, they effectively line-jump because they 
> will initially try to lock 
> whichever tuple version is committed live at that instant, and thus they get 
> ahead of whichever remain of 
> the original 100 sessions for the lock on that tuple version (since those are 
> all still blocked on some older 
> tuple version, whose lock is held by whichever session is performing the 
> next-to-commit update).

> I don't see any way to make that more stable that doesn't involve requiring 
> sessions to take locks on 
> already-dead-to-them tuples; which sure seems like a nonstarter, not least 
> because we don't even have a 
> way to find such tuples.  The update chains only link forward not back.

Thank you for your reply.
When I was doing this test, I confirmed the following two actions.
(1) The first 100 sessions are overtaken by the last 10.
(2) the order of the preceding 100 sessions changes

(1) I was concerned from the user's point of view that the lock order for the 
same tuple was not preserved.
However, as you pointed out, in many cases the order of arrival is guaranteed 
from the perspective of the tuple.
You understand the PostgreSQL architecture and understand that you need to use 
it.

(2) This behavior is rare. Typically, the first session gets 
AccessExclusiveLock to the tuple and ShareLock to the
transaction ID. Subsequent sessions will wait for AccessExclusiveLock to the 
tuple. However, we ignored
AccessExclusiveLock in the tuple from the log and observed multiple sessions 
waiting for ShareLock to the
transaction ID. The log shows that the order of the original 100 sessions has 
been changed due to the above
movement.

At first, I thought both (1) and (2) were obstacles. However, I understood from 
your indication that (1) is not a bug.
I would be grateful if you could also give me your opinion on (2).

Share the following logs:

[Log]
1. ShareLock has one wait, the rest is in AccessExclusiveLock

1-1. Only 1369555 is aligned with ShareLock, the transaction ID obtained by 
1369547, and the rest with
　　AccessExclusiveLock, the tuple obtained by 1369555.
　　This is similar to a pattern in which no updates have occurred to the tuple.
--
2022-10-26 01:20:08.881 EDT [1369555:19:0] LOG: process 1369555 still waiting 
for ShareLock on transaction 2501 after 10.072 ms
2022-10-26 01:20:08.881 EDT [1369555:20:0] DETAIL: Process holding the lock: 
1369547. Wait queue: 1369555.
〜
2022-10-26 01:21:58.918 EDT [1369898:17:0] LOG: process 1369898 acquired 
AccessExclusiveLock on tuple (1, 0) of relation 16546 of database 13779 after 
10.321 ms
2022-10-26 01:21:58.918 EDT [1369898:18:0] DETAIL: Process holding the lock: 
1369555. Wait queue: 1369558, 1369561, 1369564, 1369567, 1369570, 1369573, 
1369576, ...
--


2. All processes wait with ShareLock

2-1. With 1369558 holding the t1 (0, 4) lock, the queue head is 1369561.
--
2022-10-26 01:22:27.230 EDT [1369623:46:2525] LOG: process 1369623 still 
waiting for ShareLock on transaction 2504 after 10.133 msprocess 1369623 still 
waiting for ShareLock on transaction 2504 after 10.133 ms
2022-10-26 01:22:27.242 EDT [1369877:47:2604] DETAIL: Process holding the lock: 
1369558. Wait queue: 1369561, 1369623, 1369626, ...
--

2-2. When 1369558 locks are released, the first 1369561 in the Wait queue was 
expected to acquire the lock,
　　but the process actually acquired 1369787
--
2022-10-26 01:22:28.237 EDT [1369623:63:2525] LOG: process 1369623 still 
waiting for ShareLock on transaction 2577 after 10.028 ms
2022-10-26 01:22:28.237 EDT [1369623:64:2525] DETAIL: Process holding the lock: 
1369787. Wait queue: 1369623, 1369610, 1369614, 1369617, 1369620.
--

2-3. Checking that the 1369561 is rearranging.
--
2022-10-26 01:22:28.237 E

some problem explicit_bzero with building PostgreSQL on linux

2023-03-07 Thread Dimitry Markman

Hi, we got some problem with building PostgreSQL (version 15.1) on linux 
ldd —version returns
ldd (Debian GLIBC 2.31-13+deb11u5.tmw1) 2.31

we can build it all right, however we want to use binaries on different glibc 
version

so we’re detecting usage of the glibc version > 2.17 and we need to prevent 
usage

of symbols (like explicit_bzero), that wasn’t exist in glibc 2.17.

what we see that even if I commented line 

$as_echo "#define HAVE_EXPLICIT_BZERO 1" >>confdefs.h

from configure we still have a problem - symbol explicit_bzero was leaked in 

lib/libpq.so.5.15, bin/postgres, bin/pg_verifybackup

I was able to verify that HAVE_EXPLICIT_BZERO wasn’t defined 

in all c files that use  explicit_bzero: 

./src/interfaces/libpq/fe-connect.c
./src/backend/libpq/be-secure-common.c
./src/common/hmac_openssl.c
./src/common/cryptohash.c
./src/common/cryptohash_openssl.c
./src/common/hmac.c

how we can guaranty that if HAVE_EXPLICIT_BZERO is not defined then

explicit_bzero function implemented in port/explicit_bzero.c will be used (just 
like in Darwin or windows)

thanks in advance

dm

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-03-07 Thread Maxim Orlov

On Tue, 17 Jan 2023 at 16:33, Aleksander Alekseev 
wrote:

> Hi hackers,
>
> Maxim, perhaps you could share with us what your reasoning was here?
>
> I'm really sorry for late response, but better late than never. Yes, we
can not access shared memory without lock.
In this particular case, we use XidGenLock. That is why we use lock
argument to take it is it was not taken previously.
Actually, we may place assertion in this insist.

As for xid compare: we do not compare xids here, we are checking for
wraparound, so, AFAICS, this code is correct.



On Mon, 6 Mar 2023 at 22:48, Heikki Linnakangas  wrote:

>
> 1. Use 64 bit page numbers in SLRUs (this patch)
>
> 2. Use the larger segment file names in async.c, to lift the current 8
> GB limit on the max number of pending notifications.
>
> 3. Extend pg_multixact so that pg_multixact/members is addressed by
> 64-bit offsets.
>
> 4. Extend pg_subtrans to 64-bits.
>
> 5. Extend pg_xact to 64-bits.
>
>
> 6. (a bonus thing that I noticed while thinking of pg_xact.) Extend
> pg_twophase.c, to use FullTransactionIds.
>
> Currently, the twophase state files in pg_twophase are named according
> to the 32 bit Xid of the transaction. Let's switch to FullTransactionId
> there.
>
> ...
>
> I propose that we try to finish 1 and 2 for v16. And maybe 6. I think
> that's doable. It doesn't have any great user-visible benefits yet, but
> we need to start somewhere.
>
> - Heikki
>
> Yes, this is a great idea! My only concern here is that we're going in
circles here. You see, patch 1 is what was proposed
in the beginning of this thread. Anyway, I will be happy if we are being
able to push this topic forward.

As for making pg_multixact 64 bit, I spend the last couple of days to make
proper pg_upgrade for pg_multixact's and for pg_xact's
with wraparound and I've understood, that it is not a simple task compare
to pg_xact's. The problem is, we do not have epoch for
multixacts, so we do not have ability to "overcome" wraparound. The
solution may be adding some kind of epoch for multixacts or
make them 64 bit in "main" 64-xid patch, but in perspective of this thread,
in my view, this should be last in line here.

In pg_xact we do not have such a problem, we do have epoch for transacions,
so conversion should be pretty obvious:
 -> 
0001 -> 0001
...
0FFE -> 0FFE
0FFF -> 0FFF
 -> 0001
0001 -> 00010001

So, in my view, the plan should be:
1. Use internal 64 bit page numbers in SLRUs without changing segments
naming.
2. Use the larger segment file names in async.c, to lift the current 8 GB
limit on the max number of pending notifications.
3. Extend pg_xact to 64-bits.
4. Extend pg_subtrans to 64-bits.
5. Extend pg_multixact so that pg_multixact/members is addressed by 64-bit
offsets.
6. Extend pg_twophase.c, to use FullTransactionIds. (a bonus thing)

Thoughts?

-- 
Best regards,
Maxim Orlov.

On Mon, 6 Mar 2023 at 22:48, Heikki Linnakangas  wrote:

> On 01/03/2023 12:21, Aleksander Alekseev wrote:
> > Hi,
> >
> >> I'm surprised that these patches extend the page numbering to 64 bits,
> >> but never actually uses the high bits. The XID "epoch" is not used, and
> >> pg_xact still wraps around and the segment names are still reused. I
> >> thought we could stop doing that.
> >
> > To clarify, the idea is to let CLOG grow indefinitely and simply store
> > FullTransactionId -> TransactionStatus (two bits). Correct?
>
> Correct.
>
> > I didn't investigate this in much detail but it may affect quite some
> > amount of code since TransactionIdDidCommit() and
> > TransactionIdDidCommit() currently both deal with TransactionId, not
> > FullTransactionId. IMO, this would be a nice change however, assuming
> > we are ready for it.
>
> Yep, it's a lot of code churn..
>
> > In the previous version of the patch there was an attempt to derive
> > FullTransactionId from TransactionId but it was wrong for the reasons
> > named above in the thread. Thus is was removed and the patch
> > simplified.
>
> Yeah, it's tricky to get it right. Clearly we need to do it at some
> point though.
>
> All in all, this is a big effort. I spent some more time reviewing this
> in the last few days, and thought a lot about what the path forward here
> could be. And I haven't looked at the actual 64-bit XIDs patch set yet,
> just this patch to use 64-bit addressing in SLRUs.
>
> This patch is the first step, but we have a bit of a chicken and egg
> problem, because this patch on its own isn't very interesting, but on
> the other hand, we need it to work on the follow up items. Here's how I
> see the development path for this (and again, this is just for the
> 64-bit SLRUs work, not the bigger 64-bit-XIDs-in-heapam effort):
>
> 1. Use 64 bit page numbers in SLRUs (this patch)
>
> I would like to make one change here: I'd prefer to keep the old 4-digit
> segment names, until we actually start to use the wider address space.
> Let's allow each SLRU to specify how many dig

Re: Add pg_walinspect function with block info columns

2023-03-07 Thread Matthias van de Meent

On Tue, 7 Mar 2023 at 01:34, Michael Paquier  wrote:
>
> On Mon, Mar 06, 2023 at 04:08:28PM +0100, Matthias van de Meent wrote:
> > On Mon, 6 Mar 2023 at 15:40, Bharath Rupireddy
> >> IMO, pg_get_wal_records_extended_info as proposed doesn't look good to
> >> me as it outputs most of the columns that are already given by
> >> pg_get_wal_records_info.What I think the best way at this point is to
> >> make it return the following:
> >> lsn pg_lsn
> >> block_id int8
> >> spcOid oid
> >> dbOid oid
> >> relNumber oid
> >> forkNames text
> >> fpi bytea
> >> fpi_info text
>
> I would add the length of the block data (without the hole and
> compressed, as the FPI data should always be presented as
> uncompressed), and the block data if any (without the block data
> length as one can guess it based on the bytea data length).  Note that
> a block can have both a FPI and some data assigned to it, as far as I
> recall.
>
> > The basic idea is to create a single entrypoint to all relevant data
> > from DecodedXLogRecord in SQL, not multiple.
>
> While I would agree with this principle on simplicity's ground in
> terms of minimizing the SQL interface and the pg_wal/ lookups, I
> disagree about it on unsability ground, because we can avoid extra SQL
> tweaks with more functions.  One recent example I have in mind is
> partitionfuncs.c, which can actually be achieved with a WITH RECURSIVE
> on the catalogs.

Correct, but in that case the user would build the same query (or at
least with the same complexity) as what we're executing under the
hood, right?

> There are of course various degrees of complexity,
> and perhaps unnest() cannot qualify as one, but having two functions
> returning normalized records (one for the record information, and a
> second for the block information), is a rather good balance between
> usability and interface complexity, in my experience.

I would agree, if it weren't for the reasons written below.

>  If you have two
> functions, a JOIN is enough to cross-check the block data and the
> record data,

Joins are expensive on large datasets; and because WAL is one of the
largest datasets in the system, why would we want to force the user to
JOIN them if we can produce the data in one pre-baked data structure
without a need to join?

> while an unnest() heavily bloats the main function output
> (aka byteas of FPIs in a single array).

I don't see how that would be bad. You can select a subset of columns
without much issue, which can allow you to ignore any and all bloat.
It is also not easy to imagine that we'd have arguments in the
function that determine whether it includes the largest fields (main
data, blocks, block data, and block images) or leaves them NULL so
that we need to pass less data around if the user doesn't want the
data.

Matthias van de Meent

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

On Tue, Mar 7, 2023 at 3:00 PM Peter Smith  wrote:
>
> On Tue, Mar 7, 2023 at 8:01 PM Amit Kapila  wrote:
> >
> > On Tue, Mar 7, 2023 at 1:19 PM Peter Smith  wrote:
> > >
> > > Let me give an example to demonstrate why I thought something is fishy 
> > > here:
> > >
> > > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=.
> > > Imagine the same rel has a PRIMARY KEY with Oid=.
> > >
> > > ---
> > >
> > > +/*
> > > + * Get replica identity index or if it is not defined a primary key.
> > > + *
> > > + * If neither is defined, returns InvalidOid
> > > + */
> > > +Oid
> > > +GetRelationIdentityOrPK(Relation rel)
> > > +{
> > > + Oid idxoid;
> > > +
> > > + idxoid = RelationGetReplicaIndex(rel);
> > > +
> > > + if (!OidIsValid(idxoid))
> > > + idxoid = RelationGetPrimaryKeyIndex(rel);
> > > +
> > > + return idxoid;
> > > +}
> > > +
> > > +/*
> > > + * Given a relation and OID of an index, returns true if the
> > > + * index is relation's replica identity index or relation's
> > > + * primary key's index.
> > > + *
> > > + * Returns false otherwise.
> > > + */
> > > +bool
> > > +IdxIsRelationIdentityOrPK(Relation rel, Oid idxoid)
> > > +{
> > > + Assert(OidIsValid(idxoid));
> > > +
> > > + return GetRelationIdentityOrPK(rel) == idxoid;
> > > +}
> > > +
> > >
> > >
> > > So, according to the above function comment/name, I will expect
> > > calling IdxIsRelationIdentityOrPK passing Oid= or Oid- will
> > > both return true, right?
> > >
> > > But AFAICT
> > >
> > > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel)
> > > returns  (the valid oid of the RI) -->  ==  --> true;
> > >
> > > IdxIsRelationIdentityOrPK(rel, ) --> GetRelationIdentityOrPK(rel)
> > > returns  (the valid oid of the RI) -->  ==  --> false;
> > >
> > > ~
> > >
> > > Now two people are telling me this is OK, but I've been staring at it
> > > for too long and I just don't see how it can be. (??)
> > >
> >
> > The difference is that you are misunderstanding the intent of this
> > function. GetRelationIdentityOrPK() returns a "replica identity index
> > oid" if the same is defined, else return PK, if that is defined,
> > otherwise, return invalidOid. This is what is expected by its callers.
> > Now, one can argue to have a different function name and that may be a
> > valid argument but as far as I can see the function does what is
> > expected from it.
> >
>
> Sure, but I am questioning the function IdxIsRelationIdentityOrPK, not
> GetRelationIdentityOrPK.
>

The intent of IdxIsRelationIdentityOrPK() is as follows: Returns true
for the following conditions (a) if the given index OID is the same as
replica identity (when the same is defined); else (if replica identity
is not defined then (b)) (b) if the given OID is the same as PK.
Returns false otherwise. Feel free to propose any better function name
or if you think comments can be changed which makes it easier to
understand.

-- 
With Regards,
Amit Kapila.

Re: pg_dump/pg_restore: Fix stdin/stdout handling of custom format on Win32

2023-03-07 Thread Juan José Santamaría Flecha

On Thu, Mar 2, 2023 at 8:01 AM Michael Paquier  wrote:

>
> We had better make sure that this does not break again 10260c7, and
> these could not be reproduced with automated tests as they needed a
> Windows terminal.  Isn't this issue like the other commit, where the
> automated testing cannot reproduce any of that because it requires a
> terminal?  If not, could it be possible to add some tests to have some
> coverage?  The tests of pg_dump in src/bin/pg_dump/t/ invoke the
> custom format in a few scenarios already, and these are tested in the
> buildfarm for a couple of years now, without failing, but perhaps we'd
> need a small tweak to have a reproducible test case for automation?
>

I've been able to manually reproduce the problem with:

pg_dump --format=custom > custom.dump
pg_restore --file=toc.txt --list custom.dump
pg_dump --format=custom | pg_restore --file=toc.dump --use-list=toc.txt

The error I get is:

pg_restore: error: unsupported version (0.7) in file header

I'm not really sure how to integrate this in a tap test.


> The internal implementation of _pgstat64() is used in quite a few
> places, so we'd better update this part first, IMO, and then focus on
> the pg_dump part.  Anyway, it looks like you are right here: there is
> nothing for FILE_TYPE_PIPE or FILE_TYPE_CHAR in this WIN32
> implementation of fstat().
>
> I am amazed to hear that both ftello64() and fseek64() actually
> succeed if you use a pipe:
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/fseek.html
> Could it be something we should try to make more portable by ourselves
> with a wrapper for these on WIN32?  That would not be the first one to
> accomodate our code with POSIX, and who knows what code could be broken
> because of that, like external extensions that use fseek64() without
> knowing it.
>

The error is reproducible in versions previous to win32stat.c, so that
might work as bug fix.


> -   if (hFile == INVALID_HANDLE_VALUE || buf == NULL)
> +   if (hFile == INVALID_HANDLE_VALUE || hFile == (HANDLE)-2 || buf ==
> NULL)
> What's the -2 for?  Perhaps this should have a comment?
>

 There's a note on _get_osfhandle() [1] about when -2 is returned, but a
comment seems appropriate.

+   fileType = GetFileType(hFile);
> +   lastError = GetLastError();
> [...]
>if (fileType == FILE_TYPE_UNKNOWN && lastError != NO_ERROR) {
> +   _dosmaperr(lastError);
> +   return -1;
> }
> So, the patched code assumes that all the file types classified as
> FILE_TYPE_UNKNOWN when GetFileType() does not fail refer to fileno
> being either stdin, stderr or stdout.  Perhaps we had better
> cross-check that fileno points to one of these three cases in the
> switch under FILE_TYPE_UNKNOWN?  Could there be other cases where we
> have FILE_TYPE_UNKNOWN but GetFileType() does not fail?
>

I don't think we should set st_mode for FILE_TYPE_UNKNOWN.


> Per the documentation of GetFileType, FILE_TYPE_REMOTE is unused:
>
> https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getfiletype
> Perhaps it would be safer to fail in this case?
>

+1, we don't know what that might involve.

[1]
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/get-osfhandle?view=msvc-170

Regards,

Juan José Santamaría Flecha

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-03-07 Thread Heikki Linnakangas


On 07/03/2023 13:38, Maxim Orlov wrote:
As for making pg_multixact 64 bit, I spend the last couple of days to 
make proper pg_upgrade for pg_multixact's and for pg_xact's
with wraparound and I've understood, that it is not a simple task 
compare to pg_xact's. The problem is, we do not have epoch for
multixacts, so we do not have ability to "overcome" wraparound. The 
solution may be adding some kind of epoch for multixacts or
make them 64 bit in "main" 64-xid patch, but in perspective of this 
thread, in my view, this should be last in line here.


That is true for pg_multixact/offsets. We will indeed need to add an 
epoch and introduce the concept of FullMultiXactIds for that. However, 
we can change pg_multixact/members independently of that. We can extend 
MultiXactOffset from 32 to 64 bits, and eliminate pg_multixact/members 
wraparound, while keeping multi-xids 32 bits wide.


- Heikki

Re: Raising the SCRAM iteration count

> On 7 Mar 2023, at 09:26, Daniel Gustafsson  wrote:

> Right, what I meant was: can a pg_regress sql/expected test drive a psql
> interactive prompt?  Your comments suggested using password.sql so I was
> curious if I was missing a neat trick for doing this.

The attached v7 adds a TAP test for verifying that \password use the changed
SCRAM iteration count setting, and dials back the other added test to use fewer
iterations than the default setting in order to shave (barely noticeable
amounts of) cpu cycles.

Running interactive tests against psql adds a fair bit of complexity and isn't
all that pleasing on the eye, but it can be cleaned up and refactored when
https://commitfest.postgresql.org/42/4228/ is committed.

--
Daniel Gustafsson

v7-0001-Make-SCRAM-iteration-count-configurable.patch
Description: Binary data

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-03-07 Thread Aleksander Alekseev

Hi,

> Here's how I see the development path for this [...]

> So, in my view, the plan should be [...]
> Thoughts?

The plan looks great! I would also explicitly include this:

> As we start to refactor these things, I also think it would be good to
> have more explicit tracking of the valid range of SLRU pages in each
> SLRU. Take pg_subtrans for example: it's not very clear what pages have
> been initialized, especially during different stages of startup. It
> would be good to have clear start and end page numbers, and throw an
> error if you try to look up anything outside those bounds. Same for all
> other SLRUs.

So the plan becomes:

1. Use internal 64 bit page numbers in SLRUs without changing segments naming.
2. Use the larger segment file names in async.c, to lift the current 8
GB limit on the max number of pending notifications.
3. Extend pg_xact to 64-bits.
4. Extend pg_subtrans to 64-bits.
5. Extend pg_multixact so that pg_multixact/members is addressed by
64-bit offsets.
6. Extend pg_twophase.c, to use FullTransactionIds. (a bonus thing)
7. More explicit tracking of the valid range of SLRU pages in each SLRU

Where our main focus for PG16 is going to be 1 and 2, and we may try
to deliver 6 and 7 too if time will permit.

Maxim and I agreed (offlist) that I'm going to submit 1 and 2. The
patch 1 is ready, please see the attachment. I'm currently working on
2 and going to submit it in a bit. It seems to be relatively
straightforward but I don't want to rush things and want to make sure
I didn't miss something.

> I wonder if we should actually add an artificial limit, as a GUC.

Yes, I think we need some sort of limit. Using a GUC would be the most
straightforward approach. Alternatively we could derive the limit from
the existing GUCs, similarly to how we derive the default value of
wal_buffers from shared_buffers [1]. However, off the top of my head
we only have max_wal_size and it doesn't strike me as a good candidate
for deriving something for NOTIFY / LISTEN.

So I'm going to add max_notify_segments GUC with the default value of
65536 as it is now. If the new GUC will receive a push back from the
community we can always use a hard-coded value instead, or no limit at
all.

[1]: 
https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-BUFFERS

-- 
Best regards,
Aleksander Alekseev


v56-0001-Index-SLRUs-by-64-bit-integers-rather-than-by-32.patch
Description: Binary data

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Hi Amit, Peter


> > > Let me give an example to demonstrate why I thought something is fishy
> here:
> > >
> > > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=.
> > > Imagine the same rel has a PRIMARY KEY with Oid=.
> > >
>

Hmm, alright, this is syntactically possible, but not sure if any user
would do this. Still thanks for catching this.

And, you are right, if a user has created such a schema,
IdxIsRelationIdentityOrPK()
would return the wrong result and we'd use sequential scan instead of index
scan.
This would be a regression. I think we should change the function.


Here is the example:
DROP TABLE tab1;
CREATE TABLE tab1 (a int NOT NULL);
CREATE UNIQUE INDEX replica_unique ON tab1(a);
ALTER TABLE tab1 REPLICA IDENTITY USING INDEX replica_unique;
ALTER TABLE tab1 ADD CONSTRAINT pkey PRIMARY KEY (a);


I'm attaching v35.

Does that make sense to you Amit?


v35_0002_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data


v35_0001_use_index_on_subs_when_pub_rep_ident_full.patch
Description: Binary data

Re: some problem explicit_bzero with building PostgreSQL on linux

2023-03-07 Thread Tom Lane

Dimitry Markman  writes:
> how we can guaranty that if HAVE_EXPLICIT_BZERO is not defined then
> explicit_bzero function implemented in port/explicit_bzero.c will be used 
> (just like in Darwin or windows)

Did you remember to add explicit_bzero.o to LIBOBJS in
the configured Makefile.global?

If it still doesn't work, then evidently your toolchain is selecting
the system's built-in definition of explicit_bzero over the one in
src/port/.  This is not terribly surprising given that there has to be
some amount of compiler magic involved in that function.  You may have
to resort to actually building Postgres on a platform without
explicit_bzero.

regards, tom lane

Re: XID formatting and SLRU refactorings (was: Add 64-bit XIDs into PostgreSQL 15)

2023-03-07 Thread Maxim Orlov

On Tue, 7 Mar 2023 at 15:38, Heikki Linnakangas  wrote:

>
> That is true for pg_multixact/offsets. We will indeed need to add an
> epoch and introduce the concept of FullMultiXactIds for that. However,
> we can change pg_multixact/members independently of that. We can extend
> MultiXactOffset from 32 to 64 bits, and eliminate pg_multixact/members
> wraparound, while keeping multi-xids 32 bits wide.
>
> Yes, you're totally correct. If it will be committable that way, I'm all
for that.

-- 
Best regards,
Maxim Orlov.

Re: Add a hook to allow modification of the ldapbindpasswd

2023-03-07 Thread Andrew Dunstan



On 2023-03-06 Mo 15:16, Gregory Stark (as CFM) wrote:

The CFBot says this patch is failing but I find it hard to believe
this is related to this patch...

2023-03-05 20:56:58.705 UTC [33902][client backend]
[pg_regress/btree_index][18/750:0] STATEMENT:  ALTER INDEX
btree_part_idx ALTER COLUMN id SET (n_distinct=100);
2023-03-05 20:56:58.709 UTC [33902][client backend]
[pg_regress/btree_index][:0] LOG:  disconnection: session time:
0:00:02.287 user=postgres database=regression host=[local]
2023-03-05 20:56:58.710 UTC [33889][client backend]
[pg_regress/join][:0] LOG:  disconnection: session time: 0:00:02.289
user=postgres database=regression host=[local]
2023-03-05 20:56:58.749 UTC [33045][postmaster] LOG:  server process
(PID 33898) was terminated by signal 6: Abort trap
2023-03-05 20:56:58.749 UTC [33045][postmaster] DETAIL:  Failed
process was running: SELECT * FROM writetest;
2023-03-05 20:56:58.749 UTC [33045][postmaster] LOG:  terminating any
other active server processes



Yeah. It says it's fine now. Neither of the two recent failures look 
like they have anything to do with this.



cheers


andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: GUC for temporarily disabling event triggers

2023-03-07 Thread Mikhail Gribkov

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

I like it now.

* The patch does what it intends to do;
* The implementation way is clear;
* All test are passed;
* No additional problems catched - at least by my eyes;

I think it can be marked as Ready for Committer

N.B. In fact I've encountered couple of failed tests during installcheck-world, 
although the same fails are there even for master branch. Thus doesn't seem to 
be this patch issue.

The new status of this patch is: Ready for Committer

Re: Add shared buffer hits to pg_stat_io

2023-03-07 Thread Drouvot, Bertrand


Hi,

On 3/6/23 4:38 PM, Melanie Plageman wrote:

Thanks for the review!

On Tue, Feb 28, 2023 at 7:36 AM Drouvot, Bertrand
 wrote:

   BufferDesc *
   LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
-bool *foundPtr, IOContext *io_context)
+bool *foundPtr, IOContext io_context)
   {
  BufferTag   newTag; /* identity of requested 
block */
  LocalBufferLookupEnt *hresult;
@@ -128,14 +128,6 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, 
BlockNumber blockNum,
  hresult = (LocalBufferLookupEnt *)
  hash_search(LocalBufHash, &newTag, HASH_FIND, NULL);

-   /*
-* IO Operations on local buffers are only done in IOCONTEXT_NORMAL. Set
-* io_context here (instead of after a buffer hit would have returned) 
for
-* convenience since we don't have to worry about the overhead of 
calling
-* IOContextForStrategy().
-*/
-   *io_context = IOCONTEXT_NORMAL;


It looks like that io_context is not used in LocalBufferAlloc() anymore and 
then can be removed as an argument.


Good catch. Updated patchset attached.


Thanks for the update!




While adding this, I noticed that I had made all of the IOOP columns
int8 in the view, and I was wondering if this is sufficient for hits (I
imagine you could end up with quite a lot of those).



I think that's ok and bigint is what is already used for 
pg_statio_user_tables.heap_blks_hit for example.


Ah, I was silly and didn't understand that the SQL type int8 is eight
bytes and not 1. That makes a lot of things make more sense :)


Oh, I see ;-)

I may give it another review but currently V2 looks good to me.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Track IO times in pg_stat_io

2023-03-07 Thread Drouvot, Bertrand


Hi,

On 3/6/23 5:30 PM, Melanie Plageman wrote:

Thanks for the review!

On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
 wrote:

On 2/26/23 5:03 PM, Melanie Plageman wrote:

The timings will only be non-zero when track_io_timing is on


That could lead to incorrect interpretation if one wants to divide the timing 
per operations, say:

- track_io_timing is set to on while there is already operations
- or set to off while it was on (and the number of operations keeps growing)

Might be worth to warn/highlight in the "track_io_timing" doc?


This is a good point. I've added a note to the docs for pg_stat_io.


Thanks!

Now I've a second thought: what do you think about resetting the related number
of operations and *_time fields when enabling/disabling track_io_timing? (And 
mention it in the doc).

That way it'd prevent bad interpretation (at least as far the time per 
operation metrics are concerned).

Thinking that way as we'd loose some (most?) benefits of the new *_time columns
if one can't "trust" their related operations and/or one is not sampling 
pg_stat_io frequently enough (to discard the samples
where the track_io_timing changes occur).

But well, resetting the operations could also lead to bad interpretation about 
the operations...

Not sure about which approach I like the most yet, what do you think?


That leads to pgstat_count_io_time() to be called before pgstat_count_io_op() 
(for the IOOP_EXTEND case) and
after pgstat_count_io_op() (for the IOOP_READ case).

What about calling them in the same order and so that pgstat_count_io_time() is 
called before pgstat_count_io_op()?

If so, the ordering would also need to be changed in:

- FlushRelationBuffers()
- register_dirty_segment()


Yes, good point. I've updated the code to use this suggested ordering in
attached v3.



Thanks, this looks good to me.


As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that 
would be a good idea
to also check that if counts are not Zero then times are not Zero.


Yes, I think adding some validation around the relationship between
counts and timing should help prevent developers from forgetting to call
pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).

However, I think that we cannot check that if IO counts are non-zero
that IO times are non-zero, because the user may not have
track_io_timing enabled.


Yeah, right.


We can check that if IO times are not zero, IO
counts are not zero. I've done this in the attached v3.



Thanks, looks good to me.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-07 Thread Pavel Borisov

Hi, Andres and Alexander!

On Tue, 7 Mar 2023, 10:10 Alexander Korotkov,  wrote:

> On Tue, Mar 7, 2023 at 4:50 AM Andres Freund  wrote:
> > On 2023-03-02 14:28:56 +0400, Pavel Borisov wrote:
> > > 2. Heap updates with low tuple concurrency:
> > > Prepare with pkeys (pgbench -d postgres -i -I dtGvp -s 300
> --unlogged-tables)
> > > Update 3*10^7 rows, 50 conns (pgbench postgres -f
> > > ./update-only-account.sql -s 300 -P10 -M prepared -T 600 -j 5 -c 50)
> > >
> > > Result: Both patches and master are the same within a tolerance of
> > > less than 0.7%.
> >
> > What exactly does that mean? I would definitely not want to accept a 0.7%
> > regression of the uncontended case to benefit the contended case here...
>
> I don't know what exactly Pavel meant, but average overall numbers for
> low concurrency are.
> master: 420401 (stddev of average 233)
> patchset v11: 420111 (stddev of average 199)
> The difference is less than 0.1% and that is very safely within the error.
>

Yes, the only thing that I meant is that for low-concurrency case the
results between patch and master are within the difference between repeated
series of measurements. So I concluded that the test can not prove any
difference between patch and master.

I haven't meant or written there is some performance degradation.

Alexander, I suppose did an extra step and calculated overall average and
stddev, from raw data provided. Thanks!

Regards,
Pavel.

>

Re: Timeline ID hexadecimal format

2023-03-07 Thread Sébastien Lardière


On 06/03/2023 18:04, Peter Eisentraut wrote:

On 03.03.23 16:52, Sébastien Lardière wrote:

On 02/03/2023 09:12, Peter Eisentraut wrote:


I think here it would be more helpful to show actual examples. Like, 
here is a possible file name, this is what the different parts mean.


So you mean explain the WAL filename and the history filename ? Is it 
the good place for it ?


Well, your patch says, by the way, the timeline ID in the file is 
hexadecimal.  Then one might ask, what file, what is a timeline, what 
are the other numbers in the file, etc.  It seems very specific in 
this context.  I don't know if the format of these file names is 
actually documented somewhere.



Well, in the context of this patch, the usage both filename are 
explained juste before, so it seems understandable to me


Timelines are explained in this place : 
https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-TIMELINES 
so the patch explains the format there









This applies to all configuration parameters, so it doesn't need to 
be mentioned explicitly for individual ones.


Probably, but is there another parameter with the same consequence ?

worth it to document this point globally ?


It's ok to mention it again.  We do something similar for example at 
unix_socket_permissions.  But maybe with more context, like "If you 
want to specify a timeline ID hexadecimal (for example, if extracted 
from a WAL file name), then prefix it with a 0x".



Ok, I've improved the message






Maybe this could be fixed instead?


Indeed, and strtoul is probably a better option than sscanf, don't 
you think ?


Yeah, the use of sscanf() is kind of weird here.  We have been moving 
the option parsing to use option_parse_int().  Maybe hex support could 
be added there.  Or just use strtoul().



I've made the change with strtoul

About option_parse_int(), actually, strtoint() is used, do we need a 
option_parse_ul() fonction ?


patch attached,

best regards,


--
Sébastien
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index be05a33205..fb86a3fec5 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1332,7 +1332,8 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'
 you like, add comments to a history file to record your own notes about
 how and why this particular timeline was created.  Such comments will be
 especially valuable when you have a thicket of different timelines as
-a result of experimentation.
+a result of experimentation. The timeline identifier is an integer which
+is used in hexadecimal format in both WAL segment file names and history files.

 

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e5c41cc6c6..6c0d63b73d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4110,7 +4110,11 @@ restore_command = 'copy "C:\\server\\archivedir\\%f" "%p"'  # Windows
 current when the base backup was taken.  The
 value latest recovers
 to the latest timeline found in the archive, which is useful in
-a standby server.  latest is the default.
+a standby server. A numerical value expressed in hexadecimal, by exemple
+from WAL filename or history file, must be prefixed with 0x,
+for example 0x11 if the WAL filename
+is 001100A1004F.
+latest is the default.

 

diff --git a/doc/src/sgml/ref/pg_waldump.sgml b/doc/src/sgml/ref/pg_waldump.sgml
index 343f0482a9..d92948c68a 100644
--- a/doc/src/sgml/ref/pg_waldump.sgml
+++ b/doc/src/sgml/ref/pg_waldump.sgml
@@ -215,7 +215,9 @@ PostgreSQL documentation

 Timeline from which to read WAL records. The default is to use the
 value in startseg, if that is specified; otherwise, the
-default is 1.
+default is 1. The value can be expressed in decimal or hexadecimal format.
+The hexadecimal format, given by WAL filename, must be preceded by 0x,
+by example 0x11.

   
  
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 44b5c8726e..b29d5223ce 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -770,6 +770,7 @@ usage(void)
 	printf(_("  -R, --relation=T/D/R   only show records that modify blocks in relation T/D/R\n"));
 	printf(_("  -s, --start=RECPTR start reading at WAL location RECPTR\n"));
 	printf(_("  -t, --timeline=TLI timeline from which to read WAL records\n"
+			 " hexadecimal value, from WAL filename, must be preceded by 0x\n"
 			 " (default: 1 or the value used in STARTSEG)\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
 	printf(_("  -w, --fullpage only show records with a full page write\n"));
@@ -1007,7 +1008,7 @@ main(int argc, char **argv)
 	private.startptr = (uint64) xlogid << 32 | xrecoff;
 break;
 			c

Re: some problem explicit_bzero with building PostgreSQL on linux

2023-03-07 Thread Dimitry Markman

Hi Tom, thanks a lot
Adding explicit_bzero.o did the job
Thanks a lot

dm

From: Tom Lane 
Date: Tuesday, March 7, 2023 at 9:14 AM
To: Dimitry Markman 
Cc: pgsql-hackers@lists.postgresql.org , 
Bhavya Dabas 
Subject: Re: some problem explicit_bzero with building PostgreSQL on linux
Dimitry Markman  writes:
> how we can guaranty that if HAVE_EXPLICIT_BZERO is not defined then
> explicit_bzero function implemented in port/explicit_bzero.c will be used 
> (just like in Darwin or windows)

Did you remember to add explicit_bzero.o to LIBOBJS in
the configured Makefile.global?

If it still doesn't work, then evidently your toolchain is selecting
the system's built-in definition of explicit_bzero over the one in
src/port/.  This is not terribly surprising given that there has to be
some amount of compiler magic involved in that function.  You may have
to resort to actually building Postgres on a platform without
explicit_bzero.

regards, tom lane

Re: add PROCESS_MAIN to VACUUM

2023-03-07 Thread Melanie Plageman

On Mon, Mar 6, 2023 at 10:45 PM Michael Paquier  wrote:
>
> On Mon, Mar 06, 2023 at 04:59:49PM -0800, Nathan Bossart wrote:
> > That did cross my mind, but I was worried that trying to explain all that
> > here could cause confusion.
> >
> >   If PROCESS_MAIN is set (the default), it's time to vacuum the main
> >   relation.  Otherwise, we can skip this part.  If processing the TOAST
> >   table is required (e.g., PROCESS_TOAST is set), we'll force
> >   PROCESS_MAIN to be set when we recurse to the TOAST table so that it
> >   gets processed here.
> >
> > How does that sound?
>
> Sounds clear to me, thanks!  Melanie, what do you think?

Yes, sounds clear to me also!

- Melanie

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-07 Thread Robert Haas

On Tue, Jan 10, 2023 at 6:50 PM Tom Lane  wrote:
> The script-file-per-upgrade-path aspect solves a problem that you
> have, whether you admit it or not; I think you simply aren't realizing
> that because you have not had to deal with the consequences of
> your proposed feature.  Namely that you won't have any control
> over what the backend will try to do in terms of upgrade paths.
>
> As an example, suppose that a database has foo 4.0 installed, and
> the DBA decides to try to downgrade to 3.0.  With the system as it
> stands, if you've provided foo--4.0--3.0.sql then the conversion
> will go through, and presumably it will work because you tested that
> that script does what it is intended to.  If you haven't provided
> any such downgrade script, then ALTER EXTENSION UPDATE will say
> "Sorry Dave, I'm afraid I can't do that" and no harm is done.

I mean, there is nothing to prevent the extension author from writing
a script which ensures that the extension membership after the
downgrade is exactly what it should be for version 3.0, regardless of
what it was before. SQL is Turing-complete.

The thing that confuses me here is why the PostGIS folks are ending up
with so many files. We certainly don't have that problem with the
extension that are being maintained in contrib, and I guess there is
some difference in versioning practice that is making it an issue for
them but not for us. I wish I understood what was going on there.

But that to one side, there is clearly a problem here, and I think
PostgreSQL ought to provide some infrastructure to help solve it, even
if the ultimate cause of that problem is that the PostGIS folks
managed their extension versions in some less-than-ideal way. I can't
shake the feeling that you're just holding your breath here and hoping
the problem goes away by itself, and judging by the responses, that
doesn't seem like it's going to happen.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote:
> On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart  
> wrote:
>> Is it possible to memcpy more than a page at a time?
> 
> It would complicate things a lot there; the logic to figure out the
> last page bytes that may or may not fit in the whole page gets
> complicated. Also, the logic to verify each page's header gets
> complicated. We might lose out if we memcpy all the pages at once and
> start verifying each page's header in another loop.

Doesn't the complicated logic you describe already exist to some extent in
the patch?  You are copying a page at a time, which involves calculating
various addresses and byte counts.

>> +elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for 
>> given LSN %X/%X, Timeline ID %u",
>> + *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
>>
>> I definitely don't think we should put an elog() in this code path.
>> Perhaps this should be guarded behind WAL_DEBUG.
> 
> Placing it behind WAL_DEBUG doesn't help users/developers. My
> intention was to let users know that the WAL read hit the buffers,
> it'll help them report if any issue occurs and also help developers to
> debug that issue.

I still think an elog() is mighty expensive for this code path, even when
it doesn't actually produce any messages.  And when it does, I think it has
the potential to be incredibly noisy.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: HOT chain validation in verify_heapam()

2023-03-07 Thread Robert Haas

On Mon, Mar 6, 2023 at 12:36 PM Robert Haas  wrote:
> So it seems that we still don't have a patch where the
> value of a variable called lp_valid corresponds to whether or not the
> L.P. is valid.

Here's a worked-over version of this patch. Changes:

- I got rid of the code that sets lp_valid in funny places and instead
arranged to have check_tuple_visibility() pass up the information on
the XID status. That's important, because we can't casually apply
operations like TransactionIdIsCommitted() to XIDs that, for all we
know, might not even be in the range covered by CLOG. In such cases,
we should not perform any HOT chain validation because we can't do it
sensibly; the new code accomplishes this, and also reduces the number
of CLOG lookups as compared with your version.

- I moved most of the HOT chain checks from the loop over the
predecessor[] array to the loop over the successor[] array. It didn't
seem to have any value to put them in the third loop; it forces us to
expend extra code to distinguish between redirects and tuples,
information that we already had in the second loop. The only check
that seems to make sense to do in that last loop is the one for a HOT
chain that starts with a HOT tuple, which can't be done any earlier.

- I realized that your patch had a guard against setting the
predecessor[] when it was set already only for tuples, not for
redirects. That means if a redirect pointed into the middle of a HOT
chain we might not report corruption appropriately. I fixed this and
reworded the associated messages a bit.

- Assorted cosmetic and comment changes.

I think this is easier to follow and more nearly correct, but what do
you (and others) think?

-- 
Robert Haas
EDB: http://www.enterprisedb.com


0001-Implement-HOT-chain-validation-in-verify_heapam.patch
Description: Binary data

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Hi,

On 2023-03-06 15:21:14 -0500, Melanie Plageman wrote:
> Good point. Attached is what you suggested. I committed the transaction
> before the drop table so that the statistics would be visible when we
> queried pg_stat_io.

Pushed, thanks for the report, analysis and fix, Tom, Horiguchi-san, Melanie.

Greetings,

Andres Freund

Re: HOT chain validation in verify_heapam()

2023-03-07 Thread Mark Dilger




> On Mar 7, 2023, at 10:16 AM, Robert Haas  wrote:
> 
> On Mon, Mar 6, 2023 at 12:36 PM Robert Haas  wrote:
>> So it seems that we still don't have a patch where the
>> value of a variable called lp_valid corresponds to whether or not the
>> L.P. is valid.
> 
> Here's a worked-over version of this patch. Changes:
> 
> - I got rid of the code that sets lp_valid in funny places and instead
> arranged to have check_tuple_visibility() pass up the information on
> the XID status. That's important, because we can't casually apply
> operations like TransactionIdIsCommitted() to XIDs that, for all we
> know, might not even be in the range covered by CLOG. In such cases,
> we should not perform any HOT chain validation because we can't do it
> sensibly; the new code accomplishes this, and also reduces the number
> of CLOG lookups as compared with your version.
> 
> - I moved most of the HOT chain checks from the loop over the
> predecessor[] array to the loop over the successor[] array. It didn't
> seem to have any value to put them in the third loop; it forces us to
> expend extra code to distinguish between redirects and tuples,
> information that we already had in the second loop. The only check
> that seems to make sense to do in that last loop is the one for a HOT
> chain that starts with a HOT tuple, which can't be done any earlier.
> 
> - I realized that your patch had a guard against setting the
> predecessor[] when it was set already only for tuples, not for
> redirects. That means if a redirect pointed into the middle of a HOT
> chain we might not report corruption appropriately. I fixed this and
> reworded the associated messages a bit.
> 
> - Assorted cosmetic and comment changes.
> 
> I think this is easier to follow and more nearly correct, but what do
> you (and others) think?

Thanks, Robert.  Quickly skimming over this patch, it looks like something 
reviewable.  Your changes to t/004_verify_heapam.pl appear to be consistent 
with how that test was intended to function.

Note that I have not tried any of this yet.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Time delayed LR (WAS Re: logical replication restrictions)

On Mon, Mar 06, 2023 at 07:27:59PM +0300, Önder Kalacı wrote:
> On the other hand, we already have a similar problem with
> recovery_min_apply_delay combined with hot_standby_feedback [1].
> So, that probably is an acceptable trade-off for the pgsql-hackers.
> If you use this feature, you should be even more careful.

Yes, but it's possible to turn off hot_standby_feedback so that you don't
incur bloat on the primary.  And you don't need to store hours or days of
WAL on the primary.  I'm very late to this thread, but IIUC you cannot
avoid blocking VACUUM with the proposed feature.  IMO the current set of
trade-offs (e.g., unavoidable bloat and WAL buildup) would make this
feature virtually unusable for a lot of workloads, so it's probably worth
exploring an alternative approach.  In any case, we probably shouldn't rush
this into v16 in its current form.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Track IO times in pg_stat_io

Hi,

On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think that 
> > would be a good idea
> > to also check that if counts are not Zero then times are not Zero.
> 
> Yes, I think adding some validation around the relationship between
> counts and timing should help prevent developers from forgetting to call
> pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> 
> However, I think that we cannot check that if IO counts are non-zero
> that IO times are non-zero, because the user may not have
> track_io_timing enabled. We can check that if IO times are not zero, IO
> counts are not zero. I've done this in the attached v3.

And even if track_io_timing is enabled, the timer granularity might be so low
that we *still* get zeroes.

I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,


> @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char 
> relpersistence, ForkNumber forkNum,
>  
>   if (isExtend)
>   {
> + instr_time  io_start,
> + io_time;
> +
>   /* new buffers are zero-filled */
>   MemSet((char *) bufBlock, 0, BLCKSZ);
> +
> + if (track_io_timing)
> + INSTR_TIME_SET_CURRENT(io_start);
> + else
> + INSTR_TIME_SET_ZERO(io_start);
> +

I wonder if there's an argument for tracking this in the existing IO stats as
well. But I guess we've lived with this for a long time...


> @@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, 
> IOObject io_object,
>* When a strategy is not in use, the write can only be a "regular" 
> write
>* of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
>*/
> - pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> -
>   if (track_io_timing)
>   {
>   INSTR_TIME_SET_CURRENT(io_time);
>   INSTR_TIME_SUBTRACT(io_time, io_start);
>   
> pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
>   INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> + pgstat_count_io_time(IOOBJECT_RELATION, io_context, IOOP_WRITE, 
> io_time);
>   }

I think this needs a bit of cleanup - pgstat_count_buffer_write_time(),
pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We
might not be able to reduce the whole duplication at this point, but at least
it should be a bit more centralized.



> + pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
>   pgBufferUsage.shared_blks_written++;
>  
>   /*
> @@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)
>  
>   if (RelationUsesLocalBuffers(rel))
>   {
> + instr_time  io_start,
> + io_time;
> +
>   for (i = 0; i < NLocBuffer; i++)
>   {
>   uint32  buf_state;
> @@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)
>  
>   PageSetChecksumInplace(localpage, 
> bufHdr->tag.blockNum);
>  
> + if (track_io_timing)
> + INSTR_TIME_SET_CURRENT(io_start);
> + else
> + INSTR_TIME_SET_ZERO(io_start);
> +
>   smgrwrite(RelationGetSmgr(rel),
> 
> BufTagGetForkNum(&bufHdr->tag),
> bufHdr->tag.blockNum,

I don't think you need the INSTR_TIME_SET_ZERO() in the body of the loop, to
silence the compiler warnings you can do it one level up.



> @@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, 
> BlockNumber blockNum,
>  
>   PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
>  
> + if (track_io_timing)
> + INSTR_TIME_SET_CURRENT(io_start);
> + else
> + INSTR_TIME_SET_ZERO(io_start);
> +
>   /* And write... */
>   smgrwrite(oreln,
> BufTagGetForkNum(&bufHdr->tag),
> @@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, 
> BlockNumber blockNum,
>   buf_state &= ~BM_DIRTY;
>   pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
>  
> + if (track_io_timing)
> + {
> + INSTR_TIME_SET_CURRENT(io_time);
> + INSTR_TIME_SUBTRACT(io_time, io_start);
> + pgstat_count_io_time(IOOBJECT_TEMP_RELATION, 
> IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
> + }
> +
>   pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, 
> IOOP_WRITE);
>   pgBufferUsage.local_blks_written++;
>   }

Perhaps we can inste

Re: Track IO times in pg_stat_io

2023-03-07 Thread Melanie Plageman

Thanks for taking another look!

On Tue, Mar 7, 2023 at 10:52 AM Drouvot, Bertrand
 wrote:
> On 3/6/23 5:30 PM, Melanie Plageman wrote:
> > Thanks for the review!
> >
> > On Tue, Feb 28, 2023 at 4:49 AM Drouvot, Bertrand
> >  wrote:
> >> On 2/26/23 5:03 PM, Melanie Plageman wrote:
> >>> The timings will only be non-zero when track_io_timing is on
> >>
> >> That could lead to incorrect interpretation if one wants to divide the 
> >> timing per operations, say:
> >>
> >> - track_io_timing is set to on while there is already operations
> >> - or set to off while it was on (and the number of operations keeps 
> >> growing)
> >>
> >> Might be worth to warn/highlight in the "track_io_timing" doc?
> >
> > This is a good point. I've added a note to the docs for pg_stat_io.
>
> Thanks!
>
> Now I've a second thought: what do you think about resetting the related 
> number
> of operations and *_time fields when enabling/disabling track_io_timing? (And 
> mention it in the doc).
>
> That way it'd prevent bad interpretation (at least as far the time per 
> operation metrics are concerned).
>
> Thinking that way as we'd loose some (most?) benefits of the new *_time 
> columns
> if one can't "trust" their related operations and/or one is not sampling 
> pg_stat_io frequently enough (to discard the samples
> where the track_io_timing changes occur).
>
> But well, resetting the operations could also lead to bad interpretation 
> about the operations...
>
> Not sure about which approach I like the most yet, what do you think?

Oh, this is an interesting idea. I think you are right about the
synchronization issues making the statistics untrustworthy and, thus,
unuseable.

Building on your idea, what if we had the times be NULL instead of zero
when track_io_timing is disabled? Then as you suggested, when you enable
track_io_timing, it resets the IOOp counts and starts the times off at
zero. However, disabling track_io_timing would only NULL out the times
and not zero out the counts.

We could also, as you say, log these events.

- Melanie

Re: Track IO times in pg_stat_io

On 2023-03-07 13:43:28 -0500, Melanie Plageman wrote:
> > Now I've a second thought: what do you think about resetting the related 
> > number
> > of operations and *_time fields when enabling/disabling track_io_timing? 
> > (And mention it in the doc).
> >
> > That way it'd prevent bad interpretation (at least as far the time per 
> > operation metrics are concerned).
> >
> > Thinking that way as we'd loose some (most?) benefits of the new *_time 
> > columns
> > if one can't "trust" their related operations and/or one is not sampling 
> > pg_stat_io frequently enough (to discard the samples
> > where the track_io_timing changes occur).
> >
> > But well, resetting the operations could also lead to bad interpretation 
> > about the operations...
> >
> > Not sure about which approach I like the most yet, what do you think?
> 
> Oh, this is an interesting idea. I think you are right about the
> synchronization issues making the statistics untrustworthy and, thus,
> unuseable.

No, I don't think we can do that. It can be enabled on a per-session basis.

I think we simply shouldn't do anything here. This is a pre-existing issue. I
also think that loosing stats when turning track_io_timing on/off would not be
helpful.

Greetings,

Andres Freund

Re: postgres_fdw, dblink, and CREATE SUBSCRIPTION security

2023-03-07 Thread Robert Haas

On Thu, Feb 9, 2023 at 4:46 PM Jacob Champion  wrote:
> On 2/6/23 08:22, Robert Haas wrote:
> > I don't think that's quite the right concept. It seems to me that the
> > client is responsible for informing the server of what the situation
> > is, and the server is responsible for deciding whether to allow the
> > connection. In your scenario, the client is not only communicating
> > information ("here's the password I have got") but also making demands
> > on the server ("DO NOT authenticate using anything else"). I like the
> > first part fine, but not the second part.
>
> For what it's worth, making a negative demand during authentication is
> pretty standard: if you visit example.com and it tells you "I need your
> OS login password and Social Security Number to authenticate you," you
> have the option of saying "no thanks" and closing the tab.

No, that's the opposite, and exactly the point I'm trying to make. In
that case, the SERVER says what it's willing to accept, and the CLIENT
decides whether or not to provide that. In your proposal, the client
tells the server which authentication methods to accept.

> In a hypothetical world where the server presented the client with a
> list of authentication options before allowing any access, this would
> maybe be a little less convoluted to solve. For example, a proxy seeing
> a SASL list of
>
> - ANONYMOUS
> - EXTERNAL
>
> could understand that both methods allow the client to assume the
> authority of the proxy itself. So if its client isn't allowed to do
> that, the proxy realizes something is wrong (either it, or its target
> server, has been misconfigured or is under attack), and it can close the
> connection *before* the server runs login triggers.

Yep, that totally makes sense to me, but I don't think it's what you proposed.

> This sounds like a reasonable separation of responsibilities on the
> surface, but I think it's subtly off. The entire confused-deputy problem
> space revolves around the proxy being unable to correctly decide which
> connections to allow unless it also knows why the connections are being
> authorized.

I agree.

> You've constructed an example where that's not a concern: everything's
> symmetrical, all proxies operate with the same authority, and internal
> users are identical to external users. But the CVE that led to the
> password requirement, as far as I can tell, dealt with asymmetry. The
> proxy had the authority to connect locally to a user, and the clients
> had the authority to connect to other machines' users, but those users
> weren't the same and were not mutually trusting.

Yeah, agreed. So, I think the point here is that the proxy
configuration (and pg_hba.conf) need to be sufficiently powerful that
each user can permit the things that make sense in their environment
and block the things that don't.

I don't think we're really very far apart here, but for some reason
the terminology seems to be giving us some trouble. Of course, there's
also the small problem of actually finding the time to do some
meaningful work on this stuff, rather than just talking

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-07 Thread Tom Lane

Robert Haas  writes:
> On Tue, Jan 10, 2023 at 6:50 PM Tom Lane  wrote:
>> As an example, suppose that a database has foo 4.0 installed, and
>> the DBA decides to try to downgrade to 3.0.  With the system as it
>> stands, if you've provided foo--4.0--3.0.sql then the conversion
>> will go through, and presumably it will work because you tested that
>> that script does what it is intended to.  If you haven't provided
>> any such downgrade script, then ALTER EXTENSION UPDATE will say
>> "Sorry Dave, I'm afraid I can't do that" and no harm is done.

> I mean, there is nothing to prevent the extension author from writing
> a script which ensures that the extension membership after the
> downgrade is exactly what it should be for version 3.0, regardless of
> what it was before. SQL is Turing-complete.

It may be Turing-complete, but I seriously doubt that that's sufficient
for the problem I'm thinking of, which is to downgrade from an extension
version that you never heard of at the time the script was written.
In the worst case, that version might even contain objects of types
you never heard of and don't know how to drop.

You can imagine various approaches to deal with that; for instance,
maybe we could provide some kind of command to deal with dropping an
object identified by classid and objid, which the upgrade script
could scrape out of pg_depend.  After writing even more code to issue
those drops in dependency-aware order, you could get on with modifying
the objects you do know about ... but maybe those now have properties
you never heard of and don't know how to adjust.

Whether this is all theoretically possible is sort of moot.  What
I am maintaining is that no extension author is actually going to
write such a script, indeed they probably won't trouble to write
any downgrade-like actions at all.  Which makes the proposed design
mostly a foot-gun.

> But that to one side, there is clearly a problem here, and I think
> PostgreSQL ought to provide some infrastructure to help solve it, even
> if the ultimate cause of that problem is that the PostGIS folks
> managed their extension versions in some less-than-ideal way. I can't
> shake the feeling that you're just holding your breath here and hoping
> the problem goes away by itself, and judging by the responses, that
> doesn't seem like it's going to happen.

I'm not unsympathetic to the idea of trying to support multiple upgrade
paths in one script.  I just don't like this particular design for that,
because it requires the extension author to make promises that nobody
is actually going to deliver on.

regards, tom lane

RE: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-07 Thread Regina Obe

> The thing that confuses me here is why the PostGIS folks are ending up with
> so many files. 
> We certainly don't have that problem with the extension that
> are being maintained in contrib, and I guess there is some difference in
> versioning practice that is making it an issue for them but not for us. I 
> wish I
> understood what was going on there.

The contrib files are minor versioned.  PostGIS is micro versioned.

So we have for example postgis--3.3.0--3.3.1.sql
Also we have 5 extensions we ship all micro versions, so multiply that issue by 
5

postgis
postgis_raster
postgis_topology
postgis_tiger_geocoder
postgis_sfcgal

The other wrinkle we have that I don't think postgresql's contrib have is that 
we support 
Each version across multiple PostgreSQL versions.
So for example our 3.3 series supports PostgreSQL 12-15 (with plans to also 
support 16).

Re: buildfarm + meson

Hi,

On 2023-03-01 13:32:58 -0800, Andres Freund wrote:
> On 2023-03-01 16:21:32 -0500, Andrew Dunstan wrote:
> > Perhaps the latest version will be more to your taste.
> 
> I'll check it out.

A simple conversion from an existing config failed with:
Can't use an undefined value as an ARRAY reference at 
/home/bf/src/pgbuildfarm-client-meson/PGBuild/Modules/TestICU.pm line 37.

I disabled TestICU and was able to progress past that.

...
piculet-meson:HEAD  [19:12:48] setting up db cluster (C)...
piculet-meson:HEAD  [19:12:48] starting db (C)...
piculet-meson:HEAD  [19:12:48] running installcheck (C)...
piculet-meson:HEAD  [19:12:57] restarting db (C)...
piculet-meson:HEAD  [19:12:59] running meson misc installchecks (C) ...
Branch: HEAD
Stage delay_executionInstallCheck-C failed with status 1

The failures are like this:

+ERROR:  extension "dummy_index_am" is not available
+DETAIL:  Could not open extension control file 
"/home/bf/bf-build/piculet-meson/HEAD/inst/share/postgresql/extension/dummy_index_am.control":
 No such file or directory.
+HINT:  The extension must first be installed on the system where PostgreSQL is 
running.

I assume this is in an interaction with b6a0d469cae.

I think we need a install-test-modules or such that installs into the normal
directory.

Greetings,

Andres Freund

RE: [PATCH] Support % wildcard in extension upgrade filenames

2023-03-07 Thread Regina Obe

> I'm not unsympathetic to the idea of trying to support multiple upgrade paths
> in one script.  I just don't like this particular design for that, because it
> requires the extension author to make promises that nobody is actually going
> to deliver on.
> 
>   regards, tom lane

How about the idea I mentioned, of we revise the patch to read versioned 
upgrades from the control file
rather than relying on said file to exist.

https://www.postgresql.org/message-id/000201d92572%247dcd8260%2479688720%24%40pcorp.us

Even better, we have an additional control file, something like

postgis--paths.control

That has separate lines to itemize those paths.  It would be nice if we could 
allow wild-cards in that, but I could live without that if we can stop shipping 
300 empty files.

Thanks,
Regina

Re: Add shared buffer hits to pg_stat_io

Hi,

LGTM. The only comment I have is that a small test wouldn't hurt... Compared
to the other things it should be fairly easy...

Greetings,

Andres Freund

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Hi,

On 2023-03-07 08:22:45 +0530, Amit Kapila wrote:
> On Tue, Mar 7, 2023 at 1:34 AM Andres Freund  wrote:
> > I think even as-is it's reasonable to just use it. The sequential scan
> > approach is O(N^2), which, uh, is not good. And having an index over 
> > thousands
> > of non-differing values will generally perform badly, not just in this
> > context.
> >
> Yes, it is true that generally also index scan with a lot of
> duplicates may not perform well but during the scan, we do costing to
> ensure such cases and may prefer other index or sequence scan. Then we
> have "enable_indexscan" GUC that the user can use if required. So, I
> feel it is better to have a knob to disallow usage of such indexes and
> the default would be to use an index, if available.

It just feels like we're optimizing for an irrelevant case here. If we add
GUCs for irrelevant things like this we'll explode the number of GUCs even
faster than we already are.

Greetings,

Andres Freund

Re: Add support for unit "B" to pg_size_pretty()

2023-03-07 Thread Peter Eisentraut


On 06.03.23 09:27, David Rowley wrote:

On Mon, 6 Mar 2023 at 21:13, Peter Eisentraut
 wrote:


On 02.03.23 20:58, David Rowley wrote:

I think I'd prefer to see the size_bytes_unit_alias struct have an
index into size_pretty_units[] array. i.e:


Ok, done that way.  (I had thought about that, but I was worried that
that would be too error-prone to maintain.  But I suppose the tables
don't change that often, and test cases would easily catch mistakes.)


Patch looks pretty good. I just see a small spelling mistake in:

+/* Additional unit aliases acceted by pg_size_bytes */


I also updated the documentation a bit more.


I see I must have forgotten to add PB to the docs when pg_size_pretty
had that unit added.  I guess you added the "etc" to fix that?  I'm
wondering if that's the right choice. You modified the comment above
size_pretty_units[] to remind us to update the docs when adding units,
but the docs now say "etc", so do we need to?  I'd likely have gone
with just adding "PB" to the docs, that way it's pretty clear that new
units need to be mentioned in the docs.


Ok, I have fixed the original documentation to that effect and 
backpatched it.


The remaining patch has been updated accordingly and committed also.

Re: buildfarm + meson

2023-03-07 Thread Andrew Dunstan



On 2023-03-07 Tu 14:37, Andres Freund wrote:

Hi,

On 2023-03-01 13:32:58 -0800, Andres Freund wrote:

On 2023-03-01 16:21:32 -0500, Andrew Dunstan wrote:

Perhaps the latest version will be more to your taste.

I'll check it out.

A simple conversion from an existing config failed with:
Can't use an undefined value as an ARRAY reference at 
/home/bf/src/pgbuildfarm-client-meson/PGBuild/Modules/TestICU.pm line 37.

I disabled TestICU and was able to progress past that.



Pushed a fix for that.




...
piculet-meson:HEAD  [19:12:48] setting up db cluster (C)...
piculet-meson:HEAD  [19:12:48] starting db (C)...
piculet-meson:HEAD  [19:12:48] running installcheck (C)...
piculet-meson:HEAD  [19:12:57] restarting db (C)...
piculet-meson:HEAD  [19:12:59] running meson misc installchecks (C) ...
Branch: HEAD
Stage delay_executionInstallCheck-C failed with status 1


The failures are like this:

+ERROR:  extension "dummy_index_am" is not available
+DETAIL:  Could not open extension control file 
"/home/bf/bf-build/piculet-meson/HEAD/inst/share/postgresql/extension/dummy_index_am.control":
 No such file or directory.
+HINT:  The extension must first be installed on the system where PostgreSQL is 
running.

I assume this is in an interaction with b6a0d469cae.


I think we need a install-test-modules or such that installs into the normal
directory.



Exactly.


cheers


andrew

--

Andrew Dunstan
EDB:https://www.enterprisedb.com

Re: add PROCESS_MAIN to VACUUM

On Tue, Mar 07, 2023 at 12:39:29PM -0500, Melanie Plageman wrote:
> Yes, sounds clear to me also!

Here is an updated patch.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 580f966499..0acc42af2b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2058,25 +2058,34 @@ vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params, bool skip_privs)
 	save_nestlevel = NewGUCNestLevel();
 
 	/*
-	 * Do the actual work --- either FULL or "lazy" vacuum
+	 * If PROCESS_MAIN is set (the default), it's time to vacuum the main
+	 * relation.  Otherwise, we can skip this part.  If processing the TOAST
+	 * table is required (e.g., PROCESS_TOAST is set), we'll force PROCESS_MAIN
+	 * to be set when we recurse to the TOAST table so that it gets processed
+	 * here.
 	 */
-	if ((params->options & VACOPT_FULL) &&
-		(params->options & VACOPT_PROCESS_MAIN))
+	if (params->options & VACOPT_PROCESS_MAIN)
 	{
-		ClusterParams cluster_params = {0};
+		/*
+		 * Do the actual work --- either FULL or "lazy" vacuum
+		 */
+		if (params->options & VACOPT_FULL)
+		{
+			ClusterParams cluster_params = {0};
 
-		/* close relation before vacuuming, but hold lock until commit */
-		relation_close(rel, NoLock);
-		rel = NULL;
+			/* close relation before vacuuming, but hold lock until commit */
+			relation_close(rel, NoLock);
+			rel = NULL;
 
-		if ((params->options & VACOPT_VERBOSE) != 0)
-			cluster_params.options |= CLUOPT_VERBOSE;
+			if ((params->options & VACOPT_VERBOSE) != 0)
+cluster_params.options |= CLUOPT_VERBOSE;
 
-		/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
-		cluster_rel(relid, InvalidOid, &cluster_params);
+			/* VACUUM FULL is now a variant of CLUSTER; see cluster.c */
+			cluster_rel(relid, InvalidOid, &cluster_params);
+		}
+		else
+			table_relation_vacuum(rel, params, vac_strategy);
 	}
-	else if (params->options & VACOPT_PROCESS_MAIN)
-		table_relation_vacuum(rel, params, vac_strategy);
 
 	/* Roll back any GUC changes executed by index functions */
 	AtEOXact_GUC(false, save_nestlevel);

Re: psql: Add role's membership options to the \du+ command

2023-03-07 Thread David G. Johnston

On Mon, Mar 6, 2023 at 12:43 AM Pavel Luzanov 
wrote:

> Indeed, adding ADMIN to pg_has_role looks logical. The function will show
> whether one role can manage another directly or indirectly (via SET ROLE).
>

FWIW I've finally gotten to publishing my beta version of the Role Graph
for PostgreSQL pseudo-extension I'd been talking about:

https://github.com/polobo/RoleGraphForPostgreSQL

The administration column basically determines all this via a recursive
CTE.  I'm pondering how to incorporate some of this core material into it
now for both cross-validation purposes and ease-of-use.

I handle EMPTY explicitly in the Role Graph but as I noted somewhere in my
comments, it really shouldn't be possible to leave the database in that
state.  Do we need to bug Robert on this directly or do you plan to have a
go at it?

Adding ADMIN will lead to the question of naming other values. It is more
> reasonable to have INHERIT instead of USAGE.
>
And it is not very clear whether (except for backward compatibility) a
> separate MEMBER value is needed at all.
>

There is an appeal to breaking things thoroughly here and removing both
MEMBER and USAGE terms while adding ADMIN, SET, INHERIT.

However, I am against that.  Most everyday usage of this would only likely
care about SET and INHERIT even going forward - administration of roles is
distinct from how those roles generally behave at runtime.  Breaking the
later because we improved upon the former doesn't seem defensible.  Thus,
while adding ADMIN makes sense, keeping MEMBER (a.k.a., SET) and USAGE
(a.k.a., INHERIT) is my suggested way forward.

I'll be looking over your v3 patch sometime this week, if not today.

David J.

Re: Add support for unit "B" to pg_size_pretty()

On Wed, 8 Mar 2023 at 09:22, Peter Eisentraut
 wrote:
> Ok, I have fixed the original documentation to that effect and
> backpatched it.

Thanks for fixing that.

David

pipe_read_line for reading arbitrary strings

When skimming through pg_rewind during a small review I noticed the use of
pipe_read_line for reading arbitrary data from a pipe, the mechanics of which
seemed odd.

Commit 5b2f4afffe6 refactored find_other_exec() and broke out pipe_read_line()
as a static convenience routine for reading a single line of output to catch a
version number.  Many years later, commit a7e8ece41 exposed it externally in
order to read a GUC from postgresql.conf using "postgres -C ..".  f06b1c598
also make use of it for reading a version string much like find_other_exec().
Funnily enough, while now used for arbitrary string reading the variable is
still "pgver".

Since the function requires passing a buffer/size, and at most size - 1 bytes
will be read via fgets(), there is a truncation risk when using this for
reading GUCs (like how pg_rewind does, though the risk there is slim to none).

If we are going to continue using this for reading $stuff from pipes, maybe we
should think about presenting a nicer API which removes that risk?  Returning
an allocated buffer which contains all the output along the lines of the recent
pg_get_line work seems a lot nicer and safer IMO.

The attached POC diff replace fgets() with pg_get_line(), which may not be an
Ok way to cross the streams (it's clearly not a great fit), but as a POC it
provided a neater interface for reading one-off lines from a pipe IMO.  Does
anyone else think this is worth fixing before too many callsites use it, or is
this another case of my fear of silent subtle truncation bugs?  =)

--
Daniel Gustafsson



pipe_read_line.diff
Description: Binary data

PGDOCS - Replica Identity quotes

PGDOCS - Replica Identity quotes

Hi,

Here are some trivial quote changes to a paragraph describing REPLICA IDENTITY.

These changes were previously made in another ongoing R.I. patch
v28-0001 [1], but it was decided that since they are not strictly
related to that patch they should done separately.

==
logical-replication.sgml

Section 31.1 Publication

A published table must have a “replica identity” configured in order
to be able to replicate UPDATE and DELETE operations, so that
appropriate rows to update or delete can be identified on the
subscriber side. By default, this is the primary key, if there is one.
Another unique index (with certain additional requirements) can also
be set to be the replica identity. If the table does not have any
suitable key, then it can be set to replica identity “full”, which
means the entire row becomes the key. This, however, is very
inefficient and should only be used as a fallback if no other solution
is possible. If a replica identity other than “full” is set on the
publisher side, a replica identity comprising the same or fewer
columns must also be set on the subscriber side. See REPLICA IDENTITY
for details on how to set the replica identity. If a table without a
replica identity is added to a publication that replicates UPDATE or
DELETE operations then subsequent UPDATE or DELETE operations will
cause an error on the publisher. INSERT operations can proceed
regardless of any replica identity.

Suggested changes:

1.
The quoted "replica identity" should not be quoted -- This is the
first time this term is used on this page so I think it should be
using SGML tag, just the same as how
publication looks at the top of this section.

2.
The quoted "full" should also not be quoted. Replicate identities are
not specified using text string "full" - they are specified as FULL
(see [2]), so IMO these instances should be changed to
FULL to eliminate that ambiguity.

~~~

PSA patch v1 which implements the above changes.

--
[1]
https://www.postgresql.org/message-id/CAA4eK1J8R-qS97cu27sF2%3DqzjhuQNkv%2BZvgaTzFv7rs-LA4c2w%40mail.gmail.com
[2]
https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-REPLICA-IDENTITY

Kind Regards,
Peter Smith.
Fujitsu Australia

v1-0001-PGDOCS-replica-identity-quotes.patch
Description: Binary data

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

On Mon, Mar 6, 2023 at 7:40 PM Amit Kapila  wrote:
>
> On Mon, Mar 6, 2023 at 1:40 PM Peter Smith  wrote:
> >
...
> >
> > Anyhow, if you feel those firstterm and FULL changes ought to be kept
> > separate from this RI patch, please let me know and I will propose
> > those changes in a new thread,
> >
>
> Personally, I would prefer to keep those separate. So, feel free to
> propose them in a new thread.
>

Done. Those suggested pg docs changes now have their own new thread [1].

--
[1] RI quotes -
https://www.postgresql.org/message-id/CAHut%2BPst11ac2hcmePt1%3DoTmBwTT%3DDAssRR1nsdoy4BT%2B68%3DMg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

2023-03-07 Thread Alexander Korotkov

On Tue, Mar 7, 2023 at 7:26 PM Pavel Borisov  wrote:
> On Tue, 7 Mar 2023, 10:10 Alexander Korotkov,  wrote:
>> I don't know what exactly Pavel meant, but average overall numbers for
>> low concurrency are.
>> master: 420401 (stddev of average 233)
>> patchset v11: 420111 (stddev of average 199)
>> The difference is less than 0.1% and that is very safely within the error.
>
>
> Yes, the only thing that I meant is that for low-concurrency case the results 
> between patch and master are within the difference between repeated series of 
> measurements. So I concluded that the test can not prove any difference 
> between patch and master.
>
> I haven't meant or written there is some performance degradation.
>
> Alexander, I suppose did an extra step and calculated overall average and 
> stddev, from raw data provided. Thanks!

Pavel, thank you for verifying this.

Could you, please, rerun performance benchmarks for the v13?  It
introduces LazyTupleTableSlot, which shouldn't do any measurable
impact on performance.  But still.

--
Regards,
Alexander Korotkov

optimize several list functions with SIMD intrinsics

I noticed that several of the List functions do simple linear searches that
can be optimized with SIMD intrinsics (as was done for XidInMVCCSnapshot in
37a6e5d).  The following table shows the time spent iterating over a list
of n elements (via list_member_int) one billion times on my x86 laptop.

  n   | head (ms) | patched (ms) 
--+---+--
2 |  3884 | 3001
4 |  5506 | 4092
8 |  6209 | 3026
   16 |  8797 | 4458
   32 | 25051 | 7032
   64 | 37611 |12763
  128 | 61886 |22770
  256 |70 |59885
  512 |209612 |   103378
 1024 |407462 |   189484

I've attached a work-in-progress patch that implements these optimizations
for both x86 and arm, and I will register this in the July commitfest.  I'm
posting this a little early in order to gauge interest.  Presumably you
shouldn't be using a List if you have many elements that must be routinely
searched, but it might be nice to speed up these functions, anyway.  This
was mostly a fun weekend project, and I don't presently have any concrete
examples of workloads where this might help.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 4e04f84766d98f9ba6bb6fdd03bcb431c8aad1d3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Sat, 4 Mar 2023 23:16:07 -0800
Subject: [PATCH v1 1/1] speed up several list functions with SIMD

---
 src/backend/nodes/list.c| 254 +++-
 src/include/port/pg_lfind.h | 189 +++
 src/include/port/simd.h | 103 +++
 3 files changed, 513 insertions(+), 33 deletions(-)

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index a709d23ef1..acc56dddb7 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -19,6 +19,7 @@
 
 #include "nodes/pg_list.h"
 #include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
 #include "utils/memdebug.h"
 #include "utils/memutils.h"
 
@@ -680,11 +681,25 @@ list_member(const List *list, const void *datum)
 bool
 list_member_ptr(const List *list, const void *datum)
 {
+#ifdef USE_NO_SIMD
 	const ListCell *cell;
+#endif
 
 	Assert(IsPointerList(list));
 	check_list_invariants(list);
 
+#ifndef USE_NO_SIMD
+
+	Assert(sizeof(ListCell) == 8);
+	Assert(sizeof(void *) == 8);
+
+	if (list == NIL)
+		return false;
+
+	return pg_lfind64((uint64) datum, (uint64 *) list->elements, list->length);
+
+#else
+
 	foreach(cell, list)
 	{
 		if (lfirst(cell) == datum)
@@ -692,46 +707,121 @@ list_member_ptr(const List *list, const void *datum)
 	}
 
 	return false;
+
+#endif
 }
 
 /*
- * Return true iff the integer 'datum' is a member of the list.
+ * Optimized linear search routine (using SIMD intrinsics where available) for
+ * lists with inline 32-bit data.
  */
-bool
-list_member_int(const List *list, int datum)
+static inline bool
+list_member_inline_internal(const List *list, uint32 datum)
 {
+	uint32		i = 0;
 	const ListCell *cell;
 
-	Assert(IsIntegerList(list));
-	check_list_invariants(list);
+#ifndef USE_NO_SIMD
 
-	foreach(cell, list)
+	/*
+	 * For better instruction-level parallelism, each loop iteration operates
+	 * on a block of four registers.
+	 */
+	const Vector32 keys = vector32_broadcast(datum);	/* load copies of key */
+	const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+	const uint32 nelem_per_iteration = 4 * nelem_per_vector;
+	const Vector32 mask = (Vector32) vector64_broadcast(UINT64CONST(0x));
+	const uint32 *elements = (const uint32 *) list->elements;
+
+	/* round down to multiple of elements per iteration */
+	const uint32 tail_idx = (list->length * 2) & ~(nelem_per_iteration - 1);
+
+	/*
+	 * The SIMD-optimized portion of this routine is written with the
+	 * expectation that the 32-bit datum we are searching for only takes up
+	 * half of a ListCell.  If that changes, this routine must change, too.
+	 */
+	Assert(sizeof(ListCell) == 8);
+
+	for (i = 0; i < tail_idx; i += nelem_per_iteration)
+	{
+		Vector32	vals1,
+	vals2,
+	vals3,
+	vals4,
+	result1,
+	result2,
+	result3,
+	result4,
+	tmp1,
+	tmp2,
+	result,
+	masked;
+
+		/* load the next block into 4 registers */
+		vector32_load(&vals1, &elements[i]);
+		vector32_load(&vals2, &elements[i + nelem_per_vector]);
+		vector32_load(&vals3, &elements[i + nelem_per_vector * 2]);
+		vector32_load(&vals4, &elements[i + nelem_per_vector * 3]);
+
+		/* compare each value to the key */
+		result1 = vector32_eq(keys, vals1);
+		result2 = vector32_eq(keys, vals2);
+		result3 = vector32_eq(keys, vals3);
+		result4 = vector32_eq(keys, vals4);
+
+		/* combine the results into a single variable */
+		tmp1 = vector32_or(result1, result2);
+		tmp2 = vector32_or(result3, result4);
+		result = vector32_or(tmp1, tmp2);
+
+		/* filter out matches in space between data */
+		masked = vector32_and(re

Re: optimize several list functions with SIMD intrinsics

On Wed, 8 Mar 2023 at 13:25, Nathan Bossart  wrote:
> I've attached a work-in-progress patch that implements these optimizations
> for both x86 and arm, and I will register this in the July commitfest.  I'm
> posting this a little early in order to gauge interest.

Interesting and quite impressive performance numbers.

>From having a quick glance at the patch, it looks like you'll need to
take some extra time to make it work on 32-bit builds.

David

Re: add PROCESS_MAIN to VACUUM

2023-03-07 Thread Michael Paquier

On Tue, Mar 07, 2023 at 12:55:08PM -0800, Nathan Bossart wrote:
> On Tue, Mar 07, 2023 at 12:39:29PM -0500, Melanie Plageman wrote:
>> Yes, sounds clear to me also!
> 
> Here is an updated patch.

Fine by me, so done.  (I have cut a few words from the comment,
without changing its meaning.)
--
Michael


signature.asc
Description: PGP signature

Re: shoud be get_extension_schema visible?

2023-03-07 Thread Michael Paquier

On Mon, Mar 06, 2023 at 04:44:59PM +0900, Michael Paquier wrote:
> I can see why you'd want that, so OK from here to provide this routine
> for external consumption.  Let's first wait a bit and see if others
> have any kind of objections or comments.

Done this one as of e20b1ea.
--
Michael


signature.asc
Description: PGP signature

Re: POC: Lock updated tuples in tuple_update() and tuple_delete()

Hi,

On 2023-03-07 04:45:32 +0300, Alexander Korotkov wrote:
> The second patch now implements a concept of LazyTupleTableSlot, a slot
> which gets allocated only when needed.  Also, there is more minor
> refactoring and more comments.

This patch already is pretty big for what it actually improves. Introducing
even infrastructure to get a not that big win, in a not particularly
interesting, extreme, workload...

What is motivating this?

Greetings,

Andres Freund

Re: buildfarm + meson

Hi,

On 2023-03-07 15:47:54 -0500, Andrew Dunstan wrote:
> On 2023-03-07 Tu 14:37, Andres Freund wrote:
> > The failures are like this:
> > 
> > +ERROR:  extension "dummy_index_am" is not available
> > +DETAIL:  Could not open extension control file 
> > "/home/bf/bf-build/piculet-meson/HEAD/inst/share/postgresql/extension/dummy_index_am.control":
> >  No such file or directory.
> > +HINT:  The extension must first be installed on the system where 
> > PostgreSQL is running.
> > 
> > I assume this is in an interaction with b6a0d469cae.
> > 
> > 
> > I think we need a install-test-modules or such that installs into the normal
> > directory.
> > 
> 
> Exactly.

Here's a prototype for that.

It adds an install-test-files target, Because we want to install into a normal
directory, I removed the necessary munging of the target paths from
meson.build and moved it into install-test-files. I also added DESTDIR
support, so that installing can redirect the directory if desired. That's used
for the tmp_install/ installation now.

I didn't like the number of arguments necessary for install_test_files, so I
changed it to use

--install target list of files

which makes it easier to use for further directories, if/when we need them.

Greetings,

Andres Freund
>From fe6c77903696ccf03b618fb6f91119a12bea4a2f Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 7 Mar 2023 16:14:18 -0800
Subject: [PATCH v1] meson: Add target for installing test files & improve
 install_test_files

---
 meson.build  | 32 ++--
 src/tools/install_test_files | 25 +++--
 2 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/meson.build b/meson.build
index 87cb974ad7c..6ce24207fb5 100644
--- a/meson.build
+++ b/meson.build
@@ -2830,6 +2830,22 @@ generated_sources_ac += {'': ['GNUmakefile']}
 testprep_targets += test_install_libs
 
 
+# command to install files used for tests, which aren't installed by default
+install_test_files = files('src/tools/install_test_files')
+install_test_files_args = [
+  install_test_files,
+  '--prefix', dir_prefix,
+  '--install', contrib_data_dir, test_install_data,
+  '--install', dir_lib_pkg, test_install_libs,
+]
+
+# Target installing files required for installcheck of various modules
+run_target('install-test-files',
+  command: [python] + install_test_files_args,
+  depends: testprep_targets,
+)
+
+
 # If there are any files in the source directory that we also generate in the
 # build directory, they might get preferred over the newly generated files,
 # e.g. because of a #include "file", which always will search in the current
@@ -2922,21 +2938,9 @@ test('tmp_install',
 is_parallel: false,
 suite: ['setup'])
 
-# get full paths of test_install_libs to copy them
-test_install_libs_fp = []
-foreach lib: test_install_libs
-  test_install_libs_fp += lib.full_path()
-endforeach
-
-install_test_files = files('src/tools/install_test_files')
 test('install_test_files',
-python, args: [
-  install_test_files,
-  '--datadir', test_install_location / contrib_data_args['install_dir'],
-  '--libdir', test_install_location / dir_lib_pkg,
-  '--install-data', test_install_data,
-  '--install-libs', test_install_libs_fp,
-],
+python,
+args: install_test_files_args + ['--destdir', test_install_destdir],
 priority: setup_tests_priority,
 is_parallel: false,
 suite: ['setup'])
diff --git a/src/tools/install_test_files b/src/tools/install_test_files
index e6ecdae10f8..8e0b36a74d1 100644
--- a/src/tools/install_test_files
+++ b/src/tools/install_test_files
@@ -6,23 +6,28 @@
 import argparse
 import shutil
 import os
+from pathlib import PurePath
 
 parser = argparse.ArgumentParser()
 
-parser.add_argument('--datadir', type=str)
-parser.add_argument('--libdir', type=str)
-parser.add_argument('--install-data', type=str, nargs='*')
-parser.add_argument('--install-libs', type=str, nargs='*')
+parser.add_argument('--destdir', type=str, default=os.environ.get('DESTDIR', None))
+parser.add_argument('--prefix', type=str)
+parser.add_argument('--install', type=str, nargs='+', action='append')
 
 args = parser.parse_args()
 
+def copy_files(prefix: str, destdir: str, targetdir: str, src_list: list):
+if not os.path.isabs(targetdir):
+targetdir = os.path.join(prefix, targetdir)
 
-def copy_files(src_list: list, dest: str):
-os.makedirs(dest, exist_ok=True)
+if destdir is not None:
+# copy of meson's logic for joining destdir and install paths
+targetdir = str(PurePath(destdir, *PurePath(targetdir).parts[1:]))
+
+os.makedirs(targetdir, exist_ok=True)
 
 for src in src_list:
-shutil.copy2(src, dest)
+shutil.copy2(src, targetdir)
 
-
-copy_files(args.install_data, args.datadir)
-copy_files(args.install_libs, args.libdir)
+for installs in args.install:
+copy_files(args.prefix, args.destdir, installs[0], installs[1:])
-- 
2.38.0

Re: Add LZ4 compression in pg_dump

2023-03-07 Thread Justin Pryzby

On Wed, Mar 01, 2023 at 04:52:49PM +0100, Tomas Vondra wrote:
> Thanks. That seems correct to me, but I find it somewhat confusing,
> because we now have
> 
>  DeflateCompressorInit vs. InitCompressorGzip
> 
>  DeflateCompressorEnd vs. EndCompressorGzip
> 
>  DeflateCompressorData - The name doesn't really say what it does (would
>  be better to have a verb in there, I think).
> 
> I wonder if we can make this somehow clearer?

To move things along, I updated Georgios' patch:

Rename DeflateCompressorData() to DeflateCompressorCommon();
Rearrange functions to their original order allowing a cleaner diff to the 
prior code;
Change pg_fatal() to an assertion+comment;
Update the commit message and fix a few typos;

> Also, InitCompressorGzip says this:
> 
>/*
> * If the caller has defined a write function, prepare the necessary
> * state. Avoid initializing during the first write call, because End
> * may be called without ever writing any data.
> */
> if (cs->writeF)
> DeflateCompressorInit(cs);
>
> Does it actually make sense to not have writeF defined in some cases?

InitCompressor is being called for either reading or writing, either of
which could be null:

src/bin/pg_dump/pg_backup_custom.c: ctx->cs = 
AllocateCompressor(AH->compression_spec,
src/bin/pg_dump/pg_backup_custom.c- 
 NULL,
src/bin/pg_dump/pg_backup_custom.c- 
 _CustomWriteFunc);
--
src/bin/pg_dump/pg_backup_custom.c: cs = 
AllocateCompressor(AH->compression_spec,
src/bin/pg_dump/pg_backup_custom.c- 
_CustomReadFunc, NULL);

It's confusing that the comment says "Avoid initializing...".  What it
really means is "Initialize eagerly...".  But that makes more sense in
the context of the commit message for this bugfix than in a comment.  So
I changed that too.

+   /* If deflation was initialized, finalize it */ 

+   if (cs->private_data)   

+   DeflateCompressorEnd(AH, cs);   

Maybe it'd be more clear if this used "if (cs->writeF)", like in the
init function ?

-- 
Justin
>From 5c027aa86e8591db140093c48a58aafba7a6c28c Mon Sep 17 00:00:00 2001
From: Georgios Kokolatos 
Date: Wed, 1 Mar 2023 12:42:32 +
Subject: [PATCH] pg_dump: fix gzip compression of empty data

When creating dumps with the Compressor API, it is possible to only call
the Allocate and End compressor functions without ever writing any data.
Since e9960732a, the gzip implementation wrongly assumed that the write
function would always be called and deferred the initialization of the
internal compression system until the first write call.

The problem with that approach is that the End call would not finalize
the internal compression system if it hadn't been initialized.

This commit initializes the internal compression system during the
Allocate call, whenever a write function was provided by the caller.
Given that decompression does not need to keep track of any state, the
compressor's private_data member is now populated only during
compression.

In passing, rearrange the functions to their original order, to allow
usefully comparing with the previous code, like:
git diff --diff-algorithm=minimal -w e9960732a~:src/bin/pg_dump/compress_io.c src/bin/pg_dump/compress_gzip.c

Also replace an unreachable pg_fatal() with an assert+comment.  I
(Justin) argued that the new fatal shouldn't have been introduced in a
refactoring commit, so this is a compromise.

Report and initial patch by Justin Pryzby, test case by Georgios
Kokolatos.

https://www.postgresql.org/message-id/20230228235834.GC30529%40telsasoft.com
---
 src/bin/pg_dump/compress_gzip.c  | 137 ++-
 src/bin/pg_dump/t/002_pg_dump.pl |  23 ++
 2 files changed, 101 insertions(+), 59 deletions(-)

diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c
index 0af65afeb4e..3c9ac55c266 100644
--- a/src/bin/pg_dump/compress_gzip.c
+++ b/src/bin/pg_dump/compress_gzip.c
@@ -32,9 +32,75 @@ typedef struct GzipCompressorState
 	size_t		outsize;
 } GzipCompressorState;

+
 /* Private routines that support gzip compressed data I/O */
+static void DeflateCompressorInit(CompressorState *cs);
+static void DeflateCompressorEnd(ArchiveHandle *AH, CompressorState *cs);
+static void DeflateCompressorCommon(ArchiveHandle *AH, CompressorState *cs,
+	bool flush);
+static void EndCompressorGzip(ArchiveHandle *AH, CompressorState *cs);
+static void WriteDataToArchiveGzip(ArchiveHandle *AH, CompressorState *cs,
+   const void *data, size_t dLen);

Re: buildfarm + meson

Hi,

On 2023-02-23 06:27:23 -0500, Andrew Dunstan wrote:
> Yeah. For touch I think we can probably just get rid of this line in the
> root meson.build:
> 
> touch = find_program('touch', native: true)

Yep.

> For cp there doesn't seem to be a formal requirement, but there is a recipe
> in src/common/unicode/meson.build that uses it, maybe that's what caused the
> failure. On Windows/msvc we could just use copy instead, I think.

I don't know about using copy, it's very easy to get into trouble due to
interpreting forward slashes as options etc. I propose that for now we just
don't support update-unicode if cp isn't available - just as already not
available when wget isn't available.

Planning to apply something like the attached soon, unless somebody opposes
that plan.


Other unix tools we have a hard requirement on right now:
- sed - would be pretty easy to replace with something else
- tar, gzip - just for tests

I'm not sure it's worth working on not requiring those.


There's also flex, bison, perl, but those will stay a hard requirement for a
while longer... :)

Greetings,

Andres Freund
>From 564092bfb4c108c387cac3562a7dbad8c3126fea Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 7 Mar 2023 18:24:18 -0800
Subject: [PATCH] meson: don't require 'touch' binary, make use of 'cp'
 optional

---
 meson.build| 2 +-
 src/common/unicode/meson.build | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/meson.build b/meson.build
index 87cb974ad7c..f1ce4cb8e03 100644
--- a/meson.build
+++ b/meson.build
@@ -333,11 +333,11 @@ prove = find_program(get_option('PROVE'), native: true, required: false)
 tar = find_program(get_option('TAR'), native: true)
 gzip = find_program(get_option('GZIP'), native: true)
 program_lz4 = find_program(get_option('LZ4'), native: true, required: false)
-touch = find_program('touch', native: true)
 openssl = find_program(get_option('OPENSSL'), native: true, required: false)
 program_zstd = find_program(get_option('ZSTD'), native: true, required: false)
 dtrace = find_program(get_option('DTRACE'), native: true, required: get_option('dtrace'))
 missing = find_program('config/missing', native: true)
+cp = find_program('cp', required: false, native: true)
 
 bison_flags = []
 if bison.found()
diff --git a/src/common/unicode/meson.build b/src/common/unicode/meson.build
index 63c81664de3..1ffece1550a 100644
--- a/src/common/unicode/meson.build
+++ b/src/common/unicode/meson.build
@@ -5,7 +5,7 @@ UNICODE_VERSION = '15.0.0'
 unicode_data = {}
 unicode_baseurl = 'https://www.unicode.org/Public/@0@/ucd/@1@'
 
-if not wget.found()
+if not wget.found() or not cp.found()
   subdir_done()
 endif
 
@@ -100,7 +100,7 @@ update_unicode = custom_target('update-unicode',
   depends: update_unicode_dep,
   output: ['dont-exist'],
   input: update_unicode_targets,
-  command: ['cp', '@INPUT@', '@SOURCE_ROOT@/src/include/common/'],
+  command: [cp, '@INPUT@', '@SOURCE_ROOT@/src/include/common/'],
   build_by_default: false,
   build_always_stale: true,
 )
-- 
2.38.0

Re: psql: Add role's membership options to the \du+ command

2023-03-07 Thread David G. Johnston

On Tue, Mar 7, 2023 at 2:02 PM David G. Johnston 
wrote:

>
> I'll be looking over your v3 patch sometime this week, if not today.
>
>
Moving the goal posts for this meta-command to >= 9.5 seems like it should
be done as a separate patch and thread.  The documentation presently states
we are targeting 9.2 and newer.

My suggestion for the docs is below.  I find saying "additional
information is shown...currently this adds the comment".  Repeating that
"+" means (show more) everywhere seems excessive, just state what those
"more" things are.  I consider \dFp and \dl to be good examples in this
regard.

I also think that "Wall of text" doesn't serve us well.  See \dP for
permission to use paragraphs.

I didn't modify \du to match; keeping those in sync (as opposed to having
\du just say "see \dg") seems acceptable.

You had the direction of membership wrong in your copy: "For each
membership in the role" describes the reverse of "Member of" which is what
the column is.  The actual format template is constructed properly.

--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1727,15 +1727,18 @@ INSERT INTO tbl1 VALUES ($1, $2) \bind 'first
value' 'second value' \g
 S modifier to include system roles.
 If pattern is
specified,
 only those roles whose names match the pattern are listed.
-For each membership in the role, the membership options and
-the role that granted the membership are displayed.
-Оne-letter abbreviations are used for membership options:
-a — admin option, i
— inherit option,
-s — set option and
empty if no one is set.
-See GRANT
command for their meaning.
-If the form \dg+ is used, additional information
-is shown about each role; currently this adds the comment for each
-role.
+
+
+Shown within each row, in newline-separated format, are the
memberships granted to
+the role.  The presentation includes both the name of the grantor
+as well as the membership permissions (in an abbreviated format:
+a for admin option, i for
inherit option,
+s for set option.) The word
empty is printed in
+the case that none of those permissions are granted.
+See the GRANT
command for their meaning.
+
+
+If the form \dg+ is used the comment attached
to the role is shown.

I would suggest tweaking the test output to include regress_du_admin and
also to make regress_du_admin a CREATEROLE role with LOGIN.

I'll need to update the Role Graph View to add the spaces and swap the
order of the "s" and "i" symbols.

David J.

RE: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

2023-03-07 Thread shiy.f...@fujitsu.com

On Tue, Mar 7, 2023 9:47 PM Önder Kalacı  wrote:
> 
> I'm attaching v35. 
> 

I noticed that if the index column only exists on the subscriber side, this 
index
can also be chosen. This seems a bit odd because the index column isn't sent
from publisher.

e.g.
-- pub
CREATE TABLE test_replica_id_full (x int, y int);
ALTER TABLE test_replica_id_full REPLICA IDENTITY FULL;
CREATE PUBLICATION tap_pub_rep_full FOR TABLE test_replica_id_full;
-- sub
CREATE TABLE test_replica_id_full (x int, y int, z int);
CREATE INDEX test_replica_id_full_idx ON test_replica_id_full(z);
CREATE SUBSCRIPTION tap_sub_rep_full_0 CONNECTION 'dbname=postgres port=5432' 
PUBLICATION tap_pub_rep_full;

I didn't see in any cases the behavior changed after applying the patch, which
looks good. Besides, I tested the performance for such case.

Steps:
1. create tables, index, publication, and subscription
-- pub
create table tbl (a int);
alter table tbl replica identity full;
create publication pub for table tbl;
-- sub
create table tbl (a int, b int);
create index idx_b on tbl(b);
create subscription sub connection 'dbname=postgres port=5432' publication pub;
2. setup synchronous replication
3. execute SQL:
truncate tbl;
insert into tbl select i from generate_series(0,1)i;
update tbl set a=a+1;

The time of UPDATE (take the average of 10 runs):
master: 1356.06 ms
patched: 3968.14 ms

For the cases that all values of extra columns on the subscriber are NULL, index
scan can't do better than sequential scan. This is not a real scenario and I
think it only degrades when there are many NULL values in the index column, so
this is probably not a case to worry about. I just share this case and then we
can discuss should we pick the index which only contain the extra columns on the
subscriber.

Regards,
Shi Yu

RE: Rework LogicalOutputPluginWriterUpdateProgress

2023-03-07 Thread wangw.f...@fujitsu.com

On Tue, Mar 7, 2023 15:55 PM Kuroda, Hayato/黒田 隼人  
wrote:
> Dear Wang,
> 
> Thank you for updating the patch! Followings are my comments.

Thanks for your comments.

> ---
> 01. missing comments
> You might miss the comment from Peter[1]. Or could you pin the related one?

Since I think the functions WalSndPrepareWrite and WalSndWriteData have similar
parameters and the HEAD has no related comments, I'm not sure whether we should
add them in this patch, or in a separate patch to comment atop these callback
functions or where they are called.

> ---
> 02. LogicalDecodingProcessRecord()
> 
> Don't we have to call UpdateDecodingProgressAndKeepalive() when there is no
> decoding function? Assuming that the timeout parameter does not have enough
> time
> period and there are so many sequential operations in the transaction. At that
> time
> there may be a possibility that timeout is occurred while calling
> ReorderBufferProcessXid()
> several times.  It may be a bad example, but I meant to say that we may have 
> to
> consider the case that decoding function has not implemented yet.

I think it's ok in this function. If the decoding function has not been
implemented for a record, I think we quickly return to the loop in the function
WalSndLoop, where it will try to send the keepalive message.

BTW, in the previous discussion [1], we decided to ignore some paths, because
the gain from modifying them may not be so great.

> ---
> 03. stream_*_cb_wrapper
> 
> Only stream_*_cb_wrapper have comments "don't call update progress, we
> didn't really make any", but
> there are more functions that does not send updates. Do you have any reasons
> why only they have?

Added this comment to more functions.
I think the following six functions don't call the function
UpdateProgressAndKeepalive in v5 patch:
- begin_cb_wrapper
- begin_prepare_cb_wrapper
- startup_cb_wrapper
- shutdown_cb_wrapper
- filter_prepare_cb_wrapper
- filter_by_origin_cb_wrapper

I think the comment you mentioned means that no new progress needs to be updated
in this *_cb_wrapper. Also, I think we don't need to update the progress at the
beginning of a transaction, just like in HEAD. So, I added the same comment only
in the 4 functions below:
- startup_cb_wrapper
- shutdown_cb_wrapper
- filter_prepare_cb_wrapper
- filter_by_origin_cb_wrapper

Attach the new patch.

[1] - 
https://www.postgresql.org/message-id/20230213180302.u5sqosteflr3zkiz%40awork3.anarazel.de

Regards,
Wang wei


v6-0001-Rework-LogicalOutputPluginWriterUpdateProgress.patch
Description:  v6-0001-Rework-LogicalOutputPluginWriterUpdateProgress.patch

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

On Tue, Mar 7, 2023 at 7:17 PM Önder Kalacı  wrote:
>
>>
>> > > Let me give an example to demonstrate why I thought something is fishy 
>> > > here:
>> > >
>> > > Imagine rel has a (non-default) REPLICA IDENTITY with Oid=.
>> > > Imagine the same rel has a PRIMARY KEY with Oid=.
>> > >
>
>
> Hmm, alright, this is syntactically possible, but not sure if any user
> would do this. Still thanks for catching this.
>
> And, you are right, if a user has created such a schema, 
> IdxIsRelationIdentityOrPK()
> would return the wrong result and we'd use sequential scan instead of index 
> scan.
> This would be a regression. I think we should change the function.
>
>
> Here is the example:
> DROP TABLE tab1;
> CREATE TABLE tab1 (a int NOT NULL);
> CREATE UNIQUE INDEX replica_unique ON tab1(a);
> ALTER TABLE tab1 REPLICA IDENTITY USING INDEX replica_unique;
> ALTER TABLE tab1 ADD CONSTRAINT pkey PRIMARY KEY (a);
>

You have not given complete steps to reproduce the problem where
instead of the index scan, a sequential scan would be picked. I have
tried to reproduce by extending your steps but didn't see the problem.
Let me know if I am missing something.

Publisher

postgres=# CREATE TABLE tab1 (a int NOT NULL);
CREATE TABLE
postgres=# Alter Table tab1 replica identity full;
ALTER TABLE
postgres=# create publication pub2 for table tab1;
CREATE PUBLICATION
postgres=# insert into tab1 values(1);
INSERT 0 1
postgres=# update tab1 set a=2;

Subscriber
-
postgres=# CREATE TABLE tab1 (a int NOT NULL);
CREATE TABLE
postgres=# CREATE UNIQUE INDEX replica_unique ON tab1(a);
CREATE INDEX
postgres=# ALTER TABLE tab1 REPLICA IDENTITY USING INDEX replica_unique;
ALTER TABLE
postgres=# ALTER TABLE tab1 ADD CONSTRAINT pkey PRIMARY KEY (a);
ALTER TABLE
postgres=# create subscription sub2 connection 'dbname=postgres'
publication pub2;
NOTICE:  created replication slot "sub2" on publisher
CREATE SUBSCRIPTION
postgres=# select * from tab1;
 a
---
 2
(1 row)

I have debugged the above example and it uses an index scan during
apply without your latest change which is what I expected. AFAICS, the
use of IdxIsRelationIdentityOrPK() is to decide whether we will do
tuples_equal() or not during the index scan and I see it gives the
correct results with the example you provided.

-- 
With Regards,
Amit Kapila.

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

Here are some review comments for v35-0001

==
General

1.
Saying the index "should" or "should not" do this or that sounds like
it is still OK but just not recommended. TO remove this ambigity IMO
most of the "should" ought to be changed to "must" because IIUC this
patch will simply not consider indexes which do not obey all your
rules.

This comment applies to a few places (search for "should")

e.g.1. - Commit Message
e.g.2. - /* There should always be at least one attribute for the index scan. */
e.g.3. - The function comment for
FindUsableIndexForReplicaIdentityFull(Relation localrel)

==
doc/src/sgml/logical-replication.sgml

2.
A published table must have a “replica identity” configured in order
to be able to replicate UPDATE and DELETE operations, so that
appropriate rows to update or delete can be identified on the
subscriber side. By default, this is the primary key, if there is one.
Another unique index (with certain additional requirements) can also
be set to be the replica identity. If the table does not have any
suitable key, then it can be set to replica identity “full”, which
means the entire row becomes the key. When replica identity “full” is
specified, indexes can be used on the subscriber side for searching
the rows. Candidate indexes must be btree, non-partial, and have at
least one column reference (i.e. cannot consist of only expressions).
These restrictions on the non-unique index properties adheres some of
the restrictions that are enforced for primary keys. Internally, we
follow a similar approach for supporting index scans within logical
replication scope. If there are no such suitable indexes, the search
on the subscriber side can be very inefficient, therefore replica
identity “full” should only be used as a fallback if no other solution
is possible. If a replica identity other than “full” is set on the
publisher side, a replica identity comprising the same or fewer
columns must also be set on the subscriber side. See REPLICA IDENTITY
for details on how to set the replica identity. If a table without a
replica identity is added to a publication that replicates UPDATE or
DELETE operations then subsequent UPDATE or DELETE operations will
cause an error on the publisher. INSERT operations can proceed
regardless of any replica identity.

~

"adheres some of" --> "adhere to some of" ?

==
src/backend/executor/execReplication.c

3. build_replindex_scan_key

  {
  Oid operator;
  Oid opfamily;
  RegProcedure regop;
- int pkattno = attoff + 1;
- int mainattno = indkey->values[attoff];
- Oid optype = get_opclass_input_type(opclass->values[attoff]);
+ int table_attno = indkey->values[index_attoff];
+ Oid optype = get_opclass_input_type(opclass->values[index_attoff]);

These variable declarations might look tidier if you kept all the Oids together.

==
src/backend/replication/logical/relation.c

4. logicalrep_partition_open

+ /*
+ * Finding a usable index is an infrequent task. It occurs when an
+ * operation is first performed on the relation, or after invalidation of
+ * the relation cache entry (such as ANALYZE or CREATE/DROP index on the
+ * relation).
+ *
+ * We also prefer to run this code on the oldctx such that we do not
+ * leak anything in the LogicalRepPartMapContext (hence
+ * CacheMemoryContext).
+ */
+ entry->localindexoid = FindLogicalRepLocalIndex(partrel, remoterel)

"such that" --> "so that" ?

~~~

5. IsIndexUsableForReplicaIdentityFull

+bool
+IsIndexUsableForReplicaIdentityFull(IndexInfo *indexInfo)
+{
+ bool is_btree = (indexInfo->ii_Am == BTREE_AM_OID);
+ bool is_partial = (indexInfo->ii_Predicate != NIL);
+ bool is_only_on_expression = IsIndexOnlyOnExpression(indexInfo);
+
+ if (is_btree && !is_partial && !is_only_on_expression)
+ {
+ return true;
+ }
+
+ return false;
+}

SUGGESTION (no need for 2 returns here)
return is_btree && !is_partial && !is_only_on_expression;

==
src/backend/replication/logical/worker.c

6. FindReplTupleInLocalRel

static bool
FindReplTupleInLocalRel(EState *estate, Relation localrel,
LogicalRepRelation *remoterel,
Oid localidxoid,
TupleTableSlot *remoteslot,
TupleTableSlot **localslot)
{
bool found;

/*
* Regardless of the top-level operation, we're performing a read here, so
* check for SELECT privileges.
*/
TargetPrivilegesCheck(localrel, ACL_SELECT);

*localslot = table_slot_create(localrel, &estate->es_tupleTable);

Assert(OidIsValid(localidxoid) ||
   (remoterel->replident == REPLICA_IDENTITY_FULL));

if (OidIsValid(localidxoid))
found = RelationFindReplTupleByIndex(localrel, localidxoid,
LockTupleExclusive,
remoteslot, *localslot);
else
found = RelationFindReplTupleSeq(localrel, LockTupleExclusive,
remoteslot, *localslot);

return found;
}

~

Since that 'found' variable is not used, you might as well remove the
if/else and simplify the code.

SUGGESTION
static bool
FindReplTupleInLocalRel(EState *estate, Relation localrel,
LogicalRepRelation *remoterel,
Oid localidxoid,
TupleTableSlot *remoteslot,
TupleTableSl

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-03-07 Thread Anton A. Melnikov

Hi, Thomas!

On 04.03.2023 00:39, Thomas Munro wrote:

It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?

https://www.postgresql.org/message-id/flat/367d01a7-90bb-9b70-4cda-248e81cc475c%40cosium.com

Thanks! The important words there:

But I fail to see how full_page_writes prevents this since it only act on writes

It ensures the damage is later repaired during WAL replay. Which can only
happen if the WAL contains the necessary information to do so - the full page
writes.

Together with the docs about restoring corrupted pg_control in the
pg_resetwal utility description these words made me think about whether
to save the contents of pg_control at the beginning and end of
checkpoint into WAL and teach pg_resetwal to read them? It would be like
a periodic backup of the pg_control to a safe place.
This thought has nothing to do with this patch and this thread, and, in general,
i'm not sure if it has any practical meaning, and whether, on the contrary, it
may lead to some difficulties. If it seems that there is a sense, then it
will be possible to think further about it.

For example, see subscription/011_generated which failed like this:

2023-02-16 06:57:25.724 GMT [5736][not initialized] PANIC: could not
read file "global/pg_control": Permission denied

That was fixed after I changed it to also do locking in xlog.c
ReadControlFile(), in the version you tested. There must be something
I don't understand going on, because that cluster wasn't even running
before: it had just been created by initdb.

But, anyway, I have a new idea that makes that whole problem go away
(though I'd still like to understand what happened there):

Seems to be it's a race between the first reading of the pg_control in
PostmasterMain()
in LocalProcessControlFile(false) and the second one in SubPostmasterMain()
here:
/*
* (re-)read control file, as it contains config. The postmaster will
* already have read this, but this process doesn't know about that.
*/
LocalProcessControlFile(false);

which crashes according to the crash log:
crashlog-postgres.exe_19a0_2023-02-16_06-57-26-675.txt

With the "pseudo-advisory lock for Windows" idea from the last email,
we don't need to bother with file system level locking in many places.
Just in controldata_utils.c, for FE/BE interlocking (that is, we don't
need to use that for interlocking of concurrent reads and writes that
are entirely in the backend, because we already have an LWLock that we
could use more consistently). Changes:

1. xlog.c mostly uses no locking
2. basebackup.c now acquires ControlFileLock
3. only controldata_utils.c uses the new file locking, for FE-vs-BE interlocking
4. lock past the end (pseudo-advisory locking for Windows)

Although the changes in 1. contributes to the problem described above again,
but 4. fixes this. And i did not find any other places where ReadControlFile()
can be called in different processes.That's all ok.
Thanks to 4., now it is not necessary to use VSS to copy the pg_control file,
it can be copied in a common way even during the torture test. This is very
good.
I really like the idea with LWLock where possible.
In general, i think that these changes make the patch more lightweight and fast.

Also i ran tests for more than a day in stress mode with fsync=off under windows
and Linux and found no problems. Patch-tester also passes without errors.

I would like to move this patch to RFC, since I don’t see any problems
both in the code and in the tests, but the pg_backup_start() subproblem
confuses me.
Maybe move it to a separate patch in a distinct thread?
As there are a number of suggestions and questions to discuss such as:

Note that when we recover from a basebackup or pg_backup_start()
backup, we use the backup label to find a redo start location in the
WAL (see read_backup_label()), BUT we still read the copied pg_control
file (one that might be too "new", so we don't use its redo pointer).
So it had better not be torn, or the recovery will fail. So, in this
version I protected that sendFile() with ControlFileLock. But...

Isn't that a bit strange? To go down this path we would also need to
document the need to copy the control file with the file locked to
avoid a rare failure, in the pg_backup_start() documentation. That's
annoying (I don't even know how to do it with easy shell commands;
maybe we should provide a program that locks and cats the file...?).

Variant with separate utility looks good, with the recommendation
in the doc to use it for the pg_control coping.

Also seems it is possible to make a function available in psql, such as
export_pg_control('dst_path') with the destination path as argument
and call it before pg_backup_stop().
Or pass the pg_control destination path to the pg_backup_stop() as extra
argument.
Or save pg_control to a predetermined location during pg_backup_stop() and
specify
in the do

Re: [Proposal] Add foreign-server health checks infrastructure

2023-03-07 Thread vignesh C

On Tue, 7 Mar 2023 at 09:53, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Katsuragi-san,
>
> Thank you for reviewing! PSA new version patches.
>
> > I think we can update the status to ready for committer after
> > this fix, if there is no objection.
>
> That's a very good news for me! How about other people?
>
> > >> 7. the document of postgres_fdw_verify_connection_states_all
> > >> NULL
> > >> +  is returned if the local session does not have connection
> > >> caches,
> > >> or this
> > >> +  function is not available on this platform.
> > >>
> > >> I think there is a case where a connection cache exists but valid
> > >> connections do not exist and NULL is returned (disconnection case).
> > >> Almost the same document as the postgres_fdw_verify_connection_states
> > >> case (comment no.5) seems better.
> > >
> > > Yes, but completely same statement cannot be used because these is not
> > > specified foreign server. How about:
> > > NULL is returned if there are no established connections or this
> > > function ...
> >
> > Yes, to align with the postgres_fdw_verify_connection_states()
> > case, how about writing the connection is not valid. Like the
> > following?
> > 'NULL is returned if no valid connections are established or
> > this function...'
>
> Prefer yours, fixed.
>
> > This is my comment for v35. Please check.
> > 0002:
> > 1. the comment of verify_cached_connections (I missed one minor point.)
> > + * This function emits warnings if a disconnection is found. This
> > returns true
> > + * if disconnections cannot be found, otherwise returns false.
> >
> > I think false is returned only if disconnections are found and
> > true is returned in all other cases. So, modifying the description
> > like the following seems better.
> > 'This returns false if disconnections are found, otherwise
> > returns true.'
>
> Fixed.

Few comments:
1) There is no handling of forConnCheck in #else HAVE_POLL, if this is
intentional we could add some comments for the same:
 static int
-pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
+pqSocketPoll(int sock, int forRead,
+int forWrite, int forConnCheck, time_t end_time)
 {
/* We use poll(2) if available, otherwise select(2) */
 #ifdef HAVE_POLL
@@ -1092,7 +1133,11 @@ pqSocketPoll(int sock, int forRead, int
forWrite, time_t end_time)
int timeout_ms;

if (!forRead && !forWrite)
-   return 0;

2) Can this condition be added to the parent if condition:
if (!forRead && !forWrite)
-   return 0;
+   {
+   /* Connection check can be available on some limted platforms */
+   if (!(forConnCheck && PQconnCheckable()))
+   return 0;
+   }

3) Can we add a comment for PQconnCheckable:
+/* Check whether the postgres server is still alive or not */
+extern int PQconnCheck(PGconn *conn);
+extern int PQconnCheckable(void);
+

4) "Note that if 0 is returned and forConnCheck is requested" doesn't
sound right, it can be changed to "Note that if 0 is returned when
forConnCheck is requested"
+ * If neither forRead, forWrite nor forConnCheck are set, immediately return a
+ * timeout condition (without waiting). Return >0 if condition is met, 0 if a
+ * timeout occurred, -1 if an error or interrupt occurred. Note that if 0 is
+ * returned and forConnCheck is requested, it means that the socket has not
+ * matched POLLRDHUP event and the connection is not closed by the socket peer.

Regards,
Vignesh

Re: Time delayed LR (WAS Re: logical replication restrictions)

2023-03-07 Thread Masahiko Sawada

On Wed, Mar 8, 2023 at 3:30 AM Nathan Bossart  wrote:
>
> On Mon, Mar 06, 2023 at 07:27:59PM +0300, Önder Kalacı wrote:
> > On the other hand, we already have a similar problem with
> > recovery_min_apply_delay combined with hot_standby_feedback [1].
> > So, that probably is an acceptable trade-off for the pgsql-hackers.
> > If you use this feature, you should be even more careful.
>
> Yes, but it's possible to turn off hot_standby_feedback so that you don't
> incur bloat on the primary.  And you don't need to store hours or days of
> WAL on the primary.

Right. This side effect belongs to the combination of
recovery_min_apply_delay and hot_standby_feedback/replication slot.
recovery_min_apply_delay itself can be used even without this side
effect if we accept other trade-offs. When it comes to this
time-delayed logical replication feature, there is no choice to avoid
the side effect for users who want to use this feature.

> I'm very late to this thread, but IIUC you cannot
> avoid blocking VACUUM with the proposed feature.

Right.

>  IMO the current set of
> trade-offs (e.g., unavoidable bloat and WAL buildup) would make this
> feature virtually unusable for a lot of workloads, so it's probably worth
> exploring an alternative approach.

It might require more engineering effort for alternative approaches
such as one I proposed but the feature could become better from the
user perspective. I also think it would be worth exploring it if we've
not yet.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: buildfarm + meson

On 2023-03-07 18:26:21 -0800, Andres Freund wrote:
> On 2023-02-23 06:27:23 -0500, Andrew Dunstan wrote:
> > Yeah. For touch I think we can probably just get rid of this line in the
> > root meson.build:
> > 
> > touch = find_program('touch', native: true)
> 
> Yep.
> 
> > For cp there doesn't seem to be a formal requirement, but there is a recipe
> > in src/common/unicode/meson.build that uses it, maybe that's what caused the
> > failure. On Windows/msvc we could just use copy instead, I think.
> 
> I don't know about using copy, it's very easy to get into trouble due to
> interpreting forward slashes as options etc. I propose that for now we just
> don't support update-unicode if cp isn't available - just as already not
> available when wget isn't available.
> 
> Planning to apply something like the attached soon, unless somebody opposes
> that plan.

Done.

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

On Wed, Mar 8, 2023 at 9:09 AM Peter Smith  wrote:
>
>
> ==
> src/backend/executor/execReplication.c
>
> 3. build_replindex_scan_key
>
>   {
>   Oid operator;
>   Oid opfamily;
>   RegProcedure regop;
> - int pkattno = attoff + 1;
> - int mainattno = indkey->values[attoff];
> - Oid optype = get_opclass_input_type(opclass->values[attoff]);
> + int table_attno = indkey->values[index_attoff];
> + Oid optype = get_opclass_input_type(opclass->values[index_attoff]);
>
> These variable declarations might look tidier if you kept all the Oids 
> together.
>

I am not sure how much that would be an improvement over the current
but that will lead to an additional churn in the patch.

> ==
> src/backend/replication/logical/worker.c
>
> 6. FindReplTupleInLocalRel
>
> static bool
> FindReplTupleInLocalRel(EState *estate, Relation localrel,
> LogicalRepRelation *remoterel,
> Oid localidxoid,
> TupleTableSlot *remoteslot,
> TupleTableSlot **localslot)
> {
> bool found;
>
> /*
> * Regardless of the top-level operation, we're performing a read here, so
> * check for SELECT privileges.
> */
> TargetPrivilegesCheck(localrel, ACL_SELECT);
>
> *localslot = table_slot_create(localrel, &estate->es_tupleTable);
>
> Assert(OidIsValid(localidxoid) ||
>(remoterel->replident == REPLICA_IDENTITY_FULL));
>
> if (OidIsValid(localidxoid))
> found = RelationFindReplTupleByIndex(localrel, localidxoid,
> LockTupleExclusive,
> remoteslot, *localslot);
> else
> found = RelationFindReplTupleSeq(localrel, LockTupleExclusive,
> remoteslot, *localslot);
>
> return found;
> }
>
> ~
>
> Since that 'found' variable is not used, you might as well remove the
> if/else and simplify the code.
>

Hmm, but that is an existing style/code, and this patch has done
nothing which requires that change. Personally, I find the current
style better for the readability purpose.

-- 
With Regards,
Amit Kapila.

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-03-07 Thread Thomas Munro

On Wed, Mar 8, 2023 at 4:43 PM Anton A. Melnikov  wrote:
> On 04.03.2023 00:39, Thomas Munro wrote:
> > Could we make better use of the safe copy that we have in the log?
> > Then the pg_backup_start() subproblem would disappear.  Conceptually,
> > that'd be just like the way we use FPI for data pages copied during a
> > backup.  I'm not sure about any of that, though, it's just an idea,
> > not tested.
>
> Sorry, i didn't understand the question about log. Would you explain me
> please what kind of log did you mention and where can i look this
> safe copy creation in the code?

Sorry, I was confused; please ignore that part.  We don't have a copy
of the control file anywhere else.  (Perhaps we should, but that could
be a separate topic.)

Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher

On Wed, Mar 8, 2023 at 3:03 PM Amit Kapila  wrote:
>
> On Wed, Mar 8, 2023 at 9:09 AM Peter Smith  wrote:
> >
> >
> > ==
> > src/backend/executor/execReplication.c
> >
> > 3. build_replindex_scan_key
> >
> >   {
> >   Oid operator;
> >   Oid opfamily;
> >   RegProcedure regop;
> > - int pkattno = attoff + 1;
> > - int mainattno = indkey->values[attoff];
> > - Oid optype = get_opclass_input_type(opclass->values[attoff]);
> > + int table_attno = indkey->values[index_attoff];
> > + Oid optype = get_opclass_input_type(opclass->values[index_attoff]);
> >
> > These variable declarations might look tidier if you kept all the Oids 
> > together.
> >
>
> I am not sure how much that would be an improvement over the current
> but that will lead to an additional churn in the patch.

That suggestion was because IMO the 'optype' and  'opfamily' belong
together. TBH, really think the assignment of 'opttype' should happen
later with the 'opfamilly' assignment too because then it will be
*after*  the  (!AttributeNumberIsValid(table_attno)) check.

>
> > ==
> > src/backend/replication/logical/worker.c
> >
> > 6. FindReplTupleInLocalRel
> >
> > static bool
> > FindReplTupleInLocalRel(EState *estate, Relation localrel,
> > LogicalRepRelation *remoterel,
> > Oid localidxoid,
> > TupleTableSlot *remoteslot,
> > TupleTableSlot **localslot)
> > {
> > bool found;
> >
> > /*
> > * Regardless of the top-level operation, we're performing a read here, so
> > * check for SELECT privileges.
> > */
> > TargetPrivilegesCheck(localrel, ACL_SELECT);
> >
> > *localslot = table_slot_create(localrel, &estate->es_tupleTable);
> >
> > Assert(OidIsValid(localidxoid) ||
> >(remoterel->replident == REPLICA_IDENTITY_FULL));
> >
> > if (OidIsValid(localidxoid))
> > found = RelationFindReplTupleByIndex(localrel, localidxoid,
> > LockTupleExclusive,
> > remoteslot, *localslot);
> > else
> > found = RelationFindReplTupleSeq(localrel, LockTupleExclusive,
> > remoteslot, *localslot);
> >
> > return found;
> > }
> >
> > ~
> >
> > Since that 'found' variable is not used, you might as well remove the
> > if/else and simplify the code.
> >
>
> Hmm, but that is an existing style/code, and this patch has done
> nothing which requires that change. Personally, I find the current
> style better for the readability purpose.
>

OK. I failed to notice that was same as the existing code.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-03-07 Thread John Naylor

On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada 
wrote:

> > 1. Make it optional to track chunk memory space by a template
parameter. It might be tiny compared to everything else that vacuum does.
That would allow other users to avoid that overhead.
> > 2. When context block usage exceeds the limit (rare), make the
additional effort to get the precise usage -- I'm not sure such a top-down
facility exists, and I'm not feeling well enough today to study this
further.
>
> I prefer option (1) as it's straight forward. I mentioned a similar
> idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
> defined. It might be worth checking if there is visible overhead of
> tracking chunk memory space. IIRC we've not evaluated it yet.

Ok, let's try this -- I can test and profile later this week.

--
John Naylor
EDB: http://www.enterprisedb.com

RE: [Proposal] Add foreign-server health checks infrastructure

2023-03-07 Thread Hayato Kuroda (Fujitsu)

Dear Vignesh,

Thank you for reviewing! PSA new version.

> 
> Few comments:
> 1) There is no handling of forConnCheck in #else HAVE_POLL, if this is
> intentional we could add some comments for the same:
>  static int
> -pqSocketPoll(int sock, int forRead, int forWrite, time_t end_time)
> +pqSocketPoll(int sock, int forRead,
> +int forWrite, int forConnCheck, time_t end_time)
>  {
> /* We use poll(2) if available, otherwise select(2) */
>  #ifdef HAVE_POLL
> @@ -1092,7 +1133,11 @@ pqSocketPoll(int sock, int forRead, int
> forWrite, time_t end_time)
> int timeout_ms;
> 
> if (!forRead && !forWrite)
> -   return 0;

Comments were added. This missing is intentional, because with the patch present
we do not implement checking feature for kqueue(). If you want to check the
detailed implementation of that, please see 0004 patch attached on [1].

> 2) Can this condition be added to the parent if condition:
> if (!forRead && !forWrite)
> -   return 0;
> +   {
> +   /* Connection check can be available on some limted platforms
> */
> +   if (!(forConnCheck && PQconnCheckable()))
> +   return 0;
> +   }

Changed, and added comments atop the if-statement.

> 3) Can we add a comment for PQconnCheckable:
> +/* Check whether the postgres server is still alive or not */
> +extern int PQconnCheck(PGconn *conn);
> +extern int PQconnCheckable(void);
> +

Added. And I found the tab should be used to divide data type and name, so 
fixed.

> 4) "Note that if 0 is returned and forConnCheck is requested" doesn't
> sound right, it can be changed to "Note that if 0 is returned when
> forConnCheck is requested"
> + * If neither forRead, forWrite nor forConnCheck are set, immediately return 
> a
> + * timeout condition (without waiting). Return >0 if condition is met, 0 if a
> + * timeout occurred, -1 if an error or interrupt occurred. Note that if 0 is
> + * returned and forConnCheck is requested, it means that the socket has not
> + * matched POLLRDHUP event and the connection is not closed by the socket
> peer.

Fixed.

[1]: 
https://www.postgresql.org/message-id/TYAPR01MB58669EAAC02493BFF9F39B06F5D99%40TYAPR01MB5866.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



v37-0001-Add-PQconnCheck-and-PQconnCheckable-to-libpq.patch
Description:  v37-0001-Add-PQconnCheck-and-PQconnCheckable-to-libpq.patch


v37-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch
Description:  v37-0002-postgres_fdw-add-postgres_fdw_verify_connection_.patch


v37-0003-add-test.patch
Description: v37-0003-add-test.patch

Re: Testing autovacuum wraparound (including failsafe)

2023-03-07 Thread Masahiko Sawada

On Fri, Mar 3, 2023 at 8:34 PM Heikki Linnakangas  wrote:
>
> On 16/11/2022 06:38, Ian Lawrence Barwick wrote:
> > Thanks for the patch. While reviewing the patch backlog, we have determined 
> > that
> > the latest version of this patch was submitted before meson support was
> > implemented, so it should have a "meson.build" file added for consideration 
> > for
> > inclusion in PostgreSQL 16.
>
> I wanted to do some XID wraparound testing again, to test the 64-bit
> SLRUs patches [1], and revived this.

Thank you for reviving this thread!

>
> I took a different approach to consuming the XIDs. Instead of setting
> nextXID directly, bypassing GetNewTransactionId(), this patch introduces
> a helper function to call GetNewTransactionId() repeatedly. But because
> that's slow, it does include a shortcut to skip over "uninteresting"
> XIDs. Whenever nextXid is close to an SLRU page boundary or XID
> wraparound, it calls GetNewTransactionId(), and otherwise it bumps up
> nextXid close to the next "interesting" value. That's still a lot slower
> than just setting nextXid, but exercises the code more realistically.
>
> I've written some variant of this helper function many times over the
> years, for ad hoc testing. I'd love to have it permanently in the git tree.

These functions seem to be better than mine.

> In addition to Masahiko's test for emergency vacuum, this includes two
> other tests. 002_limits.pl tests the "warn limit" and "stop limit" in
> GetNewTransactionId(), and 003_wraparound.pl burns through 10 billion
> transactions in total, exercising XID wraparound in general.
> Unfortunately these tests are pretty slow; the tests run for about 4
> minutes on my laptop in total, and use about 20 GB of disk space. So
> perhaps these need to be put in a special test suite that's not run as
> part of "check-world". Or perhaps leave out the 003_wraparounds.pl test,
> that's the slowest of the tests. But I'd love to have these in the git
> tree in some form.

cbfot reports some failures. The main reason seems that meson.build in
xid_wraparound directory adds the regression tests but the .sql and
.out files are missing in the patch. Perhaps the patch wants to add
only tap tests as Makefile doesn't define REGRESS?

Even after fixing this issue, CI tests (Cirrus CI) are not happy and
report failures due to a disk full. The size of xid_wraparound test
directory is 105MB out of 262MB:

% du -sh testrun
262Mtestrun
% du -sh testrun/xid_wraparound/
105Mtestrun/xid_wraparound/
% du -sh testrun/xid_wraparound/*
460Ktestrun/xid_wraparound/001_emergency_vacuum
93M testrun/xid_wraparound/002_limits
12M testrun/xid_wraparound/003_wraparounds
% ls -lh testrun/xid_wraparound/002_limits/log*
total 93M
-rw---. 1 masahiko masahiko 93M Mar  7 17:34 002_limits_wraparound.log
-rw-rw-r--. 1 masahiko masahiko 20K Mar  7 17:34 regress_log_002_limits

The biggest file is the server logs since an autovacuum worker writes
autovacuum logs for every table for every second (autovacuum_naptime
is 1s). Maybe we can set log_autovacuum_min_duration reloption for the
test tables instead of globally enabling it

The 001 test uses the 2PC transaction that holds locks on tables but
since we can consume xids while the server running, we don't need
that. Instead I think we can keep a transaction open in the background
like 002 test does.

I'll try these ideas.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: optimize several list functions with SIMD intrinsics

On Wed, Mar 08, 2023 at 01:54:15PM +1300, David Rowley wrote:
> Interesting and quite impressive performance numbers.

Thanks for taking a look.

> From having a quick glance at the patch, it looks like you'll need to
> take some extra time to make it work on 32-bit builds.

At the moment, the support for SIMD intrinsics in Postgres is limited to
64-bit (simd.h has the details).  But yes, if we want to make this work for
32-bit builds, additional work is required.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: shoud be get_extension_schema visible?