date:20210118

On Mon, Jan 18, 2021 at 1:16 PM japin  wrote:
>
>
> Hi,
>
> I find that the outputstr variable in logicalrep_write_tuple() only use in
> `else` branch, I think we can narrow the scope, just like variable outputbytes
> in `if` branch (for more readable).
>
> /*
>  * Send in binary if requested and type has suitable send function.
>  */
> if (binary && OidIsValid(typclass->typsend))
> {
> bytea  *outputbytes;
> int len;
>
> pq_sendbyte(out, LOGICALREP_COLUMN_BINARY);
> outputbytes = OidSendFunctionCall(typclass->typsend, values[i]);
> len = VARSIZE(outputbytes) - VARHDRSZ;
> pq_sendint(out, len, 4);/* length */
> pq_sendbytes(out, VARDATA(outputbytes), len);   /* data */
> pfree(outputbytes);
> }
> else
> {
> pq_sendbyte(out, LOGICALREP_COLUMN_TEXT);
> outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
> pq_sendcountedtext(out, outputstr, strlen(outputstr), false);
> pfree(outputstr);
> }
>
> Attached is a samll patch to fix it.

+1. Binary mode uses block level variable outputbytes, so making
outputstr block level is fine IMO.

Patch basically looks good to me, but it doesn't apply on my system.
Looks like it's not created with git commit. Please create the patch
with git commit command.

git apply 
/mnt/hgfs/Shared/narrow-the-scope-of-the-variable-in-logicalrep_write_tuple.patch
error: corrupt patch at line 10

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

[PATCH] ProcessInterrupts_hook

2021-01-18 Thread Craig Ringer

Hi folks

A few times lately I've been doing things in extensions that've made me
want to be able to run my own code whenever InterruptPending is true and
CHECK_FOR_INTERRUPTS() calls ProcessInterrupts()

So here's a simple patch to add ProcessInterrupts_hook. It follows the
usual pattern like ProcessUtility_hook and standard_ProcessUtility.

Why? Because sometimes I want most of the behaviour of die(), but the
option to override it with some bgworker-specific choices occasionally.
HOLD_INTERRUPTS() is too big a hammer.

What I really want to go along with this is a way for any backend to
observe the postmaster's pmState and its "Shutdown" variable's value, so
any backend can tell if we're in FastShutdown, SmartShutdown, etc. Copies
in shmem only obviously. But I'm not convinced it's right to just copy
these vars as-is to shmem, and I don't want to use the memory for a
ProcSignal slot for something that won't be relevant for most backends for
most of the postmaster lifetime. Ideas welcomed.
From 1c8c477a5814420011fa034323e82d8eabc6bc5f Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 18 Jan 2021 14:14:46 +0800
Subject: [PATCH v1 1/4] Provide a hook for ProcessInterrupts()

Code may now register a ProcessInterrupts_hook to fire its own
logic before and/or after the main ProcessInterrupts handler
for signal processing. The hook must call standard_ProcessInterrupts
to execute the normal interrupt handling or set InterruptsPending to
true to cause the interrupt to be re-processed at the next opportunity.

This follows the consistent pattern used by other PostgreSQL hooks like
the ProcessUtility_hook and standard_ProcessUtility() function.

The purpose of this hook is to allow extensions to react to their own
custom interrupt flags during CHECK_FOR_INTERRUPTS() calls invoked
in main PostgreSQL code paths.
---
 src/backend/tcop/postgres.c | 20 
 src/include/miscadmin.h |  3 +++
 2 files changed, 23 insertions(+)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 28055680aa..8cf1359e33 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -102,6 +102,8 @@ int			max_stack_depth = 100;
 /* wait N seconds to allow attach from a debugger */
 int			PostAuthDelay = 0;
 
+/* Intercept CHECK_FOR_INTERRUPTS() responses */
+ProcessInterrupts_hook_type ProcessInterrupts_hook = NULL;
 
 
 /* 
@@ -3064,8 +3066,26 @@ ProcessInterrupts(void)
 	/* OK to accept any interrupts now? */
 	if (InterruptHoldoffCount != 0 || CritSectionCount != 0)
 		return;
+
 	InterruptPending = false;
 
+	if (ProcessInterrupts_hook)
+		ProcessInterrupts_hook();
+	else
+		standard_ProcessInterrupts();
+}
+
+/*
+ * Implement the default signal handling behaviour for most backend types
+ * including user backends and bgworkers.
+ *
+ * This is where CHECK_FOR_INTERRUPTS() eventually lands up unless intercepted
+ * by ProcessInterrupts_hook.
+ */
+void
+standard_ProcessInterrupts(void)
+{
+
 	if (ProcDiePending)
 	{
 		ProcDiePending = false;
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bdc97e308..f082d04448 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -94,6 +94,9 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount;
 
 /* in tcop/postgres.c */
 extern void ProcessInterrupts(void);
+typedef void (*ProcessInterrupts_hook_type)(void);
+extern ProcessInterrupts_hook_type ProcessInterrupts_hook;
+extern void standard_ProcessInterrupts(void);
 
 #ifndef WIN32
 
-- 
2.29.2

From 58b9ac884ef5bd35a2aac9da85079e24097612be Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 18 Jan 2021 15:26:43 +0800
Subject: [PATCH v1 2/4] Test for ProcessInterrupts_hook

---
 src/test/modules/Makefile |   3 +-
 src/test/modules/test_hooks/.gitignore|   4 +
 src/test/modules/test_hooks/Makefile  |  28 
 .../expected/test_processinterrupt_hook.out   |   0
 src/test/modules/test_hooks/hooks.conf|   1 +
 .../sql/test_processinterrupt_hook.sql|  16 ++
 .../modules/test_hooks/test_hooks--1.0.sql|   8 +
 src/test/modules/test_hooks/test_hooks.c  |  37 +
 .../modules/test_hooks/test_hooks.control |   4 +
 src/test/modules/test_hooks/test_hooks.h  |  20 +++
 .../test_hooks/test_process_interrupts_hook.c | 140 ++
 11 files changed, 260 insertions(+), 1 deletion(-)
 create mode 100644 src/test/modules/test_hooks/.gitignore
 create mode 100644 src/test/modules/test_hooks/Makefile
 create mode 100644 src/test/modules/test_hooks/expected/test_processinterrupt_hook.out
 create mode 100644 src/test/modules/test_hooks/hooks.conf
 create mode 100644 src/test/modules/test_hooks/sql/test_processinterrupt_hook.sql
 create mode 100644 src/test/modules/test_hooks/test_hooks--1.0.sql
 create mode 100644 src/test/modules/test_hooks/test_hooks.c
 create mode 100644 src/test/modules/test_hooks/test_hooks.control
 create mode 100644 src/test/modules/test_hooks/tes

Re: Narrow the scope of the variable outputstr in logicalrep_write_tuple

2021-01-18 Thread japin


On Mon, 18 Jan 2021 at 15:59, Bharath Rupireddy 
 wrote:
> On Mon, Jan 18, 2021 at 1:16 PM japin  wrote:
>>
>>
>> Hi,
>>
>> I find that the outputstr variable in logicalrep_write_tuple() only use in
>> `else` branch, I think we can narrow the scope, just like variable 
>> outputbytes
>> in `if` branch (for more readable).
>>
>> /*
>>  * Send in binary if requested and type has suitable send function.
>>  */
>> if (binary && OidIsValid(typclass->typsend))
>> {
>> bytea  *outputbytes;
>> int len;
>>
>> pq_sendbyte(out, LOGICALREP_COLUMN_BINARY);
>> outputbytes = OidSendFunctionCall(typclass->typsend, values[i]);
>> len = VARSIZE(outputbytes) - VARHDRSZ;
>> pq_sendint(out, len, 4);/* length */
>> pq_sendbytes(out, VARDATA(outputbytes), len);   /* data */
>> pfree(outputbytes);
>> }
>> else
>> {
>> pq_sendbyte(out, LOGICALREP_COLUMN_TEXT);
>> outputstr = OidOutputFunctionCall(typclass->typoutput, 
>> values[i]);
>> pq_sendcountedtext(out, outputstr, strlen(outputstr), false);
>> pfree(outputstr);
>> }
>>
>> Attached is a samll patch to fix it.
>
> +1. Binary mode uses block level variable outputbytes, so making
> outputstr block level is fine IMO.
>
> Patch basically looks good to me, but it doesn't apply on my system.
> Looks like it's not created with git commit. Please create the patch
> with git commit command.
>
> git apply 
> /mnt/hgfs/Shared/narrow-the-scope-of-the-variable-in-logicalrep_write_tuple.patch
> error: corrupt patch at line 10
>

Thanks for reviewing! Attached v2 as you suggested.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

>From 89fb10efc9eccf11f531ab8675376b57a12a6cb1 Mon Sep 17 00:00:00 2001
From: Japin Li 
Date: Mon, 18 Jan 2021 16:04:54 +0800
Subject: [PATCH v2] Narrow the scope of the variable outputstr in
 logicalrep_write_tuple

---
 src/backend/replication/logical/proto.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 62275ebabe..f2c85cabb5 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -493,7 +493,6 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple, bool binar
 		HeapTuple	typtup;
 		Form_pg_type typclass;
 		Form_pg_attribute att = TupleDescAttr(desc, i);
-		char	   *outputstr;
 
 		if (att->attisdropped || att->attgenerated)
 			continue;
@@ -537,6 +536,8 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple, bool binar
 		}
 		else
 		{
+			char	   *outputstr;
+
 			pq_sendbyte(out, LOGICALREP_COLUMN_TEXT);
 			outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]);
 			pq_sendcountedtext(out, outputstr, strlen(outputstr), false);
-- 
2.30.0

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread tsunakawa.ta...@fujitsu.com

From: Tang, Haiying 
> (does this patch make some optimizes in serial insert? I'm a little confused
> here, Because the patched execution time is less than unpatched, but I didn't
> find information in commit messages about it. If I missed something, please
> kindly let me know.)

I haven't thought of anything yet.  Could you show us the output of EXPLAIN 
(ANALYZE, BUFFERS, VERBOSE) of 1,000 partitions case for the patched and 
unpatched?  If it doesn't show any difference, the output of perf may be 
necessary next.

(BTW, were all the 1,000 rows stored in the target table?)


> I did another test which made check overhead obvious. this case is not fitting
> for partition purpose, but I put it here as an extreme case too.
> Select table total rows are 1,000, # of partitions is 2000. So only the first 
> 1000
> partitions have 1 row per partition, the last 1000 partitions have no data
> inserted.

Thank you, that's a good test.


Regards
Takayuki Tsunakawa

Re: Parallel INSERT (INTO ... SELECT ...)

On Mon, Jan 18, 2021 at 1:02 PM Tang, Haiying
 wrote:
>
> > From: Amit Kapila 
> > > Can we test cases when we have few rows in the Select table (say
> > > 1000) and there 500 or 1000 partitions. In that case, we won't
> > > select parallelism but we have to pay the price of checking
> > > parallel-safety of all partitions. Can you check this with 100, 200,
> > > 500, 1000 partitions table?
> >
> > I also wanted to see such an extreme(?) case.  The 1,000 rows is not
> > the count per partition but the total count of all partitions.e.g.,
> > when # of partitions is 100, # of rows per partition is 10.
>
> Below results are in serial plan which select table total rows are 1,000. The 
> Excution Time + Planning Time is still less than unpatched.
> (does this patch make some optimizes in serial insert? I'm a little confused 
> here, Because the patched execution time is less than unpatched, but I didn't 
> find information in commit messages about it. If I missed something, please 
> kindly let me know.)
>

I don't think the patch should have any impact on the serial case. I
think you can try to repeat each test 3 times both with and without a
patch and take the median of the three.

-- 
With Regards,
Amit Kapila.

Re: Wrong usage of RelationNeedsWAL

2021-01-18 Thread Kyotaro Horiguchi

At Sun, 17 Jan 2021 23:02:18 -0800, Noah Misch  wrote in 
> On Sun, Jan 17, 2021 at 10:36:31PM -0800, Noah Misch wrote:
> > I wrote the above based on the "PageGetLSN(page) > (snapshot)->lsn" check in
> > TestForOldSnapshot().  If the LSN isn't important, what else explains
> > RelationAllowsEarlyPruning() checking RelationNeedsWAL()?
> 
> Thinking about it more, some RelationAllowsEarlyPruning() callers need
> different treatment.  Above, I was writing about the case of deciding whether
> to do early pruning.  The other RelationAllowsEarlyPruning() call sites are
> deciding whether the relation might be lacking old data.  For that case, we
> should check RELPERSISTENCE_PERMANENT, not RelationNeedsWAL().  Otherwise, we
> could fail to report an old-snapshot error in a case like this:
> 
> setup: CREATE TABLE t AS SELECT ...;
> xact1: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1;  -- take snapshot
> xact2: DELETE FROM t;
> (plenty of time passes)
> xact3: SELECT count(*) FROM t;  -- early pruning
> xact1: SAVEPOINT q; SELECT count(*) FROM t; ROLLBACK TO q;  -- "snapshot too 
> old"
> xact1: ALTER TABLE t SET TABLESPACE something;  -- start skipping WAL
> xact1: SELECT count(*) FROM t;  -- no error, wanted "snapshot too old"
> 
> Is that plausible?

Thank you for the consideration and yes. But I get "snapshot too old"
from the last query with the patched version. maybe I'm missing
something. I'm going to investigate the case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Determine parallel-safety of partition relations for Inserts

On Mon, Jan 18, 2021 at 10:27 AM tsunakawa.ta...@fujitsu.com
 wrote:
>
> From: Amit Kapila 
> > We already allow users to specify the degree of parallelism for all
> > the parallel operations via guc's max_parallel_maintenance_workers,
> > max_parallel_workers_per_gather, then we have a reloption
> > parallel_workers and vacuum command has the parallel option where
> > users can specify the number of workers that can be used for
> > parallelism. The parallelism considers these as hints but decides
> > parallelism based on some other parameters like if there are that many
> > workers available, etc. Why the users would expect differently for
> > parallel DML?
>
> I agree that the user would want to specify the degree of parallelism of DML, 
> too.  My simple (probably silly) question was, in INSERT SELECT,
>
> * If the target table has 10 partitions and the source table has 100 
> partitions, how would the user want to specify parameters?
>
> * If the source and target tables have the same number of partitions, and the 
> user specified different values to parallel_workers and parallel_dml_workers, 
> how many parallel workers would run?
>

Good question. I think if we choose to have a separate parameter for
DML, it can probably a boolean to just indicate whether to enable
parallel DML for a specified table and use the parallel_workers
specified in the table used in SELECT.

-- 
With Regards,
Amit Kapila.

Re: Improve handling of parameter differences in physical replication

2021-01-18 Thread Peter Eisentraut


On 2021-01-15 12:28, Sergei Kornilov wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

Hello
Look good for me. I think the patch is ready for commiter.

The new status of this patch is: Ready for Committer


committed

--
Peter Eisentraut
2ndQuadrant, an EDB company
https://www.2ndquadrant.com/

RE: Determine parallel-safety of partition relations for Inserts

2021-01-18 Thread tsunakawa.ta...@fujitsu.com

From: Amit Kapila 
> Good question. I think if we choose to have a separate parameter for
> DML, it can probably a boolean to just indicate whether to enable
> parallel DML for a specified table and use the parallel_workers
> specified in the table used in SELECT.

Agreed.


Regards
Takayuki Tsunakawa

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-18 Thread Justin Pryzby

On Mon, Jan 18, 2021 at 02:18:44PM +0900, Michael Paquier wrote:
> On Sat, Dec 12, 2020 at 01:45:26PM -0600, Justin Pryzby wrote:
> > It's a lot like what I wrote as [PATCH v31 1/5] ExecReindex and 
> > ReindexParams
> > In my v31 patch, I moved ReindexOptions to a private structure in 
> > indexcmds.c,
> > with an "int options" bitmask which is passed to reindex_index() et al.  
> > Your
> > patch keeps/puts ReindexOptions index.h, so it also applies to 
> > reindex_index,
> > which I think is good.
> 
> a3dc926 is an equivalent of 0001~0003 merged together.  0008 had
> better be submitted into a separate thread if there is value to it.
> 0004~0007 are the pieces remaining.  Could it be possible to rebase
> things on HEAD and put the tablespace bits into the structures 
> {Vacuum,Reindex,Cluster}Params?

Attached.  I will re-review these myself tomorrow.

-- 
Justin
>From 42991c90afda1b2a9fe4095418a87b5ead4b23d9 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 23 Mar 2020 21:10:29 +0300
Subject: [PATCH 1/4] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml |  22 +
 src/backend/catalog/index.c   |  93 ++-
 src/backend/commands/indexcmds.c  | 103 +-
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/index.h   |   2 +
 src/test/regress/input/tablespace.source  |  53 +++
 src/test/regress/output/tablespace.source | 102 +
 7 files changed, 374 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 627b36300c..4f84060c4d 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 CONCURRENTLY [ boolean ]
 VERBOSE [ boolean ]
+TABLESPACE new_tablespace
 
  
 
@@ -187,6 +188,19 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  This specifies that indexes will be rebuilt on a new tablespace.
+  Cannot be used with "mapped" relations. If SCHEMA,
+  DATABASE or SYSTEM is specified, then
+  all unsuitable relations will be skipped and a single WARNING
+  will be generated.
+ 
+
+   
+

 VERBOSE
 
@@ -210,6 +224,14 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+new_tablespace
+
+ 
+  The tablespace where indexes will be rebuilt.
+ 
+
+   
   
  
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b8cd35e995..9aa9fdf291 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -57,6 +57,7 @@
 #include "commands/event_trigger.h"
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
@@ -1394,9 +1395,13 @@ index_update_collation_versions(Oid relid, Oid coll)
  * Create concurrently an index based on the definition of the one provided by
  * caller.  The index is inserted into catalogs and needs to be built later
  * on.  This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the new tablespace to use for this index.  If
+ * InvalidOid, use the tablespace in-use instead.
  */
 Oid
-index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char *newName)
+index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
+			   Oid tablespaceOid, const char *newName)
 {
 	Relation	indexRelation;
 	IndexInfo  *oldInfo,
@@ -1526,7 +1531,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char
 			  newInfo,
 			  indexColNames,
 			  indexRelation->rd_rel->relam,
-			  indexRelation->rd_rel->reltablespace,
+			  OidIsValid(tablespaceOid) ?
+tablespaceOid : indexRelation->rd_rel->reltablespace,
 			  indexRelation->rd_indcollation,
 			  indclass->values,
 			  indcoloptions->values,
@@ -3591,6 +3597,8 @@ IndexGetRelation(Oid indexId, bool missing_ok)
 
 /*
  * reindex_index - This routine is used to recreate a single index
+ *
+ * See comments of reindex_relation() for details about "tablespaceOid".
  */
 void
 reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
@@ -3603,6 +3611,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	volatile bool skipped_constraint = false;
 	PGRUsage	ru0;
 	bool		progress = ((params->options & REINDEXOPT_REPORT_PROGRESS) != 0);
+	bool		set_tablespace = OidIsValid(params->tablespaceOid);
 
 	pg_rusage_init(&ru0);
 
@@ -3654,6 +3663,35 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 			 get_namespace_name(RelationGetNamespace(iRel)),
 			 RelationGetRelationName(iRel));

Re: evtfoid and evtowner missing in findoidjoins/README

2021-01-18 Thread Joel Jacobson

On Sun, Jan 17, 2021, at 21:32, Tom Lane wrote:
>Well, SGML is actually plenty easy to parse as long as you've got xml
>tooling at hand.  We'd never want to introduce such a dependency in the
>normal build process, but making something like findoidjoins depend on
>such tools seems within reason.  I recall having whipped up some one-off
>Perl scripts of that sort when I was doing that massive rewrite of the
>func.sgml tables last year.  I didn't save them though, and in any case
>I'm the world's worst Perl programmer, so it'd be better for someone
>with more Perl-fu to take point if we decide to go that route.

I went a head and implemented the parser, it was indeed easy.

Patch attached.

Add catalog_oidjoins.pl -- parses catalog references out of catalogs.sgml

Since doc/src/sgml/catalogs.sgml is the master where catalog references
are to be documented, if someone needs machine-readable access to
such information, it should be extracted from this document.

The added script catalog_oidjoins.pl parses the SGML and extract
the fields necessary to produce the same output as generated
by findoidjoins, which has historically been copy/pasted to the README.

This is to allow for easy comparison, to verify the correctness
of catalog_oidjoins.pl, and if necessary, to update the README
based on the information in catalogs.sgml.

Helper-files:

diff_oidjoins.sh
Runs ./catalog_oidjoins.pl and compares its output
with STDIN. Shows a diff of the result, witout any context.

test_oidjoins.sh
Runs ./diff_oidjoins.sh for both the README
and the output from ./findoidjoins regression.

bogus_oidjoins.txt
List of bogus diff entires to ignore,
based on documentation in README.

After having run make installcheck in src/test/regress,
the test_oidjoins.sh can be run:

$ ./test_oidjoins.sh
README:
+ Join pg_catalog.pg_attrdef.adnum => pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_class.relrewrite => pg_catalog.pg_class.oid
+ Join pg_catalog.pg_constraint.confkey []=> pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_constraint.conkey []=> pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_db_role_setting.setrole => pg_catalog.pg_authid.oid
+ Join pg_catalog.pg_default_acl.defaclnamespace => pg_catalog.pg_namespace.oid
+ Join pg_catalog.pg_default_acl.defaclrole => pg_catalog.pg_authid.oid
+ Join pg_catalog.pg_event_trigger.evtfoid => pg_catalog.pg_proc.oid
+ Join pg_catalog.pg_event_trigger.evtowner => pg_catalog.pg_authid.oid
+ Join pg_catalog.pg_extension.extconfig []=> pg_catalog.pg_class.oid
+ Join pg_catalog.pg_foreign_data_wrapper.fdwhandler => pg_catalog.pg_proc.oid
+ Join pg_catalog.pg_foreign_data_wrapper.fdwvalidator => pg_catalog.pg_proc.oid
+ Join pg_catalog.pg_foreign_table.ftrelid => pg_catalog.pg_class.oid
+ Join pg_catalog.pg_foreign_table.ftserver => pg_catalog.pg_foreign_server.oid
+ Join pg_catalog.pg_index.indkey => pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_partitioned_table.partattrs => 
pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_policy.polroles []=> pg_catalog.pg_authid.oid
+ Join pg_catalog.pg_publication.pubowner => pg_catalog.pg_authid.oid
+ Join pg_catalog.pg_publication_rel.prpubid => pg_catalog.pg_publication.oid
+ Join pg_catalog.pg_publication_rel.prrelid => pg_catalog.pg_class.oid
+ Join pg_catalog.pg_range.rngmultitypid => pg_catalog.pg_type.oid
+ Join pg_catalog.pg_seclabel.classoid => pg_catalog.pg_class.oid
+ Join pg_catalog.pg_shseclabel.classoid => pg_catalog.pg_class.oid
+ Join pg_catalog.pg_statistic.staattnum => pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_statistic_ext.stxkeys => pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_subscription.subdbid => pg_catalog.pg_database.oid
+ Join pg_catalog.pg_subscription.subowner => pg_catalog.pg_authid.oid
+ Join pg_catalog.pg_subscription_rel.srrelid => pg_catalog.pg_class.oid
+ Join pg_catalog.pg_subscription_rel.srsubid => pg_catalog.pg_subscription.oid
+ Join pg_catalog.pg_trigger.tgattr => pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_type.typsubscript => pg_catalog.pg_proc.oid
+ Join pg_catalog.pg_user_mapping.umserver => pg_catalog.pg_foreign_server.oid
+ Join pg_catalog.pg_user_mapping.umuser => pg_catalog.pg_authid.oid
findoidjoins:
+ Join pg_catalog.pg_attrdef.adnum => pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_class.relrewrite => pg_catalog.pg_class.oid
+ Join pg_catalog.pg_constraint.confkey []=> pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_constraint.conkey []=> pg_catalog.pg_attribute.attnum
+ Join pg_catalog.pg_db_role_setting.setrole => pg_catalog.pg_authid.oid
+ Join pg_catalog.pg_default_acl.defaclnamespace => pg_catalog.pg_namespace.oid
+ Join pg_catalog.pg_default_acl.defaclrole => pg_catalog.pg_authid.oid
+ Join pg_catalog.pg_event_trigger.evtfoid => pg_catalog.pg_proc.oid
+ Join pg_catalog.pg_event_trigger.evtowner => pg_catalog.pg_authid.oid
+ Join pg_catalog.pg_extension.extconfig []=>

Re: Parallel INSERT (INTO ... SELECT ...)

On Mon, Jan 18, 2021 at 10:45 AM Greg Nancarrow  wrote:
>
> On Fri, Jan 15, 2021 at 7:39 PM Amit Kapila  wrote:
> >
> > Here is an additional review of
> > v11-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT. There are
> > quite a few comments raised on the V11-0001* patch. I suggest first
> > post a revised version of V11-0001* patch addressing those comments
> > and then you can separately post a revised version of
> > v11-0003-Enable-parallel-INSERT-and-or-SELECT-for-INSERT-INTO.
> >
>
> 1)
>
> >Here, it seems we are accessing the relation descriptor without any
> >lock on the table which is dangerous considering the table can be
> >modified in a parallel session. Is there a reason why you think this
> >is safe? Did you check anywhere else such a coding pattern?
>
> Yes, there's a very good reason and I certainly have checked for the
> same coding pattern elsewhere, and not just randomly decided that
> locking can be ignored.
> The table has ALREADY been locked (by the caller) during the
> parse/analyze phase.
>

Fair enough. I suggest adding a comment saying the same so that the
reader doesn't get confused about the same.

> (This is not the case for a partition, in which case the patch code
> uses AccessShareLock, as you will see).

Okay, but I see you release the lock on partition rel immediately
after checking parallel-safety. What if a user added some
parallel-unsafe constraint (via Alter Table) after that check?

>
> 4)
>
> >domain_max_parallel_hazard_for_modify()
> >{
> >..
> >+ if (isnull)
> >+ {
> >+ /*
> >+ * This shouldn't ever happen, but if it does, log a WARNING
> >+ * and return UNSAFE, rather than erroring out.
> >+ */
> >+ elog(WARNING, "null conbin for constraint %u", con->oid);
> >+ context->max_hazard = PROPARALLEL_UNSAFE;
> >+ break;
> >+ }
> >..
> >}
> >index_expr_max_parallel_hazard_for_modify()
> >{
> >..
> >+ if (index_expr_item == NULL) /* shouldn't happen */
> >+ {
> >+ index_close(index_rel, lockmode);
> >+ context->max_hazard = PROPARALLEL_UNSAFE;
> >+ return context->max_hazard;
> >+ }
> >..
> >}
>
> >It is not clear why the above two are shouldn't happen cases and if so
> >why we want to treat them as unsafe. Ideally, there should be an
> >Assert if these can't happen but it is difficult to decide without
> >knowing why you have considered them unsafe?
>
> The checks being done here for "should never happen" cases are THE
> SAME as other parts of the Postgres code.
> For example, search Postgres code for "null conbin" and you'll find 6
> other places in the Postgres code which actually ERROR out if conbin
> (binary representation of the constraint) in a pg_constraint tuple is
> found to be null.
> The cases that you point out in the patch used to also error out in
> the same way, but Vignesh suggested changing them to just return
> parallel-unsafe instead of erroring-out, which I agree with.
>

You have not raised a WARNING for the second case. But in the first
place what is the reasoning for making this different from other parts
of code? If we don't have a solid reason then I suggest keeping these
checks and errors the same as in other parts of the code.

-- 
With Regards,
Amit Kapila.

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread Tang, Haiying

Hi Tsunakawa-san

> From: Tang, Haiying 
> > (does this patch make some optimizes in serial insert? I'm a little 
> > confused here, Because the patched execution time is less than 
> > unpatched, but I didn't find information in commit messages about it.
> > If I missed something, please kindly let me know.)
> 
> I haven't thought of anything yet.  Could you show us the output of 
> EXPLAIN (ANALYZE, BUFFERS, VERBOSE) of 1,000 partitions case for the 
> patched and unpatched?  If it doesn't show any difference, the output 
> of perf may be necessary next.

Execute EXPLAIN on Patched:
postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part select * 
from test_data1;
   QUERY PLAN

 Insert on public.test_part  (cost=0.00..15.00 rows=0 width=0) (actual 
time=44.139..44.140 rows=0 loops=1)
   Buffers: shared hit=1005 read=1000 dirtied=3000 written=2000
   ->  Seq Scan on public.test_data1  (cost=0.00..15.00 rows=1000 width=8) 
(actual time=0.007..0.201 rows=1000 loops=1)
 Output: test_data1.a, test_data1.b
 Buffers: shared hit=5
 Planning:
   Buffers: shared hit=27011
 Planning Time: 24.526 ms
 Execution Time: 44.981 ms

Execute EXPLAIN on non-Patched:
postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part select * 
from test_data1;
   QUERY PLAN

 Insert on public.test_part  (cost=0.00..15.00 rows=0 width=0) (actual 
time=72.656..72.657 rows=0 loops=1)
   Buffers: shared hit=22075 read=1000 dirtied=3000 written=2000
   ->  Seq Scan on public.test_data1  (cost=0.00..15.00 rows=1000 width=8) 
(actual time=0.010..0.175 rows=1000 loops=1)
 Output: test_data1.a, test_data1.b
 Buffers: shared hit=5
 Planning:
   Buffers: shared hit=72
 Planning Time: 0.135 ms
 Execution Time: 79.058 ms

> (BTW, were all the 1,000 rows stored in the target table?)

Yes, I checked all rows stored in target table.
postgres=# select count(*) from test_part;  count
---
  1000

Regards,
Tang

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread Tang, Haiying

Hi Amit

> I don't think the patch should have any impact on the serial case. I 
> think you can try to repeat each test 3 times both with and without a 
> patch and take the median of the three.

Actually, I repeated about 10 times, the execution time is always less than 
unpatched.

Regards,
Tang

Re: Parallel INSERT (INTO ... SELECT ...)

On Mon, Jan 18, 2021 at 2:40 PM Tang, Haiying
 wrote:
>
> Hi Tsunakawa-san
>
> > From: Tang, Haiying 
> > > (does this patch make some optimizes in serial insert? I'm a little
> > > confused here, Because the patched execution time is less than
> > > unpatched, but I didn't find information in commit messages about it.
> > > If I missed something, please kindly let me know.)
> >
> > I haven't thought of anything yet.  Could you show us the output of
> > EXPLAIN (ANALYZE, BUFFERS, VERBOSE) of 1,000 partitions case for the
> > patched and unpatched?  If it doesn't show any difference, the output
> > of perf may be necessary next.
>
> Execute EXPLAIN on Patched:
> postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part select * 
> from test_data1;
>QUERY PLAN
> 
>  Insert on public.test_part  (cost=0.00..15.00 rows=0 width=0) (actual 
> time=44.139..44.140 rows=0 loops=1)
>Buffers: shared hit=1005 read=1000 dirtied=3000 written=2000
>->  Seq Scan on public.test_data1  (cost=0.00..15.00 rows=1000 width=8) 
> (actual time=0.007..0.201 rows=1000 loops=1)
>  Output: test_data1.a, test_data1.b
>  Buffers: shared hit=5
>  Planning:
>Buffers: shared hit=27011
>  Planning Time: 24.526 ms
>  Execution Time: 44.981 ms
>
> Execute EXPLAIN on non-Patched:
> postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part select * 
> from test_data1;
>QUERY PLAN
> 
>  Insert on public.test_part  (cost=0.00..15.00 rows=0 width=0) (actual 
> time=72.656..72.657 rows=0 loops=1)
>Buffers: shared hit=22075 read=1000 dirtied=3000 written=2000
>->  Seq Scan on public.test_data1  (cost=0.00..15.00 rows=1000 width=8) 
> (actual time=0.010..0.175 rows=1000 loops=1)
>  Output: test_data1.a, test_data1.b
>  Buffers: shared hit=5
>  Planning:
>Buffers: shared hit=72
>  Planning Time: 0.135 ms
>  Execution Time: 79.058 ms
>

So, the results indicate that after the patch we touch more buffers
during planning which I think is because of accessing the partition
information, and during execution, the patch touches fewer buffers for
the same reason. But why this can reduce the time with patch? I think
this needs some investigation.

-- 
With Regards,
Amit Kapila.

Re: popcount

2021-01-18 Thread Peter Eisentraut


On 2021-01-11 17:13, David Fetter wrote:

On Mon, Jan 11, 2021 at 03:50:54PM +0100, Peter Eisentraut wrote:

On 2020-12-30 17:41, David Fetter wrote:

The input may have more than 2 billion bits set to 1. The biggest possible
result should be 8 billion for bytea (1 GB with all bits set to 1).
So shouldn't this function return an int8?

It does now, and thanks for looking at this.


The documentation still reflects the previous int4 return type (in two
different spellings, too).


Thanks for looking this over!

Please find attached the next version with corrected documentation.


The documentation entries should be placed in alphabetical order in 
their respective tables.


For the bytea function, maybe choose a simpler example that a reader can 
compute in their head.  Also for the test cases.


In varbit.c:

The comment says

+ * Returns the number of bits set in a bit array.

but it should be "bit string".

+   int8popcount;

should be int64.

Also, pg_popcount() returns uint64, not int64.  Perhaps overflow is not 
possible here, but perhaps a small comment about why or an assertion 
could be appropriate.


+   p = VARBITS(arg1);

Why not assign that in the declaration block as well?

This comment:

+   /*
+* ceil(VARBITLEN(ARG1)/BITS_PER_BYTE)
+* done to minimize branches and instructions.
+*/

I don't know what that is supposed to mean or why that kind of tuning 
would be necessary for a user-callable function.


+   popcount = pg_popcount((const char *)p, len);

The "const" is probably not necessary.

byteapopcount() looks better, but also needs int8 -> int64.

Re: proposal: schema variables

čt 14. 1. 2021 v 10:24 odesílatel Erik Rijkers  napsal:

> On 2021-01-14 07:35, Pavel Stehule wrote:
> > [schema-variables-20210114.patch.gz]
>
>
> Build is fine. My (small) list of tests run OK.
>
> I did notice a few more documentation peculiarities:
>
>
> 'The PostgreSQL has schema variables'  should be
> 'PostgreSQL has schema variables'
>
>
fixed


> A link to the LET command should be added to the 'See Also' of the
> CREATE VARIABLE, ALTER VARIABLE, and DROP VARIABLE pages. (After all,
> the LET command is the most interesting)
> Similarly, an ALTER VARIABLE link should be added to the 'See Also'
> section of LET.
>

fixed


>
> Somehow, the sgml in the doc files causes too large spacing in the html,
> example:
> I copy from the LET html:
> 'if that is defined.  If no explicit'
> (6 spaces between 'defined.' and 'If')
> Can you have a look?  Sorry - I can't find the cause.  It occurs on a
> few more places in the newly added sgml/html.  The unwanted spaces are
> visible also in the pdf.
>

Should be fixed in the last version. Probably there were some problems with
invisible white chars.

Thank you for check

Pavel



> (firefox 78.6.1, debian)
>
>
> Thanks,
>
> Erik Rijkers
>
>
>

Re: Parallel INSERT (INTO ... SELECT ...)

On Mon, Jan 18, 2021 at 2:42 PM Amit Kapila  wrote:
>
> On Mon, Jan 18, 2021 at 10:45 AM Greg Nancarrow  wrote:
> >
> > On Fri, Jan 15, 2021 at 7:39 PM Amit Kapila  wrote:
> > >
> > > Here is an additional review of
> > > v11-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT. There are
> > > quite a few comments raised on the V11-0001* patch. I suggest first
> > > post a revised version of V11-0001* patch addressing those comments
> > > and then you can separately post a revised version of
> > > v11-0003-Enable-parallel-INSERT-and-or-SELECT-for-INSERT-INTO.
> > >
> >
> > 1)
> >
> > >Here, it seems we are accessing the relation descriptor without any
> > >lock on the table which is dangerous considering the table can be
> > >modified in a parallel session. Is there a reason why you think this
> > >is safe? Did you check anywhere else such a coding pattern?
> >
> > Yes, there's a very good reason and I certainly have checked for the
> > same coding pattern elsewhere, and not just randomly decided that
> > locking can be ignored.
> > The table has ALREADY been locked (by the caller) during the
> > parse/analyze phase.
> >
>
> Fair enough. I suggest adding a comment saying the same so that the
> reader doesn't get confused about the same.
>
> > (This is not the case for a partition, in which case the patch code
> > uses AccessShareLock, as you will see).
>
> Okay, but I see you release the lock on partition rel immediately
> after checking parallel-safety. What if a user added some
> parallel-unsafe constraint (via Alter Table) after that check?
>
> >
> > 4)
> >
> > >domain_max_parallel_hazard_for_modify()
> > >{
> > >..
> > >+ if (isnull)
> > >+ {
> > >+ /*
> > >+ * This shouldn't ever happen, but if it does, log a WARNING
> > >+ * and return UNSAFE, rather than erroring out.
> > >+ */
> > >+ elog(WARNING, "null conbin for constraint %u", con->oid);
> > >+ context->max_hazard = PROPARALLEL_UNSAFE;
> > >+ break;
> > >+ }
> > >..
> > >}
> > >index_expr_max_parallel_hazard_for_modify()
> > >{
> > >..
> > >+ if (index_expr_item == NULL) /* shouldn't happen */
> > >+ {
> > >+ index_close(index_rel, lockmode);
> > >+ context->max_hazard = PROPARALLEL_UNSAFE;
> > >+ return context->max_hazard;
> > >+ }
> > >..
> > >}
> >
> > >It is not clear why the above two are shouldn't happen cases and if so
> > >why we want to treat them as unsafe. Ideally, there should be an
> > >Assert if these can't happen but it is difficult to decide without
> > >knowing why you have considered them unsafe?
> >
> > The checks being done here for "should never happen" cases are THE
> > SAME as other parts of the Postgres code.
> > For example, search Postgres code for "null conbin" and you'll find 6
> > other places in the Postgres code which actually ERROR out if conbin
> > (binary representation of the constraint) in a pg_constraint tuple is
> > found to be null.
> > The cases that you point out in the patch used to also error out in
> > the same way, but Vignesh suggested changing them to just return
> > parallel-unsafe instead of erroring-out, which I agree with.
> >
>
> You have not raised a WARNING for the second case. But in the first
> place what is the reasoning for making this different from other parts
> of code?
>

On again, thinking about this, I see a reason why one wants to do like
you have done currently in the patch. It helps us to avoid giving such
errors when they are really not required say when it occurred while
checking parallel-safety for a particular partition and in reality we
will never insert in that partition and there probably similar other
cases. I guess we should give WARNING consistently in all such cases.

-- 
With Regards,
Amit Kapila.

Re: proposal: schema variables

Hi

čt 14. 1. 2021 v 11:31 odesílatel Josef Šimánek 
napsal:

> I did some testing locally. All runs smoothly, compiled without warning.
>
> Later on (once merged) it would be nice to write down a documentation
> page for the whole feature as a set next to documented individual
> commands.
> It took me a few moments to understand how this works.
>
> I was looking how to create variable with non default value in one
> command, but if I understand it correctly, that's not the purpose.
> Variable lives in a schema with default value, but you can set it per
> session via LET.
>
> Thus "CREATE VARIABLE" statement should not be usually part of
> "classic" queries, but it should be threatened more like TABLE or
> other DDL statements.
>
> On the other side LET is there to use the variable in "classic" queries.
>
> Does that make sense? Feel free to ping me if any help with
> documentation would be needed. I can try to prepare an initial
> variables guide once I'll ensure I do  understand this feature well.
>

I invite any help with doc. Maybe there can be page in section advanced
features

https://www.postgresql.org/docs/current/tutorial-advanced.html

Regards

Pavel


>
> PS: Now it is clear to me why it is called "schema variables", not
> "session variables".
>
> čt 14. 1. 2021 v 7:36 odesílatel Pavel Stehule 
> napsal:
> >
> > Hi
> >
> > rebase
> >
> > Regards
> >
> > Pavel
> >
>

Re: proposal: schema variables

po 18. 1. 2021 v 10:50 odesílatel Pavel Stehule 
napsal:

>
>
> čt 14. 1. 2021 v 10:24 odesílatel Erik Rijkers  napsal:
>
>> On 2021-01-14 07:35, Pavel Stehule wrote:
>> > [schema-variables-20210114.patch.gz]
>>
>>
>> Build is fine. My (small) list of tests run OK.
>>
>> I did notice a few more documentation peculiarities:
>>
>>
>> 'The PostgreSQL has schema variables'  should be
>> 'PostgreSQL has schema variables'
>>
>>
> fixed
>
>
>> A link to the LET command should be added to the 'See Also' of the
>> CREATE VARIABLE, ALTER VARIABLE, and DROP VARIABLE pages. (After all,
>> the LET command is the most interesting)
>> Similarly, an ALTER VARIABLE link should be added to the 'See Also'
>> section of LET.
>>
>
> fixed
>
>
>>
>> Somehow, the sgml in the doc files causes too large spacing in the html,
>> example:
>> I copy from the LET html:
>> 'if that is defined.  If no explicit'
>> (6 spaces between 'defined.' and 'If')
>> Can you have a look?  Sorry - I can't find the cause.  It occurs on a
>> few more places in the newly added sgml/html.  The unwanted spaces are
>> visible also in the pdf.
>>
>
> Should be fixed in the last version. Probably there were some problems
> with invisible white chars.
>
> Thank you for check
>
> Pavel
>
>
>
>> (firefox 78.6.1, debian)
>>
>
and here is the patch

Regards

Pavel


>>
>> Thanks,
>>
>> Erik Rijkers
>>
>>
>>


schema-variables-20200118.patch.gz
Description: application/gzip

Re: Proposal: Global Index

2021-01-18 Thread 曾文旌



> 2021年1月12日 02:37，Robert Haas  写道：
> 
> On Mon, Jan 11, 2021 at 12:46 PM Bruce Momjian  wrote:
>>> For 1) The DETACH old child table can be finished immediately, global index 
>>> can be kept valid after DETACH is completed, and the cleanup of garbage 
>>> data in global index can be deferred to VACUUM.
>>> This is similar to the global index optimization done by Oracle12c.
>>> For 2) ATTACH new empty child table can also be completed immediately.
>>> If this is the case, many of the advantages of partitioned tables will be 
>>> retained, while the advantages of global indexes will be gained.
>> 
>> Yes, we can keep the index rows for the deleted partition and clean them
>> up later, but what is the advantage of partitioning then?  Just heap
>> deletion quickly?  Is that enough of a value?
> 
> I actually think the idea of lazily deleting the index entries is
> pretty good, but it won't work if the way the global index is
> implemented is by adding a tableoid column. Because then, I might
> detach a partition and later reattach it and the old index entries are
> still there but the table contents might have changed. Worse yet, the
> table might be dropped and the table OID reused for a completely
> unrelated table with completely unrelated contents, which could then
> be attached as a new partition.
> 
> One of the big selling points of global indexes is that they allow you
> to enforce uniqueness on a column unrelated to the partitioning
> column. Another is that you can look up a value by doing a single
> index scan on the global index rather than an index scan per
> partition. Those things are no less valuable for performing index
> deletion lazily.
> 
> However, there is a VACUUM amplification effect to worry about here
> which Wenjing seems not to be considering. Suppose I have a table
> which is not partitioned and it is 1TB in size with an index that is
> 128GB in size. To vacuum the table, I need to do 1TB + 128GB of I/O.
> Now, suppose I now partition the table into 1024 partitions each with
> its own local index. Each partition is 1GB in size and the index on
> each partition is 128MB in size. To vacuum an individual partition
> requires 1GB + 128MB of I/O, so to vacuum all the partitions requires
> the same amount of total I/O as before. But, now suppose that I have a
> single global index instead of a local index per partition. First, how
> big will that index be? It will not be 128GB, but somewhat bigger,
> because it needs extra space for every indexed tuple. Let's say 140GB.
> Furthermore, it will need to be vacuumed whenever any child is
> vacuumed, because it contains some index entries from every child. So
> the total I/O to vacuum all partitions is now 1GB * 1024 + 140GB *
> 1024 = 141TB, which is a heck of a lot worse than the 1.125TB I
> required with the unpartitioned table or the locally partitioned
> table.
Thank you for pointing this out.
It seems that some optimization can be done, but there is no good way
to completely eliminate the vacuum amplification effect of the global index.
Maybe we can only count on Zheap, which doesn't need to do Vaccum.



> 
> That's not necessarily a death sentence for every use case, but it's
> going to be pretty bad for tables that are big and heavily updated.
> 
> -- 
> Robert Haas
> EDB: http://www.enterprisedb.com



smime.p7s
Description: S/MIME cryptographic signature

Re: Parallel INSERT (INTO ... SELECT ...)

2021-01-18 Thread Greg Nancarrow

On Mon, Jan 18, 2021 at 8:10 PM Amit Kapila  wrote:
>
> > 1)
> >
> > >Here, it seems we are accessing the relation descriptor without any
> > >lock on the table which is dangerous considering the table can be
> > >modified in a parallel session. Is there a reason why you think this
> > >is safe? Did you check anywhere else such a coding pattern?
> >
> > Yes, there's a very good reason and I certainly have checked for the
> > same coding pattern elsewhere, and not just randomly decided that
> > locking can be ignored.
> > The table has ALREADY been locked (by the caller) during the
> > parse/analyze phase.
> >
>
> Fair enough. I suggest adding a comment saying the same so that the
> reader doesn't get confused about the same.
>

OK, I'll add a comment.

> > (This is not the case for a partition, in which case the patch code
> > uses AccessShareLock, as you will see).
>
> Okay, but I see you release the lock on partition rel immediately
> after checking parallel-safety. What if a user added some
> parallel-unsafe constraint (via Alter Table) after that check?
>
> >

I'm not sure. But there would be a similar concern for current
Parallel SELECT functionality, right?
My recollection is that ALTER TABLE obtains an exclusive lock on the
table which it retains until the end of the transaction, so that will
result in blocking at certain points, during parallel-checks and
execution, but there may still be a window.

> > 4)
> >
> > >domain_max_parallel_hazard_for_modify()
> > >{
> > >..
> > >+ if (isnull)
> > >+ {
> > >+ /*
> > >+ * This shouldn't ever happen, but if it does, log a WARNING
> > >+ * and return UNSAFE, rather than erroring out.
> > >+ */
> > >+ elog(WARNING, "null conbin for constraint %u", con->oid);
> > >+ context->max_hazard = PROPARALLEL_UNSAFE;
> > >+ break;
> > >+ }
> > >..
> > >}
> > >index_expr_max_parallel_hazard_for_modify()
> > >{
> > >..
> > >+ if (index_expr_item == NULL) /* shouldn't happen */
> > >+ {
> > >+ index_close(index_rel, lockmode);
> > >+ context->max_hazard = PROPARALLEL_UNSAFE;
> > >+ return context->max_hazard;
> > >+ }
> > >..
> > >}
> >
> > >It is not clear why the above two are shouldn't happen cases and if so
> > >why we want to treat them as unsafe. Ideally, there should be an
> > >Assert if these can't happen but it is difficult to decide without
> > >knowing why you have considered them unsafe?
> >
> > The checks being done here for "should never happen" cases are THE
> > SAME as other parts of the Postgres code.
> > For example, search Postgres code for "null conbin" and you'll find 6
> > other places in the Postgres code which actually ERROR out if conbin
> > (binary representation of the constraint) in a pg_constraint tuple is
> > found to be null.
> > The cases that you point out in the patch used to also error out in
> > the same way, but Vignesh suggested changing them to just return
> > parallel-unsafe instead of erroring-out, which I agree with.
> >
>
> You have not raised a WARNING for the second case.

The same checks in current Postgres code also don't raise a WARNING
for that case, so I'm just being consistent with existing Postgres
code (which itself isn't consistent for those two cases).

>But in the first
> place what is the reasoning for making this different from other parts
> of code? If we don't have a solid reason then I suggest keeping these
> checks and errors the same as in other parts of the code.
>

The checks are the same as done in existing Postgres source - but
instead of failing with an ERROR (i.e. whole backend dies), in the
middle of parallel-safety-checking, it has been changed to regard the
operation as parallel-unsafe, so that it will try to execute in
non-parallel mode, where it will most likely fail too when those
corrupted attributes are accessed - but it will fail in the way that
it currently does in Postgres, should that very rare condition ever
happen. This was suggested by Vignesh, and I agree with him. So in
effect, it's just allowing it to use the existing error paths in the
code, rather than introducing new ERROR points.

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Allow matching whole DN from a client certificate

2021-01-18 Thread Daniel Gustafsson

I've circled back to this and tested/read it more, and I'm still of the opinion
that it's a good feature addition. A few small comments on the patch:

+   is the default, the username is matched against the certificate's
In the docs we use "user name" instead of "username" in descriptive text.
There are a couple of "username" in this added paragraph.

+   This option is probably best used in comjunction with a username map.
Typo: s/comjunction/conjunction/

+   /* use commas instead of slashes */
+   X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_SEP_COMMA_PLUS);
I don't have strong opinions on whether we shold use slashes or commas, but I
think it needs to be documented that commas are required since slashes is the
more common way to print a DN.  pg_stat_ssl and sslinfo are also displaying the
DN with slashes.

/* Make sure we have received a username in the certificate */
-   if (port->peer_cn == NULL ||
-   strlen(port->peer_cn) <= 0)
+   peer_username = port->hba->clientcertname == clientCertCN ? 
port->peer_cn : port->peer_dn;
Nitpickering nitpickery perhaps, but when we can inspect the DN and not just
the CN it seems a bit odd to talk about "username", which is again echoed in
the errormessage just below here.  Not sure what would be better though, since
"user information" doesn't really convey enough detail/meaning.

+   /* and extract the Common Name / Distinguished Name from it. */
Not introduced in this patch, but single line comments should not be punctuated
IIRC.

The patch still applies and tests pass, I'm moving this to ready for committer.

cheers ./daniel

search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread David Geier


Hi hackers,

While working with cursors that reference plans with CustomScanStates 
nodes, I encountered a segfault which originates from 
search_plan_tree(). The query plan is the result of a simple SELECT 
statement into which I inject a Custom Scan node at the root to do some 
post-processing before returning rows. This plan is referenced by a 
second plan with a Tid Scan which originates from a query of the form 
DELETE FROM foo WHERE CURRENT OF my_cursor;


search_plan_tree() assumes that 
CustomScanState::ScanState::ss_currentRelation is never NULL. In my 
understanding that only holds for CustomScanState nodes which are at the 
bottom of the plan and actually read from a relation. CustomScanState 
nodes which are not at the bottom don't have ss_currentRelation set. I 
believe for such nodes, instead search_plan_tree() should recurse into 
CustomScanState::custom_ps.


I attached a patch. Any thoughts?

Best regards,
David
Swarm64

diff --git a/src/backend/executor/execCurrent.c 
b/src/backend/executor/execCurrent.c
index f89319fcd8..0d5f09402b 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -326,7 +326,6 @@ search_plan_tree(PlanState *node, Oid table_oid,
case T_BitmapHeapScanState:
case T_TidScanState:
case T_ForeignScanState:
-   case T_CustomScanState:
{
ScanState  *sstate = (ScanState *) node;
 
@@ -335,6 +334,39 @@ search_plan_tree(PlanState *node, Oid table_oid,
break;
}
 
+
+   /*
+* Custom scan nodes can be leaf nodes or inner nodes 
and therfore need special treatment.
+*/
+   case T_CustomScanState:
+   {
+   CustomScanState *css = 
castNode(CustomScanState, node);
+   ScanState   *sstate = (ScanState *) node;
+
+   if (sstate->ss_currentRelation == NULL) /* 
inner node */
+   {
+   ListCell *lc;
+
+   foreach (lc, sstate->custom_ps)
+   {
+   ScanState *elem = 
search_plan_tree((PlanState *)lfirst(lc), table_oid, pending_rescan);
+
+   if (!elem)
+   continue;
+   if (result)
+   return NULL;/* 
multiple matches */
+   result = elem;
+   }
+   }
+   else /* leaf node */
+   {
+   if 
(RelationGetRelid(sstate->ss_currentRelation) == table_oid)
+   result = sstate;
+   }
+
+   break;
+   }
+
/*
 * For Append, we must look through the members; watch 
out for
 * multiple matches (possible if it was from UNION ALL)

Re: Single transaction in the tablesync worker?

2021-01-18 Thread Peter Smith

Hi Amit.

PSA the v16 patch for the Tablesync Solution1.

Main differences from v15:
+ Tablesync cleanups of DropSubscription/AlterSubscription_refresh are
re-implemented as as ProcessInterrupts function



Features:

* The tablesync slot is now permanent instead of temporary.

* The tablesync slot name is no longer tied to the Subscription slot name.

* The tablesync worker is now allowing multiple tx instead of single tx

* A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a
successful copy_table in tablesync's LogicalRepSyncTableStart.

* If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY
then it will bypass the initial copy_table phase.

* Now tablesync sets up replication origin tracking in
LogicalRepSyncTableStart (similar as done for the apply worker). The
origin is advanced when first created.

* Cleanup of tablesync resources:
- The tablesync slot cleanup (drop) code is added for
process_syncing_tables_for_sync functions.
- The tablesync replication origin tracking is cleaned
process_syncing_tables_for_apply.
- A tablesync function to cleanup its own slot/origin is called from
ProcessInterrupt. This is indirectly invoked by
DropSubscription/AlterSubscrition when they signal the tablesync
worker to stop.

* Updates to PG docs.

TODO / Known Issues:

* Race condition observed in "make check" may be related to this patch.

* Add test cases.

---

Please also see some test scenario logging which shows the new
tablesync cleanup function getting called as results of
Drop/AlterSUbscription.

---

Kind Regards,
Peter Smith.
Fujitsu Australia
TEST SCENARIO

Purpose: To observe that the patch can process cleanups caused by the tablesync 
kills during DropSubscription

Note: The "!!>>" is extra logging added for testing, not a normal part of PG.

Steps:
1.  CREATE PUBLICATION for some table T1
2.  CREATE SUBSCRIPTION for that publication
3.  Pause the tablesync worker in CATCHUP state (so there would be a 
tablesync slot in need of cleanup)
4.  Show the slots 
5.  DROP SUBSCRIPTION
// this will signal SIGTERM the tablesync workers
6. Allow the tablesync work to proceed.
// See if it gets into the interrupt cleanup function and drops the 
slots/origin as expected

==

##
## No slots to start with
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "select * from 
pg_replication_slots;"
 slot_name | plugin | slot_type | datoid | database | temporary | active | 
active_pid | xmin | catalog_xmin | restart_lsn | conf
irmed_flush_lsn | wal_status | safe_wal_size 
---++---++--+---+++--+--+-+-
++---
(0 rows)


##
## Normal PUBLICATION of a table 
##

[postgres@CentOS7-x64 ~]$ psql -d test_pub -c "CREATE PUBLICATION tap_pub FOR 
TABLE test_tab;"
CREATE PUBLICATION


##
## Create subscription, and pause the tablesync process when gets to CATCHUP 
state
##

[postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION 
tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' 
PUBLICATION tap_pub;"
2021-01-18 18:15:55.951 AEDT [2068] LOG:  logical decoding found consistent 
point at 0/1614640
2021-01-18 18:15:55.951 AEDT [2068] DETAIL:  There are no running transactions.
2021-01-18 18:15:55.951 AEDT [2068] STATEMENT:  CREATE_REPLICATION_SLOT 
"tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT
NOTICE:  created replication slot "tap_sub" on publisher
CREATE SUBSCRIPTION
2021-01-18 18:15:55.975 AEDT [2069] LOG:  logical replication apply worker for 
subscription "tap_sub" has started
2021-01-18 18:15:55.976 AEDT [2069] LOG:  !!>> The apply worker process has PID 
= 2069
[postgres@CentOS7-x64 ~]$ 2021-01-18 18:15:55.984 AEDT [2074] LOG:  starting 
logical decoding for slot "tap_sub"
2021-01-18 18:15:55.984 AEDT [2074] DETAIL:  Streaming transactions committing 
after 0/1614678, reading WAL from 0/1614640.
2021-01-18 18:15:55.984 AEDT [2074] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-01-18 18:15:55.984 AEDT [2069] LOG:  !!>> apply worker: LogicalRepApplyLoop
2021-01-18 18:15:55.984 AEDT [2074] LOG:  logical decoding found consistent 
point at 0/1614640
2021-01-18 18:15:55.984 AEDT [2074] DETAIL:  There are no running transactions.
2021-01-18 18:15:55.984 AEDT [2074] STATEMENT:  START_REPLICATION SLOT 
"tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"')
2021-01-18 18:15:55.984 AEDT [2069] LOG:  !!>> apply worker: called 
process_syncing_tables
2021-01-18 18:15:55.988 AEDT [2077] LOG:  logical replication table 
synchronization worker for subscription "tap_sub", table "test_tab" has started
2021-01-18 18:15:55.988 AEDT [2077] LOG:  !!>> The tablesync worker process has 
PID = 2077
2021-01-18 18:15:55.989 AEDT [2077] LOG:  !!>>


Sleeping 30 secs. For debugging, attach to process 2077 now!

Re: Deleting older versions in unique indexes to avoid page splits

2021-01-18 Thread Victor Yegorov

пн, 18 янв. 2021 г. в 07:44, Amit Kapila :

> The summary of the above is that with Case-1 (clean-up based on hint
> received with the last item on the page) it takes fewer operations to
> cause a page split as compared to Case-2 (clean-up based on hint
> received with at least of the items on the page) for a mixed workload.
> How can we say that it doesn't matter?
>

I cannot understand this, perhaps there's a word missing in the brackets?..


Thinking of your proposal to undertake bottom-up deletion also on the
last-to-fit tuple insertion,
I'd like to start with my understanding of the design:

* we try to avoid index bloat, but we do it in the “lazy” way (for a
reason, see below)

* it means, that if there is enough space on the leaf page to fit new index
tuple,
  we just fit it there

* if there's not enough space, we first look at the presence of LP_DEAD
tuples,
  and if they do exits, we scan the whole (index) page to re-check all
tuples in order
  to find others, not-yet-marked-but-actually-being-LP_DEAD tuples and
clean all those up.

* if still not enough space, only now we try bottom-up-deletion. This is
heavy operation and
  can cause extra IO (tests do show this), therefore this operation is
undertaken at the point,
  when we can justify extra IO against leaf page split.

* if no space also after bottom-up-deletion, we perform deduplication (if
possible)

* finally, we split the page.

Should we try bottom-up-deletion pass in a situation where we're inserting
the last possible tuple
into the page (enough space for *current* insert, but likely no space for
the next),
then (among others) exists the following possibilities:

- no new tuples ever comes to this page anymore (happy accident), which
means that
  we've wasted IO cycles

- we undertake bottom-up-deletion pass without success and we're asked to
insert new tuple in this
  fully packed index page. This can unroll to:
  - due to the delay, we've managed to find some space either due to
LP_DEAD or bottom-up-cleanup.
which means we've wasted IO cycles on the previous iteration (too early
attempt).
  - we still failed to find any space and are forced to split the page.
in this case we've wasted IO cycles twice.

In my view these cases when we generated wasted IO cycles (producing no
benefit) should be avoided.
And this is main reason for current approach.

Again, this is my understanding and I hope I got it right.


-- 
Victor Yegorov

Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

2021-01-18 Thread Ashutosh Bapat

On Mon, Jan 18, 2021 at 4:13 PM David Geier  wrote:
>
> Hi hackers,
>
> While working with cursors that reference plans with CustomScanStates
> nodes, I encountered a segfault which originates from
> search_plan_tree(). The query plan is the result of a simple SELECT
> statement into which I inject a Custom Scan node at the root to do some
> post-processing before returning rows. This plan is referenced by a
> second plan with a Tid Scan which originates from a query of the form
> DELETE FROM foo WHERE CURRENT OF my_cursor;
>
> search_plan_tree() assumes that
> CustomScanState::ScanState::ss_currentRelation is never NULL. In my
> understanding that only holds for CustomScanState nodes which are at the
> bottom of the plan and actually read from a relation. CustomScanState
> nodes which are not at the bottom don't have ss_currentRelation set. I
> believe for such nodes, instead search_plan_tree() should recurse into
> CustomScanState::custom_ps.
>
> I attached a patch. Any thoughts?

I don't have any comments about your patch as such, but ForeignScan is
similar to CustomScan. ForeignScan also can leave ss_currentRelation
NULL if it represents join between two foreign tables. So either
ForeignScan has the same problem as CustomScan (it's just above
CustomScan case in search_plan_tree()) or it's handling it in some
other way. In the first case we may want to fix that too in the same
manner (not necessarily in the same patch) and the in the later case
CustomScan can handle it the same way.

Said that, I didn't notice any field in ForeignScan which is parallel
to custom_ps, so what you are proposing is still needed.

-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Add extra statistics to explain for Nested Loop

2021-01-18 Thread Michael Christofides

> New version of this patch prints extra statistics for all cases of
> multiple loops, not only for Nested Loop. Also I fixed the example by
> adding VERBOSE.
>
> Please don't hesitate to share any thoughts on this topic!

Thanks a lot for working on this! I really like the extra details, and
including it only with VERBOSE sounds good.

> rows * loops is still an important calculation.
>
> Why not just add total_rows while we are at it - last in the listing?
>
> (actual rows=N loops=N min_rows=N max_rows=N total_rows=N)

This total_rows idea from David would really help us too, especially
in the cases where the actual rows is rounded down to zero. We make an
explain visualisation tool, and it'd be nice to show people a better
total than loops * actual rows. It would also help the accuracy of
some of our tips, that use this number.

Apologies if this input is too late to be helpful.

Cheers,
Michael

Re: Narrow the scope of the variable outputstr in logicalrep_write_tuple

On Mon, Jan 18, 2021 at 1:39 PM japin  wrote:
>
>
> On Mon, 18 Jan 2021 at 15:59, Bharath Rupireddy 
>  wrote:
> > On Mon, Jan 18, 2021 at 1:16 PM japin  wrote:
> >>
> >>
> >> Hi,
> >>
> >> I find that the outputstr variable in logicalrep_write_tuple() only use in
> >> `else` branch, I think we can narrow the scope, just like variable 
> >> outputbytes
> >> in `if` branch (for more readable).
> >>
> >> /*
> >>  * Send in binary if requested and type has suitable send function.
> >>  */
> >> if (binary && OidIsValid(typclass->typsend))
> >> {
> >> bytea  *outputbytes;
> >> int len;
> >>
> >> pq_sendbyte(out, LOGICALREP_COLUMN_BINARY);
> >> outputbytes = OidSendFunctionCall(typclass->typsend, 
> >> values[i]);
> >> len = VARSIZE(outputbytes) - VARHDRSZ;
> >> pq_sendint(out, len, 4);/* length */
> >> pq_sendbytes(out, VARDATA(outputbytes), len);   /* data */
> >> pfree(outputbytes);
> >> }
> >> else
> >> {
> >> pq_sendbyte(out, LOGICALREP_COLUMN_TEXT);
> >> outputstr = OidOutputFunctionCall(typclass->typoutput, 
> >> values[i]);
> >> pq_sendcountedtext(out, outputstr, strlen(outputstr), false);
> >> pfree(outputstr);
> >> }
> >>
> >> Attached is a samll patch to fix it.
> >
> > +1. Binary mode uses block level variable outputbytes, so making
> > outputstr block level is fine IMO.
> >
> > Patch basically looks good to me, but it doesn't apply on my system.
> > Looks like it's not created with git commit. Please create the patch
> > with git commit command.
> >
> > git apply 
> > /mnt/hgfs/Shared/narrow-the-scope-of-the-variable-in-logicalrep_write_tuple.patch
> > error: corrupt patch at line 10
> >
>
> Thanks for reviewing! Attached v2 as you suggested.

Thanks. v2 patch LGTM.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: ResourceOwner refactoring


On 18/01/2021 09:49, kuroda.hay...@fujitsu.com wrote:

Dear Heikki,

I apologize for sending again.


I will check another ResourceOwnerEnlarge() if I have a time.


I checked all ResourceOwnerEnlarge() types after applying patch 0001.
(searched by "grep -rI ResourceOwnerEnlarge")
No problem was found.


Great, thanks!

Here are more details on the performance testing I did:

Unpatched
--

postgres=# \i snaptest.sql
 numkeep | numsnaps | lifo_time_ns | fifo_time_ns
-+--+--+--
   0 |1 | 38.2 | 34.8
   0 |5 | 35.7 | 35.4
   0 |   10 | 37.6 | 37.6
   0 |   60 | 35.4 | 39.2
   0 |   70 | 55.0 | 53.8
   0 |  100 | 48.6 | 48.9
   0 | 1000 | 54.8 | 57.0
   0 |1 | 63.9 | 67.1
   9 |   10 | 39.3 | 38.8
   9 |  100 | 44.0 | 42.5
   9 | 1000 | 45.8 | 47.1
   9 |1 | 64.2 | 66.8
  65 |   70 | 45.7 | 43.7
  65 |  100 | 42.9 | 43.7
  65 | 1000 | 46.9 | 45.7
  65 |1 | 65.0 | 64.5
(16 rows)


Patched


postgres=# \i snaptest.sql
 numkeep | numsnaps | lifo_time_ns | fifo_time_ns
-+--+--+--
   0 |1 | 35.3 | 33.8
   0 |5 | 34.8 | 34.6
   0 |   10 | 49.8 | 51.4
   0 |   60 | 53.1 | 57.2
   0 |   70 | 53.2 | 59.6
   0 |  100 | 53.1 | 58.2
   0 | 1000 | 63.1 | 69.7
   0 |1 | 83.3 | 83.5
   9 |   10 | 35.0 | 35.1
   9 |  100 | 55.4 | 54.1
   9 | 1000 | 56.8 | 60.3
   9 |1 | 88.6 | 82.0
  65 |   70 | 36.4 | 35.1
  65 |  100 | 52.4 | 53.0
  65 | 1000 | 55.8 | 59.4
  65 |1 | 79.3 | 80.3
(16 rows)

About the test:

Each test call Register/UnregisterSnapshot in a loop. numsnaps is the 
number of snapshots that are registered in each iteration. For exmaple, 
on the second line with numkeep=0 and numnaps=5, each iteration in the 
LIFO test does essentially:


rs[0] = RegisterSnapshot()
rs[1] = RegisterSnapshot()
rs[2] = RegisterSnapshot()
rs[3] = RegisterSnapshot()
rs[4] = RegisterSnapshot()
UnregisterSnapshot(rs[4]);
UnregisterSnapshot(rs[3]);
UnregisterSnapshot(rs[2]);
UnregisterSnapshot(rs[1]);
UnregisterSnapshot(rs[0]);

In the FIFO tests, the Unregister calls are made in reverse order.

When numkeep is non zero, that many snapshots are registered once at the 
beginning of the test, and released only after the test loop. So this 
simulates the scenario that you remember a bunch of resources and hold 
them for a long time, and while holding them, you do many more 
remember/forget calls.


Based on this test, this patch makes things slightly slower overall. 
However, *not* in the cases that matter the most I believe. That's the 
cases where the (numsnaps - numkeep) is small, so that the hot action 
happens in the array and the hashing is not required. In my testing 
earlier, about 95% of the ResourceOwnerRemember calls in the regression 
tests fall into that category.


There are a few simple things we could do speed this up, if needed. A 
different hash function might be cheaper, for example. And we could 
inline the fast paths of the ResourceOwnerEnlarge and 
ResourceOwnerRemember() into the callers. But I didn't do those things 
as part of this patch, because it wouldn't be a fair comparison; we 
could do those with the old implementation too. And unless we really 
need the speed, it's more readable this way.


I'm OK with these results. Any objections?

Attached are the patches again. Same as previous version, except for a 
couple of little comment changes. And a third patch containing the 
source for the performance test.


- Heikki
>From ca5ef92d30d6d2d77e6758da3ae60d075451d5c1 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 13 Jan 2021 12:21:28 +0200
Subject: [PATCH v5 1/2] Move a few ResourceOwnerEnlarge() calls for safety and
 clarity.

These are functions where quite a lot of things happen between the
ResourceOwnerEnlarge and ResourceOwnerRemember calls. It's important that
there are no unrelated ResourceOwnerRemember() calls in the code
inbetween, otherwise the entry reserved by the ResourceOwnerEnlarge() call
might be used up by the intervening ResourceOwnerRemember() and not be
available at the intended ResourceOwnerRemember() call anymore. The longer
the code path between them is, the harder it is to verify that.

In bu

Re: Yet another fast GiST build


On 18/01/2021 01:10, Peter Geoghegan wrote:

On Sun, Jan 17, 2021 at 3:04 PM Peter Geoghegan  wrote:

I personally agree with you - it's not like there aren't other ways
for superusers to crash the server (most of which seem very similar to
this gist_page_items() issue, in fact). I just think that it's worth
being clear about that being a trade-off that we've accepted.


Can we rename gist_page_items_bytea() to gist_page_items(), and at the
same time rename the current gist_page_items() -- perhaps call it
gist_page_items_output()?

That way we could add a bt_page_items_output() function later, while
leaving everything consistent (actually not quite, since
bt_page_items() outputs text instead of bytea -- but that seems worth
fixing too). This also has the merit of making the unsafe "output"
variant into the special case.


bt_page_items() and bt_page_items_bytea() exist already. And 
brin_page_items() also calls the output functions (there's no bytea 
version of that). Perhaps it would've been better to make the 
bytea-variants the default, but I'm afraid that ship has already sailed.


We're not terribly consistent; heap_page_items(), hash_page_items() and 
gin_page_items() don't attempt to call the output functions.


Then again, I don't think we need to worry much about backwards 
compatibility in pageinspect, so I guess we could rename them all. It 
doesn't bother me enough to make me do it, but I won't object if you 
want to.


- Heikki

simplifying foreign key/RI checks

2021-01-18 Thread Amit Langote

While discussing the topic of foreign key performance off-list with
Robert and Corey (also came up briefly on the list recently [1], [2]),
a few ideas were thrown around to simplify our current system of RI
checks to enforce foreign keys with the aim of reducing some of its
overheads.  The two main aspects of  how we do these checks that
seemingly cause the most overhead are:

* Using row-level triggers that are fired during the modification of
the referencing and the referenced relations to perform them

* Using plain SQL queries issued over SPI

There is a discussion nearby titled "More efficient RI checks - take
2" [2] to address this problem from the viewpoint that it is using
row-level triggers that causes the most overhead, although there are
some posts mentioning that SQL-over-SPI is not without blame here.  I
decided to focus on the latter aspect and tried reimplementing some
checks such that SPI can be skipped altogether.

I started with the check that's performed when inserting into or
updating the referencing table to confirm that the new row points to a
valid row in the referenced relation.  The corresponding SQL is this:

SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x

$1 is the value of the foreign key of the new row.  If the query
returns a row, all good.  Thanks to SPI, or its use of plan caching,
the query is re-planned only a handful of times before making a
generic plan that is then saved and reused, which looks like this:

  QUERY PLAN
--
 LockRows
   ->  Index Scan using pk_pkey on pk x
 Index Cond: (a = $1)
(3 rows)

So in most cases, the trigger's function would only execute the plan
that's already there, at least in a given session.  That's good but
what we realized would be even better is if we didn't have to
"execute" a full-fledged "plan" for this, that is, to simply find out
whether a row containing the key we're looking for exists in the
referenced relation and if found lock it.  Directly scanning the index
and locking it directly with table_tuple_lock() like ExecLockRows()
does gives us exactly that behavior, which seems simple enough to be
done in a not-so-long local function in ri_trigger.c.  I gave that a
try and came up with the attached.  It also takes care of the case
where the referenced relation is partitioned in which case its
appropriate leaf partition's index is scanned.

The patch results in ~2x improvement in the performance of inserts and
updates on referencing tables:

create table p (a numeric primary key);
insert into p select generate_series(1, 100);
create table f (a bigint references p);

-- unpatched
insert into f select generate_series(1, 200, 2);
INSERT 0 100
Time: 6340.733 ms (00:06.341)

update f set a = a + 1;
UPDATE 100
Time: 7490.906 ms (00:07.491)

-- patched
insert into f select generate_series(1, 200, 2);
INSERT 0 100
Time: 3340.808 ms (00:03.341)

update f set a = a + 1;
UPDATE 100
Time: 4178.171 ms (00:04.178)

The improvement is even more dramatic when the referenced table (that
we're no longer querying over SPI) is partitioned.  Here are the
numbers when the PK relation has 1000 hash partitions.

Unpatched:

insert into f select generate_series(1, 200, 2);
INSERT 0 100
Time: 35898.783 ms (00:35.899)

update f set a = a + 1;
UPDATE 100
Time: 37736.294 ms (00:37.736)

Patched:

insert into f select generate_series(1, 200, 2);
INSERT 0 100
Time: 5633.377 ms (00:05.633)

update f set a = a + 1;
UPDATE 100
Time: 6345.029 ms (00:06.345)

That's over ~5x improvement!

While the above case seemed straightforward enough for skipping SPI,
it seems a bit hard to do the same for other cases where we query the
*referencing* relation during an operation on the referenced table
(for example, checking if the row being deleted is still referenced),
because the plan in those cases is not predictably an index scan.
Also, the filters in those queries are more than likely to not match
the partition key of a partitioned referencing relation, so all
partitions will have to scanned.  I have left those cases as future
work.

The patch seems simple enough to consider for inclusion in v14 unless
of course we stumble into some dealbreaker(s).  I will add this to
March CF.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] 
https://www.postgresql.org/message-id/CADkLM%3DcTt_8Fg1Jtij5j%2BQEBOxz9Cuu4DiMDYOwdtktDAKzuLw%40mail.gmail.com

[2] https://www.postgresql.org/message-id/1813.1586363881%40antos


v1-0001-Export-get_partition_for_tuple.patch
Description: Binary data


v1-0002-Avoid-using-SPI-for-some-RI-checks.patch
Description: Binary data

Re: Deleting older versions in unique indexes to avoid page splits

On Mon, Jan 18, 2021 at 5:11 PM Victor Yegorov  wrote:
>
> пн, 18 янв. 2021 г. в 07:44, Amit Kapila :
>>
>> The summary of the above is that with Case-1 (clean-up based on hint
>> received with the last item on the page) it takes fewer operations to
>> cause a page split as compared to Case-2 (clean-up based on hint
>> received with at least of the items on the page) for a mixed workload.
>> How can we say that it doesn't matter?
>
>
> I cannot understand this, perhaps there's a word missing in the brackets?..
>

There is a missing word (one) in Case-2, let me write it again to
avoid any confusion. Case-2 (clean-up based on hint received with at
least one of the items on the page).

>
> Thinking of your proposal to undertake bottom-up deletion also on the 
> last-to-fit tuple insertion,
>

I think there is some misunderstanding in what I am trying to say and
your conclusion of the same. See below.

> I'd like to start with my understanding of the design:
>
> * we try to avoid index bloat, but we do it in the “lazy” way (for a reason, 
> see below)
>
> * it means, that if there is enough space on the leaf page to fit new index 
> tuple,
>   we just fit it there
>
> * if there's not enough space, we first look at the presence of LP_DEAD 
> tuples,
>   and if they do exits, we scan the whole (index) page to re-check all tuples 
> in order
>   to find others, not-yet-marked-but-actually-being-LP_DEAD tuples and clean 
> all those up.
>
> * if still not enough space, only now we try bottom-up-deletion. This is 
> heavy operation and
>   can cause extra IO (tests do show this), therefore this operation is 
> undertaken at the point,
>   when we can justify extra IO against leaf page split.
>
> * if no space also after bottom-up-deletion, we perform deduplication (if 
> possible)
>
> * finally, we split the page.
>
> Should we try bottom-up-deletion pass in a situation where we're inserting 
> the last possible tuple
> into the page (enough space for *current* insert, but likely no space for the 
> next),
>

I am saying that we try bottom-up deletion when the new insert item
didn't find enough space on the page and there was previously some
indexUnchanged tuple(s) inserted into that page. Of course, like now
it will be attempted after an attempt to remove LP_DEAD items.

> then (among others) exists the following possibilities:
>
> - no new tuples ever comes to this page anymore (happy accident), which means 
> that
>   we've wasted IO cycles
>
> - we undertake bottom-up-deletion pass without success and we're asked to 
> insert new tuple in this
>   fully packed index page. This can unroll to:
>   - due to the delay, we've managed to find some space either due to LP_DEAD 
> or bottom-up-cleanup.
> which means we've wasted IO cycles on the previous iteration (too early 
> attempt).
>   - we still failed to find any space and are forced to split the page.
> in this case we've wasted IO cycles twice.
>
> In my view these cases when we generated wasted IO cycles (producing no 
> benefit) should be avoided.
>

I don't think any of these can happen in what I am actually saying. Do
you still have the same feeling after reading this email? Off-hand, I
don't see any downsides as compared to the current approach and it
will have fewer splits in some other workloads like when there is a
mix of inserts and non-HOT updates (that doesn't logically change the
index).

-- 
With Regards,
Amit Kapila.

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-18 Thread Fujii Masao





On 2021/01/18 15:37, Bharath Rupireddy wrote:

On Mon, Jan 18, 2021 at 11:58 AM Fujii Masao
 wrote:




On 2021/01/18 15:02, Hou, Zhijie wrote:

We need to create the loopback3 with user mapping public, otherwise the
test might become unstable as shown below. Note that loopback and
loopback2 are not dropped in the test, so no problem with them.

   ALTER SERVER loopback OPTIONS (ADD use_remote_estimate 'off');  DROP
SERVER loopback3 CASCADE;
   NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to user mapping for postgres on server loopback3
+DETAIL:  drop cascades to user mapping for bharath on server loopback3

Attaching v2 patch for postgres_fdw_get_connections. Please have a look.

Hi

I have a comment for the doc about postgres_fdw_get_connections.

+postgres_fdw_get_connections(OUT server_name text, OUT valid boolean) 
returns setof record
+
+ 
+  This function returns the foreign server names of all the open
+  connections that postgres_fdw established from
+  the local session to the foreign servers. It also returns whether
+  each connection is valid or not. false is returned
+  if the foreign server connection is used in the current local
+  transaction but its foreign server or user mapping is changed or
+  dropped, and then such invalid connection will be closed at
+  the end of that transaction. true is returned
+  otherwise. If there are no open connections, no record is returned.
+  Example usage of the function:

The doc seems does not memtion the case when the function returns NULL in 
server_name.
Users may be a little confused about why NULL was returned.


Yes, so what about adding

  (Note that the returned server name of invalid connection is NULL if its 
server is dropped)

into the following (just after "dropped")?

+  if the foreign server connection is used in the current local
+  transaction but its foreign server or user mapping is changed or
+  dropped

Or better description?


+1 to add it after "dropped (Note )", how about as follows
with slight changes?

dropped (Note that server name of an invalid connection can be NULL if
the server is dropped), and then such .


Yes, I like this one. One question is; "should" or "is" is better than
"can" in this case because the server name of invalid connection is
always NULL when its server is dropped?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

回复：BEFORE ROW triggers for partitioned tables

2021-01-18 Thread 李杰(慎追)

Hi commiter,

Many of our customers expect to use BR triggers in partitioned tables.
After I followed your discussion, I also checked your patch. 
Here are two questions confusing me:

1. Your modification removes the check BR triggers against partitioned table,
 and a more friendly error message is added to the ExecInsert and ExecUpdate.  
You are correct. ExecInsert does not reroute partitions. 
However, when ExecUpdate modifies partition keys, 
it is almost equivalent to ExecDelete and ExecInsert, 
and it is re-routed(ExecPrepareTupleRouting) once before ExecInsert. Therefore, 
why should an error be thrown in ExecUpdate?
Let's look at a case : 
```
postgres=# create table parted (a int, b int, c text) partition by list 
(a);
CREATE TABLE
postgres=# create table parted_1 partition of parted for values in (1);
CREATE TABLE
postgres=# create table parted_2 partition of parted for values in (2);
CREATE TABLE
postgres=# create function parted_trigfunc() returns trigger 
language plpgsql as $$
 begin
   new.a = new.a + 1;
   return new;
  end;
 $$;
CREATE FUNCTION
postgres=# insert into parted values (1, 1, 'uno uno v1'); 
INSERT 0 1
postgres=# create trigger t before update on parted
   for each row execute function parted_trigfunc();
CREATE TRIGGER
postgres=# update parted set c = c || 'v3'; 
   ```
If there is no check in the ExecUpdate, 
the above update SQL will be executed successfully.
However, in your code, this will fail.
So, what is the reason for your consideration?

2. In this patch, you only removed the restrictions BR trigger against 
the partitioned table, but did not fundamentally solve the problem caused 
by modifying partition keys in the BR trigger. What are the difficulties in 
solving this problem fundamentally? We plan to implement it. 
Can you give us some suggestions?


--
发件人：Alvaro Herrera 
发送时间：2021年1月18日(星期一) 20:36
收件人：Ashutosh Bapat 
抄　送：Ashutosh Bapat ; Pg Hackers 

主　题：Re: BEFORE ROW triggers for partitioned tables

Thanks for the reviews; I have pushed it now.

(I made the doc edits you mentioned too.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

On Mon, Jan 18, 2021 at 6:17 PM Fujii Masao  wrote:
> > +1 to add it after "dropped (Note )", how about as follows
> > with slight changes?
> >
> > dropped (Note that server name of an invalid connection can be NULL if
> > the server is dropped), and then such .
>
> Yes, I like this one. One question is; "should" or "is" is better than
> "can" in this case because the server name of invalid connection is
> always NULL when its server is dropped?

I think "dropped (Note that server name of an invalid connection will
be NULL if the server is dropped), and then such ."

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2021-01-18 Thread Antonin Houska

Dmitry Dolgov <9erthali...@gmail.com> wrote:

> Thanks for the updated patch. As I've mentioned off the list I'm slowly
> looking through it with the intent to concentrate on undo progress
> tracking. But before I will post anything I want to mention couple of
> strange issues I see, otherwise I will forget for sure. Maybe it's
> already known, but running several times 'make installcheck' against a
> freshly build postgres with the patch applied from time to time I
> observe various errors.
> 
> This one happens on a crash recovery, seems like
> UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in
> the replay process:
> 
> TRAP: FailedAssertion("page_offset + this_page_bytes <= 
> uph->ud_insertion_point", File: "undopage.c", Line: 300)
> postgres: startup recovering 
> 00010012(ExceptionalCondition+0xa1)[0x558b38b8a350]
> postgres: startup recovering 
> 00010012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e]
> postgres: startup recovering 
> 00010012(UndoReplay+0xa1d)[0x558b38766f32]
> postgres: startup recovering 
> 00010012(XactUndoReplay+0x77)[0x558b38769281]
> postgres: startup recovering 
> 00010012(smgr_redo+0x1af)[0x558b387aa7bd]
> 
> This one is somewhat similar:
> 
> TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: 
> "undopage.c", Line: 287)
> postgres: undo worker for database 36893 
> (ExceptionalCondition+0xa1)[0x5559c90f1350]
> postgres: undo worker for database 36893 
> (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3]
> postgres: undo worker for database 36893 
> (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008]
> postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989]

Well, on repeated run of the test I could also hit the first one. I could fix
it and will post a new version of the patch (along with some other small
changes) this week.

> There are also here and there messages about not found undo files:
> 
> ERROR:  cannot open undo segment file 'base/undo/08.02': No 
> such file or directory
> WARNING:  failed to undo transaction

I don't see this one in the log so far, will try again.

Thanks for the report!

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: list of extended statistics on psql

2021-01-18 Thread Tomas Vondra





On 1/18/21 8:31 AM, Tatsuro Yamada wrote:

Hi Tomas and Shinoda-san,

On 2021/01/17 23:31, Tomas Vondra wrote:



On 1/17/21 3:01 AM, Tomas Vondra wrote:

On 1/17/21 2:41 AM, Shinoda, Noriyoshi (PN Japan FSIP) wrote:

Hi, hackers.

I tested this committed feature.
It doesn't seem to be available to non-superusers due to the 
inability to access pg_statistics_ext_data.

Is this the expected behavior?



Ugh. I overlooked the test to check the case of the user hasn't 
Superuser privilege. The user without the privilege was able to access 
pg_statistics_ext. Therefore I supposed that it's also able to access 
pg_statics_ext_data. Oops.




Hmmm, that's a good point. Bummer we haven't noticed that earlier.

I wonder what the right fix should be - presumably we could do 
something like pg_stats_ext (we can't use that view directly, because 
it formats the data, so the sizes are different).


But should it list just the stats the user has access to, or should 
it list everything and leave the inaccessible fields NULL?




I've reverted the commit - once we find the right way to handle this, 
I'll get it committed again.


As for how to deal with this, I can think of about three ways:

1) simplify the command to only print information from 
pg_statistic_ext (so on information about which stats are built or sizes)


2) extend pg_stats_ext with necessary information (e.g. sizes)

3) create a new system view, with necessary information (so that 
pg_stats_ext does not need to be modified)


4) add functions returning the necessary information, possibly only 
for statistics the user can access (similarly to what pg_stats_ext does)


Options 2-4 have the obvious disadvantage that this won't work on 
older releases (we can't add views or functions there). So I'm leaning 
towards #1 even if that means we have to remove some of the details. 
We can consider adding that for new releases, though.



Thanks for the useful advice. I go with option 1).
The following query is created by using pg_stats_ext instead of 
pg_statistic_ext and pg_statistic_ext_data. However, I was confused

about writing a part of the query for calculating MCV size because
there are four columns related to MCV. For example, most_common_vals, 
most_common_val_nulls, most_common_freqs, and most_common_base_freqs.

Currently, I don't know how to calculate the size of MCV by using the
four columns. Thoughts? :-)


Well, my suggestion was to use pg_statistic_ext, because that lists all 
statistics, while pg_stats_ext is filtering statistics depending on 
access privileges. I think that's more appropriate for \dX, the contents 
should not change depending on the user.


Also, let me clarify - with option (1) we'd not show the sizes at all. 
The size of the formatted statistics may be very different from the 
on-disk representation, so I see no point in showing it in \dX.


We might show other stats (e.g. number of MCV items, or the fraction of 
data represented by the MCV list), but the user can inspect pg_stats_ext 
if needed.


What we might do is to show those stats when a superuser is running this 
command, but I'm not sure that's a good idea (or how difficult would it 
be to implement).



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Added schema level support for publication.

2021-01-18 Thread vignesh C

On Sat, Jan 9, 2021 at 5:21 PM vignesh C  wrote:
>
> On Fri, Jan 8, 2021 at 4:32 PM Amit Kapila  wrote:
> >
> > On Thu, Jan 7, 2021 at 10:03 PM vignesh C  wrote:
> > >
> > > This feature adds schema option while creating publication. Users will
> > > be able to specify one or more schemas while creating publication,
> > > when the user specifies schema option, then the data changes for the
> > > tables present in the schema specified by the user will be replicated
> > > to the subscriber. Few examples have been listed below:
> > >
> > > Create a publication that publishes all changes for all the tables
> > > present in production schema:
> > > CREATE PUBLICATION production_publication FOR ALL TABLES SCHEMA 
> > > production;
> > >
> > > Create a publication that publishes all changes for all the tables
> > > present in marketing and sales schemas:
> > > CREATE PUBLICATION sales_publication FOR ALL TABLES SCHEMA marketing, 
> > > sales;
> > >
> > > Add some schemas to the publication:
> > > ALTER PUBLICATION sales_publication ADD SCHEMA marketing_june, sales_june;
> > >
> > > Drop some schema from the publication:
> > > ALTER PUBLICATION production_quarterly_publication DROP SCHEMA 
> > > production_july;
> > >
> > > Attached is a POC patch for the same. I felt this feature would be quite 
> > > useful.
> > >
> >
> > What do we do if the user Drops the schema? Do we automatically remove
> > it from the publication?
> >
> I have not yet handled this scenario yet, I will handle this and
> adding of tests in the next patch.
>

I have handled the above scenario(drop schema should automatically
remove the schema entry from publication schema relation) & addition
of tests in the new v2 patch attached.
Thoughts?


Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From d14e47368091d81f22ab11b94ba0716e0d918471 Mon Sep 17 00:00:00 2001
From: Vignesh C 
Date: Mon, 18 Jan 2021 18:08:51 +0530
Subject: [PATCH v2]  Added schema level support for publication.

This patch adds schema level support for publication.  User can specify multiple
schemas with schema option. When user specifies schema option, then the tables
present in the schema specified will be selected by publisher for sending the
data to subscriber.
---
 doc/src/sgml/ref/alter_publication.sgml  |  45 +++-
 doc/src/sgml/ref/create_publication.sgml |  31 ++-
 src/backend/catalog/Makefile |   4 +-
 src/backend/catalog/aclchk.c |   2 +
 src/backend/catalog/dependency.c |   9 +
 src/backend/catalog/objectaddress.c  | 138 +
 src/backend/catalog/pg_publication.c | 134 +++-
 src/backend/commands/alter.c |   1 +
 src/backend/commands/event_trigger.c |   4 +
 src/backend/commands/publicationcmds.c   | 266 +++-
 src/backend/commands/seclabel.c  |   1 +
 src/backend/commands/tablecmds.c |   1 +
 src/backend/parser/gram.y|  76 ---
 src/backend/utils/cache/syscache.c   |  23 +++
 src/bin/pg_dump/common.c |   3 +
 src/bin/pg_dump/pg_backup_archiver.c |   3 +-
 src/bin/pg_dump/pg_dump.c| 155 +-
 src/bin/pg_dump/pg_dump.h|  17 ++
 src/bin/pg_dump/pg_dump_sort.c   |   7 +
 src/bin/psql/describe.c  | 110 +-
 src/include/catalog/dependency.h |   1 +
 src/include/catalog/pg_publication.h |  16 +-
 src/include/catalog/pg_publication_schema.h  |  49 +
 src/include/commands/publicationcmds.h   |   1 +
 src/include/nodes/parsenodes.h   |   3 +
 src/include/utils/syscache.h |   2 +
 src/test/regress/expected/object_address.out |   6 +-
 src/test/regress/expected/publication.out| 298 ++-
 src/test/regress/expected/sanity_check.out   |   1 +
 src/test/regress/sql/object_address.sql  |   3 +
 src/test/regress/sql/publication.sql |  87 +++-
 31 files changed, 1401 insertions(+), 96 deletions(-)
 create mode 100644 src/include/catalog/pg_publication_schema.h

diff --git a/doc/src/sgml/ref/alter_publication.sgml b/doc/src/sgml/ref/alter_publication.sgml
index faa114b..d6199af 100644
--- a/doc/src/sgml/ref/alter_publication.sgml
+++ b/doc/src/sgml/ref/alter_publication.sgml
@@ -24,6 +24,9 @@ PostgreSQL documentation
 ALTER PUBLICATION name ADD TABLE [ ONLY ] table_name [ * ] [, ...]
 ALTER PUBLICATION name SET TABLE [ ONLY ] table_name [ * ] [, ...]
 ALTER PUBLICATION name DROP TABLE [ ONLY ] table_name [ * ] [, ...]
+ALTER PUBLICATION name ADD SCHEMA schema_name [, ...]
+ALTER PUBLICATION name SET SCHEMA schema_name [, ...]
+ALTER PUBLICATION name DROP SCHEMA schema_name [, ...]
 ALTER PUBLICATION name SET ( publication_parameter [= value] [, ... ] )
 ALTER PUBLICATION name OWNER TO { new_owner | CURRENT_ROLE | CURRENT_USER | SESSION_USER }
 ALTER PUBLICATION name RENAME TO new

Re: [PATCH] ProcessInterrupts_hook

On Mon, Jan 18, 2021 at 3:00 AM Craig Ringer
 wrote:
> A few times lately I've been doing things in extensions that've made me want 
> to be able to run my own code whenever InterruptPending is true and 
> CHECK_FOR_INTERRUPTS() calls ProcessInterrupts()

I've wanted this in the past, too, so +1 from me.

> What I really want to go along with this is a way for any backend to observe 
> the postmaster's pmState and its "Shutdown" variable's value, so any backend 
> can tell if we're in FastShutdown, SmartShutdown, etc. Copies in shmem only 
> obviously. But I'm not convinced it's right to just copy these vars as-is to 
> shmem, and I don't want to use the memory for a ProcSignal slot for something 
> that won't be relevant for most backends for most of the postmaster lifetime. 
> Ideas welcomed.

I've wanted something along this line, too, but what I was thinking
about was more along the lines of having the postmaster signal the
backends when a smart shutdown happened. After all when a fast
shutdown happens the backends already get told to terminate, and that
seems like it ought to be enough: I'm not sure backends have any
business caring about why they are being asked to terminate. But they
might well want to know whether a smart shutdown is in progress, and
right now there's no way for them to know that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin  wrote:
> Does anyone maintain opensource pg_surgery analogs for released versions of 
> PG?
> It seems to me I'll have to use something like this and I just though that I 
> should consider pg_surgery in favour of our pg_dirty_hands.

I do not. I'm still of the opinion that we ought to back-patch
pg_surgery. This didn't attract a consensus before, and it's hard to
dispute that it's a new feature in what would be a back branch. But
it's unclear to me how users are otherwise supposed to recover from
some of the bugs that are or have been present in those back branches.
I'm not sure that I see the logic in telling people we'll try to
prevent them from getting hosed in the future but if they're already
hosed they can wait for v14 to fix it. They can't wait that long, and
a dump-and-restore cycle is awfully painful.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Deleting older versions in unique indexes to avoid page splits

2021-01-18 Thread Victor Yegorov

пн, 18 янв. 2021 г. в 13:42, Amit Kapila :

> I don't think any of these can happen in what I am actually saying. Do
> you still have the same feeling after reading this email? Off-hand, I
> don't see any downsides as compared to the current approach and it
> will have fewer splits in some other workloads like when there is a
> mix of inserts and non-HOT updates (that doesn't logically change the
> index).
>

If I understand you correctly, you suggest to track _all_ the hints that
came
from the executor for possible non-HOT logical duplicates somewhere on
the page. And when we hit the no-space case, we'll check not only the last
item being hinted, but all items on the page, which makes it more probable
to kick in and do something.

Sounds interesting. And judging on the Peter's tests of extra LP_DEAD tuples
found on the page (almost always being more, than actually flagged), this
can
have some positive effects.

Also, I'm not sure where to put it. We've deprecated the BTP_HAS_GARBAGE
flag, maybe it can be reused for this purpose?


-- 
Victor Yegorov

Re: Transactions involving multiple postgres foreign servers, take 2

Hi, Masahiko-san:

bq. How about FdwXactRequestToLaunchResolver()?

Sounds good to me.

bq. But there is already a function named FdwXactExists()

Then we can leave the function name as it is.

Cheers

On Sun, Jan 17, 2021 at 9:55 PM Masahiko Sawada 
wrote:

> On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu  wrote:
> >
> > For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :
> >
> > +   entry->changing_xact_state = true;
> > ...
> > +   entry->changing_xact_state = abort_cleanup_failure;
> >
> > I don't see return statement in between the two assignments. I wonder
> why entry->changing_xact_state is set to true, and later being assigned
> again.
>
> Because postgresRollbackForeignTransaction() can get called again in
> case where an error occurred during aborting and cleanup the
> transaction. For example, if an error occurred when executing ABORT
> TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR),
> postgresRollbackForeignTransaction() will get called again while
> entry->changing_xact_state is still true. Then the entry will be
> caught by the following condition and cleaned up:
>
> /*
>  * If connection is before starting transaction or is already
> unsalvageable,
>  * do only the cleanup and don't touch it further.
>  */
> if (entry->changing_xact_state)
> {
> pgfdw_cleanup_after_transaction(entry);
> return;
> }
>
> >
> > For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :
> >
> > bq. This commits introduces to new background processes: foreign
> >
> > commits introduces to new -> commit introduces two new
>
> Fixed.
>
> >
> > +FdwXactExistsXid(TransactionId xid)
> >
> > Since Xid is the parameter to this method, I think the Xid suffix can be
> dropped from the method name.
>
> But there is already a function named FdwXactExists()?
>
> bool
> FdwXactExists(Oid dbid, Oid serverid, Oid userid)
>
> As far as I read other code, we already have such functions that have
> the same functionality but have different arguments. For instance,
> SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think
> we can leave as it is but is it better to have like
> FdwXactCheckExistence() and FdwXactCheckExistenceByXid()?
>
> >
> > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group
> >
> > Please correct year in the next patch set.
>
> Fixed.
>
> >
> > +FdwXactLauncherRequestToLaunch(void)
> >
> > Since the launcher's job is to 'launch', I think the Launcher can be
> omitted from the method name.
>
> Agreed. How about FdwXactRequestToLaunchResolver()?
>
> >
> > +/* Report shared memory space needed by FdwXactRsoverShmemInit */
> > +Size
> > +FdwXactRslvShmemSize(void)
> >
> > Are both Rsover and Rslv referring to resolver ? It would be better to
> use whole word which reduces confusion.
> > Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or
> FdwXactResolveShmemInit)
>
> Agreed. I realized that these functions are the launcher's function,
> not resolver's. So I'd change to FdwXactLauncherShmemSize() and
> FdwXactLauncherShmemInit() respectively.
>
> >
> > +fdwxact_launch_resolver(Oid dbid)
> >
> > The above method is not in camel case. It would be better if method
> names are consistent (in casing).
>
> Fixed.
>
> >
> > +errmsg("out of foreign transaction resolver slots"),
> > +errhint("You might need to increase
> max_foreign_transaction_resolvers.")));
> >
> > It would be nice to include the value of max_foreign_xact_resolvers
>
> I agree it would be nice but looking at other code we don't include
> the value in this kind of messages.
>
> >
> > For fdwxact_resolver_onexit():
> >
> > +   LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
> > +   fdwxact->locking_backend = InvalidBackendId;
> > +   LWLockRelease(FdwXactLock);
> >
> > There is no call to method inside the for loop which may take time. I
> wonder if the lock can be obtained prior to the for loop and released
> coming out of the for loop.
>
> Agreed.
>
> >
> > +FXRslvLoop(void)
> >
> > Please use Resolver instead of Rslv
>
> Fixed.
>
> >
> > +   FdwXactResolveFdwXacts(held_fdwxacts, nheld);
> >
> > Fdw and Xact are repeated twice each in the method name. Probably the
> method name can be made shorter.
>
> Fixed.
>
> Regards,
>
> --
> Masahiko Sawada
> EnterpriseDB:  https://www.enterprisedb.com/
>

Re: Additional Chapter for Tutorial - arch-dev.sgml


On 20/11/2020 23:52, Erik Rijkers wrote:

(smallish) Changes to arch-dev.sgml


This looks good to me. One little complaint:


@@ -125,7 +122,7 @@
 use a supervisor process (also
 master process) that spawns a new
 server process every time a connection is requested. This supervisor
-process is called postgres and listens at a
+process is called postgres (formerly 'postmaster') and 
listens at a
 specified TCP/IP port for incoming connections. Whenever a request
 for a connection is detected the postgres
 process spawns a new server process. The server tasks


I believe we still call it the postmaster process. We renamed the binary 
a long time ago (commit 5266f221a2), and the above text was changed as 
part of that commit. I think that was a mistake, and this should say simply:


... This supervisor process is called postmaster and ...

like it did before we renamed the binary.

Barring objections, I'll commit this with that change (as attached).

- Heikki
diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml
index ade0ad97d8..a27c8477c2 100644
--- a/doc/src/sgml/arch-dev.sgml
+++ b/doc/src/sgml/arch-dev.sgml
@@ -7,7 +7,7 @@
Author

 This chapter originated as part of
-, Stefan Simkovics'
+ Stefan Simkovics'
 Master's Thesis prepared at Vienna University of Technology under the direction
 of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr.

@@ -17,10 +17,7 @@
This chapter gives an overview of the internal structure of the
backend of PostgreSQL.  After having
read the following sections you should have an idea of how a query
-   is processed. This chapter does not aim to provide a detailed
-   description of the internal operation of
-   PostgreSQL, as such a document would be
-   very extensive. Rather, this chapter is intended to help the reader
+   is processed.  This chapter is intended to help the reader
understand the general sequence of operations that occur within the
backend from the point at which a query is received, to the point
at which the results are returned to the client.
@@ -30,8 +27,8 @@
The Path of a Query
 

-Here we give a short overview of the stages a query has to pass in
-order to obtain a result.
+Here we give a short overview of the stages a query has to pass 
+to obtain a result.

 

@@ -125,9 +122,9 @@
 use a supervisor process (also
 master process) that spawns a new
 server process every time a connection is requested. This supervisor
-process is called postgres and listens at a
+process is called postmaster and listens at a
 specified TCP/IP port for incoming connections. Whenever a request
-for a connection is detected the postgres
+for a connection is detected the postmaster
 process spawns a new server process. The server tasks
 communicate with each other using semaphores and
 shared memory to ensure data integrity
@@ -230,7 +227,7 @@
 
  A detailed description of bison or
  the grammar rules given in gram.y would be
- beyond the scope of this paper. There are many books and
+ beyond the scope of this manual. There are many books and
  documents dealing with flex and
  bison. You should be familiar with
  bison before you start to study the
@@ -343,8 +340,8 @@

 
  In some situations, examining each possible way in which a query
- can be executed would take an excessive amount of time and memory
- space. In particular, this occurs when executing queries
+ can be executed would take an excessive amount of time and memory.
+ In particular, this occurs when executing queries
  involving large numbers of join operations. In order to determine
  a reasonable (not necessarily optimal) query plan in a reasonable amount
  of time, PostgreSQL uses a Genetic
@@ -411,7 +408,7 @@
 merge join: Each relation is sorted on the join
 attributes before the join starts. Then the two relations are
 scanned in parallel, and matching rows are combined to form
-join rows. This kind of join is more
+join rows. This kind of join is
 attractive because each relation has to be scanned only once.
 The required sorting might be achieved either by an explicit sort
 step, or by scanning the relation in the proper order using an
@@ -442,7 +439,7 @@
  If the query uses fewer than 
  relations, a near-exhaustive search is conducted to find the best
  join sequence.  The planner preferentially considers joins between any
- two relations for which there exist a corresponding join clause in the
+ two relations for which there exists a corresponding join clause in the
  WHERE qualification (i.e., for
  which a restriction like where rel1.attr1=rel2.attr2
  exists). Join pairs with no join clause are considered only when there

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

On Mon, Jan 18, 2021 at 11:44 AM Fujii Masao
 wrote:
> > I will post patches for the other function postgres_fdw_disconnect,
> > GUC and server level option later.
>
> Thanks!

Attaching v12 patch set. 0001 is for postgres_fdw_disconnect()
function, 0002 is for keep_connections GUC and 0003 is for
keep_connection server level option.

Please review it further.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v12-0001-postgres_fdw-function-to-discard-cached-connecti.patch
Description: Binary data

v12-0002-postgres_fdw-add-keep_connections-GUC-to-not-cac.patch
Description: Binary data

v12-0003-postgres_fdw-server-level-option-keep_connection.patch
Description: Binary data

TOAST condition for column size

2021-01-18 Thread torikoshia


Hi,

When I created a table consisting of 400 VARCHAR columns and tried
to INSERT a record which rows were all the same size, there were
cases where I got an error due to exceeding the size limit per
row.

  =# -- create a table consisting of 400 VARCHAR columns
  =# CREATE TABLE t1 (c1 VARCHAR(100),
  c2 VARCHAR(100),
  ...
  c400 VARCHAR(100));

  =# -- insert one record which rows are all 20 bytes
  =# INSERT INTO t1 VALUES (repeat('a', 20),
repeat('a', 20),
...
repeat('a', 20));
ERROR:  row is too big: size 8424, maximum size 8160

What is interesting is that it failed only when the size of each
column was 20~23 bytes, as shown below.

  size of each column  |  result
  ---
  18 bytes |  success
  19 bytes |  success
  20 bytes |  failure
  21 bytes |  failure
  22 bytes |  failure
  23 bytes |  failure
  24 bytes |  success
  25 bytes |  success


When the size of each column was 19 bytes or less, it succeeds
because the row size is within a page size.
When the size of each column was 24 bytes or more, it also
succeeds because columns are TOASTed and the row size is reduced
to less than one page size.
OTOH, when it's more than 19 bytes and less than 24 bytes,
columns aren't TOASTed because it doesn't meet the condition of
the following if statement.

 --src/backend/access/table/toast_helper.c

   toast_tuple_find_biggest_attribute(ToastTupleContext *ttc,
 bool for_compression, bool check_main)
   ...(snip)...
   int32biggest_size = MAXALIGN(TOAST_POINTER_SIZE);
   ...(snip)...
   if (ttc->ttc_attr[i].tai_size > biggest_size) // <- here
   {
   biggest_attno = i;
   biggest_size = ttc->ttc_attr[i].tai_size;
   }


Since TOAST_POINTER_SIZE is 18 bytes but
MAXALIGN(TOAST_POINTER_SIZE) is 24 bytes, columns are not TOASTed
until its size becomes larger than 24 bytes.

I confirmed these sizes in my environment but AFAIU they would be
the same size in any environment.

So, as a result of adjusting the alignment, 20~23 bytes seems to
fail.

I wonder if it might be better not to adjust the alignment here
as an attached patch because it succeeded in inserting 20~23
bytes records.
Or is there reasons to add the alignment here?

I understand that TOAST is not effective for small data and it's
not recommended to create a table containing hundreds of columns,
but I think cases that can be successful should be successful.

Any thoughts?


Regards,

--
Atsushi Torikoshidiff --git a/src/backend/access/table/toast_helper.c b/src/backend/access/table/toast_helper.c
index fb36151ce5..e916c0f95c 100644
--- a/src/backend/access/table/toast_helper.c
+++ b/src/backend/access/table/toast_helper.c
@@ -183,7 +183,7 @@ toast_tuple_find_biggest_attribute(ToastTupleContext *ttc,
 	TupleDesc	tupleDesc = ttc->ttc_rel->rd_att;
 	int			numAttrs = tupleDesc->natts;
 	int			biggest_attno = -1;
-	int32		biggest_size = MAXALIGN(TOAST_POINTER_SIZE);
+	int32		biggest_size = TOAST_POINTER_SIZE;
 	int32		skip_colflags = TOASTCOL_IGNORE;
 	int			i;

Re: proposal: schema variables

2021-01-18 Thread Erik Rijkers


On 2021-01-18 10:59, Pavel Stehule wrote:



and here is the patch
[schema-variables-20200118.patch.gz ]



One small thing:

The drop variable synopsis is:

DROP VARIABLE [ IF EXISTS ] name [, ...] [ CASCADE | RESTRICT ]

In that text following it, 'RESTRICT' is not documented. When used it 
does not give an error but I don't see how it 'works'.



Erik

Re: recovering from "found xmin ... from before relfrozenxid ..."

2021-01-18 Thread Michael Banck

On Mon, Jan 18, 2021 at 08:54:10AM -0500, Robert Haas wrote:
> On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin  wrote:
> > Does anyone maintain opensource pg_surgery analogs for released
> > versions of PG?  It seems to me I'll have to use something like this
> > and I just though that I should consider pg_surgery in favour of our
> > pg_dirty_hands.
> 
> I do not. I'm still of the opinion that we ought to back-patch
> pg_surgery. This didn't attract a consensus before, and it's hard to
> dispute that it's a new feature in what would be a back branch. But
> it's unclear to me how users are otherwise supposed to recover from
> some of the bugs that are or have been present in those back branches.

One other possiblity would be to push a version of pg_surgery that is
compatible with the back-branches somewhere external (e.g. either
git.postgresql.org and/or Github), so that it can be picked up by
distributions and/or individual users in need.

That is Assuming it does not need assorted server changes to go with; I
did not read the thread in detail but I was under the assumption it is a
client program?

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: ResourceOwner refactoring

2021-01-18 Thread Alvaro Herrera

On 2021-Jan-18, Heikki Linnakangas wrote:

> +static ResourceOwnerFuncs jit_funcs =
> +{
> + /* relcache references */
> + .name = "LLVM JIT context",
> + .phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> + .ReleaseResource = ResOwnerReleaseJitContext,
> + .PrintLeakWarning = ResOwnerPrintJitContextLeakWarning
> +};

I think you mean jit_resowner_funcs here; "jit_funcs" is a bit
excessively vague.  Also, why did you choose not to define
ResourceOwnerRememberJIT?  You do that in other modules and it seems
better.

> +static ResourceOwnerFuncs catcache_funcs =
> +{
> + /* catcache references */
> + .name = "catcache reference",
> + .phase = RESOURCE_RELEASE_AFTER_LOCKS,
> + .ReleaseResource = ResOwnerReleaseCatCache,
> + .PrintLeakWarning = ResOwnerPrintCatCacheLeakWarning
> +};
> +
> +static ResourceOwnerFuncs catlistref_funcs =
> +{
> + /* catcache-list pins */
> + .name = "catcache list reference",
> + .phase = RESOURCE_RELEASE_AFTER_LOCKS,
> + .ReleaseResource = ResOwnerReleaseCatCacheList,
> + .PrintLeakWarning = ResOwnerPrintCatCacheListLeakWarning
> +};

Similar naming issue here, I say.

> diff --git a/src/common/cryptohash_openssl.c b/src/common/cryptohash_openssl.c
> index 551ec392b60..642e72d8c0f 100644
> --- a/src/common/cryptohash_openssl.c
> +++ b/src/common/cryptohash_openssl.c
> @@ -60,6 +59,21 @@ struct pg_cryptohash_ctx
>  #endif
>  };
>  
> +/* ResourceOwner callbacks to hold JitContexts  */
> +#ifndef FRONTEND
> +static void ResOwnerReleaseCryptoHash(Datum res);
> +static void ResOwnerPrintCryptoHashLeakWarning(Datum res);

The comment is wrong -- "Crypohashes", not "JitContexts".

So according to your performance benchmark, we're willing to accept a
30% performance loss on an allegedly common operation -- numkeep=0
numsnaps=10 becomes 49.8ns from 37.6ns.  That seems a bit shocking.
Maybe you can claim that these operations aren't exactly hot spots, and
so the fact that we remain in the same power-of-ten is sufficient.  Is
that the argument?

-- 
Álvaro Herrera39°49'30"S 73°17'W
"The Postgresql hackers have what I call a "NASA space shot" mentality.
 Quite refreshing in a world of "weekend drag racer" developers."
(Scott Marlowe)

回复：BEFORE ROW triggers for partitioned tables

2021-01-18 Thread 李杰(慎追)

Hi commiter,

Many of our customers expect to use BR triggers in partitioned tables.
After I followed your discussion, I also checked your patch. 
Here are two questions confusing me:

1. Your modification removes the check BR triggers against partitioned table,
 and a more friendly error message is added to the ExecInsert and ExecUpdate.  
You are correct. ExecInsert does not reroute partitions. 
However, when ExecUpdate modifies partition keys, 
it is almost equivalent to ExecDelete and ExecInsert, 
and it is re-routed(ExecPrepareTupleRouting) once before ExecInsert. Therefore, 
why should an error be thrown in ExecUpdate?
Let's look at a case : 
```
postgres=# create table parted (a int, b int, c text) partition by list 
(a);
CREATE TABLE
postgres=# create table parted_1 partition of parted for values in (1);
CREATE TABLE
postgres=# create table parted_2 partition of parted for values in (2);
CREATE TABLE
postgres=# create function parted_trigfunc() returns trigger 
language plpgsql as $$
 begin
   new.a = new.a + 1;
   return new;
  end;
 $$;
CREATE FUNCTION
postgres=# insert into parted values (1, 1, 'uno uno v1'); 
INSERT 0 1
postgres=# create trigger t before update on parted
   for each row execute function parted_trigfunc();
CREATE TRIGGER
postgres=# update parted set c = c || 'v3'; 
   ```
If there is no check in the ExecUpdate, 
the above update SQL will be executed successfully.
However, in your code, this will fail.
So, what is the reason for your consideration?

2. In this patch, you only removed the restrictions BR trigger against 
the partitioned table, but did not fundamentally solve the problem caused 
by modifying partition keys in the BR trigger. What are the difficulties in 
solving this problem fundamentally? We plan to implement it. 
Can you give us some suggestions?

We look forward to your reply.
Thank you very much,
 Regards,  Adger





--
发件人：Alvaro Herrera 
发送时间：2021年1月18日(星期一) 20:36
收件人：Ashutosh Bapat 
抄　送：Ashutosh Bapat ; Pg Hackers 

主　题：Re: BEFORE ROW triggers for partitioned tables

Thanks for the reviews; I have pushed it now.

(I made the doc edits you mentioned too.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Jan 18, 2021 at 9:25 AM Michael Banck  wrote:
> One other possiblity would be to push a version of pg_surgery that is
> compatible with the back-branches somewhere external (e.g. either
> git.postgresql.org and/or Github), so that it can be picked up by
> distributions and/or individual users in need.

Sure, but I don't see how that's better.

> That is Assuming it does not need assorted server changes to go with; I
> did not read the thread in detail but I was under the assumption it is a
> client program?

It's a server extension. It does not require core changes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: New IndexAM API controlling index vacuum strategies

Hi, Masahiko-san:

For v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patch :

For blvacuumstrategy():

+   if (params->index_cleanup == VACOPT_TERNARY_DISABLED)
+   return INDEX_VACUUM_STRATEGY_NONE;
+   else
+   return INDEX_VACUUM_STRATEGY_BULKDELETE;

The 'else' can be omitted.

Similar comment for ginvacuumstrategy().

For v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patch :

If index_cleanup option is specified neither VACUUM command nor
storage option

I think this is what you meant (by not using passive voice):

If index_cleanup option specifies neither VACUUM command nor
storage option,

- * integer, but you can't fit that many items on a page. 11 ought to be
more
+ * integer, but you can't fit that many items on a page. 13 ought to be
more

It would be nice to add a note why the number of bits is increased.

For choose_vacuum_strategy():

+   IndexVacuumStrategy ivstrat;

The variable is only used inside the loop. You can
use vacrelstats->ivstrategies[i] directly and omit the variable.

+   int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) *
0.7);

How was the factor of 0.7 determined ? Comment below only mentioned 'safety
factor' but not how it was chosen.
I also wonder if this factor should be exposed as GUC.

+   if (nworkers > 0)
+   nworkers--;

Should log / assert be added when nworkers is <= 0 ?

+ * XXX: allowing to fill the heap page with only line pointer seems a
overkill.

'a overkill' -> 'an overkill'

Cheers

On Sun, Jan 17, 2021 at 10:21 PM Masahiko Sawada 
wrote:

> On Mon, Jan 18, 2021 at 2:18 PM Masahiko Sawada 
> wrote:
> >
> > On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada 
> wrote:
> > >
> > > On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada 
> wrote:
> > > >
> > > > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan  wrote:
> > > > >
> > > > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan 
> wrote:
> > > > > > I experimented with this today, and I think that it is a good
> way to
> > > > > > do it. I like the idea of choose_vacuum_strategy() understanding
> that
> > > > > > heap pages that are subject to many non-HOT updates have a
> "natural
> > > > > > extra capacity for LP_DEAD items" that it must care about
> directly (at
> > > > > > least with non-default heap fill factor settings). My early
> testing
> > > > > > shows that it will often take a surprisingly long time for the
> most
> > > > > > heavily updated heap page to have more than about 100 LP_DEAD
> items.
> > > > >
> > > > > Attached is a rough patch showing what I did here. It was applied
> on
> > > > > top of my bottom-up index deletion patch series and your
> > > > > poc_vacuumstrategy.patch patch. This patch was written as a quick
> and
> > > > > dirty way of simulating what I thought would work best for
> bottom-up
> > > > > index deletion for one specific benchmark/test, which was
> > > > > non-hot-update heavy. This consists of a variant pgbench with
> several
> > > > > indexes on pgbench_accounts (almost the same as most other
> bottom-up
> > > > > deletion benchmarks I've been running). Only one index is
> "logically
> > > > > modified" by the updates, but of course we still physically modify
> all
> > > > > indexes on every update. I set fill factor to 90 for this
> benchmark,
> > > > > which is an important factor for how your VACUUM patch works during
> > > > > the benchmark.
> > > > >
> > > > > This rough supplementary patch includes VACUUM logic that assumes
> (but
> > > > > doesn't check) that the table has heap fill factor set to 90 --
> see my
> > > > > changes to choose_vacuum_strategy(). This benchmark is really about
> > > > > stability over time more than performance (though performance is
> also
> > > > > improved significantly). I wanted to keep both the table/heap and
> the
> > > > > logically unmodified indexes (i.e. 3 out of 4 indexes on
> > > > > pgbench_accounts) exactly the same size *forever*.
> > > > >
> > > > > Does this make sense?
> > > >
> > > > Thank you for sharing the patch. That makes sense.
> > > >
> > > > +if (!vacuum_heap)
> > > > +{
> > > > +if (maxdeadpage > 130 ||
> > > > +/* Also check if maintenance_work_mem space is
> running out */
> > > > +vacrelstats->dead_tuples->num_tuples >
> > > > +vacrelstats->dead_tuples->max_tuples / 2)
> > > > +vacuum_heap = true;
> > > > +}
> > > >
> > > > The second test checking if maintenane_work_mem space is running out
> > > > also makes sense to me. Perhaps another idea would be to compare the
> > > > number of collected garbage tuple to the total number of heap tuples
> > > > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a
> > > > certain amount of garbage in the table.
> > > >
> > > > >
> > > > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that
> > > > > pg_stat_database shows an almost ~28% reduction in blks_read after
> an
> > > > > o

Re: ResourceOwner refactoring


On 18/01/2021 16:34, Alvaro Herrera wrote:

So according to your performance benchmark, we're willing to accept a
30% performance loss on an allegedly common operation -- numkeep=0
numsnaps=10 becomes 49.8ns from 37.6ns.  That seems a bit shocking.
Maybe you can claim that these operations aren't exactly hot spots, and
so the fact that we remain in the same power-of-ten is sufficient.  Is
that the argument?


That's right. The fast path is fast, and that's important. The slow path 
becomes 30% slower, but that's acceptable.


- Heikki

Re: support for MERGE

2021-01-18 Thread Alvaro Herrera

Jaime Casanova just reported that this patch causes a crash on the
regression database with this query:

 MERGE INTO public.pagg_tab_ml_p3 as target_0 
 USING public.prt2_l_p3_p2 as ref_0 ON target_0.a = ref_0.a 
 WHEN MATCHED   AND cast(null as tid) <= cast(null as tid)THEN DELETE;

The reason is down to adjust_partition_tlist() not being willing to deal
with empty tlists.  So this is the most direct fix:

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 1fa4d84c42..d6b478ec33 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -976,7 +976,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState 
*estate,
conv_tl = map_partition_varattnos((List *) 
action->targetList,

  firstVarno,

  partrel, firstResultRel);
-   conv_tl = adjust_partition_tlist(conv_tl, map);
+   if (conv_tl != NIL)
+   conv_tl = 
adjust_partition_tlist(conv_tl, map);
tupdesc = ExecTypeFromTL(conv_tl);
/* XXX gotta pfree conv_tl and tupdesc? */
 
 

But I wonder if it wouldn't be better to patch adjust_partition_tlist()
to return NIL on NIL input, instead, like this:

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 1fa4d84c42..6a170eea03 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1589,6 +1589,9 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
AttrMap*attrMap = map->attrMap;
AttrNumber  attrno;
 
+   if (tlist == NIL)
+   return NIL;
+
Assert(tupdesc->natts == attrMap->maplen);
for (attrno = 1; attrno <= tupdesc->natts; attrno++)
{

I lean towards the latter myself.

-- 
Álvaro Herrera   Valdivia, Chile

Re: POC: postgres_fdw insert batching

Hi, Takayuki-san:

+   if (batch_size <= 0)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("%s requires a non-negative integer value",

It seems the message doesn't match the check w.r.t. the batch size of 0.

+   int numInserted = numSlots;

Since numInserted is filled by ExecForeignBatchInsert(), the initialization
can be done with 0.

Cheers

On Sun, Jan 17, 2021 at 10:52 PM tsunakawa.ta...@fujitsu.com <
tsunakawa.ta...@fujitsu.com> wrote:

> Tomas-san,
>
> From: Amit Langote 
> > Good thing you reminded me that this is about inserts, and in that
> > case no, ExecInitModifyTable() doesn't know all leaf partitions, it
> > only sees the root table whose batch_size doesn't really matter.  So
> > it's really ExecInitRoutingInfo() that I would recommend to set
> > ri_BatchSize; right after this block:
> >
> > /*
> >  * If the partition is a foreign table, let the FDW init itself for
> >  * routing tuples to the partition.
> >  */
> > if (partRelInfo->ri_FdwRoutine != NULL &&
> > partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
> > partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);
> >
> > Note that ExecInitRoutingInfo() is called only once for a partition
> > when it is initialized after being inserted into for the first time.
> >
> > For a non-partitioned targets, I'd still say set ri_BatchSize in
> > ExecInitModifyTable().
>
> Attached is the patch that added call to GetModifyBatchSize() to the above
> two places.  The regression test passes.
>
> (FWIW, frankly, I prefer the previous version because the code is a bit
> smaller...  Maybe we should refactor the code someday to reduce similar
> processings in both the partitioned case and non-partitioned case.)
>
>
> Regards
> Takayuki Tsunakawa
>
>

Re: popcount

Peter Eisentraut  writes:
> [ assorted nits ]

At the level of bikeshedding ... I quite dislike using the name "popcount"
for these functions.  I'm aware that some C compilers provide primitives
of that name, but I wouldn't expect a SQL programmer to know that;
without that context the name seems pretty random and unintuitive.
Moreover, it invites confusion with SQL's use of "pop" to abbreviate
"population" in the statistical aggregates, such as var_pop().

Perhaps something along the lines of count_ones() or count_set_bits()
would be more apropos.

regards, tom lane

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-18 Thread Fujii Masao





On 2021/01/18 23:14, Bharath Rupireddy wrote:

On Mon, Jan 18, 2021 at 11:44 AM Fujii Masao
 wrote:

I will post patches for the other function postgres_fdw_disconnect,
GUC and server level option later.


Thanks!


Attaching v12 patch set. 0001 is for postgres_fdw_disconnect()
function, 0002 is for keep_connections GUC and 0003 is for
keep_connection server level option.


Thanks!



Please review it further.


+   server = GetForeignServerByName(servername, true);
+
+   if (!server)
+   ereport(ERROR,
+   
(errcode(ERRCODE_CONNECTION_DOES_NOT_EXIST),
+errmsg("foreign server \"%s\" does not 
exist", servername)));

ISTM we can simplify this code as follows.

server = GetForeignServerByName(servername, false);


+   hash_seq_init(&scan, ConnectionHash);
+   while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))

When the server name is specified, even if its connection is successfully
closed, postgres_fdw_disconnect() scans through all the entries to check
whether there are active connections. But if "result" is true and
active_conn_exists is true, we can get out of this loop to avoid unnecessary
scans.


+   /*
+* Destroy the cache if we discarded all active connections i.e. if 
there
+* is no single active connection, which we can know while scanning the
+* cached entries in the above loop. Destroying the cache is better 
than to
+* keep it in the memory with all inactive entries in it to save some
+* memory. Cache can get initialized on the subsequent queries to 
foreign
+* server.

How much memory is assumed to be saved by destroying the cache in
many cases? I'm not sure if it's really worth destroying the cache to save
the memory.


+  a warning is issued and false is returned. 
false
+  is returned when there are no open connections. When there are some open
+  connections, but there is no connection for the given foreign server,
+  then false is returned. When no foreign server exists
+  with the given name, an error is emitted. Example usage of the function:

When a non-existent server name is specified, postgres_fdw_disconnect()
emits an error if there is at least one open connection, but just returns
false otherwise. At least for me, this behavior looks inconsitent and strange.
In that case, IMO the function always should emit an error.

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Key management with tests

On Sun, Jan 17, 2021 at 07:50:13PM -0500, Robert Haas wrote:
> On Fri, Jan 15, 2021 at 7:56 PM Andres Freund  wrote:
> > I think that's not at all acceptable. I don't mind hashing out details
> > on calls / off-list, but the design needs to be public, documented, and
> > reviewable.  And if it's something the community can't understand, then
> > it can't get in. We're going to have to maintain this going forward.
> 
> I agree. If the community is unable to clearly understand what
> something is, and why we should have it, then we shouldn't have it --
> even if the reason is that we're too dumb to understand, as Bruce

I am not sure why you are brining intelligence into this discussion. 
You have to understand Postgres internals and cryptography tradeoffs to
understand why some of the design decisions were made.  It is a
knowledge issue, not an intelligence issue.  The wiki page is the result
of those phone discussions.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: PoC/WIP: Extended statistics on expressions

2021-01-18 Thread Dean Rasheed

Looking through extended_stats.c, I found a corner case that can lead
to a seg-fault:

CREATE TABLE foo();
CREATE STATISTICS s ON (1) FROM foo;
ANALYSE foo;

This crashes in lookup_var_attr_stats(), because it isn't expecting
nvacatts to be 0. I can't think of any case where building stats on a
table with no analysable columns is useful, so it should probably just
exit early in that case.


In BuildRelationExtStatistics(), it looks like min_attrs should be
declared assert-only.


In evaluate_expressions():

+   /* set the pointers */
+   result = (ExprInfo *) ptr;
+   ptr += sizeof(ExprInfo);

I think that should probably have a MAXALIGN().


A slightly bigger issue that I don't like is the way it assigns
attribute numbers for expressions starting from
MaxHeapAttributeNumber+1, so the first expression has an attnum of
1601. That leads to pretty inefficient use of Bitmapsets, since most
tables only contain a handful of columns, and there's a large block of
unused space in the middle the Bitmapset.

An alternative approach might be to use regular attnums for columns
and use negative indexes -1, -2, -3, ... for expressions in the stored
stats. Then when storing and retrieving attnums from Bitmapsets, it
could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values
in the Bitmapsets, since there can't be more than that many
expressions (just like other code stores system attributes using
FirstLowInvalidHeapAttributeNumber).

That would be a somewhat bigger change, but hopefully fairly
mechanical, and then some code like add_expressions_to_attributes()
would go away.


Looking at the new view pg_stats_ext_exprs, I noticed that it fails to
show expressions until the statistics have been built. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON (a+b), (a*b) FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

 statistics_name | tablename | expr | n_distinct
-+---+--+
 s   | foo   |  |
(1 row)

but after populating and analysing the table, this becomes:

 statistics_name | tablename |  expr   | n_distinct
-+---+-+
 s   | foo   | (a + b) | 11
 s   | foo   | (a * b) | 11
(2 rows)

I think it should show the expressions even before the stats have been built.

Another issue is that it returns rows for non-expression stats as
well. For example:

CREATE TABLE foo(a int, b int);
CREATE STATISTICS s ON a, b FROM foo;
SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs;

 statistics_name | tablename | expr | n_distinct
-+---+--+
 s   | foo   |  |
(1 row)

and those values will never be populated, since they're not
expressions, so I would expect them to not be shown in the view.

So basically, instead of

+ LEFT JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON sd.stxdexpr IS NOT NULL;

perhaps just

+ JOIN LATERAL (
+ SELECT
+ *
+ FROM (
+ SELECT
+
unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr,
+ unnest(sd.stxdexpr)::pg_statistic AS a
+ ) x
+ ) stat ON true;

Regards,
Dean

Re: Key management with tests

On Sun, Jan 17, 2021 at 11:54:57AM +0530, Amit Kapila wrote:
> > Is there a design document for a Postgres feature of this size and
> > scope that people feel would serve as a good example? Alternatively,
> > is there a design document template that has been successfully used in
> > the past?
> 
> We normally write the design considerations and choices we made with
> the reasons in README and code comments. Personally, I am not sure if
> there is a need for any specific document per-se but a README and
> detailed comments in the code should suffice what people are worried
> about here. It is mostly from the perspective that other developers
> reading the code, want to do bug-fix, or later enhance that code
> should be able to understand it. One recent example I can give is
> Peter's work on bottom-up deletion [1] which I have read today where I
> find that the design is captured via README, appropriate comments in
> the code, and documentation. This feature is quite different and
> probably a lot more new concepts are being introduced but I hope that
> will give you some clue.
> 
> [1] - 
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf

OK, I looked at that and it is good, and I see my patch is missing that.
Are people looking for me to take the wiki content, expand on it and tie
it to the code that will be applied, or something else like all the
various crypto options and why we chose what we did beyond what is
already on the wiki?  I can easily go from what we have on the wiki to
implementation code steps, but the other part is harder to explain and
that is why I offered to talk to people via voice.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Key management with tests

On Sat, Jan 16, 2021 at 10:58:47PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2021-01-17 11:54:57 +0530, Amit Kapila wrote:
> > On Sun, Jan 17, 2021 at 5:38 AM Tom Kincaid  
> > wrote:
> > > Admittedly I am a novice on this topic, and the majority of the
> > > PostgreSQL source code, however I am hopeful enough (those of you who
> > > know me understand that I suffer from eternal optimism) that I am
> > > going to attempt to help.
> > >
> > > Is there a design document for a Postgres feature of this size and
> > > scope that people feel would serve as a good example? Alternatively,
> > > is there a design document template that has been successfully used in
> > > the past?
> > >
> > 
> > We normally write the design considerations and choices we made with
> > the reasons in README and code comments. Personally, I am not sure if
> > there is a need for any specific document per-se but a README and
> > detailed comments in the code should suffice what people are worried
> > about here.
> 
> Right. It could be a README file, or a long comment at a start of one of
> the files. It doesn't matter too much. What matters is that people that
> haven't been on those phone call can understand the design and the
> implications it has.

OK, so does the wiki page contain most of what you want, but is missing
the connection between the design and the code?

https://wiki.postgresql.org/wiki/Transparent_Data_Encryption

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

Hi,
For 0001-Allow-REINDEX-to-change-tablespace.patch :

+ * InvalidOid, use the tablespace in-use instead.

'in-use' seems a bit redundant in the sentence.
How about : InvalidOid, use the tablespace of the index instead.

Cheers

On Mon, Jan 18, 2021 at 12:38 AM Justin Pryzby  wrote:

> On Mon, Jan 18, 2021 at 02:18:44PM +0900, Michael Paquier wrote:
> > On Sat, Dec 12, 2020 at 01:45:26PM -0600, Justin Pryzby wrote:
> > > It's a lot like what I wrote as [PATCH v31 1/5] ExecReindex and
> ReindexParams
> > > In my v31 patch, I moved ReindexOptions to a private structure in
> indexcmds.c,
> > > with an "int options" bitmask which is passed to reindex_index() et
> al.  Your
> > > patch keeps/puts ReindexOptions index.h, so it also applies to
> reindex_index,
> > > which I think is good.
> >
> > a3dc926 is an equivalent of 0001~0003 merged together.  0008 had
> > better be submitted into a separate thread if there is value to it.
> > 0004~0007 are the pieces remaining.  Could it be possible to rebase
> > things on HEAD and put the tablespace bits into the structures
> > {Vacuum,Reindex,Cluster}Params?
>
> Attached.  I will re-review these myself tomorrow.
>
> --
> Justin
>

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-18 Thread Fujii Masao




On 2021/01/18 22:03, Bharath Rupireddy wrote:

On Mon, Jan 18, 2021 at 6:17 PM Fujii Masao  wrote:

+1 to add it after "dropped (Note )", how about as follows
with slight changes?

dropped (Note that server name of an invalid connection can be NULL if
the server is dropped), and then such .


Yes, I like this one. One question is; "should" or "is" is better than
"can" in this case because the server name of invalid connection is
always NULL when its server is dropped?


I think "dropped (Note that server name of an invalid connection will
be NULL if the server is dropped), and then such ."


Sounds good to me. So patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
From cbe84856899f23a810fe4d42cf1cd5e46b3d6870 Mon Sep 17 00:00:00 2001
From: Fujii Masao 
Date: Tue, 19 Jan 2021 00:56:10 +0900
Subject: [PATCH] doc: Add note about the server name of
 postgres_fdw_get_connections() returns.

Previously the document didn't mention the case where
postgres_fdw_get_connections() returns NULL in server_name column.
Users might be confused about why NULL was returned.

This commit adds the note that, in postgres_fdw_get_connections(),
the server name of an invalid connection will be NULL if the server is dropped.

Suggested-by: Zhijie Hou
Author: Bharath Rupireddy
Reviewed-by: Fujii Masao
Discussion: 
https://postgr.es/m/e7ddd14e96444fce88e47a709c196537@G08CNEXMBPEKD05.g08.fujitsu.local
---
 doc/src/sgml/postgres-fdw.sgml | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 6a91926da8..9adc8d12a9 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -493,7 +493,9 @@ OPTIONS (ADD password_required 'false');
   each connection is valid or not. false is returned
   if the foreign server connection is used in the current local
   transaction but its foreign server or user mapping is changed or
-  dropped, and then such invalid connection will be closed at
+  dropped (Note that server name of an invalid connection will be
+  NULL if the server is dropped),
+  and then such invalid connection will be closed at
   the end of that transaction. true is returned
   otherwise. If there are no open connections, no record is returned.
   Example usage of the function:
-- 
2.27.0

Re: recovering from "found xmin ... from before relfrozenxid ..."

2021-01-18 Thread Andrey Borodin




> 18 янв. 2021 г., в 18:54, Robert Haas  написал(а):
> 
> On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin  wrote:
>> Does anyone maintain opensource pg_surgery analogs for released versions of 
>> PG?
>> It seems to me I'll have to use something like this and I just though that I 
>> should consider pg_surgery in favour of our pg_dirty_hands.
> 
> I do not. I'm still of the opinion that we ought to back-patch
> pg_surgery.
+1.
Yesterday I spent a few hours packaging pg_dirty_hands and pg_surgery(BTW it 
works fine for 12).
It's a kind of a 911 tool, one doesn't think they will need it until they 
actually do. And clocks are ticking.
OTOH, it opens new ways to shoot in the foot.

Best regards, Andrey Borodin.

Re: support for MERGE

On Thu, Dec 31, 2020 at 8:48 AM Alvaro Herrera  wrote:
> Here's a rebase of Simon/Pavan's MERGE patch to current sources.  I
> cleaned up some minor things in it, but aside from rebasing, it's pretty
> much their work (even the commit message is Simon's).

It's my impression that the previous discussion concluded that their
version needed pretty substantial design changes. Is there a plan to
work on that? Or was some of that stuff done already?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Add session statistics to pg_stat_database

2021-01-18 Thread Laurenz Albe

On Sun, 2021-01-17 at 14:07 +0100, Magnus Hagander wrote:
> I have applied this version, with some minor changes:
> 
> * I renamed the n__time members in the struct to just
> total__time. The n_ indicates "number of" and is thus wrong for
> time parameters.

Right.

> * Some very minor wording changes.
>
> * catversion bump (for once I didn't forget it!)

Thank you!

You included the catversion bump, but shouldn't PGSTAT_FILE_FORMAT_ID
in "include/pgstat.h" be updated as well?

Yours,
Laurenz Albe

Re: ResourceOwner refactoring

On Mon, Jan 18, 2021 at 10:19 AM Heikki Linnakangas  wrote:
>
> On 18/01/2021 16:34, Alvaro Herrera wrote:
> > So according to your performance benchmark, we're willing to accept a
> > 30% performance loss on an allegedly common operation -- numkeep=0
> > numsnaps=10 becomes 49.8ns from 37.6ns.  That seems a bit shocking.
> > Maybe you can claim that these operations aren't exactly hot spots, and
> > so the fact that we remain in the same power-of-ten is sufficient.  Is
> > that the argument?
>
> That's right. The fast path is fast, and that's important. The slow path
> becomes 30% slower, but that's acceptable.
>
> - Heikki
>
>


-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: ResourceOwner refactoring

On Mon, Jan 18, 2021 at 11:11 AM Robert Haas  wrote:
> On Mon, Jan 18, 2021 at 10:19 AM Heikki Linnakangas  wrote:
> > On 18/01/2021 16:34, Alvaro Herrera wrote:
> > > So according to your performance benchmark, we're willing to accept a
> > > 30% performance loss on an allegedly common operation -- numkeep=0
> > > numsnaps=10 becomes 49.8ns from 37.6ns.  That seems a bit shocking.
> > > Maybe you can claim that these operations aren't exactly hot spots, and
> > > so the fact that we remain in the same power-of-ten is sufficient.  Is
> > > that the argument?
> >
> > That's right. The fast path is fast, and that's important. The slow path
> > becomes 30% slower, but that's acceptable.

Sorry for the empty message.

I don't know whether a 30% slowdown will hurt anybody, but it seems
like kind of a lot, and I'm not sure I understand what corresponding
benefit we're getting.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: popcount

2021-01-18 Thread Daniel Verite

Peter Eisentraut wrote:

> +   /*
> +* ceil(VARBITLEN(ARG1)/BITS_PER_BYTE)
> +* done to minimize branches and instructions.
> +*/
> 
> I don't know what that is supposed to mean or why that kind of tuning 
> would be necessary for a user-callable function.

Also, the formula just below looks wrong in the current patch  (v3):

+ /*
+  * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE)
+  * done to minimize branches and instructions.
+  */
+ len = (VARBITLEN(arg1) + BITS_PER_BYTE + 1) / BITS_PER_BYTE;
+ p = VARBITS(arg1);
+
+ popcount = pg_popcount((const char *)p, len);

It should be 
 len = (VARBITLEN(arg1) + BITS_PER_BYTE - 1) / BITS_PER_BYTE

If we add 1 to BITS_PER_BYTE instead of subtracting 1, when
VARBITLEN(arg1) is a multiple of BITS_PER_BYTE, "len" is one byte too
large.

The correct formula is already used in include/utils/varbit.h near the
definition of VARBITLEN:

#define VARBITTOTALLEN(BITLEN)  (((BITLEN) + BITS_PER_BYTE-1)/BITS_PER_BYTE +
\
 VARHDRSZ +
VARBITHDRSZ)


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: support for MERGE

2021-01-18 Thread Alvaro Herrera

On 2021-Jan-18, Robert Haas wrote:

> On Thu, Dec 31, 2020 at 8:48 AM Alvaro Herrera  
> wrote:
> > Here's a rebase of Simon/Pavan's MERGE patch to current sources.  I
> > cleaned up some minor things in it, but aside from rebasing, it's pretty
> > much their work (even the commit message is Simon's).
> 
> It's my impression that the previous discussion concluded that their
> version needed pretty substantial design changes. Is there a plan to
> work on that? Or was some of that stuff done already?

I think some of the issues were handled as I adapted the patch to
current sources.  However, the extensive refactoring that had been
recommended in the old threads has not been done.  I have these two
comments mainly:

https://postgr.es/m/CABOikdOeraX0RKojBs0jZxY5SmbQF3GJBP0+Rat=2g+vqsa...@mail.gmail.com
https://postgr.es/m/7168.1547584...@sss.pgh.pa.us

I'll try to get to those, but I have some other patches that I need to
handle first.

-- 
Álvaro Herrera   Valdivia, Chile

Re: Add session statistics to pg_stat_database

2021-01-18 Thread Magnus Hagander

On Mon, Jan 18, 2021 at 5:11 PM Laurenz Albe  wrote:
>
> On Sun, 2021-01-17 at 14:07 +0100, Magnus Hagander wrote:
> > I have applied this version, with some minor changes:
> >
> > * I renamed the n__time members in the struct to just
> > total__time. The n_ indicates "number of" and is thus wrong for
> > time parameters.
>
> Right.
>
> > * Some very minor wording changes.
> >
> > * catversion bump (for once I didn't forget it!)
>
> Thank you!
>
> You included the catversion bump, but shouldn't PGSTAT_FILE_FORMAT_ID
> in "include/pgstat.h" be updated as well?

Yup, you are absolutely correct. Will fix.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: [PATCH] ProcessInterrupts_hook

Robert Haas  writes:
> On Mon, Jan 18, 2021 at 3:00 AM Craig Ringer
>  wrote:
>> A few times lately I've been doing things in extensions that've made me want 
>> to be able to run my own code whenever InterruptPending is true and 
>> CHECK_FOR_INTERRUPTS() calls ProcessInterrupts()

> I've wanted this in the past, too, so +1 from me.

I dunno, this seems pretty scary and easily abusable.  There's not all
that much that can be done safely in ProcessInterrupts(), and we should
not be encouraging extensions to think they can add random processing
there.

>> What I really want to go along with this is a way for any backend to observe 
>> the postmaster's pmState and its "Shutdown" variable's value, so any backend 
>> can tell if we're in FastShutdown, SmartShutdown, etc.

> I've wanted something along this line, too, but what I was thinking
> about was more along the lines of having the postmaster signal the
> backends when a smart shutdown happened.

We're about halfway there already, see 7e784d1dc.  I didn't do the
other half because it wasn't necessary to the problem, but exposing
the shutdown state more fully seems reasonable.

regards, tom lane

Re: Key management with tests

On Mon, Jan 18, 2021 at 10:50:37AM -0500, Bruce Momjian wrote:
> OK, I looked at that and it is good, and I see my patch is missing that.
> Are people looking for me to take the wiki content, expand on it and tie
> it to the code that will be applied, or something else like all the
> various crypto options and why we chose what we did beyond what is
> already on the wiki?  I can easily go from what we have on the wiki to
> implementation code steps, but the other part is harder to explain and
> that is why I offered to talk to people via voice.

Just to clarify why voice calls can be helpful --- if you have to get
into "you have to understand X to understand Y", that's where a voice
call works best, because understanding X will require understanding
A/B/C, and everyone's missing pieces are different, so you have to
customize it for the individual.  

You can explain some of this in a README, but trying to cover all of it
leads to a combinatorial problem of trying to explain everything. 
Ideally the wiki page can be expanded so people can ask and answer all
posted issues, perhaps in a Q&A format.  Someone could go through the
archives and post why certain decisions were made, and link to the
original emails.

I have to admit I was kind of baffled that the wiki page wasn't
sufficient, because it is one of the longest Postgres feature
explanations I have seen, but I now think the missing part is tying
the wiki contents to the code implementation.  If that is it, please
confirm.  If it is something else, also explain.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Add primary keys to system catalogs

Robert Haas writes:
> I don't have any complaint about labelling some of the unique indexes
> as primary keys, but I think installing foreign keys that don't really
> enforce anything may lead to confusion.

I'm not sure if I buy the "confusion" argument, but after thinking
about this more I realized that there's a stronger roadblock for
treating catalog interrelationships as SQL foreign keys. Namely,
that we always represent no-reference situations with a zero OID,
whereas it'd have to be NULL to look like a valid foreign-key case.
Changing to real NULLs in those columns would of course break a ton
of C code, so I don't think that's gonna happen.

We could overlay an FK onto OID columns that should never have zero
entries, but that would miss enough of the interesting cases that
I don't find it a very attractive proposal.

So I think we should finish up the project of making real SQL
constraints for the catalog indexes, but the idea of making FK
constraints too looks like it'll end in failure.

The reason I got interested in this right now is the nearby
discussion [1] about why findoidjoins misses some catalog
relationships and whether we should fix that and/or make its results
more readily accessible. I'd thought perhaps FK constraint entries
could supersede the need for that tool altogether, but now it seems
like not. I still like the idea of marking OID relationships in the
catalog headers though. Perhaps we should take Joel's suggestion
of a new system catalog more seriously, and have genbki.pl populate
such a catalog from info in the catalog header files.

regards, tom lane

[1]
https://www.postgresql.org/message-id/flat/d1f413ff-0e50-413d-910c-bdd9ee9752c1%40www.fastmail.com

Re: Key management with tests

2021-01-18 Thread Andres Freund

Hi,

On 2021-01-18 12:06:35 -0500, Bruce Momjian wrote:
> On Mon, Jan 18, 2021 at 10:50:37AM -0500, Bruce Momjian wrote:
> > OK, I looked at that and it is good, and I see my patch is missing that.
> > Are people looking for me to take the wiki content, expand on it and tie
> > it to the code that will be applied, or something else like all the
> > various crypto options and why we chose what we did beyond what is
> > already on the wiki?  I can easily go from what we have on the wiki to
> > implementation code steps, but the other part is harder to explain and
> > that is why I offered to talk to people via voice.
> 
> Just to clarify why voice calls can be helpful --- if you have to get
> into "you have to understand X to understand Y", that's where a voice
> call works best, because understanding X will require understanding
> A/B/C, and everyone's missing pieces are different, so you have to
> customize it for the individual.  

I don't think anybody argued against having voice calls.


> You can explain some of this in a README, but trying to cover all of it
> leads to a combinatorial problem of trying to explain everything. 
> Ideally the wiki page can be expanded so people can ask and answer all
> posted issues, perhaps in a Q&A format.  Someone could go through the
> archives and post why certain decisions were made, and link to the
> original emails.
> 
> I have to admit I was kind of baffled that the wiki page wasn't
> sufficient, because it is one of the longest Postgres feature
> explanations I have seen, but I now think the missing part is tying
> the wiki contents to the code implementation.  If that is it, please
> confirm.  If it is something else, also explain.

I don't think the wiki right now covers what's needed. The "Overview",
"Threat model" and "Scope of TDE" are a start, but beyond that it's
missing a bunch of things. And it's not in the source tree (we'll soon
have multiple versions of postgres with increasing levels of TDE
features, the wiki doesn't help with that)

Missing:
- talks about cluster wide encyrption being simpler, without mentioning
  what it's being compared to, and what makes it simpler
- no differentiation from file system / block level encryption
- there's no explanation of which/why specific crypto primitives were
  chosen, what the tradeoffs are
- no explanation which keys exists, stored where
- the key management patch introduces new files, not documented
- there's new types of lock files, possibility of interrupted
  operations, ... - no documentation of what that means
- there's no documentation what "key wrapping" actually precisely is,
  what the danger of the two-tier model is, ...
- are there dangers in not encrypting zero pages etc?
- ...



Personally, but I admit that there's legitimate reasons to differ on
that note, I don't think it's reasonable for a feature this invasive to
commit preliminary patches without the major subsequent patches being in
a shape that allows reviewing the whole picture.

Greetings,

Andres Freund

Re: simplifying foreign key/RI checks

Hi,
I was looking at this statement:

insert into f select generate_series(1, 200, 2);

Since certain generated values (the second half) are not in table p,
wouldn't insertion for those values fail ?
I tried a scaled down version (1000th) of your example:

yugabyte=# insert into f select generate_series(1, 2000, 2);
ERROR:  insert or update on table "f" violates foreign key constraint
"f_a_fkey"
DETAIL:  Key (a)=(1001) is not present in table "p".

For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch :

+* Collect partition key values from the unique key.

At the end of the nested loop, should there be an assertion
that partkey->partnatts partition key values have been found ?
This can be done by using a counter (initialized to 0) which is incremented
when a match is found by the inner loop.

Cheers

On Mon, Jan 18, 2021 at 4:40 AM Amit Langote 
wrote:

> While discussing the topic of foreign key performance off-list with
> Robert and Corey (also came up briefly on the list recently [1], [2]),
> a few ideas were thrown around to simplify our current system of RI
> checks to enforce foreign keys with the aim of reducing some of its
> overheads.  The two main aspects of  how we do these checks that
> seemingly cause the most overhead are:
>
> * Using row-level triggers that are fired during the modification of
> the referencing and the referenced relations to perform them
>
> * Using plain SQL queries issued over SPI
>
> There is a discussion nearby titled "More efficient RI checks - take
> 2" [2] to address this problem from the viewpoint that it is using
> row-level triggers that causes the most overhead, although there are
> some posts mentioning that SQL-over-SPI is not without blame here.  I
> decided to focus on the latter aspect and tried reimplementing some
> checks such that SPI can be skipped altogether.
>
> I started with the check that's performed when inserting into or
> updating the referencing table to confirm that the new row points to a
> valid row in the referenced relation.  The corresponding SQL is this:
>
> SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x
>
> $1 is the value of the foreign key of the new row.  If the query
> returns a row, all good.  Thanks to SPI, or its use of plan caching,
> the query is re-planned only a handful of times before making a
> generic plan that is then saved and reused, which looks like this:
>
>   QUERY PLAN
> --
>  LockRows
>->  Index Scan using pk_pkey on pk x
>  Index Cond: (a = $1)
> (3 rows)
>
> So in most cases, the trigger's function would only execute the plan
> that's already there, at least in a given session.  That's good but
> what we realized would be even better is if we didn't have to
> "execute" a full-fledged "plan" for this, that is, to simply find out
> whether a row containing the key we're looking for exists in the
> referenced relation and if found lock it.  Directly scanning the index
> and locking it directly with table_tuple_lock() like ExecLockRows()
> does gives us exactly that behavior, which seems simple enough to be
> done in a not-so-long local function in ri_trigger.c.  I gave that a
> try and came up with the attached.  It also takes care of the case
> where the referenced relation is partitioned in which case its
> appropriate leaf partition's index is scanned.
>
> The patch results in ~2x improvement in the performance of inserts and
> updates on referencing tables:
>
> create table p (a numeric primary key);
> insert into p select generate_series(1, 100);
> create table f (a bigint references p);
>
> -- unpatched
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 6340.733 ms (00:06.341)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 7490.906 ms (00:07.491)
>
> -- patched
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 3340.808 ms (00:03.341)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 4178.171 ms (00:04.178)
>
> The improvement is even more dramatic when the referenced table (that
> we're no longer querying over SPI) is partitioned.  Here are the
> numbers when the PK relation has 1000 hash partitions.
>
> Unpatched:
>
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 35898.783 ms (00:35.899)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 37736.294 ms (00:37.736)
>
> Patched:
>
> insert into f select generate_series(1, 200, 2);
> INSERT 0 100
> Time: 5633.377 ms (00:05.633)
>
> update f set a = a + 1;
> UPDATE 100
> Time: 6345.029 ms (00:06.345)
>
> That's over ~5x improvement!
>
> While the above case seemed straightforward enough for skipping SPI,
> it seems a bit hard to do the same for other cases where we query the
> *referencing* relation during an operation on the referenced table
> (for example, checking if the row being deleted is still referenced),
> because the plan in those cases is not predict

Re: POC: postgres_fdw insert batching

2021-01-18 Thread Tomas Vondra




On 1/18/21 7:51 AM, tsunakawa.ta...@fujitsu.com wrote:

Tomas-san,

From: Amit Langote 
Good thing you reminded me that this is about inserts, and in that 
case no, ExecInitModifyTable() doesn't know all leaf partitions,

it only sees the root table whose batch_size doesn't really matter.
So it's really ExecInitRoutingInfo() that I would recommend to set 
ri_BatchSize; right after this block:


/* * If the partition is a foreign table, let the FDW init itself
for * routing tuples to the partition. */ if
(partRelInfo->ri_FdwRoutine != NULL && 
partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL) 
partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate,

partRelInfo);

Note that ExecInitRoutingInfo() is called only once for a
partition when it is initialized after being inserted into for the
first time.

For a non-partitioned targets, I'd still say set ri_BatchSize in 
ExecInitModifyTable().


Attached is the patch that added call to GetModifyBatchSize() to the
above two places.  The regression test passes.

(FWIW, frankly, I prefer the previous version because the code is a
bit smaller...  Maybe we should refactor the code someday to reduce
similar processings in both the partitioned case and non-partitioned
case.)



Less code would be nice, but it's not always the right thing to do, 
unfortunately :-(


I took a look at this - there's a bit of bitrot due to 708d165ddb92c, so 
attached is a rebased patch (0001) fixing that.


0002 adds a couple comments and minor tweaks

0003 addresses a couple shortcomings related to explain - we haven't 
been showing the batch size for EXPLAIN (VERBOSE), because there'd be no 
FdwState, so this tries to fix that. Furthermore, there were no tests 
for EXPLAIN output with batch size, so I added a couple.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>From 9425a8501543d4caf55a302a877745b0f83b6046 Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Mon, 18 Jan 2021 15:18:14 +0100
Subject: [PATCH 1/3] Add bulk insert for foreign tables

---
 contrib/postgres_fdw/deparse.c|  43 ++-
 .../postgres_fdw/expected/postgres_fdw.out| 116 ++-
 contrib/postgres_fdw/option.c |  14 +
 contrib/postgres_fdw/postgres_fdw.c   | 302 ++
 contrib/postgres_fdw/postgres_fdw.h   |   5 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |  91 ++
 doc/src/sgml/fdwhandler.sgml  |  89 +-
 doc/src/sgml/postgres-fdw.sgml|  13 +
 src/backend/executor/execPartition.c  |  12 +
 src/backend/executor/nodeModifyTable.c| 157 +
 src/backend/nodes/list.c  |  15 +
 src/include/foreign/fdwapi.h  |  10 +
 src/include/nodes/execnodes.h |   6 +
 src/include/nodes/pg_list.h   |  15 +
 14 files changed, 822 insertions(+), 66 deletions(-)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 3cf7b4eb1e..2d38ab25cb 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -1711,7 +1711,7 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
  Index rtindex, Relation rel,
  List *targetAttrs, bool doNothing,
  List *withCheckOptionList, List *returningList,
- List **retrieved_attrs)
+ List **retrieved_attrs, int *values_end_len)
 {
 	AttrNumber	pindex;
 	bool		first;
@@ -1754,6 +1754,7 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 	}
 	else
 		appendStringInfoString(buf, " DEFAULT VALUES");
+	*values_end_len = buf->len;
 
 	if (doNothing)
 		appendStringInfoString(buf, " ON CONFLICT DO NOTHING");
@@ -1763,6 +1764,46 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte,
 		 withCheckOptionList, returningList, retrieved_attrs);
 }
 
+/*
+ * rebuild remote INSERT statement
+ *
+ */
+void
+rebuildInsertSql(StringInfo buf, char *orig_query,
+ int values_end_len, int num_cols,
+ int num_rows)
+{
+	int			i, j;
+	int			pindex;
+	bool		first;
+
+	/* Copy up to the end of the first record from the original query */
+	appendBinaryStringInfo(buf, orig_query, values_end_len);
+
+	/* Add records to VALUES clause */
+	pindex = num_cols + 1;
+	for (i = 0; i < num_rows; i++)
+	{
+		appendStringInfoString(buf, ", (");
+
+		first = true;
+		for (j = 0; j < num_cols; j++)
+		{
+			if (!first)
+appendStringInfoString(buf, ", ");
+			first = false;
+
+			appendStringInfo(buf, "$%d", pindex);
+			pindex++;
+		}
+
+		appendStringInfoChar(buf, ')');
+	}
+
+	/* Copy stuff after VALUES clause from the original query */
+	appendStringInfoString(buf, orig_query + values_end_len);
+}
+
 /*
  * deparse remote UPDATE statement
  *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1cad311436..8c0fdb5a9a 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -8923,7 +8

Re: [PATCH] ProcessInterrupts_hook

On Mon, Jan 18, 2021 at 11:56 AM Tom Lane  wrote:
> > I've wanted this in the past, too, so +1 from me.
>
> I dunno, this seems pretty scary and easily abusable.  There's not all
> that much that can be done safely in ProcessInterrupts(), and we should
> not be encouraging extensions to think they can add random processing
> there.

We've had this disagreement before about other things, and I just
don't agree. If somebody uses a hook for something wildly unsafe, that
will break their stuff, not ours. That's not to say I endorse adding
hooks for random purposes in random places. In particular, if it's
impossible to use a particular hook in a reasonably safe way, that's a
sign that the hook is badly-designed and that we should not have it.
But, that's not the case here. I care more about smart extension
authors being able to do useful things than I do about the possibility
that dumb extension authors will do stupid things. We can't really
prevent the latter anyway: this is open source.

> We're about halfway there already, see 7e784d1dc.  I didn't do the
> other half because it wasn't necessary to the problem, but exposing
> the shutdown state more fully seems reasonable.

Ah, I hadn't realized.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: simplifying foreign key/RI checks

po 18. 1. 2021 v 13:40 odesílatel Amit Langote 
napsal:

> While discussing the topic of foreign key performance off-list with
> Robert and Corey (also came up briefly on the list recently [1], [2]),
> a few ideas were thrown around to simplify our current system of RI
> checks to enforce foreign keys with the aim of reducing some of its
> overheads.  The two main aspects of  how we do these checks that
> seemingly cause the most overhead are:
>
> * Using row-level triggers that are fired during the modification of
> the referencing and the referenced relations to perform them
>
> * Using plain SQL queries issued over SPI
>
> There is a discussion nearby titled "More efficient RI checks - take
> 2" [2] to address this problem from the viewpoint that it is using
> row-level triggers that causes the most overhead, although there are
> some posts mentioning that SQL-over-SPI is not without blame here.  I
> decided to focus on the latter aspect and tried reimplementing some
> checks such that SPI can be skipped altogether.
>
> I started with the check that's performed when inserting into or
> updating the referencing table to confirm that the new row points to a
> valid row in the referenced relation.  The corresponding SQL is this:
>
> SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x
>
> $1 is the value of the foreign key of the new row.  If the query
> returns a row, all good.  Thanks to SPI, or its use of plan caching,
> the query is re-planned only a handful of times before making a
> generic plan that is then saved and reused, which looks like this:
>
>   QUERY PLAN
> --
>  LockRows
>->  Index Scan using pk_pkey on pk x
>  Index Cond: (a = $1)
> (3 rows)
>
>
>

What is performance when the referenced table is small? - a lot of
codebooks are small between 1000 to 10K rows.

Re: [Patch] ALTER SYSTEM READ ONLY

On Thu, Jan 14, 2021 at 6:29 AM Amul Sul  wrote:
> To move development, testing, and the review forward, I have commented out the
> code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and
> added the changes for the checkpointer so that system read-write state change
> request can be processed as soon as possible, as suggested by Robert[1].
>
> I have started a new thread[2] to understand the need for the CheckpointLock 
> in
> CreateCheckPoint() function. Until then we can continue work on this feature 
> by
> skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is
> marked WIP.

Based on the favorable review comment from Andres upthread and also
your feedback, I committed 0001.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

Hi,

+* Custom scan nodes can be leaf nodes or inner nodes and
therfore need special treatment.

The special treatment applies to inner nodes. The above should be better
phrased to clarify.

Cheers

On Mon, Jan 18, 2021 at 2:43 AM David Geier  wrote:

> Hi hackers,
>
> While working with cursors that reference plans with CustomScanStates
> nodes, I encountered a segfault which originates from
> search_plan_tree(). The query plan is the result of a simple SELECT
> statement into which I inject a Custom Scan node at the root to do some
> post-processing before returning rows. This plan is referenced by a
> second plan with a Tid Scan which originates from a query of the form
> DELETE FROM foo WHERE CURRENT OF my_cursor;
>
> search_plan_tree() assumes that
> CustomScanState::ScanState::ss_currentRelation is never NULL. In my
> understanding that only holds for CustomScanState nodes which are at the
> bottom of the plan and actually read from a relation. CustomScanState
> nodes which are not at the bottom don't have ss_currentRelation set. I
> believe for such nodes, instead search_plan_tree() should recurse into
> CustomScanState::custom_ps.
>
> I attached a patch. Any thoughts?
>
> Best regards,
> David
> Swarm64
>
>

Re: Online checksums patch - once again

2021-01-18 Thread Michael Banck

Hi,

On Fri, Jan 15, 2021 at 11:32:56AM +0100, Daniel Gustafsson wrote:
> The attached v30 adds the proposed optimizations in this thread as previously
> asked for, as well as some small cleanups to the procsignal calling codepath
> (replacing single call functions with just calling the function) and some
> function comments which were missing.

0002 (now 0001 I guess) needs a rebase due to a3ed4d1e.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: proposal: schema variables

po 18. 1. 2021 v 15:24 odesílatel Erik Rijkers  napsal:

> On 2021-01-18 10:59, Pavel Stehule wrote:
> >>
> > and here is the patch
> > [schema-variables-20200118.patch.gz ]
>
>
> One small thing:
>
> The drop variable synopsis is:
>
> DROP VARIABLE [ IF EXISTS ] name [, ...] [ CASCADE | RESTRICT ]
>
> In that text following it, 'RESTRICT' is not documented. When used it
> does not give an error but I don't see how it 'works'.
>

should be fixed now. Thank you for check

Regards

Pavel



>
> Erik
>
>
>
>


schema-variables-20200118-2.patch.gz
Description: application/gzip

Re: Add primary keys to system catalogs

I wrote:
> ... I still like the idea of marking OID relationships in the
> catalog headers though.  Perhaps we should take Joel's suggestion
> of a new system catalog more seriously, and have genbki.pl populate
> such a catalog from info in the catalog header files.

On second thought, a catalog is overkill; it'd only be useful if the data
could change after initdb, which this data surely cannot.  The right way
to expose such info to SQL is with a set-returning function reading a
constant table in the C code, a la pg_get_keywords().  That way takes a
whole lot less infrastructure and is just as useful to SQL code.

[ wanders away wondering about a debug mode in which we actually validate
OIDs inserted into such columns... ]

regards, tom lane

Re: CLUSTER on partitioned index

2021-01-18 Thread Justin Pryzby

On Sat, Nov 28, 2020 at 08:03:02PM -0600, Justin Pryzby wrote:
> On Sun, Nov 15, 2020 at 07:53:35PM -0600, Justin Pryzby wrote:
> > On Wed, Nov 04, 2020 at 08:23:56PM -0600, Justin Pryzby wrote:
> > > On Tue, Oct 27, 2020 at 07:33:12PM -0500, Justin Pryzby wrote:
> > > > I'm attaching a counter-proposal to your catalog change, which preserves
> > > > indisclustered on children of clustered, partitioned indexes, and 
> > > > invalidates
> > > > indisclustered when attaching unclustered indexes.
> > > 
> > > ..and now propagates CLUSTER ON to child indexes.
> > > 
> > > I left this as separate patches to show what I mean and what's new while 
> > > we
> > > discuss it.
> > 
> > This fixes some omissions in the previous patch and error in its test cases.
> > 
> > CLUSTER ON recurses to children, since I think a clustered parent index 
> > means
> > that all its child indexes are clustered.  "SET WITHOUT CLUSTER" doesn't 
> > have
> > to recurse to children, but I did it like that for consistency and it avoids
> > the need to special case InvalidOid.
> 
> The previous patch failed pg_upgrade when restoring a clustered, parent index,
> since it's marked INVALID until indexes have been built on all child tables, 
> so
> CLUSTER ON was rejected on invalid index.
> 
> So I think CLUSTER ON needs to be a separate pg_dump object, to allow 
> attaching
> the child index (thereby making the parent "valid") to happen before SET
> CLUSTER on the parent index.

Rebased on b5913f612 and now a3dc92600.

This patch is intertwined with the tablespace patch: not only will it get
rebase conflict, but will also need to test the functionality of
CLUSTER (TABLESPACE a) partitioned_table;

-- 
Justin
>From cb817c479d2a2d4ae94cfa8d215e369b5d206738 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 26 Nov 2020 14:37:08 -0600
Subject: [PATCH v6 1/7] pg_dump: make CLUSTER ON a separate dump object..

..since it needs to be restored after any child indexes are restored *and
attached*.  The order needs to be:

1) restore child and parent index (order doesn't matter);
2) attach child index;
3) set cluster on child and parent index (order doesn't matter);
---
 src/bin/pg_dump/pg_dump.c  | 86 ++
 src/bin/pg_dump/pg_dump.h  |  8 
 src/bin/pg_dump/pg_dump_sort.c |  8 
 3 files changed, 82 insertions(+), 20 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 798d14580e..e6526392e5 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -208,6 +208,7 @@ static void dumpSequence(Archive *fout, TableInfo *tbinfo);
 static void dumpSequenceData(Archive *fout, TableDataInfo *tdinfo);
 static void dumpIndex(Archive *fout, IndxInfo *indxinfo);
 static void dumpIndexAttach(Archive *fout, IndexAttachInfo *attachinfo);
+static void dumpIndexClusterOn(Archive *fout, IndexClusterInfo *clusterinfo);
 static void dumpStatisticsExt(Archive *fout, StatsExtInfo *statsextinfo);
 static void dumpConstraint(Archive *fout, ConstraintInfo *coninfo);
 static void dumpTableConstraintComment(Archive *fout, ConstraintInfo *coninfo);
@@ -7092,6 +7093,11 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 i_inddependcollversions;
 	int			ntups;
 
+	int	ncluster = 0;
+	IndexClusterInfo *clusterinfo;
+	clusterinfo = (IndexClusterInfo *)
+		pg_malloc0(numTables * sizeof(IndexClusterInfo));
+
 	for (i = 0; i < numTables; i++)
 	{
 		TableInfo  *tbinfo = &tblinfo[i];
@@ -7471,6 +7477,27 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables)
 /* Plain secondary index */
 indxinfo[j].indexconstraint = 0;
 			}
+
+			/* Record each table's CLUSTERed index, if any */
+			if (indxinfo[j].indisclustered)
+			{
+IndxInfo   *index = &indxinfo[j];
+IndexClusterInfo *cluster = &clusterinfo[ncluster];
+
+cluster->dobj.objType = DO_INDEX_CLUSTER_ON;
+cluster->dobj.catId.tableoid = 0;
+cluster->dobj.catId.oid = 0;
+AssignDumpId(&cluster->dobj);
+cluster->dobj.name = pg_strdup(index->dobj.name);
+cluster->dobj.namespace = index->indextable->dobj.namespace;
+cluster->index = index;
+cluster->indextable = &tblinfo[i];
+
+/* The CLUSTER ON depends on its index.. */
+addObjectDependency(&cluster->dobj, index->dobj.dumpId);
+
+ncluster++;
+			}
 		}
 
 		PQclear(res);
@@ -10296,6 +10323,9 @@ dumpDumpableObject(Archive *fout, DumpableObject *dobj)
 		case DO_SUBSCRIPTION:
 			dumpSubscription(fout, (SubscriptionInfo *) dobj);
 			break;
+		case DO_INDEX_CLUSTER_ON:
+			dumpIndexClusterOn(fout, (IndexClusterInfo *) dobj);
+			break;
 		case DO_PRE_DATA_BOUNDARY:
 		case DO_POST_DATA_BOUNDARY:
 			/* never dumped, nothing to do */
@@ -16516,6 +16546,41 @@ getAttrName(int attrnum, TableInfo *tblInfo)
 	return NULL;/* keep compiler quiet */
 }
 
+/*
+ * dumpIndexClusterOn
+ *	  record that the index is clustered.
+ */
+static void
+dumpIndexClusterOn(Archive *fout, IndexClusterInfo *clusterinfo)
+{
+	Du

Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

David Geier  writes:
> search_plan_tree() assumes that 
> CustomScanState::ScanState::ss_currentRelation is never NULL. In my 
> understanding that only holds for CustomScanState nodes which are at the 
> bottom of the plan and actually read from a relation. CustomScanState 
> nodes which are not at the bottom don't have ss_currentRelation set. I 
> believe for such nodes, instead search_plan_tree() should recurse into 
> CustomScanState::custom_ps.

Hm.  I agree that we shouldn't simply assume that ss_currentRelation
isn't null.  However, we cannot make search_plan_tree() descend
through non-leaf CustomScan nodes, because we don't know what processing
is involved there.  We need to find a scan that is guaranteed to return
rows that are one-to-one with the cursor output.  This is why the function
doesn't descend through join or aggregation nodes, and I see no argument
by which we should assume we know more about what a customscan node will
do than we know about those.

So I'm inclined to think a suitable fix is just

-   if (RelationGetRelid(sstate->ss_currentRelation) == table_oid)
+   if (sstate->ss_currentRelation &&
+   RelationGetRelid(sstate->ss_currentRelation) == table_oid)
result = sstate;

regards, tom lane

Re: WIP: System Versioned Temporal Table

2021-01-18 Thread Surafel Temesgen

On Mon, Jan 18, 2021 at 1:43 AM Vik Fearing  wrote:

>
> This is not good, and I see that DROP SYSTEM VERSIONING also removes
> these columns which is even worse.  Please read the standard that you
> are trying to implement!
>
>
The standard states the function of ALTER TABLE ADD SYSTEM VERSIONING
as  "Alter a regular persistent base table to a system-versioned table" and
system versioned table is described in the standard by two generated
stored constraint columns and implemented as such.


> I will do a more thorough review of the functionalities in this patch
> (not necessarily the code) this week.
>
>
Please do

regards
Surafel

Re: Key management with tests

On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote:
> Personally, but I admit that there's legitimate reasons to differ on
> that note, I don't think it's reasonable for a feature this invasive to
> commit preliminary patches without the major subsequent patches being in
> a shape that allows reviewing the whole picture.

OK, if that is a requirement, I can't help anymore since there are
already complaints that the patch is too large to review, even if broken
into pieces.  Please let me know what the community decides.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Key management with tests

2021-01-18 Thread Tom Kincaid

> > I have to admit I was kind of baffled that the wiki page wasn't
> > sufficient, because it is one of the longest Postgres feature
> > explanations I have seen, but I now think the missing part is tying
> > the wiki contents to the code implementation.  If that is it, please
> > confirm.  If it is something else, also explain.
>
> I don't think the wiki right now covers what's needed. The "Overview",
> "Threat model" and "Scope of TDE" are a start, but beyond that it's
> missing a bunch of things. And it's not in the source tree (we'll soon
> have multiple versions of postgres with increasing levels of TDE
> features, the wiki doesn't help with that)
>

Thanks, the versioning issue makes sense for the design document
needing to be part of the the source tree.


As I was reading the README for the patch Amit referenced and as I am
going through this patch, I feel the desire to incorporate diagrams.
Are design diagrams ever incorporated in the source tree as a part of
the design description of a feature? If not, any concerns about doing
that? I think that is likely where I can contribute the most.


> Missing:
> - talks about cluster wide encyrption being simpler, without mentioning
>   what it's being compared to, and what makes it simpler
> - no differentiation from file system / block level encryption
> - there's no explanation of which/why specific crypto primitives were
>   chosen, what the tradeoffs are
> - no explanation which keys exists, stored where
> - the key management patch introduces new files, not documented
> - there's new types of lock files, possibility of interrupted
>   operations, ... - no documentation of what that means
> - there's no documentation what "key wrapping" actually precisely is,
>   what the danger of the two-tier model is, ...
> - are there dangers in not encrypting zero pages etc?
> - ...
>

Some of the missing things you mention above are about the design of
TDE  feature in general. However, this patch is about Key Management
which is going part of the larger TDE feature. So it feels as though
there is the need for a general design document about the overall
vision / approach for TDE and a specific design doc. for Key
Management. Is it appropriate to include both of those in the same
patch?

Something along the lines here is the overall design of TDE and here
is how the Key Management portion is designed and implemented. I guess
in that case, follow on patches for TDE could refer to the overall
design described in this patch.




>
>
> Personally, but I admit that there's legitimate reasons to differ on
> that note, I don't think it's reasonable for a feature this invasive to
> commit preliminary patches without the major subsequent patches being in
> a shape that allows reviewing the whole picture.
>
> Greetings,
>
> Andres Freund



-- 
Thomas John Kincaid

Re: Key management with tests

2021-01-18 Thread Andres Freund

On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote:
> On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote:
> > Personally, but I admit that there's legitimate reasons to differ on
> > that note, I don't think it's reasonable for a feature this invasive to
> > commit preliminary patches without the major subsequent patches being in
> > a shape that allows reviewing the whole picture.
> 
> OK, if that is a requirement, I can't help anymore since there are
> already complaints that the patch is too large to review, even if broken
> into pieces.  Please let me know what the community decides.

Those aren't conflicting demands. Having later patches around to
validate the design of earlier patches doesn't necessitates that the
later patches need to be reviewed at the same time.

Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

Hi,
It seems sstate->ss_currentRelation being null can only
occur for T_ForeignScanState and T_CustomScanState.

What about the following change ?

Cheers

diff --git a/src/backend/executor/execCurrent.c
b/src/backend/executor/execCurrent.c
index 0852bb9cec..56e31951d1 100644
--- a/src/backend/executor/execCurrent.c
+++ b/src/backend/executor/execCurrent.c
@@ -325,12 +325,21 @@ search_plan_tree(PlanState *node, Oid table_oid,
 case T_IndexOnlyScanState:
 case T_BitmapHeapScanState:
 case T_TidScanState:
+{
+ScanState  *sstate = (ScanState *) node;
+
+if (RelationGetRelid(sstate->ss_currentRelation) ==
table_oid)
+result = sstate;
+break;
+}
+
 case T_ForeignScanState:
 case T_CustomScanState:
 {
 ScanState  *sstate = (ScanState *) node;

-if (RelationGetRelid(sstate->ss_currentRelation) ==
table_oid)
+if (sstate->ss_currentRelation &&
+RelationGetRelid(sstate->ss_currentRelation) ==
table_oid)
 result = sstate;
 break;
 }

On Mon, Jan 18, 2021 at 10:46 AM Tom Lane  wrote:

> David Geier  writes:
> > search_plan_tree() assumes that
> > CustomScanState::ScanState::ss_currentRelation is never NULL. In my
> > understanding that only holds for CustomScanState nodes which are at the
> > bottom of the plan and actually read from a relation. CustomScanState
> > nodes which are not at the bottom don't have ss_currentRelation set. I
> > believe for such nodes, instead search_plan_tree() should recurse into
> > CustomScanState::custom_ps.
>
> Hm.  I agree that we shouldn't simply assume that ss_currentRelation
> isn't null.  However, we cannot make search_plan_tree() descend
> through non-leaf CustomScan nodes, because we don't know what processing
> is involved there.  We need to find a scan that is guaranteed to return
> rows that are one-to-one with the cursor output.  This is why the function
> doesn't descend through join or aggregation nodes, and I see no argument
> by which we should assume we know more about what a customscan node will
> do than we know about those.
>
> So I'm inclined to think a suitable fix is just
>
> -   if (RelationGetRelid(sstate->ss_currentRelation) ==
> table_oid)
> +   if (sstate->ss_currentRelation &&
> +   RelationGetRelid(sstate->ss_currentRelation) ==
> table_oid)
> result = sstate;
>
> regards, tom lane
>
>
>

Re: CheckpointLock needed in CreateCheckPoint()?

On Thu, Jan 14, 2021 at 10:21 AM Robert Haas  wrote:
> Yeah, I think this lock is useless. In fact, I think it's harmful,
> because it makes large sections of the checkpointer code, basically
> all of it really, run with interrupts held off for no reason.
>
> It's not impossible that removing the lock could reveal bugs
> elsewhere: suppose we have segments of code that actually aren't safe
> to interrupt, but because of this LWLock, it never happens. But, if
> that happens to be true, I think we should just view those cases as
> bugs to be fixed.
>
> One thing that blunts the impact of this quite a bit is that the
> checkpointer doesn't use rely much on CHECK_FOR_INTERRUPTS() in the
> first place. It has its own machinery: HandleCheckpointerInterrupts().
> Possibly that's something we should be looking to refactor somehow as
> well.

Here's a patch to remove CheckpointLock completely. For
ProcessInterrupts() to do anything, one of the following things would
have to be true:

1. ProcDiePending = true. Normally this is set by die(), but the
checkpointer does not use die(). Perhaps someday it should, or maybe
not, but that's an issue for another day.

2. ClientConnectionLost = true. This gets set in pqcomm.c's
internal_flush(), but the checkpointer has no client connection.

3. DoingCommandRead = true. Can't happen; only set in PostgresMain().

4. QueryCancelPending = true. Can be set by recovery conflicts, but
those shouldn't happen in the checkpointer. Normally set by
StatementCancelHandler, which the checkpointer doesn't use.

5. IdleInTransactionSessionTimeoutPending = true. Set up by
InitPostgres(), which is not used by auxiliary processes.

6. ProcSignalBarrierPending = true. This is the case we actually care
about allowing for the ASRO patch, so if this occurs, it's good.

7. ParallelMessagePending = true. The checkpointer definitely never
starts parallel workers.

So I don't see any problem with just ripping this out entirely, but
I'd like to know if anybody else does.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

v1-0001-Remove-CheckpointLock.patch
Description: Binary data

Re: Deleting older versions in unique indexes to avoid page splits

2021-01-18 Thread Peter Geoghegan

On Sun, Jan 17, 2021 at 10:44 PM Amit Kapila  wrote:
> With the above example, it seems like it would also help when this is not 
> true.

I'll respond to your remarks here separately, in a later email.

> I am not sure what I proposed fits here but the bottom-up sounds like
> we are starting from the leaf level and move upwards to root level
> which I think is not true here.

I guess that that's understandable, because it is true that B-Trees
are maintained in a bottom-up fashion. However, it's also true that
you can have top-down and bottom-up approaches in query optimizers,
and many other things (it's even something that is used to describe
governance models). The whole point of the term "bottom-up" is to
suggest that bottom-up deletion complements top-down cleanup by
VACUUM. I think that this design embodies certain principles that can
be generalized to other areas, such as heap pruning.

I recall that Robert Haas hated the name deduplication. I'm not about
to argue that my choice of "deduplication" was objectively better than
whatever he would have preferred. Same thing applies here - I more or
less chose a novel name because the concept is itself novel (unlike
deduplication). Reasonable people can disagree about what exact name
might have been better, but it's not like we're going to arrive at
something that everybody can be happy with. And whatever we arrive at
probably won't matter that much - the vast majority of users will
never need to know what either thing is. They may be important things,
but that doesn't mean that many people will ever think about them (in
fact I specifically hope that they don't ever need to think about
them).

> How is that working? Does heapam.c can someway indicate additional
> tuples (extra to what has been marked/sent by IndexAM as promising) as
> deletable? I see in heap_index_delete_tuples that we mark the status
> of the passed tuples as delectable (by setting knowndeletable flag for
> them).

The bottom-up nbtree caller to
heap_index_delete_tuples()/table_index_delete_tuple() (not to be
confused with the simple/LP_DEAD heap_index_delete_tuples() caller)
always provides heapam.c with a complete picture of the index page, in
the sense that it exhaustively has a delstate.deltids entry for each
and every TID on the page, no matter what. This is the case even
though in practice there is usually no possible way to check even 20%
of the deltids entries within heapam.c.

In general, the goal during a bottom-up pass is *not* to maximize
expected utility (i.e. the number of deleted index tuples/space
freed/whatever), exactly. The goal is to lower the variance across
related calls, so that we'll typically manage to free a fair number of
index tuples when we need to. In general it is much better for
heapam.c to make its decisions based on 2 or 3 good reasons rather
than just 1 excellent reason. And so heapam.c applies a power-of-two
bucketing scheme, never truly giving too much weight to what nbtree
tells it about duplicates. See comments above
bottomup_nblocksfavorable(), and bottomup_sort_and_shrink() comments
(both are from heapam.c).

-- 
Peter Geoghegan

Re: [PATCH] Add support for leading/trailing bytea trim()ing

"Joel Jacobson"  writes:
> On Fri, Dec 4, 2020, at 22:02, Tom Lane wrote:
>> (Maybe the existing ltrim/rtrim descrs are also like this, but if so
>> I'd change them too.)

> They weren't, but I think the description for the bytea functions
> can be improved to have a more precise description
> if we take inspiration from the the text functions.

Yeah, I agree with making the bytea descriptions look like the
text ones.  Pushed with minor additional doc fixes.

regards, tom lane

Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements

2021-01-18 Thread Matthias van de Meent

On Fri, 15 Jan 2021 at 15:29, Álvaro Herrera  wrote:
>
> So one last remaining improvement was to have VACUUM ignore processes
> doing CIC and RC to compute the Xid horizon of tuples to remove.  I
> think we can do something simple like the attached patch.

Regarding the patch:

> +if (statusFlags & PROC_IN_SAFE_IC)
> +h->catalog_oldest_nonremovable =
> +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
> +else
> +h->data_oldest_nonremovable = h->catalog_oldest_nonremovable 
> =
> +TransactionIdOlder(h->data_oldest_nonremovable, xmin);

Would this not need to be the following? Right now, it resets
potentially older h->catalog_oldest_nonremovable (which is set in the
PROC_IN_SAFE_IC branch).

> +if (statusFlags & PROC_IN_SAFE_IC)
> +h->catalog_oldest_nonremovable =
> +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
> +else
> +{
> +h->data_oldest_nonremovable =
> +TransactionIdOlder(h->data_oldest_nonremovable, xmin);
> +h->catalog_oldest_nonremovable =
> +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin);
> +}

Prior to reading the rest of my response: I do not understand the
intricacies of the VACUUM process, nor the systems of db snapshots, so
it's reasonably possible I misunderstand this.
But would this patch not allow for tuples to be created, deleted and
vacuumed away from a table whilst rebuilding an index on that table,
resulting in invalid pointers in the index?

Example:

1.) RI starts
2.) PHASE 2: filling the index:
2.1.) scanning the heap (live tuple is cached)
< tuple is deleted
< last transaction other than RI commits, only snapshot of RI exists
< vacuum drops the tuple, and cannot remove it from the new index
because this new index is not yet populated.
2.2.) sorting tuples
2.3.) index filled with tuples, incl. deleted tuple
3.) PHASE 3: wait for transactions
4.) PHASE 4: validate does not remove the tuple from the index,
because it is not built to do so: it will only insert new tuples.
Tuples that are marked for deletion are removed from the index only
through VACUUM (and optimistic ALL_DEAD detection).

According to my limited knowledge of RI, it requires VACUUM to not run
on the table during the initial index build process (which is
currently guaranteed through the use of a snapshot).

Regards,

Matthias van de Meent

Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault

Zhihong Yu  writes:
> It seems sstate->ss_currentRelation being null can only
> occur for T_ForeignScanState and T_CustomScanState.
> What about the following change ?

Seems like more code for no very good reason.

regards, tom lane

Re: CheckpointLock needed in CreateCheckPoint()?