Re: Narrow the scope of the variable outputstr in logicalrep_write_tuple
On Mon, Jan 18, 2021 at 1:16 PM japin wrote: > > > Hi, > > I find that the outputstr variable in logicalrep_write_tuple() only use in > `else` branch, I think we can narrow the scope, just like variable outputbytes > in `if` branch (for more readable). > > /* > * Send in binary if requested and type has suitable send function. > */ > if (binary && OidIsValid(typclass->typsend)) > { > bytea *outputbytes; > int len; > > pq_sendbyte(out, LOGICALREP_COLUMN_BINARY); > outputbytes = OidSendFunctionCall(typclass->typsend, values[i]); > len = VARSIZE(outputbytes) - VARHDRSZ; > pq_sendint(out, len, 4);/* length */ > pq_sendbytes(out, VARDATA(outputbytes), len); /* data */ > pfree(outputbytes); > } > else > { > pq_sendbyte(out, LOGICALREP_COLUMN_TEXT); > outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]); > pq_sendcountedtext(out, outputstr, strlen(outputstr), false); > pfree(outputstr); > } > > Attached is a samll patch to fix it. +1. Binary mode uses block level variable outputbytes, so making outputstr block level is fine IMO. Patch basically looks good to me, but it doesn't apply on my system. Looks like it's not created with git commit. Please create the patch with git commit command. git apply /mnt/hgfs/Shared/narrow-the-scope-of-the-variable-in-logicalrep_write_tuple.patch error: corrupt patch at line 10 With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
[PATCH] ProcessInterrupts_hook
Hi folks A few times lately I've been doing things in extensions that've made me want to be able to run my own code whenever InterruptPending is true and CHECK_FOR_INTERRUPTS() calls ProcessInterrupts() So here's a simple patch to add ProcessInterrupts_hook. It follows the usual pattern like ProcessUtility_hook and standard_ProcessUtility. Why? Because sometimes I want most of the behaviour of die(), but the option to override it with some bgworker-specific choices occasionally. HOLD_INTERRUPTS() is too big a hammer. What I really want to go along with this is a way for any backend to observe the postmaster's pmState and its "Shutdown" variable's value, so any backend can tell if we're in FastShutdown, SmartShutdown, etc. Copies in shmem only obviously. But I'm not convinced it's right to just copy these vars as-is to shmem, and I don't want to use the memory for a ProcSignal slot for something that won't be relevant for most backends for most of the postmaster lifetime. Ideas welcomed. From 1c8c477a5814420011fa034323e82d8eabc6bc5f Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Mon, 18 Jan 2021 14:14:46 +0800 Subject: [PATCH v1 1/4] Provide a hook for ProcessInterrupts() Code may now register a ProcessInterrupts_hook to fire its own logic before and/or after the main ProcessInterrupts handler for signal processing. The hook must call standard_ProcessInterrupts to execute the normal interrupt handling or set InterruptsPending to true to cause the interrupt to be re-processed at the next opportunity. This follows the consistent pattern used by other PostgreSQL hooks like the ProcessUtility_hook and standard_ProcessUtility() function. The purpose of this hook is to allow extensions to react to their own custom interrupt flags during CHECK_FOR_INTERRUPTS() calls invoked in main PostgreSQL code paths. --- src/backend/tcop/postgres.c | 20 src/include/miscadmin.h | 3 +++ 2 files changed, 23 insertions(+) diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index 28055680aa..8cf1359e33 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -102,6 +102,8 @@ int max_stack_depth = 100; /* wait N seconds to allow attach from a debugger */ int PostAuthDelay = 0; +/* Intercept CHECK_FOR_INTERRUPTS() responses */ +ProcessInterrupts_hook_type ProcessInterrupts_hook = NULL; /* @@ -3064,8 +3066,26 @@ ProcessInterrupts(void) /* OK to accept any interrupts now? */ if (InterruptHoldoffCount != 0 || CritSectionCount != 0) return; + InterruptPending = false; + if (ProcessInterrupts_hook) + ProcessInterrupts_hook(); + else + standard_ProcessInterrupts(); +} + +/* + * Implement the default signal handling behaviour for most backend types + * including user backends and bgworkers. + * + * This is where CHECK_FOR_INTERRUPTS() eventually lands up unless intercepted + * by ProcessInterrupts_hook. + */ +void +standard_ProcessInterrupts(void) +{ + if (ProcDiePending) { ProcDiePending = false; diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 1bdc97e308..f082d04448 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -94,6 +94,9 @@ extern PGDLLIMPORT volatile uint32 CritSectionCount; /* in tcop/postgres.c */ extern void ProcessInterrupts(void); +typedef void (*ProcessInterrupts_hook_type)(void); +extern ProcessInterrupts_hook_type ProcessInterrupts_hook; +extern void standard_ProcessInterrupts(void); #ifndef WIN32 -- 2.29.2 From 58b9ac884ef5bd35a2aac9da85079e24097612be Mon Sep 17 00:00:00 2001 From: Craig Ringer Date: Mon, 18 Jan 2021 15:26:43 +0800 Subject: [PATCH v1 2/4] Test for ProcessInterrupts_hook --- src/test/modules/Makefile | 3 +- src/test/modules/test_hooks/.gitignore| 4 + src/test/modules/test_hooks/Makefile | 28 .../expected/test_processinterrupt_hook.out | 0 src/test/modules/test_hooks/hooks.conf| 1 + .../sql/test_processinterrupt_hook.sql| 16 ++ .../modules/test_hooks/test_hooks--1.0.sql| 8 + src/test/modules/test_hooks/test_hooks.c | 37 + .../modules/test_hooks/test_hooks.control | 4 + src/test/modules/test_hooks/test_hooks.h | 20 +++ .../test_hooks/test_process_interrupts_hook.c | 140 ++ 11 files changed, 260 insertions(+), 1 deletion(-) create mode 100644 src/test/modules/test_hooks/.gitignore create mode 100644 src/test/modules/test_hooks/Makefile create mode 100644 src/test/modules/test_hooks/expected/test_processinterrupt_hook.out create mode 100644 src/test/modules/test_hooks/hooks.conf create mode 100644 src/test/modules/test_hooks/sql/test_processinterrupt_hook.sql create mode 100644 src/test/modules/test_hooks/test_hooks--1.0.sql create mode 100644 src/test/modules/test_hooks/test_hooks.c create mode 100644 src/test/modules/test_hooks/test_hooks.control create mode 100644 src/test/modules/test_hooks/tes
Re: Narrow the scope of the variable outputstr in logicalrep_write_tuple
On Mon, 18 Jan 2021 at 15:59, Bharath Rupireddy wrote: > On Mon, Jan 18, 2021 at 1:16 PM japin wrote: >> >> >> Hi, >> >> I find that the outputstr variable in logicalrep_write_tuple() only use in >> `else` branch, I think we can narrow the scope, just like variable >> outputbytes >> in `if` branch (for more readable). >> >> /* >> * Send in binary if requested and type has suitable send function. >> */ >> if (binary && OidIsValid(typclass->typsend)) >> { >> bytea *outputbytes; >> int len; >> >> pq_sendbyte(out, LOGICALREP_COLUMN_BINARY); >> outputbytes = OidSendFunctionCall(typclass->typsend, values[i]); >> len = VARSIZE(outputbytes) - VARHDRSZ; >> pq_sendint(out, len, 4);/* length */ >> pq_sendbytes(out, VARDATA(outputbytes), len); /* data */ >> pfree(outputbytes); >> } >> else >> { >> pq_sendbyte(out, LOGICALREP_COLUMN_TEXT); >> outputstr = OidOutputFunctionCall(typclass->typoutput, >> values[i]); >> pq_sendcountedtext(out, outputstr, strlen(outputstr), false); >> pfree(outputstr); >> } >> >> Attached is a samll patch to fix it. > > +1. Binary mode uses block level variable outputbytes, so making > outputstr block level is fine IMO. > > Patch basically looks good to me, but it doesn't apply on my system. > Looks like it's not created with git commit. Please create the patch > with git commit command. > > git apply > /mnt/hgfs/Shared/narrow-the-scope-of-the-variable-in-logicalrep_write_tuple.patch > error: corrupt patch at line 10 > Thanks for reviewing! Attached v2 as you suggested. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. >From 89fb10efc9eccf11f531ab8675376b57a12a6cb1 Mon Sep 17 00:00:00 2001 From: Japin Li Date: Mon, 18 Jan 2021 16:04:54 +0800 Subject: [PATCH v2] Narrow the scope of the variable outputstr in logicalrep_write_tuple --- src/backend/replication/logical/proto.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c index 62275ebabe..f2c85cabb5 100644 --- a/src/backend/replication/logical/proto.c +++ b/src/backend/replication/logical/proto.c @@ -493,7 +493,6 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple, bool binar HeapTuple typtup; Form_pg_type typclass; Form_pg_attribute att = TupleDescAttr(desc, i); - char *outputstr; if (att->attisdropped || att->attgenerated) continue; @@ -537,6 +536,8 @@ logicalrep_write_tuple(StringInfo out, Relation rel, HeapTuple tuple, bool binar } else { + char *outputstr; + pq_sendbyte(out, LOGICALREP_COLUMN_TEXT); outputstr = OidOutputFunctionCall(typclass->typoutput, values[i]); pq_sendcountedtext(out, outputstr, strlen(outputstr), false); -- 2.30.0
RE: Parallel INSERT (INTO ... SELECT ...)
From: Tang, Haiying > (does this patch make some optimizes in serial insert? I'm a little confused > here, Because the patched execution time is less than unpatched, but I didn't > find information in commit messages about it. If I missed something, please > kindly let me know.) I haven't thought of anything yet. Could you show us the output of EXPLAIN (ANALYZE, BUFFERS, VERBOSE) of 1,000 partitions case for the patched and unpatched? If it doesn't show any difference, the output of perf may be necessary next. (BTW, were all the 1,000 rows stored in the target table?) > I did another test which made check overhead obvious. this case is not fitting > for partition purpose, but I put it here as an extreme case too. > Select table total rows are 1,000, # of partitions is 2000. So only the first > 1000 > partitions have 1 row per partition, the last 1000 partitions have no data > inserted. Thank you, that's a good test. Regards Takayuki Tsunakawa
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 18, 2021 at 1:02 PM Tang, Haiying wrote: > > > From: Amit Kapila > > > Can we test cases when we have few rows in the Select table (say > > > 1000) and there 500 or 1000 partitions. In that case, we won't > > > select parallelism but we have to pay the price of checking > > > parallel-safety of all partitions. Can you check this with 100, 200, > > > 500, 1000 partitions table? > > > > I also wanted to see such an extreme(?) case. The 1,000 rows is not > > the count per partition but the total count of all partitions.e.g., > > when # of partitions is 100, # of rows per partition is 10. > > Below results are in serial plan which select table total rows are 1,000. The > Excution Time + Planning Time is still less than unpatched. > (does this patch make some optimizes in serial insert? I'm a little confused > here, Because the patched execution time is less than unpatched, but I didn't > find information in commit messages about it. If I missed something, please > kindly let me know.) > I don't think the patch should have any impact on the serial case. I think you can try to repeat each test 3 times both with and without a patch and take the median of the three. -- With Regards, Amit Kapila.
Re: Wrong usage of RelationNeedsWAL
At Sun, 17 Jan 2021 23:02:18 -0800, Noah Misch wrote in > On Sun, Jan 17, 2021 at 10:36:31PM -0800, Noah Misch wrote: > > I wrote the above based on the "PageGetLSN(page) > (snapshot)->lsn" check in > > TestForOldSnapshot(). If the LSN isn't important, what else explains > > RelationAllowsEarlyPruning() checking RelationNeedsWAL()? > > Thinking about it more, some RelationAllowsEarlyPruning() callers need > different treatment. Above, I was writing about the case of deciding whether > to do early pruning. The other RelationAllowsEarlyPruning() call sites are > deciding whether the relation might be lacking old data. For that case, we > should check RELPERSISTENCE_PERMANENT, not RelationNeedsWAL(). Otherwise, we > could fail to report an old-snapshot error in a case like this: > > setup: CREATE TABLE t AS SELECT ...; > xact1: BEGIN ISOLATION LEVEL REPEATABLE READ; SELECT 1; -- take snapshot > xact2: DELETE FROM t; > (plenty of time passes) > xact3: SELECT count(*) FROM t; -- early pruning > xact1: SAVEPOINT q; SELECT count(*) FROM t; ROLLBACK TO q; -- "snapshot too > old" > xact1: ALTER TABLE t SET TABLESPACE something; -- start skipping WAL > xact1: SELECT count(*) FROM t; -- no error, wanted "snapshot too old" > > Is that plausible? Thank you for the consideration and yes. But I get "snapshot too old" from the last query with the patched version. maybe I'm missing something. I'm going to investigate the case. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Determine parallel-safety of partition relations for Inserts
On Mon, Jan 18, 2021 at 10:27 AM tsunakawa.ta...@fujitsu.com wrote: > > From: Amit Kapila > > We already allow users to specify the degree of parallelism for all > > the parallel operations via guc's max_parallel_maintenance_workers, > > max_parallel_workers_per_gather, then we have a reloption > > parallel_workers and vacuum command has the parallel option where > > users can specify the number of workers that can be used for > > parallelism. The parallelism considers these as hints but decides > > parallelism based on some other parameters like if there are that many > > workers available, etc. Why the users would expect differently for > > parallel DML? > > I agree that the user would want to specify the degree of parallelism of DML, > too. My simple (probably silly) question was, in INSERT SELECT, > > * If the target table has 10 partitions and the source table has 100 > partitions, how would the user want to specify parameters? > > * If the source and target tables have the same number of partitions, and the > user specified different values to parallel_workers and parallel_dml_workers, > how many parallel workers would run? > Good question. I think if we choose to have a separate parameter for DML, it can probably a boolean to just indicate whether to enable parallel DML for a specified table and use the parallel_workers specified in the table used in SELECT. -- With Regards, Amit Kapila.
Re: Improve handling of parameter differences in physical replication
On 2021-01-15 12:28, Sergei Kornilov wrote: The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation:tested, passed Hello Look good for me. I think the patch is ready for commiter. The new status of this patch is: Ready for Committer committed -- Peter Eisentraut 2ndQuadrant, an EDB company https://www.2ndquadrant.com/
RE: Determine parallel-safety of partition relations for Inserts
From: Amit Kapila > Good question. I think if we choose to have a separate parameter for > DML, it can probably a boolean to just indicate whether to enable > parallel DML for a specified table and use the parallel_workers > specified in the table used in SELECT. Agreed. Regards Takayuki Tsunakawa
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On Mon, Jan 18, 2021 at 02:18:44PM +0900, Michael Paquier wrote: > On Sat, Dec 12, 2020 at 01:45:26PM -0600, Justin Pryzby wrote: > > It's a lot like what I wrote as [PATCH v31 1/5] ExecReindex and > > ReindexParams > > In my v31 patch, I moved ReindexOptions to a private structure in > > indexcmds.c, > > with an "int options" bitmask which is passed to reindex_index() et al. > > Your > > patch keeps/puts ReindexOptions index.h, so it also applies to > > reindex_index, > > which I think is good. > > a3dc926 is an equivalent of 0001~0003 merged together. 0008 had > better be submitted into a separate thread if there is value to it. > 0004~0007 are the pieces remaining. Could it be possible to rebase > things on HEAD and put the tablespace bits into the structures > {Vacuum,Reindex,Cluster}Params? Attached. I will re-review these myself tomorrow. -- Justin >From 42991c90afda1b2a9fe4095418a87b5ead4b23d9 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 23 Mar 2020 21:10:29 +0300 Subject: [PATCH 1/4] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 22 + src/backend/catalog/index.c | 93 ++- src/backend/commands/indexcmds.c | 103 +- src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 2 + src/test/regress/input/tablespace.source | 53 +++ src/test/regress/output/tablespace.source | 102 + 7 files changed, 374 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..4f84060c4d 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,19 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + This specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. + + + + VERBOSE @@ -210,6 +224,14 @@ REINDEX [ ( option [, ...] ) ] { IN + +new_tablespace + + + The tablespace where indexes will be rebuilt. + + + diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index b8cd35e995..9aa9fdf291 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -57,6 +57,7 @@ #include "commands/event_trigger.h" #include "commands/progress.h" #include "commands/tablecmds.h" +#include "commands/tablespace.h" #include "commands/trigger.h" #include "executor/executor.h" #include "miscadmin.h" @@ -1394,9 +1395,13 @@ index_update_collation_versions(Oid relid, Oid coll) * Create concurrently an index based on the definition of the one provided by * caller. The index is inserted into catalogs and needs to be built later * on. This is called during concurrent reindex processing. + * + * "tablespaceOid" is the new tablespace to use for this index. If + * InvalidOid, use the tablespace in-use instead. */ Oid -index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char *newName) +index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, + Oid tablespaceOid, const char *newName) { Relation indexRelation; IndexInfo *oldInfo, @@ -1526,7 +1531,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char newInfo, indexColNames, indexRelation->rd_rel->relam, - indexRelation->rd_rel->reltablespace, + OidIsValid(tablespaceOid) ? +tablespaceOid : indexRelation->rd_rel->reltablespace, indexRelation->rd_indcollation, indclass->values, indcoloptions->values, @@ -3591,6 +3597,8 @@ IndexGetRelation(Oid indexId, bool missing_ok) /* * reindex_index - This routine is used to recreate a single index + * + * See comments of reindex_relation() for details about "tablespaceOid". */ void reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, @@ -3603,6 +3611,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, volatile bool skipped_constraint = false; PGRUsage ru0; bool progress = ((params->options & REINDEXOPT_REPORT_PROGRESS) != 0); + bool set_tablespace = OidIsValid(params->tablespaceOid); pg_rusage_init(&ru0); @@ -3654,6 +3663,35 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, get_namespace_name(RelationGetNamespace(iRel)), RelationGetRelationName(iRel));
Re: evtfoid and evtowner missing in findoidjoins/README
On Sun, Jan 17, 2021, at 21:32, Tom Lane wrote: >Well, SGML is actually plenty easy to parse as long as you've got xml >tooling at hand. We'd never want to introduce such a dependency in the >normal build process, but making something like findoidjoins depend on >such tools seems within reason. I recall having whipped up some one-off >Perl scripts of that sort when I was doing that massive rewrite of the >func.sgml tables last year. I didn't save them though, and in any case >I'm the world's worst Perl programmer, so it'd be better for someone >with more Perl-fu to take point if we decide to go that route. I went a head and implemented the parser, it was indeed easy. Patch attached. Add catalog_oidjoins.pl -- parses catalog references out of catalogs.sgml Since doc/src/sgml/catalogs.sgml is the master where catalog references are to be documented, if someone needs machine-readable access to such information, it should be extracted from this document. The added script catalog_oidjoins.pl parses the SGML and extract the fields necessary to produce the same output as generated by findoidjoins, which has historically been copy/pasted to the README. This is to allow for easy comparison, to verify the correctness of catalog_oidjoins.pl, and if necessary, to update the README based on the information in catalogs.sgml. Helper-files: diff_oidjoins.sh Runs ./catalog_oidjoins.pl and compares its output with STDIN. Shows a diff of the result, witout any context. test_oidjoins.sh Runs ./diff_oidjoins.sh for both the README and the output from ./findoidjoins regression. bogus_oidjoins.txt List of bogus diff entires to ignore, based on documentation in README. After having run make installcheck in src/test/regress, the test_oidjoins.sh can be run: $ ./test_oidjoins.sh README: + Join pg_catalog.pg_attrdef.adnum => pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_class.relrewrite => pg_catalog.pg_class.oid + Join pg_catalog.pg_constraint.confkey []=> pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_constraint.conkey []=> pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_db_role_setting.setrole => pg_catalog.pg_authid.oid + Join pg_catalog.pg_default_acl.defaclnamespace => pg_catalog.pg_namespace.oid + Join pg_catalog.pg_default_acl.defaclrole => pg_catalog.pg_authid.oid + Join pg_catalog.pg_event_trigger.evtfoid => pg_catalog.pg_proc.oid + Join pg_catalog.pg_event_trigger.evtowner => pg_catalog.pg_authid.oid + Join pg_catalog.pg_extension.extconfig []=> pg_catalog.pg_class.oid + Join pg_catalog.pg_foreign_data_wrapper.fdwhandler => pg_catalog.pg_proc.oid + Join pg_catalog.pg_foreign_data_wrapper.fdwvalidator => pg_catalog.pg_proc.oid + Join pg_catalog.pg_foreign_table.ftrelid => pg_catalog.pg_class.oid + Join pg_catalog.pg_foreign_table.ftserver => pg_catalog.pg_foreign_server.oid + Join pg_catalog.pg_index.indkey => pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_partitioned_table.partattrs => pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_policy.polroles []=> pg_catalog.pg_authid.oid + Join pg_catalog.pg_publication.pubowner => pg_catalog.pg_authid.oid + Join pg_catalog.pg_publication_rel.prpubid => pg_catalog.pg_publication.oid + Join pg_catalog.pg_publication_rel.prrelid => pg_catalog.pg_class.oid + Join pg_catalog.pg_range.rngmultitypid => pg_catalog.pg_type.oid + Join pg_catalog.pg_seclabel.classoid => pg_catalog.pg_class.oid + Join pg_catalog.pg_shseclabel.classoid => pg_catalog.pg_class.oid + Join pg_catalog.pg_statistic.staattnum => pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_statistic_ext.stxkeys => pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_subscription.subdbid => pg_catalog.pg_database.oid + Join pg_catalog.pg_subscription.subowner => pg_catalog.pg_authid.oid + Join pg_catalog.pg_subscription_rel.srrelid => pg_catalog.pg_class.oid + Join pg_catalog.pg_subscription_rel.srsubid => pg_catalog.pg_subscription.oid + Join pg_catalog.pg_trigger.tgattr => pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_type.typsubscript => pg_catalog.pg_proc.oid + Join pg_catalog.pg_user_mapping.umserver => pg_catalog.pg_foreign_server.oid + Join pg_catalog.pg_user_mapping.umuser => pg_catalog.pg_authid.oid findoidjoins: + Join pg_catalog.pg_attrdef.adnum => pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_class.relrewrite => pg_catalog.pg_class.oid + Join pg_catalog.pg_constraint.confkey []=> pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_constraint.conkey []=> pg_catalog.pg_attribute.attnum + Join pg_catalog.pg_db_role_setting.setrole => pg_catalog.pg_authid.oid + Join pg_catalog.pg_default_acl.defaclnamespace => pg_catalog.pg_namespace.oid + Join pg_catalog.pg_default_acl.defaclrole => pg_catalog.pg_authid.oid + Join pg_catalog.pg_event_trigger.evtfoid => pg_catalog.pg_proc.oid + Join pg_catalog.pg_event_trigger.evtowner => pg_catalog.pg_authid.oid + Join pg_catalog.pg_extension.extconfig []=>
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 18, 2021 at 10:45 AM Greg Nancarrow wrote: > > On Fri, Jan 15, 2021 at 7:39 PM Amit Kapila wrote: > > > > Here is an additional review of > > v11-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT. There are > > quite a few comments raised on the V11-0001* patch. I suggest first > > post a revised version of V11-0001* patch addressing those comments > > and then you can separately post a revised version of > > v11-0003-Enable-parallel-INSERT-and-or-SELECT-for-INSERT-INTO. > > > > 1) > > >Here, it seems we are accessing the relation descriptor without any > >lock on the table which is dangerous considering the table can be > >modified in a parallel session. Is there a reason why you think this > >is safe? Did you check anywhere else such a coding pattern? > > Yes, there's a very good reason and I certainly have checked for the > same coding pattern elsewhere, and not just randomly decided that > locking can be ignored. > The table has ALREADY been locked (by the caller) during the > parse/analyze phase. > Fair enough. I suggest adding a comment saying the same so that the reader doesn't get confused about the same. > (This is not the case for a partition, in which case the patch code > uses AccessShareLock, as you will see). Okay, but I see you release the lock on partition rel immediately after checking parallel-safety. What if a user added some parallel-unsafe constraint (via Alter Table) after that check? > > 4) > > >domain_max_parallel_hazard_for_modify() > >{ > >.. > >+ if (isnull) > >+ { > >+ /* > >+ * This shouldn't ever happen, but if it does, log a WARNING > >+ * and return UNSAFE, rather than erroring out. > >+ */ > >+ elog(WARNING, "null conbin for constraint %u", con->oid); > >+ context->max_hazard = PROPARALLEL_UNSAFE; > >+ break; > >+ } > >.. > >} > >index_expr_max_parallel_hazard_for_modify() > >{ > >.. > >+ if (index_expr_item == NULL) /* shouldn't happen */ > >+ { > >+ index_close(index_rel, lockmode); > >+ context->max_hazard = PROPARALLEL_UNSAFE; > >+ return context->max_hazard; > >+ } > >.. > >} > > >It is not clear why the above two are shouldn't happen cases and if so > >why we want to treat them as unsafe. Ideally, there should be an > >Assert if these can't happen but it is difficult to decide without > >knowing why you have considered them unsafe? > > The checks being done here for "should never happen" cases are THE > SAME as other parts of the Postgres code. > For example, search Postgres code for "null conbin" and you'll find 6 > other places in the Postgres code which actually ERROR out if conbin > (binary representation of the constraint) in a pg_constraint tuple is > found to be null. > The cases that you point out in the patch used to also error out in > the same way, but Vignesh suggested changing them to just return > parallel-unsafe instead of erroring-out, which I agree with. > You have not raised a WARNING for the second case. But in the first place what is the reasoning for making this different from other parts of code? If we don't have a solid reason then I suggest keeping these checks and errors the same as in other parts of the code. -- With Regards, Amit Kapila.
RE: Parallel INSERT (INTO ... SELECT ...)
Hi Tsunakawa-san > From: Tang, Haiying > > (does this patch make some optimizes in serial insert? I'm a little > > confused here, Because the patched execution time is less than > > unpatched, but I didn't find information in commit messages about it. > > If I missed something, please kindly let me know.) > > I haven't thought of anything yet. Could you show us the output of > EXPLAIN (ANALYZE, BUFFERS, VERBOSE) of 1,000 partitions case for the > patched and unpatched? If it doesn't show any difference, the output > of perf may be necessary next. Execute EXPLAIN on Patched: postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part select * from test_data1; QUERY PLAN Insert on public.test_part (cost=0.00..15.00 rows=0 width=0) (actual time=44.139..44.140 rows=0 loops=1) Buffers: shared hit=1005 read=1000 dirtied=3000 written=2000 -> Seq Scan on public.test_data1 (cost=0.00..15.00 rows=1000 width=8) (actual time=0.007..0.201 rows=1000 loops=1) Output: test_data1.a, test_data1.b Buffers: shared hit=5 Planning: Buffers: shared hit=27011 Planning Time: 24.526 ms Execution Time: 44.981 ms Execute EXPLAIN on non-Patched: postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part select * from test_data1; QUERY PLAN Insert on public.test_part (cost=0.00..15.00 rows=0 width=0) (actual time=72.656..72.657 rows=0 loops=1) Buffers: shared hit=22075 read=1000 dirtied=3000 written=2000 -> Seq Scan on public.test_data1 (cost=0.00..15.00 rows=1000 width=8) (actual time=0.010..0.175 rows=1000 loops=1) Output: test_data1.a, test_data1.b Buffers: shared hit=5 Planning: Buffers: shared hit=72 Planning Time: 0.135 ms Execution Time: 79.058 ms > (BTW, were all the 1,000 rows stored in the target table?) Yes, I checked all rows stored in target table. postgres=# select count(*) from test_part; count --- 1000 Regards, Tang
RE: Parallel INSERT (INTO ... SELECT ...)
Hi Amit > I don't think the patch should have any impact on the serial case. I > think you can try to repeat each test 3 times both with and without a > patch and take the median of the three. Actually, I repeated about 10 times, the execution time is always less than unpatched. Regards, Tang
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 18, 2021 at 2:40 PM Tang, Haiying wrote: > > Hi Tsunakawa-san > > > From: Tang, Haiying > > > (does this patch make some optimizes in serial insert? I'm a little > > > confused here, Because the patched execution time is less than > > > unpatched, but I didn't find information in commit messages about it. > > > If I missed something, please kindly let me know.) > > > > I haven't thought of anything yet. Could you show us the output of > > EXPLAIN (ANALYZE, BUFFERS, VERBOSE) of 1,000 partitions case for the > > patched and unpatched? If it doesn't show any difference, the output > > of perf may be necessary next. > > Execute EXPLAIN on Patched: > postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part select * > from test_data1; >QUERY PLAN > > Insert on public.test_part (cost=0.00..15.00 rows=0 width=0) (actual > time=44.139..44.140 rows=0 loops=1) >Buffers: shared hit=1005 read=1000 dirtied=3000 written=2000 >-> Seq Scan on public.test_data1 (cost=0.00..15.00 rows=1000 width=8) > (actual time=0.007..0.201 rows=1000 loops=1) > Output: test_data1.a, test_data1.b > Buffers: shared hit=5 > Planning: >Buffers: shared hit=27011 > Planning Time: 24.526 ms > Execution Time: 44.981 ms > > Execute EXPLAIN on non-Patched: > postgres=# explain (ANALYZE, BUFFERS, VERBOSE) insert into test_part select * > from test_data1; >QUERY PLAN > > Insert on public.test_part (cost=0.00..15.00 rows=0 width=0) (actual > time=72.656..72.657 rows=0 loops=1) >Buffers: shared hit=22075 read=1000 dirtied=3000 written=2000 >-> Seq Scan on public.test_data1 (cost=0.00..15.00 rows=1000 width=8) > (actual time=0.010..0.175 rows=1000 loops=1) > Output: test_data1.a, test_data1.b > Buffers: shared hit=5 > Planning: >Buffers: shared hit=72 > Planning Time: 0.135 ms > Execution Time: 79.058 ms > So, the results indicate that after the patch we touch more buffers during planning which I think is because of accessing the partition information, and during execution, the patch touches fewer buffers for the same reason. But why this can reduce the time with patch? I think this needs some investigation. -- With Regards, Amit Kapila.
Re: popcount
On 2021-01-11 17:13, David Fetter wrote: On Mon, Jan 11, 2021 at 03:50:54PM +0100, Peter Eisentraut wrote: On 2020-12-30 17:41, David Fetter wrote: The input may have more than 2 billion bits set to 1. The biggest possible result should be 8 billion for bytea (1 GB with all bits set to 1). So shouldn't this function return an int8? It does now, and thanks for looking at this. The documentation still reflects the previous int4 return type (in two different spellings, too). Thanks for looking this over! Please find attached the next version with corrected documentation. The documentation entries should be placed in alphabetical order in their respective tables. For the bytea function, maybe choose a simpler example that a reader can compute in their head. Also for the test cases. In varbit.c: The comment says + * Returns the number of bits set in a bit array. but it should be "bit string". + int8popcount; should be int64. Also, pg_popcount() returns uint64, not int64. Perhaps overflow is not possible here, but perhaps a small comment about why or an assertion could be appropriate. + p = VARBITS(arg1); Why not assign that in the declaration block as well? This comment: + /* +* ceil(VARBITLEN(ARG1)/BITS_PER_BYTE) +* done to minimize branches and instructions. +*/ I don't know what that is supposed to mean or why that kind of tuning would be necessary for a user-callable function. + popcount = pg_popcount((const char *)p, len); The "const" is probably not necessary. byteapopcount() looks better, but also needs int8 -> int64.
Re: proposal: schema variables
čt 14. 1. 2021 v 10:24 odesílatel Erik Rijkers napsal: > On 2021-01-14 07:35, Pavel Stehule wrote: > > [schema-variables-20210114.patch.gz] > > > Build is fine. My (small) list of tests run OK. > > I did notice a few more documentation peculiarities: > > > 'The PostgreSQL has schema variables' should be > 'PostgreSQL has schema variables' > > fixed > A link to the LET command should be added to the 'See Also' of the > CREATE VARIABLE, ALTER VARIABLE, and DROP VARIABLE pages. (After all, > the LET command is the most interesting) > Similarly, an ALTER VARIABLE link should be added to the 'See Also' > section of LET. > fixed > > Somehow, the sgml in the doc files causes too large spacing in the html, > example: > I copy from the LET html: > 'if that is defined. If no explicit' > (6 spaces between 'defined.' and 'If') > Can you have a look? Sorry - I can't find the cause. It occurs on a > few more places in the newly added sgml/html. The unwanted spaces are > visible also in the pdf. > Should be fixed in the last version. Probably there were some problems with invisible white chars. Thank you for check Pavel > (firefox 78.6.1, debian) > > > Thanks, > > Erik Rijkers > > >
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 18, 2021 at 2:42 PM Amit Kapila wrote: > > On Mon, Jan 18, 2021 at 10:45 AM Greg Nancarrow wrote: > > > > On Fri, Jan 15, 2021 at 7:39 PM Amit Kapila wrote: > > > > > > Here is an additional review of > > > v11-0001-Enable-parallel-SELECT-for-INSERT-INTO-.-SELECT. There are > > > quite a few comments raised on the V11-0001* patch. I suggest first > > > post a revised version of V11-0001* patch addressing those comments > > > and then you can separately post a revised version of > > > v11-0003-Enable-parallel-INSERT-and-or-SELECT-for-INSERT-INTO. > > > > > > > 1) > > > > >Here, it seems we are accessing the relation descriptor without any > > >lock on the table which is dangerous considering the table can be > > >modified in a parallel session. Is there a reason why you think this > > >is safe? Did you check anywhere else such a coding pattern? > > > > Yes, there's a very good reason and I certainly have checked for the > > same coding pattern elsewhere, and not just randomly decided that > > locking can be ignored. > > The table has ALREADY been locked (by the caller) during the > > parse/analyze phase. > > > > Fair enough. I suggest adding a comment saying the same so that the > reader doesn't get confused about the same. > > > (This is not the case for a partition, in which case the patch code > > uses AccessShareLock, as you will see). > > Okay, but I see you release the lock on partition rel immediately > after checking parallel-safety. What if a user added some > parallel-unsafe constraint (via Alter Table) after that check? > > > > > 4) > > > > >domain_max_parallel_hazard_for_modify() > > >{ > > >.. > > >+ if (isnull) > > >+ { > > >+ /* > > >+ * This shouldn't ever happen, but if it does, log a WARNING > > >+ * and return UNSAFE, rather than erroring out. > > >+ */ > > >+ elog(WARNING, "null conbin for constraint %u", con->oid); > > >+ context->max_hazard = PROPARALLEL_UNSAFE; > > >+ break; > > >+ } > > >.. > > >} > > >index_expr_max_parallel_hazard_for_modify() > > >{ > > >.. > > >+ if (index_expr_item == NULL) /* shouldn't happen */ > > >+ { > > >+ index_close(index_rel, lockmode); > > >+ context->max_hazard = PROPARALLEL_UNSAFE; > > >+ return context->max_hazard; > > >+ } > > >.. > > >} > > > > >It is not clear why the above two are shouldn't happen cases and if so > > >why we want to treat them as unsafe. Ideally, there should be an > > >Assert if these can't happen but it is difficult to decide without > > >knowing why you have considered them unsafe? > > > > The checks being done here for "should never happen" cases are THE > > SAME as other parts of the Postgres code. > > For example, search Postgres code for "null conbin" and you'll find 6 > > other places in the Postgres code which actually ERROR out if conbin > > (binary representation of the constraint) in a pg_constraint tuple is > > found to be null. > > The cases that you point out in the patch used to also error out in > > the same way, but Vignesh suggested changing them to just return > > parallel-unsafe instead of erroring-out, which I agree with. > > > > You have not raised a WARNING for the second case. But in the first > place what is the reasoning for making this different from other parts > of code? > On again, thinking about this, I see a reason why one wants to do like you have done currently in the patch. It helps us to avoid giving such errors when they are really not required say when it occurred while checking parallel-safety for a particular partition and in reality we will never insert in that partition and there probably similar other cases. I guess we should give WARNING consistently in all such cases. -- With Regards, Amit Kapila.
Re: proposal: schema variables
Hi čt 14. 1. 2021 v 11:31 odesílatel Josef Šimánek napsal: > I did some testing locally. All runs smoothly, compiled without warning. > > Later on (once merged) it would be nice to write down a documentation > page for the whole feature as a set next to documented individual > commands. > It took me a few moments to understand how this works. > > I was looking how to create variable with non default value in one > command, but if I understand it correctly, that's not the purpose. > Variable lives in a schema with default value, but you can set it per > session via LET. > > Thus "CREATE VARIABLE" statement should not be usually part of > "classic" queries, but it should be threatened more like TABLE or > other DDL statements. > > On the other side LET is there to use the variable in "classic" queries. > > Does that make sense? Feel free to ping me if any help with > documentation would be needed. I can try to prepare an initial > variables guide once I'll ensure I do understand this feature well. > I invite any help with doc. Maybe there can be page in section advanced features https://www.postgresql.org/docs/current/tutorial-advanced.html Regards Pavel > > PS: Now it is clear to me why it is called "schema variables", not > "session variables". > > čt 14. 1. 2021 v 7:36 odesílatel Pavel Stehule > napsal: > > > > Hi > > > > rebase > > > > Regards > > > > Pavel > > >
Re: proposal: schema variables
po 18. 1. 2021 v 10:50 odesílatel Pavel Stehule napsal: > > > čt 14. 1. 2021 v 10:24 odesílatel Erik Rijkers napsal: > >> On 2021-01-14 07:35, Pavel Stehule wrote: >> > [schema-variables-20210114.patch.gz] >> >> >> Build is fine. My (small) list of tests run OK. >> >> I did notice a few more documentation peculiarities: >> >> >> 'The PostgreSQL has schema variables' should be >> 'PostgreSQL has schema variables' >> >> > fixed > > >> A link to the LET command should be added to the 'See Also' of the >> CREATE VARIABLE, ALTER VARIABLE, and DROP VARIABLE pages. (After all, >> the LET command is the most interesting) >> Similarly, an ALTER VARIABLE link should be added to the 'See Also' >> section of LET. >> > > fixed > > >> >> Somehow, the sgml in the doc files causes too large spacing in the html, >> example: >> I copy from the LET html: >> 'if that is defined. If no explicit' >> (6 spaces between 'defined.' and 'If') >> Can you have a look? Sorry - I can't find the cause. It occurs on a >> few more places in the newly added sgml/html. The unwanted spaces are >> visible also in the pdf. >> > > Should be fixed in the last version. Probably there were some problems > with invisible white chars. > > Thank you for check > > Pavel > > > >> (firefox 78.6.1, debian) >> > and here is the patch Regards Pavel >> >> Thanks, >> >> Erik Rijkers >> >> >> schema-variables-20200118.patch.gz Description: application/gzip
Re: Proposal: Global Index
> 2021年1月12日 02:37,Robert Haas 写道: > > On Mon, Jan 11, 2021 at 12:46 PM Bruce Momjian wrote: >>> For 1) The DETACH old child table can be finished immediately, global index >>> can be kept valid after DETACH is completed, and the cleanup of garbage >>> data in global index can be deferred to VACUUM. >>> This is similar to the global index optimization done by Oracle12c. >>> For 2) ATTACH new empty child table can also be completed immediately. >>> If this is the case, many of the advantages of partitioned tables will be >>> retained, while the advantages of global indexes will be gained. >> >> Yes, we can keep the index rows for the deleted partition and clean them >> up later, but what is the advantage of partitioning then? Just heap >> deletion quickly? Is that enough of a value? > > I actually think the idea of lazily deleting the index entries is > pretty good, but it won't work if the way the global index is > implemented is by adding a tableoid column. Because then, I might > detach a partition and later reattach it and the old index entries are > still there but the table contents might have changed. Worse yet, the > table might be dropped and the table OID reused for a completely > unrelated table with completely unrelated contents, which could then > be attached as a new partition. > > One of the big selling points of global indexes is that they allow you > to enforce uniqueness on a column unrelated to the partitioning > column. Another is that you can look up a value by doing a single > index scan on the global index rather than an index scan per > partition. Those things are no less valuable for performing index > deletion lazily. > > However, there is a VACUUM amplification effect to worry about here > which Wenjing seems not to be considering. Suppose I have a table > which is not partitioned and it is 1TB in size with an index that is > 128GB in size. To vacuum the table, I need to do 1TB + 128GB of I/O. > Now, suppose I now partition the table into 1024 partitions each with > its own local index. Each partition is 1GB in size and the index on > each partition is 128MB in size. To vacuum an individual partition > requires 1GB + 128MB of I/O, so to vacuum all the partitions requires > the same amount of total I/O as before. But, now suppose that I have a > single global index instead of a local index per partition. First, how > big will that index be? It will not be 128GB, but somewhat bigger, > because it needs extra space for every indexed tuple. Let's say 140GB. > Furthermore, it will need to be vacuumed whenever any child is > vacuumed, because it contains some index entries from every child. So > the total I/O to vacuum all partitions is now 1GB * 1024 + 140GB * > 1024 = 141TB, which is a heck of a lot worse than the 1.125TB I > required with the unpartitioned table or the locally partitioned > table. Thank you for pointing this out. It seems that some optimization can be done, but there is no good way to completely eliminate the vacuum amplification effect of the global index. Maybe we can only count on Zheap, which doesn't need to do Vaccum. > > That's not necessarily a death sentence for every use case, but it's > going to be pretty bad for tables that are big and heavily updated. > > -- > Robert Haas > EDB: http://www.enterprisedb.com smime.p7s Description: S/MIME cryptographic signature
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Jan 18, 2021 at 8:10 PM Amit Kapila wrote: > > > 1) > > > > >Here, it seems we are accessing the relation descriptor without any > > >lock on the table which is dangerous considering the table can be > > >modified in a parallel session. Is there a reason why you think this > > >is safe? Did you check anywhere else such a coding pattern? > > > > Yes, there's a very good reason and I certainly have checked for the > > same coding pattern elsewhere, and not just randomly decided that > > locking can be ignored. > > The table has ALREADY been locked (by the caller) during the > > parse/analyze phase. > > > > Fair enough. I suggest adding a comment saying the same so that the > reader doesn't get confused about the same. > OK, I'll add a comment. > > (This is not the case for a partition, in which case the patch code > > uses AccessShareLock, as you will see). > > Okay, but I see you release the lock on partition rel immediately > after checking parallel-safety. What if a user added some > parallel-unsafe constraint (via Alter Table) after that check? > > > I'm not sure. But there would be a similar concern for current Parallel SELECT functionality, right? My recollection is that ALTER TABLE obtains an exclusive lock on the table which it retains until the end of the transaction, so that will result in blocking at certain points, during parallel-checks and execution, but there may still be a window. > > 4) > > > > >domain_max_parallel_hazard_for_modify() > > >{ > > >.. > > >+ if (isnull) > > >+ { > > >+ /* > > >+ * This shouldn't ever happen, but if it does, log a WARNING > > >+ * and return UNSAFE, rather than erroring out. > > >+ */ > > >+ elog(WARNING, "null conbin for constraint %u", con->oid); > > >+ context->max_hazard = PROPARALLEL_UNSAFE; > > >+ break; > > >+ } > > >.. > > >} > > >index_expr_max_parallel_hazard_for_modify() > > >{ > > >.. > > >+ if (index_expr_item == NULL) /* shouldn't happen */ > > >+ { > > >+ index_close(index_rel, lockmode); > > >+ context->max_hazard = PROPARALLEL_UNSAFE; > > >+ return context->max_hazard; > > >+ } > > >.. > > >} > > > > >It is not clear why the above two are shouldn't happen cases and if so > > >why we want to treat them as unsafe. Ideally, there should be an > > >Assert if these can't happen but it is difficult to decide without > > >knowing why you have considered them unsafe? > > > > The checks being done here for "should never happen" cases are THE > > SAME as other parts of the Postgres code. > > For example, search Postgres code for "null conbin" and you'll find 6 > > other places in the Postgres code which actually ERROR out if conbin > > (binary representation of the constraint) in a pg_constraint tuple is > > found to be null. > > The cases that you point out in the patch used to also error out in > > the same way, but Vignesh suggested changing them to just return > > parallel-unsafe instead of erroring-out, which I agree with. > > > > You have not raised a WARNING for the second case. The same checks in current Postgres code also don't raise a WARNING for that case, so I'm just being consistent with existing Postgres code (which itself isn't consistent for those two cases). >But in the first > place what is the reasoning for making this different from other parts > of code? If we don't have a solid reason then I suggest keeping these > checks and errors the same as in other parts of the code. > The checks are the same as done in existing Postgres source - but instead of failing with an ERROR (i.e. whole backend dies), in the middle of parallel-safety-checking, it has been changed to regard the operation as parallel-unsafe, so that it will try to execute in non-parallel mode, where it will most likely fail too when those corrupted attributes are accessed - but it will fail in the way that it currently does in Postgres, should that very rare condition ever happen. This was suggested by Vignesh, and I agree with him. So in effect, it's just allowing it to use the existing error paths in the code, rather than introducing new ERROR points. Regards, Greg Nancarrow Fujitsu Australia
Re: Allow matching whole DN from a client certificate
I've circled back to this and tested/read it more, and I'm still of the opinion that it's a good feature addition. A few small comments on the patch: + is the default, the username is matched against the certificate's In the docs we use "user name" instead of "username" in descriptive text. There are a couple of "username" in this added paragraph. + This option is probably best used in comjunction with a username map. Typo: s/comjunction/conjunction/ + /* use commas instead of slashes */ + X509_NAME_print_ex(bio, x509name, 0, XN_FLAG_SEP_COMMA_PLUS); I don't have strong opinions on whether we shold use slashes or commas, but I think it needs to be documented that commas are required since slashes is the more common way to print a DN. pg_stat_ssl and sslinfo are also displaying the DN with slashes. /* Make sure we have received a username in the certificate */ - if (port->peer_cn == NULL || - strlen(port->peer_cn) <= 0) + peer_username = port->hba->clientcertname == clientCertCN ? port->peer_cn : port->peer_dn; Nitpickering nitpickery perhaps, but when we can inspect the DN and not just the CN it seems a bit odd to talk about "username", which is again echoed in the errormessage just below here. Not sure what would be better though, since "user information" doesn't really convey enough detail/meaning. + /* and extract the Common Name / Distinguished Name from it. */ Not introduced in this patch, but single line comments should not be punctuated IIRC. The patch still applies and tests pass, I'm moving this to ready for committer. cheers ./daniel
search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi hackers, While working with cursors that reference plans with CustomScanStates nodes, I encountered a segfault which originates from search_plan_tree(). The query plan is the result of a simple SELECT statement into which I inject a Custom Scan node at the root to do some post-processing before returning rows. This plan is referenced by a second plan with a Tid Scan which originates from a query of the form DELETE FROM foo WHERE CURRENT OF my_cursor; search_plan_tree() assumes that CustomScanState::ScanState::ss_currentRelation is never NULL. In my understanding that only holds for CustomScanState nodes which are at the bottom of the plan and actually read from a relation. CustomScanState nodes which are not at the bottom don't have ss_currentRelation set. I believe for such nodes, instead search_plan_tree() should recurse into CustomScanState::custom_ps. I attached a patch. Any thoughts? Best regards, David Swarm64 diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c index f89319fcd8..0d5f09402b 100644 --- a/src/backend/executor/execCurrent.c +++ b/src/backend/executor/execCurrent.c @@ -326,7 +326,6 @@ search_plan_tree(PlanState *node, Oid table_oid, case T_BitmapHeapScanState: case T_TidScanState: case T_ForeignScanState: - case T_CustomScanState: { ScanState *sstate = (ScanState *) node; @@ -335,6 +334,39 @@ search_plan_tree(PlanState *node, Oid table_oid, break; } + + /* +* Custom scan nodes can be leaf nodes or inner nodes and therfore need special treatment. +*/ + case T_CustomScanState: + { + CustomScanState *css = castNode(CustomScanState, node); + ScanState *sstate = (ScanState *) node; + + if (sstate->ss_currentRelation == NULL) /* inner node */ + { + ListCell *lc; + + foreach (lc, sstate->custom_ps) + { + ScanState *elem = search_plan_tree((PlanState *)lfirst(lc), table_oid, pending_rescan); + + if (!elem) + continue; + if (result) + return NULL;/* multiple matches */ + result = elem; + } + } + else /* leaf node */ + { + if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) + result = sstate; + } + + break; + } + /* * For Append, we must look through the members; watch out for * multiple matches (possible if it was from UNION ALL)
Re: Single transaction in the tablesync worker?
Hi Amit. PSA the v16 patch for the Tablesync Solution1. Main differences from v15: + Tablesync cleanups of DropSubscription/AlterSubscription_refresh are re-implemented as as ProcessInterrupts function Features: * The tablesync slot is now permanent instead of temporary. * The tablesync slot name is no longer tied to the Subscription slot name. * The tablesync worker is now allowing multiple tx instead of single tx * A new state (SUBREL_STATE_FINISHEDCOPY) is persisted after a successful copy_table in tablesync's LogicalRepSyncTableStart. * If a re-launched tablesync finds state SUBREL_STATE_FINISHEDCOPY then it will bypass the initial copy_table phase. * Now tablesync sets up replication origin tracking in LogicalRepSyncTableStart (similar as done for the apply worker). The origin is advanced when first created. * Cleanup of tablesync resources: - The tablesync slot cleanup (drop) code is added for process_syncing_tables_for_sync functions. - The tablesync replication origin tracking is cleaned process_syncing_tables_for_apply. - A tablesync function to cleanup its own slot/origin is called from ProcessInterrupt. This is indirectly invoked by DropSubscription/AlterSubscrition when they signal the tablesync worker to stop. * Updates to PG docs. TODO / Known Issues: * Race condition observed in "make check" may be related to this patch. * Add test cases. --- Please also see some test scenario logging which shows the new tablesync cleanup function getting called as results of Drop/AlterSUbscription. --- Kind Regards, Peter Smith. Fujitsu Australia TEST SCENARIO Purpose: To observe that the patch can process cleanups caused by the tablesync kills during DropSubscription Note: The "!!>>" is extra logging added for testing, not a normal part of PG. Steps: 1. CREATE PUBLICATION for some table T1 2. CREATE SUBSCRIPTION for that publication 3. Pause the tablesync worker in CATCHUP state (so there would be a tablesync slot in need of cleanup) 4. Show the slots 5. DROP SUBSCRIPTION // this will signal SIGTERM the tablesync workers 6. Allow the tablesync work to proceed. // See if it gets into the interrupt cleanup function and drops the slots/origin as expected == ## ## No slots to start with ## [postgres@CentOS7-x64 ~]$ psql -d test_pub -c "select * from pg_replication_slots;" slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | conf irmed_flush_lsn | wal_status | safe_wal_size ---++---++--+---+++--+--+-+- ++--- (0 rows) ## ## Normal PUBLICATION of a table ## [postgres@CentOS7-x64 ~]$ psql -d test_pub -c "CREATE PUBLICATION tap_pub FOR TABLE test_tab;" CREATE PUBLICATION ## ## Create subscription, and pause the tablesync process when gets to CATCHUP state ## [postgres@CentOS7-x64 ~]$ psql -d test_sub -p 54321 -c "CREATE SUBSCRIPTION tap_sub CONNECTION 'host=localhost dbname=test_pub application_name=tap_sub' PUBLICATION tap_pub;" 2021-01-18 18:15:55.951 AEDT [2068] LOG: logical decoding found consistent point at 0/1614640 2021-01-18 18:15:55.951 AEDT [2068] DETAIL: There are no running transactions. 2021-01-18 18:15:55.951 AEDT [2068] STATEMENT: CREATE_REPLICATION_SLOT "tap_sub" LOGICAL pgoutput NOEXPORT_SNAPSHOT NOTICE: created replication slot "tap_sub" on publisher CREATE SUBSCRIPTION 2021-01-18 18:15:55.975 AEDT [2069] LOG: logical replication apply worker for subscription "tap_sub" has started 2021-01-18 18:15:55.976 AEDT [2069] LOG: !!>> The apply worker process has PID = 2069 [postgres@CentOS7-x64 ~]$ 2021-01-18 18:15:55.984 AEDT [2074] LOG: starting logical decoding for slot "tap_sub" 2021-01-18 18:15:55.984 AEDT [2074] DETAIL: Streaming transactions committing after 0/1614678, reading WAL from 0/1614640. 2021-01-18 18:15:55.984 AEDT [2074] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2021-01-18 18:15:55.984 AEDT [2069] LOG: !!>> apply worker: LogicalRepApplyLoop 2021-01-18 18:15:55.984 AEDT [2074] LOG: logical decoding found consistent point at 0/1614640 2021-01-18 18:15:55.984 AEDT [2074] DETAIL: There are no running transactions. 2021-01-18 18:15:55.984 AEDT [2074] STATEMENT: START_REPLICATION SLOT "tap_sub" LOGICAL 0/0 (proto_version '2', publication_names '"tap_pub"') 2021-01-18 18:15:55.984 AEDT [2069] LOG: !!>> apply worker: called process_syncing_tables 2021-01-18 18:15:55.988 AEDT [2077] LOG: logical replication table synchronization worker for subscription "tap_sub", table "test_tab" has started 2021-01-18 18:15:55.988 AEDT [2077] LOG: !!>> The tablesync worker process has PID = 2077 2021-01-18 18:15:55.989 AEDT [2077] LOG: !!>> Sleeping 30 secs. For debugging, attach to process 2077 now!
Re: Deleting older versions in unique indexes to avoid page splits
пн, 18 янв. 2021 г. в 07:44, Amit Kapila : > The summary of the above is that with Case-1 (clean-up based on hint > received with the last item on the page) it takes fewer operations to > cause a page split as compared to Case-2 (clean-up based on hint > received with at least of the items on the page) for a mixed workload. > How can we say that it doesn't matter? > I cannot understand this, perhaps there's a word missing in the brackets?.. Thinking of your proposal to undertake bottom-up deletion also on the last-to-fit tuple insertion, I'd like to start with my understanding of the design: * we try to avoid index bloat, but we do it in the “lazy” way (for a reason, see below) * it means, that if there is enough space on the leaf page to fit new index tuple, we just fit it there * if there's not enough space, we first look at the presence of LP_DEAD tuples, and if they do exits, we scan the whole (index) page to re-check all tuples in order to find others, not-yet-marked-but-actually-being-LP_DEAD tuples and clean all those up. * if still not enough space, only now we try bottom-up-deletion. This is heavy operation and can cause extra IO (tests do show this), therefore this operation is undertaken at the point, when we can justify extra IO against leaf page split. * if no space also after bottom-up-deletion, we perform deduplication (if possible) * finally, we split the page. Should we try bottom-up-deletion pass in a situation where we're inserting the last possible tuple into the page (enough space for *current* insert, but likely no space for the next), then (among others) exists the following possibilities: - no new tuples ever comes to this page anymore (happy accident), which means that we've wasted IO cycles - we undertake bottom-up-deletion pass without success and we're asked to insert new tuple in this fully packed index page. This can unroll to: - due to the delay, we've managed to find some space either due to LP_DEAD or bottom-up-cleanup. which means we've wasted IO cycles on the previous iteration (too early attempt). - we still failed to find any space and are forced to split the page. in this case we've wasted IO cycles twice. In my view these cases when we generated wasted IO cycles (producing no benefit) should be avoided. And this is main reason for current approach. Again, this is my understanding and I hope I got it right. -- Victor Yegorov
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
On Mon, Jan 18, 2021 at 4:13 PM David Geier wrote: > > Hi hackers, > > While working with cursors that reference plans with CustomScanStates > nodes, I encountered a segfault which originates from > search_plan_tree(). The query plan is the result of a simple SELECT > statement into which I inject a Custom Scan node at the root to do some > post-processing before returning rows. This plan is referenced by a > second plan with a Tid Scan which originates from a query of the form > DELETE FROM foo WHERE CURRENT OF my_cursor; > > search_plan_tree() assumes that > CustomScanState::ScanState::ss_currentRelation is never NULL. In my > understanding that only holds for CustomScanState nodes which are at the > bottom of the plan and actually read from a relation. CustomScanState > nodes which are not at the bottom don't have ss_currentRelation set. I > believe for such nodes, instead search_plan_tree() should recurse into > CustomScanState::custom_ps. > > I attached a patch. Any thoughts? I don't have any comments about your patch as such, but ForeignScan is similar to CustomScan. ForeignScan also can leave ss_currentRelation NULL if it represents join between two foreign tables. So either ForeignScan has the same problem as CustomScan (it's just above CustomScan case in search_plan_tree()) or it's handling it in some other way. In the first case we may want to fix that too in the same manner (not necessarily in the same patch) and the in the later case CustomScan can handle it the same way. Said that, I didn't notice any field in ForeignScan which is parallel to custom_ps, so what you are proposing is still needed. -- Best Wishes, Ashutosh Bapat
Re: [PATCH] Add extra statistics to explain for Nested Loop
> New version of this patch prints extra statistics for all cases of > multiple loops, not only for Nested Loop. Also I fixed the example by > adding VERBOSE. > > Please don't hesitate to share any thoughts on this topic! Thanks a lot for working on this! I really like the extra details, and including it only with VERBOSE sounds good. > rows * loops is still an important calculation. > > Why not just add total_rows while we are at it - last in the listing? > > (actual rows=N loops=N min_rows=N max_rows=N total_rows=N) This total_rows idea from David would really help us too, especially in the cases where the actual rows is rounded down to zero. We make an explain visualisation tool, and it'd be nice to show people a better total than loops * actual rows. It would also help the accuracy of some of our tips, that use this number. Apologies if this input is too late to be helpful. Cheers, Michael
Re: Narrow the scope of the variable outputstr in logicalrep_write_tuple
On Mon, Jan 18, 2021 at 1:39 PM japin wrote: > > > On Mon, 18 Jan 2021 at 15:59, Bharath Rupireddy > wrote: > > On Mon, Jan 18, 2021 at 1:16 PM japin wrote: > >> > >> > >> Hi, > >> > >> I find that the outputstr variable in logicalrep_write_tuple() only use in > >> `else` branch, I think we can narrow the scope, just like variable > >> outputbytes > >> in `if` branch (for more readable). > >> > >> /* > >> * Send in binary if requested and type has suitable send function. > >> */ > >> if (binary && OidIsValid(typclass->typsend)) > >> { > >> bytea *outputbytes; > >> int len; > >> > >> pq_sendbyte(out, LOGICALREP_COLUMN_BINARY); > >> outputbytes = OidSendFunctionCall(typclass->typsend, > >> values[i]); > >> len = VARSIZE(outputbytes) - VARHDRSZ; > >> pq_sendint(out, len, 4);/* length */ > >> pq_sendbytes(out, VARDATA(outputbytes), len); /* data */ > >> pfree(outputbytes); > >> } > >> else > >> { > >> pq_sendbyte(out, LOGICALREP_COLUMN_TEXT); > >> outputstr = OidOutputFunctionCall(typclass->typoutput, > >> values[i]); > >> pq_sendcountedtext(out, outputstr, strlen(outputstr), false); > >> pfree(outputstr); > >> } > >> > >> Attached is a samll patch to fix it. > > > > +1. Binary mode uses block level variable outputbytes, so making > > outputstr block level is fine IMO. > > > > Patch basically looks good to me, but it doesn't apply on my system. > > Looks like it's not created with git commit. Please create the patch > > with git commit command. > > > > git apply > > /mnt/hgfs/Shared/narrow-the-scope-of-the-variable-in-logicalrep_write_tuple.patch > > error: corrupt patch at line 10 > > > > Thanks for reviewing! Attached v2 as you suggested. Thanks. v2 patch LGTM. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: ResourceOwner refactoring
On 18/01/2021 09:49, kuroda.hay...@fujitsu.com wrote: Dear Heikki, I apologize for sending again. I will check another ResourceOwnerEnlarge() if I have a time. I checked all ResourceOwnerEnlarge() types after applying patch 0001. (searched by "grep -rI ResourceOwnerEnlarge") No problem was found. Great, thanks! Here are more details on the performance testing I did: Unpatched -- postgres=# \i snaptest.sql numkeep | numsnaps | lifo_time_ns | fifo_time_ns -+--+--+-- 0 |1 | 38.2 | 34.8 0 |5 | 35.7 | 35.4 0 | 10 | 37.6 | 37.6 0 | 60 | 35.4 | 39.2 0 | 70 | 55.0 | 53.8 0 | 100 | 48.6 | 48.9 0 | 1000 | 54.8 | 57.0 0 |1 | 63.9 | 67.1 9 | 10 | 39.3 | 38.8 9 | 100 | 44.0 | 42.5 9 | 1000 | 45.8 | 47.1 9 |1 | 64.2 | 66.8 65 | 70 | 45.7 | 43.7 65 | 100 | 42.9 | 43.7 65 | 1000 | 46.9 | 45.7 65 |1 | 65.0 | 64.5 (16 rows) Patched postgres=# \i snaptest.sql numkeep | numsnaps | lifo_time_ns | fifo_time_ns -+--+--+-- 0 |1 | 35.3 | 33.8 0 |5 | 34.8 | 34.6 0 | 10 | 49.8 | 51.4 0 | 60 | 53.1 | 57.2 0 | 70 | 53.2 | 59.6 0 | 100 | 53.1 | 58.2 0 | 1000 | 63.1 | 69.7 0 |1 | 83.3 | 83.5 9 | 10 | 35.0 | 35.1 9 | 100 | 55.4 | 54.1 9 | 1000 | 56.8 | 60.3 9 |1 | 88.6 | 82.0 65 | 70 | 36.4 | 35.1 65 | 100 | 52.4 | 53.0 65 | 1000 | 55.8 | 59.4 65 |1 | 79.3 | 80.3 (16 rows) About the test: Each test call Register/UnregisterSnapshot in a loop. numsnaps is the number of snapshots that are registered in each iteration. For exmaple, on the second line with numkeep=0 and numnaps=5, each iteration in the LIFO test does essentially: rs[0] = RegisterSnapshot() rs[1] = RegisterSnapshot() rs[2] = RegisterSnapshot() rs[3] = RegisterSnapshot() rs[4] = RegisterSnapshot() UnregisterSnapshot(rs[4]); UnregisterSnapshot(rs[3]); UnregisterSnapshot(rs[2]); UnregisterSnapshot(rs[1]); UnregisterSnapshot(rs[0]); In the FIFO tests, the Unregister calls are made in reverse order. When numkeep is non zero, that many snapshots are registered once at the beginning of the test, and released only after the test loop. So this simulates the scenario that you remember a bunch of resources and hold them for a long time, and while holding them, you do many more remember/forget calls. Based on this test, this patch makes things slightly slower overall. However, *not* in the cases that matter the most I believe. That's the cases where the (numsnaps - numkeep) is small, so that the hot action happens in the array and the hashing is not required. In my testing earlier, about 95% of the ResourceOwnerRemember calls in the regression tests fall into that category. There are a few simple things we could do speed this up, if needed. A different hash function might be cheaper, for example. And we could inline the fast paths of the ResourceOwnerEnlarge and ResourceOwnerRemember() into the callers. But I didn't do those things as part of this patch, because it wouldn't be a fair comparison; we could do those with the old implementation too. And unless we really need the speed, it's more readable this way. I'm OK with these results. Any objections? Attached are the patches again. Same as previous version, except for a couple of little comment changes. And a third patch containing the source for the performance test. - Heikki >From ca5ef92d30d6d2d77e6758da3ae60d075451d5c1 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 13 Jan 2021 12:21:28 +0200 Subject: [PATCH v5 1/2] Move a few ResourceOwnerEnlarge() calls for safety and clarity. These are functions where quite a lot of things happen between the ResourceOwnerEnlarge and ResourceOwnerRemember calls. It's important that there are no unrelated ResourceOwnerRemember() calls in the code inbetween, otherwise the entry reserved by the ResourceOwnerEnlarge() call might be used up by the intervening ResourceOwnerRemember() and not be available at the intended ResourceOwnerRemember() call anymore. The longer the code path between them is, the harder it is to verify that. In bu
Re: Yet another fast GiST build
On 18/01/2021 01:10, Peter Geoghegan wrote: On Sun, Jan 17, 2021 at 3:04 PM Peter Geoghegan wrote: I personally agree with you - it's not like there aren't other ways for superusers to crash the server (most of which seem very similar to this gist_page_items() issue, in fact). I just think that it's worth being clear about that being a trade-off that we've accepted. Can we rename gist_page_items_bytea() to gist_page_items(), and at the same time rename the current gist_page_items() -- perhaps call it gist_page_items_output()? That way we could add a bt_page_items_output() function later, while leaving everything consistent (actually not quite, since bt_page_items() outputs text instead of bytea -- but that seems worth fixing too). This also has the merit of making the unsafe "output" variant into the special case. bt_page_items() and bt_page_items_bytea() exist already. And brin_page_items() also calls the output functions (there's no bytea version of that). Perhaps it would've been better to make the bytea-variants the default, but I'm afraid that ship has already sailed. We're not terribly consistent; heap_page_items(), hash_page_items() and gin_page_items() don't attempt to call the output functions. Then again, I don't think we need to worry much about backwards compatibility in pageinspect, so I guess we could rename them all. It doesn't bother me enough to make me do it, but I won't object if you want to. - Heikki
simplifying foreign key/RI checks
While discussing the topic of foreign key performance off-list with Robert and Corey (also came up briefly on the list recently [1], [2]), a few ideas were thrown around to simplify our current system of RI checks to enforce foreign keys with the aim of reducing some of its overheads. The two main aspects of how we do these checks that seemingly cause the most overhead are: * Using row-level triggers that are fired during the modification of the referencing and the referenced relations to perform them * Using plain SQL queries issued over SPI There is a discussion nearby titled "More efficient RI checks - take 2" [2] to address this problem from the viewpoint that it is using row-level triggers that causes the most overhead, although there are some posts mentioning that SQL-over-SPI is not without blame here. I decided to focus on the latter aspect and tried reimplementing some checks such that SPI can be skipped altogether. I started with the check that's performed when inserting into or updating the referencing table to confirm that the new row points to a valid row in the referenced relation. The corresponding SQL is this: SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x $1 is the value of the foreign key of the new row. If the query returns a row, all good. Thanks to SPI, or its use of plan caching, the query is re-planned only a handful of times before making a generic plan that is then saved and reused, which looks like this: QUERY PLAN -- LockRows -> Index Scan using pk_pkey on pk x Index Cond: (a = $1) (3 rows) So in most cases, the trigger's function would only execute the plan that's already there, at least in a given session. That's good but what we realized would be even better is if we didn't have to "execute" a full-fledged "plan" for this, that is, to simply find out whether a row containing the key we're looking for exists in the referenced relation and if found lock it. Directly scanning the index and locking it directly with table_tuple_lock() like ExecLockRows() does gives us exactly that behavior, which seems simple enough to be done in a not-so-long local function in ri_trigger.c. I gave that a try and came up with the attached. It also takes care of the case where the referenced relation is partitioned in which case its appropriate leaf partition's index is scanned. The patch results in ~2x improvement in the performance of inserts and updates on referencing tables: create table p (a numeric primary key); insert into p select generate_series(1, 100); create table f (a bigint references p); -- unpatched insert into f select generate_series(1, 200, 2); INSERT 0 100 Time: 6340.733 ms (00:06.341) update f set a = a + 1; UPDATE 100 Time: 7490.906 ms (00:07.491) -- patched insert into f select generate_series(1, 200, 2); INSERT 0 100 Time: 3340.808 ms (00:03.341) update f set a = a + 1; UPDATE 100 Time: 4178.171 ms (00:04.178) The improvement is even more dramatic when the referenced table (that we're no longer querying over SPI) is partitioned. Here are the numbers when the PK relation has 1000 hash partitions. Unpatched: insert into f select generate_series(1, 200, 2); INSERT 0 100 Time: 35898.783 ms (00:35.899) update f set a = a + 1; UPDATE 100 Time: 37736.294 ms (00:37.736) Patched: insert into f select generate_series(1, 200, 2); INSERT 0 100 Time: 5633.377 ms (00:05.633) update f set a = a + 1; UPDATE 100 Time: 6345.029 ms (00:06.345) That's over ~5x improvement! While the above case seemed straightforward enough for skipping SPI, it seems a bit hard to do the same for other cases where we query the *referencing* relation during an operation on the referenced table (for example, checking if the row being deleted is still referenced), because the plan in those cases is not predictably an index scan. Also, the filters in those queries are more than likely to not match the partition key of a partitioned referencing relation, so all partitions will have to scanned. I have left those cases as future work. The patch seems simple enough to consider for inclusion in v14 unless of course we stumble into some dealbreaker(s). I will add this to March CF. -- Amit Langote EDB: http://www.enterprisedb.com [1] https://www.postgresql.org/message-id/CADkLM%3DcTt_8Fg1Jtij5j%2BQEBOxz9Cuu4DiMDYOwdtktDAKzuLw%40mail.gmail.com [2] https://www.postgresql.org/message-id/1813.1586363881%40antos v1-0001-Export-get_partition_for_tuple.patch Description: Binary data v1-0002-Avoid-using-SPI-for-some-RI-checks.patch Description: Binary data
Re: Deleting older versions in unique indexes to avoid page splits
On Mon, Jan 18, 2021 at 5:11 PM Victor Yegorov wrote: > > пн, 18 янв. 2021 г. в 07:44, Amit Kapila : >> >> The summary of the above is that with Case-1 (clean-up based on hint >> received with the last item on the page) it takes fewer operations to >> cause a page split as compared to Case-2 (clean-up based on hint >> received with at least of the items on the page) for a mixed workload. >> How can we say that it doesn't matter? > > > I cannot understand this, perhaps there's a word missing in the brackets?.. > There is a missing word (one) in Case-2, let me write it again to avoid any confusion. Case-2 (clean-up based on hint received with at least one of the items on the page). > > Thinking of your proposal to undertake bottom-up deletion also on the > last-to-fit tuple insertion, > I think there is some misunderstanding in what I am trying to say and your conclusion of the same. See below. > I'd like to start with my understanding of the design: > > * we try to avoid index bloat, but we do it in the “lazy” way (for a reason, > see below) > > * it means, that if there is enough space on the leaf page to fit new index > tuple, > we just fit it there > > * if there's not enough space, we first look at the presence of LP_DEAD > tuples, > and if they do exits, we scan the whole (index) page to re-check all tuples > in order > to find others, not-yet-marked-but-actually-being-LP_DEAD tuples and clean > all those up. > > * if still not enough space, only now we try bottom-up-deletion. This is > heavy operation and > can cause extra IO (tests do show this), therefore this operation is > undertaken at the point, > when we can justify extra IO against leaf page split. > > * if no space also after bottom-up-deletion, we perform deduplication (if > possible) > > * finally, we split the page. > > Should we try bottom-up-deletion pass in a situation where we're inserting > the last possible tuple > into the page (enough space for *current* insert, but likely no space for the > next), > I am saying that we try bottom-up deletion when the new insert item didn't find enough space on the page and there was previously some indexUnchanged tuple(s) inserted into that page. Of course, like now it will be attempted after an attempt to remove LP_DEAD items. > then (among others) exists the following possibilities: > > - no new tuples ever comes to this page anymore (happy accident), which means > that > we've wasted IO cycles > > - we undertake bottom-up-deletion pass without success and we're asked to > insert new tuple in this > fully packed index page. This can unroll to: > - due to the delay, we've managed to find some space either due to LP_DEAD > or bottom-up-cleanup. > which means we've wasted IO cycles on the previous iteration (too early > attempt). > - we still failed to find any space and are forced to split the page. > in this case we've wasted IO cycles twice. > > In my view these cases when we generated wasted IO cycles (producing no > benefit) should be avoided. > I don't think any of these can happen in what I am actually saying. Do you still have the same feeling after reading this email? Off-hand, I don't see any downsides as compared to the current approach and it will have fewer splits in some other workloads like when there is a mix of inserts and non-HOT updates (that doesn't logically change the index). -- With Regards, Amit Kapila.
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2021/01/18 15:37, Bharath Rupireddy wrote: On Mon, Jan 18, 2021 at 11:58 AM Fujii Masao wrote: On 2021/01/18 15:02, Hou, Zhijie wrote: We need to create the loopback3 with user mapping public, otherwise the test might become unstable as shown below. Note that loopback and loopback2 are not dropped in the test, so no problem with them. ALTER SERVER loopback OPTIONS (ADD use_remote_estimate 'off'); DROP SERVER loopback3 CASCADE; NOTICE: drop cascades to 2 other objects -DETAIL: drop cascades to user mapping for postgres on server loopback3 +DETAIL: drop cascades to user mapping for bharath on server loopback3 Attaching v2 patch for postgres_fdw_get_connections. Please have a look. Hi I have a comment for the doc about postgres_fdw_get_connections. +postgres_fdw_get_connections(OUT server_name text, OUT valid boolean) returns setof record + + + This function returns the foreign server names of all the open + connections that postgres_fdw established from + the local session to the foreign servers. It also returns whether + each connection is valid or not. false is returned + if the foreign server connection is used in the current local + transaction but its foreign server or user mapping is changed or + dropped, and then such invalid connection will be closed at + the end of that transaction. true is returned + otherwise. If there are no open connections, no record is returned. + Example usage of the function: The doc seems does not memtion the case when the function returns NULL in server_name. Users may be a little confused about why NULL was returned. Yes, so what about adding (Note that the returned server name of invalid connection is NULL if its server is dropped) into the following (just after "dropped")? + if the foreign server connection is used in the current local + transaction but its foreign server or user mapping is changed or + dropped Or better description? +1 to add it after "dropped (Note )", how about as follows with slight changes? dropped (Note that server name of an invalid connection can be NULL if the server is dropped), and then such . Yes, I like this one. One question is; "should" or "is" is better than "can" in this case because the server name of invalid connection is always NULL when its server is dropped? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
回复:BEFORE ROW triggers for partitioned tables
Hi commiter, Many of our customers expect to use BR triggers in partitioned tables. After I followed your discussion, I also checked your patch. Here are two questions confusing me: 1. Your modification removes the check BR triggers against partitioned table, and a more friendly error message is added to the ExecInsert and ExecUpdate. You are correct. ExecInsert does not reroute partitions. However, when ExecUpdate modifies partition keys, it is almost equivalent to ExecDelete and ExecInsert, and it is re-routed(ExecPrepareTupleRouting) once before ExecInsert. Therefore, why should an error be thrown in ExecUpdate? Let's look at a case : ``` postgres=# create table parted (a int, b int, c text) partition by list (a); CREATE TABLE postgres=# create table parted_1 partition of parted for values in (1); CREATE TABLE postgres=# create table parted_2 partition of parted for values in (2); CREATE TABLE postgres=# create function parted_trigfunc() returns trigger language plpgsql as $$ begin new.a = new.a + 1; return new; end; $$; CREATE FUNCTION postgres=# insert into parted values (1, 1, 'uno uno v1'); INSERT 0 1 postgres=# create trigger t before update on parted for each row execute function parted_trigfunc(); CREATE TRIGGER postgres=# update parted set c = c || 'v3'; ``` If there is no check in the ExecUpdate, the above update SQL will be executed successfully. However, in your code, this will fail. So, what is the reason for your consideration? 2. In this patch, you only removed the restrictions BR trigger against the partitioned table, but did not fundamentally solve the problem caused by modifying partition keys in the BR trigger. What are the difficulties in solving this problem fundamentally? We plan to implement it. Can you give us some suggestions? -- 发件人:Alvaro Herrera 发送时间:2021年1月18日(星期一) 20:36 收件人:Ashutosh Bapat 抄 送:Ashutosh Bapat ; Pg Hackers 主 题:Re: BEFORE ROW triggers for partitioned tables Thanks for the reviews; I have pushed it now. (I made the doc edits you mentioned too.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On Mon, Jan 18, 2021 at 6:17 PM Fujii Masao wrote: > > +1 to add it after "dropped (Note )", how about as follows > > with slight changes? > > > > dropped (Note that server name of an invalid connection can be NULL if > > the server is dropped), and then such . > > Yes, I like this one. One question is; "should" or "is" is better than > "can" in this case because the server name of invalid connection is > always NULL when its server is dropped? I think "dropped (Note that server name of an invalid connection will be NULL if the server is dropped), and then such ." With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: POC: Cleaning up orphaned files using undo logs
Dmitry Dolgov <9erthali...@gmail.com> wrote: > Thanks for the updated patch. As I've mentioned off the list I'm slowly > looking through it with the intent to concentrate on undo progress > tracking. But before I will post anything I want to mention couple of > strange issues I see, otherwise I will forget for sure. Maybe it's > already known, but running several times 'make installcheck' against a > freshly build postgres with the patch applied from time to time I > observe various errors. > > This one happens on a crash recovery, seems like > UndoRecordSetXLogBufData has usr_type = USRT_INVALID and is involved in > the replay process: > > TRAP: FailedAssertion("page_offset + this_page_bytes <= > uph->ud_insertion_point", File: "undopage.c", Line: 300) > postgres: startup recovering > 00010012(ExceptionalCondition+0xa1)[0x558b38b8a350] > postgres: startup recovering > 00010012(UndoPageSkipOverwrite+0x0)[0x558b38761b7e] > postgres: startup recovering > 00010012(UndoReplay+0xa1d)[0x558b38766f32] > postgres: startup recovering > 00010012(XactUndoReplay+0x77)[0x558b38769281] > postgres: startup recovering > 00010012(smgr_redo+0x1af)[0x558b387aa7bd] > > This one is somewhat similar: > > TRAP: FailedAssertion("page_offset >= SizeOfUndoPageHeaderData", File: > "undopage.c", Line: 287) > postgres: undo worker for database 36893 > (ExceptionalCondition+0xa1)[0x5559c90f1350] > postgres: undo worker for database 36893 > (UndoPageOverwrite+0xa6)[0x5559c8cc8ae3] > postgres: undo worker for database 36893 > (UpdateLastAppliedRecord+0xbe)[0x5559c8ccd008] > postgres: undo worker for database 36893 (smgr_undo+0xa6)[0x5559c8d11989] Well, on repeated run of the test I could also hit the first one. I could fix it and will post a new version of the patch (along with some other small changes) this week. > There are also here and there messages about not found undo files: > > ERROR: cannot open undo segment file 'base/undo/08.02': No > such file or directory > WARNING: failed to undo transaction I don't see this one in the log so far, will try again. Thanks for the report! -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: list of extended statistics on psql
On 1/18/21 8:31 AM, Tatsuro Yamada wrote: Hi Tomas and Shinoda-san, On 2021/01/17 23:31, Tomas Vondra wrote: On 1/17/21 3:01 AM, Tomas Vondra wrote: On 1/17/21 2:41 AM, Shinoda, Noriyoshi (PN Japan FSIP) wrote: Hi, hackers. I tested this committed feature. It doesn't seem to be available to non-superusers due to the inability to access pg_statistics_ext_data. Is this the expected behavior? Ugh. I overlooked the test to check the case of the user hasn't Superuser privilege. The user without the privilege was able to access pg_statistics_ext. Therefore I supposed that it's also able to access pg_statics_ext_data. Oops. Hmmm, that's a good point. Bummer we haven't noticed that earlier. I wonder what the right fix should be - presumably we could do something like pg_stats_ext (we can't use that view directly, because it formats the data, so the sizes are different). But should it list just the stats the user has access to, or should it list everything and leave the inaccessible fields NULL? I've reverted the commit - once we find the right way to handle this, I'll get it committed again. As for how to deal with this, I can think of about three ways: 1) simplify the command to only print information from pg_statistic_ext (so on information about which stats are built or sizes) 2) extend pg_stats_ext with necessary information (e.g. sizes) 3) create a new system view, with necessary information (so that pg_stats_ext does not need to be modified) 4) add functions returning the necessary information, possibly only for statistics the user can access (similarly to what pg_stats_ext does) Options 2-4 have the obvious disadvantage that this won't work on older releases (we can't add views or functions there). So I'm leaning towards #1 even if that means we have to remove some of the details. We can consider adding that for new releases, though. Thanks for the useful advice. I go with option 1). The following query is created by using pg_stats_ext instead of pg_statistic_ext and pg_statistic_ext_data. However, I was confused about writing a part of the query for calculating MCV size because there are four columns related to MCV. For example, most_common_vals, most_common_val_nulls, most_common_freqs, and most_common_base_freqs. Currently, I don't know how to calculate the size of MCV by using the four columns. Thoughts? :-) Well, my suggestion was to use pg_statistic_ext, because that lists all statistics, while pg_stats_ext is filtering statistics depending on access privileges. I think that's more appropriate for \dX, the contents should not change depending on the user. Also, let me clarify - with option (1) we'd not show the sizes at all. The size of the formatted statistics may be very different from the on-disk representation, so I see no point in showing it in \dX. We might show other stats (e.g. number of MCV items, or the fraction of data represented by the MCV list), but the user can inspect pg_stats_ext if needed. What we might do is to show those stats when a superuser is running this command, but I'm not sure that's a good idea (or how difficult would it be to implement). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Added schema level support for publication.
On Sat, Jan 9, 2021 at 5:21 PM vignesh C wrote: > > On Fri, Jan 8, 2021 at 4:32 PM Amit Kapila wrote: > > > > On Thu, Jan 7, 2021 at 10:03 PM vignesh C wrote: > > > > > > This feature adds schema option while creating publication. Users will > > > be able to specify one or more schemas while creating publication, > > > when the user specifies schema option, then the data changes for the > > > tables present in the schema specified by the user will be replicated > > > to the subscriber. Few examples have been listed below: > > > > > > Create a publication that publishes all changes for all the tables > > > present in production schema: > > > CREATE PUBLICATION production_publication FOR ALL TABLES SCHEMA > > > production; > > > > > > Create a publication that publishes all changes for all the tables > > > present in marketing and sales schemas: > > > CREATE PUBLICATION sales_publication FOR ALL TABLES SCHEMA marketing, > > > sales; > > > > > > Add some schemas to the publication: > > > ALTER PUBLICATION sales_publication ADD SCHEMA marketing_june, sales_june; > > > > > > Drop some schema from the publication: > > > ALTER PUBLICATION production_quarterly_publication DROP SCHEMA > > > production_july; > > > > > > Attached is a POC patch for the same. I felt this feature would be quite > > > useful. > > > > > > > What do we do if the user Drops the schema? Do we automatically remove > > it from the publication? > > > I have not yet handled this scenario yet, I will handle this and > adding of tests in the next patch. > I have handled the above scenario(drop schema should automatically remove the schema entry from publication schema relation) & addition of tests in the new v2 patch attached. Thoughts? Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com From d14e47368091d81f22ab11b94ba0716e0d918471 Mon Sep 17 00:00:00 2001 From: Vignesh C Date: Mon, 18 Jan 2021 18:08:51 +0530 Subject: [PATCH v2] Added schema level support for publication. This patch adds schema level support for publication. User can specify multiple schemas with schema option. When user specifies schema option, then the tables present in the schema specified will be selected by publisher for sending the data to subscriber. --- doc/src/sgml/ref/alter_publication.sgml | 45 +++- doc/src/sgml/ref/create_publication.sgml | 31 ++- src/backend/catalog/Makefile | 4 +- src/backend/catalog/aclchk.c | 2 + src/backend/catalog/dependency.c | 9 + src/backend/catalog/objectaddress.c | 138 + src/backend/catalog/pg_publication.c | 134 +++- src/backend/commands/alter.c | 1 + src/backend/commands/event_trigger.c | 4 + src/backend/commands/publicationcmds.c | 266 +++- src/backend/commands/seclabel.c | 1 + src/backend/commands/tablecmds.c | 1 + src/backend/parser/gram.y| 76 --- src/backend/utils/cache/syscache.c | 23 +++ src/bin/pg_dump/common.c | 3 + src/bin/pg_dump/pg_backup_archiver.c | 3 +- src/bin/pg_dump/pg_dump.c| 155 +- src/bin/pg_dump/pg_dump.h| 17 ++ src/bin/pg_dump/pg_dump_sort.c | 7 + src/bin/psql/describe.c | 110 +- src/include/catalog/dependency.h | 1 + src/include/catalog/pg_publication.h | 16 +- src/include/catalog/pg_publication_schema.h | 49 + src/include/commands/publicationcmds.h | 1 + src/include/nodes/parsenodes.h | 3 + src/include/utils/syscache.h | 2 + src/test/regress/expected/object_address.out | 6 +- src/test/regress/expected/publication.out| 298 ++- src/test/regress/expected/sanity_check.out | 1 + src/test/regress/sql/object_address.sql | 3 + src/test/regress/sql/publication.sql | 87 +++- 31 files changed, 1401 insertions(+), 96 deletions(-) create mode 100644 src/include/catalog/pg_publication_schema.h diff --git a/doc/src/sgml/ref/alter_publication.sgml b/doc/src/sgml/ref/alter_publication.sgml index faa114b..d6199af 100644 --- a/doc/src/sgml/ref/alter_publication.sgml +++ b/doc/src/sgml/ref/alter_publication.sgml @@ -24,6 +24,9 @@ PostgreSQL documentation ALTER PUBLICATION name ADD TABLE [ ONLY ] table_name [ * ] [, ...] ALTER PUBLICATION name SET TABLE [ ONLY ] table_name [ * ] [, ...] ALTER PUBLICATION name DROP TABLE [ ONLY ] table_name [ * ] [, ...] +ALTER PUBLICATION name ADD SCHEMA schema_name [, ...] +ALTER PUBLICATION name SET SCHEMA schema_name [, ...] +ALTER PUBLICATION name DROP SCHEMA schema_name [, ...] ALTER PUBLICATION name SET ( publication_parameter [= value] [, ... ] ) ALTER PUBLICATION name OWNER TO { new_owner | CURRENT_ROLE | CURRENT_USER | SESSION_USER } ALTER PUBLICATION name RENAME TO new
Re: [PATCH] ProcessInterrupts_hook
On Mon, Jan 18, 2021 at 3:00 AM Craig Ringer wrote: > A few times lately I've been doing things in extensions that've made me want > to be able to run my own code whenever InterruptPending is true and > CHECK_FOR_INTERRUPTS() calls ProcessInterrupts() I've wanted this in the past, too, so +1 from me. > What I really want to go along with this is a way for any backend to observe > the postmaster's pmState and its "Shutdown" variable's value, so any backend > can tell if we're in FastShutdown, SmartShutdown, etc. Copies in shmem only > obviously. But I'm not convinced it's right to just copy these vars as-is to > shmem, and I don't want to use the memory for a ProcSignal slot for something > that won't be relevant for most backends for most of the postmaster lifetime. > Ideas welcomed. I've wanted something along this line, too, but what I was thinking about was more along the lines of having the postmaster signal the backends when a smart shutdown happened. After all when a fast shutdown happens the backends already get told to terminate, and that seems like it ought to be enough: I'm not sure backends have any business caring about why they are being asked to terminate. But they might well want to know whether a smart shutdown is in progress, and right now there's no way for them to know that. -- Robert Haas EDB: http://www.enterprisedb.com
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin wrote: > Does anyone maintain opensource pg_surgery analogs for released versions of > PG? > It seems to me I'll have to use something like this and I just though that I > should consider pg_surgery in favour of our pg_dirty_hands. I do not. I'm still of the opinion that we ought to back-patch pg_surgery. This didn't attract a consensus before, and it's hard to dispute that it's a new feature in what would be a back branch. But it's unclear to me how users are otherwise supposed to recover from some of the bugs that are or have been present in those back branches. I'm not sure that I see the logic in telling people we'll try to prevent them from getting hosed in the future but if they're already hosed they can wait for v14 to fix it. They can't wait that long, and a dump-and-restore cycle is awfully painful. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Deleting older versions in unique indexes to avoid page splits
пн, 18 янв. 2021 г. в 13:42, Amit Kapila : > I don't think any of these can happen in what I am actually saying. Do > you still have the same feeling after reading this email? Off-hand, I > don't see any downsides as compared to the current approach and it > will have fewer splits in some other workloads like when there is a > mix of inserts and non-HOT updates (that doesn't logically change the > index). > If I understand you correctly, you suggest to track _all_ the hints that came from the executor for possible non-HOT logical duplicates somewhere on the page. And when we hit the no-space case, we'll check not only the last item being hinted, but all items on the page, which makes it more probable to kick in and do something. Sounds interesting. And judging on the Peter's tests of extra LP_DEAD tuples found on the page (almost always being more, than actually flagged), this can have some positive effects. Also, I'm not sure where to put it. We've deprecated the BTP_HAS_GARBAGE flag, maybe it can be reused for this purpose? -- Victor Yegorov
Re: Transactions involving multiple postgres foreign servers, take 2
Hi, Masahiko-san: bq. How about FdwXactRequestToLaunchResolver()? Sounds good to me. bq. But there is already a function named FdwXactExists() Then we can leave the function name as it is. Cheers On Sun, Jan 17, 2021 at 9:55 PM Masahiko Sawada wrote: > On Fri, Jan 15, 2021 at 7:45 AM Zhihong Yu wrote: > > > > For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch : > > > > + entry->changing_xact_state = true; > > ... > > + entry->changing_xact_state = abort_cleanup_failure; > > > > I don't see return statement in between the two assignments. I wonder > why entry->changing_xact_state is set to true, and later being assigned > again. > > Because postgresRollbackForeignTransaction() can get called again in > case where an error occurred during aborting and cleanup the > transaction. For example, if an error occurred when executing ABORT > TRANSACTION (pgfdw_get_cleanup_result() could emit an ERROR), > postgresRollbackForeignTransaction() will get called again while > entry->changing_xact_state is still true. Then the entry will be > caught by the following condition and cleaned up: > > /* > * If connection is before starting transaction or is already > unsalvageable, > * do only the cleanup and don't touch it further. > */ > if (entry->changing_xact_state) > { > pgfdw_cleanup_after_transaction(entry); > return; > } > > > > > For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch : > > > > bq. This commits introduces to new background processes: foreign > > > > commits introduces to new -> commit introduces two new > > Fixed. > > > > > +FdwXactExistsXid(TransactionId xid) > > > > Since Xid is the parameter to this method, I think the Xid suffix can be > dropped from the method name. > > But there is already a function named FdwXactExists()? > > bool > FdwXactExists(Oid dbid, Oid serverid, Oid userid) > > As far as I read other code, we already have such functions that have > the same functionality but have different arguments. For instance, > SearchSysCacheExists() and SearchSysCacheExistsAttName(). So I think > we can leave as it is but is it better to have like > FdwXactCheckExistence() and FdwXactCheckExistenceByXid()? > > > > > + * Portions Copyright (c) 2020, PostgreSQL Global Development Group > > > > Please correct year in the next patch set. > > Fixed. > > > > > +FdwXactLauncherRequestToLaunch(void) > > > > Since the launcher's job is to 'launch', I think the Launcher can be > omitted from the method name. > > Agreed. How about FdwXactRequestToLaunchResolver()? > > > > > +/* Report shared memory space needed by FdwXactRsoverShmemInit */ > > +Size > > +FdwXactRslvShmemSize(void) > > > > Are both Rsover and Rslv referring to resolver ? It would be better to > use whole word which reduces confusion. > > Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or > FdwXactResolveShmemInit) > > Agreed. I realized that these functions are the launcher's function, > not resolver's. So I'd change to FdwXactLauncherShmemSize() and > FdwXactLauncherShmemInit() respectively. > > > > > +fdwxact_launch_resolver(Oid dbid) > > > > The above method is not in camel case. It would be better if method > names are consistent (in casing). > > Fixed. > > > > > +errmsg("out of foreign transaction resolver slots"), > > +errhint("You might need to increase > max_foreign_transaction_resolvers."))); > > > > It would be nice to include the value of max_foreign_xact_resolvers > > I agree it would be nice but looking at other code we don't include > the value in this kind of messages. > > > > > For fdwxact_resolver_onexit(): > > > > + LWLockAcquire(FdwXactLock, LW_EXCLUSIVE); > > + fdwxact->locking_backend = InvalidBackendId; > > + LWLockRelease(FdwXactLock); > > > > There is no call to method inside the for loop which may take time. I > wonder if the lock can be obtained prior to the for loop and released > coming out of the for loop. > > Agreed. > > > > > +FXRslvLoop(void) > > > > Please use Resolver instead of Rslv > > Fixed. > > > > > + FdwXactResolveFdwXacts(held_fdwxacts, nheld); > > > > Fdw and Xact are repeated twice each in the method name. Probably the > method name can be made shorter. > > Fixed. > > Regards, > > -- > Masahiko Sawada > EnterpriseDB: https://www.enterprisedb.com/ >
Re: Additional Chapter for Tutorial - arch-dev.sgml
On 20/11/2020 23:52, Erik Rijkers wrote: (smallish) Changes to arch-dev.sgml This looks good to me. One little complaint: @@ -125,7 +122,7 @@ use a supervisor process (also master process) that spawns a new server process every time a connection is requested. This supervisor -process is called postgres and listens at a +process is called postgres (formerly 'postmaster') and listens at a specified TCP/IP port for incoming connections. Whenever a request for a connection is detected the postgres process spawns a new server process. The server tasks I believe we still call it the postmaster process. We renamed the binary a long time ago (commit 5266f221a2), and the above text was changed as part of that commit. I think that was a mistake, and this should say simply: ... This supervisor process is called postmaster and ... like it did before we renamed the binary. Barring objections, I'll commit this with that change (as attached). - Heikki diff --git a/doc/src/sgml/arch-dev.sgml b/doc/src/sgml/arch-dev.sgml index ade0ad97d8..a27c8477c2 100644 --- a/doc/src/sgml/arch-dev.sgml +++ b/doc/src/sgml/arch-dev.sgml @@ -7,7 +7,7 @@ Author This chapter originated as part of -, Stefan Simkovics' + Stefan Simkovics' Master's Thesis prepared at Vienna University of Technology under the direction of O.Univ.Prof.Dr. Georg Gottlob and Univ.Ass. Mag. Katrin Seyr. @@ -17,10 +17,7 @@ This chapter gives an overview of the internal structure of the backend of PostgreSQL. After having read the following sections you should have an idea of how a query - is processed. This chapter does not aim to provide a detailed - description of the internal operation of - PostgreSQL, as such a document would be - very extensive. Rather, this chapter is intended to help the reader + is processed. This chapter is intended to help the reader understand the general sequence of operations that occur within the backend from the point at which a query is received, to the point at which the results are returned to the client. @@ -30,8 +27,8 @@ The Path of a Query -Here we give a short overview of the stages a query has to pass in -order to obtain a result. +Here we give a short overview of the stages a query has to pass +to obtain a result. @@ -125,9 +122,9 @@ use a supervisor process (also master process) that spawns a new server process every time a connection is requested. This supervisor -process is called postgres and listens at a +process is called postmaster and listens at a specified TCP/IP port for incoming connections. Whenever a request -for a connection is detected the postgres +for a connection is detected the postmaster process spawns a new server process. The server tasks communicate with each other using semaphores and shared memory to ensure data integrity @@ -230,7 +227,7 @@ A detailed description of bison or the grammar rules given in gram.y would be - beyond the scope of this paper. There are many books and + beyond the scope of this manual. There are many books and documents dealing with flex and bison. You should be familiar with bison before you start to study the @@ -343,8 +340,8 @@ In some situations, examining each possible way in which a query - can be executed would take an excessive amount of time and memory - space. In particular, this occurs when executing queries + can be executed would take an excessive amount of time and memory. + In particular, this occurs when executing queries involving large numbers of join operations. In order to determine a reasonable (not necessarily optimal) query plan in a reasonable amount of time, PostgreSQL uses a Genetic @@ -411,7 +408,7 @@ merge join: Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form -join rows. This kind of join is more +join rows. This kind of join is attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an @@ -442,7 +439,7 @@ If the query uses fewer than relations, a near-exhaustive search is conducted to find the best join sequence. The planner preferentially considers joins between any - two relations for which there exist a corresponding join clause in the + two relations for which there exists a corresponding join clause in the WHERE qualification (i.e., for which a restriction like where rel1.attr1=rel2.attr2 exists). Join pairs with no join clause are considered only when there
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On Mon, Jan 18, 2021 at 11:44 AM Fujii Masao wrote: > > I will post patches for the other function postgres_fdw_disconnect, > > GUC and server level option later. > > Thanks! Attaching v12 patch set. 0001 is for postgres_fdw_disconnect() function, 0002 is for keep_connections GUC and 0003 is for keep_connection server level option. Please review it further. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v12-0001-postgres_fdw-function-to-discard-cached-connecti.patch Description: Binary data v12-0002-postgres_fdw-add-keep_connections-GUC-to-not-cac.patch Description: Binary data v12-0003-postgres_fdw-server-level-option-keep_connection.patch Description: Binary data
TOAST condition for column size
Hi, When I created a table consisting of 400 VARCHAR columns and tried to INSERT a record which rows were all the same size, there were cases where I got an error due to exceeding the size limit per row. =# -- create a table consisting of 400 VARCHAR columns =# CREATE TABLE t1 (c1 VARCHAR(100), c2 VARCHAR(100), ... c400 VARCHAR(100)); =# -- insert one record which rows are all 20 bytes =# INSERT INTO t1 VALUES (repeat('a', 20), repeat('a', 20), ... repeat('a', 20)); ERROR: row is too big: size 8424, maximum size 8160 What is interesting is that it failed only when the size of each column was 20~23 bytes, as shown below. size of each column | result --- 18 bytes | success 19 bytes | success 20 bytes | failure 21 bytes | failure 22 bytes | failure 23 bytes | failure 24 bytes | success 25 bytes | success When the size of each column was 19 bytes or less, it succeeds because the row size is within a page size. When the size of each column was 24 bytes or more, it also succeeds because columns are TOASTed and the row size is reduced to less than one page size. OTOH, when it's more than 19 bytes and less than 24 bytes, columns aren't TOASTed because it doesn't meet the condition of the following if statement. --src/backend/access/table/toast_helper.c toast_tuple_find_biggest_attribute(ToastTupleContext *ttc, bool for_compression, bool check_main) ...(snip)... int32biggest_size = MAXALIGN(TOAST_POINTER_SIZE); ...(snip)... if (ttc->ttc_attr[i].tai_size > biggest_size) // <- here { biggest_attno = i; biggest_size = ttc->ttc_attr[i].tai_size; } Since TOAST_POINTER_SIZE is 18 bytes but MAXALIGN(TOAST_POINTER_SIZE) is 24 bytes, columns are not TOASTed until its size becomes larger than 24 bytes. I confirmed these sizes in my environment but AFAIU they would be the same size in any environment. So, as a result of adjusting the alignment, 20~23 bytes seems to fail. I wonder if it might be better not to adjust the alignment here as an attached patch because it succeeded in inserting 20~23 bytes records. Or is there reasons to add the alignment here? I understand that TOAST is not effective for small data and it's not recommended to create a table containing hundreds of columns, but I think cases that can be successful should be successful. Any thoughts? Regards, -- Atsushi Torikoshidiff --git a/src/backend/access/table/toast_helper.c b/src/backend/access/table/toast_helper.c index fb36151ce5..e916c0f95c 100644 --- a/src/backend/access/table/toast_helper.c +++ b/src/backend/access/table/toast_helper.c @@ -183,7 +183,7 @@ toast_tuple_find_biggest_attribute(ToastTupleContext *ttc, TupleDesc tupleDesc = ttc->ttc_rel->rd_att; int numAttrs = tupleDesc->natts; int biggest_attno = -1; - int32 biggest_size = MAXALIGN(TOAST_POINTER_SIZE); + int32 biggest_size = TOAST_POINTER_SIZE; int32 skip_colflags = TOASTCOL_IGNORE; int i;
Re: proposal: schema variables
On 2021-01-18 10:59, Pavel Stehule wrote: and here is the patch [schema-variables-20200118.patch.gz ] One small thing: The drop variable synopsis is: DROP VARIABLE [ IF EXISTS ] name [, ...] [ CASCADE | RESTRICT ] In that text following it, 'RESTRICT' is not documented. When used it does not give an error but I don't see how it 'works'. Erik
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Mon, Jan 18, 2021 at 08:54:10AM -0500, Robert Haas wrote: > On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin wrote: > > Does anyone maintain opensource pg_surgery analogs for released > > versions of PG? It seems to me I'll have to use something like this > > and I just though that I should consider pg_surgery in favour of our > > pg_dirty_hands. > > I do not. I'm still of the opinion that we ought to back-patch > pg_surgery. This didn't attract a consensus before, and it's hard to > dispute that it's a new feature in what would be a back branch. But > it's unclear to me how users are otherwise supposed to recover from > some of the bugs that are or have been present in those back branches. One other possiblity would be to push a version of pg_surgery that is compatible with the back-branches somewhere external (e.g. either git.postgresql.org and/or Github), so that it can be picked up by distributions and/or individual users in need. That is Assuming it does not need assorted server changes to go with; I did not read the thread in detail but I was under the assumption it is a client program? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Re: ResourceOwner refactoring
On 2021-Jan-18, Heikki Linnakangas wrote: > +static ResourceOwnerFuncs jit_funcs = > +{ > + /* relcache references */ > + .name = "LLVM JIT context", > + .phase = RESOURCE_RELEASE_BEFORE_LOCKS, > + .ReleaseResource = ResOwnerReleaseJitContext, > + .PrintLeakWarning = ResOwnerPrintJitContextLeakWarning > +}; I think you mean jit_resowner_funcs here; "jit_funcs" is a bit excessively vague. Also, why did you choose not to define ResourceOwnerRememberJIT? You do that in other modules and it seems better. > +static ResourceOwnerFuncs catcache_funcs = > +{ > + /* catcache references */ > + .name = "catcache reference", > + .phase = RESOURCE_RELEASE_AFTER_LOCKS, > + .ReleaseResource = ResOwnerReleaseCatCache, > + .PrintLeakWarning = ResOwnerPrintCatCacheLeakWarning > +}; > + > +static ResourceOwnerFuncs catlistref_funcs = > +{ > + /* catcache-list pins */ > + .name = "catcache list reference", > + .phase = RESOURCE_RELEASE_AFTER_LOCKS, > + .ReleaseResource = ResOwnerReleaseCatCacheList, > + .PrintLeakWarning = ResOwnerPrintCatCacheListLeakWarning > +}; Similar naming issue here, I say. > diff --git a/src/common/cryptohash_openssl.c b/src/common/cryptohash_openssl.c > index 551ec392b60..642e72d8c0f 100644 > --- a/src/common/cryptohash_openssl.c > +++ b/src/common/cryptohash_openssl.c > @@ -60,6 +59,21 @@ struct pg_cryptohash_ctx > #endif > }; > > +/* ResourceOwner callbacks to hold JitContexts */ > +#ifndef FRONTEND > +static void ResOwnerReleaseCryptoHash(Datum res); > +static void ResOwnerPrintCryptoHashLeakWarning(Datum res); The comment is wrong -- "Crypohashes", not "JitContexts". So according to your performance benchmark, we're willing to accept a 30% performance loss on an allegedly common operation -- numkeep=0 numsnaps=10 becomes 49.8ns from 37.6ns. That seems a bit shocking. Maybe you can claim that these operations aren't exactly hot spots, and so the fact that we remain in the same power-of-ten is sufficient. Is that the argument? -- Álvaro Herrera39°49'30"S 73°17'W "The Postgresql hackers have what I call a "NASA space shot" mentality. Quite refreshing in a world of "weekend drag racer" developers." (Scott Marlowe)
回复:BEFORE ROW triggers for partitioned tables
Hi commiter, Many of our customers expect to use BR triggers in partitioned tables. After I followed your discussion, I also checked your patch. Here are two questions confusing me: 1. Your modification removes the check BR triggers against partitioned table, and a more friendly error message is added to the ExecInsert and ExecUpdate. You are correct. ExecInsert does not reroute partitions. However, when ExecUpdate modifies partition keys, it is almost equivalent to ExecDelete and ExecInsert, and it is re-routed(ExecPrepareTupleRouting) once before ExecInsert. Therefore, why should an error be thrown in ExecUpdate? Let's look at a case : ``` postgres=# create table parted (a int, b int, c text) partition by list (a); CREATE TABLE postgres=# create table parted_1 partition of parted for values in (1); CREATE TABLE postgres=# create table parted_2 partition of parted for values in (2); CREATE TABLE postgres=# create function parted_trigfunc() returns trigger language plpgsql as $$ begin new.a = new.a + 1; return new; end; $$; CREATE FUNCTION postgres=# insert into parted values (1, 1, 'uno uno v1'); INSERT 0 1 postgres=# create trigger t before update on parted for each row execute function parted_trigfunc(); CREATE TRIGGER postgres=# update parted set c = c || 'v3'; ``` If there is no check in the ExecUpdate, the above update SQL will be executed successfully. However, in your code, this will fail. So, what is the reason for your consideration? 2. In this patch, you only removed the restrictions BR trigger against the partitioned table, but did not fundamentally solve the problem caused by modifying partition keys in the BR trigger. What are the difficulties in solving this problem fundamentally? We plan to implement it. Can you give us some suggestions? We look forward to your reply. Thank you very much, Regards, Adger -- 发件人:Alvaro Herrera 发送时间:2021年1月18日(星期一) 20:36 收件人:Ashutosh Bapat 抄 送:Ashutosh Bapat ; Pg Hackers 主 题:Re: BEFORE ROW triggers for partitioned tables Thanks for the reviews; I have pushed it now. (I made the doc edits you mentioned too.) -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Mon, Jan 18, 2021 at 9:25 AM Michael Banck wrote: > One other possiblity would be to push a version of pg_surgery that is > compatible with the back-branches somewhere external (e.g. either > git.postgresql.org and/or Github), so that it can be picked up by > distributions and/or individual users in need. Sure, but I don't see how that's better. > That is Assuming it does not need assorted server changes to go with; I > did not read the thread in detail but I was under the assumption it is a > client program? It's a server extension. It does not require core changes. -- Robert Haas EDB: http://www.enterprisedb.com
Re: New IndexAM API controlling index vacuum strategies
Hi, Masahiko-san: For v2-0001-Introduce-IndexAM-API-for-choosing-index-vacuum-s.patch : For blvacuumstrategy(): + if (params->index_cleanup == VACOPT_TERNARY_DISABLED) + return INDEX_VACUUM_STRATEGY_NONE; + else + return INDEX_VACUUM_STRATEGY_BULKDELETE; The 'else' can be omitted. Similar comment for ginvacuumstrategy(). For v2-0002-Choose-index-vacuum-strategy-based-on-amvacuumstr.patch : If index_cleanup option is specified neither VACUUM command nor storage option I think this is what you meant (by not using passive voice): If index_cleanup option specifies neither VACUUM command nor storage option, - * integer, but you can't fit that many items on a page. 11 ought to be more + * integer, but you can't fit that many items on a page. 13 ought to be more It would be nice to add a note why the number of bits is increased. For choose_vacuum_strategy(): + IndexVacuumStrategy ivstrat; The variable is only used inside the loop. You can use vacrelstats->ivstrategies[i] directly and omit the variable. + int ndeaditems_limit = (int) ((freespace / sizeof(ItemIdData)) * 0.7); How was the factor of 0.7 determined ? Comment below only mentioned 'safety factor' but not how it was chosen. I also wonder if this factor should be exposed as GUC. + if (nworkers > 0) + nworkers--; Should log / assert be added when nworkers is <= 0 ? + * XXX: allowing to fill the heap page with only line pointer seems a overkill. 'a overkill' -> 'an overkill' Cheers On Sun, Jan 17, 2021 at 10:21 PM Masahiko Sawada wrote: > On Mon, Jan 18, 2021 at 2:18 PM Masahiko Sawada > wrote: > > > > On Tue, Jan 5, 2021 at 10:35 AM Masahiko Sawada > wrote: > > > > > > On Tue, Dec 29, 2020 at 3:25 PM Masahiko Sawada > wrote: > > > > > > > > On Tue, Dec 29, 2020 at 7:06 AM Peter Geoghegan wrote: > > > > > > > > > > On Sun, Dec 27, 2020 at 11:41 PM Peter Geoghegan > wrote: > > > > > > I experimented with this today, and I think that it is a good > way to > > > > > > do it. I like the idea of choose_vacuum_strategy() understanding > that > > > > > > heap pages that are subject to many non-HOT updates have a > "natural > > > > > > extra capacity for LP_DEAD items" that it must care about > directly (at > > > > > > least with non-default heap fill factor settings). My early > testing > > > > > > shows that it will often take a surprisingly long time for the > most > > > > > > heavily updated heap page to have more than about 100 LP_DEAD > items. > > > > > > > > > > Attached is a rough patch showing what I did here. It was applied > on > > > > > top of my bottom-up index deletion patch series and your > > > > > poc_vacuumstrategy.patch patch. This patch was written as a quick > and > > > > > dirty way of simulating what I thought would work best for > bottom-up > > > > > index deletion for one specific benchmark/test, which was > > > > > non-hot-update heavy. This consists of a variant pgbench with > several > > > > > indexes on pgbench_accounts (almost the same as most other > bottom-up > > > > > deletion benchmarks I've been running). Only one index is > "logically > > > > > modified" by the updates, but of course we still physically modify > all > > > > > indexes on every update. I set fill factor to 90 for this > benchmark, > > > > > which is an important factor for how your VACUUM patch works during > > > > > the benchmark. > > > > > > > > > > This rough supplementary patch includes VACUUM logic that assumes > (but > > > > > doesn't check) that the table has heap fill factor set to 90 -- > see my > > > > > changes to choose_vacuum_strategy(). This benchmark is really about > > > > > stability over time more than performance (though performance is > also > > > > > improved significantly). I wanted to keep both the table/heap and > the > > > > > logically unmodified indexes (i.e. 3 out of 4 indexes on > > > > > pgbench_accounts) exactly the same size *forever*. > > > > > > > > > > Does this make sense? > > > > > > > > Thank you for sharing the patch. That makes sense. > > > > > > > > +if (!vacuum_heap) > > > > +{ > > > > +if (maxdeadpage > 130 || > > > > +/* Also check if maintenance_work_mem space is > running out */ > > > > +vacrelstats->dead_tuples->num_tuples > > > > > +vacrelstats->dead_tuples->max_tuples / 2) > > > > +vacuum_heap = true; > > > > +} > > > > > > > > The second test checking if maintenane_work_mem space is running out > > > > also makes sense to me. Perhaps another idea would be to compare the > > > > number of collected garbage tuple to the total number of heap tuples > > > > so that we do lazy_vacuum_heap() only when we’re likely to reclaim a > > > > certain amount of garbage in the table. > > > > > > > > > > > > > > Anyway, with a 15k TPS limit on a pgbench scale 3000 DB, I see that > > > > > pg_stat_database shows an almost ~28% reduction in blks_read after > an > > > > > o
Re: ResourceOwner refactoring
On 18/01/2021 16:34, Alvaro Herrera wrote: So according to your performance benchmark, we're willing to accept a 30% performance loss on an allegedly common operation -- numkeep=0 numsnaps=10 becomes 49.8ns from 37.6ns. That seems a bit shocking. Maybe you can claim that these operations aren't exactly hot spots, and so the fact that we remain in the same power-of-ten is sufficient. Is that the argument? That's right. The fast path is fast, and that's important. The slow path becomes 30% slower, but that's acceptable. - Heikki
Re: support for MERGE
Jaime Casanova just reported that this patch causes a crash on the regression database with this query: MERGE INTO public.pagg_tab_ml_p3 as target_0 USING public.prt2_l_p3_p2 as ref_0 ON target_0.a = ref_0.a WHEN MATCHED AND cast(null as tid) <= cast(null as tid)THEN DELETE; The reason is down to adjust_partition_tlist() not being willing to deal with empty tlists. So this is the most direct fix: diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 1fa4d84c42..d6b478ec33 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -976,7 +976,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate, conv_tl = map_partition_varattnos((List *) action->targetList, firstVarno, partrel, firstResultRel); - conv_tl = adjust_partition_tlist(conv_tl, map); + if (conv_tl != NIL) + conv_tl = adjust_partition_tlist(conv_tl, map); tupdesc = ExecTypeFromTL(conv_tl); /* XXX gotta pfree conv_tl and tupdesc? */ But I wonder if it wouldn't be better to patch adjust_partition_tlist() to return NIL on NIL input, instead, like this: diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 1fa4d84c42..6a170eea03 100644 --- a/src/backend/executor/execPartition.c +++ b/src/backend/executor/execPartition.c @@ -1589,6 +1589,9 @@ adjust_partition_tlist(List *tlist, TupleConversionMap *map) AttrMap*attrMap = map->attrMap; AttrNumber attrno; + if (tlist == NIL) + return NIL; + Assert(tupdesc->natts == attrMap->maplen); for (attrno = 1; attrno <= tupdesc->natts; attrno++) { I lean towards the latter myself. -- Álvaro Herrera Valdivia, Chile
Re: POC: postgres_fdw insert batching
Hi, Takayuki-san: + if (batch_size <= 0) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("%s requires a non-negative integer value", It seems the message doesn't match the check w.r.t. the batch size of 0. + int numInserted = numSlots; Since numInserted is filled by ExecForeignBatchInsert(), the initialization can be done with 0. Cheers On Sun, Jan 17, 2021 at 10:52 PM tsunakawa.ta...@fujitsu.com < tsunakawa.ta...@fujitsu.com> wrote: > Tomas-san, > > From: Amit Langote > > Good thing you reminded me that this is about inserts, and in that > > case no, ExecInitModifyTable() doesn't know all leaf partitions, it > > only sees the root table whose batch_size doesn't really matter. So > > it's really ExecInitRoutingInfo() that I would recommend to set > > ri_BatchSize; right after this block: > > > > /* > > * If the partition is a foreign table, let the FDW init itself for > > * routing tuples to the partition. > > */ > > if (partRelInfo->ri_FdwRoutine != NULL && > > partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL) > > partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo); > > > > Note that ExecInitRoutingInfo() is called only once for a partition > > when it is initialized after being inserted into for the first time. > > > > For a non-partitioned targets, I'd still say set ri_BatchSize in > > ExecInitModifyTable(). > > Attached is the patch that added call to GetModifyBatchSize() to the above > two places. The regression test passes. > > (FWIW, frankly, I prefer the previous version because the code is a bit > smaller... Maybe we should refactor the code someday to reduce similar > processings in both the partitioned case and non-partitioned case.) > > > Regards > Takayuki Tsunakawa > >
Re: popcount
Peter Eisentraut writes: > [ assorted nits ] At the level of bikeshedding ... I quite dislike using the name "popcount" for these functions. I'm aware that some C compilers provide primitives of that name, but I wouldn't expect a SQL programmer to know that; without that context the name seems pretty random and unintuitive. Moreover, it invites confusion with SQL's use of "pop" to abbreviate "population" in the statistical aggregates, such as var_pop(). Perhaps something along the lines of count_ones() or count_set_bits() would be more apropos. regards, tom lane
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2021/01/18 23:14, Bharath Rupireddy wrote: On Mon, Jan 18, 2021 at 11:44 AM Fujii Masao wrote: I will post patches for the other function postgres_fdw_disconnect, GUC and server level option later. Thanks! Attaching v12 patch set. 0001 is for postgres_fdw_disconnect() function, 0002 is for keep_connections GUC and 0003 is for keep_connection server level option. Thanks! Please review it further. + server = GetForeignServerByName(servername, true); + + if (!server) + ereport(ERROR, + (errcode(ERRCODE_CONNECTION_DOES_NOT_EXIST), +errmsg("foreign server \"%s\" does not exist", servername))); ISTM we can simplify this code as follows. server = GetForeignServerByName(servername, false); + hash_seq_init(&scan, ConnectionHash); + while ((entry = (ConnCacheEntry *) hash_seq_search(&scan))) When the server name is specified, even if its connection is successfully closed, postgres_fdw_disconnect() scans through all the entries to check whether there are active connections. But if "result" is true and active_conn_exists is true, we can get out of this loop to avoid unnecessary scans. + /* +* Destroy the cache if we discarded all active connections i.e. if there +* is no single active connection, which we can know while scanning the +* cached entries in the above loop. Destroying the cache is better than to +* keep it in the memory with all inactive entries in it to save some +* memory. Cache can get initialized on the subsequent queries to foreign +* server. How much memory is assumed to be saved by destroying the cache in many cases? I'm not sure if it's really worth destroying the cache to save the memory. + a warning is issued and false is returned. false + is returned when there are no open connections. When there are some open + connections, but there is no connection for the given foreign server, + then false is returned. When no foreign server exists + with the given name, an error is emitted. Example usage of the function: When a non-existent server name is specified, postgres_fdw_disconnect() emits an error if there is at least one open connection, but just returns false otherwise. At least for me, this behavior looks inconsitent and strange. In that case, IMO the function always should emit an error. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Key management with tests
On Sun, Jan 17, 2021 at 07:50:13PM -0500, Robert Haas wrote: > On Fri, Jan 15, 2021 at 7:56 PM Andres Freund wrote: > > I think that's not at all acceptable. I don't mind hashing out details > > on calls / off-list, but the design needs to be public, documented, and > > reviewable. And if it's something the community can't understand, then > > it can't get in. We're going to have to maintain this going forward. > > I agree. If the community is unable to clearly understand what > something is, and why we should have it, then we shouldn't have it -- > even if the reason is that we're too dumb to understand, as Bruce I am not sure why you are brining intelligence into this discussion. You have to understand Postgres internals and cryptography tradeoffs to understand why some of the design decisions were made. It is a knowledge issue, not an intelligence issue. The wiki page is the result of those phone discussions. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: PoC/WIP: Extended statistics on expressions
Looking through extended_stats.c, I found a corner case that can lead to a seg-fault: CREATE TABLE foo(); CREATE STATISTICS s ON (1) FROM foo; ANALYSE foo; This crashes in lookup_var_attr_stats(), because it isn't expecting nvacatts to be 0. I can't think of any case where building stats on a table with no analysable columns is useful, so it should probably just exit early in that case. In BuildRelationExtStatistics(), it looks like min_attrs should be declared assert-only. In evaluate_expressions(): + /* set the pointers */ + result = (ExprInfo *) ptr; + ptr += sizeof(ExprInfo); I think that should probably have a MAXALIGN(). A slightly bigger issue that I don't like is the way it assigns attribute numbers for expressions starting from MaxHeapAttributeNumber+1, so the first expression has an attnum of 1601. That leads to pretty inefficient use of Bitmapsets, since most tables only contain a handful of columns, and there's a large block of unused space in the middle the Bitmapset. An alternative approach might be to use regular attnums for columns and use negative indexes -1, -2, -3, ... for expressions in the stored stats. Then when storing and retrieving attnums from Bitmapsets, it could just offset by STATS_MAX_DIMENSIONS (8) to avoid negative values in the Bitmapsets, since there can't be more than that many expressions (just like other code stores system attributes using FirstLowInvalidHeapAttributeNumber). That would be a somewhat bigger change, but hopefully fairly mechanical, and then some code like add_expressions_to_attributes() would go away. Looking at the new view pg_stats_ext_exprs, I noticed that it fails to show expressions until the statistics have been built. For example: CREATE TABLE foo(a int, b int); CREATE STATISTICS s ON (a+b), (a*b) FROM foo; SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs; statistics_name | tablename | expr | n_distinct -+---+--+ s | foo | | (1 row) but after populating and analysing the table, this becomes: statistics_name | tablename | expr | n_distinct -+---+-+ s | foo | (a + b) | 11 s | foo | (a * b) | 11 (2 rows) I think it should show the expressions even before the stats have been built. Another issue is that it returns rows for non-expression stats as well. For example: CREATE TABLE foo(a int, b int); CREATE STATISTICS s ON a, b FROM foo; SELECT statistics_name, tablename, expr, n_distinct FROM pg_stats_ext_exprs; statistics_name | tablename | expr | n_distinct -+---+--+ s | foo | | (1 row) and those values will never be populated, since they're not expressions, so I would expect them to not be shown in the view. So basically, instead of + LEFT JOIN LATERAL ( + SELECT + * + FROM ( + SELECT + unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr, + unnest(sd.stxdexpr)::pg_statistic AS a + ) x + ) stat ON sd.stxdexpr IS NOT NULL; perhaps just + JOIN LATERAL ( + SELECT + * + FROM ( + SELECT + unnest(pg_get_statisticsobjdef_expressions(s.oid)) AS expr, + unnest(sd.stxdexpr)::pg_statistic AS a + ) x + ) stat ON true; Regards, Dean
Re: Key management with tests
On Sun, Jan 17, 2021 at 11:54:57AM +0530, Amit Kapila wrote: > > Is there a design document for a Postgres feature of this size and > > scope that people feel would serve as a good example? Alternatively, > > is there a design document template that has been successfully used in > > the past? > > We normally write the design considerations and choices we made with > the reasons in README and code comments. Personally, I am not sure if > there is a need for any specific document per-se but a README and > detailed comments in the code should suffice what people are worried > about here. It is mostly from the perspective that other developers > reading the code, want to do bug-fix, or later enhance that code > should be able to understand it. One recent example I can give is > Peter's work on bottom-up deletion [1] which I have read today where I > find that the design is captured via README, appropriate comments in > the code, and documentation. This feature is quite different and > probably a lot more new concepts are being introduced but I hope that > will give you some clue. > > [1] - > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=d168b666823b6e0bcf60ed19ce24fb5fb91b8ccf OK, I looked at that and it is good, and I see my patch is missing that. Are people looking for me to take the wiki content, expand on it and tie it to the code that will be applied, or something else like all the various crypto options and why we chose what we did beyond what is already on the wiki? I can easily go from what we have on the wiki to implementation code steps, but the other part is harder to explain and that is why I offered to talk to people via voice. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Key management with tests
On Sat, Jan 16, 2021 at 10:58:47PM -0800, Andres Freund wrote: > Hi, > > On 2021-01-17 11:54:57 +0530, Amit Kapila wrote: > > On Sun, Jan 17, 2021 at 5:38 AM Tom Kincaid > > wrote: > > > Admittedly I am a novice on this topic, and the majority of the > > > PostgreSQL source code, however I am hopeful enough (those of you who > > > know me understand that I suffer from eternal optimism) that I am > > > going to attempt to help. > > > > > > Is there a design document for a Postgres feature of this size and > > > scope that people feel would serve as a good example? Alternatively, > > > is there a design document template that has been successfully used in > > > the past? > > > > > > > We normally write the design considerations and choices we made with > > the reasons in README and code comments. Personally, I am not sure if > > there is a need for any specific document per-se but a README and > > detailed comments in the code should suffice what people are worried > > about here. > > Right. It could be a README file, or a long comment at a start of one of > the files. It doesn't matter too much. What matters is that people that > haven't been on those phone call can understand the design and the > implications it has. OK, so does the wiki page contain most of what you want, but is missing the connection between the design and the code? https://wiki.postgresql.org/wiki/Transparent_Data_Encryption -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Hi, For 0001-Allow-REINDEX-to-change-tablespace.patch : + * InvalidOid, use the tablespace in-use instead. 'in-use' seems a bit redundant in the sentence. How about : InvalidOid, use the tablespace of the index instead. Cheers On Mon, Jan 18, 2021 at 12:38 AM Justin Pryzby wrote: > On Mon, Jan 18, 2021 at 02:18:44PM +0900, Michael Paquier wrote: > > On Sat, Dec 12, 2020 at 01:45:26PM -0600, Justin Pryzby wrote: > > > It's a lot like what I wrote as [PATCH v31 1/5] ExecReindex and > ReindexParams > > > In my v31 patch, I moved ReindexOptions to a private structure in > indexcmds.c, > > > with an "int options" bitmask which is passed to reindex_index() et > al. Your > > > patch keeps/puts ReindexOptions index.h, so it also applies to > reindex_index, > > > which I think is good. > > > > a3dc926 is an equivalent of 0001~0003 merged together. 0008 had > > better be submitted into a separate thread if there is value to it. > > 0004~0007 are the pieces remaining. Could it be possible to rebase > > things on HEAD and put the tablespace bits into the structures > > {Vacuum,Reindex,Cluster}Params? > > Attached. I will re-review these myself tomorrow. > > -- > Justin >
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2021/01/18 22:03, Bharath Rupireddy wrote: On Mon, Jan 18, 2021 at 6:17 PM Fujii Masao wrote: +1 to add it after "dropped (Note )", how about as follows with slight changes? dropped (Note that server name of an invalid connection can be NULL if the server is dropped), and then such . Yes, I like this one. One question is; "should" or "is" is better than "can" in this case because the server name of invalid connection is always NULL when its server is dropped? I think "dropped (Note that server name of an invalid connection will be NULL if the server is dropped), and then such ." Sounds good to me. So patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION From cbe84856899f23a810fe4d42cf1cd5e46b3d6870 Mon Sep 17 00:00:00 2001 From: Fujii Masao Date: Tue, 19 Jan 2021 00:56:10 +0900 Subject: [PATCH] doc: Add note about the server name of postgres_fdw_get_connections() returns. Previously the document didn't mention the case where postgres_fdw_get_connections() returns NULL in server_name column. Users might be confused about why NULL was returned. This commit adds the note that, in postgres_fdw_get_connections(), the server name of an invalid connection will be NULL if the server is dropped. Suggested-by: Zhijie Hou Author: Bharath Rupireddy Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/e7ddd14e96444fce88e47a709c196537@G08CNEXMBPEKD05.g08.fujitsu.local --- doc/src/sgml/postgres-fdw.sgml | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml index 6a91926da8..9adc8d12a9 100644 --- a/doc/src/sgml/postgres-fdw.sgml +++ b/doc/src/sgml/postgres-fdw.sgml @@ -493,7 +493,9 @@ OPTIONS (ADD password_required 'false'); each connection is valid or not. false is returned if the foreign server connection is used in the current local transaction but its foreign server or user mapping is changed or - dropped, and then such invalid connection will be closed at + dropped (Note that server name of an invalid connection will be + NULL if the server is dropped), + and then such invalid connection will be closed at the end of that transaction. true is returned otherwise. If there are no open connections, no record is returned. Example usage of the function: -- 2.27.0
Re: recovering from "found xmin ... from before relfrozenxid ..."
> 18 янв. 2021 г., в 18:54, Robert Haas написал(а): > > On Sat, Jan 16, 2021 at 10:41 AM Andrey Borodin wrote: >> Does anyone maintain opensource pg_surgery analogs for released versions of >> PG? >> It seems to me I'll have to use something like this and I just though that I >> should consider pg_surgery in favour of our pg_dirty_hands. > > I do not. I'm still of the opinion that we ought to back-patch > pg_surgery. +1. Yesterday I spent a few hours packaging pg_dirty_hands and pg_surgery(BTW it works fine for 12). It's a kind of a 911 tool, one doesn't think they will need it until they actually do. And clocks are ticking. OTOH, it opens new ways to shoot in the foot. Best regards, Andrey Borodin.
Re: support for MERGE
On Thu, Dec 31, 2020 at 8:48 AM Alvaro Herrera wrote: > Here's a rebase of Simon/Pavan's MERGE patch to current sources. I > cleaned up some minor things in it, but aside from rebasing, it's pretty > much their work (even the commit message is Simon's). It's my impression that the previous discussion concluded that their version needed pretty substantial design changes. Is there a plan to work on that? Or was some of that stuff done already? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add session statistics to pg_stat_database
On Sun, 2021-01-17 at 14:07 +0100, Magnus Hagander wrote: > I have applied this version, with some minor changes: > > * I renamed the n__time members in the struct to just > total__time. The n_ indicates "number of" and is thus wrong for > time parameters. Right. > * Some very minor wording changes. > > * catversion bump (for once I didn't forget it!) Thank you! You included the catversion bump, but shouldn't PGSTAT_FILE_FORMAT_ID in "include/pgstat.h" be updated as well? Yours, Laurenz Albe
Re: ResourceOwner refactoring
On Mon, Jan 18, 2021 at 10:19 AM Heikki Linnakangas wrote: > > On 18/01/2021 16:34, Alvaro Herrera wrote: > > So according to your performance benchmark, we're willing to accept a > > 30% performance loss on an allegedly common operation -- numkeep=0 > > numsnaps=10 becomes 49.8ns from 37.6ns. That seems a bit shocking. > > Maybe you can claim that these operations aren't exactly hot spots, and > > so the fact that we remain in the same power-of-ten is sufficient. Is > > that the argument? > > That's right. The fast path is fast, and that's important. The slow path > becomes 30% slower, but that's acceptable. > > - Heikki > > -- Robert Haas EDB: http://www.enterprisedb.com
Re: ResourceOwner refactoring
On Mon, Jan 18, 2021 at 11:11 AM Robert Haas wrote: > On Mon, Jan 18, 2021 at 10:19 AM Heikki Linnakangas wrote: > > On 18/01/2021 16:34, Alvaro Herrera wrote: > > > So according to your performance benchmark, we're willing to accept a > > > 30% performance loss on an allegedly common operation -- numkeep=0 > > > numsnaps=10 becomes 49.8ns from 37.6ns. That seems a bit shocking. > > > Maybe you can claim that these operations aren't exactly hot spots, and > > > so the fact that we remain in the same power-of-ten is sufficient. Is > > > that the argument? > > > > That's right. The fast path is fast, and that's important. The slow path > > becomes 30% slower, but that's acceptable. Sorry for the empty message. I don't know whether a 30% slowdown will hurt anybody, but it seems like kind of a lot, and I'm not sure I understand what corresponding benefit we're getting. -- Robert Haas EDB: http://www.enterprisedb.com
Re: popcount
Peter Eisentraut wrote: > + /* > +* ceil(VARBITLEN(ARG1)/BITS_PER_BYTE) > +* done to minimize branches and instructions. > +*/ > > I don't know what that is supposed to mean or why that kind of tuning > would be necessary for a user-callable function. Also, the formula just below looks wrong in the current patch (v3): + /* + * ceil(VARBITLEN(ARG1)/BITS_PER_BYTE) + * done to minimize branches and instructions. + */ + len = (VARBITLEN(arg1) + BITS_PER_BYTE + 1) / BITS_PER_BYTE; + p = VARBITS(arg1); + + popcount = pg_popcount((const char *)p, len); It should be len = (VARBITLEN(arg1) + BITS_PER_BYTE - 1) / BITS_PER_BYTE If we add 1 to BITS_PER_BYTE instead of subtracting 1, when VARBITLEN(arg1) is a multiple of BITS_PER_BYTE, "len" is one byte too large. The correct formula is already used in include/utils/varbit.h near the definition of VARBITLEN: #define VARBITTOTALLEN(BITLEN) (((BITLEN) + BITS_PER_BYTE-1)/BITS_PER_BYTE + \ VARHDRSZ + VARBITHDRSZ) Best regards, -- Daniel Vérité PostgreSQL-powered mailer: https://www.manitou-mail.org Twitter: @DanielVerite
Re: support for MERGE
On 2021-Jan-18, Robert Haas wrote: > On Thu, Dec 31, 2020 at 8:48 AM Alvaro Herrera > wrote: > > Here's a rebase of Simon/Pavan's MERGE patch to current sources. I > > cleaned up some minor things in it, but aside from rebasing, it's pretty > > much their work (even the commit message is Simon's). > > It's my impression that the previous discussion concluded that their > version needed pretty substantial design changes. Is there a plan to > work on that? Or was some of that stuff done already? I think some of the issues were handled as I adapted the patch to current sources. However, the extensive refactoring that had been recommended in the old threads has not been done. I have these two comments mainly: https://postgr.es/m/CABOikdOeraX0RKojBs0jZxY5SmbQF3GJBP0+Rat=2g+vqsa...@mail.gmail.com https://postgr.es/m/7168.1547584...@sss.pgh.pa.us I'll try to get to those, but I have some other patches that I need to handle first. -- Álvaro Herrera Valdivia, Chile
Re: Add session statistics to pg_stat_database
On Mon, Jan 18, 2021 at 5:11 PM Laurenz Albe wrote: > > On Sun, 2021-01-17 at 14:07 +0100, Magnus Hagander wrote: > > I have applied this version, with some minor changes: > > > > * I renamed the n__time members in the struct to just > > total__time. The n_ indicates "number of" and is thus wrong for > > time parameters. > > Right. > > > * Some very minor wording changes. > > > > * catversion bump (for once I didn't forget it!) > > Thank you! > > You included the catversion bump, but shouldn't PGSTAT_FILE_FORMAT_ID > in "include/pgstat.h" be updated as well? Yup, you are absolutely correct. Will fix. -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Re: [PATCH] ProcessInterrupts_hook
Robert Haas writes: > On Mon, Jan 18, 2021 at 3:00 AM Craig Ringer > wrote: >> A few times lately I've been doing things in extensions that've made me want >> to be able to run my own code whenever InterruptPending is true and >> CHECK_FOR_INTERRUPTS() calls ProcessInterrupts() > I've wanted this in the past, too, so +1 from me. I dunno, this seems pretty scary and easily abusable. There's not all that much that can be done safely in ProcessInterrupts(), and we should not be encouraging extensions to think they can add random processing there. >> What I really want to go along with this is a way for any backend to observe >> the postmaster's pmState and its "Shutdown" variable's value, so any backend >> can tell if we're in FastShutdown, SmartShutdown, etc. > I've wanted something along this line, too, but what I was thinking > about was more along the lines of having the postmaster signal the > backends when a smart shutdown happened. We're about halfway there already, see 7e784d1dc. I didn't do the other half because it wasn't necessary to the problem, but exposing the shutdown state more fully seems reasonable. regards, tom lane
Re: Key management with tests
On Mon, Jan 18, 2021 at 10:50:37AM -0500, Bruce Momjian wrote: > OK, I looked at that and it is good, and I see my patch is missing that. > Are people looking for me to take the wiki content, expand on it and tie > it to the code that will be applied, or something else like all the > various crypto options and why we chose what we did beyond what is > already on the wiki? I can easily go from what we have on the wiki to > implementation code steps, but the other part is harder to explain and > that is why I offered to talk to people via voice. Just to clarify why voice calls can be helpful --- if you have to get into "you have to understand X to understand Y", that's where a voice call works best, because understanding X will require understanding A/B/C, and everyone's missing pieces are different, so you have to customize it for the individual. You can explain some of this in a README, but trying to cover all of it leads to a combinatorial problem of trying to explain everything. Ideally the wiki page can be expanded so people can ask and answer all posted issues, perhaps in a Q&A format. Someone could go through the archives and post why certain decisions were made, and link to the original emails. I have to admit I was kind of baffled that the wiki page wasn't sufficient, because it is one of the longest Postgres feature explanations I have seen, but I now think the missing part is tying the wiki contents to the code implementation. If that is it, please confirm. If it is something else, also explain. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Add primary keys to system catalogs
Robert Haas writes: > I don't have any complaint about labelling some of the unique indexes > as primary keys, but I think installing foreign keys that don't really > enforce anything may lead to confusion. I'm not sure if I buy the "confusion" argument, but after thinking about this more I realized that there's a stronger roadblock for treating catalog interrelationships as SQL foreign keys. Namely, that we always represent no-reference situations with a zero OID, whereas it'd have to be NULL to look like a valid foreign-key case. Changing to real NULLs in those columns would of course break a ton of C code, so I don't think that's gonna happen. We could overlay an FK onto OID columns that should never have zero entries, but that would miss enough of the interesting cases that I don't find it a very attractive proposal. So I think we should finish up the project of making real SQL constraints for the catalog indexes, but the idea of making FK constraints too looks like it'll end in failure. The reason I got interested in this right now is the nearby discussion [1] about why findoidjoins misses some catalog relationships and whether we should fix that and/or make its results more readily accessible. I'd thought perhaps FK constraint entries could supersede the need for that tool altogether, but now it seems like not. I still like the idea of marking OID relationships in the catalog headers though. Perhaps we should take Joel's suggestion of a new system catalog more seriously, and have genbki.pl populate such a catalog from info in the catalog header files. regards, tom lane [1] https://www.postgresql.org/message-id/flat/d1f413ff-0e50-413d-910c-bdd9ee9752c1%40www.fastmail.com
Re: Key management with tests
Hi, On 2021-01-18 12:06:35 -0500, Bruce Momjian wrote: > On Mon, Jan 18, 2021 at 10:50:37AM -0500, Bruce Momjian wrote: > > OK, I looked at that and it is good, and I see my patch is missing that. > > Are people looking for me to take the wiki content, expand on it and tie > > it to the code that will be applied, or something else like all the > > various crypto options and why we chose what we did beyond what is > > already on the wiki? I can easily go from what we have on the wiki to > > implementation code steps, but the other part is harder to explain and > > that is why I offered to talk to people via voice. > > Just to clarify why voice calls can be helpful --- if you have to get > into "you have to understand X to understand Y", that's where a voice > call works best, because understanding X will require understanding > A/B/C, and everyone's missing pieces are different, so you have to > customize it for the individual. I don't think anybody argued against having voice calls. > You can explain some of this in a README, but trying to cover all of it > leads to a combinatorial problem of trying to explain everything. > Ideally the wiki page can be expanded so people can ask and answer all > posted issues, perhaps in a Q&A format. Someone could go through the > archives and post why certain decisions were made, and link to the > original emails. > > I have to admit I was kind of baffled that the wiki page wasn't > sufficient, because it is one of the longest Postgres feature > explanations I have seen, but I now think the missing part is tying > the wiki contents to the code implementation. If that is it, please > confirm. If it is something else, also explain. I don't think the wiki right now covers what's needed. The "Overview", "Threat model" and "Scope of TDE" are a start, but beyond that it's missing a bunch of things. And it's not in the source tree (we'll soon have multiple versions of postgres with increasing levels of TDE features, the wiki doesn't help with that) Missing: - talks about cluster wide encyrption being simpler, without mentioning what it's being compared to, and what makes it simpler - no differentiation from file system / block level encryption - there's no explanation of which/why specific crypto primitives were chosen, what the tradeoffs are - no explanation which keys exists, stored where - the key management patch introduces new files, not documented - there's new types of lock files, possibility of interrupted operations, ... - no documentation of what that means - there's no documentation what "key wrapping" actually precisely is, what the danger of the two-tier model is, ... - are there dangers in not encrypting zero pages etc? - ... Personally, but I admit that there's legitimate reasons to differ on that note, I don't think it's reasonable for a feature this invasive to commit preliminary patches without the major subsequent patches being in a shape that allows reviewing the whole picture. Greetings, Andres Freund
Re: simplifying foreign key/RI checks
Hi, I was looking at this statement: insert into f select generate_series(1, 200, 2); Since certain generated values (the second half) are not in table p, wouldn't insertion for those values fail ? I tried a scaled down version (1000th) of your example: yugabyte=# insert into f select generate_series(1, 2000, 2); ERROR: insert or update on table "f" violates foreign key constraint "f_a_fkey" DETAIL: Key (a)=(1001) is not present in table "p". For v1-0002-Avoid-using-SPI-for-some-RI-checks.patch : +* Collect partition key values from the unique key. At the end of the nested loop, should there be an assertion that partkey->partnatts partition key values have been found ? This can be done by using a counter (initialized to 0) which is incremented when a match is found by the inner loop. Cheers On Mon, Jan 18, 2021 at 4:40 AM Amit Langote wrote: > While discussing the topic of foreign key performance off-list with > Robert and Corey (also came up briefly on the list recently [1], [2]), > a few ideas were thrown around to simplify our current system of RI > checks to enforce foreign keys with the aim of reducing some of its > overheads. The two main aspects of how we do these checks that > seemingly cause the most overhead are: > > * Using row-level triggers that are fired during the modification of > the referencing and the referenced relations to perform them > > * Using plain SQL queries issued over SPI > > There is a discussion nearby titled "More efficient RI checks - take > 2" [2] to address this problem from the viewpoint that it is using > row-level triggers that causes the most overhead, although there are > some posts mentioning that SQL-over-SPI is not without blame here. I > decided to focus on the latter aspect and tried reimplementing some > checks such that SPI can be skipped altogether. > > I started with the check that's performed when inserting into or > updating the referencing table to confirm that the new row points to a > valid row in the referenced relation. The corresponding SQL is this: > > SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x > > $1 is the value of the foreign key of the new row. If the query > returns a row, all good. Thanks to SPI, or its use of plan caching, > the query is re-planned only a handful of times before making a > generic plan that is then saved and reused, which looks like this: > > QUERY PLAN > -- > LockRows >-> Index Scan using pk_pkey on pk x > Index Cond: (a = $1) > (3 rows) > > So in most cases, the trigger's function would only execute the plan > that's already there, at least in a given session. That's good but > what we realized would be even better is if we didn't have to > "execute" a full-fledged "plan" for this, that is, to simply find out > whether a row containing the key we're looking for exists in the > referenced relation and if found lock it. Directly scanning the index > and locking it directly with table_tuple_lock() like ExecLockRows() > does gives us exactly that behavior, which seems simple enough to be > done in a not-so-long local function in ri_trigger.c. I gave that a > try and came up with the attached. It also takes care of the case > where the referenced relation is partitioned in which case its > appropriate leaf partition's index is scanned. > > The patch results in ~2x improvement in the performance of inserts and > updates on referencing tables: > > create table p (a numeric primary key); > insert into p select generate_series(1, 100); > create table f (a bigint references p); > > -- unpatched > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 6340.733 ms (00:06.341) > > update f set a = a + 1; > UPDATE 100 > Time: 7490.906 ms (00:07.491) > > -- patched > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 3340.808 ms (00:03.341) > > update f set a = a + 1; > UPDATE 100 > Time: 4178.171 ms (00:04.178) > > The improvement is even more dramatic when the referenced table (that > we're no longer querying over SPI) is partitioned. Here are the > numbers when the PK relation has 1000 hash partitions. > > Unpatched: > > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 35898.783 ms (00:35.899) > > update f set a = a + 1; > UPDATE 100 > Time: 37736.294 ms (00:37.736) > > Patched: > > insert into f select generate_series(1, 200, 2); > INSERT 0 100 > Time: 5633.377 ms (00:05.633) > > update f set a = a + 1; > UPDATE 100 > Time: 6345.029 ms (00:06.345) > > That's over ~5x improvement! > > While the above case seemed straightforward enough for skipping SPI, > it seems a bit hard to do the same for other cases where we query the > *referencing* relation during an operation on the referenced table > (for example, checking if the row being deleted is still referenced), > because the plan in those cases is not predict
Re: POC: postgres_fdw insert batching
On 1/18/21 7:51 AM, tsunakawa.ta...@fujitsu.com wrote: Tomas-san, From: Amit Langote Good thing you reminded me that this is about inserts, and in that case no, ExecInitModifyTable() doesn't know all leaf partitions, it only sees the root table whose batch_size doesn't really matter. So it's really ExecInitRoutingInfo() that I would recommend to set ri_BatchSize; right after this block: /* * If the partition is a foreign table, let the FDW init itself for * routing tuples to the partition. */ if (partRelInfo->ri_FdwRoutine != NULL && partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL) partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo); Note that ExecInitRoutingInfo() is called only once for a partition when it is initialized after being inserted into for the first time. For a non-partitioned targets, I'd still say set ri_BatchSize in ExecInitModifyTable(). Attached is the patch that added call to GetModifyBatchSize() to the above two places. The regression test passes. (FWIW, frankly, I prefer the previous version because the code is a bit smaller... Maybe we should refactor the code someday to reduce similar processings in both the partitioned case and non-partitioned case.) Less code would be nice, but it's not always the right thing to do, unfortunately :-( I took a look at this - there's a bit of bitrot due to 708d165ddb92c, so attached is a rebased patch (0001) fixing that. 0002 adds a couple comments and minor tweaks 0003 addresses a couple shortcomings related to explain - we haven't been showing the batch size for EXPLAIN (VERBOSE), because there'd be no FdwState, so this tries to fix that. Furthermore, there were no tests for EXPLAIN output with batch size, so I added a couple. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company >From 9425a8501543d4caf55a302a877745b0f83b6046 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Mon, 18 Jan 2021 15:18:14 +0100 Subject: [PATCH 1/3] Add bulk insert for foreign tables --- contrib/postgres_fdw/deparse.c| 43 ++- .../postgres_fdw/expected/postgres_fdw.out| 116 ++- contrib/postgres_fdw/option.c | 14 + contrib/postgres_fdw/postgres_fdw.c | 302 ++ contrib/postgres_fdw/postgres_fdw.h | 5 +- contrib/postgres_fdw/sql/postgres_fdw.sql | 91 ++ doc/src/sgml/fdwhandler.sgml | 89 +- doc/src/sgml/postgres-fdw.sgml| 13 + src/backend/executor/execPartition.c | 12 + src/backend/executor/nodeModifyTable.c| 157 + src/backend/nodes/list.c | 15 + src/include/foreign/fdwapi.h | 10 + src/include/nodes/execnodes.h | 6 + src/include/nodes/pg_list.h | 15 + 14 files changed, 822 insertions(+), 66 deletions(-) diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c index 3cf7b4eb1e..2d38ab25cb 100644 --- a/contrib/postgres_fdw/deparse.c +++ b/contrib/postgres_fdw/deparse.c @@ -1711,7 +1711,7 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte, Index rtindex, Relation rel, List *targetAttrs, bool doNothing, List *withCheckOptionList, List *returningList, - List **retrieved_attrs) + List **retrieved_attrs, int *values_end_len) { AttrNumber pindex; bool first; @@ -1754,6 +1754,7 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte, } else appendStringInfoString(buf, " DEFAULT VALUES"); + *values_end_len = buf->len; if (doNothing) appendStringInfoString(buf, " ON CONFLICT DO NOTHING"); @@ -1763,6 +1764,46 @@ deparseInsertSql(StringInfo buf, RangeTblEntry *rte, withCheckOptionList, returningList, retrieved_attrs); } +/* + * rebuild remote INSERT statement + * + */ +void +rebuildInsertSql(StringInfo buf, char *orig_query, + int values_end_len, int num_cols, + int num_rows) +{ + int i, j; + int pindex; + bool first; + + /* Copy up to the end of the first record from the original query */ + appendBinaryStringInfo(buf, orig_query, values_end_len); + + /* Add records to VALUES clause */ + pindex = num_cols + 1; + for (i = 0; i < num_rows; i++) + { + appendStringInfoString(buf, ", ("); + + first = true; + for (j = 0; j < num_cols; j++) + { + if (!first) +appendStringInfoString(buf, ", "); + first = false; + + appendStringInfo(buf, "$%d", pindex); + pindex++; + } + + appendStringInfoChar(buf, ')'); + } + + /* Copy stuff after VALUES clause from the original query */ + appendStringInfoString(buf, orig_query + values_end_len); +} + /* * deparse remote UPDATE statement * diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index 1cad311436..8c0fdb5a9a 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -8923,7 +8
Re: [PATCH] ProcessInterrupts_hook
On Mon, Jan 18, 2021 at 11:56 AM Tom Lane wrote: > > I've wanted this in the past, too, so +1 from me. > > I dunno, this seems pretty scary and easily abusable. There's not all > that much that can be done safely in ProcessInterrupts(), and we should > not be encouraging extensions to think they can add random processing > there. We've had this disagreement before about other things, and I just don't agree. If somebody uses a hook for something wildly unsafe, that will break their stuff, not ours. That's not to say I endorse adding hooks for random purposes in random places. In particular, if it's impossible to use a particular hook in a reasonably safe way, that's a sign that the hook is badly-designed and that we should not have it. But, that's not the case here. I care more about smart extension authors being able to do useful things than I do about the possibility that dumb extension authors will do stupid things. We can't really prevent the latter anyway: this is open source. > We're about halfway there already, see 7e784d1dc. I didn't do the > other half because it wasn't necessary to the problem, but exposing > the shutdown state more fully seems reasonable. Ah, I hadn't realized. -- Robert Haas EDB: http://www.enterprisedb.com
Re: simplifying foreign key/RI checks
po 18. 1. 2021 v 13:40 odesílatel Amit Langote napsal: > While discussing the topic of foreign key performance off-list with > Robert and Corey (also came up briefly on the list recently [1], [2]), > a few ideas were thrown around to simplify our current system of RI > checks to enforce foreign keys with the aim of reducing some of its > overheads. The two main aspects of how we do these checks that > seemingly cause the most overhead are: > > * Using row-level triggers that are fired during the modification of > the referencing and the referenced relations to perform them > > * Using plain SQL queries issued over SPI > > There is a discussion nearby titled "More efficient RI checks - take > 2" [2] to address this problem from the viewpoint that it is using > row-level triggers that causes the most overhead, although there are > some posts mentioning that SQL-over-SPI is not without blame here. I > decided to focus on the latter aspect and tried reimplementing some > checks such that SPI can be skipped altogether. > > I started with the check that's performed when inserting into or > updating the referencing table to confirm that the new row points to a > valid row in the referenced relation. The corresponding SQL is this: > > SELECT 1 FROM pk_rel x WHERE x.pkey = $1 FOR KEY SHARE OF x > > $1 is the value of the foreign key of the new row. If the query > returns a row, all good. Thanks to SPI, or its use of plan caching, > the query is re-planned only a handful of times before making a > generic plan that is then saved and reused, which looks like this: > > QUERY PLAN > -- > LockRows >-> Index Scan using pk_pkey on pk x > Index Cond: (a = $1) > (3 rows) > > > What is performance when the referenced table is small? - a lot of codebooks are small between 1000 to 10K rows.
Re: [Patch] ALTER SYSTEM READ ONLY
On Thu, Jan 14, 2021 at 6:29 AM Amul Sul wrote: > To move development, testing, and the review forward, I have commented out the > code acquiring CheckpointLock from CreateCheckPoint() in the 0003 patch and > added the changes for the checkpointer so that system read-write state change > request can be processed as soon as possible, as suggested by Robert[1]. > > I have started a new thread[2] to understand the need for the CheckpointLock > in > CreateCheckPoint() function. Until then we can continue work on this feature > by > skipping CheckpointLock in CreateCheckPoint(), and therefore the 0003 patch is > marked WIP. Based on the favorable review comment from Andres upthread and also your feedback, I committed 0001. -- Robert Haas EDB: http://www.enterprisedb.com
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi, +* Custom scan nodes can be leaf nodes or inner nodes and therfore need special treatment. The special treatment applies to inner nodes. The above should be better phrased to clarify. Cheers On Mon, Jan 18, 2021 at 2:43 AM David Geier wrote: > Hi hackers, > > While working with cursors that reference plans with CustomScanStates > nodes, I encountered a segfault which originates from > search_plan_tree(). The query plan is the result of a simple SELECT > statement into which I inject a Custom Scan node at the root to do some > post-processing before returning rows. This plan is referenced by a > second plan with a Tid Scan which originates from a query of the form > DELETE FROM foo WHERE CURRENT OF my_cursor; > > search_plan_tree() assumes that > CustomScanState::ScanState::ss_currentRelation is never NULL. In my > understanding that only holds for CustomScanState nodes which are at the > bottom of the plan and actually read from a relation. CustomScanState > nodes which are not at the bottom don't have ss_currentRelation set. I > believe for such nodes, instead search_plan_tree() should recurse into > CustomScanState::custom_ps. > > I attached a patch. Any thoughts? > > Best regards, > David > Swarm64 > >
Re: Online checksums patch - once again
Hi, On Fri, Jan 15, 2021 at 11:32:56AM +0100, Daniel Gustafsson wrote: > The attached v30 adds the proposed optimizations in this thread as previously > asked for, as well as some small cleanups to the procsignal calling codepath > (replacing single call functions with just calling the function) and some > function comments which were missing. 0002 (now 0001 I guess) needs a rebase due to a3ed4d1e. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.ba...@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
Re: proposal: schema variables
po 18. 1. 2021 v 15:24 odesílatel Erik Rijkers napsal: > On 2021-01-18 10:59, Pavel Stehule wrote: > >> > > and here is the patch > > [schema-variables-20200118.patch.gz ] > > > One small thing: > > The drop variable synopsis is: > > DROP VARIABLE [ IF EXISTS ] name [, ...] [ CASCADE | RESTRICT ] > > In that text following it, 'RESTRICT' is not documented. When used it > does not give an error but I don't see how it 'works'. > should be fixed now. Thank you for check Regards Pavel > > Erik > > > > schema-variables-20200118-2.patch.gz Description: application/gzip
Re: Add primary keys to system catalogs
I wrote: > ... I still like the idea of marking OID relationships in the > catalog headers though. Perhaps we should take Joel's suggestion > of a new system catalog more seriously, and have genbki.pl populate > such a catalog from info in the catalog header files. On second thought, a catalog is overkill; it'd only be useful if the data could change after initdb, which this data surely cannot. The right way to expose such info to SQL is with a set-returning function reading a constant table in the C code, a la pg_get_keywords(). That way takes a whole lot less infrastructure and is just as useful to SQL code. [ wanders away wondering about a debug mode in which we actually validate OIDs inserted into such columns... ] regards, tom lane
Re: CLUSTER on partitioned index
On Sat, Nov 28, 2020 at 08:03:02PM -0600, Justin Pryzby wrote: > On Sun, Nov 15, 2020 at 07:53:35PM -0600, Justin Pryzby wrote: > > On Wed, Nov 04, 2020 at 08:23:56PM -0600, Justin Pryzby wrote: > > > On Tue, Oct 27, 2020 at 07:33:12PM -0500, Justin Pryzby wrote: > > > > I'm attaching a counter-proposal to your catalog change, which preserves > > > > indisclustered on children of clustered, partitioned indexes, and > > > > invalidates > > > > indisclustered when attaching unclustered indexes. > > > > > > ..and now propagates CLUSTER ON to child indexes. > > > > > > I left this as separate patches to show what I mean and what's new while > > > we > > > discuss it. > > > > This fixes some omissions in the previous patch and error in its test cases. > > > > CLUSTER ON recurses to children, since I think a clustered parent index > > means > > that all its child indexes are clustered. "SET WITHOUT CLUSTER" doesn't > > have > > to recurse to children, but I did it like that for consistency and it avoids > > the need to special case InvalidOid. > > The previous patch failed pg_upgrade when restoring a clustered, parent index, > since it's marked INVALID until indexes have been built on all child tables, > so > CLUSTER ON was rejected on invalid index. > > So I think CLUSTER ON needs to be a separate pg_dump object, to allow > attaching > the child index (thereby making the parent "valid") to happen before SET > CLUSTER on the parent index. Rebased on b5913f612 and now a3dc92600. This patch is intertwined with the tablespace patch: not only will it get rebase conflict, but will also need to test the functionality of CLUSTER (TABLESPACE a) partitioned_table; -- Justin >From cb817c479d2a2d4ae94cfa8d215e369b5d206738 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 26 Nov 2020 14:37:08 -0600 Subject: [PATCH v6 1/7] pg_dump: make CLUSTER ON a separate dump object.. ..since it needs to be restored after any child indexes are restored *and attached*. The order needs to be: 1) restore child and parent index (order doesn't matter); 2) attach child index; 3) set cluster on child and parent index (order doesn't matter); --- src/bin/pg_dump/pg_dump.c | 86 ++ src/bin/pg_dump/pg_dump.h | 8 src/bin/pg_dump/pg_dump_sort.c | 8 3 files changed, 82 insertions(+), 20 deletions(-) diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c index 798d14580e..e6526392e5 100644 --- a/src/bin/pg_dump/pg_dump.c +++ b/src/bin/pg_dump/pg_dump.c @@ -208,6 +208,7 @@ static void dumpSequence(Archive *fout, TableInfo *tbinfo); static void dumpSequenceData(Archive *fout, TableDataInfo *tdinfo); static void dumpIndex(Archive *fout, IndxInfo *indxinfo); static void dumpIndexAttach(Archive *fout, IndexAttachInfo *attachinfo); +static void dumpIndexClusterOn(Archive *fout, IndexClusterInfo *clusterinfo); static void dumpStatisticsExt(Archive *fout, StatsExtInfo *statsextinfo); static void dumpConstraint(Archive *fout, ConstraintInfo *coninfo); static void dumpTableConstraintComment(Archive *fout, ConstraintInfo *coninfo); @@ -7092,6 +7093,11 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables) i_inddependcollversions; int ntups; + int ncluster = 0; + IndexClusterInfo *clusterinfo; + clusterinfo = (IndexClusterInfo *) + pg_malloc0(numTables * sizeof(IndexClusterInfo)); + for (i = 0; i < numTables; i++) { TableInfo *tbinfo = &tblinfo[i]; @@ -7471,6 +7477,27 @@ getIndexes(Archive *fout, TableInfo tblinfo[], int numTables) /* Plain secondary index */ indxinfo[j].indexconstraint = 0; } + + /* Record each table's CLUSTERed index, if any */ + if (indxinfo[j].indisclustered) + { +IndxInfo *index = &indxinfo[j]; +IndexClusterInfo *cluster = &clusterinfo[ncluster]; + +cluster->dobj.objType = DO_INDEX_CLUSTER_ON; +cluster->dobj.catId.tableoid = 0; +cluster->dobj.catId.oid = 0; +AssignDumpId(&cluster->dobj); +cluster->dobj.name = pg_strdup(index->dobj.name); +cluster->dobj.namespace = index->indextable->dobj.namespace; +cluster->index = index; +cluster->indextable = &tblinfo[i]; + +/* The CLUSTER ON depends on its index.. */ +addObjectDependency(&cluster->dobj, index->dobj.dumpId); + +ncluster++; + } } PQclear(res); @@ -10296,6 +10323,9 @@ dumpDumpableObject(Archive *fout, DumpableObject *dobj) case DO_SUBSCRIPTION: dumpSubscription(fout, (SubscriptionInfo *) dobj); break; + case DO_INDEX_CLUSTER_ON: + dumpIndexClusterOn(fout, (IndexClusterInfo *) dobj); + break; case DO_PRE_DATA_BOUNDARY: case DO_POST_DATA_BOUNDARY: /* never dumped, nothing to do */ @@ -16516,6 +16546,41 @@ getAttrName(int attrnum, TableInfo *tblInfo) return NULL;/* keep compiler quiet */ } +/* + * dumpIndexClusterOn + * record that the index is clustered. + */ +static void +dumpIndexClusterOn(Archive *fout, IndexClusterInfo *clusterinfo) +{ + Du
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
David Geier writes: > search_plan_tree() assumes that > CustomScanState::ScanState::ss_currentRelation is never NULL. In my > understanding that only holds for CustomScanState nodes which are at the > bottom of the plan and actually read from a relation. CustomScanState > nodes which are not at the bottom don't have ss_currentRelation set. I > believe for such nodes, instead search_plan_tree() should recurse into > CustomScanState::custom_ps. Hm. I agree that we shouldn't simply assume that ss_currentRelation isn't null. However, we cannot make search_plan_tree() descend through non-leaf CustomScan nodes, because we don't know what processing is involved there. We need to find a scan that is guaranteed to return rows that are one-to-one with the cursor output. This is why the function doesn't descend through join or aggregation nodes, and I see no argument by which we should assume we know more about what a customscan node will do than we know about those. So I'm inclined to think a suitable fix is just - if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) + if (sstate->ss_currentRelation && + RelationGetRelid(sstate->ss_currentRelation) == table_oid) result = sstate; regards, tom lane
Re: WIP: System Versioned Temporal Table
On Mon, Jan 18, 2021 at 1:43 AM Vik Fearing wrote: > > This is not good, and I see that DROP SYSTEM VERSIONING also removes > these columns which is even worse. Please read the standard that you > are trying to implement! > > The standard states the function of ALTER TABLE ADD SYSTEM VERSIONING as "Alter a regular persistent base table to a system-versioned table" and system versioned table is described in the standard by two generated stored constraint columns and implemented as such. > I will do a more thorough review of the functionalities in this patch > (not necessarily the code) this week. > > Please do regards Surafel
Re: Key management with tests
On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote: > Personally, but I admit that there's legitimate reasons to differ on > that note, I don't think it's reasonable for a feature this invasive to > commit preliminary patches without the major subsequent patches being in > a shape that allows reviewing the whole picture. OK, if that is a requirement, I can't help anymore since there are already complaints that the patch is too large to review, even if broken into pieces. Please let me know what the community decides. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Key management with tests
> > I have to admit I was kind of baffled that the wiki page wasn't > > sufficient, because it is one of the longest Postgres feature > > explanations I have seen, but I now think the missing part is tying > > the wiki contents to the code implementation. If that is it, please > > confirm. If it is something else, also explain. > > I don't think the wiki right now covers what's needed. The "Overview", > "Threat model" and "Scope of TDE" are a start, but beyond that it's > missing a bunch of things. And it's not in the source tree (we'll soon > have multiple versions of postgres with increasing levels of TDE > features, the wiki doesn't help with that) > Thanks, the versioning issue makes sense for the design document needing to be part of the the source tree. As I was reading the README for the patch Amit referenced and as I am going through this patch, I feel the desire to incorporate diagrams. Are design diagrams ever incorporated in the source tree as a part of the design description of a feature? If not, any concerns about doing that? I think that is likely where I can contribute the most. > Missing: > - talks about cluster wide encyrption being simpler, without mentioning > what it's being compared to, and what makes it simpler > - no differentiation from file system / block level encryption > - there's no explanation of which/why specific crypto primitives were > chosen, what the tradeoffs are > - no explanation which keys exists, stored where > - the key management patch introduces new files, not documented > - there's new types of lock files, possibility of interrupted > operations, ... - no documentation of what that means > - there's no documentation what "key wrapping" actually precisely is, > what the danger of the two-tier model is, ... > - are there dangers in not encrypting zero pages etc? > - ... > Some of the missing things you mention above are about the design of TDE feature in general. However, this patch is about Key Management which is going part of the larger TDE feature. So it feels as though there is the need for a general design document about the overall vision / approach for TDE and a specific design doc. for Key Management. Is it appropriate to include both of those in the same patch? Something along the lines here is the overall design of TDE and here is how the Key Management portion is designed and implemented. I guess in that case, follow on patches for TDE could refer to the overall design described in this patch. > > > Personally, but I admit that there's legitimate reasons to differ on > that note, I don't think it's reasonable for a feature this invasive to > commit preliminary patches without the major subsequent patches being in > a shape that allows reviewing the whole picture. > > Greetings, > > Andres Freund -- Thomas John Kincaid
Re: Key management with tests
On 2021-01-18 13:58:20 -0500, Bruce Momjian wrote: > On Mon, Jan 18, 2021 at 09:42:54AM -0800, Andres Freund wrote: > > Personally, but I admit that there's legitimate reasons to differ on > > that note, I don't think it's reasonable for a feature this invasive to > > commit preliminary patches without the major subsequent patches being in > > a shape that allows reviewing the whole picture. > > OK, if that is a requirement, I can't help anymore since there are > already complaints that the patch is too large to review, even if broken > into pieces. Please let me know what the community decides. Those aren't conflicting demands. Having later patches around to validate the design of earlier patches doesn't necessitates that the later patches need to be reviewed at the same time.
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Hi, It seems sstate->ss_currentRelation being null can only occur for T_ForeignScanState and T_CustomScanState. What about the following change ? Cheers diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c index 0852bb9cec..56e31951d1 100644 --- a/src/backend/executor/execCurrent.c +++ b/src/backend/executor/execCurrent.c @@ -325,12 +325,21 @@ search_plan_tree(PlanState *node, Oid table_oid, case T_IndexOnlyScanState: case T_BitmapHeapScanState: case T_TidScanState: +{ +ScanState *sstate = (ScanState *) node; + +if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) +result = sstate; +break; +} + case T_ForeignScanState: case T_CustomScanState: { ScanState *sstate = (ScanState *) node; -if (RelationGetRelid(sstate->ss_currentRelation) == table_oid) +if (sstate->ss_currentRelation && +RelationGetRelid(sstate->ss_currentRelation) == table_oid) result = sstate; break; } On Mon, Jan 18, 2021 at 10:46 AM Tom Lane wrote: > David Geier writes: > > search_plan_tree() assumes that > > CustomScanState::ScanState::ss_currentRelation is never NULL. In my > > understanding that only holds for CustomScanState nodes which are at the > > bottom of the plan and actually read from a relation. CustomScanState > > nodes which are not at the bottom don't have ss_currentRelation set. I > > believe for such nodes, instead search_plan_tree() should recurse into > > CustomScanState::custom_ps. > > Hm. I agree that we shouldn't simply assume that ss_currentRelation > isn't null. However, we cannot make search_plan_tree() descend > through non-leaf CustomScan nodes, because we don't know what processing > is involved there. We need to find a scan that is guaranteed to return > rows that are one-to-one with the cursor output. This is why the function > doesn't descend through join or aggregation nodes, and I see no argument > by which we should assume we know more about what a customscan node will > do than we know about those. > > So I'm inclined to think a suitable fix is just > > - if (RelationGetRelid(sstate->ss_currentRelation) == > table_oid) > + if (sstate->ss_currentRelation && > + RelationGetRelid(sstate->ss_currentRelation) == > table_oid) > result = sstate; > > regards, tom lane > > >
Re: CheckpointLock needed in CreateCheckPoint()?
On Thu, Jan 14, 2021 at 10:21 AM Robert Haas wrote: > Yeah, I think this lock is useless. In fact, I think it's harmful, > because it makes large sections of the checkpointer code, basically > all of it really, run with interrupts held off for no reason. > > It's not impossible that removing the lock could reveal bugs > elsewhere: suppose we have segments of code that actually aren't safe > to interrupt, but because of this LWLock, it never happens. But, if > that happens to be true, I think we should just view those cases as > bugs to be fixed. > > One thing that blunts the impact of this quite a bit is that the > checkpointer doesn't use rely much on CHECK_FOR_INTERRUPTS() in the > first place. It has its own machinery: HandleCheckpointerInterrupts(). > Possibly that's something we should be looking to refactor somehow as > well. Here's a patch to remove CheckpointLock completely. For ProcessInterrupts() to do anything, one of the following things would have to be true: 1. ProcDiePending = true. Normally this is set by die(), but the checkpointer does not use die(). Perhaps someday it should, or maybe not, but that's an issue for another day. 2. ClientConnectionLost = true. This gets set in pqcomm.c's internal_flush(), but the checkpointer has no client connection. 3. DoingCommandRead = true. Can't happen; only set in PostgresMain(). 4. QueryCancelPending = true. Can be set by recovery conflicts, but those shouldn't happen in the checkpointer. Normally set by StatementCancelHandler, which the checkpointer doesn't use. 5. IdleInTransactionSessionTimeoutPending = true. Set up by InitPostgres(), which is not used by auxiliary processes. 6. ProcSignalBarrierPending = true. This is the case we actually care about allowing for the ASRO patch, so if this occurs, it's good. 7. ParallelMessagePending = true. The checkpointer definitely never starts parallel workers. So I don't see any problem with just ripping this out entirely, but I'd like to know if anybody else does. -- Robert Haas EDB: http://www.enterprisedb.com v1-0001-Remove-CheckpointLock.patch Description: Binary data
Re: Deleting older versions in unique indexes to avoid page splits
On Sun, Jan 17, 2021 at 10:44 PM Amit Kapila wrote: > With the above example, it seems like it would also help when this is not > true. I'll respond to your remarks here separately, in a later email. > I am not sure what I proposed fits here but the bottom-up sounds like > we are starting from the leaf level and move upwards to root level > which I think is not true here. I guess that that's understandable, because it is true that B-Trees are maintained in a bottom-up fashion. However, it's also true that you can have top-down and bottom-up approaches in query optimizers, and many other things (it's even something that is used to describe governance models). The whole point of the term "bottom-up" is to suggest that bottom-up deletion complements top-down cleanup by VACUUM. I think that this design embodies certain principles that can be generalized to other areas, such as heap pruning. I recall that Robert Haas hated the name deduplication. I'm not about to argue that my choice of "deduplication" was objectively better than whatever he would have preferred. Same thing applies here - I more or less chose a novel name because the concept is itself novel (unlike deduplication). Reasonable people can disagree about what exact name might have been better, but it's not like we're going to arrive at something that everybody can be happy with. And whatever we arrive at probably won't matter that much - the vast majority of users will never need to know what either thing is. They may be important things, but that doesn't mean that many people will ever think about them (in fact I specifically hope that they don't ever need to think about them). > How is that working? Does heapam.c can someway indicate additional > tuples (extra to what has been marked/sent by IndexAM as promising) as > deletable? I see in heap_index_delete_tuples that we mark the status > of the passed tuples as delectable (by setting knowndeletable flag for > them). The bottom-up nbtree caller to heap_index_delete_tuples()/table_index_delete_tuple() (not to be confused with the simple/LP_DEAD heap_index_delete_tuples() caller) always provides heapam.c with a complete picture of the index page, in the sense that it exhaustively has a delstate.deltids entry for each and every TID on the page, no matter what. This is the case even though in practice there is usually no possible way to check even 20% of the deltids entries within heapam.c. In general, the goal during a bottom-up pass is *not* to maximize expected utility (i.e. the number of deleted index tuples/space freed/whatever), exactly. The goal is to lower the variance across related calls, so that we'll typically manage to free a fair number of index tuples when we need to. In general it is much better for heapam.c to make its decisions based on 2 or 3 good reasons rather than just 1 excellent reason. And so heapam.c applies a power-of-two bucketing scheme, never truly giving too much weight to what nbtree tells it about duplicates. See comments above bottomup_nblocksfavorable(), and bottomup_sort_and_shrink() comments (both are from heapam.c). -- Peter Geoghegan
Re: [PATCH] Add support for leading/trailing bytea trim()ing
"Joel Jacobson" writes: > On Fri, Dec 4, 2020, at 22:02, Tom Lane wrote: >> (Maybe the existing ltrim/rtrim descrs are also like this, but if so >> I'd change them too.) > They weren't, but I think the description for the bytea functions > can be improved to have a more precise description > if we take inspiration from the the text functions. Yeah, I agree with making the bytea descriptions look like the text ones. Pushed with minor additional doc fixes. regards, tom lane
Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements
On Fri, 15 Jan 2021 at 15:29, Álvaro Herrera wrote: > > So one last remaining improvement was to have VACUUM ignore processes > doing CIC and RC to compute the Xid horizon of tuples to remove. I > think we can do something simple like the attached patch. Regarding the patch: > +if (statusFlags & PROC_IN_SAFE_IC) > +h->catalog_oldest_nonremovable = > +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin); > +else > +h->data_oldest_nonremovable = h->catalog_oldest_nonremovable > = > +TransactionIdOlder(h->data_oldest_nonremovable, xmin); Would this not need to be the following? Right now, it resets potentially older h->catalog_oldest_nonremovable (which is set in the PROC_IN_SAFE_IC branch). > +if (statusFlags & PROC_IN_SAFE_IC) > +h->catalog_oldest_nonremovable = > +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin); > +else > +{ > +h->data_oldest_nonremovable = > +TransactionIdOlder(h->data_oldest_nonremovable, xmin); > +h->catalog_oldest_nonremovable = > +TransactionIdOlder(h->catalog_oldest_nonremovable, xmin); > +} Prior to reading the rest of my response: I do not understand the intricacies of the VACUUM process, nor the systems of db snapshots, so it's reasonably possible I misunderstand this. But would this patch not allow for tuples to be created, deleted and vacuumed away from a table whilst rebuilding an index on that table, resulting in invalid pointers in the index? Example: 1.) RI starts 2.) PHASE 2: filling the index: 2.1.) scanning the heap (live tuple is cached) < tuple is deleted < last transaction other than RI commits, only snapshot of RI exists < vacuum drops the tuple, and cannot remove it from the new index because this new index is not yet populated. 2.2.) sorting tuples 2.3.) index filled with tuples, incl. deleted tuple 3.) PHASE 3: wait for transactions 4.) PHASE 4: validate does not remove the tuple from the index, because it is not built to do so: it will only insert new tuples. Tuples that are marked for deletion are removed from the index only through VACUUM (and optimistic ALL_DEAD detection). According to my limited knowledge of RI, it requires VACUUM to not run on the table during the initial index build process (which is currently guaranteed through the use of a snapshot). Regards, Matthias van de Meent
Re: search_plan_tree(): handling of non-leaf CustomScanState nodes causes segfault
Zhihong Yu writes: > It seems sstate->ss_currentRelation being null can only > occur for T_ForeignScanState and T_CustomScanState. > What about the following change ? Seems like more code for no very good reason. regards, tom lane
Re: CheckpointLock needed in CreateCheckPoint()?
Robert Haas writes: > Here's a patch to remove CheckpointLock completely. ... > So I don't see any problem with just ripping this out entirely, but > I'd like to know if anybody else does. If memory serves, the reason for the lock was that the CHECKPOINT command used to run the checkpointing code directly in the calling backend, so we needed it to keep more than one process from doing that at once. AFAICS, it's no longer possible for more than one process to try to run that code concurrently, so we shouldn't need the lock anymore. regards, tom lane
Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements
On 2021-Jan-18, Matthias van de Meent wrote: > Example: > > 1.) RI starts > 2.) PHASE 2: filling the index: > 2.1.) scanning the heap (live tuple is cached) > < tuple is deleted > < last transaction other than RI commits, only snapshot of RI exists > < vacuum drops the tuple, and cannot remove it from the new index > because this new index is not yet populated. > 2.2.) sorting tuples > 2.3.) index filled with tuples, incl. deleted tuple > 3.) PHASE 3: wait for transactions > 4.) PHASE 4: validate does not remove the tuple from the index, > because it is not built to do so: it will only insert new tuples. > Tuples that are marked for deletion are removed from the index only > through VACUUM (and optimistic ALL_DEAD detection). > > According to my limited knowledge of RI, it requires VACUUM to not run > on the table during the initial index build process (which is > currently guaranteed through the use of a snapshot). VACUUM cannot run concurrently with CIC or RI in a table -- both acquire ShareUpdateExclusiveLock, which conflicts with itself, so this cannot occur. I do wonder if the problem you suggest (or something similar) can occur via HOT pruning, though. -- Álvaro Herrera Valdivia, Chile
Re: {CREATE INDEX, REINDEX} CONCURRENTLY improvements
On 2021-Jan-18, Matthias van de Meent wrote: > On Fri, 15 Jan 2021 at 15:29, Álvaro Herrera wrote: > Would this not need to be the following? Right now, it resets > potentially older h->catalog_oldest_nonremovable (which is set in the > PROC_IN_SAFE_IC branch). > > > +if (statusFlags & PROC_IN_SAFE_IC) > > +h->catalog_oldest_nonremovable = > > +TransactionIdOlder(h->catalog_oldest_nonremovable, > > xmin); > > +else > > +{ > > +h->data_oldest_nonremovable = > > +TransactionIdOlder(h->data_oldest_nonremovable, xmin); > > +h->catalog_oldest_nonremovable = > > +TransactionIdOlder(h->catalog_oldest_nonremovable, > > xmin); > > +} Oops, you're right. -- Álvaro Herrera Valdivia, Chile
Re: PG vs LLVM 12 on seawasp, next round
Hello Alvaro, The "no such file" error seems more like a machine local issue to me. I'll look into it when I have time, which make take some time. Hopefully over the holidays. This is still happening ... Any chance you can have a look at it? Indeed. I'll try to look (again) into it soon. I had a look but did not find anything obvious in the short time frame I had. Last two months were a little overworked for me so I let slip quite a few things. If you want to disable the animal as Tom suggests, do as you want. -- Fabien.