RE: Schema variables - new implementation for Postgres 15
From: Pavel Stehule -- I am sorry, but in this topic we have different opinions. The variables in PLpgSQL are not transactional too (same is true in Perl, Python, ...). Session variables in Oracle, MS SQL, DB2, MySQL are not transactional too. My primary focus is PLpgSQL - and I would like to use schema variables as global plpgsql variables (from PLpgSQL perspective) - that means in Postgres's perspective session variables. But in Postgres, I have to write features that will work with others PL too - PLPython, PLPerl, ... Statement SET in ANSI/SQL standard (SQL/PSM) doesn't expect transactional behaviour for variables too. Unfortunately SET keyword is used in Postgres for GUC, and isn't possible to reuse without a compatibility break. The PostgreSQL configuration is transactional, but it is a different feature designed for different purposes. Using GUC like session variables is just a workaround. It can be useful for some cases, sure. But it is not usual behaviour. And for other cases the transactional behaviour is not practical. Schema variables are not replacement of GUC, schema variables are not replacement of temporal tables. There is a prepared patch for global temp tables. I hope so this patch can be committed to Postgres 15. Global temp tables fixes almost all disadvantages of temporary tables in Postgres. So the schema variable is not a one row table. It is a different creature - designed to support the server's side procedural features. -- +1 I understand (and wish) this feature is intended to ease migration from Oracle PL/SQL, which will further increase the popularity of Postgres. So, the transactional behavior is not necessary unless Oracle has such a feature. Furthermore, Postgres already has some non-transactonal SQL commands. So, I don't think we need to reject non-transactional LET. * Sequence operation: SELECT nextval/setval * SET [SESSION] * SET ROLE * SET SESSION AUTHORIZATION -- I have prepared a patch that allows non default transactional behaviour (but this behaviour should not be default - I didn't design schema variables as temp tables replacement). This patch increases the length of the current patch about 1/4, and I have enough work with rebasing with the current patch, so I didn't send it to commitfest now. If schema variables will be inside core, this day I'll send the patch that allows transactional behaviour for schema variables - I promise. -- I prefer the simpler, targeted one without transactional behavior initially, because added complexity might prevent this feature from being committed in PG 15. Regards Takayuki Tsunakawa
Re: doc review for v14
A bunch more found with things like this. find src -name '*.c' |xargs grep '^[[:space:]]*/\?\*' |grep -woE '[[:lower:]]{3,8}' |sed 's/.*/\L&/' | sort |uniq -c |sort -nr |awk '$1==1' |less >From 9fd5e2797efc14f83bfb2d47d5174de81a34ac6e Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 4 Mar 2021 19:32:05 -0600 Subject: [PATCH 01/27] doc review: Add support for PROVE_TESTS and PROVE_FLAGS in MSVC scripts 5bca69a76b3046a85c60c48271c1831fd5affa51 --- doc/src/sgml/install-windows.sgml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml index 64687b12e6..cb6bb05dc5 100644 --- a/doc/src/sgml/install-windows.sgml +++ b/doc/src/sgml/install-windows.sgml @@ -499,8 +499,8 @@ $ENV{PERL5LIB}=$ENV{PERL5LIB} . ';c:\IPC-Run-0.94\lib'; The TAP tests run with vcregress support the - environment variables PROVE_TESTS, that is expanded - automatically using the name patterns given, and + environment variables PROVE_TESTS, which is + expanded as a glob pattern, and PROVE_FLAGS. These can be set on a Windows terminal, before running vcregress: -- 2.17.0 >From 1392ef571f6afb62f5cd79db691a9c2d2255dde0 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 23 Mar 2021 13:39:36 -0500 Subject: [PATCH 02/27] doc review: Pass all scan keys to BRIN consistent function at once commit a1c649d889bdf6e74e9382e1e28574d7071568de --- doc/src/sgml/brin.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml index d2f12bb605..d2476481af 100644 --- a/doc/src/sgml/brin.sgml +++ b/doc/src/sgml/brin.sgml @@ -833,7 +833,7 @@ typedef struct BrinOpcInfo Returns whether all the ScanKey entries are consistent with the given indexed values for a range. The attribute number to use is passed as part of the scan key. - Multiple scan keys for the same attribute may be passed at once, the + Multiple scan keys for the same attribute may be passed at once; the number of entries is determined by the nkeys parameter. -- 2.17.0 >From 1323e85320a0e5414b490ef7f0082f3cca60864c Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 26 Mar 2021 23:14:57 -0500 Subject: [PATCH 03/27] doc review: BRIN minmax-multi indexes ab596105b55f1d7fbd5a66b66f65227d210b047d --- doc/src/sgml/brin.sgml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml index d2476481af..ce7c210575 100644 --- a/doc/src/sgml/brin.sgml +++ b/doc/src/sgml/brin.sgml @@ -730,7 +730,7 @@ LOG: request for BRIN range summarization for index "brin_wi_idx" page 128 was for . When set to a positive value, each block range is assumed to contain this number of distinct non-null values. When set to a negative value, which must be greater than or - equal to -1, the number of distinct non-null is assumed linear with + equal to -1, the number of distinct non-null values is assumed to grow linearly with the maximum possible number of tuples in the block range (about 290 rows per block). The default value is -0.1, and the minimum number of distinct non-null values is 16. @@ -1214,7 +1214,7 @@ typedef struct BrinOpcInfo The minmax-multi operator class is also intended for data types implementing - a totally ordered sets, and may be seen as a simple extension of the minmax + a totally ordered set, and may be seen as a simple extension of the minmax operator class. While minmax operator class summarizes values from each block range into a single contiguous interval, minmax-multi allows summarization into multiple smaller intervals to improve handling of outlier values. -- 2.17.0 >From dbf9f5cd9e2c0d7ee471c29b241ae8d5bc565fd6 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Thu, 18 Feb 2021 10:13:15 -0600 Subject: [PATCH 04/27] doc review: VACUUM (PROCESS_TOAST) 7cb3048f38e26b39dd5fd412ed8a4981b6809b35 Michael had some comments here, but it seems to exactly match my existing language. https://www.postgresql.org/message-id/yddar29pdixpg...@paquier.xyz --- doc/src/sgml/ref/vacuum.sgml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml index 6a0028a514..949ca23797 100644 --- a/doc/src/sgml/ref/vacuum.sgml +++ b/doc/src/sgml/ref/vacuum.sgml @@ -219,7 +219,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ connection_name ] DEC Description - DECLARE STATEMENT declares SQL statement identifier. + DECLARE STATEMENT declares a SQL statement identifier. SQL statement identifier can be associated with the connection. - When the identifier is used by dynamic SQL statements, these SQLs are executed - by using the associated connection. + When the identifier is used by dynamic SQL statements, the statements are execu
Re: Truncate in synchronous logical replication failed
On Thu, 15 Apr 2021 at 19:25, Amit Kapila wrote: > On Thu, Apr 15, 2021 at 4:30 PM Japin Li wrote: >> >> >> On Thu, 15 Apr 2021 at 18:22, Amit Kapila wrote: >> > On Wed, Apr 14, 2021 at 3:31 PM Amit Kapila >> > wrote: >> >> >> >> On Tue, Apr 13, 2021 at 8:07 PM Petr Jelinek >> >> wrote: >> >> > >> >> > > On 12 Apr 2021, at 08:58, Amit Kapila wrote: >> >> > > >> >> > > The problem happens only when we try to fetch IDENTITY_KEY attributes >> >> > > because pgoutput uses RelationGetIndexAttrBitmap() to get that >> >> > > information which locks the required indexes. Now, because TRUNCATE >> >> > > has already acquired an exclusive lock on the index, it seems to >> >> > > create a sort of deadlock where the actual Truncate operation waits >> >> > > for logical replication of operation to complete and logical >> >> > > replication waits for actual Truncate operation to finish. >> >> > > >> >> > > Do we really need to use RelationGetIndexAttrBitmap() to build >> >> > > IDENTITY_KEY attributes? During decoding, we don't even lock the main >> >> > > relation, we just scan the system table and build that information >> >> > > using a historic snapshot. Can't we do something similar here? >> >> > > >> >> > > Adding Petr J. and Peter E. to know their views as this seems to be an >> >> > > old problem (since the decoding of Truncate operation is introduced). >> >> > >> >> > We used RelationGetIndexAttrBitmap because it already existed, no other >> >> > reason. >> >> > >> >> >> >> Fair enough. But I think we should do something about it because using >> >> the same (RelationGetIndexAttrBitmap) just breaks the synchronous >> >> logical replication. I think this is broken since the logical >> >> replication of Truncate is supported. >> >> >> >> > I am not sure what exact locking we need but I would have guessed at >> >> > least AccessShareLock would be needed. >> >> > >> >> >> >> Are you telling that we need AccessShareLock on the index? If so, why >> >> is it different from how we access the relation during decoding >> >> (basically in ReorderBufferProcessTXN, we directly use >> >> RelationIdGetRelation() without any lock on the relation)? I think we >> >> do it that way because we need it to process WAL entries and we need >> >> the relation state based on the historic snapshot, so, even if the >> >> relation is later changed/dropped, we are fine with the old state we >> >> got. Isn't the same thing applies here in logicalrep_write_attrs? If >> >> that is true then some equivalent of RelationGetIndexAttrBitmap where >> >> we use RelationIdGetRelation instead of index_open should work? >> >> >> > >> > Today, again I have thought about this and don't see a problem with >> > the above idea. If the above understanding is correct, then I think >> > for our purpose in pgoutput, we just need to call RelationGetIndexList >> > and then build the idattr list for relation->rd_replidindex. >> >> Sorry, I don't know how can we build the idattr without open the index. >> If we should open the index, then we should use NoLock, since the TRUNCATE >> side hold AccessExclusiveLock. As Osumi points out in [1], The NoLock mode >> assumes that the appropriate lock on the index is already taken. >> > > Why can't we use RelationIdGetRelation() by passing the required > indexOid to it? Hi Amit, as your suggested, I try to use RelationIdGetRelation() replace the index_open() to avoid specify the AccessSharedLock, then the TRUNCATE will not be blocked. Attached patch fixed it. Thoughts? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 29702d6eab..0ad59ef189 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -5060,7 +5060,7 @@ restart: bool isPK; /* primary key */ bool isIDKey; /* replica identity index */ - indexDesc = index_open(indexOid, AccessShareLock); + indexDesc = RelationIdGetRelation(indexOid); /* * Extract index expressions and index predicate. Note: Don't use @@ -5134,7 +5134,7 @@ restart: /* Collect all attributes in the index predicate, too */ pull_varattnos(indexPredicate, 1, &indexattrs); - index_close(indexDesc, AccessShareLock); + RelationClose(indexDesc); } /* diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl index be2c0bdc35..cfcee2f1a7 100644 --- a/src/test/subscription/t/010_truncate.pl +++ b/src/test/subscription/t/010_truncate.pl @@ -3,7 +3,7 @@ use strict; use warnings; use PostgresNode; use TestLib; -use Test::More tests => 9; +use Test::More tests => 11; # setup @@ -158,3 +158,28 @@ is($result, qq(0||), 'truncate of multiple tables some not published'); $result = $node_subscriber->safe_psql('postgres', "SELECT count(*), min(a), max(a) FROM tab2"); is($result, qq(3|1|3), 'truncate of multiple tables some not published'); + +# setup synchronous logical replicatio
RE: Truncate in synchronous logical replication failed
Hi On Friday, April 16, 2021 11:02 AM Japin Li > On Thu, 15 Apr 2021 at 19:25, Amit Kapila wrote: > > On Thu, Apr 15, 2021 at 4:30 PM Japin Li wrote: > >> > >> > >> On Thu, 15 Apr 2021 at 18:22, Amit Kapila > wrote: > >> > On Wed, Apr 14, 2021 at 3:31 PM Amit Kapila > wrote: > >> >> > >> >> On Tue, Apr 13, 2021 at 8:07 PM Petr Jelinek > >> >> wrote: > >> >> > > >> >> > > On 12 Apr 2021, at 08:58, Amit Kapila > wrote: > >> >> > > > >> >> > > The problem happens only when we try to fetch IDENTITY_KEY > >> >> > > attributes because pgoutput uses RelationGetIndexAttrBitmap() > >> >> > > to get that information which locks the required indexes. Now, > >> >> > > because TRUNCATE has already acquired an exclusive lock on the > >> >> > > index, it seems to create a sort of deadlock where the actual > >> >> > > Truncate operation waits for logical replication of operation > >> >> > > to complete and logical replication waits for actual Truncate > operation to finish. > >> >> > > > >> >> > > Do we really need to use RelationGetIndexAttrBitmap() to build > >> >> > > IDENTITY_KEY attributes? During decoding, we don't even lock > >> >> > > the main relation, we just scan the system table and build > >> >> > > that information using a historic snapshot. Can't we do something > similar here? > >> >> > > > >> >> > > Adding Petr J. and Peter E. to know their views as this seems > >> >> > > to be an old problem (since the decoding of Truncate operation is > introduced). > >> >> > > >> >> > We used RelationGetIndexAttrBitmap because it already existed, no > other reason. > >> >> > > >> >> > >> >> Fair enough. But I think we should do something about it because > >> >> using the same (RelationGetIndexAttrBitmap) just breaks the > >> >> synchronous logical replication. I think this is broken since the > >> >> logical replication of Truncate is supported. > >> >> > >> >> > I am not sure what exact locking we need but I would have guessed > at least AccessShareLock would be needed. > >> >> > > >> >> > >> >> Are you telling that we need AccessShareLock on the index? If so, > >> >> why is it different from how we access the relation during > >> >> decoding (basically in ReorderBufferProcessTXN, we directly use > >> >> RelationIdGetRelation() without any lock on the relation)? I think > >> >> we do it that way because we need it to process WAL entries and we > >> >> need the relation state based on the historic snapshot, so, even > >> >> if the relation is later changed/dropped, we are fine with the old > >> >> state we got. Isn't the same thing applies here in > >> >> logicalrep_write_attrs? If that is true then some equivalent of > >> >> RelationGetIndexAttrBitmap where we use RelationIdGetRelation > instead of index_open should work? > >> >> > >> > > >> > Today, again I have thought about this and don't see a problem with > >> > the above idea. If the above understanding is correct, then I think > >> > for our purpose in pgoutput, we just need to call > >> > RelationGetIndexList and then build the idattr list for > relation->rd_replidindex. > >> > >> Sorry, I don't know how can we build the idattr without open the index. > >> If we should open the index, then we should use NoLock, since the > >> TRUNCATE side hold AccessExclusiveLock. As Osumi points out in [1], > >> The NoLock mode assumes that the appropriate lock on the index is > already taken. > >> > > > > Why can't we use RelationIdGetRelation() by passing the required > > indexOid to it? > > Thanks for your reminder. It might be a way to solve this problem. Yeah. I've made the 1st patch for this issue. In my env, with the patch the TRUNCATE in synchronous logical replication doesn't hang. It's OK with make check-world as well. Best Regards, Takamichi Osumi Truncate_in_synchronous_logical_replication_v01.patch Description: Truncate_in_synchronous_logical_replication_v01.patch
Re: Allowing to create LEAKPROOF functions to non-superuser
On Mon, Apr 12, 2021 at 02:35:27PM -0700, Andres Freund wrote: > On 2021-04-12 17:14:20 -0400, Tom Lane wrote: > > I doubt that falsely labeling a function LEAKPROOF can get you more > > than the ability to read data you're not supposed to be able to read > > ... but that ability is then available to all users, or at least all > > users who can execute the function in question. So it definitely is a > > fairly serious security hazard, and one that's not well modeled by > > role labels. If you give somebody e.g. pg_read_all_data privileges, > > you don't expect that that means they can give it to other users. I do expect that, essentially. Like Andres describes for BYPASSRLS, they can create and GRANT a SECURITY DEFINER function that performs an arbitrary query and returns a refcursor (or stores the data to a table of the caller's choosing, etc.). Unlike BYPASSRLS, they can even make pg_read_all_data own the function, making the situation persist after one drops the actor's role and that role's objects. > A user with BYPASSRLS can create public security definer functions > returning data. If the concern is a BYPASSRLS user intentionally > exposing data, then there's not a meaningful increase to allow defining > LEAKPROOF functions. Hence, I do find it reasonable to let pg_read_all_data be sufficient for setting LEAKPROOF. I would not consult datdba, because datdba currently has no special read abilities. It feels too weird to let BYPASSRLS start affecting non-RLS access controls. A reasonable person may assume that BYPASSRLS has no consequences until someone uses CREATE POLICY. That said, I wouldn't be horrified if BYPASSRLS played a part. BYPASSRLS, like pg_read_all_data, clearly isn't something to grant lightly.
Re: fix old confusing JSON example
On Sat, Apr 03, 2021 at 02:28:38PM +0200, Erik Rijkers wrote: > So, that gives information on two operators, and then gives one > example query for each. Clearly, the second example was meant to > illustrate a where-clause with the @? operator. > > Small change to prevent great confusion (I'll admit it took me far > too long to understand this). Once one guesses the definition of the table to use with the sample data at disposal in the docs, it is easy to see that both queries should return the same result, but the second one misses the shot and is corrected as you say. So, applied. My apologies for the delay. -- Michael signature.asc Description: PGP signature
Re: Converting built-in SQL functions to new style
On Thu, Apr 15, 2021 at 07:25:39PM -0400, Tom Lane wrote: > Here's a draft patch that converts all the built-in and information_schema > SQL functions to new style, except for half a dozen that cannot be > converted because they use polymorphic arguments. This patch looks good. > One thing I was wondering about, but did not pull the trigger on > here, is whether to split off the function-related stuff in > system_views.sql into a new file "system_functions.sql", as has > long been speculated about by the comments in system_views.sql. > I think it is time to do this because > > (a) The function stuff now amounts to a full third of the file. Fair. > (b) While the views made by system_views.sql are intentionally > not pinned, the function-related commands are messing with > pre-existing objects that *are* pinned. This seems quite > confusing to me, and it might interfere with the intention that > you could reload the system view definitions using this file. I'm not aware of that causing a problem. Currently, the views give a few errors, and the functions do not.
Re: Truncate in synchronous logical replication failed
On Fri, Apr 16, 2021 at 12:56 PM osumi.takami...@fujitsu.com wrote: > > > Thanks for your reminder. It might be a way to solve this problem. > Yeah. I've made the 1st patch for this issue. > > In my env, with the patch > the TRUNCATE in synchronous logical replication doesn't hang. > Few initial comments: = 1. + relreplindex = relation->rd_replidindex; + + /* + * build attributes to idindexattrs. + */ + idindexattrs = NULL; + foreach(l, indexoidlist) + { + Oid indexOid = lfirst_oid(l); + Relation indexDesc; + int i; + bool isIDKey; /* replica identity index */ + + indexDesc = RelationIdGetRelation(indexOid); When you have oid of replica identity index (relreplindex) then what is the need to traverse all the indexes? 2. It is better to name the function as RelationGet... -- With Regards, Amit Kapila.
Re: Truncate in synchronous logical replication failed
On Fri, Apr 16, 2021 at 12:55 PM Japin Li wrote: > > On Thu, 15 Apr 2021 at 19:25, Amit Kapila wrote: > > On Thu, Apr 15, 2021 at 4:30 PM Japin Li wrote: > >> > >> > >> On Thu, 15 Apr 2021 at 18:22, Amit Kapila wrote: > >> > On Wed, Apr 14, 2021 at 3:31 PM Amit Kapila > >> > wrote: > >> >> > >> >> On Tue, Apr 13, 2021 at 8:07 PM Petr Jelinek > >> >> wrote: > >> >> > > >> >> > > On 12 Apr 2021, at 08:58, Amit Kapila > >> >> > > wrote: > >> >> > > > >> >> > > The problem happens only when we try to fetch IDENTITY_KEY > >> >> > > attributes > >> >> > > because pgoutput uses RelationGetIndexAttrBitmap() to get that > >> >> > > information which locks the required indexes. Now, because TRUNCATE > >> >> > > has already acquired an exclusive lock on the index, it seems to > >> >> > > create a sort of deadlock where the actual Truncate operation waits > >> >> > > for logical replication of operation to complete and logical > >> >> > > replication waits for actual Truncate operation to finish. > >> >> > > > >> >> > > Do we really need to use RelationGetIndexAttrBitmap() to build > >> >> > > IDENTITY_KEY attributes? During decoding, we don't even lock the > >> >> > > main > >> >> > > relation, we just scan the system table and build that information > >> >> > > using a historic snapshot. Can't we do something similar here? > >> >> > > > >> >> > > Adding Petr J. and Peter E. to know their views as this seems to be > >> >> > > an > >> >> > > old problem (since the decoding of Truncate operation is > >> >> > > introduced). > >> >> > > >> >> > We used RelationGetIndexAttrBitmap because it already existed, no > >> >> > other reason. > >> >> > > >> >> > >> >> Fair enough. But I think we should do something about it because using > >> >> the same (RelationGetIndexAttrBitmap) just breaks the synchronous > >> >> logical replication. I think this is broken since the logical > >> >> replication of Truncate is supported. > >> >> > >> >> > I am not sure what exact locking we need but I would have guessed at > >> >> > least AccessShareLock would be needed. > >> >> > > >> >> > >> >> Are you telling that we need AccessShareLock on the index? If so, why > >> >> is it different from how we access the relation during decoding > >> >> (basically in ReorderBufferProcessTXN, we directly use > >> >> RelationIdGetRelation() without any lock on the relation)? I think we > >> >> do it that way because we need it to process WAL entries and we need > >> >> the relation state based on the historic snapshot, so, even if the > >> >> relation is later changed/dropped, we are fine with the old state we > >> >> got. Isn't the same thing applies here in logicalrep_write_attrs? If > >> >> that is true then some equivalent of RelationGetIndexAttrBitmap where > >> >> we use RelationIdGetRelation instead of index_open should work? > >> >> > >> > > >> > Today, again I have thought about this and don't see a problem with > >> > the above idea. If the above understanding is correct, then I think > >> > for our purpose in pgoutput, we just need to call RelationGetIndexList > >> > and then build the idattr list for relation->rd_replidindex. > >> > >> Sorry, I don't know how can we build the idattr without open the index. > >> If we should open the index, then we should use NoLock, since the TRUNCATE > >> side hold AccessExclusiveLock. As Osumi points out in [1], The NoLock mode > >> assumes that the appropriate lock on the index is already taken. > >> > > > > Why can't we use RelationIdGetRelation() by passing the required > > indexOid to it? > > Hi Amit, as your suggested, I try to use RelationIdGetRelation() replace > the index_open() to avoid specify the AccessSharedLock, then the TRUNCATE > will not be blocked. > It is okay as POC but we can't change the existing function RelationGetIndexAttrBitmap. It is used from other places as well. It might be better to write a separate function for this, something like what Osumi-San's patch is trying to do. -- With Regards, Amit Kapila.
RE: Truncate in synchronous logical replication failed
Hi On Friday, April 16, 2021 5:50 PM Amit Kapila wrote: > On Fri, Apr 16, 2021 at 12:56 PM osumi.takami...@fujitsu.com > wrote: > > > > > Thanks for your reminder. It might be a way to solve this problem. > > Yeah. I've made the 1st patch for this issue. > > > > In my env, with the patch > > the TRUNCATE in synchronous logical replication doesn't hang. > > > > Few initial comments: > = > 1. > + relreplindex = relation->rd_replidindex; > + > + /* > + * build attributes to idindexattrs. > + */ > + idindexattrs = NULL; > + foreach(l, indexoidlist) > + { > + Oid indexOid = lfirst_oid(l); > + Relation indexDesc; > + int i; > + bool isIDKey; /* replica identity index */ > + > + indexDesc = RelationIdGetRelation(indexOid); > > When you have oid of replica identity index (relreplindex) then what is the > need to traverse all the indexes? Ok. No need to traverse all the indexes. Will fix this part. > 2. > It is better to name the function as RelationGet... You are right. I'll modify this in my next version. Best Regards, Takamichi Osumi
Re: Truncate in synchronous logical replication failed
On Fri, 16 Apr 2021 at 16:52, Amit Kapila wrote: > On Fri, Apr 16, 2021 at 12:55 PM Japin Li wrote: >> >> On Thu, 15 Apr 2021 at 19:25, Amit Kapila wrote: >> > On Thu, Apr 15, 2021 at 4:30 PM Japin Li wrote: >> >> >> >> >> >> On Thu, 15 Apr 2021 at 18:22, Amit Kapila wrote: >> >> > On Wed, Apr 14, 2021 at 3:31 PM Amit Kapila >> >> > wrote: >> >> >> >> >> >> On Tue, Apr 13, 2021 at 8:07 PM Petr Jelinek >> >> >> wrote: >> >> >> > >> >> >> > > On 12 Apr 2021, at 08:58, Amit Kapila >> >> >> > > wrote: >> >> >> > > >> >> >> > > The problem happens only when we try to fetch IDENTITY_KEY >> >> >> > > attributes >> >> >> > > because pgoutput uses RelationGetIndexAttrBitmap() to get that >> >> >> > > information which locks the required indexes. Now, because TRUNCATE >> >> >> > > has already acquired an exclusive lock on the index, it seems to >> >> >> > > create a sort of deadlock where the actual Truncate operation waits >> >> >> > > for logical replication of operation to complete and logical >> >> >> > > replication waits for actual Truncate operation to finish. >> >> >> > > >> >> >> > > Do we really need to use RelationGetIndexAttrBitmap() to build >> >> >> > > IDENTITY_KEY attributes? During decoding, we don't even lock the >> >> >> > > main >> >> >> > > relation, we just scan the system table and build that information >> >> >> > > using a historic snapshot. Can't we do something similar here? >> >> >> > > >> >> >> > > Adding Petr J. and Peter E. to know their views as this seems to >> >> >> > > be an >> >> >> > > old problem (since the decoding of Truncate operation is >> >> >> > > introduced). >> >> >> > >> >> >> > We used RelationGetIndexAttrBitmap because it already existed, no >> >> >> > other reason. >> >> >> > >> >> >> >> >> >> Fair enough. But I think we should do something about it because using >> >> >> the same (RelationGetIndexAttrBitmap) just breaks the synchronous >> >> >> logical replication. I think this is broken since the logical >> >> >> replication of Truncate is supported. >> >> >> >> >> >> > I am not sure what exact locking we need but I would have guessed at >> >> >> > least AccessShareLock would be needed. >> >> >> > >> >> >> >> >> >> Are you telling that we need AccessShareLock on the index? If so, why >> >> >> is it different from how we access the relation during decoding >> >> >> (basically in ReorderBufferProcessTXN, we directly use >> >> >> RelationIdGetRelation() without any lock on the relation)? I think we >> >> >> do it that way because we need it to process WAL entries and we need >> >> >> the relation state based on the historic snapshot, so, even if the >> >> >> relation is later changed/dropped, we are fine with the old state we >> >> >> got. Isn't the same thing applies here in logicalrep_write_attrs? If >> >> >> that is true then some equivalent of RelationGetIndexAttrBitmap where >> >> >> we use RelationIdGetRelation instead of index_open should work? >> >> >> >> >> > >> >> > Today, again I have thought about this and don't see a problem with >> >> > the above idea. If the above understanding is correct, then I think >> >> > for our purpose in pgoutput, we just need to call RelationGetIndexList >> >> > and then build the idattr list for relation->rd_replidindex. >> >> >> >> Sorry, I don't know how can we build the idattr without open the index. >> >> If we should open the index, then we should use NoLock, since the TRUNCATE >> >> side hold AccessExclusiveLock. As Osumi points out in [1], The NoLock mode >> >> assumes that the appropriate lock on the index is already taken. >> >> >> > >> > Why can't we use RelationIdGetRelation() by passing the required >> > indexOid to it? >> >> Hi Amit, as your suggested, I try to use RelationIdGetRelation() replace >> the index_open() to avoid specify the AccessSharedLock, then the TRUNCATE >> will not be blocked. >> > > It is okay as POC but we can't change the existing function > RelationGetIndexAttrBitmap. It is used from other places as well. It > might be better to write a separate function for this, something like > what Osumi-San's patch is trying to do. Agreed! Hi Osumi-San, can you merge the test case in your next version? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.
Re: Replication slot stats misgivings
On Mon, Apr 12, 2021 at 2:57 PM vignesh C wrote: > > On Sat, Mar 20, 2021 at 9:26 AM Amit Kapila wrote: > > > > On Sat, Mar 20, 2021 at 12:22 AM Andres Freund wrote: > > > > > > And then more generally about the feature: > > > - If a slot was used to stream out a large amount of changes (say an > > > initial data load), but then replication is interrupted before the > > > transaction is committed/aborted, stream_bytes will not reflect the > > > many gigabytes of data we may have sent. > > > > > > > We can probably update the stats each time we spilled or streamed the > > transaction data but it was not clear at that stage whether or how > > much it will be useful. > > > > I felt we can update the replication slot statistics data each time we > spill/stream the transaction data instead of accumulating the > statistics and updating at the end. I have tried this in the attached > patch and the statistics data were getting updated. > Thoughts? > Did you check if we can update the stats when we release the slot as discussed above? I am not sure if it is easy to do at the time of slot release because this information might not be accessible there and in some cases, we might have already released the decoding context/reorderbuffer where this information is stored. It might be okay to update this when we stream or spill but let's see if we can do it easily at the time of slot release. -- With Regards, Amit Kapila.
Re: Truncate in synchronous logical replication failed
On Fri, Apr 16, 2021 at 3:08 PM Japin Li wrote: > > On Fri, 16 Apr 2021 at 16:52, Amit Kapila wrote: > > On Fri, Apr 16, 2021 at 12:55 PM Japin Li wrote: > > It is okay as POC but we can't change the existing function > > RelationGetIndexAttrBitmap. It is used from other places as well. It > > might be better to write a separate function for this, something like > > what Osumi-San's patch is trying to do. > > Agreed! > > Hi Osumi-San, can you merge the test case in your next version? > +1. Your test looks reasonable to me. -- With Regards, Amit Kapila.
Re: logical replication worker accesses catalogs in error context callback
On Tue, Mar 16, 2021 at 2:21 AM Tom Lane wrote: > > Bharath Rupireddy writes: > > Thanks for pointing to the changes in the commit message. I corrected > > them. Attaching v4 patch set, consider it for further review. > > I took a quick look at this. I'm quite worried about the potential > performance cost of the v4-0001 patch (the one for fixing > slot_store_error_callback). Previously, we didn't pay any noticeable > cost for having the callback unless there actually was an error. > As patched, we perform several catalog lookups per column per row, > even in the non-error code path. That seems like it'd be a noticeable > performance hit. Just to add insult to injury, it leaks memory. > > I propose a more radical but simpler solution: let's just not bother > with including the type names in the context message. How much are > they really buying? << Attaching v5-0001 here again for completion >> I'm attaching v5-0001 patch that avoids printing the column type names in the context message thus no cache lookups have to be done in the error context callback. I think the column name is enough to know on which column the error occurred and if required it's type can be known by the user. This patch gets rid of printing local and remote type names in slot_store_error_callback and also logicalrep_typmap_gettypname because it's unnecessary. Thoughts? > v4-0002 (for postgres_fdw's conversion_error_callback) has the same > problems, although mitigated a bit by not needing to do any catalog > lookups in the non-join case. For the join case, I wonder whether > we could push the lookups back to plan setup time, so they don't > need to be done over again for each row. (Not doing lookups at all > doesn't seem attractive there; it'd render the context message nearly > useless.) A different idea is to try to get the column names out > of the query's rangetable instead of going to the catalogs. I'm attaching v5-0002 which stores the required attribute information for foreign joins in postgresBeginForeignScan which is a one time job as opposed to the per-row system catalog lookup v4-0001 was doing. I'm storing the foreign table relid(as key), relname and the retrieved attributes' attnum and attname into a hash table. Whenever a conversion error occurs, using relid, the hash table is looked up to fetch the relname and attname. Thoughts? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v5-0001-Avoid-Catalogue-Accesses-In-slot_store_error_call.patch Description: Binary data v5-0002-Avoid-Catalogue-Accesses-In-conversion_error_call.patch Description: Binary data
Re: Replication slot stats misgivings
On Fri, Apr 16, 2021 at 3:16 PM Amit Kapila wrote: > > On Mon, Apr 12, 2021 at 2:57 PM vignesh C wrote: > > > > On Sat, Mar 20, 2021 at 9:26 AM Amit Kapila wrote: > > > > > > On Sat, Mar 20, 2021 at 12:22 AM Andres Freund wrote: > > > > > > > > And then more generally about the feature: > > > > - If a slot was used to stream out a large amount of changes (say an > > > > initial data load), but then replication is interrupted before the > > > > transaction is committed/aborted, stream_bytes will not reflect the > > > > many gigabytes of data we may have sent. > > > > > > > > > > We can probably update the stats each time we spilled or streamed the > > > transaction data but it was not clear at that stage whether or how > > > much it will be useful. > > > > > > > I felt we can update the replication slot statistics data each time we > > spill/stream the transaction data instead of accumulating the > > statistics and updating at the end. I have tried this in the attached > > patch and the statistics data were getting updated. > > Thoughts? > > > > Did you check if we can update the stats when we release the slot as > discussed above? I am not sure if it is easy to do at the time of slot > release because this information might not be accessible there and in > some cases, we might have already released the decoding > context/reorderbuffer where this information is stored. It might be > okay to update this when we stream or spill but let's see if we can do > it easily at the time of slot release. > I'm not sure if we will be able to update stats from here, as we will not have access to decoding context/reorderbuffer at this place, and also like you pointed out I noticed that the decoding context gets released earlier itself. Regards, Vignesh
Re: Replication slot stats misgivings
On Fri, Apr 16, 2021 at 11:28 AM Amit Kapila wrote: > > On Thu, Apr 15, 2021 at 4:35 PM Masahiko Sawada wrote: > > > > Thank you for the update! The patch looks good to me. > > > > I have pushed the first patch. Comments on the next patch > v13-0001-Use-HTAB-for-replication-slot-statistics: Also should we change PGSTAT_FILE_FORMAT_ID as we have modified the replication slot statistics? Regards, Vignesh
Re: ATTACH PARTITION locking documentation for DEFAULT partitions
On Thu, 15 Apr 2021 at 21:24, Justin Pryzby wrote: > > On Thu, Apr 15, 2021 at 08:47:26PM +0200, Matthias van de Meent wrote: > > I recently noticed that ATTACH PARTITION also recursively locks the > > default partition with ACCESS EXCLUSIVE mode when its constraints do > > not explicitly exclude the to-be-attached partition, which I couldn't > > find documented (has been there since PG10 I believe). > > I'm not sure it's what you're looking for, but maybe you saw: > https://www.postgresql.org/docs/12/sql-altertable.html > |The default partition can't contain any rows that would need to be moved to > the > |new partition, and will be scanned to verify that none are present. This > scan, > |like the scan of the new partition, can be avoided if an appropriate > |CHECK constraint is present. > > And since 2a4d96ebb: > |Attaching a partition acquires a SHARE UPDATE EXCLUSIVE lock on the parent > table, in addition to ACCESS EXCLUSIVE locks on the table to be attached and > on the default partition (if any). >From the current documentation the recursive locking isn't clear: I didn't expect an ACCESS EXCLUSIVE on the whole hierarchy of both the to-be-attached and the default partitions whilst scanning, because the SUEL on the shared parent is not propagated to all its children either. > From your patch: > > > + > > + Similarly, if you have a default partition on the parent table, it is > > + recommended to create a CHECK constraint that > > excludes > > + the to be attached partition constraint. Here, too, without the > > + CHECK constraint, this table will be scanned to > > + validate that the updated default partition constraints while holding > > + an ACCESS EXCLUSIVE lock on the default partition. > > + > > The AEL is acquired in any case, right ? Yes, the main point is that the validation scan runs whilst holding the AEL on the partition (sub)tree of that default partition. After looking at bit more at the code, I agree that my current patch is not descriptive enough. I compared adding a partition to running `CREATE CONSTRAINT ... NOT VALID` on the to-be-altered partitions (using AEL), + `VALIDATE CONSTRAINT` running recursively over it's partitions (using SHARE UPDATE EXCLUSIVE). We only expect an SUEL for VALIDATE CONSTRAINT, and the constraint itself is only added/updated to the direct descendents of the parent, not their recursivedescendents. Insertions already can only happen when the whole upward hierarchy of a partition allows for inserts, so this shouldn't be that much of an issue. > I think whatever we say here needs to be crystal clear that only the scan can > be skipped. Yes, but when we skip the scan for the default partition, we also skip locking its partition tree with AEL. The partition tree of the table that is being attached, however, is fully locked regardless of constraint definitions. > I suggest that maybe the existing paragraph in alter_table.sgml could maybe > say > that an exclusive lock is held, maybe like. > > |The default partition can't contain any rows that would need to be moved to > the > |new partition, and will be scanned to verify that none are present. This > scan, > |like the scan of the new partition, can be avoided if an appropriate > |CHECK constraint is present. > |The scan of the default partition occurs while it is exclusively locked. PFA an updated patch. I've updated the wording of the previous patch, and also updated this section in alter_table.sgml, but with different wording, explictly explaining the process used to validate the altered default constraint. Thanks for the review. With regards, Matthias van de Meent From 230c594950886bf50bf4c69295aa0cb67c16b8a6 Mon Sep 17 00:00:00 2001 From: Matthias van de Meent Date: Thu, 15 Apr 2021 20:43:00 +0200 Subject: [PATCH v2] ATTACH PARTITION locking documentation for DEFAULT partitions. --- doc/src/sgml/ddl.sgml | 18 -- doc/src/sgml/ref/alter_table.sgml | 8 ++-- 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index 30e4170963..8f0b38847a 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -3934,6 +3934,9 @@ ALTER TABLE measurement_y2008m02 ADD CONSTRAINT y2008m02 \copy measurement_y2008m02 from 'measurement_y2008m02' -- possibly some other data preparation work +ALTER TABLE measurement_default ADD CONSTRAINT excl_y2008m02 + CHECK ( (logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01') IS FALSE ); + ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02 FOR VALUES FROM ('2008-02-01') TO ('2008-03-01' ); @@ -3947,12 +3950,23 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02 which is otherwise needed to validate the implicit partition constraint. Without the CHECK constraint, the table will be scanned to validate the partition constraint while - holding both an ACCESS EXCLUSIVE lock on tha
Re: WIP: WAL prefetch (another approach)
On Sat, Apr 10, 2021 at 2:16 AM Thomas Munro wrote: > In commit 1d257577e08d3e598011d6850fd1025858de8c8c, there is a change in file format for stats, won't it require bumping PGSTAT_FILE_FORMAT_ID? Actually, I came across this while working on my today's commit f5fc2f5b23 where I forgot to bump PGSTAT_FILE_FORMAT_ID. So, I thought maybe we can bump it just once if required? -- With Regards, Amit Kapila.
Re: Retry in pgbench
On Fri, 16 Apr 2021 10:28:48 +0900 (JST) Tatsuo Ishii wrote: > > By the way, I've been playing with the idea of failing gracefully and retry > > indefinitely (or until given -T) on SQL error AND connection issue. > > > > It would be useful to test replicating clusters with a (switch|fail)over > > procedure. > > Interesting idea but in general a failover takes sometime (like a few > minutes), and it will strongly affect TPS. I think in the end it just > compares the failover time. This usecase is not about benchmarking. It's about generating constant trafic to be able to practice/train some [auto]switchover procedures while being close to production activity. In this contexte, a max-saturated TPS of one node is not relevant. But being able to add some stats about downtime might be a good addition. Regards,
Re: Retry in pgbench
> This usecase is not about benchmarking. It's about generating constant trafic > to be able to practice/train some [auto]switchover procedures while being > close > to production activity. > > In this contexte, a max-saturated TPS of one node is not relevant. But being > able to add some stats about downtime might be a good addition. Oh I see. That makes sense. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Error when defining a set returning function
Dear all Since I was receiving an error when defining a set returning function, I borrowed a function from PostgreSQL as follows /* C definition */ typedef struct testState { int current; int finish; int step; } testState; /** * test_srf(startval int, endval int, step int) */ PG_FUNCTION_INFO_V1(test_srf); Datum test_srf(PG_FUNCTION_ARGS) { FuncCallContext *funcctx; testState *fctx; int result; /* the actual return value */ if (SRF_IS_FIRSTCALL()) { /* Get input values */ int start = PG_GETARG_INT32(0); int finish = PG_GETARG_INT32(1); int step = PG_GETARG_INT32(2); MemoryContext oldcontext; /* create a function context for cross-call persistence */ funcctx = SRF_FIRSTCALL_INIT(); /* switch to memory context appropriate for multiple function calls */ oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); /* quick opt-out if we get nonsensical inputs */ if (step <= 0 || start == finish) { funcctx = SRF_PERCALL_SETUP(); SRF_RETURN_DONE(funcctx); } /* allocate memory for function context */ fctx = (testState *) palloc0(sizeof(testState)); fctx->current = start; fctx->finish = finish; fctx->step = step; funcctx->user_fctx = fctx; MemoryContextSwitchTo(oldcontext); } /* stuff done on every call of the function */ funcctx = SRF_PERCALL_SETUP(); /* get state */ fctx = funcctx->user_fctx; result = fctx->current; fctx->current += fctx->step; /* Stop when we have generated all values */ if (fctx->current > fctx->finish) { SRF_RETURN_DONE(funcctx); } SRF_RETURN_NEXT(funcctx, Int32GetDatum(result)); } /* SQL definition */ CREATE OR REPLACE FUNCTION testSRF(startval int, endval int, step int) RETURNS SETOF integer AS 'MODULE_PATHNAME', 'test_srf' LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE; When I execute this function I obtain select testSRF(1,10, 2); ERROR: unrecognized table-function returnMode: 257 select version(); PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit Any idea what could be wrong ? Thanks for your help Esteban
Re: Bogus collation version recording in recordMultipleDependencies
Julien Rouhaud writes: > On Thu, Apr 15, 2021 at 10:06:24AM -0400, Tom Lane wrote: >> 0001 fails for me :-(. I think that requires default collation to be C. > Oh right, adding --no-locale to the regress opts I see that create_index is > failing, and that's not the one I was expecting. > We could change create_index test to create c2 with a C collation, in order to > test that we don't track dependency on unversioned locales, and add an extra > test in collate.linux.utf8 to check that we do track a dependency on the > default collation as this test isn't run in the --no-locale case. The only > case not tested would be default unversioned collation, but I'm not sure where > to properly test that. Maybe a short leading test in collate.linux.utf8 that > would be run on linux in that case (when getdatabaseencoding() != 'UTF8')? It > would require an extra alternate file but it wouldn't cause too much > maintenance problem as there should be only one test. Since the proposed patch removes the dependency code's special-case handling of the default collation, I don't feel like we need to jump through hoops to prove that the default collation is tracked the same as other collations. A regression test with alternative outputs is a significant ongoing maintenance burden, and I do not see where we're getting a commensurate improvement in test coverage. Especially since, AFAICS, the two alternative outputs would essentially have to accept both the "it works" and "it doesn't work" outcomes. So I propose that we do 0001 below, which is my first patch plus your suggestion about fixing up create_index.sql. This passes check-world for me under both C and en_US.utf8 prevailing locales. regards, tom lane diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c index 362db7fe91..1217c01b8a 100644 --- a/src/backend/catalog/pg_depend.c +++ b/src/backend/catalog/pg_depend.c @@ -73,7 +73,6 @@ recordMultipleDependencies(const ObjectAddress *depender, max_slots, slot_init_count, slot_stored_count; - char *version = NULL; if (nreferenced <= 0) return; /* nothing to do */ @@ -104,31 +103,22 @@ recordMultipleDependencies(const ObjectAddress *depender, slot_init_count = 0; for (i = 0; i < nreferenced; i++, referenced++) { - bool ignore_systempin = false; + char *version = NULL; if (record_version) { /* For now we only know how to deal with collations. */ if (referenced->classId == CollationRelationId) { -/* C and POSIX don't need version tracking. */ +/* These are unversioned, so don't waste cycles on them. */ if (referenced->objectId == C_COLLATION_OID || referenced->objectId == POSIX_COLLATION_OID) continue; version = get_collation_version_for_oid(referenced->objectId, false); - -/* - * Default collation is pinned, so we need to force recording - * the dependency to store the version. - */ -if (referenced->objectId == DEFAULT_COLLATION_OID) - ignore_systempin = true; } } - else - Assert(!version); /* * If the referenced object is pinned by the system, there's no real @@ -136,7 +126,7 @@ recordMultipleDependencies(const ObjectAddress *depender, * version. This saves lots of space in pg_depend, so it's worth the * time taken to check. */ - if (!ignore_systempin && isObjectPinned(referenced, dependDesc)) + if (version == NULL && isObjectPinned(referenced, dependDesc)) continue; if (slot_init_count < max_slots) diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out index 830fdddf24..7f8f91b92c 100644 --- a/src/test/regress/expected/create_index.out +++ b/src/test/regress/expected/create_index.out @@ -2026,10 +2026,10 @@ REINDEX TABLE concur_reindex_tab; -- notice NOTICE: table "concur_reindex_tab" has no indexes to reindex REINDEX (CONCURRENTLY) TABLE concur_reindex_tab; -- notice NOTICE: table "concur_reindex_tab" has no indexes that can be reindexed concurrently -ALTER TABLE concur_reindex_tab ADD COLUMN c2 text; -- add toast index +ALTER TABLE concur_reindex_tab ADD COLUMN c2 text COLLATE "C"; -- add toast index -- Normal index with integer column CREATE UNIQUE INDEX concur_reindex_ind1 ON concur_reindex_tab(c1); --- Normal index with text column +-- Normal index with text column (with unversioned collation) CREATE INDEX concur_reindex_ind2 ON concur_reindex_tab(c2); -- UNIQUE index with expression CREATE UNIQUE INDEX concur_reindex_ind3 ON concur_reindex_tab(abs(c1)); @@ -2069,16 +2069,14 @@ WHERE classid = 'pg_class'::regclass AND obj| objref | deptype --++- index concur_reindex_ind1| constraint concur_reindex_ind1 on table concur_
Re: fix old confusing JSON example
On Fri, Apr 16, 2021 at 11:00 AM Michael Paquier wrote: > On Sat, Apr 03, 2021 at 02:28:38PM +0200, Erik Rijkers wrote: > > So, that gives information on two operators, and then gives one > > example query for each. Clearly, the second example was meant to > > illustrate a where-clause with the @? operator. > > > > Small change to prevent great confusion (I'll admit it took me far > > too long to understand this). > > Once one guesses the definition of the table to use with the sample > data at disposal in the docs, it is easy to see that both queries > should return the same result, but the second one misses the shot and > is corrected as you say. So, applied. > > My apologies for the delay. My apologies for missing this. And thank you for taking care! -- Regards, Alexander Korotkov
Re: Error when defining a set returning function
Esteban Zimanyi writes: > Since I was receiving an error when defining a set returning function, I > borrowed a function from PostgreSQL as follows > ... > When I execute this function I obtain > select testSRF(1,10, 2); > ERROR: unrecognized table-function returnMode: 257 Hmm, I compiled this function up and it works for me: regression=# select testSRF(1,10, 2); testsrf -- 1 3 5 7 (4 rows) I think your "quick opt-out" code is a bit broken, because it fails to restore the current memory context; but there's nothing wrong with the main code path. Hence, the problem is somewhere else. The first theory that comes to mind is that you're compiling against Postgres headers that don't match the server version you're actually loading the code into. In theory the PG_MODULE_MAGIC infrastructure ought to catch that, but maybe you've found some creative way to fool that :-(. One way maybe would be if the headers were from some pre-release v13 version that wasn't ABI-compatible with 13.0. Or it could be something else, but I'd counsel looking for build process mistakes, cause this C code isn't the problem. regards, tom lane
default_tablespace doc and partitioned rels
commit 87259588d0ab0b8e742e30596afa7ae25caadb18 Author: Alvaro Herrera Date: Thu Apr 25 10:20:23 2019 -0400 Fix tablespace inheritance for partitioned rels This doc change doesn't make sense to me: +++ b/doc/src/sgml/config.sgml @@ -7356,7 +7356,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv; This variable specifies the default tablespace in which to create objects (tables and indexes) when a CREATE command does -not explicitly specify a tablespace. +not explicitly specify a tablespace. It also determines the tablespace +that a partitioned relation will direct future partitions to. default_tablespace is a global GUC, so if a partitioned relation "directs" partitions anywhere, it's not to the fallback value of the GUC, but to its reltablespace, as this patch wrote in doc/src/sgml/ref/create_table.sgml: + the tablespace specified overrides default_tablespace + as the default tablespace to use for any newly created partitions when no + other tablespace is explicitly specified. Maybe I'm misreading config.sgml somehow ? I thought it would be more like this (but actually I think shouldn't say anything at all): +... It also determines the tablespace where new partitions are created, +if the parent, partitioned relation doesn't have a tablespace set. -- Justin
Re: wrong units in ExplainPropertyFloat
Justin Pryzby writes: > Text mode uses appendStringInfo() for high density output, so this only > affects > non-text output, but it turns out that units aren't shown for nontext format > anyway - this seems like a deficiency, but it means there's no visible bug. Yeah, I concur: these should say "ms", but it's only latent so it's not surprising nobody's noticed. Pushed. regards, tom lane
Re: Truncate in synchronous logical replication failed
On Fri, 16 Apr 2021 at 17:19, osumi.takami...@fujitsu.com wrote: > Hi > > > On Friday, April 16, 2021 5:50 PM Amit Kapila wrote: >> On Fri, Apr 16, 2021 at 12:56 PM osumi.takami...@fujitsu.com >> wrote: >> > >> > > Thanks for your reminder. It might be a way to solve this problem. >> > Yeah. I've made the 1st patch for this issue. >> > >> > In my env, with the patch >> > the TRUNCATE in synchronous logical replication doesn't hang. >> > >> >> Few initial comments: >> = >> 1. >> + relreplindex = relation->rd_replidindex; >> + >> + /* >> + * build attributes to idindexattrs. >> + */ >> + idindexattrs = NULL; >> + foreach(l, indexoidlist) >> + { >> + Oid indexOid = lfirst_oid(l); >> + Relation indexDesc; >> + int i; >> + bool isIDKey; /* replica identity index */ >> + >> + indexDesc = RelationIdGetRelation(indexOid); >> >> When you have oid of replica identity index (relreplindex) then what is the >> need to traverse all the indexes? > Ok. No need to traverse all the indexes. Will fix this part. > >> 2. >> It is better to name the function as RelationGet... > You are right. I'll modify this in my next version. > I took the liberty to address review comments and provide a v2 patch on top of your's v1 patch, also merge the test case. Sorry for attaching. -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd. diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c index 2a1f9830e0..1cf59e0fb0 100644 --- a/src/backend/replication/logical/proto.c +++ b/src/backend/replication/logical/proto.c @@ -668,8 +668,7 @@ logicalrep_write_attrs(StringInfo out, Relation rel) /* fetch bitmap of REPLICATION IDENTITY attributes */ replidentfull = (rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL); if (!replidentfull) - idattrs = RelationGetIndexAttrBitmap(rel, - INDEX_ATTR_BITMAP_IDENTITY_KEY); + idattrs = RelationGetIdentityKeyBitmap(rel); /* send the attributes */ for (i = 0; i < desc->natts; i++) diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 29702d6eab..ace167f324 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -5206,6 +5206,94 @@ restart: } } +Bitmapset * +RelationGetIdentityKeyBitmap(Relation relation) +{ + Bitmapset *idindexattrs; /* columns in the replica identity */ + List *indexoidlist; + List *newindexoidlist; + Oid relpkindex; + Oid relreplindex; + Relation indexDesc; + int i; + MemoryContext oldcxt; + + /* Quick exit if we already computed the result. */ + if (relation->rd_idattr != NULL) + return bms_copy(relation->rd_idattr); + + /* Fast path if definitely no indexes */ + if (!RelationGetForm(relation)->relhasindex) + return NULL; + + /* + * Get cached list of index OIDs. If we have to start over, we do so here. + */ +restart: + indexoidlist = RelationGetIndexList(relation); + + /* Fall out if no indexes (but relhasindex was set) */ + if (indexoidlist == NIL) + return NULL; + + /* Save some values to compare after building attributes */ + relpkindex = relation->rd_pkindex; + relreplindex = relation->rd_replidindex; + + /* + * build attributes to idindexattrs. + */ + idindexattrs = NULL; + indexDesc = RelationIdGetRelation(relreplindex); + + /* Collect simple attribute references */ + for (i = 0; i < indexDesc->rd_index->indnatts; i++) + { + int attrnum = indexDesc->rd_index->indkey.values[i]; + + /* Romove non-key columns here. */ + if (attrnum != 0) + { + if (i < indexDesc->rd_index->indnkeyatts) +idindexattrs = bms_add_member(idindexattrs, + attrnum - FirstLowInvalidHeapAttributeNumber); + } + } + RelationClose(indexDesc); + + /* Check if we need to redo */ + newindexoidlist = RelationGetIndexList(relation); + if (equal(indexoidlist, newindexoidlist) && + relpkindex == relation->rd_pkindex && + relreplindex == relation->rd_replidindex) + { + /* Still the same index set, so proceed */ + list_free(newindexoidlist); + list_free(indexoidlist); + } + else + { + /* Gotta do it over ... might as well not leak memory */ + list_free(newindexoidlist); + list_free(indexoidlist); + bms_free(idindexattrs); + + goto restart; + } + + /* Don't leak the old values of these bitmaps, if any */ + bms_free(relation->rd_idattr); + relation->rd_idattr = NULL; + + /* Now save copies of the bitmaps in the relcache entry */ + oldcxt = MemoryContextSwitchTo(CacheMemoryContext); + relation->rd_idattr = bms_copy(idindexattrs); + MemoryContextSwitchTo(oldcxt); + + /* We return our original working copy for caller to play with */ + return idindexattrs; +} + /* * RelationGetExclusionInfo -- get info about index's exclusion constraint * diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h index 2fcdf79323..f772855ac6 100644 --- a/src/include/utils/relcache.h +++ b/src/include/utils/relcache.h @@ -65,6 +65,8 @@ typedef enum IndexAttrBitmapKind extern Bitmapset *Relati
Re: Bogus collation version recording in recordMultipleDependencies
On Fri, Apr 16, 2021 at 10:03:42AM -0400, Tom Lane wrote: > > Since the proposed patch removes the dependency code's special-case > handling of the default collation, I don't feel like we need to jump > through hoops to prove that the default collation is tracked the > same as other collations. A regression test with alternative outputs > is a significant ongoing maintenance burden, and I do not see where > we're getting a commensurate improvement in test coverage. Especially > since, AFAICS, the two alternative outputs would essentially have to > accept both the "it works" and "it doesn't work" outcomes. Fine by me, I was mentioning those if we wanted to keep some extra coverage for that by I agree it doesn't add much value. > So I propose that we do 0001 below, which is my first patch plus your > suggestion about fixing up create_index.sql. This passes check-world > for me under both C and en_US.utf8 prevailing locales. That's what I ended up with too, so LGTM!
Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation
On Thu, Apr 15, 2021 at 11:06 AM Matthias van de Meent wrote: > I've noticed there are a lot of places in the btree index > infrastructure (and also some other index AMs) that effectively > iterate over the attributes of the index tuple, but won't use > index_deform_tuple for reasons. However, this implies that they must > repeatedly call index_getattr, which in the worst case is O(n) for the > n-th attribute, slowing down extraction of multi-column indexes > significantly. As such, I've added some API that allows for iteration > (ish) over index attributes. Interesting approach. I think that in an ideal world we would have a tuple format with attribute lengths/offsets right in the header. But it's too late for that, so other approaches seem worth considering. > Also attached is 0002, which dynamically truncates attribute prefixes > of tuples whilst _binsrch-ing through a nbtree page. It greatly uses > the improved performance of 0001; they work very well together. The > problems that Peter (cc-ed) mentions in [0] only result in invalid > search bounds when traversing the tree, but on the page level valid > bounds can be constructed. > > This is patchset 1 of a series of patches I'm starting for eventually > adding static prefix truncation into nbtree infrastructure in > PostgreSQL. I've put up a wiki page [1] with my current research and > thoughts on that topic. The idea of making _bt_truncate() produce new leaf page high keys based on the lastleft tuple rather than the firstright tuple (i.e. +inf truncated attribute values rather than the current -inf) seems like a non-starter. As you point out in "1.) Suffix-truncation; -INF in high keys" on the Postgres wiki page, the current approach truncates firstright (not lastleft), making the left page's new high key contain what you call a 'foreign' value. But I see that as a big advantage of the current approach. Consider, for example, the nbtree high key "continuescan" optimization added by commit 29b64d1d. The fact that leaf page high keys are generated in this way kind of allows us to "peak" on the page to the immediate right before actually visiting it -- possibly without ever visiting it (which is where the benefit of that particular optimization lies). _bt_check_unique() uses a similar trick. After the Postgres 12 work, _bt_check_unique() will only visit a second page in the extreme case where we cannot possibly fit all of the relevant version duplicates on even one whole leaf page (i.e. practically never). There is also cleverness inside _bt_compare() to make sure that we handle the boundary cases perfectly while descending the tree. You might also consider how the nbtsplitloc.c code works with duplicates, and how that would be affected by +inf truncated attributes. The leaf-page-packing performed in the SPLIT_SINGLE_VALUE case only goes ahead when the existing high key confirms that this must be the rightmost page. Now, I guess that you could still do something like that if we switched to +inf semantics. But, the fact that the new right page will have a 'foreign' value in the SPLIT_SINGLE_VALUE-split case is also of benefit -- it is practically empty right after the split (since the original/left page is packed full), and we want this empty space to be eligible to either take more duplicates, or to take values that may happen to fit between the highly duplicated value and the original foreign high key value. We want that flexibility, I think. I also find -inf much more natural. If in the future we teach nbtree to truncate "inside" text attributes (say text columns), you'd pretty much be doing the same thing at the level of characters rather than whole columns. The -inf semantics are like strcmp() semantics. If you're going to pursue full prefix compression anyway, maybe you should use a low key on the leaf level in cases where the optimization is in use. This creates complexity during page deletion, because the low key in the subtree to the right of the deletion target subtree may need to be updated. Perhaps you can find a way to make that work that isn't too complicated. > I've run some tests with regards to performance on my laptop; which > tests nbtree index traversal. The test is based on a recent UK land > registry sales prices dataset (25744780 rows), being copied from one > table into an unlogged table with disabled autovacuum, with one index > as specified by the result. Master @ 99964c4a, patched is with both > 0001 and 0002. The results are averages over 3 runs, with plain > configure, compiled by gcc (Debian 6.3.0-18+deb9u1). You should probably account for index size here. I have lots of my own tests for space utilization, using data from a variety of sources. -- Peter Geoghegan
Re: Bogus collation version recording in recordMultipleDependencies
Julien Rouhaud writes: > On Fri, Apr 16, 2021 at 10:03:42AM -0400, Tom Lane wrote: >> So I propose that we do 0001 below, which is my first patch plus your >> suggestion about fixing up create_index.sql. This passes check-world >> for me under both C and en_US.utf8 prevailing locales. > That's what I ended up with too, so LGTM! Pushed, thanks for review! (and I'll update the open items list in a sec) regards, tom lane
Re: Error when defining a set returning function
Dear Tom Many thanks for asking my question so quickly. After your answer, I downloaded brand new versions of PostgreSQL 13.2, PostGIS 2.5.5, and compiled/installed with the standard parameters. I didn't get any error messages in the build. I then recompiled again MobilityDB and got the same error message. When debugging the function with gdb, I noticed that the rsinfo variable of the PostgreSQL function ExecMakeFunctionResultSet is modified in the macro SRF_RETURN_NEXT causing the problem. Any idea how to solve this? 4353 SRF_RETURN_NEXT(funcctx, Int32GetDatum(result)); (gdb) up #1 0x55b8a871fc56 in ExecMakeFunctionResultSet (fcache=0x55b8a8e6d9a0, econtext=0x55b8a8e6cfa0, argContext=0x55b8a9d00dd0, isNull=0x55b8a8e6d930, isDone=0x55b8a8e6d988) at /home/esteban/src/postgresql-13.2/build_dir/../src/backend/executor/execSRF.c:614 614 result = FunctionCallInvoke(fcinfo); (gdb) p rsinfo $5 = {type = T_ReturnSetInfo, econtext = 0x55b8a8e6cfa0, expectedDesc = 0x55b8a8e6e8f0, allowedModes = 3, returnMode = SFRM_ValuePerCall, isDone = ExprSingleResult, setResult = 0x0, setDesc = 0x0} (gdb) n 4354} (gdb) ExecMakeFunctionResultSet (fcache=0x55b8a8e6d9a0, econtext=0x55b8a8e6cfa0, argContext=0x55b8a9d00dd0, isNull=0x55b8a8e6d930, isDone=0x55b8a8e6d988) at /home/esteban/src/postgresql-13.2/build_dir/../src/backend/executor/execSRF.c:615 615 *isNull = fcinfo->isnull; (gdb) p rsinfo $6 = {type = T_ReturnSetInfo, econtext = 0x55b8a8e6cfa0, expectedDesc = 0x55b8a8e6e8f0, allowedModes = 3, returnMode = (SFRM_ValuePerCall | unknown: 256), isDone = ExprSingleResult, setResult = 0x0, setDesc = 0x0} (gdb)
Re: Bogus collation version recording in recordMultipleDependencies
I wrote: >> That's what I ended up with too, so LGTM! > Pushed, thanks for review! (and I'll update the open items list in a > sec) ... or maybe not just yet. Andres' buildfarm critters seem to have a different opinion than my machine about what the output of collate.icu.utf8 ought to be. I wonder what the prevailing LANG setting is for them, and which ICU version they're using. regards, tom lane
Re: Bogus collation version recording in recordMultipleDependencies
Hi, On 2021-04-16 12:55:28 -0400, Tom Lane wrote: > I wrote: > >> That's what I ended up with too, so LGTM! > > > Pushed, thanks for review! (and I'll update the open items list in a > > sec) > > ... or maybe not just yet. Andres' buildfarm critters seem to have > a different opinion than my machine about what the output of > collate.icu.utf8 ought to be. I wonder what the prevailing LANG > setting is for them, and which ICU version they're using. andres@andres-pg-buildfarm-valgrind:~/src/pgbuildfarm-client-stock$ grep calliph *.conf build-farm-copyparse.conf:animal => "calliphoridae", build-farm-copyparse.conf:build_root => '/mnt/resource/andres/bf/calliphoridae', andres@andres-pg-buildfarm-valgrind:~/src/pgbuildfarm-client-stock$ dpkg -l|grep icu ii icu-devtools 67.1-6 amd64 Development utilities for International Components for Unicode ii libicu-dev:amd64 67.1-6 amd64 Development files for International Components for Unicode ii libicu67:amd64 67.1-6 amd64 International Components for Unicode andres@andres-pg-buildfarm-valgrind:~/src/pgbuildfarm-client-stock$ locale LANG=C.UTF-8 LANGUAGE= LC_CTYPE="C.UTF-8" LC_NUMERIC="C.UTF-8" LC_TIME="C.UTF-8" LC_COLLATE="C.UTF-8" LC_MONETARY="C.UTF-8" LC_MESSAGES="C.UTF-8" LC_PAPER="C.UTF-8" LC_NAME="C.UTF-8" LC_ADDRESS="C.UTF-8" LC_TELEPHONE="C.UTF-8" LC_MEASUREMENT="C.UTF-8" LC_IDENTIFICATION="C.UTF-8" LC_ALL= Greetings, Andres Freund
Re: Error when defining a set returning function
Esteban Zimanyi writes: > When debugging the function with gdb, I noticed that the rsinfo variable of > the PostgreSQL function ExecMakeFunctionResultSet is modified in the > macro SRF_RETURN_NEXT causing the problem. Any idea how to solve this? Well, what SRF_RETURN_NEXT thinks it's doing is rsi->isDone = ExprMultipleResult; \ which surely shouldn't change the returnMode field. At this point I'm guessing that you are compiling the PG headers with some compiler pragma that changes the struct packing rules. Don't do that. regards, tom lane
Re: Bogus collation version recording in recordMultipleDependencies
I wrote: > ... or maybe not just yet. Andres' buildfarm critters seem to have > a different opinion than my machine about what the output of > collate.icu.utf8 ought to be. I wonder what the prevailing LANG > setting is for them, and which ICU version they're using. Oh, I bet it's "C.utf8", because I can reproduce the failure with that. This crystallizes a nagging feeling I'd had that you were misdescribing the collate.icu.utf8 test as not being run under --no-locale. Actually, it's only skipped if the encoding isn't UTF8, not the same thing. I think we need to remove the default-collation cases from that test too. regards, tom lane
Re: Bogus collation version recording in recordMultipleDependencies
Andres Freund writes: > On 2021-04-16 12:55:28 -0400, Tom Lane wrote: >> ... or maybe not just yet. Andres' buildfarm critters seem to have >> a different opinion than my machine about what the output of >> collate.icu.utf8 ought to be. I wonder what the prevailing LANG >> setting is for them, and which ICU version they're using. > LANG=C.UTF-8 I'd guessed that shortly later, but thanks for confirming. regards, tom lane
Re: Bogus collation version recording in recordMultipleDependencies
I wrote: > Oh, I bet it's "C.utf8", because I can reproduce the failure with that. > This crystallizes a nagging feeling I'd had that you were misdescribing > the collate.icu.utf8 test as not being run under --no-locale. Actually, > it's only skipped if the encoding isn't UTF8, not the same thing. > I think we need to remove the default-collation cases from that test too. Hmm ... this is more subtle than it seemed. I tried to figure out where the default-collation dependencies were coming from, and it's quite non-obvious, at least for some of them. Observe: u8de=# create table t1 (f1 text collate "fr_FR"); CREATE TABLE u8de=# create index on t1(f1) where f1 > 'foo'; CREATE INDEX u8de=# SELECT objid::regclass, refobjid::regcollation, refobjversion FROM pg_depend d LEFT JOIN pg_class c ON c.oid = d.objid WHERE refclassid = 'pg_collation'::regclass AND coalesce(relkind, 'i') = 'i' AND relname LIKE 't1_%'; objid | refobjid | refobjversion ---+---+--- t1_f1_idx | "fr_FR" | 2.28 t1_f1_idx | "fr_FR" | 2.28 t1_f1_idx | "default" | 2.28 (3 rows) (The "default" item doesn't show up if default collation is C, which is what's causing the buildfarm instability.) Now, it certainly looks like that index definition ought to only have fr_FR dependencies. I dug into it and discovered that the reason we're coming up with a dependency on "default" is that the WHERE clause looks like {OPEXPR :opno 666 :opfuncid 742 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 14484 :args ( {VAR :varno 1 :varattno 1 :vartype 25 :vartypmod -1 :varcollid 14484 :varlevelsup 0 :varnosyn 1 :varattnosyn 1 :location 23 } {CONST :consttype 25 :consttypmod -1 :constcollid 100 :constlen -1 :constbyval false :constisnull false :location 28 :constvalue 7 [ 28 0 0 0 102 111 111 ] } ) :location 26 } So sure enough, the comparison operator's inputcollid is fr_FR, but the 'foo' constant has constcollid = "default". That will have exactly zero impact on the semantics of the expression, but dependency.c doesn't realize that and reports it as a dependency anyway. I feel like this is telling us that there's a fundamental misunderstanding in find_expr_references_walker about which collation dependencies to report. It's reporting all the leaf-node collations, and ignoring the ones that actually count semantically, that is the inputcollid fields of function and operator nodes. Not sure what's the best thing to do here. Redesigning this post-feature-freeze doesn't seem terribly appetizing, but on the other hand, this index collation recording feature has put a premium on not overstating the collation dependencies of an expression. We don't want to tell users that an index is broken when it isn't really. regards, tom lane
Re: Forget close an open relation in ReorderBufferProcessTXN()
Hi, On 2021-04-16 08:42:40 +0530, Amit Kapila wrote: > I think it is because relation_open expects either caller to have a > lock on the relation or don't use 'NoLock' lockmode. AFAIU, we don't > need to acquire a lock on relation while decoding changes from WAL > because it uses a historic snapshot to build a relcache entry and all > the later changes to the rel are absorbed while decoding WAL. Right. > I think it is also important to *not* acquire any lock on relation > otherwise it can lead to some sort of deadlock or infinite wait in the > decoding process. Consider a case for operations like Truncate (or if > the user has acquired an exclusive lock on the relation in some other > way say via Lock command) which acquires an exclusive lock on > relation, it won't get replicated in synchronous mode (when > synchronous_standby_name is configured). The truncate operation will > wait for the transaction to be replicated to the subscriber and the > decoding process will wait for the Truncate operation to finish. However, this cannot be really relied upon for catalog tables. An output function might acquire locks or such. But for those we do not need to decode contents... This made me take a brief look at pgoutput.c - maybe I am missing something, but how is the following not a memory leak? static void maybe_send_schema(LogicalDecodingContext *ctx, ReorderBufferTXN *txn, ReorderBufferChange *change, Relation relation, RelationSyncEntry *relentry) { ... /* Map must live as long as the session does. */ oldctx = MemoryContextSwitchTo(CacheMemoryContext); relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc), CreateTupleDescCopy(outdesc)); MemoryContextSwitchTo(oldctx); send_relation_and_attrs(ancestor, xid, ctx); RelationClose(ancestor); If - and that's common - convert_tuples_by_name() won't have to do anything, the copied tuple descs will be permanently leaked. Greetings, Andres Freund
pg_amcheck option to install extension
Hi, Peter Geoghegan suggested that I have the cross version upgrade checker run pg_amcheck on the upgraded module. This seemed to me like a good idea, so I tried it, only to find that it refuses to run unless the amcheck extension is installed. That's fair enough, but it also seems to me like we should have an option to have pg_amcheck install the extension is it's not present, by running something like 'create extension if not exists amcheck'. Maybe in such a case there could also be an option to drop the extension when pg_amcheck's work is done - I haven't thought through all the implications. Given pg_amcheck is a new piece of work I'm not sure if we can sneak this in under the wire for release 14. I will certainly undertake to review anything expeditiously. I can work around this issue in the buildfarm, but it seems like something other users are likely to want. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Error when defining a set returning function
Many thanks Tom for your help ! I removed the flag -fshort-enums and everything works fine ! On Fri, Apr 16, 2021 at 7:04 PM Tom Lane wrote: > Esteban Zimanyi writes: > > When debugging the function with gdb, I noticed that the rsinfo variable > of > > the PostgreSQL function ExecMakeFunctionResultSet is modified in the > > macro SRF_RETURN_NEXT causing the problem. Any idea how to solve this? > > Well, what SRF_RETURN_NEXT thinks it's doing is > > rsi->isDone = ExprMultipleResult; \ > > which surely shouldn't change the returnMode field. At this point > I'm guessing that you are compiling the PG headers with some compiler > pragma that changes the struct packing rules. Don't do that. > > regards, tom lane >
Re: ANALYZE counts LP_DEAD line pointers as n_dead_tup
On Wed, Apr 14, 2021 at 7:11 AM Masahiko Sawada wrote: > If we create a table with vacuum_index_cleanup = off or execute VACUUM > with INDEX_CLEANUP = off, vacuum updates pg_stat_all_tables.n_dead_tup > to the number of HEAPTUPLE_RECENTLY_DEAD tuples. Whereas analyze > updates it to the sum of the number of HEAPTUPLE_DEAD/RECENTLY_DEAD > tuples and LP_DEAD line pointers. So if the table has many LP_DEAD > line pointers due to skipping index cleanup, autovacuum is triggered > every time after analyze/autoanalyze. This issue seems to happen also > on back branches, probably from 12 where INDEX_CLEANUP option was > introduced. Hmm. > I think we can have heapam_scan_analyze_next_tuple() not count LP_DEAD > line pointer as lazy_scan_prune() does. Attached the patch for that. lazy_scan_prune() is concerned about what the state of the table *will be* when VACUUM finishes, based on its working assumption that index vacuuming and heap vacuuming always go ahead. This is exactly the same reason why lazy_scan_prune() will set LVPagePruneState.hastup to 'false' in the presence of an LP_DEAD item -- this is not how count_nondeletable_pages() considers if the same page 'hastup' much later on, right at the end of the VACUUM (it will only consider the page safe to truncate away if it now only contains LP_UNUSED items -- LP_DEAD items make heap/table truncation unsafe). In general accounting rules like this may need to work slightly differently across near-identical functions because of "being versus becoming" issues. It is necessary to distinguish between "being" code (e.g., this ANALYZE code, count_nondeletable_pages() and its hastup issue) and "becoming" code (e.g., lazy_scan_prune() ands its approach to counting "remaining" dead tuples as well as hastup-ness). I tend to doubt that your patch is the right approach because the two code paths already "agree" once you assume that the LP_DEAD items that lazy_scan_prune() sees will be gone at the end of the VACUUM. I do agree that this is a problem, though. Generally speaking, the "becoming" code from lazy_scan_prune() is not 100% sure that it will be correct in each case, for a large variety of reasons. But I think that we should expect it to be mostly correct. We definitely cannot allow it to be quite wrong all the time with some workloads. And so I agree that this is a problem for the INDEX_CLEANUP = off case, though it's equally an issue for the recently added failsafe mechanism. I do not believe that it is a problem for the bypass-indexes optimization, though, because that is designed to only be applied when there are practically zero LP_DEAD items. The optimization can make VACUUM record that there are zero dead tuples after the VACUUM finishes, even though there were in fact a very small non-zero number of dead tuples -- but that's not appreciably different from any of the other ways that the count of dead tuples could be inaccurate (e.g. concurrent opportunistic pruning). The specific tests that we apply inside lazy_vacuum() should make sure that autovacuum scheduling is never affected. The autovacuum scheduling code can safely "believe" that the indexes were vacuumed, because it really is the same as if there were precisely zero LP_DEAD items (or the same for all practical purposes). I'm not sure what to do, though. Both the INDEX_CLEANUP = off case and the failsafe case are only intended for emergencies. And it's hard to know what to do in a code path that is designed to rarely or never be used. -- Peter Geoghegan
Re: default_tablespace doc and partitioned rels
On 2021-Apr-16, Justin Pryzby wrote: > +++ b/doc/src/sgml/config.sgml > @@ -7356,7 +7356,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH > csv; > > This variable specifies the default tablespace in which to create > objects (tables and indexes) when a CREATE > command does > -not explicitly specify a tablespace. > +not explicitly specify a tablespace. It also determines the > tablespace > +that a partitioned relation will direct future partitions to. > > > default_tablespace is a global GUC, so if a partitioned relation "directs" > partitions anywhere, it's not to the fallback value of the GUC, but to its > reltablespace, as this patch wrote in doc/src/sgml/ref/create_table.sgml: Yes, but also the partitioned table's reltablespace is going to be set to default_tablespace, if no tablespace is explicitly specified in the partitioned table creation. A partitioned table is not created anywhere itself; the only thing it can do, is direct where are future partitions created. I don't think it's 100% obvious that default_tablespace will become the partitioned table's tablespace, which is why I added that phrase. I understand that the language might be unclear, but I don't think either of your suggestions make this any clearer. Removing it just hides the behavior, and this one: > +... It also determines the tablespace where new partitions are > created, > +if the parent, partitioned relation doesn't have a tablespace set. just documents that default_tablespace will be in effect at partition CREATE time, but it fails to remind the user that the partitioned table will acquire default_tablespace as its own tablespace. Maybe we can reword it in some other way. "If this parameter is set when a partitioned table is created, it will become the default tablespace for future partitions too, even if default_tablespace itself is reset later" ...?? -- Álvaro Herrera39°49'30"S 73°17'W "En las profundidades de nuestro inconsciente hay una obsesiva necesidad de un universo lógico y coherente. Pero el universo real se halla siempre un paso más allá de la lógica" (Irulan)
Re: Bogus collation version recording in recordMultipleDependencies
I wrote: > I feel like this is telling us that there's a fundamental > misunderstanding in find_expr_references_walker about which > collation dependencies to report. It's reporting all the > leaf-node collations, and ignoring the ones that actually > count semantically, that is the inputcollid fields of > function and operator nodes. > Not sure what's the best thing to do here. Redesigning > this post-feature-freeze doesn't seem terribly appetizing, > but on the other hand, this index collation recording > feature has put a premium on not overstating the collation > dependencies of an expression. We don't want to tell users > that an index is broken when it isn't really. I felt less hesitant to modify find_expr_references_walker's behavior w.r.t. collations after realizing that most of it was not of long standing, but came in with 257836a75. So here's a draft patch that redesigns it as suggested above. Along the way I discovered that GetTypeCollations was quite broken for ranges and arrays, so this fixes that too. Per the changes in collate.icu.utf8.out, this gets rid of a lot of imaginary collation dependencies, but it also gets rid of some arguably-real ones. In particular, calls of record_eq and its siblings will be considered not to have any collation dependencies, although we know that internally those will look up per-column collations of their input types. We could imagine special-casing record_eq etc here, but that sure seems like a hack. I"m starting to have a bad feeling about 257836a75 overall. As I think I've complained before, I do not like anything about what it's done to pg_depend; it's forcing that relation to serve two masters, neither one well. We now see that the same remark applies to find_expr_references(), because the semantics of "which collations does this expression's behavior depend on" aren't identical to "which collations need to be recorded as direct dependencies of this expression", especially not if you'd prefer to minimize either list. (Which is important.) Moreover, for all the complexity it's introducing, it's next door to useless for glibc collations --- we might as well tell people "reindex everything when your glibc version changes", which could be done with a heck of a lot less infrastructure. The situation on Windows looks pretty user-unfriendly as well, per the other thread. So I wonder if, rather than continuing to pursue this right now, we shouldn't revert 257836a75 and try again later with a new design that doesn't try to commandeer the existing dependency infrastructure. We might have a better idea about what to do on Windows by the time that's done, too. regards, tom lane diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index 8d8e926c21..41093ea6ae 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -1835,8 +1835,17 @@ recordDependencyOnSingleRelExpr(const ObjectAddress *depender, * the datatype. However we do need a type dependency if there is no such * indirect dependency, as for example in Const and CoerceToDomain nodes. * - * Similarly, we don't need to create dependencies on collations except where - * the collation is being freshly introduced to the expression. + * Collations are handled primarily by recording the inputcollid's of node + * types that have them, as those are the ones that are semantically + * significant during expression evaluation. We also record the collation of + * CollateExpr nodes, since those will be needed to print such nodes even if + * they don't really affect semantics. Collations of leaf nodes such as Vars + * can be ignored on the grounds that if they're not default, they came from + * the referenced object (e.g., a table column), so the dependency on that + * object is enough. (Note: in a post-const-folding expression tree, a + * CollateExpr's collation could have been absorbed into a Const or + * RelabelType node. While ruleutils.c prints such collations for clarity, + * we may ignore them here as they have no semantic effect.) */ static bool find_expr_references_walker(Node *node, @@ -1876,29 +1885,6 @@ find_expr_references_walker(Node *node, /* If it's a plain relation, reference this column */ add_object_address(OCLASS_CLASS, rte->relid, var->varattno, context->addrs); - - /* Top-level collation if valid */ - if (OidIsValid(var->varcollid)) -add_object_address(OCLASS_COLLATION, var->varcollid, 0, - context->addrs); - /* Otherwise, it may be a type with internal collations */ - else if (var->vartype >= FirstNormalObjectId) - { -List *collations; -ListCell *lc; - -collations = GetTypeCollations(var->vartype); - -foreach(lc, collations) -{ - Oid coll = lfirst_oid(lc); - - if (OidIsValid(coll)) - add_object_address(OCLASS_COLLATION, - lfirst_oid(lc), 0, - context->addrs); -} - } } /* @@ -
Re: Error when defining a set returning function
On 4/16/21 3:32 PM, Esteban Zimanyi wrote: > Many thanks Tom for your help ! > > I removed the flag -fshort-enums and everything works fine ! > > If you build with pgxs it should supply the appropriate compiler flags. Alternatively, get the right settings from pg_config. In general rolling your own is a bad idea. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: default_tablespace doc and partitioned rels
On Fri, Apr 16, 2021 at 04:19:18PM -0400, Alvaro Herrera wrote: > Maybe we can reword it in some other way. "If this parameter is set > when a partitioned table is created, it will become the default > tablespace for future partitions too, even if default_tablespace itself > is reset later" ...?? +1 I'd say: If this parameter is set when a partitioned table is created, the partitioned table's tablespace will be set to the given tablespace, and which will be the default tablespace for partitions create in the future, even if default_tablespace itself has since been changed. -- Justin
Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation
On Fri, 16 Apr 2021 at 18:03, Peter Geoghegan wrote: > > On Thu, Apr 15, 2021 at 11:06 AM Matthias van de Meent > wrote: > > I've noticed there are a lot of places in the btree index > > infrastructure (and also some other index AMs) that effectively > > iterate over the attributes of the index tuple, but won't use > > index_deform_tuple for reasons. However, this implies that they must > > repeatedly call index_getattr, which in the worst case is O(n) for the > > n-th attribute, slowing down extraction of multi-column indexes > > significantly. As such, I've added some API that allows for iteration > > (ish) over index attributes. > > Interesting approach. I think that in an ideal world we would have a > tuple format with attribute lengths/offsets right in the header. I believe that that would indeed be ideal w.r.t. access speed, but also quite expensive w.r.t. amount of data stored. This would add 2 bytes per attribute in the current infrastructure (11 bits at least for each attribute to store offsets), on the current 12 bytes of overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is probably always going to be a non-starter, seeing that we can relatively easily optimize our current attribute access patterns. > But > it's too late for that, so other approaches seem worth considering. Yep. > > Also attached is 0002, which dynamically truncates attribute prefixes > > of tuples whilst _binsrch-ing through a nbtree page. It greatly uses > > the improved performance of 0001; they work very well together. The > > problems that Peter (cc-ed) mentions in [0] only result in invalid > > search bounds when traversing the tree, but on the page level valid > > bounds can be constructed. > > > > This is patchset 1 of a series of patches I'm starting for eventually > > adding static prefix truncation into nbtree infrastructure in > > PostgreSQL. I've put up a wiki page [1] with my current research and > > thoughts on that topic. > > The idea of making _bt_truncate() produce new leaf page high keys > based on the lastleft tuple rather than the firstright tuple (i.e. > +inf truncated attribute values rather than the current -inf) seems > like a non-starter. As you point out in "1.) Suffix-truncation; -INF > in high keys" on the Postgres wiki page, the current approach > truncates firstright (not lastleft), making the left page's new high > key contain what you call a 'foreign' value. But I see that as a big > advantage of the current approach. > > Consider, for example, the nbtree high key "continuescan" optimization > added by commit 29b64d1d. The fact that leaf page high keys are > generated in this way kind of allows us to "peak" on the page to the > immediate right before actually visiting it -- possibly without ever > visiting it (which is where the benefit of that particular > optimization lies). _bt_check_unique() uses a similar trick. After the > Postgres 12 work, _bt_check_unique() will only visit a second page in > the extreme case where we cannot possibly fit all of the relevant > version duplicates on even one whole leaf page (i.e. practically > never). There is also cleverness inside _bt_compare() to make sure > that we handle the boundary cases perfectly while descending the tree. I understand and appreciate that the "-INF" truncation that is currently in place is being relied upon in quite some places. Part of the effort for "+INF" truncation would be determining where and how to keep the benefits of the "-INF" truncation. I also believe that for internal pages truncating to "+INF" would be perfectly fine; the optimizations that I know of only rely on it at the leaf level. Completely seperate from that, there's no reason (except for a potential lack of unused bits) we can't flag suffix-truncated columns as either "+INF" or "-INF" - that would allow us to apply each where useful. > You might also consider how the nbtsplitloc.c code works with > duplicates, and how that would be affected by +inf truncated > attributes. The leaf-page-packing performed in the SPLIT_SINGLE_VALUE > case only goes ahead when the existing high key confirms that this > must be the rightmost page. Now, I guess that you could still do > something like that if we switched to +inf semantics. But, the fact > that the new right page will have a 'foreign' value in the > SPLIT_SINGLE_VALUE-split case is also of benefit -- it is practically > empty right after the split (since the original/left page is packed > full), and we want this empty space to be eligible to either take more > duplicates, or to take values that may happen to fit between the > highly duplicated value and the original foreign high key value. We > want that flexibility, I think. > > I also find -inf much more natural. If in the future we teach nbtree > to truncate "inside" text attributes (say text columns), you'd pretty > much be doing the same thing at the level of characters rather than > whole columns. The -inf semantics are like strcmp() semant
Re: Bogus collation version recording in recordMultipleDependencies
On Sat, Apr 17, 2021 at 8:39 AM Tom Lane wrote: > Per the changes in collate.icu.utf8.out, this gets rid of > a lot of imaginary collation dependencies, but it also gets > rid of some arguably-real ones. In particular, calls of > record_eq and its siblings will be considered not to have > any collation dependencies, although we know that internally > those will look up per-column collations of their input types. > We could imagine special-casing record_eq etc here, but that > sure seems like a hack. Thanks for looking into all this. Hmm. > I"m starting to have a bad feeling about 257836a75 overall. > As I think I've complained before, I do not like anything about > what it's done to pg_depend; it's forcing that relation to serve > two masters, neither one well. ... We did worry about (essentially) this question quite a bit in the discussion thread, but we figured that you'd otherwise have to create a parallel infrastructure that would look almost identical (for example [1]). > ... We now see that the same remark > applies to find_expr_references(), because the semantics of > "which collations does this expression's behavior depend on" aren't > identical to "which collations need to be recorded as direct > dependencies of this expression", especially not if you'd prefer > to minimize either list. (Which is important.) ... Bugs in the current analyser code aside, if we had a second catalog and a second analyser for this stuff, then you'd still have the union of both minimised sets in total, with some extra duplication because you'd have some rows in both places that are currently handled by one row, no? > ... Moreover, for all > the complexity it's introducing, it's next door to useless for > glibc collations --- we might as well tell people "reindex > everything when your glibc version changes", which could be done > with a heck of a lot less infrastructure. ... You do gain reliable tracking of which indexes remain to be rebuilt, and warnings for common hazards like hot standbys with mismatched glibc, so I think it's pretty useful. As for the poverty of information from glibc, I don't see why it should hold ICU, Windows, FreeBSD users back. In fact I am rather hoping that by shipping this, glibc developers will receive encouragement to add the trivial interface we need to do better. > ... The situation on Windows > looks pretty user-unfriendly as well, per the other thread. That is unfortunate, it seems like such a stupid problem. Restating here for the sake of the list: initdb just needs to figure out how to ask for the current environment's locale in BCP 47 format ("en-US") when setting the default for your template databases, not the traditional format ("English_United States.1252") that Microsoft explicitly tells us not to store in databases and that doesn't work in the versioning API, but since we're mostly all Unix hackers we don't know how. > So I wonder if, rather than continuing to pursue this right now, > we shouldn't revert 257836a75 and try again later with a new design > that doesn't try to commandeer the existing dependency infrastructure. > We might have a better idea about what to do on Windows by the time > that's done, too. It seems to me that there are two things that would be needed to salvage this for PG14: (1) deciding that we're unlikely to come up with a better idea than using pg_depend for this (following the argument that it'd only create duplication to have a parallel dedicated catalog), (2) fixing any remaining flaws in the dependency analyser code. I'll look into the details some more on Monday. [1] https://www.postgresql.org/message-id/e9e22c5e-c018-f4ea-24c8-5b6d6fdacf30%402ndquadrant.com
Re: Bogus collation version recording in recordMultipleDependencies
Thomas Munro writes: > I'll look into the details some more on Monday. Fair enough. Although there are only a few buildfarm members complaining, I don't really want to leave them red all weekend. I could either commit the patch I just presented, or revert ef387bed8 ... got a preference? regards, tom lane
Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation
On Fri, Apr 16, 2021 at 2:20 PM Matthias van de Meent wrote: > > Interesting approach. I think that in an ideal world we would have a > > tuple format with attribute lengths/offsets right in the header. > > I believe that that would indeed be ideal w.r.t. access speed, but > also quite expensive w.r.t. amount of data stored. This would add 2 > bytes per attribute in the current infrastructure (11 bits at least > for each attribute to store offsets), on the current 12 bytes of > overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is > probably always going to be a non-starter, seeing that we can > relatively easily optimize our current attribute access patterns. I don't think that that's why it's a non-starter. This design assumes a world in which everything has already been optimized for this layout. You no longer get to store the varlena header inline, which would break a lot of things in Postgres were it ever to be attempted. The space efficiency issues don't really apply because you have an offset for fixed-length types -- their presence is always implied. I think that you need to encode NULLs differently, which is a lot less space efficient when there are a lot of NULLs. But on the whole this design seems more efficient than what we have currently. This representation of index tuples would be a totally reasonable design were we in a green field situation. (Which is pretty far from the situation we're actually in, of course.) > I understand and appreciate that the "-INF" truncation that is > currently in place is being relied upon in quite some places. Part of > the effort for "+INF" truncation would be determining where and how to > keep the benefits of the "-INF" truncation. I also believe that for > internal pages truncating to "+INF" would be perfectly fine; the > optimizations that I know of only rely on it at the leaf level. I don't doubt that there is nothing special about -inf from a key space point of view. Actually...you could say that -inf is special to the limited extent that we know it only appears in pivot tuples and exploit that property when the !pivotsearch case/optimization is used. But that isn't much of an exception at a high level, so whatever. Anyway, it follows that +inf could in principle be used instead in some or all cases -- all that is truly essential for correctness is that the invariants always be respected. We're still in agreement up until here. > Yes, I also read and appreciate your comments on +inf vs -inf when > this came up in [0]. I'm impressed that you've done your homework on this. > However, if we could choose, I think that having > both options could be quite beneficial, especially when dealing with > many duplicates or duplicate prefixes. This is where things are much less clear -- maybe we're not in agreement here. Who knows, though -- maybe you're right. But you haven't presented any kind of argument. I understand that it's hard to articulate what effects might be in play with stuff like this, so I won't force the issue now. Strong evidence is of course the only way that you'll reliably convince me of this. I should point out that I am a little confused about how this +inf business could be both independently useful and pivotal to implementing [dynamic] prefix truncation/compression. Seems...weird to discuss them together, except maybe to mention in passing that this +inf thing is notable for particularly helping dynamic prefix stuff -- which is it? It is my strong preference that nbtsplitloc.c continue to know approximately nothing about compression or deduplication. While it is true that nbtsplitloc.c's _bt_recsplitloc() is aware of posting lists, this is strictly an optimization that is only justified by the fact that posting lists are sometimes very large, and therefore worth considering directly -- just to get a more accurate idea of how a relevant split point choice affects the balance of free space (we don't bother to do the same thing with non-key INCLUDE columns because they're generally small and equi-sized). And so this _bt_recsplitloc() thing no exception to the general rule, which is: deduplication/posting list maintenance should be *totally* orthogonal to the page split choice logic (the design of posting list splits helps a lot with that). We can afford to have complicated split point choice logic because the question of which split point is optimal is totally decoupled from the question of which are correct -- in particular, from the correctness of the space accounting used to generate candidate split points. It may interest you to know that I once thought that it would be nice to have the *option* of +inf too, so that we could use it in very rare cases like the pathological SPLIT_MANY_DUPLICATES case that _bt_bestsplitloc() has some defenses against. It would perhaps be nice if we could use +inf selectively in that case. I never said anything about this publicly before now, mostly because it wasn't that important --
PATCH: generate fractional cheapest paths in generate_orderedappend_path
Hi, This patch is a WIP fix for the issue described in [1], where the planner picks a more expensive plan with partition-wise joins enabled, and disabling this option produces a cheaper plan. That's a bit strange because with the option disabled we consider *fewer* plans, so we should not be able to generate a cheaper plan. The problem lies in generate_orderedappend_paths(), which builds two types of append paths - with minimal startup cost, and with minimal total cost. That however does not work for queries with LIMIT, which also need to consider at fractional cost, but the path interesting from this perspective may be eliminated by other paths. Consider three paths (this comes from the reproducer shared in [1]): A: nestloop_path startup 0.585000total 35708.292500 B: nestloop_path startup 0.292500total 150004297.292500 C: mergejoin_path startup 9748.112737 total 14102.092737 With some reasonable LIMIT value (e.g. 5% of the data), we really want to pick path A. But that path is dominated both in startup cost (by B) and total cost (path C). Hence generate_orderedappend_paths() will ignore it, and we'll generate a more expensive LIMIT plan. In [2] Tom proposed to modify generate_orderedappend_paths() to also consider the fractional cheapest_path, just like we do for startup and total costs. The attached patch does exactly that, and it does seem to do the trick. There are some loose ends, though: 1) If get_cheapest_fractional_path_for_pathkeys returns NULL, it's not clear whether to default to cheapest_startup or cheapest_total. We might also consider an incremental sort, in which case the startup cost matters (for Sort it does not). So we may need an enhanced version of get_cheapest_fractional_path_for_pathkeys that generates such paths. 2) Same for the cheapest_total - maybe there's a partially sorted path, and using it with incremental sort on top would be better than using cheapest_total_path + sort. 3) Not sure if get_cheapest_fractional_path_for_pathkeys should worry about require_parallel_safe, just like the other functions nearby. I'd argue that this patch does not need to add the Incremental Sort capabilities in (1) and (2) - it's just another place where we decided not to consider this sort variant yet. I'm not sure how much this depends on partition-wise join, and why disabling it generates the right plan. The reproducer uses that, but AFAICS generate_orderedappend_paths() works like this since 2010 (commit 11cad29c915). I'd bet the issue exists since then and it's possible to get similar cases even without partition-wise join. I can reproduce it on PostgreSQL 12, though (which however supports partition-wise join). Not sure whether this should be backpatched. We didn't get any reports until now, so it doesn't seem like a pressing issue. OTOH most users won't actually notice this, they'll just get worse plans without realizing there's a better option. regards [1] https://www.postgresql.org/message-id/011937a3-7427-b99f-13f1-c07a127cf94c%40enterprisedb.com [2] https://www.postgresql.org/message-id/4006636.1618577893%40sss.pgh.pa.us -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index edba5e49a8..0284162034 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -1716,6 +1716,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, List *pathkeys = (List *) lfirst(lcp); List *startup_subpaths = NIL; List *total_subpaths = NIL; + List *fractional_subpaths = NIL; bool startup_neq_total = false; ListCell *lcr; bool match_partition_order; @@ -1745,7 +1746,8 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, { RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr); Path *cheapest_startup, - *cheapest_total; + *cheapest_total, + *cheapest_fractional = NULL; /* Locate the right paths, if they are available. */ cheapest_startup = @@ -1761,6 +1763,21 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, TOTAL_COST, false); + /* + * XXX strange that get_cheapest_fractional_path_for_pathkeys + * does not have require_parallel_safe. + */ + if (root->tuple_fraction > 0) + { +double path_fraction = 1.0 / root->tuple_fraction; + +cheapest_fractional = + get_cheapest_fractional_path_for_pathkeys(childrel->pathlist, + pathkeys, + NULL, + path_fraction); + } + /* * If we can't find any paths with the right order just use the * cheapest-total path; we'll have to sort it later. @@ -1773,6 +1790,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, Assert(cheapest_total->param_info == NULL); } + /* + * XXX Do we need to do something about
Re: Bogus collation version recording in recordMultipleDependencies
On Sat, Apr 17, 2021 at 10:47 AM Tom Lane wrote: > Thomas Munro writes: > > I'll look into the details some more on Monday. > > Fair enough. > > Although there are only a few buildfarm members complaining, I don't > really want to leave them red all weekend. I could either commit the > patch I just presented, or revert ef387bed8 ... got a preference? +1 for committing the new patch for now. I will look into to the record problem. More in a couple of days.
Re: ANALYZE counts LP_DEAD line pointers as n_dead_tup
On Fri, Apr 16, 2021 at 1:16 PM Peter Geoghegan wrote: > I'm not sure what to do, though. Both the INDEX_CLEANUP = off case and > the failsafe case are only intended for emergencies. And it's hard to > know what to do in a code path that is designed to rarely or never be > used. How about just documenting it in comments, as in the attached patch? I tried to address all of the issues with LP_DEAD accounting together. Both the issue raised by Masahiko, and one or two others that were also discussed recently on other threads. They all seem kind of related to me. I didn't address the INDEX_CLEANUP = off case in the comments directly (I just addressed the failsafe case). There is no good reason to think that the situation will resolve with INDEX_CLEANUP = off, so it didn't seem wise to mention it too. But that should still be okay -- INDEX_CLEANUP = off has practically been superseded by the failsafe, since it is much more flexible. And, anybody that uses INDEX_CLEANUP = off cannot expect to never do index cleanup without seriously bad consequences all over the place. -- Peter Geoghegan From 5d24338e90f0f3eec7134c328d40270f2c50 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan Date: Fri, 9 Apr 2021 18:50:25 -0700 Subject: [PATCH] VACUUM accounting: Explain LP_DEAD accounting. --- src/backend/access/heap/vacuumlazy.c | 70 +++- 1 file changed, 47 insertions(+), 23 deletions(-) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 9f9ba5d308..ad37f25e2a 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -686,7 +686,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, new_min_multi, false); - /* report results to the stats collector, too */ + /* + * Report results to the stats collector, too. + * + * We deliberately avoid telling the stats collector about LP_DEAD items + * that remain in the table when index/heap vacuuming has been bypassed by + * the failsafe mechanism. Avoid behaving too aggressively once the + * danger of wraparound failure subsides. Autoanalyze should notice any + * remaining LP_DEAD items and schedule an autovacuum if nothing else + * does. + */ pgstat_report_vacuum(RelationGetRelid(rel), rel->rd_rel->relisshared, Max(new_live_tuples, 0), @@ -1334,6 +1343,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive) */ lazy_scan_prune(vacrel, buf, blkno, page, vistest, &prunestate); + Assert(!prunestate.all_visible || !prunestate.has_lpdead_items); + Assert(!all_visible_according_to_vm || prunestate.all_visible); + /* Remember the location of the last page with nonremovable tuples */ if (prunestate.hastup) vacrel->nonempty_pages = blkno + 1; @@ -1404,7 +1416,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive) * Handle setting visibility map bit based on what the VM said about * the page before pruning started, and using prunestate */ - Assert(!prunestate.all_visible || !prunestate.has_lpdead_items); if (!all_visible_according_to_vm && prunestate.all_visible) { uint8 flags = VISIBILITYMAP_ALL_VISIBLE; @@ -1737,19 +1748,22 @@ retry: /* * LP_DEAD items are processed outside of the loop. * - * Note that we deliberately don't set hastup=true in the case of an - * LP_DEAD item here, which is not how lazy_check_needs_freeze() or + * Our working assumption is that any LP_DEAD items we encounter here + * will become LP_UNUSED inside lazy_vacuum_heap_page() before VACUUM + * finishes. + * + * This is why the number of LP_DEAD items won't be reported to the + * stats collector -- we simply won't leave any behind. (Actually, + * cases where we bypass index vacuuming will in fact leave behind + * LP_DEAD items, but that only happens in special circumstances. We + * avoid trying to compensate for that in our later report to the + * stats collector.) + * + * This is also why we deliberately don't set hastup=true in the case + * of an LP_DEAD item. This is not how lazy_check_needs_freeze() or * count_nondeletable_pages() do it -- they only consider pages empty * when they only have LP_UNUSED items, which is important for * correctness. - * - * Our assumption is that any LP_DEAD items we encounter here will - * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually - * call count_nondeletable_pages(). In any case our opinion of - * whether or not a page 'hastup' (which is how our caller sets its - * vacrel->nonempty_pages value) is inherently race-prone. It must be - * treated as advisory/unreliable, so we might as well be slightly - * optimistic. */ if (ItemIdIsDead(itemid)) { @@ -1901,9 +1915,6 @@ retry: * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple * that remains and needs to be considered for freezing now (LP_UNUSED and * L
Re: terminate called after throwing an instance of 'std::bad_alloc'
Hi, On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote: > I'd be happy to run with a prototype fix for the leak to see if the other > issue > does (not) recur. I just posted a prototype fix to https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de (just because that was the first thread I re-found). It'd be cool if you could have a look! Greetings, Andres Freund
Re: Bogus collation version recording in recordMultipleDependencies
Thomas Munro writes: > On Sat, Apr 17, 2021 at 10:47 AM Tom Lane wrote: >> Although there are only a few buildfarm members complaining, I don't >> really want to leave them red all weekend. I could either commit the >> patch I just presented, or revert ef387bed8 ... got a preference? > +1 for committing the new patch for now. I will look into to the > record problem. More in a couple of days. OK, done. regards, tom lane
Re: terminate called after throwing an instance of 'std::bad_alloc'
On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote: > Hi, > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote: > > I'd be happy to run with a prototype fix for the leak to see if the other > > issue > > does (not) recur. > > I just posted a prototype fix to > https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de > (just because that was the first thread I re-found). It'd be cool if you > could have a look! This doesn't seem to address the problem triggered by the reproducer at https://www.postgresql.org/message-id/20210331040751.gu4...@telsasoft.com (sorry I didn't CC you) CREATE OR REPLACE FUNCTION cfn() RETURNS void LANGUAGE PLPGSQL AS $$ declare a record; begin FOR a IN SELECT generate_series(1,99) LOOP PERFORM format('select 1'); END LOOP; END $$; $ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; SET client_min_messages=debug; SET log_executor_stats=on; SELECT cfn();' |head -11 |psql 2>&1 |grep 'max resident' -- Justin
Re: terminate called after throwing an instance of 'std::bad_alloc'
On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote: > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote: > > Hi, > > > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote: > > > I'd be happy to run with a prototype fix for the leak to see if the other > > > issue > > > does (not) recur. > > > > I just posted a prototype fix to > > https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de > > (just because that was the first thread I re-found). It'd be cool if you > > could have a look! > > This doesn't seem to address the problem triggered by the reproducer at > https://www.postgresql.org/message-id/20210331040751.gu4...@telsasoft.com > (sorry I didn't CC you) I take that back - I forgot that this doesn't release RAM until hitting a threshold. Without the patch, it looks like: $ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; SET client_min_messages=debug; SET log_executor_stats=on; SELECT cfn();' |head -11 |psql -h /tmp postgres 2>&1 |grep 'max resident' ! 61820 kB max resident size ! 65020 kB max resident size ! 68812 kB max resident size ! 71152 kB max resident size ! 76820 kB max resident size ! 78760 kB max resident size ! 81140 kB max resident size ! 83520 kB max resident size ! 93084 kB max resident size ! 94756 kB max resident size ! 96416 kB max resident size With the patch and #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1, it looks like this: ! 61436 kB max resident size ! 61572 kB max resident size ! 63236 kB max resident size ! 63236 kB max resident size ! 63556 kB max resident size ! 63556 kB max resident size ! 63880 kB max resident size ! 65416 kB max resident size ! 65416 kB max resident size ! 65416 kB max resident size ! 65416 kB max resident size -- Justin
RE: Truncate in synchronous logical replication failed
On Saturday, April 17, 2021 12:53 AM Japin Li > On Fri, 16 Apr 2021 at 17:19, osumi.takami...@fujitsu.com > wrote: > > On Friday, April 16, 2021 5:50 PM Amit Kapila > wrote: > >> On Fri, Apr 16, 2021 at 12:56 PM osumi.takami...@fujitsu.com > >> wrote: > >> > > >> > > Thanks for your reminder. It might be a way to solve this problem. > >> > Yeah. I've made the 1st patch for this issue. > >> > > >> > In my env, with the patch > >> > the TRUNCATE in synchronous logical replication doesn't hang. > >> > > >> > >> Few initial comments: > >> = > >> 1. > >> + relreplindex = relation->rd_replidindex; > >> + > >> + /* > >> + * build attributes to idindexattrs. > >> + */ > >> + idindexattrs = NULL; > >> + foreach(l, indexoidlist) > >> + { > >> + Oid indexOid = lfirst_oid(l); > >> + Relation indexDesc; > >> + int i; > >> + bool isIDKey; /* replica identity index */ > >> + > >> + indexDesc = RelationIdGetRelation(indexOid); > >> > >> When you have oid of replica identity index (relreplindex) then what > >> is the need to traverse all the indexes? > > Ok. No need to traverse all the indexes. Will fix this part. > > > >> 2. > >> It is better to name the function as RelationGet... > > You are right. I'll modify this in my next version. > > > > I took the liberty to address review comments and provide a v2 patch on top > of your's v1 patch, also merge the test case. > > Sorry for attaching. No problem. Thank you for updating the patch. I've conducted some cosmetic changes. Could you please check this ? That's already applied by pgindent. I executed RT for this and made no failure. Just in case, I executed 010_truncate.pl test 100 times in a tight loop, which also didn't fail. Best Regards, Takamichi Osumi truncate_in_synchronous_logical_replication_v03.patch Description: truncate_in_synchronous_logical_replication_v03.patch
Re: Forget close an open relation in ReorderBufferProcessTXN()
On Fri, Apr 16, 2021 at 11:24 PM Andres Freund wrote: > > > I think it is also important to *not* acquire any lock on relation > > otherwise it can lead to some sort of deadlock or infinite wait in the > > decoding process. Consider a case for operations like Truncate (or if > > the user has acquired an exclusive lock on the relation in some other > > way say via Lock command) which acquires an exclusive lock on > > relation, it won't get replicated in synchronous mode (when > > synchronous_standby_name is configured). The truncate operation will > > wait for the transaction to be replicated to the subscriber and the > > decoding process will wait for the Truncate operation to finish. > > However, this cannot be really relied upon for catalog tables. An output > function might acquire locks or such. But for those we do not need to > decode contents... > True, so, if we don't need to decode contents then we won't have the problems of the above kind. > > > This made me take a brief look at pgoutput.c - maybe I am missing > something, but how is the following not a memory leak? > > static void > maybe_send_schema(LogicalDecodingContext *ctx, > ReorderBufferTXN *txn, ReorderBufferChange *change, > Relation relation, RelationSyncEntry *relentry) > { > ... > /* Map must live as long as the session does. */ > oldctx = MemoryContextSwitchTo(CacheMemoryContext); > relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc), >CreateTupleDescCopy(outdesc)); > MemoryContextSwitchTo(oldctx); > send_relation_and_attrs(ancestor, xid, ctx); > RelationClose(ancestor); > > If - and that's common - convert_tuples_by_name() won't have to do > anything, the copied tuple descs will be permanently leaked. > I also think this is a permanent leak. I think we need to free all the memory associated with this map on the invalidation of this particular relsync entry (basically in rel_sync_cache_relation_cb). -- With Regards, Amit Kapila.
Re: terminate called after throwing an instance of 'std::bad_alloc'
Hi, On Fri, Apr 16, 2021, at 20:18, Justin Pryzby wrote: > On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote: > > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote: > > > Hi, > > > > > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote: > > > > I'd be happy to run with a prototype fix for the leak to see if the > > > > other issue > > > > does (not) recur. > > > > > > I just posted a prototype fix to > > > https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de > > > (just because that was the first thread I re-found). It'd be cool if you > > > could have a look! > > > > This doesn't seem to address the problem triggered by the reproducer at > > https://www.postgresql.org/message-id/20210331040751.gu4...@telsasoft.com > > (sorry I didn't CC you) > > I take that back - I forgot that this doesn't release RAM until hitting a > threshold. > Phew. > Without the patch, it looks like: > > $ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; > SET client_min_messages=debug; SET log_executor_stats=on; SELECT > cfn();' |head -11 |psql -h /tmp postgres 2>&1 |grep 'max resident' > ! 61820 kB max resident size > ! 65020 kB max resident size > ! 68812 kB max resident size > ! 71152 kB max resident size > ! 76820 kB max resident size > ! 78760 kB max resident size > ! 81140 kB max resident size > ! 83520 kB max resident size > ! 93084 kB max resident size > ! 94756 kB max resident size > ! 96416 kB max resident size > > With the patch and #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1, it looks like > this: > > ! 61436 kB max resident size > ! 61572 kB max resident size > ! 63236 kB max resident size > ! 63236 kB max resident size > ! 63556 kB max resident size > ! 63556 kB max resident size > ! 63880 kB max resident size > ! 65416 kB max resident size > ! 65416 kB max resident size > ! 65416 kB max resident size > ! 65416 kB max resident size If you set the define to much lower, how low does this get? I didn't see an easy way to make the recreation of the context memory usage dependant, but that'd of course be much better than an #iterations based approach. I'll play around with your test tomorrow or so, if you don't, but now the next step is risotto... Regards, Andres
Re: Forget close an open relation in ReorderBufferProcessTXN()
On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila wrote: > > On Thu, Apr 15, 2021 at 4:00 PM Japin Li wrote: > > > > The RelationIdGetRelation() comment says: > > > > > Caller should eventually decrement count. (Usually, > > > that happens by calling RelationClose().) > > > > However, it doesn't do it in ReorderBufferProcessTXN(). > > I think we should close it, here is a patch that fixes it. Thoughts? > > > > +1. Your fix looks correct to me but can we test it in some way? > I have tried to find a test but not able to find one. I have tried with a foreign table but we don't log truncate for it, see ExecuteTruncate. It has a check that it will log for relids where RelationIsLogicallyLogged. If that is the case, it is not clear to me how we can ever hit this condition? Have you tried to find the test? -- With Regards, Amit Kapila.
Re: Forget close an open relation in ReorderBufferProcessTXN()
On Fri, Apr 16, 2021 at 11:24 PM Andres Freund wrote: > > > > I think it is also important to *not* acquire any lock on relation > > otherwise it can lead to some sort of deadlock or infinite wait in the > > decoding process. Consider a case for operations like Truncate (or if > > the user has acquired an exclusive lock on the relation in some other > > way say via Lock command) which acquires an exclusive lock on > > relation, it won't get replicated in synchronous mode (when > > synchronous_standby_name is configured). The truncate operation will > > wait for the transaction to be replicated to the subscriber and the > > decoding process will wait for the Truncate operation to finish. > > However, this cannot be really relied upon for catalog tables. An output > function might acquire locks or such. But for those we do not need to > decode contents... > I see that if we define a user_catalog_table (create table t1_cat(c1 int) WITH(user_catalog_table = true);), we are able to decode operations like (insert, truncate) on such a table. What do you mean by "But for those we do not need to decode contents"? -- With Regards, Amit Kapila.
Re: Forget close an open relation in ReorderBufferProcessTXN()
On Sat, 17 Apr 2021 at 14:09, Amit Kapila wrote: > On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila wrote: >> >> On Thu, Apr 15, 2021 at 4:00 PM Japin Li wrote: >> > >> > The RelationIdGetRelation() comment says: >> > >> > > Caller should eventually decrement count. (Usually, >> > > that happens by calling RelationClose().) >> > >> > However, it doesn't do it in ReorderBufferProcessTXN(). >> > I think we should close it, here is a patch that fixes it. Thoughts? >> > >> >> +1. Your fix looks correct to me but can we test it in some way? >> > > I have tried to find a test but not able to find one. I have tried > with a foreign table but we don't log truncate for it, see > ExecuteTruncate. It has a check that it will log for relids where > RelationIsLogicallyLogged. If that is the case, it is not clear to me > how we can ever hit this condition? Have you tried to find the test? I also don't find a test for this. It is introduced in 5dfd1e5a6696, wrote by Simon Riggs, Marco Nenciarini and Peter Eisentraut. Maybe they can explain when we can enter this condition? -- Regrads, Japin Li. ChengDu WenWu Information Technology Co.,Ltd.