date:20210416

RE: Schema variables - new implementation for Postgres 15

2021-04-16 Thread tsunakawa.ta...@fujitsu.com

From: Pavel Stehule 
--
I am sorry, but in this topic we have different opinions. The variables in 
PLpgSQL are not transactional too (same is true in Perl, Python, ...). Session 
variables in Oracle, MS SQL, DB2, MySQL are not transactional too. My primary 
focus is PLpgSQL - and I would like to use schema variables as global plpgsql 
variables (from PLpgSQL perspective) - that means in Postgres's perspective 
session variables. But in Postgres, I have to write features that will work 
with others PL too - PLPython, PLPerl, ... Statement SET in ANSI/SQL standard 
(SQL/PSM) doesn't expect transactional behaviour for variables too. 
Unfortunately SET keyword is used in Postgres for GUC, and isn't possible to 
reuse without a compatibility break.

The PostgreSQL configuration is transactional, but it is a different feature 
designed for different purposes. Using GUC like session variables is just a 
workaround. It can be useful for some cases, sure. But it is not usual 
behaviour. And for other cases the transactional behaviour is not practical. 
Schema variables are not replacement of GUC, schema variables are not 
replacement of temporal tables. There is a prepared patch for global temp 
tables. I hope so this patch can be committed to Postgres 15. Global temp 
tables fixes almost all disadvantages of temporary tables in Postgres. So the 
schema variable is not a one row table. It is a different creature - designed 
to support the server's side procedural features.
--

+1
I understand (and wish) this feature is intended to ease migration from Oracle 
PL/SQL, which will further increase the popularity of Postgres.  So, the 
transactional behavior is not necessary unless Oracle has such a feature.

Furthermore, Postgres already has some non-transactonal SQL commands.  So, I 
don't think we need to reject non-transactional LET.

* Sequence operation: SELECT nextval/setval
* SET [SESSION]
* SET ROLE
* SET SESSION AUTHORIZATION


--
I have prepared a patch that allows non default transactional behaviour (but 
this behaviour should not be default - I didn't design schema variables as temp 
tables replacement). This patch increases the length of the current patch about 
1/4, and I have enough work with rebasing with the current patch, so I didn't 
send it to commitfest now. If schema variables will be inside core, this day 
I'll send the patch that allows transactional behaviour for schema variables - 
I promise.
--

I prefer the simpler, targeted one without transactional behavior initially, 
because added complexity might prevent this feature from being committed in PG 
15.


Regards
Takayuki Tsunakawa

Re: doc review for v14

2021-04-16 Thread Justin Pryzby

A bunch more found with things like this.

find src -name '*.c' |xargs grep '^[[:space:]]*/\?\*' |grep -woE 
'[[:lower:]]{3,8}' |sed 's/.*/\L&/' |
sort |uniq -c |sort -nr |awk '$1==1' |less
>From 9fd5e2797efc14f83bfb2d47d5174de81a34ac6e Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 4 Mar 2021 19:32:05 -0600
Subject: [PATCH 01/27] doc review: Add support for PROVE_TESTS and PROVE_FLAGS
 in MSVC scripts

5bca69a76b3046a85c60c48271c1831fd5affa51
---
 doc/src/sgml/install-windows.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml
index 64687b12e6..cb6bb05dc5 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -499,8 +499,8 @@ $ENV{PERL5LIB}=$ENV{PERL5LIB} . ';c:\IPC-Run-0.94\lib';
 
   
The TAP tests run with vcregress support the
-   environment variables PROVE_TESTS, that is expanded
-   automatically using the name patterns given, and
+   environment variables PROVE_TESTS, which is
+   expanded as a glob pattern, and
PROVE_FLAGS. These can be set on a Windows terminal,
before running vcregress:
 
-- 
2.17.0

>From 1392ef571f6afb62f5cd79db691a9c2d2255dde0 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 23 Mar 2021 13:39:36 -0500
Subject: [PATCH 02/27] doc review: Pass all scan keys to BRIN consistent
 function at once

commit a1c649d889bdf6e74e9382e1e28574d7071568de
---
 doc/src/sgml/brin.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index d2f12bb605..d2476481af 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -833,7 +833,7 @@ typedef struct BrinOpcInfo
   Returns whether all the ScanKey entries are consistent with the given
   indexed values for a range.
   The attribute number to use is passed as part of the scan key.
-  Multiple scan keys for the same attribute may be passed at once, the
+  Multiple scan keys for the same attribute may be passed at once; the
   number of entries is determined by the nkeys parameter.
  
 
-- 
2.17.0

>From 1323e85320a0e5414b490ef7f0082f3cca60864c Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 26 Mar 2021 23:14:57 -0500
Subject: [PATCH 03/27] doc review: BRIN minmax-multi indexes

ab596105b55f1d7fbd5a66b66f65227d210b047d
---
 doc/src/sgml/brin.sgml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml
index d2476481af..ce7c210575 100644
--- a/doc/src/sgml/brin.sgml
+++ b/doc/src/sgml/brin.sgml
@@ -730,7 +730,7 @@ LOG:  request for BRIN range summarization for index "brin_wi_idx" page 128 was
  for . When set to a positive value,
  each block range is assumed to contain this number of distinct non-null
  values. When set to a negative value, which must be greater than or
- equal to -1, the number of distinct non-null is assumed linear with
+ equal to -1, the number of distinct non-null values is assumed to grow linearly with
  the maximum possible number of tuples in the block range (about 290
  rows per block). The default value is -0.1, and
  the minimum number of distinct non-null values is 16.
@@ -1214,7 +1214,7 @@ typedef struct BrinOpcInfo
 
  
   The minmax-multi operator class is also intended for data types implementing
-  a totally ordered sets, and may be seen as a simple extension of the minmax
+  a totally ordered set, and may be seen as a simple extension of the minmax
   operator class. While minmax operator class summarizes values from each block
   range into a single contiguous interval, minmax-multi allows summarization
   into multiple smaller intervals to improve handling of outlier values.
-- 
2.17.0

>From dbf9f5cd9e2c0d7ee471c29b241ae8d5bc565fd6 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Thu, 18 Feb 2021 10:13:15 -0600
Subject: [PATCH 04/27] doc review: VACUUM (PROCESS_TOAST)

7cb3048f38e26b39dd5fd412ed8a4981b6809b35

Michael had some comments here, but it seems to exactly match my existing
language.  https://www.postgresql.org/message-id/yddar29pdixpg...@paquier.xyz
---
 doc/src/sgml/ref/vacuum.sgml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 6a0028a514..949ca23797 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -219,7 +219,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ connection_name ] DEC
 Description
 
 
- DECLARE STATEMENT declares SQL statement identifier.
+ DECLARE STATEMENT declares a SQL statement identifier.
  SQL statement identifier can be associated with the connection.
- When the identifier is used by dynamic SQL statements, these SQLs are executed
- by using the associated connection.
+ When the identifier is used by dynamic SQL statements, the statements are execu

Re: Truncate in synchronous logical replication failed

2021-04-16 Thread Japin Li


On Thu, 15 Apr 2021 at 19:25, Amit Kapila  wrote:
> On Thu, Apr 15, 2021 at 4:30 PM Japin Li  wrote:
>>
>>
>> On Thu, 15 Apr 2021 at 18:22, Amit Kapila  wrote:
>> > On Wed, Apr 14, 2021 at 3:31 PM Amit Kapila  
>> > wrote:
>> >>
>> >> On Tue, Apr 13, 2021 at 8:07 PM Petr Jelinek
>> >>  wrote:
>> >> >
>> >> > > On 12 Apr 2021, at 08:58, Amit Kapila  wrote:
>> >> > >
>> >> > > The problem happens only when we try to fetch IDENTITY_KEY attributes
>> >> > > because pgoutput uses RelationGetIndexAttrBitmap() to get that
>> >> > > information which locks the required indexes. Now, because TRUNCATE
>> >> > > has already acquired an exclusive lock on the index, it seems to
>> >> > > create a sort of deadlock where the actual Truncate operation waits
>> >> > > for logical replication of operation to complete and logical
>> >> > > replication waits for actual Truncate operation to finish.
>> >> > >
>> >> > > Do we really need to use RelationGetIndexAttrBitmap() to build
>> >> > > IDENTITY_KEY attributes? During decoding, we don't even lock the main
>> >> > > relation, we just scan the system table and build that information
>> >> > > using a historic snapshot. Can't we do something similar here?
>> >> > >
>> >> > > Adding Petr J. and Peter E. to know their views as this seems to be an
>> >> > > old problem (since the decoding of Truncate operation is introduced).
>> >> >
>> >> > We used RelationGetIndexAttrBitmap because it already existed, no other 
>> >> > reason.
>> >> >
>> >>
>> >> Fair enough. But I think we should do something about it because using
>> >> the same (RelationGetIndexAttrBitmap) just breaks the synchronous
>> >> logical replication. I think this is broken since the logical
>> >> replication of Truncate is supported.
>> >>
>> >> > I am not sure what exact locking we need but I would have guessed at 
>> >> > least AccessShareLock would be needed.
>> >> >
>> >>
>> >> Are you telling that we need AccessShareLock on the index? If so, why
>> >> is it different from how we access the relation during decoding
>> >> (basically in ReorderBufferProcessTXN, we directly use
>> >> RelationIdGetRelation() without any lock on the relation)? I think we
>> >> do it that way because we need it to process WAL entries and we need
>> >> the relation state based on the historic snapshot, so, even if the
>> >> relation is later changed/dropped, we are fine with the old state we
>> >> got. Isn't the same thing applies here in logicalrep_write_attrs? If
>> >> that is true then some equivalent of RelationGetIndexAttrBitmap where
>> >> we use RelationIdGetRelation instead of index_open should work?
>> >>
>> >
>> > Today, again I have thought about this and don't see a problem with
>> > the above idea. If the above understanding is correct, then I think
>> > for our purpose in pgoutput, we just need to call RelationGetIndexList
>> > and then build the idattr list for relation->rd_replidindex.
>>
>> Sorry, I don't know how can we build the idattr without open the index.
>> If we should open the index, then we should use NoLock, since the TRUNCATE
>> side hold AccessExclusiveLock. As Osumi points out in [1], The NoLock mode
>> assumes that the  appropriate lock on the index is already taken.
>>
>
> Why can't we use RelationIdGetRelation() by passing the required
> indexOid to it?

Hi Amit, as your suggested, I try to use RelationIdGetRelation() replace
the index_open() to avoid specify the AccessSharedLock, then the TRUNCATE
will not be blocked.

Attached patch fixed it. Thoughts?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 29702d6eab..0ad59ef189 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5060,7 +5060,7 @@ restart:
 		bool		isPK;		/* primary key */
 		bool		isIDKey;	/* replica identity index */
 
-		indexDesc = index_open(indexOid, AccessShareLock);
+		indexDesc = RelationIdGetRelation(indexOid);
 
 		/*
 		 * Extract index expressions and index predicate.  Note: Don't use
@@ -5134,7 +5134,7 @@ restart:
 		/* Collect all attributes in the index predicate, too */
 		pull_varattnos(indexPredicate, 1, &indexattrs);
 
-		index_close(indexDesc, AccessShareLock);
+		RelationClose(indexDesc);
 	}
 
 	/*
diff --git a/src/test/subscription/t/010_truncate.pl b/src/test/subscription/t/010_truncate.pl
index be2c0bdc35..cfcee2f1a7 100644
--- a/src/test/subscription/t/010_truncate.pl
+++ b/src/test/subscription/t/010_truncate.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 9;
+use Test::More tests => 11;
 
 # setup
 
@@ -158,3 +158,28 @@ is($result, qq(0||), 'truncate of multiple tables some not published');
 $result = $node_subscriber->safe_psql('postgres',
 	"SELECT count(*), min(a), max(a) FROM tab2");
 is($result, qq(3|1|3), 'truncate of multiple tables some not published');
+
+# setup synchronous logical replicatio

RE: Truncate in synchronous logical replication failed

2021-04-16 Thread osumi.takami...@fujitsu.com

Hi


On Friday, April 16, 2021 11:02 AM Japin Li 
> On Thu, 15 Apr 2021 at 19:25, Amit Kapila  wrote:
> > On Thu, Apr 15, 2021 at 4:30 PM Japin Li  wrote:
> >>
> >>
> >> On Thu, 15 Apr 2021 at 18:22, Amit Kapila 
> wrote:
> >> > On Wed, Apr 14, 2021 at 3:31 PM Amit Kapila
>  wrote:
> >> >>
> >> >> On Tue, Apr 13, 2021 at 8:07 PM Petr Jelinek
> >> >>  wrote:
> >> >> >
> >> >> > > On 12 Apr 2021, at 08:58, Amit Kapila 
> wrote:
> >> >> > >
> >> >> > > The problem happens only when we try to fetch IDENTITY_KEY
> >> >> > > attributes because pgoutput uses RelationGetIndexAttrBitmap()
> >> >> > > to get that information which locks the required indexes. Now,
> >> >> > > because TRUNCATE has already acquired an exclusive lock on the
> >> >> > > index, it seems to create a sort of deadlock where the actual
> >> >> > > Truncate operation waits for logical replication of operation
> >> >> > > to complete and logical replication waits for actual Truncate
> operation to finish.
> >> >> > >
> >> >> > > Do we really need to use RelationGetIndexAttrBitmap() to build
> >> >> > > IDENTITY_KEY attributes? During decoding, we don't even lock
> >> >> > > the main relation, we just scan the system table and build
> >> >> > > that information using a historic snapshot. Can't we do something
> similar here?
> >> >> > >
> >> >> > > Adding Petr J. and Peter E. to know their views as this seems
> >> >> > > to be an old problem (since the decoding of Truncate operation is
> introduced).
> >> >> >
> >> >> > We used RelationGetIndexAttrBitmap because it already existed, no
> other reason.
> >> >> >
> >> >>
> >> >> Fair enough. But I think we should do something about it because
> >> >> using the same (RelationGetIndexAttrBitmap) just breaks the
> >> >> synchronous logical replication. I think this is broken since the
> >> >> logical replication of Truncate is supported.
> >> >>
> >> >> > I am not sure what exact locking we need but I would have guessed
> at least AccessShareLock would be needed.
> >> >> >
> >> >>
> >> >> Are you telling that we need AccessShareLock on the index? If so,
> >> >> why is it different from how we access the relation during
> >> >> decoding (basically in ReorderBufferProcessTXN, we directly use
> >> >> RelationIdGetRelation() without any lock on the relation)? I think
> >> >> we do it that way because we need it to process WAL entries and we
> >> >> need the relation state based on the historic snapshot, so, even
> >> >> if the relation is later changed/dropped, we are fine with the old
> >> >> state we got. Isn't the same thing applies here in
> >> >> logicalrep_write_attrs? If that is true then some equivalent of
> >> >> RelationGetIndexAttrBitmap where we use RelationIdGetRelation
> instead of index_open should work?
> >> >>
> >> >
> >> > Today, again I have thought about this and don't see a problem with
> >> > the above idea. If the above understanding is correct, then I think
> >> > for our purpose in pgoutput, we just need to call
> >> > RelationGetIndexList and then build the idattr list for
> relation->rd_replidindex.
> >>
> >> Sorry, I don't know how can we build the idattr without open the index.
> >> If we should open the index, then we should use NoLock, since the
> >> TRUNCATE side hold AccessExclusiveLock. As Osumi points out in [1],
> >> The NoLock mode assumes that the  appropriate lock on the index is
> already taken.
> >>
> >
> > Why can't we use RelationIdGetRelation() by passing the required
> > indexOid to it?
> 
> Thanks for your reminder.  It might be a way to solve this problem.
Yeah. I've made the 1st patch for this issue.

In my env, with the patch
the TRUNCATE in synchronous logical replication doesn't hang.
It's OK with make check-world as well.


Best Regards,
Takamichi Osumi



Truncate_in_synchronous_logical_replication_v01.patch
Description: Truncate_in_synchronous_logical_replication_v01.patch

Re: Allowing to create LEAKPROOF functions to non-superuser

2021-04-16 Thread Noah Misch

On Mon, Apr 12, 2021 at 02:35:27PM -0700, Andres Freund wrote:
> On 2021-04-12 17:14:20 -0400, Tom Lane wrote:
> > I doubt that falsely labeling a function LEAKPROOF can get you more
> > than the ability to read data you're not supposed to be able to read
> > ... but that ability is then available to all users, or at least all
> > users who can execute the function in question.  So it definitely is a
> > fairly serious security hazard, and one that's not well modeled by
> > role labels.  If you give somebody e.g. pg_read_all_data privileges,
> > you don't expect that that means they can give it to other users.

I do expect that, essentially.  Like Andres describes for BYPASSRLS, they can
create and GRANT a SECURITY DEFINER function that performs an arbitrary query
and returns a refcursor (or stores the data to a table of the caller's
choosing, etc.).  Unlike BYPASSRLS, they can even make pg_read_all_data own
the function, making the situation persist after one drops the actor's role
and that role's objects.

> A user with BYPASSRLS can create public security definer functions
> returning data. If the concern is a BYPASSRLS user intentionally
> exposing data, then there's not a meaningful increase to allow defining
> LEAKPROOF functions.

Hence, I do find it reasonable to let pg_read_all_data be sufficient for
setting LEAKPROOF.  I would not consult datdba, because datdba currently has
no special read abilities.  It feels too weird to let BYPASSRLS start
affecting non-RLS access controls.  A reasonable person may assume that
BYPASSRLS has no consequences until someone uses CREATE POLICY.  That said, I
wouldn't be horrified if BYPASSRLS played a part.  BYPASSRLS, like
pg_read_all_data, clearly isn't something to grant lightly.

Re: fix old confusing JSON example

2021-04-16 Thread Michael Paquier

On Sat, Apr 03, 2021 at 02:28:38PM +0200, Erik Rijkers wrote:
> So, that gives information on two operators, and then gives one
> example query for each.  Clearly, the second example was meant to
> illustrate a where-clause with the  @?  operator. 
> 
> Small change to prevent great confusion (I'll admit it took me far
> too long to understand this). 

Once one guesses the definition of the table to use with the sample
data at disposal in the docs, it is easy to see that both queries
should return the same result, but the second one misses the shot and
is corrected as you say.  So, applied.

My apologies for the delay.
--
Michael

signature.asc
Description: PGP signature

Re: Converting built-in SQL functions to new style

2021-04-16 Thread Noah Misch

On Thu, Apr 15, 2021 at 07:25:39PM -0400, Tom Lane wrote:
> Here's a draft patch that converts all the built-in and information_schema
> SQL functions to new style, except for half a dozen that cannot be
> converted because they use polymorphic arguments.

This patch looks good.

> One thing I was wondering about, but did not pull the trigger on
> here, is whether to split off the function-related stuff in
> system_views.sql into a new file "system_functions.sql", as has
> long been speculated about by the comments in system_views.sql.
> I think it is time to do this because
> 
> (a) The function stuff now amounts to a full third of the file.

Fair.

> (b) While the views made by system_views.sql are intentionally
> not pinned, the function-related commands are messing with
> pre-existing objects that *are* pinned.  This seems quite
> confusing to me, and it might interfere with the intention that
> you could reload the system view definitions using this file.

I'm not aware of that causing a problem.  Currently, the views give a few
errors, and the functions do not.

Re: Truncate in synchronous logical replication failed

2021-04-16 Thread Amit Kapila

On Fri, Apr 16, 2021 at 12:56 PM osumi.takami...@fujitsu.com
 wrote:
>
> > Thanks for your reminder.  It might be a way to solve this problem.
> Yeah. I've made the 1st patch for this issue.
>
> In my env, with the patch
> the TRUNCATE in synchronous logical replication doesn't hang.
>

Few initial comments:
=
1.
+ relreplindex = relation->rd_replidindex;
+
+ /*
+ * build attributes to idindexattrs.
+ */
+ idindexattrs = NULL;
+ foreach(l, indexoidlist)
+ {
+ Oid indexOid = lfirst_oid(l);
+ Relation indexDesc;
+ int i;
+ bool isIDKey; /* replica identity index */
+
+ indexDesc = RelationIdGetRelation(indexOid);

When you have oid of replica identity index (relreplindex) then what
is the need to traverse all the indexes?

2.
It is better to name the function as RelationGet...

-- 
With Regards,
Amit Kapila.

Re: Truncate in synchronous logical replication failed

2021-04-16 Thread Amit Kapila

On Fri, Apr 16, 2021 at 12:55 PM Japin Li  wrote:
>
> On Thu, 15 Apr 2021 at 19:25, Amit Kapila  wrote:
> > On Thu, Apr 15, 2021 at 4:30 PM Japin Li  wrote:
> >>
> >>
> >> On Thu, 15 Apr 2021 at 18:22, Amit Kapila  wrote:
> >> > On Wed, Apr 14, 2021 at 3:31 PM Amit Kapila  
> >> > wrote:
> >> >>
> >> >> On Tue, Apr 13, 2021 at 8:07 PM Petr Jelinek
> >> >>  wrote:
> >> >> >
> >> >> > > On 12 Apr 2021, at 08:58, Amit Kapila  
> >> >> > > wrote:
> >> >> > >
> >> >> > > The problem happens only when we try to fetch IDENTITY_KEY 
> >> >> > > attributes
> >> >> > > because pgoutput uses RelationGetIndexAttrBitmap() to get that
> >> >> > > information which locks the required indexes. Now, because TRUNCATE
> >> >> > > has already acquired an exclusive lock on the index, it seems to
> >> >> > > create a sort of deadlock where the actual Truncate operation waits
> >> >> > > for logical replication of operation to complete and logical
> >> >> > > replication waits for actual Truncate operation to finish.
> >> >> > >
> >> >> > > Do we really need to use RelationGetIndexAttrBitmap() to build
> >> >> > > IDENTITY_KEY attributes? During decoding, we don't even lock the 
> >> >> > > main
> >> >> > > relation, we just scan the system table and build that information
> >> >> > > using a historic snapshot. Can't we do something similar here?
> >> >> > >
> >> >> > > Adding Petr J. and Peter E. to know their views as this seems to be 
> >> >> > > an
> >> >> > > old problem (since the decoding of Truncate operation is 
> >> >> > > introduced).
> >> >> >
> >> >> > We used RelationGetIndexAttrBitmap because it already existed, no 
> >> >> > other reason.
> >> >> >
> >> >>
> >> >> Fair enough. But I think we should do something about it because using
> >> >> the same (RelationGetIndexAttrBitmap) just breaks the synchronous
> >> >> logical replication. I think this is broken since the logical
> >> >> replication of Truncate is supported.
> >> >>
> >> >> > I am not sure what exact locking we need but I would have guessed at 
> >> >> > least AccessShareLock would be needed.
> >> >> >
> >> >>
> >> >> Are you telling that we need AccessShareLock on the index? If so, why
> >> >> is it different from how we access the relation during decoding
> >> >> (basically in ReorderBufferProcessTXN, we directly use
> >> >> RelationIdGetRelation() without any lock on the relation)? I think we
> >> >> do it that way because we need it to process WAL entries and we need
> >> >> the relation state based on the historic snapshot, so, even if the
> >> >> relation is later changed/dropped, we are fine with the old state we
> >> >> got. Isn't the same thing applies here in logicalrep_write_attrs? If
> >> >> that is true then some equivalent of RelationGetIndexAttrBitmap where
> >> >> we use RelationIdGetRelation instead of index_open should work?
> >> >>
> >> >
> >> > Today, again I have thought about this and don't see a problem with
> >> > the above idea. If the above understanding is correct, then I think
> >> > for our purpose in pgoutput, we just need to call RelationGetIndexList
> >> > and then build the idattr list for relation->rd_replidindex.
> >>
> >> Sorry, I don't know how can we build the idattr without open the index.
> >> If we should open the index, then we should use NoLock, since the TRUNCATE
> >> side hold AccessExclusiveLock. As Osumi points out in [1], The NoLock mode
> >> assumes that the  appropriate lock on the index is already taken.
> >>
> >
> > Why can't we use RelationIdGetRelation() by passing the required
> > indexOid to it?
>
> Hi Amit, as your suggested, I try to use RelationIdGetRelation() replace
> the index_open() to avoid specify the AccessSharedLock, then the TRUNCATE
> will not be blocked.
>

It is okay as POC but we can't change the existing function
RelationGetIndexAttrBitmap. It is used from other places as well. It
might be better to write a separate function for this, something like
what Osumi-San's patch is trying to do.

-- 
With Regards,
Amit Kapila.

RE: Truncate in synchronous logical replication failed

2021-04-16 Thread osumi.takami...@fujitsu.com

Hi


On Friday, April 16, 2021 5:50 PM Amit Kapila  wrote:
> On Fri, Apr 16, 2021 at 12:56 PM osumi.takami...@fujitsu.com
>  wrote:
> >
> > > Thanks for your reminder.  It might be a way to solve this problem.
> > Yeah. I've made the 1st patch for this issue.
> >
> > In my env, with the patch
> > the TRUNCATE in synchronous logical replication doesn't hang.
> >
> 
> Few initial comments:
> =
> 1.
> + relreplindex = relation->rd_replidindex;
> +
> + /*
> + * build attributes to idindexattrs.
> + */
> + idindexattrs = NULL;
> + foreach(l, indexoidlist)
> + {
> + Oid indexOid = lfirst_oid(l);
> + Relation indexDesc;
> + int i;
> + bool isIDKey; /* replica identity index */
> +
> + indexDesc = RelationIdGetRelation(indexOid);
> 
> When you have oid of replica identity index (relreplindex) then what is the
> need to traverse all the indexes?
Ok. No need to traverse all the indexes. Will fix this part.

> 2.
> It is better to name the function as RelationGet...
You are right. I'll modify this in my next version.


Best Regards,
Takamichi Osumi

Re: Truncate in synchronous logical replication failed

2021-04-16 Thread Japin Li



On Fri, 16 Apr 2021 at 16:52, Amit Kapila  wrote:
> On Fri, Apr 16, 2021 at 12:55 PM Japin Li  wrote:
>>
>> On Thu, 15 Apr 2021 at 19:25, Amit Kapila  wrote:
>> > On Thu, Apr 15, 2021 at 4:30 PM Japin Li  wrote:
>> >>
>> >>
>> >> On Thu, 15 Apr 2021 at 18:22, Amit Kapila  wrote:
>> >> > On Wed, Apr 14, 2021 at 3:31 PM Amit Kapila  
>> >> > wrote:
>> >> >>
>> >> >> On Tue, Apr 13, 2021 at 8:07 PM Petr Jelinek
>> >> >>  wrote:
>> >> >> >
>> >> >> > > On 12 Apr 2021, at 08:58, Amit Kapila  
>> >> >> > > wrote:
>> >> >> > >
>> >> >> > > The problem happens only when we try to fetch IDENTITY_KEY 
>> >> >> > > attributes
>> >> >> > > because pgoutput uses RelationGetIndexAttrBitmap() to get that
>> >> >> > > information which locks the required indexes. Now, because TRUNCATE
>> >> >> > > has already acquired an exclusive lock on the index, it seems to
>> >> >> > > create a sort of deadlock where the actual Truncate operation waits
>> >> >> > > for logical replication of operation to complete and logical
>> >> >> > > replication waits for actual Truncate operation to finish.
>> >> >> > >
>> >> >> > > Do we really need to use RelationGetIndexAttrBitmap() to build
>> >> >> > > IDENTITY_KEY attributes? During decoding, we don't even lock the 
>> >> >> > > main
>> >> >> > > relation, we just scan the system table and build that information
>> >> >> > > using a historic snapshot. Can't we do something similar here?
>> >> >> > >
>> >> >> > > Adding Petr J. and Peter E. to know their views as this seems to 
>> >> >> > > be an
>> >> >> > > old problem (since the decoding of Truncate operation is 
>> >> >> > > introduced).
>> >> >> >
>> >> >> > We used RelationGetIndexAttrBitmap because it already existed, no 
>> >> >> > other reason.
>> >> >> >
>> >> >>
>> >> >> Fair enough. But I think we should do something about it because using
>> >> >> the same (RelationGetIndexAttrBitmap) just breaks the synchronous
>> >> >> logical replication. I think this is broken since the logical
>> >> >> replication of Truncate is supported.
>> >> >>
>> >> >> > I am not sure what exact locking we need but I would have guessed at 
>> >> >> > least AccessShareLock would be needed.
>> >> >> >
>> >> >>
>> >> >> Are you telling that we need AccessShareLock on the index? If so, why
>> >> >> is it different from how we access the relation during decoding
>> >> >> (basically in ReorderBufferProcessTXN, we directly use
>> >> >> RelationIdGetRelation() without any lock on the relation)? I think we
>> >> >> do it that way because we need it to process WAL entries and we need
>> >> >> the relation state based on the historic snapshot, so, even if the
>> >> >> relation is later changed/dropped, we are fine with the old state we
>> >> >> got. Isn't the same thing applies here in logicalrep_write_attrs? If
>> >> >> that is true then some equivalent of RelationGetIndexAttrBitmap where
>> >> >> we use RelationIdGetRelation instead of index_open should work?
>> >> >>
>> >> >
>> >> > Today, again I have thought about this and don't see a problem with
>> >> > the above idea. If the above understanding is correct, then I think
>> >> > for our purpose in pgoutput, we just need to call RelationGetIndexList
>> >> > and then build the idattr list for relation->rd_replidindex.
>> >>
>> >> Sorry, I don't know how can we build the idattr without open the index.
>> >> If we should open the index, then we should use NoLock, since the TRUNCATE
>> >> side hold AccessExclusiveLock. As Osumi points out in [1], The NoLock mode
>> >> assumes that the  appropriate lock on the index is already taken.
>> >>
>> >
>> > Why can't we use RelationIdGetRelation() by passing the required
>> > indexOid to it?
>>
>> Hi Amit, as your suggested, I try to use RelationIdGetRelation() replace
>> the index_open() to avoid specify the AccessSharedLock, then the TRUNCATE
>> will not be blocked.
>>
>
> It is okay as POC but we can't change the existing function
> RelationGetIndexAttrBitmap. It is used from other places as well. It
> might be better to write a separate function for this, something like
> what Osumi-San's patch is trying to do.

Agreed!

Hi Osumi-San, can you merge the test case in your next version?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Replication slot stats misgivings

2021-04-16 Thread Amit Kapila

On Mon, Apr 12, 2021 at 2:57 PM vignesh C  wrote:
>
> On Sat, Mar 20, 2021 at 9:26 AM Amit Kapila  wrote:
> >
> > On Sat, Mar 20, 2021 at 12:22 AM Andres Freund  wrote:
> > >
> > > And then more generally about the feature:
> > > - If a slot was used to stream out a large amount of changes (say an
> > >   initial data load), but then replication is interrupted before the
> > >   transaction is committed/aborted, stream_bytes will not reflect the
> > >   many gigabytes of data we may have sent.
> > >
> >
> > We can probably update the stats each time we spilled or streamed the
> > transaction data but it was not clear at that stage whether or how
> > much it will be useful.
> >
>
> I felt we can update the replication slot statistics data each time we
> spill/stream the transaction data instead of accumulating the
> statistics and updating at the end. I have tried this in the attached
> patch and the statistics data were getting updated.
> Thoughts?
>

Did you check if we can update the stats when we release the slot as
discussed above? I am not sure if it is easy to do at the time of slot
release because this information might not be accessible there and in
some cases, we might have already released the decoding
context/reorderbuffer where this information is stored. It might be
okay to update this when we stream or spill but let's see if we can do
it easily at the time of slot release.

-- 
With Regards,
Amit Kapila.

Re: Truncate in synchronous logical replication failed

2021-04-16 Thread Amit Kapila

On Fri, Apr 16, 2021 at 3:08 PM Japin Li  wrote:
>
> On Fri, 16 Apr 2021 at 16:52, Amit Kapila  wrote:
> > On Fri, Apr 16, 2021 at 12:55 PM Japin Li  wrote:
> > It is okay as POC but we can't change the existing function
> > RelationGetIndexAttrBitmap. It is used from other places as well. It
> > might be better to write a separate function for this, something like
> > what Osumi-San's patch is trying to do.
>
> Agreed!
>
> Hi Osumi-San, can you merge the test case in your next version?
>

+1. Your test looks reasonable to me.

-- 
With Regards,
Amit Kapila.

Re: logical replication worker accesses catalogs in error context callback

2021-04-16 Thread Bharath Rupireddy

On Tue, Mar 16, 2021 at 2:21 AM Tom Lane  wrote:
>
> Bharath Rupireddy  writes:
> > Thanks for pointing to the changes in the commit message. I corrected
> > them. Attaching v4 patch set, consider it for further review.
>
> I took a quick look at this.  I'm quite worried about the potential
> performance cost of the v4-0001 patch (the one for fixing
> slot_store_error_callback).  Previously, we didn't pay any noticeable
> cost for having the callback unless there actually was an error.
> As patched, we perform several catalog lookups per column per row,
> even in the non-error code path.  That seems like it'd be a noticeable
> performance hit.  Just to add insult to injury, it leaks memory.
>
> I propose a more radical but simpler solution: let's just not bother
> with including the type names in the context message.  How much are
> they really buying?

<< Attaching v5-0001 here again for completion >>
I'm attaching v5-0001 patch that avoids printing the column type names
in the context message thus no cache lookups have to be done in the
error context callback. I think the column name is enough to know on
which column the error occurred and if required it's type can be known
by the user. This patch gets rid of printing local and remote type
names in slot_store_error_callback and also
logicalrep_typmap_gettypname because it's unnecessary. Thoughts?

> v4-0002 (for postgres_fdw's conversion_error_callback) has the same
> problems, although mitigated a bit by not needing to do any catalog
> lookups in the non-join case.  For the join case, I wonder whether
> we could push the lookups back to plan setup time, so they don't
> need to be done over again for each row.  (Not doing lookups at all
> doesn't seem attractive there; it'd render the context message nearly
> useless.)  A different idea is to try to get the column names out
> of the query's rangetable instead of going to the catalogs.

I'm attaching v5-0002 which stores the required attribute information
for foreign joins in postgresBeginForeignScan which is a one time job
as opposed to the per-row system catalog lookup v4-0001 was doing. I'm
storing the foreign table relid(as key), relname and the retrieved
attributes' attnum and attname into a hash table. Whenever a
conversion error occurs, using relid, the hash table is looked up to
fetch the relname and attname. Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

v5-0001-Avoid-Catalogue-Accesses-In-slot_store_error_call.patch
Description: Binary data

v5-0002-Avoid-Catalogue-Accesses-In-conversion_error_call.patch
Description: Binary data

Re: Replication slot stats misgivings

2021-04-16 Thread vignesh C

On Fri, Apr 16, 2021 at 3:16 PM Amit Kapila  wrote:
>
> On Mon, Apr 12, 2021 at 2:57 PM vignesh C  wrote:
> >
> > On Sat, Mar 20, 2021 at 9:26 AM Amit Kapila  wrote:
> > >
> > > On Sat, Mar 20, 2021 at 12:22 AM Andres Freund  wrote:
> > > >
> > > > And then more generally about the feature:
> > > > - If a slot was used to stream out a large amount of changes (say an
> > > >   initial data load), but then replication is interrupted before the
> > > >   transaction is committed/aborted, stream_bytes will not reflect the
> > > >   many gigabytes of data we may have sent.
> > > >
> > >
> > > We can probably update the stats each time we spilled or streamed the
> > > transaction data but it was not clear at that stage whether or how
> > > much it will be useful.
> > >
> >
> > I felt we can update the replication slot statistics data each time we
> > spill/stream the transaction data instead of accumulating the
> > statistics and updating at the end. I have tried this in the attached
> > patch and the statistics data were getting updated.
> > Thoughts?
> >
>
> Did you check if we can update the stats when we release the slot as
> discussed above? I am not sure if it is easy to do at the time of slot
> release because this information might not be accessible there and in
> some cases, we might have already released the decoding
> context/reorderbuffer where this information is stored. It might be
> okay to update this when we stream or spill but let's see if we can do
> it easily at the time of slot release.
>

I'm not sure if we will be able to update stats from here, as we will
not have  access to decoding context/reorderbuffer at this place, and
also like you pointed out I noticed that the decoding context gets
released earlier itself.

Regards,
Vignesh

Re: Replication slot stats misgivings

2021-04-16 Thread vignesh C

On Fri, Apr 16, 2021 at 11:28 AM Amit Kapila  wrote:
>
> On Thu, Apr 15, 2021 at 4:35 PM Masahiko Sawada  wrote:
> >
> > Thank you for the update! The patch looks good to me.
> >
>
> I have pushed the first patch. Comments on the next patch
> v13-0001-Use-HTAB-for-replication-slot-statistics:

Also should we change PGSTAT_FILE_FORMAT_ID as we have modified the
replication slot statistics?

Regards,
Vignesh

Re: ATTACH PARTITION locking documentation for DEFAULT partitions

2021-04-16 Thread Matthias van de Meent

On Thu, 15 Apr 2021 at 21:24, Justin Pryzby  wrote:
>
> On Thu, Apr 15, 2021 at 08:47:26PM +0200, Matthias van de Meent wrote:
> > I recently noticed that ATTACH PARTITION also recursively locks the
> > default partition with ACCESS EXCLUSIVE mode when its constraints do
> > not explicitly exclude the to-be-attached partition, which I couldn't
> > find documented (has been there since PG10 I believe).
>
> I'm not sure it's what you're looking for, but maybe you saw:
> https://www.postgresql.org/docs/12/sql-altertable.html
> |The default partition can't contain any rows that would need to be moved to 
> the
> |new partition, and will be scanned to verify that none are present. This 
> scan,
> |like the scan of the new partition, can be avoided if an appropriate
> |CHECK constraint is present.
>
> And since 2a4d96ebb:
> |Attaching a partition acquires a SHARE UPDATE EXCLUSIVE lock on the parent 
> table, in addition to ACCESS EXCLUSIVE locks on the table to be attached and 
> on the default partition (if any).

>From the current documentation the recursive locking isn't clear: I
didn't expect an ACCESS EXCLUSIVE on the whole hierarchy of both the
to-be-attached and the default partitions whilst scanning, because the
SUEL on the shared parent is not propagated to all its children
either.

> From your patch:
>
> > +
> > + Similarly, if you have a default partition on the parent table, it is
> > + recommended to create a CHECK constraint that 
> > excludes
> > + the to be attached partition constraint. Here, too, without the
> > + CHECK constraint, this table will be scanned to
> > + validate that the updated default partition constraints while holding
> > + an ACCESS EXCLUSIVE lock on the default partition.
> > +
>
> The AEL is acquired in any case, right ?

Yes, the main point is that the validation scan runs whilst holding
the AEL on the partition (sub)tree of that default partition. After
looking at bit more at the code, I agree that my current patch is not
descriptive enough.

I compared adding a partition to running `CREATE CONSTRAINT ... NOT
VALID` on the to-be-altered partitions (using AEL), + `VALIDATE
CONSTRAINT` running recursively over it's partitions (using SHARE
UPDATE EXCLUSIVE). We only expect an SUEL for VALIDATE CONSTRAINT, and
the constraint itself is only added/updated to the direct descendents
of the parent, not their recursivedescendents. Insertions already can
only happen when the whole upward hierarchy of a partition allows for
inserts, so this shouldn't be that much of an issue.

> I think whatever we say here needs to be crystal clear that only the scan can
> be skipped.

Yes, but when we skip the scan for the default partition, we also skip
locking its partition tree with AEL. The partition tree of the table
that is being attached, however, is fully locked regardless of
constraint definitions.

> I suggest that maybe the existing paragraph in alter_table.sgml could maybe 
> say
> that an exclusive lock is held, maybe like.
>
> |The default partition can't contain any rows that would need to be moved to 
> the
> |new partition, and will be scanned to verify that none are present. This 
> scan,
> |like the scan of the new partition, can be avoided if an appropriate
> |CHECK constraint is present.
> |The scan of the default partition occurs while it is exclusively locked.

PFA an updated patch. I've updated the wording of the previous patch,
and also updated this section in alter_table.sgml, but with different
wording, explictly explaining the process used to validate the altered
default constraint.

Thanks for the review.

With regards,

Matthias van de Meent
From 230c594950886bf50bf4c69295aa0cb67c16b8a6 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent 
Date: Thu, 15 Apr 2021 20:43:00 +0200
Subject: [PATCH v2] ATTACH PARTITION locking documentation for DEFAULT
 partitions.

---
 doc/src/sgml/ddl.sgml | 18 --
 doc/src/sgml/ref/alter_table.sgml |  8 ++--
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 30e4170963..8f0b38847a 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -3934,6 +3934,9 @@ ALTER TABLE measurement_y2008m02 ADD CONSTRAINT y2008m02
 \copy measurement_y2008m02 from 'measurement_y2008m02'
 -- possibly some other data preparation work

+ALTER TABLE measurement_default ADD CONSTRAINT excl_y2008m02
+   CHECK ( (logdate >= DATE '2008-02-01' AND logdate < DATE '2008-03-01') IS FALSE );
+
 ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
 FOR VALUES FROM ('2008-02-01') TO ('2008-03-01' );

@@ -3947,12 +3950,23 @@ ALTER TABLE measurement ATTACH PARTITION measurement_y2008m02
  which is otherwise needed to validate the implicit
  partition constraint. Without the CHECK constraint,
  the table will be scanned to validate the partition constraint while
- holding both an ACCESS EXCLUSIVE lock on tha

Re: WIP: WAL prefetch (another approach)

2021-04-16 Thread Amit Kapila

On Sat, Apr 10, 2021 at 2:16 AM Thomas Munro  wrote:
>

In commit 1d257577e08d3e598011d6850fd1025858de8c8c, there is a change
in file format for stats, won't it require bumping
PGSTAT_FILE_FORMAT_ID?

Actually, I came across this while working on my today's commit
f5fc2f5b23 where I forgot to bump PGSTAT_FILE_FORMAT_ID. So, I thought
maybe we can bump it just once if required?

-- 
With Regards,
Amit Kapila.

Re: Retry in pgbench

2021-04-16 Thread Jehan-Guillaume de Rorthais

On Fri, 16 Apr 2021 10:28:48 +0900 (JST)
Tatsuo Ishii  wrote:

> > By the way, I've been playing with the idea of failing gracefully and retry
> > indefinitely (or until given -T) on SQL error AND connection issue.
> > 
> > It would be useful to test replicating clusters with a (switch|fail)over
> > procedure.  
> 
> Interesting idea but in general a failover takes sometime (like a few
> minutes), and it will strongly affect TPS. I think in the end it just
> compares the failover time.

This usecase is not about benchmarking. It's about generating constant trafic
to be able to practice/train some [auto]switchover procedures while being close
to production activity.

In this contexte, a max-saturated TPS of one node is not relevant. But being
able to add some stats about downtime might be a good addition.

Regards,

Re: Retry in pgbench

2021-04-16 Thread Tatsuo Ishii

> This usecase is not about benchmarking. It's about generating constant trafic
> to be able to practice/train some [auto]switchover procedures while being 
> close
> to production activity.
> 
> In this contexte, a max-saturated TPS of one node is not relevant. But being
> able to add some stats about downtime might be a good addition.

Oh I see. That makes sense.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Error when defining a set returning function

2021-04-16 Thread Esteban Zimanyi

Dear all

Since I was receiving an error when defining a set returning function, I
borrowed a function from PostgreSQL as follows

/* C definition */
typedef struct testState
{
  int current;
  int finish;
  int step;
} testState;

/**
* test_srf(startval int, endval int, step int)
*/
PG_FUNCTION_INFO_V1(test_srf);
Datum test_srf(PG_FUNCTION_ARGS)
{
  FuncCallContext *funcctx;
  testState *fctx;
  int result; /* the actual return value */

  if (SRF_IS_FIRSTCALL())
  {
/* Get input values */
int start = PG_GETARG_INT32(0);
int finish = PG_GETARG_INT32(1);
int step = PG_GETARG_INT32(2);
MemoryContext oldcontext;

/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();

/* switch to memory context appropriate for multiple function calls */
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);

/* quick opt-out if we get nonsensical inputs  */
if (step <= 0 || start == finish)
{
  funcctx = SRF_PERCALL_SETUP();
  SRF_RETURN_DONE(funcctx);
}

/* allocate memory for function context */
fctx = (testState *) palloc0(sizeof(testState));
fctx->current = start;
fctx->finish = finish;
fctx->step = step;

funcctx->user_fctx = fctx;
MemoryContextSwitchTo(oldcontext);
  }

  /* stuff done on every call of the function */
  funcctx = SRF_PERCALL_SETUP();

  /* get state */
  fctx = funcctx->user_fctx;

  result = fctx->current;
  fctx->current += fctx->step;
  /* Stop when we have generated all values */
  if (fctx->current > fctx->finish)
  {
SRF_RETURN_DONE(funcctx);
  }

  SRF_RETURN_NEXT(funcctx, Int32GetDatum(result));
}

/* SQL definition */
CREATE OR REPLACE FUNCTION testSRF(startval int, endval int, step int)
  RETURNS SETOF integer
  AS 'MODULE_PATHNAME', 'test_srf'
  LANGUAGE C IMMUTABLE STRICT PARALLEL SAFE;

When I execute this function I obtain

select testSRF(1,10, 2);
ERROR:  unrecognized table-function returnMode: 257

select version();
 PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
9.3.0-17ubuntu1~20.04) 9.3.0, 64-bit

Any idea what could be wrong ?

Thanks for your help

Esteban

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Tom Lane

Julien Rouhaud  writes:
> On Thu, Apr 15, 2021 at 10:06:24AM -0400, Tom Lane wrote:
>> 0001 fails for me :-(.  I think that requires default collation to be C.

> Oh right, adding --no-locale to the regress opts I see that create_index is
> failing, and that's not the one I was expecting.

> We could change create_index test to create c2 with a C collation, in order to
> test that we don't track dependency on unversioned locales, and add an extra
> test in collate.linux.utf8 to check that we do track a dependency on the
> default collation as this test isn't run in the --no-locale case.  The only
> case not tested would be default unversioned collation, but I'm not sure where
> to properly test that.  Maybe a short leading test in collate.linux.utf8 that
> would be run on linux in that case (when getdatabaseencoding() != 'UTF8')?  It
> would require an extra alternate file but it wouldn't cause too much
> maintenance problem as there should be only one test.

Since the proposed patch removes the dependency code's special-case
handling of the default collation, I don't feel like we need to jump
through hoops to prove that the default collation is tracked the
same as other collations.  A regression test with alternative outputs
is a significant ongoing maintenance burden, and I do not see where
we're getting a commensurate improvement in test coverage.  Especially
since, AFAICS, the two alternative outputs would essentially have to
accept both the "it works" and "it doesn't work" outcomes.

So I propose that we do 0001 below, which is my first patch plus your
suggestion about fixing up create_index.sql.  This passes check-world
for me under both C and en_US.utf8 prevailing locales.

regards, tom lane

diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index 362db7fe91..1217c01b8a 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -73,7 +73,6 @@ recordMultipleDependencies(const ObjectAddress *depender,
 max_slots,
 slot_init_count,
 slot_stored_count;
-	char	   *version = NULL;
 
 	if (nreferenced <= 0)
 		return;	/* nothing to do */
@@ -104,31 +103,22 @@ recordMultipleDependencies(const ObjectAddress *depender,
 	slot_init_count = 0;
 	for (i = 0; i < nreferenced; i++, referenced++)
 	{
-		bool		ignore_systempin = false;
+		char	   *version = NULL;
 
 		if (record_version)
 		{
 			/* For now we only know how to deal with collations. */
 			if (referenced->classId == CollationRelationId)
 			{
-/* C and POSIX don't need version tracking. */
+/* These are unversioned, so don't waste cycles on them. */
 if (referenced->objectId == C_COLLATION_OID ||
 	referenced->objectId == POSIX_COLLATION_OID)
 	continue;
 
 version = get_collation_version_for_oid(referenced->objectId,
 		false);
-
-/*
- * Default collation is pinned, so we need to force recording
- * the dependency to store the version.
- */
-if (referenced->objectId == DEFAULT_COLLATION_OID)
-	ignore_systempin = true;
 			}
 		}
-		else
-			Assert(!version);
 
 		/*
 		 * If the referenced object is pinned by the system, there's no real
@@ -136,7 +126,7 @@ recordMultipleDependencies(const ObjectAddress *depender,
 		 * version.  This saves lots of space in pg_depend, so it's worth the
 		 * time taken to check.
 		 */
-		if (!ignore_systempin && isObjectPinned(referenced, dependDesc))
+		if (version == NULL && isObjectPinned(referenced, dependDesc))
 			continue;
 
 		if (slot_init_count < max_slots)
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 830fdddf24..7f8f91b92c 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2026,10 +2026,10 @@ REINDEX TABLE concur_reindex_tab; -- notice
 NOTICE:  table "concur_reindex_tab" has no indexes to reindex
 REINDEX (CONCURRENTLY) TABLE concur_reindex_tab; -- notice
 NOTICE:  table "concur_reindex_tab" has no indexes that can be reindexed concurrently
-ALTER TABLE concur_reindex_tab ADD COLUMN c2 text; -- add toast index
+ALTER TABLE concur_reindex_tab ADD COLUMN c2 text COLLATE "C"; -- add toast index
 -- Normal index with integer column
 CREATE UNIQUE INDEX concur_reindex_ind1 ON concur_reindex_tab(c1);
--- Normal index with text column
+-- Normal index with text column (with unversioned collation)
 CREATE INDEX concur_reindex_ind2 ON concur_reindex_tab(c2);
 -- UNIQUE index with expression
 CREATE UNIQUE INDEX concur_reindex_ind3 ON concur_reindex_tab(abs(c1));
@@ -2069,16 +2069,14 @@ WHERE classid = 'pg_class'::regclass AND
obj|   objref   | deptype 
 --++-
  index concur_reindex_ind1| constraint concur_reindex_ind1 on table concur_

Re: fix old confusing JSON example

2021-04-16 Thread Alexander Korotkov

On Fri, Apr 16, 2021 at 11:00 AM Michael Paquier  wrote:
> On Sat, Apr 03, 2021 at 02:28:38PM +0200, Erik Rijkers wrote:
> > So, that gives information on two operators, and then gives one
> > example query for each.  Clearly, the second example was meant to
> > illustrate a where-clause with the  @?  operator.
> >
> > Small change to prevent great confusion (I'll admit it took me far
> > too long to understand this).
>
> Once one guesses the definition of the table to use with the sample
> data at disposal in the docs, it is easy to see that both queries
> should return the same result, but the second one misses the shot and
> is corrected as you say.  So, applied.
>
> My apologies for the delay.

My apologies for missing this.  And thank you for taking care!

--
Regards,
Alexander Korotkov

Re: Error when defining a set returning function

2021-04-16 Thread Tom Lane

Esteban Zimanyi  writes:
> Since I was receiving an error when defining a set returning function, I
> borrowed a function from PostgreSQL as follows
> ...
> When I execute this function I obtain

> select testSRF(1,10, 2);
> ERROR:  unrecognized table-function returnMode: 257

Hmm, I compiled this function up and it works for me:

regression=# select testSRF(1,10, 2);
 testsrf 
--
1
3
5
7
(4 rows)

I think your "quick opt-out" code is a bit broken, because it fails to
restore the current memory context; but there's nothing wrong with the
main code path.

Hence, the problem is somewhere else.  The first theory that comes
to mind is that you're compiling against Postgres headers that
don't match the server version you're actually loading the code
into.  In theory the PG_MODULE_MAGIC infrastructure ought to catch
that, but maybe you've found some creative way to fool that :-(.
One way maybe would be if the headers were from some pre-release
v13 version that wasn't ABI-compatible with 13.0.

Or it could be something else, but I'd counsel looking for build
process mistakes, cause this C code isn't the problem.

regards, tom lane

default_tablespace doc and partitioned rels

2021-04-16 Thread Justin Pryzby

commit 87259588d0ab0b8e742e30596afa7ae25caadb18
Author: Alvaro Herrera 
Date:   Thu Apr 25 10:20:23 2019 -0400

Fix tablespace inheritance for partitioned rels

This doc change doesn't make sense to me:

+++ b/doc/src/sgml/config.sgml
@@ -7356,7 +7356,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
csv;

 This variable specifies the default tablespace in which to create
 objects (tables and indexes) when a CREATE command 
does
-not explicitly specify a tablespace.
+not explicitly specify a tablespace.  It also determines the tablespace
+that a partitioned relation will direct future partitions to.


default_tablespace is a global GUC, so if a partitioned relation "directs"
partitions anywhere, it's not to the fallback value of the GUC, but to its
reltablespace, as this patch wrote in doc/src/sgml/ref/create_table.sgml:

+  the tablespace specified overrides default_tablespace
+  as the default tablespace to use for any newly created partitions when no
+  other tablespace is explicitly specified.

Maybe I'm misreading config.sgml somehow ?
I thought it would be more like this (but actually I think 
shouldn't say anything at all):

+... It also determines the tablespace where new partitions are created,
+if the parent, partitioned relation doesn't have a tablespace set.


-- 
Justin

Re: wrong units in ExplainPropertyFloat

2021-04-16 Thread Tom Lane

Justin Pryzby  writes:
> Text mode uses appendStringInfo() for high density output, so this only 
> affects
> non-text output, but it turns out that units aren't shown for nontext format
> anyway - this seems like a deficiency, but it means there's no visible bug.

Yeah, I concur: these should say "ms", but it's only latent so it's
not surprising nobody's noticed.  Pushed.

regards, tom lane

Re: Truncate in synchronous logical replication failed

2021-04-16 Thread Japin Li


On Fri, 16 Apr 2021 at 17:19, osumi.takami...@fujitsu.com 
 wrote:
> Hi
>
>
> On Friday, April 16, 2021 5:50 PM Amit Kapila  wrote:
>> On Fri, Apr 16, 2021 at 12:56 PM osumi.takami...@fujitsu.com
>>  wrote:
>> >
>> > > Thanks for your reminder.  It might be a way to solve this problem.
>> > Yeah. I've made the 1st patch for this issue.
>> >
>> > In my env, with the patch
>> > the TRUNCATE in synchronous logical replication doesn't hang.
>> >
>>
>> Few initial comments:
>> =
>> 1.
>> + relreplindex = relation->rd_replidindex;
>> +
>> + /*
>> + * build attributes to idindexattrs.
>> + */
>> + idindexattrs = NULL;
>> + foreach(l, indexoidlist)
>> + {
>> + Oid indexOid = lfirst_oid(l);
>> + Relation indexDesc;
>> + int i;
>> + bool isIDKey; /* replica identity index */
>> +
>> + indexDesc = RelationIdGetRelation(indexOid);
>>
>> When you have oid of replica identity index (relreplindex) then what is the
>> need to traverse all the indexes?
> Ok. No need to traverse all the indexes. Will fix this part.
>
>> 2.
>> It is better to name the function as RelationGet...
> You are right. I'll modify this in my next version.
>

I took the liberty to address review comments and provide a v2 patch on top of
your's v1 patch, also merge the test case.

Sorry for attaching.

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

diff --git a/src/backend/replication/logical/proto.c b/src/backend/replication/logical/proto.c
index 2a1f9830e0..1cf59e0fb0 100644
--- a/src/backend/replication/logical/proto.c
+++ b/src/backend/replication/logical/proto.c
@@ -668,8 +668,7 @@ logicalrep_write_attrs(StringInfo out, Relation rel)
 	/* fetch bitmap of REPLICATION IDENTITY attributes */
 	replidentfull = (rel->rd_rel->relreplident == REPLICA_IDENTITY_FULL);
 	if (!replidentfull)
-		idattrs = RelationGetIndexAttrBitmap(rel,
-			 INDEX_ATTR_BITMAP_IDENTITY_KEY);
+		idattrs = RelationGetIdentityKeyBitmap(rel);
 
 	/* send the attributes */
 	for (i = 0; i < desc->natts; i++)
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 29702d6eab..ace167f324 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -5206,6 +5206,94 @@ restart:
 	}
 }
 
+Bitmapset *
+RelationGetIdentityKeyBitmap(Relation relation)
+{
+	Bitmapset  *idindexattrs;	/* columns in the replica identity */
+	List	   *indexoidlist;
+	List	   *newindexoidlist;
+	Oid			relpkindex;
+	Oid			relreplindex;
+	Relation	indexDesc;
+	int			i;
+	MemoryContext oldcxt;
+
+	/* Quick exit if we already computed the result. */
+	if (relation->rd_idattr != NULL)
+		return bms_copy(relation->rd_idattr);
+
+	/* Fast path if definitely no indexes */
+	if (!RelationGetForm(relation)->relhasindex)
+		return NULL;
+
+	/*
+	 * Get cached list of index OIDs. If we have to start over, we do so here.
+	 */
+restart:
+	indexoidlist = RelationGetIndexList(relation);
+
+	/* Fall out if no indexes (but relhasindex was set) */
+	if (indexoidlist == NIL)
+		return NULL;
+
+	/* Save some values to compare after building attributes */
+	relpkindex = relation->rd_pkindex;
+	relreplindex = relation->rd_replidindex;
+
+	/*
+	 * build attributes to idindexattrs.
+	 */
+	idindexattrs = NULL;
+	indexDesc = RelationIdGetRelation(relreplindex);
+
+	/* Collect simple attribute references */
+	for (i = 0; i < indexDesc->rd_index->indnatts; i++)
+	{
+		int			attrnum = indexDesc->rd_index->indkey.values[i];
+
+		/* Romove non-key columns here. */
+		if (attrnum != 0)
+		{
+			if (i < indexDesc->rd_index->indnkeyatts)
+idindexattrs = bms_add_member(idindexattrs,
+			  attrnum - FirstLowInvalidHeapAttributeNumber);
+		}
+	}
+	RelationClose(indexDesc);
+
+	/* Check if we need to redo */
+	newindexoidlist = RelationGetIndexList(relation);
+	if (equal(indexoidlist, newindexoidlist) &&
+		relpkindex == relation->rd_pkindex &&
+		relreplindex == relation->rd_replidindex)
+	{
+		/* Still the same index set, so proceed */
+		list_free(newindexoidlist);
+		list_free(indexoidlist);
+	}
+	else
+	{
+		/* Gotta do it over ... might as well not leak memory */
+		list_free(newindexoidlist);
+		list_free(indexoidlist);
+		bms_free(idindexattrs);
+
+		goto restart;
+	}
+
+	/* Don't leak the old values of these bitmaps, if any */
+	bms_free(relation->rd_idattr);
+	relation->rd_idattr = NULL;
+
+	/* Now save copies of the bitmaps in the relcache entry */
+	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+	relation->rd_idattr = bms_copy(idindexattrs);
+	MemoryContextSwitchTo(oldcxt);
+
+	/* We return our original working copy for caller to play with */
+	return idindexattrs;
+}
+
 /*
  * RelationGetExclusionInfo -- get info about index's exclusion constraint
  *
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index 2fcdf79323..f772855ac6 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -65,6 +65,8 @@ typedef enum IndexAttrBitmapKind
 extern Bitmapset *Relati

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Julien Rouhaud

On Fri, Apr 16, 2021 at 10:03:42AM -0400, Tom Lane wrote:
> 
> Since the proposed patch removes the dependency code's special-case
> handling of the default collation, I don't feel like we need to jump
> through hoops to prove that the default collation is tracked the
> same as other collations.  A regression test with alternative outputs
> is a significant ongoing maintenance burden, and I do not see where
> we're getting a commensurate improvement in test coverage.  Especially
> since, AFAICS, the two alternative outputs would essentially have to
> accept both the "it works" and "it doesn't work" outcomes.

Fine by me, I was mentioning those if we wanted to keep some extra coverage for
that by I agree it doesn't add much value.

> So I propose that we do 0001 below, which is my first patch plus your
> suggestion about fixing up create_index.sql.  This passes check-world
> for me under both C and en_US.utf8 prevailing locales.

That's what I ended up with too, so LGTM!

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

2021-04-16 Thread Peter Geoghegan

On Thu, Apr 15, 2021 at 11:06 AM Matthias van de Meent
 wrote:
> I've noticed there are a lot of places in the btree index
> infrastructure (and also some other index AMs) that effectively
> iterate over the attributes of the index tuple, but won't use
> index_deform_tuple for reasons. However, this implies that they must
> repeatedly call index_getattr, which in the worst case is O(n) for the
> n-th attribute, slowing down extraction of multi-column indexes
> significantly. As such, I've added some API that allows for iteration
> (ish) over index attributes.

Interesting approach. I think that in an ideal world we would have a
tuple format with attribute lengths/offsets right in the header. But
it's too late for that, so other approaches seem worth considering.

> Also attached is 0002, which dynamically truncates attribute prefixes
> of tuples whilst _binsrch-ing through a nbtree page. It greatly uses
> the improved performance of 0001; they work very well together. The
> problems that Peter (cc-ed) mentions in [0] only result in invalid
> search bounds when traversing the tree, but on the page level valid
> bounds can be constructed.
>
> This is patchset 1 of a series of patches I'm starting for eventually
> adding static prefix truncation into nbtree infrastructure in
> PostgreSQL. I've put up a wiki page [1] with my current research and
> thoughts on that topic.

The idea of making _bt_truncate() produce new leaf page high keys
based on the lastleft tuple rather than the firstright tuple (i.e.
+inf truncated attribute values rather than the current -inf) seems
like a non-starter. As you point out in "1.) Suffix-truncation; -INF
in high keys" on the Postgres wiki page, the current approach
truncates firstright (not lastleft), making the left page's new high
key contain what you call a 'foreign' value. But I see that as a big
advantage of the current approach.

Consider, for example, the nbtree high key "continuescan" optimization
added by commit 29b64d1d. The fact that leaf page high keys are
generated in this way kind of allows us to "peak" on the page to the
immediate right before actually visiting it -- possibly without ever
visiting it (which is where the benefit of that particular
optimization lies). _bt_check_unique() uses a similar trick. After the
Postgres 12 work, _bt_check_unique() will only visit a second page in
the extreme case where we cannot possibly fit all of the relevant
version duplicates on even one whole leaf page (i.e. practically
never). There is also cleverness inside _bt_compare() to make sure
that we handle the boundary cases perfectly while descending the tree.

You might also consider how the nbtsplitloc.c code works with
duplicates, and how that would be affected by +inf truncated
attributes. The leaf-page-packing performed in the SPLIT_SINGLE_VALUE
case only goes ahead when the existing high key confirms that this
must be the rightmost page. Now, I guess that you could still do
something like that if we switched to +inf semantics. But, the fact
that the new right page will have a 'foreign' value in the
SPLIT_SINGLE_VALUE-split case is also of benefit -- it is practically
empty right after the split (since the original/left page is packed
full), and we want this empty space to be eligible to either take more
duplicates, or to take values that may happen to fit between the
highly duplicated value and the original foreign high key value. We
want that flexibility, I think.

I also find -inf much more natural. If in the future we teach nbtree
to truncate "inside" text attributes (say text columns), you'd pretty
much be doing the same thing at the level of characters rather than
whole columns. The -inf semantics are like strcmp() semantics.

If you're going to pursue full prefix compression anyway, maybe you
should use a low key on the leaf level in cases where the optimization
is in use. This creates complexity during page deletion, because the
low key in the subtree to the right of the deletion target subtree may
need to be updated. Perhaps you can find a way to make that work that
isn't too complicated.

> I've run some tests with regards to performance on my laptop; which
> tests nbtree index traversal. The test is based on a recent UK land
> registry sales prices dataset (25744780 rows), being copied from one
> table into an unlogged table with disabled autovacuum, with one index
> as specified by the result. Master @ 99964c4a, patched is with both
> 0001 and 0002. The results are averages over 3 runs, with plain
> configure, compiled by gcc (Debian 6.3.0-18+deb9u1).

You should probably account for index size here. I have lots of my own
tests for space utilization, using data from a variety of sources.

--
Peter Geoghegan

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Tom Lane

Julien Rouhaud  writes:
> On Fri, Apr 16, 2021 at 10:03:42AM -0400, Tom Lane wrote:
>> So I propose that we do 0001 below, which is my first patch plus your
>> suggestion about fixing up create_index.sql.  This passes check-world
>> for me under both C and en_US.utf8 prevailing locales.

> That's what I ended up with too, so LGTM!

Pushed, thanks for review!  (and I'll update the open items list in a
sec)

regards, tom lane

Re: Error when defining a set returning function

2021-04-16 Thread Esteban Zimanyi

Dear Tom

Many thanks for asking my question so quickly. After your answer, I
downloaded brand new versions of PostgreSQL 13.2, PostGIS 2.5.5, and
compiled/installed with the standard parameters. I didn't get any error
messages in the build. I then recompiled again MobilityDB and got the same
error message.

When debugging the function with gdb, I noticed that the rsinfo variable of
the PostgreSQL function ExecMakeFunctionResultSet  is modified in the
macro  SRF_RETURN_NEXT causing the problem. Any idea how to solve this?

4353  SRF_RETURN_NEXT(funcctx, Int32GetDatum(result));
(gdb) up
#1  0x55b8a871fc56 in ExecMakeFunctionResultSet (fcache=0x55b8a8e6d9a0,
econtext=0x55b8a8e6cfa0,
argContext=0x55b8a9d00dd0, isNull=0x55b8a8e6d930, isDone=0x55b8a8e6d988)
at
/home/esteban/src/postgresql-13.2/build_dir/../src/backend/executor/execSRF.c:614
614 result = FunctionCallInvoke(fcinfo);
(gdb) p rsinfo
$5 = {type = T_ReturnSetInfo, econtext = 0x55b8a8e6cfa0, expectedDesc =
0x55b8a8e6e8f0, allowedModes = 3,
  returnMode = SFRM_ValuePerCall, isDone = ExprSingleResult, setResult =
0x0, setDesc = 0x0}
(gdb) n
4354}
(gdb)
ExecMakeFunctionResultSet (fcache=0x55b8a8e6d9a0, econtext=0x55b8a8e6cfa0,
argContext=0x55b8a9d00dd0,
isNull=0x55b8a8e6d930, isDone=0x55b8a8e6d988)
at
/home/esteban/src/postgresql-13.2/build_dir/../src/backend/executor/execSRF.c:615
615 *isNull = fcinfo->isnull;
(gdb) p rsinfo
$6 = {type = T_ReturnSetInfo, econtext = 0x55b8a8e6cfa0, expectedDesc =
0x55b8a8e6e8f0, allowedModes = 3,
  returnMode = (SFRM_ValuePerCall | unknown: 256), isDone =
ExprSingleResult, setResult = 0x0, setDesc = 0x0}
(gdb)

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Tom Lane

I wrote:
>> That's what I ended up with too, so LGTM!

> Pushed, thanks for review!  (and I'll update the open items list in a
> sec)

... or maybe not just yet.  Andres' buildfarm critters seem to have
a different opinion than my machine about what the output of
collate.icu.utf8 ought to be.  I wonder what the prevailing LANG
setting is for them, and which ICU version they're using.

regards, tom lane

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Andres Freund

Hi,

On 2021-04-16 12:55:28 -0400, Tom Lane wrote:
> I wrote:
> >> That's what I ended up with too, so LGTM!
> 
> > Pushed, thanks for review!  (and I'll update the open items list in a
> > sec)
> 
> ... or maybe not just yet.  Andres' buildfarm critters seem to have
> a different opinion than my machine about what the output of
> collate.icu.utf8 ought to be.  I wonder what the prevailing LANG
> setting is for them, and which ICU version they're using.

andres@andres-pg-buildfarm-valgrind:~/src/pgbuildfarm-client-stock$ grep 
calliph *.conf
build-farm-copyparse.conf:animal => "calliphoridae",
build-farm-copyparse.conf:build_root => 
'/mnt/resource/andres/bf/calliphoridae',

andres@andres-pg-buildfarm-valgrind:~/src/pgbuildfarm-client-stock$ dpkg 
-l|grep icu
ii  icu-devtools 67.1-6 amd64   
 Development utilities for International Components for Unicode
ii  libicu-dev:amd64 67.1-6 amd64   
 Development files for International Components for Unicode
ii  libicu67:amd64   67.1-6 amd64   
 International Components for Unicode

andres@andres-pg-buildfarm-valgrind:~/src/pgbuildfarm-client-stock$ locale
LANG=C.UTF-8
LANGUAGE=
LC_CTYPE="C.UTF-8"
LC_NUMERIC="C.UTF-8"
LC_TIME="C.UTF-8"
LC_COLLATE="C.UTF-8"
LC_MONETARY="C.UTF-8"
LC_MESSAGES="C.UTF-8"
LC_PAPER="C.UTF-8"
LC_NAME="C.UTF-8"
LC_ADDRESS="C.UTF-8"
LC_TELEPHONE="C.UTF-8"
LC_MEASUREMENT="C.UTF-8"
LC_IDENTIFICATION="C.UTF-8"
LC_ALL=

Greetings,

Andres Freund

Re: Error when defining a set returning function

2021-04-16 Thread Tom Lane

Esteban Zimanyi  writes:
> When debugging the function with gdb, I noticed that the rsinfo variable of
> the PostgreSQL function ExecMakeFunctionResultSet  is modified in the
> macro  SRF_RETURN_NEXT causing the problem. Any idea how to solve this?

Well, what SRF_RETURN_NEXT thinks it's doing is

rsi->isDone = ExprMultipleResult; \

which surely shouldn't change the returnMode field.  At this point
I'm guessing that you are compiling the PG headers with some compiler
pragma that changes the struct packing rules.  Don't do that.

regards, tom lane

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Tom Lane

I wrote:
> ... or maybe not just yet.  Andres' buildfarm critters seem to have
> a different opinion than my machine about what the output of
> collate.icu.utf8 ought to be.  I wonder what the prevailing LANG
> setting is for them, and which ICU version they're using.

Oh, I bet it's "C.utf8", because I can reproduce the failure with that.
This crystallizes a nagging feeling I'd had that you were misdescribing
the collate.icu.utf8 test as not being run under --no-locale.  Actually,
it's only skipped if the encoding isn't UTF8, not the same thing.
I think we need to remove the default-collation cases from that test too.

regards, tom lane

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Tom Lane

Andres Freund  writes:
> On 2021-04-16 12:55:28 -0400, Tom Lane wrote:
>> ... or maybe not just yet.  Andres' buildfarm critters seem to have
>> a different opinion than my machine about what the output of
>> collate.icu.utf8 ought to be.  I wonder what the prevailing LANG
>> setting is for them, and which ICU version they're using.

> LANG=C.UTF-8

I'd guessed that shortly later, but thanks for confirming.

regards, tom lane

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Tom Lane

I wrote:
> Oh, I bet it's "C.utf8", because I can reproduce the failure with that.
> This crystallizes a nagging feeling I'd had that you were misdescribing
> the collate.icu.utf8 test as not being run under --no-locale.  Actually,
> it's only skipped if the encoding isn't UTF8, not the same thing.
> I think we need to remove the default-collation cases from that test too.

Hmm ... this is more subtle than it seemed.

I tried to figure out where the default-collation dependencies were coming
from, and it's quite non-obvious, at least for some of them.  Observe:

u8de=# create table t1 (f1 text collate "fr_FR");
CREATE TABLE
u8de=# create index on t1(f1) where f1 > 'foo';
CREATE INDEX
u8de=# SELECT objid::regclass, refobjid::regcollation, refobjversion
FROM pg_depend d
LEFT JOIN pg_class c ON c.oid = d.objid
WHERE refclassid = 'pg_collation'::regclass
AND coalesce(relkind, 'i') = 'i'
AND relname LIKE 't1_%';
   objid   | refobjid  | refobjversion 
---+---+---
 t1_f1_idx | "fr_FR"   | 2.28
 t1_f1_idx | "fr_FR"   | 2.28
 t1_f1_idx | "default" | 2.28
(3 rows)

(The "default" item doesn't show up if default collation is C,
which is what's causing the buildfarm instability.)

Now, it certainly looks like that index definition ought to only
have fr_FR dependencies.  I dug into it and discovered that the
reason we're coming up with a dependency on "default" is that
the WHERE clause looks like

 {OPEXPR 
 :opno 666 
 :opfuncid 742 
 :opresulttype 16 
 :opretset false 
 :opcollid 0 
 :inputcollid 14484 
 :args (
{VAR 
:varno 1 
:varattno 1 
:vartype 25 
:vartypmod -1 
:varcollid 14484 
:varlevelsup 0 
:varnosyn 1 
:varattnosyn 1 
:location 23
}
{CONST 
:consttype 25 
:consttypmod -1 
:constcollid 100 
:constlen -1 
:constbyval false 
:constisnull false 
:location 28 
:constvalue 7 [ 28 0 0 0 102 111 111 ]
}
 )
 :location 26
 }

So sure enough, the comparison operator's inputcollid is
fr_FR, but the 'foo' constant has constcollid = "default".
That will have exactly zero impact on the semantics of the
expression, but dependency.c doesn't realize that and
reports it as a dependency anyway.

I feel like this is telling us that there's a fundamental
misunderstanding in find_expr_references_walker about which
collation dependencies to report.  It's reporting all the
leaf-node collations, and ignoring the ones that actually
count semantically, that is the inputcollid fields of
function and operator nodes.

Not sure what's the best thing to do here.  Redesigning
this post-feature-freeze doesn't seem terribly appetizing,
but on the other hand, this index collation recording
feature has put a premium on not overstating the collation
dependencies of an expression.  We don't want to tell users
that an index is broken when it isn't really.

regards, tom lane

Re: Forget close an open relation in ReorderBufferProcessTXN()

2021-04-16 Thread Andres Freund

Hi,

On 2021-04-16 08:42:40 +0530, Amit Kapila wrote:
> I think it is because relation_open expects either caller to have a
> lock on the relation or don't use 'NoLock' lockmode. AFAIU, we don't
> need to acquire a lock on relation while decoding changes from WAL
> because it uses a historic snapshot to build a relcache entry and all
> the later changes to the rel are absorbed while decoding WAL.

Right.

> I think it is also important to *not* acquire any lock on relation
> otherwise it can lead to some sort of deadlock or infinite wait in the
> decoding process. Consider a case for operations like Truncate (or if
> the user has acquired an exclusive lock on the relation in some other
> way say via Lock command) which acquires an exclusive lock on
> relation, it won't get replicated in synchronous mode (when
> synchronous_standby_name is configured). The truncate operation will
> wait for the transaction to be replicated to the subscriber and the
> decoding process will wait for the Truncate operation to finish.

However, this cannot be really relied upon for catalog tables. An output
function might acquire locks or such. But for those we do not need to
decode contents...

This made me take a brief look at pgoutput.c - maybe I am missing
something, but how is the following not a memory leak?

static void
maybe_send_schema(LogicalDecodingContext *ctx,
  ReorderBufferTXN *txn, ReorderBufferChange *change,
  Relation relation, RelationSyncEntry *relentry)
{
...
/* Map must live as long as the session does. */
oldctx = MemoryContextSwitchTo(CacheMemoryContext);
relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
   CreateTupleDescCopy(outdesc));
MemoryContextSwitchTo(oldctx);
send_relation_and_attrs(ancestor, xid, ctx);
RelationClose(ancestor);

If - and that's common - convert_tuples_by_name() won't have to do
anything, the copied tuple descs will be permanently leaked.

Greetings,

Andres Freund

pg_amcheck option to install extension

2021-04-16 Thread Andrew Dunstan



Hi,

Peter Geoghegan suggested that I have the cross version upgrade checker
run pg_amcheck on the upgraded module. This seemed to me like a good
idea, so I tried it, only to find that it refuses to run unless the
amcheck extension is installed. That's fair enough, but it also seems to
me like we should have an option to have pg_amcheck install the
extension is it's not present, by running something like 'create
extension if not exists amcheck'. Maybe in such a case there could also
be an option to drop the extension when pg_amcheck's work is done - I
haven't thought through all the implications.

Given pg_amcheck is a new piece of work I'm not sure if we can sneak
this in under the wire for release 14. I will certainly undertake to
review anything expeditiously. I can work around this issue in the
buildfarm, but it seems like something other users are likely to want.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Error when defining a set returning function

2021-04-16 Thread Esteban Zimanyi

Many thanks Tom for your help !

I removed the flag -fshort-enums and everything works fine !

On Fri, Apr 16, 2021 at 7:04 PM Tom Lane  wrote:

> Esteban Zimanyi  writes:
> > When debugging the function with gdb, I noticed that the rsinfo variable
> of
> > the PostgreSQL function ExecMakeFunctionResultSet  is modified in the
> > macro  SRF_RETURN_NEXT causing the problem. Any idea how to solve this?
>
> Well, what SRF_RETURN_NEXT thinks it's doing is
>
> rsi->isDone = ExprMultipleResult; \
>
> which surely shouldn't change the returnMode field.  At this point
> I'm guessing that you are compiling the PG headers with some compiler
> pragma that changes the struct packing rules.  Don't do that.
>
> regards, tom lane
>

Re: ANALYZE counts LP_DEAD line pointers as n_dead_tup

2021-04-16 Thread Peter Geoghegan

On Wed, Apr 14, 2021 at 7:11 AM Masahiko Sawada  wrote:
> If we create a table with vacuum_index_cleanup = off or execute VACUUM
> with INDEX_CLEANUP = off, vacuum updates pg_stat_all_tables.n_dead_tup
> to the number of HEAPTUPLE_RECENTLY_DEAD tuples. Whereas analyze
> updates it to the sum of the number of HEAPTUPLE_DEAD/RECENTLY_DEAD
> tuples and LP_DEAD line pointers. So if the table has many LP_DEAD
> line pointers due to skipping index cleanup, autovacuum is triggered
> every time after analyze/autoanalyze. This issue seems to happen also
> on back branches, probably from 12 where INDEX_CLEANUP option was
> introduced.

Hmm.

> I think we can have heapam_scan_analyze_next_tuple() not count LP_DEAD
> line pointer as lazy_scan_prune() does. Attached the patch for that.

lazy_scan_prune() is concerned about what the state of the table *will
be* when VACUUM finishes, based on its working assumption that index
vacuuming and heap vacuuming always go ahead. This is exactly the same
reason why lazy_scan_prune() will set LVPagePruneState.hastup to
'false' in the presence of an LP_DEAD item -- this is not how
count_nondeletable_pages() considers if the same page 'hastup' much
later on, right at the end of the VACUUM (it will only consider the
page safe to truncate away if it now only contains LP_UNUSED items --
LP_DEAD items make heap/table truncation unsafe).

In general accounting rules like this may need to work slightly
differently across near-identical functions because of "being versus
becoming" issues. It is necessary to distinguish between "being" code
(e.g., this ANALYZE code, count_nondeletable_pages() and its hastup
issue) and "becoming" code (e.g., lazy_scan_prune() ands its approach
to counting "remaining" dead tuples as well as hastup-ness). I tend to
doubt that your patch is the right approach because the two code paths
already "agree" once you assume that the LP_DEAD items that
lazy_scan_prune() sees will be gone at the end of the VACUUM. I do
agree that this is a problem, though.

Generally speaking, the "becoming" code from lazy_scan_prune() is not
100% sure that it will be correct in each case, for a large variety of
reasons. But I think that we should expect it to be mostly correct. We
definitely cannot allow it to be quite wrong all the time with some
workloads. And so I agree that this is a problem for the INDEX_CLEANUP
= off case, though it's equally an issue for the recently added
failsafe mechanism. I do not believe that it is a problem for the
bypass-indexes optimization, though, because that is designed to only
be applied when there are practically zero LP_DEAD items. The
optimization can make VACUUM record that there are zero dead tuples
after the VACUUM finishes, even though there were in fact a very small
non-zero number of dead tuples -- but that's not appreciably different
from any of the other ways that the count of dead tuples could be
inaccurate (e.g. concurrent opportunistic pruning). The specific tests
that we apply inside lazy_vacuum() should make sure that autovacuum
scheduling is never affected. The autovacuum scheduling code can
safely "believe" that the indexes were vacuumed, because it really is
the same as if there were precisely zero LP_DEAD items (or the same
for all practical purposes).

I'm not sure what to do, though. Both the INDEX_CLEANUP = off case and
the failsafe case are only intended for emergencies. And it's hard to
know what to do in a code path that is designed to rarely or never be
used.

-- 
Peter Geoghegan

Re: default_tablespace doc and partitioned rels

2021-04-16 Thread Alvaro Herrera

On 2021-Apr-16, Justin Pryzby wrote:

> +++ b/doc/src/sgml/config.sgml
> @@ -7356,7 +7356,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH 
> csv;
> 
>  This variable specifies the default tablespace in which to create
>  objects (tables and indexes) when a CREATE 
> command does
> -not explicitly specify a tablespace.
> +not explicitly specify a tablespace.  It also determines the 
> tablespace
> +that a partitioned relation will direct future partitions to.
> 
> 
> default_tablespace is a global GUC, so if a partitioned relation "directs"
> partitions anywhere, it's not to the fallback value of the GUC, but to its
> reltablespace, as this patch wrote in doc/src/sgml/ref/create_table.sgml:

Yes, but also the partitioned table's reltablespace is going to be set
to default_tablespace, if no tablespace is explicitly specified in the
partitioned table creation.

A partitioned table is not created anywhere itself; the only thing it
can do, is direct where are future partitions created.  I don't think
it's 100% obvious that default_tablespace will become the partitioned
table's tablespace, which is why I added that phrase.  I understand that
the language might be unclear, but I don't think either of your
suggestions make this any clearer.  Removing it just hides the behavior,
and this one:

> +... It also determines the tablespace where new partitions are 
> created,
> +if the parent, partitioned relation doesn't have a tablespace set.

just documents that default_tablespace will be in effect at partition
CREATE time, but it fails to remind the user that the partitioned table
will acquire default_tablespace as its own tablespace.

Maybe we can reword it in some other way.  "If this parameter is set
when a partitioned table is created, it will become the default
tablespace for future partitions too, even if default_tablespace itself
is reset later" ...??

-- 
Álvaro Herrera39°49'30"S 73°17'W
"En las profundidades de nuestro inconsciente hay una obsesiva necesidad
de un universo lógico y coherente. Pero el universo real se halla siempre
un paso más allá de la lógica" (Irulan)

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Tom Lane

I wrote:
> I feel like this is telling us that there's a fundamental
> misunderstanding in find_expr_references_walker about which
> collation dependencies to report.  It's reporting all the
> leaf-node collations, and ignoring the ones that actually
> count semantically, that is the inputcollid fields of
> function and operator nodes.
> Not sure what's the best thing to do here.  Redesigning
> this post-feature-freeze doesn't seem terribly appetizing,
> but on the other hand, this index collation recording
> feature has put a premium on not overstating the collation
> dependencies of an expression.  We don't want to tell users
> that an index is broken when it isn't really.

I felt less hesitant to modify find_expr_references_walker's
behavior w.r.t. collations after realizing that most of it
was not of long standing, but came in with 257836a75.
So here's a draft patch that redesigns it as suggested above.
Along the way I discovered that GetTypeCollations was quite
broken for ranges and arrays, so this fixes that too.

Per the changes in collate.icu.utf8.out, this gets rid of
a lot of imaginary collation dependencies, but it also gets
rid of some arguably-real ones.  In particular, calls of
record_eq and its siblings will be considered not to have
any collation dependencies, although we know that internally
those will look up per-column collations of their input types.
We could imagine special-casing record_eq etc here, but that
sure seems like a hack.

I"m starting to have a bad feeling about 257836a75 overall.
As I think I've complained before, I do not like anything about
what it's done to pg_depend; it's forcing that relation to serve
two masters, neither one well.  We now see that the same remark
applies to find_expr_references(), because the semantics of
"which collations does this expression's behavior depend on" aren't
identical to "which collations need to be recorded as direct
dependencies of this expression", especially not if you'd prefer
to minimize either list.  (Which is important.)  Moreover, for all
the complexity it's introducing, it's next door to useless for
glibc collations --- we might as well tell people "reindex
everything when your glibc version changes", which could be done
with a heck of a lot less infrastructure.  The situation on Windows
looks pretty user-unfriendly as well, per the other thread.

So I wonder if, rather than continuing to pursue this right now,
we shouldn't revert 257836a75 and try again later with a new design
that doesn't try to commandeer the existing dependency infrastructure.
We might have a better idea about what to do on Windows by the time
that's done, too.

regards, tom lane

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 8d8e926c21..41093ea6ae 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1835,8 +1835,17 @@ recordDependencyOnSingleRelExpr(const ObjectAddress *depender,
  * the datatype.  However we do need a type dependency if there is no such
  * indirect dependency, as for example in Const and CoerceToDomain nodes.
  *
- * Similarly, we don't need to create dependencies on collations except where
- * the collation is being freshly introduced to the expression.
+ * Collations are handled primarily by recording the inputcollid's of node
+ * types that have them, as those are the ones that are semantically
+ * significant during expression evaluation.  We also record the collation of
+ * CollateExpr nodes, since those will be needed to print such nodes even if
+ * they don't really affect semantics.  Collations of leaf nodes such as Vars
+ * can be ignored on the grounds that if they're not default, they came from
+ * the referenced object (e.g., a table column), so the dependency on that
+ * object is enough.  (Note: in a post-const-folding expression tree, a
+ * CollateExpr's collation could have been absorbed into a Const or
+ * RelabelType node.  While ruleutils.c prints such collations for clarity,
+ * we may ignore them here as they have no semantic effect.)
  */
 static bool
 find_expr_references_walker(Node *node,
@@ -1876,29 +1885,6 @@ find_expr_references_walker(Node *node,
 			/* If it's a plain relation, reference this column */
 			add_object_address(OCLASS_CLASS, rte->relid, var->varattno,
 			   context->addrs);
-
-			/* Top-level collation if valid */
-			if (OidIsValid(var->varcollid))
-add_object_address(OCLASS_COLLATION, var->varcollid, 0,
-   context->addrs);
-			/* Otherwise, it may be a type with internal collations */
-			else if (var->vartype >= FirstNormalObjectId)
-			{
-List	   *collations;
-ListCell   *lc;
-
-collations = GetTypeCollations(var->vartype);
-
-foreach(lc, collations)
-{
-	Oid			coll = lfirst_oid(lc);
-
-	if (OidIsValid(coll))
-		add_object_address(OCLASS_COLLATION,
-		   lfirst_oid(lc), 0,
-		   context->addrs);
-}
-			}
 		}

 		/*
@@ -

Re: Error when defining a set returning function

2021-04-16 Thread Andrew Dunstan

On 4/16/21 3:32 PM, Esteban Zimanyi wrote:
> Many thanks Tom for your help ! 
>
> I removed the flag -fshort-enums and everything works fine !
>
>

If you build with pgxs it should supply the appropriate compiler flags.
Alternatively, get the right settings from pg_config. In general rolling
your own is a bad idea.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: default_tablespace doc and partitioned rels

2021-04-16 Thread Justin Pryzby

On Fri, Apr 16, 2021 at 04:19:18PM -0400, Alvaro Herrera wrote:
> Maybe we can reword it in some other way.  "If this parameter is set
> when a partitioned table is created, it will become the default
> tablespace for future partitions too, even if default_tablespace itself
> is reset later" ...??

+1

I'd say:

If this parameter is set when a partitioned table is created, the partitioned
table's tablespace will be set to the given tablespace, and which will be the
default tablespace for partitions create in the future, even if
default_tablespace itself has since been changed.

-- 
Justin

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

2021-04-16 Thread Matthias van de Meent

On Fri, 16 Apr 2021 at 18:03, Peter Geoghegan  wrote:
>
> On Thu, Apr 15, 2021 at 11:06 AM Matthias van de Meent
>  wrote:
> > I've noticed there are a lot of places in the btree index
> > infrastructure (and also some other index AMs) that effectively
> > iterate over the attributes of the index tuple, but won't use
> > index_deform_tuple for reasons. However, this implies that they must
> > repeatedly call index_getattr, which in the worst case is O(n) for the
> > n-th attribute, slowing down extraction of multi-column indexes
> > significantly. As such, I've added some API that allows for iteration
> > (ish) over index attributes.
>
> Interesting approach. I think that in an ideal world we would have a
> tuple format with attribute lengths/offsets right in the header.

I believe that that would indeed be ideal w.r.t. access speed, but
also quite expensive w.r.t. amount of data stored. This would add 2
bytes per attribute in the current infrastructure (11 bits at least
for each attribute to store offsets), on the current 12 bytes of
overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is
probably always going to be a non-starter, seeing that we can
relatively easily optimize our current attribute access patterns.

> But
> it's too late for that, so other approaches seem worth considering.

Yep.

> > Also attached is 0002, which dynamically truncates attribute prefixes
> > of tuples whilst _binsrch-ing through a nbtree page. It greatly uses
> > the improved performance of 0001; they work very well together. The
> > problems that Peter (cc-ed) mentions in [0] only result in invalid
> > search bounds when traversing the tree, but on the page level valid
> > bounds can be constructed.
> >
> > This is patchset 1 of a series of patches I'm starting for eventually
> > adding static prefix truncation into nbtree infrastructure in
> > PostgreSQL. I've put up a wiki page [1] with my current research and
> > thoughts on that topic.
>
> The idea of making _bt_truncate() produce new leaf page high keys
> based on the lastleft tuple rather than the firstright tuple (i.e.
> +inf truncated attribute values rather than the current -inf) seems
> like a non-starter. As you point out in "1.) Suffix-truncation; -INF
> in high keys" on the Postgres wiki page, the current approach
> truncates firstright (not lastleft), making the left page's new high
> key contain what you call a 'foreign' value. But I see that as a big
> advantage of the current approach.
>
> Consider, for example, the nbtree high key "continuescan" optimization
> added by commit 29b64d1d. The fact that leaf page high keys are
> generated in this way kind of allows us to "peak" on the page to the
> immediate right before actually visiting it -- possibly without ever
> visiting it (which is where the benefit of that particular
> optimization lies). _bt_check_unique() uses a similar trick. After the
> Postgres 12 work, _bt_check_unique() will only visit a second page in
> the extreme case where we cannot possibly fit all of the relevant
> version duplicates on even one whole leaf page (i.e. practically
> never). There is also cleverness inside _bt_compare() to make sure
> that we handle the boundary cases perfectly while descending the tree.

I understand and appreciate that the "-INF" truncation that is
currently in place is being relied upon in quite some places. Part of
the effort for "+INF" truncation would be determining where and how to
keep the benefits of the "-INF" truncation. I also believe that for
internal pages truncating to "+INF" would be perfectly fine; the
optimizations that I know of only rely on it at the leaf level.
Completely seperate from that, there's no reason (except for a
potential lack of unused bits) we can't flag suffix-truncated columns
as either "+INF" or "-INF" - that would allow us to apply each where
useful.

> You might also consider how the nbtsplitloc.c code works with
> duplicates, and how that would be affected by +inf truncated
> attributes. The leaf-page-packing performed in the SPLIT_SINGLE_VALUE
> case only goes ahead when the existing high key confirms that this
> must be the rightmost page. Now, I guess that you could still do
> something like that if we switched to +inf semantics. But, the fact
> that the new right page will have a 'foreign' value in the
> SPLIT_SINGLE_VALUE-split case is also of benefit -- it is practically
> empty right after the split (since the original/left page is packed
> full), and we want this empty space to be eligible to either take more
> duplicates, or to take values that may happen to fit between the
> highly duplicated value and the original foreign high key value. We
> want that flexibility, I think.
>
> I also find -inf much more natural. If in the future we teach nbtree
> to truncate "inside" text attributes (say text columns), you'd pretty
> much be doing the same thing at the level of characters rather than
> whole columns. The -inf semantics are like strcmp() semant

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Thomas Munro

On Sat, Apr 17, 2021 at 8:39 AM Tom Lane  wrote:
> Per the changes in collate.icu.utf8.out, this gets rid of
> a lot of imaginary collation dependencies, but it also gets
> rid of some arguably-real ones.  In particular, calls of
> record_eq and its siblings will be considered not to have
> any collation dependencies, although we know that internally
> those will look up per-column collations of their input types.
> We could imagine special-casing record_eq etc here, but that
> sure seems like a hack.

Thanks for looking into all this.  Hmm.

> I"m starting to have a bad feeling about 257836a75 overall.
> As I think I've complained before, I do not like anything about
> what it's done to pg_depend; it's forcing that relation to serve
> two masters, neither one well. ...

We did worry about (essentially) this question quite a bit in the
discussion thread, but we figured that you'd otherwise have to create
a parallel infrastructure that would look almost identical (for
example [1]).

> ...  We now see that the same remark
> applies to find_expr_references(), because the semantics of
> "which collations does this expression's behavior depend on" aren't
> identical to "which collations need to be recorded as direct
> dependencies of this expression", especially not if you'd prefer
> to minimize either list.  (Which is important.) ...

Bugs in the current analyser code aside, if we had a second catalog
and a second analyser for this stuff, then you'd still have the union
of both minimised sets in total, with some extra duplication because
you'd have some rows in both places that are currently handled by one
row, no?

> ... Moreover, for all
> the complexity it's introducing, it's next door to useless for
> glibc collations --- we might as well tell people "reindex
> everything when your glibc version changes", which could be done
> with a heck of a lot less infrastructure. ...

You do gain reliable tracking of which indexes remain to be rebuilt,
and warnings for common hazards like hot standbys with mismatched
glibc, so I think it's pretty useful.  As for the poverty of
information from glibc, I don't see why it should hold ICU, Windows,
FreeBSD users back.  In fact I am rather hoping that by shipping this,
glibc developers will receive encouragement to add the trivial
interface we need to do better.

> ... The situation on Windows
> looks pretty user-unfriendly as well, per the other thread.

That is unfortunate, it seems like such a stupid problem.  Restating
here for the sake of the list:  initdb just needs to figure out how to
ask for the current environment's locale in BCP 47 format ("en-US")
when setting the default for your template databases, not the
traditional format ("English_United States.1252") that Microsoft
explicitly tells us not to store in databases and that doesn't work in
the versioning API, but since we're mostly all Unix hackers we don't
know how.

> So I wonder if, rather than continuing to pursue this right now,
> we shouldn't revert 257836a75 and try again later with a new design
> that doesn't try to commandeer the existing dependency infrastructure.
> We might have a better idea about what to do on Windows by the time
> that's done, too.

It seems to me that there are two things that would be needed to
salvage this for PG14: (1) deciding that we're unlikely to come up
with a better idea than using pg_depend for this (following the
argument that it'd only create duplication to have a parallel
dedicated catalog), (2) fixing any remaining flaws in the dependency
analyser code.  I'll look into the details some more on Monday.

[1] 
https://www.postgresql.org/message-id/e9e22c5e-c018-f4ea-24c8-5b6d6fdacf30%402ndquadrant.com

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Tom Lane

Thomas Munro  writes:
> I'll look into the details some more on Monday.

Fair enough.

Although there are only a few buildfarm members complaining, I don't
really want to leave them red all weekend.  I could either commit the
patch I just presented, or revert ef387bed8 ... got a preference?

regards, tom lane

Re: Iterating on IndexTuple attributes and nbtree page-level dynamic prefix truncation

2021-04-16 Thread Peter Geoghegan

On Fri, Apr 16, 2021 at 2:20 PM Matthias van de Meent
 wrote:
> > Interesting approach. I think that in an ideal world we would have a
> > tuple format with attribute lengths/offsets right in the header.
>
> I believe that that would indeed be ideal w.r.t. access speed, but
> also quite expensive w.r.t. amount of data stored. This would add 2
> bytes per attribute in the current infrastructure (11 bits at least
> for each attribute to store offsets), on the current 12 bytes of
> overhead per indextuple (= 8 for IndexTuple + 4 for ItemId). That is
> probably always going to be a non-starter, seeing that we can
> relatively easily optimize our current attribute access patterns.

I don't think that that's why it's a non-starter. This design assumes
a world in which everything has already been optimized for this
layout. You no longer get to store the varlena header inline, which
would break a lot of things in Postgres were it ever to be attempted.
The space efficiency issues don't really apply because you have an
offset for fixed-length types -- their presence is always implied. I
think that you need to encode NULLs differently, which is a lot less
space efficient when there are a lot of NULLs. But on the whole this
design seems more efficient than what we have currently.

This representation of index tuples would be a totally reasonable
design were we in a green field situation. (Which is pretty far from
the situation we're actually in, of course.)

> I understand and appreciate that the "-INF" truncation that is
> currently in place is being relied upon in quite some places. Part of
> the effort for "+INF" truncation would be determining where and how to
> keep the benefits of the "-INF" truncation. I also believe that for
> internal pages truncating to "+INF" would be perfectly fine; the
> optimizations that I know of only rely on it at the leaf level.

I don't doubt that there is nothing special about -inf from a key
space point of view. Actually...you could say that -inf is special to
the limited extent that we know it only appears in pivot tuples and
exploit that property when the !pivotsearch case/optimization is used.
But that isn't much of an exception at a high level, so whatever.

Anyway, it follows that +inf could in principle be used instead in
some or all cases -- all that is truly essential for correctness is
that the invariants always be respected. We're still in agreement up
until here.

> Yes, I also read and appreciate your comments on +inf vs -inf when
> this came up in [0].

I'm impressed that you've done your homework on this.

> However, if we could choose, I think that having
> both options could be quite beneficial, especially when dealing with
> many duplicates or duplicate prefixes.

This is where things are much less clear -- maybe we're not in
agreement here. Who knows, though -- maybe you're right. But you
haven't presented any kind of argument. I understand that it's hard to
articulate what effects might be in play with stuff like this, so I
won't force the issue now. Strong evidence is of course the only way
that you'll reliably convince me of this.

I should point out that I am a little confused about how this +inf
business could be both independently useful and pivotal to
implementing [dynamic] prefix truncation/compression. Seems...weird to
discuss them together, except maybe to mention in passing that this
+inf thing is notable for particularly helping dynamic prefix stuff --
which is it?

It is my strong preference that nbtsplitloc.c continue to know
approximately nothing about compression or deduplication. While it is
true that nbtsplitloc.c's _bt_recsplitloc() is aware of posting lists,
this is strictly an optimization that is only justified by the fact
that posting lists are sometimes very large, and therefore worth
considering directly -- just to get a more accurate idea of how a
relevant split point choice affects the balance of free space (we
don't bother to do the same thing with non-key INCLUDE columns because
they're generally small and equi-sized). And so this _bt_recsplitloc()
thing no exception to the general rule, which is:
deduplication/posting list maintenance should be *totally* orthogonal
to the page split choice logic (the design of posting list splits
helps a lot with that). We can afford to have complicated split point
choice logic because the question of which split point is optimal is
totally decoupled from the question of which are correct -- in
particular, from the correctness of the space accounting used to
generate candidate split points.

It may interest you to know that I once thought that it would be nice
to have the *option* of +inf too, so that we could use it in very rare
cases like the pathological SPLIT_MANY_DUPLICATES case that
_bt_bestsplitloc() has some defenses against. It would perhaps be nice
if we could use +inf selectively in that case. I never said anything
about this publicly before now, mostly because it wasn't that
important --

PATCH: generate fractional cheapest paths in generate_orderedappend_path

2021-04-16 Thread Tomas Vondra

Hi,

This patch is a WIP fix for the issue described in [1], where the
planner picks a more expensive plan with partition-wise joins enabled,
and disabling this option produces a cheaper plan. That's a bit strange
because with the option disabled we consider *fewer* plans, so we should
not be able to generate a cheaper plan.

The problem lies in generate_orderedappend_paths(), which builds two
types of append paths - with minimal startup cost, and with minimal
total cost. That however does not work for queries with LIMIT, which
also need to consider at fractional cost, but the path interesting from
this perspective may be eliminated by other paths.

Consider three paths (this comes from the reproducer shared in [1]):

  A: nestloop_path   startup 0.585000total 35708.292500
  B: nestloop_path   startup 0.292500total 150004297.292500
  C: mergejoin_path  startup 9748.112737 total 14102.092737

With some reasonable LIMIT value (e.g. 5% of the data), we really want
to pick path A. But that path is dominated both in startup cost (by B)
and total cost (path C). Hence generate_orderedappend_paths() will
ignore it, and we'll generate a more expensive LIMIT plan.

In [2] Tom proposed to modify generate_orderedappend_paths() to also
consider the fractional cheapest_path, just like we do for startup and
total costs. The attached patch does exactly that, and it does seem to
do the trick.

There are some loose ends, though:

1) If get_cheapest_fractional_path_for_pathkeys returns NULL, it's not
clear whether to default to cheapest_startup or cheapest_total. We might
also consider an incremental sort, in which case the startup cost
matters (for Sort it does not). So we may need an enhanced version of
get_cheapest_fractional_path_for_pathkeys that generates such paths.

2) Same for the cheapest_total - maybe there's a partially sorted path,
and using it with incremental sort on top would be better than using
cheapest_total_path + sort.

3) Not sure if get_cheapest_fractional_path_for_pathkeys should worry
about require_parallel_safe, just like the other functions nearby.

I'd argue that this patch does not need to add the Incremental Sort
capabilities in (1) and (2) - it's just another place where we decided
not to consider this sort variant yet.

I'm not sure how much this depends on partition-wise join, and why
disabling it generates the right plan. The reproducer uses that, but
AFAICS generate_orderedappend_paths() works like this since 2010 (commit
11cad29c915). I'd bet the issue exists since then and it's possible to
get similar cases even without partition-wise join.

I can reproduce it on PostgreSQL 12, though (which however supports
partition-wise join).

Not sure whether this should be backpatched. We didn't get any reports
until now, so it doesn't seem like a pressing issue. OTOH most users
won't actually notice this, they'll just get worse plans without
realizing there's a better option.


regards

[1]
https://www.postgresql.org/message-id/011937a3-7427-b99f-13f1-c07a127cf94c%40enterprisedb.com

[2] https://www.postgresql.org/message-id/4006636.1618577893%40sss.pgh.pa.us

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index edba5e49a8..0284162034 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1716,6 +1716,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *pathkeys = (List *) lfirst(lcp);
 		List	   *startup_subpaths = NIL;
 		List	   *total_subpaths = NIL;
+		List	   *fractional_subpaths = NIL;
 		bool		startup_neq_total = false;
 		ListCell   *lcr;
 		bool		match_partition_order;
@@ -1745,7 +1746,8 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		{
 			RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr);
 			Path	   *cheapest_startup,
-	   *cheapest_total;
+	   *cheapest_total,
+	   *cheapest_fractional = NULL;
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
@@ -1761,6 +1763,21 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			   TOTAL_COST,
 			   false);
 
+			/*
+			 * XXX strange that get_cheapest_fractional_path_for_pathkeys
+			 * does not have require_parallel_safe.
+			 */
+			if (root->tuple_fraction > 0)
+			{
+double	path_fraction = 1.0 / root->tuple_fraction;
+
+cheapest_fractional =
+	get_cheapest_fractional_path_for_pathkeys(childrel->pathlist,
+			  pathkeys,
+			  NULL,
+			  path_fraction);
+			}
+
 			/*
 			 * If we can't find any paths with the right order just use the
 			 * cheapest-total path; we'll have to sort it later.
@@ -1773,6 +1790,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 Assert(cheapest_total->param_info == NULL);
 			}
 
+			/*
+			 * XXX Do we need to do something about

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Thomas Munro

On Sat, Apr 17, 2021 at 10:47 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > I'll look into the details some more on Monday.
>
> Fair enough.
>
> Although there are only a few buildfarm members complaining, I don't
> really want to leave them red all weekend.  I could either commit the
> patch I just presented, or revert ef387bed8 ... got a preference?

+1 for committing the new patch for now.  I will look into to the
record problem.  More in a couple of days.

Re: ANALYZE counts LP_DEAD line pointers as n_dead_tup

2021-04-16 Thread Peter Geoghegan

On Fri, Apr 16, 2021 at 1:16 PM Peter Geoghegan  wrote:
> I'm not sure what to do, though. Both the INDEX_CLEANUP = off case and
> the failsafe case are only intended for emergencies. And it's hard to
> know what to do in a code path that is designed to rarely or never be
> used.

How about just documenting it in comments, as in the attached patch? I
tried to address all of the issues with LP_DEAD accounting together.
Both the issue raised by Masahiko, and one or two others that were
also discussed recently on other threads. They all seem kind of
related to me.

I didn't address the INDEX_CLEANUP = off case in the comments directly
(I just addressed the failsafe case). There is no good reason to think
that the situation will resolve with INDEX_CLEANUP = off, so it didn't
seem wise to mention it too. But that should still be okay --
INDEX_CLEANUP = off has practically been superseded by the failsafe,
since it is much more flexible. And, anybody that uses INDEX_CLEANUP =
off cannot expect to never do index cleanup without seriously bad
consequences all over the place.


--
Peter Geoghegan
From 5d24338e90f0f3eec7134c328d40270f2c50 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 9 Apr 2021 18:50:25 -0700
Subject: [PATCH] VACUUM accounting: Explain LP_DEAD accounting.

---
 src/backend/access/heap/vacuumlazy.c | 70 +++-
 1 file changed, 47 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9f9ba5d308..ad37f25e2a 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -686,7 +686,16 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 		new_min_multi,
 		false);
 
-	/* report results to the stats collector, too */
+	/*
+	 * Report results to the stats collector, too.
+	 *
+	 * We deliberately avoid telling the stats collector about LP_DEAD items
+	 * that remain in the table when index/heap vacuuming has been bypassed by
+	 * the failsafe mechanism.  Avoid behaving too aggressively once the
+	 * danger of wraparound failure subsides.  Autoanalyze should notice any
+	 * remaining LP_DEAD items and schedule an autovacuum if nothing else
+	 * does.
+	 */
 	pgstat_report_vacuum(RelationGetRelid(rel),
 		 rel->rd_rel->relisshared,
 		 Max(new_live_tuples, 0),
@@ -1334,6 +1343,9 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, vistest, &prunestate);
 
+		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!all_visible_according_to_vm || prunestate.all_visible);
+
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
 			vacrel->nonempty_pages = blkno + 1;
@@ -1404,7 +1416,6 @@ lazy_scan_heap(LVRelState *vacrel, VacuumParams *params, bool aggressive)
 		 * Handle setting visibility map bit based on what the VM said about
 		 * the page before pruning started, and using prunestate
 		 */
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
 		if (!all_visible_according_to_vm && prunestate.all_visible)
 		{
 			uint8		flags = VISIBILITYMAP_ALL_VISIBLE;
@@ -1737,19 +1748,22 @@ retry:
 		/*
 		 * LP_DEAD items are processed outside of the loop.
 		 *
-		 * Note that we deliberately don't set hastup=true in the case of an
-		 * LP_DEAD item here, which is not how lazy_check_needs_freeze() or
+		 * Our working assumption is that any LP_DEAD items we encounter here
+		 * will become LP_UNUSED inside lazy_vacuum_heap_page() before VACUUM
+		 * finishes.
+		 *
+		 * This is why the number of LP_DEAD items won't be reported to the
+		 * stats collector -- we simply won't leave any behind.  (Actually,
+		 * cases where we bypass index vacuuming will in fact leave behind
+		 * LP_DEAD items, but that only happens in special circumstances.  We
+		 * avoid trying to compensate for that in our later report to the
+		 * stats collector.)
+		 *
+		 * This is also why we deliberately don't set hastup=true in the case
+		 * of an LP_DEAD item.  This is not how lazy_check_needs_freeze() or
 		 * count_nondeletable_pages() do it -- they only consider pages empty
 		 * when they only have LP_UNUSED items, which is important for
 		 * correctness.
-		 *
-		 * Our assumption is that any LP_DEAD items we encounter here will
-		 * become LP_UNUSED inside lazy_vacuum_heap_page() before we actually
-		 * call count_nondeletable_pages().  In any case our opinion of
-		 * whether or not a page 'hastup' (which is how our caller sets its
-		 * vacrel->nonempty_pages value) is inherently race-prone.  It must be
-		 * treated as advisory/unreliable, so we might as well be slightly
-		 * optimistic.
 		 */
 		if (ItemIdIsDead(itemid))
 		{
@@ -1901,9 +1915,6 @@ retry:
 	 * that will need to be vacuumed in indexes later, or a LP_NORMAL tuple
 	 * that remains and needs to be considered for freezing now (LP_UNUSED and
 	 * L

Re: terminate called after throwing an instance of 'std::bad_alloc'

2021-04-16 Thread Andres Freund

Hi,

On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> I'd be happy to run with a prototype fix for the leak to see if the other 
> issue
> does (not) recur.

I just posted a prototype fix to 
https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
(just because that was the first thread I re-found). It'd be cool if you
could have a look!

Greetings,

Andres Freund

Re: Bogus collation version recording in recordMultipleDependencies

2021-04-16 Thread Tom Lane

Thomas Munro  writes:
> On Sat, Apr 17, 2021 at 10:47 AM Tom Lane  wrote:
>> Although there are only a few buildfarm members complaining, I don't
>> really want to leave them red all weekend.  I could either commit the
>> patch I just presented, or revert ef387bed8 ... got a preference?

> +1 for committing the new patch for now.  I will look into to the
> record problem.  More in a couple of days.

OK, done.

regards, tom lane

Re: terminate called after throwing an instance of 'std::bad_alloc'

2021-04-16 Thread Justin Pryzby

On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> > I'd be happy to run with a prototype fix for the leak to see if the other 
> > issue
> > does (not) recur.
> 
> I just posted a prototype fix to 
> https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
> (just because that was the first thread I re-found). It'd be cool if you
> could have a look!

This doesn't seem to address the problem triggered by the reproducer at
https://www.postgresql.org/message-id/20210331040751.gu4...@telsasoft.com
(sorry I didn't CC you)

CREATE OR REPLACE FUNCTION cfn() RETURNS void LANGUAGE PLPGSQL AS $$ declare a 
record; begin FOR a IN SELECT generate_series(1,99) LOOP PERFORM format('select 
1'); END LOOP; END $$;
$ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; SET 
client_min_messages=debug; SET log_executor_stats=on; SELECT cfn();' |head -11 
|psql 2>&1 |grep 'max resident'

-- 
Justin

Re: terminate called after throwing an instance of 'std::bad_alloc'

2021-04-16 Thread Justin Pryzby

On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote:
> On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> > > I'd be happy to run with a prototype fix for the leak to see if the other 
> > > issue
> > > does (not) recur.
> > 
> > I just posted a prototype fix to 
> > https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
> > (just because that was the first thread I re-found). It'd be cool if you
> > could have a look!
> 
> This doesn't seem to address the problem triggered by the reproducer at
> https://www.postgresql.org/message-id/20210331040751.gu4...@telsasoft.com
> (sorry I didn't CC you)

I take that back - I forgot that this doesn't release RAM until hitting a
threshold.

Without the patch, it looks like:

$ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; SET 
client_min_messages=debug; SET log_executor_stats=on; SELECT cfn();' |head -11 
|psql -h /tmp postgres 2>&1 |grep 'max resident'
!   61820 kB max resident size
!   65020 kB max resident size
!   68812 kB max resident size
!   71152 kB max resident size
!   76820 kB max resident size
!   78760 kB max resident size
!   81140 kB max resident size
!   83520 kB max resident size
!   93084 kB max resident size
!   94756 kB max resident size
!   96416 kB max resident size

With the patch and #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1, it looks like
this:

!   61436 kB max resident size
!   61572 kB max resident size
!   63236 kB max resident size
!   63236 kB max resident size
!   63556 kB max resident size
!   63556 kB max resident size
!   63880 kB max resident size
!   65416 kB max resident size
!   65416 kB max resident size
!   65416 kB max resident size
!   65416 kB max resident size

-- 
Justin

RE: Truncate in synchronous logical replication failed

2021-04-16 Thread osumi.takami...@fujitsu.com

On Saturday, April 17, 2021 12:53 AM Japin Li 
> On Fri, 16 Apr 2021 at 17:19, osumi.takami...@fujitsu.com
>  wrote:
> > On Friday, April 16, 2021 5:50 PM Amit Kapila 
> wrote:
> >> On Fri, Apr 16, 2021 at 12:56 PM osumi.takami...@fujitsu.com
> >>  wrote:
> >> >
> >> > > Thanks for your reminder.  It might be a way to solve this problem.
> >> > Yeah. I've made the 1st patch for this issue.
> >> >
> >> > In my env, with the patch
> >> > the TRUNCATE in synchronous logical replication doesn't hang.
> >> >
> >>
> >> Few initial comments:
> >> =
> >> 1.
> >> + relreplindex = relation->rd_replidindex;
> >> +
> >> + /*
> >> + * build attributes to idindexattrs.
> >> + */
> >> + idindexattrs = NULL;
> >> + foreach(l, indexoidlist)
> >> + {
> >> + Oid indexOid = lfirst_oid(l);
> >> + Relation indexDesc;
> >> + int i;
> >> + bool isIDKey; /* replica identity index */
> >> +
> >> + indexDesc = RelationIdGetRelation(indexOid);
> >>
> >> When you have oid of replica identity index (relreplindex) then what
> >> is the need to traverse all the indexes?
> > Ok. No need to traverse all the indexes. Will fix this part.
> >
> >> 2.
> >> It is better to name the function as RelationGet...
> > You are right. I'll modify this in my next version.
> >
> 
> I took the liberty to address review comments and provide a v2 patch on top
> of your's v1 patch, also merge the test case.
> 
> Sorry for attaching.
No problem. Thank you for updating the patch.
I've conducted some cosmetic changes. Could you please check this ?
That's already applied by pgindent.

I executed RT for this and made no failure.
Just in case, I executed 010_truncate.pl test 100 times in a tight loop,
which also didn't fail.

Best Regards,
Takamichi Osumi



truncate_in_synchronous_logical_replication_v03.patch
Description: truncate_in_synchronous_logical_replication_v03.patch

Re: Forget close an open relation in ReorderBufferProcessTXN()

2021-04-16 Thread Amit Kapila

On Fri, Apr 16, 2021 at 11:24 PM Andres Freund  wrote:
>
> > I think it is also important to *not* acquire any lock on relation
> > otherwise it can lead to some sort of deadlock or infinite wait in the
> > decoding process. Consider a case for operations like Truncate (or if
> > the user has acquired an exclusive lock on the relation in some other
> > way say via Lock command) which acquires an exclusive lock on
> > relation, it won't get replicated in synchronous mode (when
> > synchronous_standby_name is configured). The truncate operation will
> > wait for the transaction to be replicated to the subscriber and the
> > decoding process will wait for the Truncate operation to finish.
>
> However, this cannot be really relied upon for catalog tables. An output
> function might acquire locks or such. But for those we do not need to
> decode contents...
>

True, so, if we don't need to decode contents then we won't have the
problems of the above kind.

>
>
> This made me take a brief look at pgoutput.c - maybe I am missing
> something, but how is the following not a memory leak?
>
> static void
> maybe_send_schema(LogicalDecodingContext *ctx,
>   ReorderBufferTXN *txn, ReorderBufferChange *change,
>   Relation relation, RelationSyncEntry *relentry)
> {
> ...
> /* Map must live as long as the session does. */
> oldctx = MemoryContextSwitchTo(CacheMemoryContext);
> relentry->map = convert_tuples_by_name(CreateTupleDescCopy(indesc),
>CreateTupleDescCopy(outdesc));
> MemoryContextSwitchTo(oldctx);
> send_relation_and_attrs(ancestor, xid, ctx);
> RelationClose(ancestor);
>
> If - and that's common - convert_tuples_by_name() won't have to do
> anything, the copied tuple descs will be permanently leaked.
>

I also think this is a permanent leak. I think we need to free all the
memory associated with this map on the invalidation of this particular
relsync entry (basically in rel_sync_cache_relation_cb).

-- 
With Regards,
Amit Kapila.

Re: terminate called after throwing an instance of 'std::bad_alloc'

2021-04-16 Thread Andres Freund

Hi,

On Fri, Apr 16, 2021, at 20:18, Justin Pryzby wrote:
> On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote:
> > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> > > > I'd be happy to run with a prototype fix for the leak to see if the 
> > > > other issue
> > > > does (not) recur.
> > > 
> > > I just posted a prototype fix to 
> > > https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
> > > (just because that was the first thread I re-found). It'd be cool if you
> > > could have a look!
> > 
> > This doesn't seem to address the problem triggered by the reproducer at
> > https://www.postgresql.org/message-id/20210331040751.gu4...@telsasoft.com
> > (sorry I didn't CC you)
> 
> I take that back - I forgot that this doesn't release RAM until hitting a
> threshold.
> 

Phew.

> Without the patch, it looks like:
> 
> $ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; 
> SET client_min_messages=debug; SET log_executor_stats=on; SELECT 
> cfn();' |head -11 |psql -h /tmp postgres 2>&1 |grep 'max resident'
> !   61820 kB max resident size
> !   65020 kB max resident size
> !   68812 kB max resident size
> !   71152 kB max resident size
> !   76820 kB max resident size
> !   78760 kB max resident size
> !   81140 kB max resident size
> !   83520 kB max resident size
> !   93084 kB max resident size
> !   94756 kB max resident size
> !   96416 kB max resident size
> 
> With the patch and #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1, it looks like
> this:
> 
> !   61436 kB max resident size
> !   61572 kB max resident size
> !   63236 kB max resident size
> !   63236 kB max resident size
> !   63556 kB max resident size
> !   63556 kB max resident size
> !   63880 kB max resident size
> !   65416 kB max resident size
> !   65416 kB max resident size
> !   65416 kB max resident size
> !   65416 kB max resident size

If you set the define to much lower, how low does this get? I didn't see an 
easy way to make the recreation of the context memory usage dependant, but 
that'd of course be much better than an #iterations based approach. I'll play 
around with your test tomorrow or so, if you don't, but now the next step is 
risotto...

Regards,

Andres

Re: Forget close an open relation in ReorderBufferProcessTXN()

2021-04-16 Thread Amit Kapila

On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila  wrote:
>
> On Thu, Apr 15, 2021 at 4:00 PM Japin Li  wrote:
> >
> > The RelationIdGetRelation() comment says:
> >
> > > Caller should eventually decrement count. (Usually,
> > > that happens by calling RelationClose().)
> >
> > However, it doesn't do it in ReorderBufferProcessTXN().
> > I think we should close it, here is a patch that fixes it. Thoughts?
> >
>
> +1. Your fix looks correct to me but can we test it in some way?
>

I have tried to find a test but not able to find one. I have tried
with a foreign table but we don't log truncate for it, see
ExecuteTruncate. It has a check that it will log for relids where
RelationIsLogicallyLogged. If that is the case, it is not clear to me
how we can ever hit this condition? Have you tried to find the test?

-- 
With Regards,
Amit Kapila.

Re: Forget close an open relation in ReorderBufferProcessTXN()

2021-04-16 Thread Amit Kapila

On Fri, Apr 16, 2021 at 11:24 PM Andres Freund  wrote:
>
>
> > I think it is also important to *not* acquire any lock on relation
> > otherwise it can lead to some sort of deadlock or infinite wait in the
> > decoding process. Consider a case for operations like Truncate (or if
> > the user has acquired an exclusive lock on the relation in some other
> > way say via Lock command) which acquires an exclusive lock on
> > relation, it won't get replicated in synchronous mode (when
> > synchronous_standby_name is configured). The truncate operation will
> > wait for the transaction to be replicated to the subscriber and the
> > decoding process will wait for the Truncate operation to finish.
>
> However, this cannot be really relied upon for catalog tables. An output
> function might acquire locks or such. But for those we do not need to
> decode contents...
>

I see that if we define a user_catalog_table (create table t1_cat(c1
int) WITH(user_catalog_table = true);), we are able to decode
operations like (insert, truncate) on such a table. What do you mean
by "But for those we do not need to decode contents"?

-- 
With Regards,
Amit Kapila.

Re: Forget close an open relation in ReorderBufferProcessTXN()

2021-04-16 Thread Japin Li



On Sat, 17 Apr 2021 at 14:09, Amit Kapila  wrote:
> On Thu, Apr 15, 2021 at 4:53 PM Amit Kapila  wrote:
>>
>> On Thu, Apr 15, 2021 at 4:00 PM Japin Li  wrote:
>> >
>> > The RelationIdGetRelation() comment says:
>> >
>> > > Caller should eventually decrement count. (Usually,
>> > > that happens by calling RelationClose().)
>> >
>> > However, it doesn't do it in ReorderBufferProcessTXN().
>> > I think we should close it, here is a patch that fixes it. Thoughts?
>> >
>>
>> +1. Your fix looks correct to me but can we test it in some way?
>>
>
> I have tried to find a test but not able to find one. I have tried
> with a foreign table but we don't log truncate for it, see
> ExecuteTruncate. It has a check that it will log for relids where
> RelationIsLogicallyLogged. If that is the case, it is not clear to me
> how we can ever hit this condition? Have you tried to find the test?

I also don't find a test for this.  It is introduced in 5dfd1e5a6696,
wrote by Simon Riggs, Marco Nenciarini and Peter Eisentraut.  Maybe they
can explain when we can enter this condition?

--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

62 matches

Mail list logo