date:20200803

I propose to replace the remaining uses of StrNCpy() with strlcpy() and 
remove the former.  It's clear that strlcpy() has won the popularity 
contest, and the presence of the former is just confusing now.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From ae0540f02eedfcf7a2b31ea8570d6bd26b31ef19 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 3 Aug 2020 08:25:54 +0200
Subject: [PATCH] Replace remaining StrNCpy() by strlcpy()

They are equivalent, and strlcpy() has become the preferred spelling.
Convert over the remaining cases.  Remove StrNCpy() as it is now
unused.
---
 contrib/pgcrypto/crypt-des.c  |  2 +-
 src/backend/access/transam/slru.c |  2 +-
 src/backend/access/transam/xlogarchive.c  |  2 +-
 src/backend/catalog/pg_constraint.c   |  2 +-
 src/backend/commands/indexcmds.c  |  2 +-
 src/backend/commands/statscmds.c  |  2 +-
 src/backend/commands/tablecmds.c  |  2 +-
 src/backend/postmaster/pgstat.c   |  2 +-
 src/backend/replication/logical/logical.c |  3 +-
 src/backend/replication/slot.c|  2 +-
 src/backend/utils/adt/formatting.c|  8 ++---
 src/backend/utils/adt/name.c  |  4 +--
 src/backend/utils/adt/pg_locale.c |  9 --
 src/backend/utils/adt/ruleutils.c |  2 +-
 src/common/exec.c |  4 +--
 src/include/c.h   | 29 ---
 src/interfaces/ecpg/pgtypeslib/dt_common.c|  4 +--
 src/interfaces/ecpg/test/pg_regress_ecpg.c|  2 +-
 .../ssl_passphrase_func.c |  2 +-
 19 files changed, 24 insertions(+), 61 deletions(-)

diff --git a/contrib/pgcrypto/crypt-des.c b/contrib/pgcrypto/crypt-des.c
index 6efaa609c9..98c30ea122 100644
--- a/contrib/pgcrypto/crypt-des.c
+++ b/contrib/pgcrypto/crypt-des.c
@@ -720,7 +720,7 @@ px_crypt_des(const char *key, const char *setting)
if (des_setkey((char *) keybuf))
return NULL;
}
-   StrNCpy(output, setting, 10);
+   strlcpy(output, setting, 10);
 
/*
 * Double check that we weren't given a short setting. If we 
were, the
diff --git a/src/backend/access/transam/slru.c 
b/src/backend/access/transam/slru.c
index 9e145f1c36..d1dbb43e09 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -252,7 +252,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, 
int nlsns,
 */
ctl->shared = shared;
ctl->do_fsync = true;   /* default behavior */
-   StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir));
+   strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
 }
 
 /*
diff --git a/src/backend/access/transam/xlogarchive.c 
b/src/backend/access/transam/xlogarchive.c
index cdd586fcfb..8f8734dc1d 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -323,7 +323,7 @@ ExecuteRecoveryCommand(const char *command, const char 
*commandName, bool failOn
case 'r':
/* %r: filename of last restartpoint */
sp++;
-   StrNCpy(dp, lastRestartPointFname, endp 
- dp);
+   strlcpy(dp, lastRestartPointFname, endp 
- dp);
dp += strlen(dp);
break;
case '%':
diff --git a/src/backend/catalog/pg_constraint.c 
b/src/backend/catalog/pg_constraint.c
index fdc63e7dea..6a6b2cb8c0 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -484,7 +484,7 @@ ChooseConstraintName(const char *name1, const char *name2,
conDesc = table_open(ConstraintRelationId, AccessShareLock);
 
/* try the unmodified label first */
-   StrNCpy(modlabel, label, sizeof(modlabel));
+   strlcpy(modlabel, label, sizeof(modlabel));
 
for (;;)
{
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2baca12c5f..7819266a63 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2246,7 +2246,7 @@ ChooseRelationName(const char *name1, const char *name2,
charmodlabel[NAMEDATALEN];
 
/* try the unmodified label first */
-   StrNCpy(modlabel, label, sizeof(modlabel));
+   strlcpy(modlabel, label, sizeof(modlabel));
 
for (;;)
{
diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index 974828545c..3057d89d50 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -681,7 +681,7 @@ ChooseExtendedStatisticName(const char *nam

Re: [Proposal] Global temporary tables

2020-08-03 Thread movead...@highgo.ca


>Fixed in global_temporary_table_v29-pg13.patch
>Please check.

I find this is the most latest mail with an attachment, so I test and reply on
this thread, several points as below:

1. I notice it produces new relfilenode when new session login and some
data insert. But the relfilenode column in pg_class still the one when create
the global temp table. I think you can try to show 0 in this area as what nail
relation does. 

2. The nail relations handle their relfilenodes by RelMapFile struct, and this
patch use hash entry and relfilenode_list, maybe RelMapFile approach more
understandable in my opinion. Sorry if I miss the real design for that.

3. I get a wrong result of pg_relation_filepath() function for global temp 
table,
I think it's necessaryto keep this an correct output.

4. In gtt_search_by_relid() function, it has not handle the missing_ok argument
if gtt_storage_local_hash is null. There should be some comments if it's the 
right
code.

5. It's a long patch and hard to review, I think it will pretty good if it can 
be
divided into several subpatches with relatively independent subfunctions.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: Is it worth accepting multiple CRLs?

At Fri, 31 Jul 2020 05:53:53 -0700, Henry B Hotz  wrote in 
> A CA may issue a CRL infrequently, but issue a delta-CRL frequently. Does the 
> logic support this properly?

If you are talking about regsitering new revokations while server is
running, it checks newer CRLs upon each lookup according to the
documentation [1], so a new Delta-CRL can be added after server
start. If server restart is allowed, the CRL file specified by
ssl_crl_file can contain multiple CRLs by just concatenation.

[1]: https://www.openssl.org/docs/man1.1.1/man3/X509_LOOKUP_hash_dir.html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Is it worth accepting multiple CRLs?

At Fri, 31 Jul 2020 09:00:14 -0400, Stephen Frost  wrote in 
> Greetings,
> 
> * Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > PostgreSQL server accepts only one CRL file. It is easy to expand
> > be_tls_init to accept a directory set in ssl_crl_file. But I'm not
> > sure CRL is actually even utilized in the field so that could ends
> > with just bloating the documentation.
> > 
> > Is it work doing?
> 
> Yes, CRLs are absolutely used in the field and having this would be
> nice.

Thanks for the opinion. I'll continue working on this.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: problem with RETURNING and update row movement

2020-08-03 Thread Amit Langote

On Mon, Aug 3, 2020 at 2:54 PM Amit Langote  wrote:
> However, I noticed that having system columns like ctid, xmin, etc. in
> the RETURNING list is now broken and maybe irrepairably due to the
> approach we are taking in the patch.  Let me show an example:
>
> drop table foo;
> create table foo (a int, b int) partition by list (a);
> create table foo1 (c int, b int, a int);
> alter table foo1 drop c;
> alter table foo attach partition foo1 for values in (1);
> create table foo2 partition of foo for values in (2);
> create table foo3 partition of foo for values in (3);
> create or replace function trigfunc () returns trigger language
> plpgsql as $$ begin new.b := 2; return new; end; $$;
> create trigger trig before insert on foo2 for each row execute
> function trigfunc();
> insert into foo values (1, 1), (2, 2), (3, 3);
> update foo set a = 2 from (values (1), (2), (3)) s(x) where a = s.x
> returning tableoid::regclass, ctid, xmin, xmax, *;
>  tableoid |  ctid  | xmin |xmax| a | b | x
> --++--++---+---+---
>  foo2 | (4294967295,0) |  128 | 4294967295 | 2 | 2 | 1
>  foo2 | (0,3)  |  782 |  0 | 2 | 2 | 2
>  foo2 | (0,4)  |  782 |  0 | 2 | 2 | 3
> (3 rows)
>
> During foo1's update, it appears that we are losing the system
> information in the physical tuple initialized during ExecInsert on
> foo2 during its conversion back to foo1's reltype using the new code.
>  I haven't been able to figure out how to preserve the system
> information in HeapTuple contained in the destination slot across the
> conversion.  Note we want to use the latter to project the RETURNING
> list.
>
> By the way, you'll need two adjustments to even get this example
> working, one of which is a reported problem that system columns in
> RETURNING list during an operation on partitioned table stopped
> working in PG 12 [1] for which I've proposed a workaround (attached).
> Another is that we forgot in our patch to "materialize" the virtual
> tuple after conversion, which means slot_getsysattr() can't find it to
> access system columns like xmin, etc.

The only solution I could think of for this so far is this:

+   if (map)
+   {
+   orig_slot = execute_attr_map_slot(map,
+ res_slot,
+ orig_slot);
+
+   /*
+* A HACK to install system information into the just
+* converted tuple so that RETURNING computes any
+* system columns correctly. This would be the same
+* information that would be present in the HeapTuple
+* version of the tuple in res_slot.
+*/
+   tuple = ExecFetchSlotHeapTuple(orig_slot, true,
+  &should_free);
+   tuple->t_data->t_infomask &= ~(HEAP_XACT_MASK);
+   tuple->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
+   tuple->t_data->t_infomask |= HEAP_XMAX_INVALID;
+   HeapTupleHeaderSetXmin(tuple->t_data,
+  GetCurrentTransactionId());
+   HeapTupleHeaderSetCmin(tuple->t_data,
+  estate->es_output_cid);
+   HeapTupleHeaderSetXmax(tuple->t_data, 0); /*
for cleanliness */
+   }
+   /*
+* Override tts_tableOid with the OID of the destination
+* partition.
+*/
+   orig_slot->tts_tableOid =
RelationGetRelid(destRel->ri_RelationDesc);
+   /* Also the TID. */
+   orig_slot->tts_tid = res_slot->tts_tid;

..but it might be too ugly :(.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-08-03 Thread Michael Paquier

On Mon, Jun 29, 2020 at 09:36:56PM +1200, David Rowley wrote:
> I've attached a graph showing the results for the AMD and Intel runs
> and also csv files of the pgbench tps output.  I've also attached each
> version of the patch I tested.
> 
> It would be good to see some testing done on other hardware.

Worth noting that the patch set fails to apply.  I have moved this
entry to the next CF, waiting on author.
--
Michael


signature.asc
Description: PGP signature

Re: SQL/JSON: JSON_TABLE

2020-08-03 Thread Michael Paquier

On Sun, Jul 05, 2020 at 12:15:58PM -0500, Justin Pryzby wrote:
> It looks like this needs to be additionally rebased - I will set cfbot to
> "Waiting".

...  Something that has not happened in four weeks, so this is marked
as returned with feedback.  Please feel free to resubmit once a rebase
is done.
--
Michael


signature.asc
Description: PGP signature

Re: Confusing behavior of create table like


On 2020-08-01 00:06, Konstantin Knizhnik wrote:

Postgres provides serial and bigserial column types for which it
implicitly creates sequence.
As far as this mechanism is somehow hidden from user, it may be
confusing that table
created with CREATE TABLE LIKE has no associated sequence.


That's why identity columns were added.  You shouldn't use serial 
columns anymore, especially if you are concerned about behaviors like this.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Replace remaining StrNCpy() by strlcpy()

2020-08-03 Thread David Rowley

On Mon, 3 Aug 2020 at 18:59, Peter Eisentraut
 wrote:
> I propose to replace the remaining uses of StrNCpy() with strlcpy() and
> remove the former.  It's clear that strlcpy() has won the popularity
> contest, and the presence of the former is just confusing now.

It certainly would be good to get rid of some of these, but are some
of the changes not a bit questionable?

e.g:

@@ -4367,7 +4367,7 @@ pgstat_send_archiver(const char *xlog, bool failed)
  */
  pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_ARCHIVER);
  msg.m_failed = failed;
- StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
+ strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
  msg.m_timestamp = GetCurrentTimestamp();
  pgstat_send(&msg, sizeof(msg));

Will mean that we'll now no longer zero the full length of the m_xlog
field after the end of the string. Won't that mean we'll start writing
junk bytes to the stats collector?

David

Re: Keep elog(ERROR) and ereport(ERROR) calls in the cold path

2020-08-03 Thread David Rowley

On Mon, 3 Aug 2020 at 19:54, Michael Paquier  wrote:
> Worth noting that the patch set fails to apply.  I have moved this
> entry to the next CF, waiting on author.

Thanks.

NB: It's not a patch set. It's 3 different versions of the patch.
They're not all meant to apply at once. Probably the CF bot wasn't
aware of that though :-(

David

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-03 Thread Ashutosh Sharma

Hi Robert,

Thanks for the review.

I've gone through all your review comments and understood all of them
except this one:

You really cannot
> modify the buffer like this and then decide, oops, never mind, I think
> I won't mark it dirty or write WAL for the changes. If you do that,
> the buffer is still in memory, but it's now been modified. A
> subsequent operation that modifies it will start with the altered
> state you created here, quite possibly leading to WAL that cannot be
> correctly replayed on the standby. In other words, you've got to
> decide for certain whether you want to proceed with the operation
> *before* you enter the critical section.
>

Could you please explain this point once more in detail? I am not quite
able to understand under what circumstances a buffer would be modified, but
won't be marked as dirty or a WAL won't be written for it.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: Is it worth accepting multiple CRLs?

Uggg.

At Mon, 03 Aug 2020 16:19:37 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 31 Jul 2020 05:53:53 -0700, Henry B Hotz  wrote in 
> > A CA may issue a CRL infrequently, but issue a delta-CRL frequently. Does 
> > the logic support this properly?
> 
> If you are talking about regsitering new revokations while server is
> running, it checks newer CRLs upon each lookup according to the
> documentation [1], so a new Delta-CRL can be added after server
> start. If server restart is allowed, the CRL file specified by

I didin't know that ssl files are reloaded by SIGHUP (pg_ctl
reload). So the ssl_crl_file is also reloaded on server reload.

> ssl_crl_file can contain multiple CRLs by just concatenation.
> 
> [1]: https://www.openssl.org/docs/man1.1.1/man3/X509_LOOKUP_hash_dir.html

Still on-demand loading is the advantage of the hashed directory
method.  I'll continue working..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Reduce/eliminate the impact of FPW

2020-08-03 Thread Daniel Wood

I thought that the biggest reason for the pgbench RW slowdown during a 
checkpoint was the flood of dirty page writes increasing the COMMIT latency.  
It turns out that the documentation which states that FPW's start "after a 
checkpoint" really means after a CKPT starts.  And this is the really cause of 
the deep dip in performance.  Maybe only I was fooled... :-)

If we can't eliminate FPW's can we at least solve the impact of it?  Instead of 
writing the before images of pages inline into the WAL, which increases the 
COMMIT latency, write these same images to a separate physical log file.  The 
key idea is that I don't believe that COMMIT's require these buffers to be 
immediately flushed to the physical log.  We only need to flush these before 
the dirty pages are written.  This delay allows the physical before image IO's 
to be decoupled and done in an efficient manner without an impact to COMMIT's.

1. When we generate a physical image add it to an in memory buffer of before 
page images.
2. Put the physical log offset of the before image into the WAL record.  This 
is the current physical log file size plus the offset in the in-memory buffer 
of pages.
3. Set a bit in the bufhdr indicating this was done.
4. COMMIT's do not need to worry about those buffers.
5. Periodically flush the in-memory buffer and clear the bit in the BufHdr.
6. During any dirty page flushing if we see the bit set, which should be rare, 
then make sure we get our before image flushed.  This would be similar to our 
LSN based XLogFlush().
Do we need these before images for more than one CKPT?  I don't think so.  Do 
PITR's require before images since it is a continuous rollforward from a 
restore?  Just some of considerations.

Do I need to back this physical log up?  I likely(?) need to deal with 
replication.

Turning off FPW gives about a 20%, maybe more, boost on a pgbench TPC-B RW 
workload which fits in the buffer cache.  Can I get this 20% improvement with a 
separate physical log of before page images?

Doing IO's off on the side, but decoupled from the WAL stream, doesn't seem to 
impact COMMIT latency on modern SSD based storage systems.  For instance, you 
can hammer a shared data and WAL SSD filesystem with dirty page writes from the 
CKPT, at near the MAX IOPS of the SSD, and not impact COMMIT latency.  However, 
this presumes that the CKPT's natural spreading of dirty page writes across the 
CKPT target doesn't push too many outstanding IO's into the storage write Q on 
the OS/device.
NOTE: I don't believe the CKPT's throttling is perfect and I think a burst of 
dirty pages into the cache just before a CKPT might cause the Q to be flooded 
and this would then also further slow TPS during the CKPT.  But a fix to this 
is off topic from the FPW issue.

Thanks to Andres Freund for both making me aware of the Q depth impact on 
COMMIT latency and the hint that FPW might also be causing the CKPT slowdown.  
FYI, I always knew about FPW slowdown in general but I just didn't realize it 
was THE primary cause of CKPT TPS slowdown on pgbench.  NOTE: I realize that 
spinning media might exhibit different behavior.  And I didn't not say dirty 
page writing has NO impact on good SSD's.  It depends, and this is a subject 
for a later date as I have a theory as to why I something see a sawtooth 
performance for pgbench TPC-B and sometimes a square wave but I want to prove 
if first.

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2020-08-03 Thread Anastasia Lubennikova


On 31.07.2020 23:28, Robert Haas wrote:

On Tue, Jul 14, 2020 at 1:51 PM Anastasia Lubennikova
 wrote:

Questions from the first review pass:

1) Do we need XLH_INSERT_ALL_VISIBLE_SET ? IIUC, in the patch it is always
implied by XLH_INSERT_ALL_FROZEN_SET.

I agree that it looks unnecessary to have two separate bits.


2) What does this comment mean? Does XXX refers to the lsn comparison?
Since it
is definitely necessary to update the VM.

+ * XXX: This seems entirely unnecessary?
+ *
+ * FIXME: Theoretically we should only do this after we've
+ * modified the heap - but it's safe to do it here I think,
+ * because this means that the page previously was empty.
+ */
+if (lsn > PageGetLSN(vmpage))
+visibilitymap_set(reln, blkno, InvalidBuffer, lsn,
vmbuffer,
+  InvalidTransactionId, vmbits);

I wondered about that too. The comment which precedes it was, I
believe, originally written by me, and copied here from
heap_xlog_visible(). But it's not clear very good practice to just
copy the comment like this. If the same logic applies, the code should
say that we're doing the same thing here as in heap_xlog_visible() for
consistency, or some such thing; after all, that's the primary place
where that happens. But it looks like the XXX might have been added by
a second person who thought that we didn't need this logic at all, and
the FIXME by a third person who thought it was in the wrong place, so
the whole thing is really confusing at this point.

I'm pretty worried about this, too:

+ * FIXME: setting recptr here is a dirty dirty hack, to prevent
+ * visibilitymap_set() from WAL logging.

That is indeed a dirty hack, and something needs to be done about it.

I wonder if it was really all that smart to try to make the
HEAP2_MULTI_INSERT do this instead of just issuing separate WAL
records to mark it all-visible afterwards, but I don't see any reason
why this can't be made to work. It needs substantially more polishing
than it's had, though, I think.


New version of the patch is in the attachment.

New design is more conservative and simpler:
- pin the visibility map page in advance;
- set PageAllVisible;
- call visibilitymap_set() with its own XlogRecord, just like in other 
places.


It allows to remove most of the "hacks" and keep code clean.
The patch passes tests added in previous versions.

I haven't tested performance yet, though. Maybe after tests, I'll bring 
some optimizations back.


--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 7a5dfaa525ba89b86663de417638cdcb30ed147b
Author: anastasia 
Date:   Sun Aug 2 15:28:51 2020 +0300

copy-freeze-vm_freeze_v1.patch
Set VM all_visible and all_frozen bits, when COPY FREEZE inserts tuples into empty page

diff --git a/contrib/pg_visibility/expected/pg_visibility.out b/contrib/pg_visibility/expected/pg_visibility.out
index ca4b6e186b..0017e3415c 100644
--- a/contrib/pg_visibility/expected/pg_visibility.out
+++ b/contrib/pg_visibility/expected/pg_visibility.out
@@ -179,6 +179,69 @@ select pg_truncate_visibility_map('test_partition');
  
 (1 row)
 
+-- test copy freeze
+create table copyfreeze (a int, b char(1500));
+-- load all rows via COPY FREEZE and ensure that all pages are set all-visible
+-- and all-frozen.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | t   | t
+ 1 | t   | t
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
+-- load half the rows via regular COPY and rest via COPY FREEZE. The pages
+-- which are touched by regular COPY must not be set all-visible/all-frozen. On
+-- the other hand, pages allocated by COPY FREEZE should be marked
+-- all-frozen/all-visible.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | f   | f
+ 1 | f   | f
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
+-- Try a mix of regular COPY and COPY FREEZE.
+begin;
+truncate copyfreeze;
+copy copyfreeze from stdin freeze;
+copy copyfreeze from stdin;
+copy copyfreeze from stdin freeze;
+commit;
+select * from pg_visibility_map('copyfreeze');
+ blkno | all_visible | all_frozen 
+---+-+
+ 0 | t   | t
+ 1 | f   | f
+ 2 | t   | t
+(3 rows)
+
+select * from pg_check_frozen('copyfreeze');
+ t_ctid 
+
+(0 rows)
+
 -- cleanup
 drop table test_partitioned;
 drop view test_view;
@@ -188,3

Re: Replication & recovery_min_apply_delay

2020-08-03 Thread Daniel Gustafsson

This thread has stalled and the patch no longer applies, so I'm marking this
Returned with Feedback.  Please feel free to open a new entry if this patch is
revived.

cheers ./daniel

Re: deferred primary key and logical replication

2020-08-03 Thread Ajin Cherian

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Patch applies cleanly. Tested that update/delete of tables with deferred 
primary keys now work with logical replication. Code/comments look fine.

The new status of this patch is: Ready for Committer

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-08-03 Thread Bharath Rupireddy

On Tue, Jul 14, 2020 at 10:28 PM Bruce Momjian  wrote:
>
> On Tue, Jul 14, 2020 at 03:38:49PM +0530, Bharath Rupireddy wrote:
> > Approach #4:
> > A postgres_fdw foreign server level option: connection idle time, the
> > amount of idle time for that server cached entry, after which the
> > cached entry goes away. Probably the backend, before itself going to
> > idle, has to be checking the cached entries and see if any of the
> > entries has timed out. One problem is that, if the backend just did it
> > before going idle, then what about sessions that haven't reached the
> > timeout at the point when we go idle, but do reach the timeout later?
>
> Imagine implementing idle_in_session_timeout (which is useful on its
> own), and then, when you connect to a foreign postgres_fdw server, you
> set idle_in_session_timeout on the foreign side, and it just
> disconnects/exits after an idle timeout.  There is nothing the sending
> side has to do.
>

Assuming we use idle_in_session_timeout on remote backends,  the
remote sessions will be closed after timeout, but the locally cached
connection cache entries still exist and become stale. The subsequent
queries that may use the cached connections will fail, of course these
subsequent queries can retry the connections only at the beginning of
a remote txn but not in the middle of a remote txn, as being discussed
in [1]. For instance, in a long running local txn, let say we used a
remote connection at the beginning of the local txn(note that it will
open a remote session and it's entry is cached in local connection
cache), only we use the cached connection later at some point in the
local txn, by then let say the idle_in_session_timeout has happened on
the remote backend and the remote session would have been closed. The
long running local txn will fail instead of succeeding. Isn't it a
problem here? Please correct me, If I miss anything.

IMHO, we are not fully solving the problem with
idle_in_session_timeout on remote backends though we are addressing
the main problem partly by letting the remote sessions close by
themselves.

[1] -  
https://www.postgresql.org/message-id/flat/CALj2ACUAi23vf1WiHNar_LksM9EDOWXcbHCo-fD4Mbr1d%3D78YQ%40mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: [PATCH] Add section headings to index types doc

2020-08-03 Thread Dagfinn Ilmari Mannsåker

ilm...@ilmari.org (Dagfinn Ilmari Mannsåker) writes:

> Hi hackers,
>
> Every time I have to look up what kinds of operations each index type is
> suitable for, I get annoyed by the index types page being virtually
> unskimmable due to not having headings for each index type.
>
> Attached is a patch that adds  tags for each index type to make
> it easier to see where the description of each one starts.

Added to the next commitfest:

https://commitfest.postgresql.org/29/2665/

Also, for easier review, here's the `git diff -w` output, since the
 tags caused most of the page to have to be renidented.

Tangentially, does anyone know of a tool to strip whitespace changes
from an existing diff, as if it had been generated with `-w` in the
first place?

diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 28adaba72d..332d161547 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -122,6 +122,9 @@
B-tree indexes, which fit the most common situations.
   
 
+  
+   B-tree
+

 
  index
@@ -172,6 +175,10 @@
 This is not always faster than a simple scan and sort, but it is
 often helpful.

+  
+
+  
+   Hash
 

 
@@ -191,6 +198,10 @@
  CREATE INDEX name ON 
table USING HASH (column);
 

+  
+
+  
+   GiST
 

 
@@ -246,6 +257,10 @@
 In , operators that can be
 used in this way are listed in the column Ordering 
Operators.

+  
+
+  
+   SP-GiST
 

 
@@ -286,6 +301,10 @@
 corresponding operator is specified in the Ordering 
Operators
 column in .

+  
+
+  
+   GIN
 

 
@@ -327,6 +346,10 @@
 classes are available in the contrib collection or as 
separate
 projects.  For more information see .

+  
+
+  
+   BRIN
 

 
@@ -360,6 +383,7 @@
 documented in .
 For more information see .

+  
  
 

- ilmari 
-- 
"A disappointingly low fraction of the human race is,
 at any given time, on fire." - Stig Sandbeck Mathisen

Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-08-03 Thread Amit Langote

Hi Andrey,

On Wed, Jul 29, 2020 at 5:36 PM Andrey V. Lepikhov
 wrote:
> > Will send more comments after reading the v5 patch.
> >
> Ok. I'll be waiting for the end of your review.

Sorry about the late reply.

If you'd like to send a new version for other reviewers, please feel
free.  I haven't managed to take more than a brief look at the v5
patch, but will try to look at it (or maybe the new version if you
post) more closely this week.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: Replace remaining StrNCpy() by strlcpy()

David Rowley  writes:
> - StrNCpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));
> + strlcpy(msg.m_xlog, xlog, sizeof(msg.m_xlog));

> Will mean that we'll now no longer zero the full length of the m_xlog
> field after the end of the string. Won't that mean we'll start writing
> junk bytes to the stats collector?

StrNCpy doesn't zero-fill the destination today either (except for
the very last byte).  If you need that, you need to memset the
dest buffer ahead of time.

I didn't review the patch in complete detail, but the principle
seems sound to me, and strlcpy is surely more standard than StrNCpy.

regards, tom lane

Re: [PATCH] Add section headings to index types doc

2020-08-03 Thread Magnus Hagander

On Mon, Aug 3, 2020 at 1:32 PM Dagfinn Ilmari Mannsåker 
wrote:

> ilm...@ilmari.org (Dagfinn Ilmari Mannsåker) writes:
>
> > Hi hackers,
> >
> > Every time I have to look up what kinds of operations each index type is
> > suitable for, I get annoyed by the index types page being virtually
> > unskimmable due to not having headings for each index type.
> >
> > Attached is a patch that adds  tags for each index type to make
> > it easier to see where the description of each one starts.
>
> Added to the next commitfest:
>
> https://commitfest.postgresql.org/29/2665/
>
> Also, for easier review, here's the `git diff -w` output, since the
>  tags caused most of the page to have to be renidented.
>
> Tangentially, does anyone know of a tool to strip whitespace changes
> from an existing diff, as if it had been generated with `-w` in the
> first place?
>

I think you can do something like:

combinediff -w 0001-Add-section-headers-to-index-types-doc.patch  /dev/null

(combinediff requires two diffs, but one can be /dev/null)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Replace remaining StrNCpy() by strlcpy()

I wrote:
> David Rowley  writes:
>> Will mean that we'll now no longer zero the full length of the m_xlog
>> field after the end of the string. Won't that mean we'll start writing
>> junk bytes to the stats collector?

> StrNCpy doesn't zero-fill the destination today either (except for
> the very last byte).

Oh, no, I take that back --- didn't read all of the strncpy man
page :-(.  Yeah, this is a point.  We'd need to check each call
site to see whether the zero-padding matters.

In the specific case of the stats collector, if you don't want
to be sending junk bytes then you'd better be memset'ing the
whole message buffer not just this string field.  So I'm not
sure that the argument has any force there.  But in places
like namecpy() and namestrcpy() we absolutely do mean to be
zeroing the whole destination buffer.

memset plus strlcpy might still be preferable to StrNCpy for
readability by people new to Postgres; but it's less of a
slam dunk than I thought.

regards, tom lane

Re: [PATCH] Add section headings to index types doc

ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> Also, for easier review, here's the `git diff -w` output, since the
>  tags caused most of the page to have to be renidented.

TBH, I'd suggest just not being anal about whether the indentation
nesting is perfect ;-).  There are certainly plenty of places in
the SGML files today where it is not.  And for something like this,
I doubt the gain is worth the loss of "git blame" tracking and
possible back-patching hazards.

I'm a compulsive neatnik when it comes to indentation of the
C code, but much less so about the SGML docs.  YMMV of course.

regards, tom lane

Re: Confusing behavior of create table like

2020-08-03 Thread Konstantin Knizhnik





On 03.08.2020 11:00, Peter Eisentraut wrote:

On 2020-08-01 00:06, Konstantin Knizhnik wrote:

Postgres provides serial and bigserial column types for which it
implicitly creates sequence.
As far as this mechanism is somehow hidden from user, it may be
confusing that table
created with CREATE TABLE LIKE has no associated sequence.


That's why identity columns were added.  You shouldn't use serial 
columns anymore, especially if you are concerned about behaviors like 
this.



I can completely agree with this position.
There are several things in Postgres which are conceptually similar, 
share a lot of code but... following different rules.
Usually it happens when some new notion is introduced, fully or partly 
substitute old notion.

Inheritance and declarative partitioning is one of such examples.
Although them are used to solve the same goal, there are many cases when 
some optimization works for partitioned table but not for inheritance.


May be generated and identity columns are good things. I have nothing 
against them.
But what preventing us from providing the similar behavior for 
serial/bigseries types?

Re: recovering from "found xmin ... from before relfrozenxid ..."

On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma  wrote:
> Could you please explain this point once more in detail? I am not quite able 
> to understand under what circumstances a buffer would be modified, but won't 
> be marked as dirty or a WAL won't be written for it.

Whenever this branch is taken:

+   if (nskippedItems == noffs)
+   goto skip_wal;

At this point you have already modified the page, using ItemIdSetDead,
HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this
branch is taken, then MarkBufferDirty() and log_newpage_buffer() are
skipped.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: public schema default ACL

On Mon, Aug 3, 2020 at 2:30 AM Noah Misch  wrote:
> Between (b)(2)(X) and (b)(3)(X), what are folks' preferences?  Does anyone
> strongly favor some other option (including the option of changing nothing)
> over both of those two?

I don't think we have any options here that are secure but do not
break backward compatibility. The present situation, with a writable
public schema, is equivalent to a UNIX system in which /usr/bin is
drwxrwxrwt. Nobody would seriously propose that such a system design
is secure, not so much because it's intrinsically broken if everyone
is careful not to execute any executables they don't know to have been
deposited by people they trust, but because it's quite easy to
accidentally execute one that isn't. However, if people are used to
being able to deposit stuff in /usr/bin and you tell them that they
now can't (because the permissions will henceforth be drwxr-xr-x or
the directly won't exist at all) then some of them are going to
complain. I don't know what to do about that: it's a straightforward
trade-off between security and backward compatibility, and you can't
have both.

I support the idea of having an automatic schema creation option. I
think that would be quite a cool thing to have, whether it's the
default (Y) or not (Z). But I don't know how to choose between (1),
(2), and (3).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] - Provide robust alternatives for replace_string

2020-08-03 Thread Asim Praveen

Thank you Alvaro for reviewing the patch!

> On 01-Aug-2020, at 7:22 AM, Alvaro Herrera  wrote:
> 
> What happens if a replacement string happens to be split in the middle
> by the fgets buffering?  I think it'll fail to be replaced.  This
> applies to both versions.

Can a string to be replaced be split across multiple lines in the source file?  
If I understand correctly, fgets reads one line from input file at a time.  If 
I do not, in the worst case, we will get an un-replaced string in the output, 
such as “@abs_dir@“ and it should be easily detected by a failing diff.

> In the stringinfo version it seemed to me that using pnstrdup is
> possible to avoid copying trailing bytes.
> 

That’s a good suggestion.  Using pnstrdup would look like this:

--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -465,7 +465,7 @@ replace_stringInfo(StringInfo string, const char *replace, 
const char *replaceme

while ((ptr = strstr(string->data, replace)) != NULL)
{
-   char   *dup = pg_strdup(string->data);
+  char   *dup = pnstrdup(string->data, string->maxlen);
size_t  pos = ptr - string->data;

string->len = pos;

> If you're asking for opinion, mine is that StringInfo looks to be the
> better approach, and also you don't need to keep API compatibility.
> 

Thank you.  We also prefer StringInfo solution.

Asim

Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

On Mon, Jul 20, 2020 at 4:30 PM Andres Freund  wrote:
> I'm extremely doubtful this is a good idea. In all likelihood this will
> just exascerbate corruption.
>
> You cannot just stop freezing tuples, that'll lead to relfrozenxid
> getting *further* out of sync with the actual table contents. And you
> cannot just freeze such tuples, because that has a good chance of making
> deleted tuples suddenly visible, leading to unique constraint violations
> etc. Which will then subsequently lead to clog lookup errors and such.

I think that the behavior ought to be:

- If we encounter any damaged tuples (e.g. tuple xid < relfrozenxid),
we give up on advancing relfrozenxid and relminmxid. This vacuum won't
change them at all.

- We do nothing to the damaged tuples themselves.

- We can still prune pages, and we can still freeze tuples that do not
appear to be damaged.

This amounts to an assumption that relfrozenxid is probably sane, and
that there are individual tuples that are messed up. It's probably not
the right thing if relfrozenxid got overwritten with a nonsense value
without changing the table contents. But, I think it's difficult to
cater to all contingencies. In my experience, the normal problem here
is that there are a few tuples or pages in the table that somehow
escaped vacuuming for long enough that they contain references to XIDs
from before the last time relfrozenxid was advanced - so continuing to
do what we can to the rest of the table is the right thing to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: new heapcheck contrib module

2020-08-03 Thread Mark Dilger

> On Aug 2, 2020, at 8:59 PM, Peter Geoghegan  wrote:
> 
> What's the point in not just giving up on the index (though not
> necessarily the table or other indexes) at the first sign of trouble,
> anyway? It makes sense for the heap structure, but not for indexes.

The case that came to mind was an index broken by a glibc update with breaking 
changes to the collation sort order underlying the index.  If the breaking 
change has already been live in production for quite some time before a DBA 
notices, they might want to quantify how broken the index has been for the last 
however many days, not just drop and recreate the index.  I'm happy to drop 
that from the patch, though.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: new heapcheck contrib module

2020-08-03 Thread Mark Dilger

> On Aug 2, 2020, at 9:13 PM, Peter Geoghegan  wrote:
> 
> On Thu, Jul 30, 2020 at 10:59 AM Robert Haas  wrote:
>> I don't really like the name, either. I get that it's probably
>> inspired by Perl, but I think it should be given a less-clever name
>> like report_corruption() or something.
> 
> +1 -- confess() is an awful name for this.

I was trying to limit unnecessary whitespace changes.  s/ereport/econfess/ 
leaves the function name nearly the same length such that the following lines 
of indented error text don't usually get moved by pgindent.  Given the 
unpopularity of the name, it's not worth it, so I'll go with Robert's 
report_corruption, instead.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] - Provide robust alternatives for replace_string

2020-08-03 Thread Alvaro Herrera

On 2020-Aug-03, Asim Praveen wrote:

> Thank you Alvaro for reviewing the patch!
> 
> > On 01-Aug-2020, at 7:22 AM, Alvaro Herrera  wrote:
> > 
> > What happens if a replacement string happens to be split in the middle
> > by the fgets buffering?  I think it'll fail to be replaced.  This
> > applies to both versions.
> 
> Can a string to be replaced be split across multiple lines in the source 
> file?  If I understand correctly, fgets reads one line from input file at a 
> time.  If I do not, in the worst case, we will get an un-replaced string in 
> the output, such as “@abs_dir@“ and it should be easily detected by a failing 
> diff.

I meant what if the line is longer than 1023 chars and the replace
marker starts at byte 1021, for example.  Then the first fgets would get
"@ab" and the second fgets would get "s_dir@" and none would see it as
replaceable.

> > In the stringinfo version it seemed to me that using pnstrdup is
> > possible to avoid copying trailing bytes.
> 
> That’s a good suggestion.  Using pnstrdup would look like this:
> 
> --- a/src/test/regress/pg_regress.c
> +++ b/src/test/regress/pg_regress.c
> @@ -465,7 +465,7 @@ replace_stringInfo(StringInfo string, const char 
> *replace, const char *replaceme
> 
> while ((ptr = strstr(string->data, replace)) != NULL)
> {
> -   char   *dup = pg_strdup(string->data);
> +  char   *dup = pnstrdup(string->data, string->maxlen);

I was thinking pnstrdup(string->data, ptr - string->data) to avoid
copying the chars beyond ptr.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: new heapcheck contrib module

On Mon, Aug 3, 2020 at 12:00 AM Peter Geoghegan  wrote:
> Moving on with verification of the same index in the event of B-Tree
> index corruption is a categorical mistake. verify_nbtree.c was simply
> not designed to work that way.
>
> You were determined to avoid allowing any behavior that can result in
> a backend crash in the event of corruption, but this design will
> defeat various measures I took to avoid crashing with corrupt data
> (e.g. in commit a9ce839a313).
>
> What's the point in not just giving up on the index (though not
> necessarily the table or other indexes) at the first sign of trouble,
> anyway? It makes sense for the heap structure, but not for indexes.

I agree that there's a serious design problem with Mark's patch in
this regard, but I disagree that the effort is pointless on its own
terms. You're basically postulating that users don't care how corrupt
their index is: whether there's one problem or one million problems,
it's all the same. If the user presents an index with one million
problems and we tell them about one of them, we've done our job and
can go home.

This doesn't match my experience. When an EDB customer reports
corruption, typically one of the first things I want to understand is
how widespread the problem is. This same issue came up on the thread
about relfrozenxid/relminmxid corruption. If you've got a table with
one or two rows where tuple.xmin < relfrozenxid, that's a different
kind of problem than if 50% of the tuples in the table have tuple.xmin
< relfrozenxid; the latter might well indicate that relfrozenxid value
itself is garbage, while the former indicates that a few tuples
slipped through the cracks somehow. If you're contemplating a recovery
strategy like "nuke the affected tuples from orbit," you really need
to understand which of those cases you've got.

Granted, this is a bit less important with indexes, because in most
cases you're just going to REINDEX. But, even there, the question is
not entirely academic. For instance, consider the case of a user whose
database crashes and then fails to restart because WAL replay fails.
Typically, there is little option here but to run pg_resetwal. At this
point, you know that there is some damage, but you don't know how bad
it is. If there was little system activity at the time of the crash,
there may be only a handful of problems with the database. If there
was a heavy OLTP workload running at the time of the crash, with a
long checkpoint interval, the problems may be widespread. If the user
has done this repeatedly before bothering to contact support, which is
more common than you might suppose, the damage may be extremely
widespread.

Now, you could argue (and not unreasonably) that in any case after
something like this happens even once, the user ought to dump and
restore to get back to a known good state. However, when the cluster
is 10TB in size and there's a $100,000 financial loss for every hour
of downtime, the question naturally arises of how urgent that dump and
restore is. Can we wait until our next maintenance window? Can we at
least wait until off hours? Being able to tell the user whether
they've got a tiny bit of corruption or a whole truckload of
corruption can enable them to make better decisions in such cases, or
at least more educated ones.

Now, again, just replacing ereport(ERROR, ...) with something else
that does not abort the rest of the checks is clearly not OK. I don't
endorse that approach, or anything like it. But neither do I accept
the argument that it would be useless to report all the errors even if
we could do so safely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: new heapcheck contrib module

On Mon, Aug 3, 2020 at 11:02 AM Mark Dilger
 wrote:
> I was trying to limit unnecessary whitespace changes.  s/ereport/econfess/ 
> leaves the function name nearly the same length such that the following lines 
> of indented error text don't usually get moved by pgindent.  Given the 
> unpopularity of the name, it's not worth it, so I'll go with Robert's 
> report_corruption, instead.

Yeah, that's not really a good reason for something like that. I think
what you should do is drop the nbtree portion of this for now; the
length of the name then doesn't even matter at all, because all the
code in which this is used will be new code. Even if we were churning
existing code, mechanical stuff like this isn't really a huge problem
most of the time, but there's no need for that here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: public schema default ACL

On Sun, Aug  2, 2020 at 11:30:50PM -0700, Noah Misch wrote:
> On Fri, Mar 23, 2018 at 07:47:39PM -0700, Noah Misch wrote:
> > In light of the mixed reception, I am withdrawing this proposal.
> 
> I'd like to reopen this.  Reception was mixed, but more in favor than against.
> Also, variations on the idea trade some problems for others and may be more
> attractive.  The taxonomy of variations has three important dimensions:
> 
> Interaction with dump/restore (including pg_upgrade) options:
> a. If the schema has a non-default ACL, dump/restore reproduces it.
>Otherwise, the new default prevails.
> b. Dump/restore always reproduces the schema ACL.

I am worried that someone _slightly_ modifies the ACL permissions on the
schema, and we reproduce it, and they think they are secure, but they
are not.  I guess for the public, and change would be to make it more
secure, so maybe this works, but it seems tricky.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: public schema default ACL

2020-08-03 Thread Stephen Frost

Greetings,

* Noah Misch (n...@leadboat.com) wrote:
> I'd like to reopen this.  Reception was mixed, but more in favor than against.
> Also, variations on the idea trade some problems for others and may be more
> attractive.  The taxonomy of variations has three important dimensions:
> 
> Interaction with dump/restore (including pg_upgrade) options:
> a. If the schema has a non-default ACL, dump/restore reproduces it.
>Otherwise, the new default prevails.
> b. Dump/restore always reproduces the schema ACL.
> 
> Initial ownership of schema "public" options:
> 1. Bootstrap superuser owns it.  (Without superuser cooperation, database
>owners can't drop it or create objects in it.)
> 2. Don't create the schema during initdb.  Database owners can create it or
>any other schema.  (A superuser could create it in template1, which
>converts an installation to option (1).)
> 3. Database owner owns it.  (One might implement this by offering ALTER SCHEMA
>x OWNER TO DATABASE_OWNER, which sets nspowner to a system OID meaning
>"refer to pg_database.datdba".  A superuser could issue DDL to convert to
>option (1) or (2).)
> 
> Automatic creation of $user schemas options:
> X. Automatic schema creation doesn't exist.
> Y. Create $user schemas on-demand (at login time or CREATE TABLE/CREATE
>FUNCTION/etc. time) if the DBA specified a "SCHEMA_CREATE" option in the
>CREATE ROLE statement.
> Z. Like (Y), but SCHEMA_CREATE is the default.
> 
> I started the thread by proposing (a)(1)(X) and mentioning (b)(1)(X) as an
> alternative.  Given the compatibility concerns, I now propose ruling out (a)
> in favor of (b).

I agree that we don't want to effectively change these privileges on a
dump/restore or pg_upgrade.

> I dislike (Z), because it requires updating security guidelines to specify
> NOSCHEMA_CREATE; I think it would be better to leave $SUBJECT unchanged than
> to adopt (Z).  I like (Y) from an SQL standard perspective, but I don't think
> it resolves the ease-of-first-use objections raised against (a)(1)(X).  (If
> changing the public schema ACL is too much of an obstacle for a DBA, adopting
> SCHEMA_CREATE is no easier.)  Hence, I propose ruling out (Y) and (Z).

I'm also in favor of having some flavor of automatic schema creation,
but I view that as something independent from this discussion and which
this change shouldn't depend on.

> That leaves the choice between (2) and (3).  Under (b)(2)(X), first-use guides
> would need to add some CREATE SCHEMA.  While (3) avoids that, some users may
> find themselves setting ownership back to the bootstrap superuser.  (3) also
> makes the system more complex overall.
> 
> Between (b)(2)(X) and (b)(3)(X), what are folks' preferences?  Does anyone
> strongly favor some other option (including the option of changing nothing)
> over both of those two?

Having the database owner own the public schema makes the most sense to
me- that this doesn't happen today has always seemed a bit odd to me as,
notionally, you'd imagine the "owner" of a database as, well, owning the
objects in that database (clearly they shouldn't actually own system
catalogs or functions or such, but the public schema isn't some internal
thing like the system catalogs and such).  Having the database owner not
have to jump through hoops to create objects immediately upon connection
to a new database also seems like it reduces the compatibility impact
that this will have.

In general, I'm still in favor of the overall change and moving to
better and more secure defaults.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Reduce/eliminate the impact of FPW

On Mon, Aug 3, 2020 at 5:26 AM Daniel Wood  wrote:
> If we can't eliminate FPW's can we at least solve the impact of it?  Instead 
> of writing the before images of pages inline into the WAL, which increases 
> the COMMIT latency, write these same images to a separate physical log file.  
> The key idea is that I don't believe that COMMIT's require these buffers to 
> be immediately flushed to the physical log.  We only need to flush these 
> before the dirty pages are written.  This delay allows the physical before 
> image IO's to be decoupled and done in an efficient manner without an impact 
> to COMMIT's.

I think this is what's called a double-write buffer, or what was tried
some years ago under that name.  A significant problem is that you
have to fsync() the double-write buffer before you can write the WAL.
So instead of this:

- write WAL to OS
- fsync WAL

You have to do this:

- write double-write buffer to OS
- fsync double-write buffer
- write WAL to OS
- fsync WAL

Note that you cannot overlap these steps -- the first fsync must be
completed before the second write can begin, else you might try to
replay WAL for which the double-write buffer information is not
available.

Because of this, I think this is actually quite expensive. COMMIT
requires the WAL to be flushed, unless you configure
synchronous_commit=off. So this would double the number of fsyncs we
have to do. It's not as bad as all that, because the individual fsyncs
would be smaller, and that makes a significant difference. For a big
transaction that writes a lot of WAL, you'd probably not notice much
difference; instead of writing 1000 pages to WAL, you might write 770
pages to the double-write buffer and 270 to the double-write buffer,
or something like that. But for short transactions, such as those
performed by pgbench, you'd probably end up with a lot of cases where
you had to write 3 pages instead of 2, and not only that, but the
writes have to be consecutive rather than simultaneous, and to
different parts of the disk rather than sequential. That would likely
suck a lot.

It's entirely possible that these kinds of problems could be mitigated
through really good engineering, maybe to the point where this kind of
solution outperforms what we have now for some or even all workloads,
but it seems equally possible that it's just always a loser. I don't
really know. It seems like a very difficult project.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Confusing behavior of create table like

On Mon, Aug 3, 2020 at 8:59 AM Konstantin Knizhnik
 wrote:
> May be generated and identity columns are good things. I have nothing
> against them.
> But what preventing us from providing the similar behavior for
> serial/bigseries types?

Backward compatibility seems like one good argument.

It kind of sucks that we end up with cases where new notions are
introduced to patch up the inadequacies of earlier ideas, but it's
also inevitable. If, after 25+ years of development, we didn't have
cases where somebody had come up with a new plan that was better than
the older plan, that would be pretty scary. We have to remember,
though, that there's a giant user community around PostgreSQL at this
point, and changing things like this can inconvenience large numbers
of those users. Sometimes that's worth it, but I find it pretty
dubious in a case like this. There's every possibility that there are
people out there who rely on the current behavior, and whose stuff
would break if it were changed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby  wrote:
> Based on commit logs, I suspect this may be an "older bug", specifically maybe
> with:
>
> |commit 898e5e3290a72d288923260143930fb32036c00c
> |Author: Robert Haas 
> |Date:   Thu Mar 7 11:13:12 2019 -0500
> |
> |Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
>
> I don't think it matters, but the process surrounding the table being INSERTed
> INTO is more than a little special, involving renames, detaches, creation,
> re-attaching within a transaction.  I think that doesn't matter though, and 
> the
> issue is surrounding the table being SELECTed *from*, which is actually behind
> a view.

That's an entirely reasonable guess, but it doesn't seem easy to
understand exactly what happened here based on the provided
information. The assertion failure probably indicates that
pinfo->relid_map[] and partdesc->oids[] differ in some way other than
additional elements having been inserted into the latter. For example,
some elements might have disappeared, or the order might have changed.
This isn't supposed to happen, because DETACH PARTITION requires
heavier locking, and the order changing without anything getting
detached should be impossible. But evidently it did. If we could dump
out the two arrays in question it might shed more light on exactly how
things went wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: WAL prefetch (another approach)

2020-08-03 Thread Tomas Vondra

On Thu, Jul 02, 2020 at 03:09:29PM +1200, Thomas Munro wrote:

On Sat, Jun 6, 2020 at 12:36 PM Stephen Frost  wrote:

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> I wonder if we can collect some stats to measure how effective the
> prefetching actually is. Ultimately we want something like cache hit
> ratio, but we're only preloading into page cache, so we can't easily
> measure that. Perhaps we could measure I/O timings in redo, though?

That would certainly be interesting, particularly as this optimization
seems likely to be useful on some platforms (eg, zfs, where the
filesystem block size is larger than ours..) and less on others
(traditional systems which have a smaller block size).

I know one way to get information about cache hit ratios without the
page cache fuzz factor: if you combine this patch with Andres's
still-in-development AIO prototype and tell it to use direct IO, you
get the undiluted truth about hits and misses by looking at the
"prefetch" and "skip_hit" columns of the stats view.  I'm hoping to
have a bit more to say about how this patch works as a client of that
new magic soon, but I also don't want to make this dependent on that
(it's mostly orthogonal, apart from the "how deep is the queue" part
which will improve with better information).

FYI I am still trying to reproduce and understand the problem Tomas
reported; more soon.

Any luck trying to reproduce thigs? Should I try again and collect some
additional debug info?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Can a background worker exist without shared memory access for EXEC_BACKEND cases?

On Fri, Jul 31, 2020 at 11:13 PM Bharath Rupireddy
 wrote:
> memory. MyLatch variable also gets created in shared mode. And having
> no shared memory access for the worker for EXEC_BACKEND cases(in
> StartBackgroundWorker, the shared memory segments get detached), the
> worker fails to receive all the global state from the postmaster.

What exactly do you mean by "all the global state"?

It's certainly true that if you declare some random static variable
and initialize it in the postmaster, and you don't take any special
precautions to propagate that into workers, then on an EXEC_BACKEND
build, it won't be set in the workers. That's why, for example, most
of the *ShmemInit() functions are written like this:

TwoPhaseState = ShmemInitStruct("Prepared Transaction Table",

 TwoPhaseShmemSize(),
&found);
if (!IsUnderPostmaster)
...initialize the data structure...
else
Assert(found);

The assignment to TwoPhaseState is unconditional, because in an
EXEC_BACKEND build that's going to be done in every process, and
otherwise the variable won't be set. But the initialization of the
shared data structure happens conditionally, because that needs to be
done only once.

See also the BackendParameters stuff, which arranges to pass down a
bunch of things to exec'd backends.

I am not necessarily opposed to trying to clarify the documentation
and/or comments here, but "global state" is a fuzzy term that doesn't
really mean anything to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Cache relation sizes?

2020-08-03 Thread Konstantin Knizhnik





On 01.08.2020 00:56, Thomas Munro wrote:

On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro  wrote:

There's still the matter of crazy numbers of lseeks in regular
backends; looking at all processes while running the above test, I get
1,469,060 (9.18 per pgbench transaction) without -M prepared, and
193,722 with -M prepared (1.21 per pgbench transaction).  Fixing that
with this approach will require bullet-proof shared invalidation, but
I think it's doable, in later work.

I couldn't help hacking on this a bit.  Perhaps instead of
bullet-proof general shared invalidation, we should have a way for
localised bits of code to state that they are ok with a "relaxed"
value.  Then they should explain the theory for why that is safe in
each case based on arguments about memory barrier pairs, but leave all
other callers making the system call so that we don't torpedo the
whole project by making it too hard.  For now, the main cases I'm
interested in are the ones that show up all the time as the dominant
system call in various workloads:

(1) Sequential scan relation-size probe.  This should be OK with a
relaxed value.  You can't miss the invalidation for a truncation,
because the read barrier in your lock acquisition on the relation
pairs with the write barrier in the exclusive lock release of any
session that truncated, and you can't miss relation any relation
extension that your snapshot can see, because the release of the
extension lock pairs with the lock involved in snapshot acquisition.

(2) Planner relation-size probe, which should be OK with a relaxed
value.  Similar arguments give you a fresh enough view, I think.

Or maybe there is a theory along these lines that already covers every
use of smgrnblocks(), so a separate mode isn't require... I don't
know!

The attached sketch-quality patch shows some of the required elements
working, though I ran out of steam trying to figure out how to thread
this thing through the right API layers so for now it always asks for
a relaxed value in table_block_relation_size().

So in this thread three solutions are proposed:
1. "bullet-proof general shared invalidation"
2. recovery-only solution avoiding shared memory and invalidation
3. "relaxed" shared memory cache with simplified invalidation

If solving such very important by still specific problem of caching 
relation size requires so much efforts,
then may be it is time to go further in the direction towards shared 
catalog?

This shared relation cache can easily store relation size as well.
In addition it will solve a lot of other problems:
- noticeable overhead of local relcache warming
- large memory consumption in case of larger number of relations 
O(max_connections*n_relations)

- sophisticated invalidation protocol and related performance issues

Certainly access to shared cache requires extra synchronization.But DDL 
operations are relatively rare.
So in most cases we will have only shared locks. May be overhead of 
locking will not be too large?

Re: FailedAssertion("pd_idx == pinfo->nparts", File: "execPartition.c", Line: 1689)

2020-08-03 Thread Justin Pryzby

On Mon, Aug 03, 2020 at 11:41:37AM -0400, Robert Haas wrote:
> On Sun, Aug 2, 2020 at 2:11 PM Justin Pryzby  wrote:
> > Based on commit logs, I suspect this may be an "older bug", specifically 
> > maybe
> > with:
> >
> > |commit 898e5e3290a72d288923260143930fb32036c00c
> > |Author: Robert Haas 
> > |Date:   Thu Mar 7 11:13:12 2019 -0500
> > |
> > |Allow ATTACH PARTITION with only ShareUpdateExclusiveLock.
> >
> > I don't think it matters, but the process surrounding the table being 
> > INSERTed
> > INTO is more than a little special, involving renames, detaches, creation,
> > re-attaching within a transaction.  I think that doesn't matter though, and 
> > the
> > issue is surrounding the table being SELECTed *from*, which is actually 
> > behind
> > a view.
> 
> That's an entirely reasonable guess, but it doesn't seem easy to
> understand exactly what happened here based on the provided
> information. The assertion failure probably indicates that
> pinfo->relid_map[] and partdesc->oids[] differ in some way other than
> additional elements having been inserted into the latter. For example,
> some elements might have disappeared, or the order might have changed.
> This isn't supposed to happen, because DETACH PARTITION requires
> heavier locking, and the order changing without anything getting
> detached should be impossible. But evidently it did. If we could dump
> out the two arrays in question it might shed more light on exactly how
> things went wrong.

(gdb) p *pinfo->relid_map@414
$8 = {22652203, 22652104, 22651920, 22651654, 22647359, 22645269, 22644012, 
22639600, 22638852, 22621975, 22615355, 22615256, 22615069, 22610573, 22606503, 
22606404, 22600237, 22600131, 22596610, 22595013, 
  22594914, 22594725, 22594464, 22589317, 22589216, 22587504, 22582570, 
22580264, 22577047, 22576948, 22576763, 22576656, 22574077, 22570911, 22570812, 
22564524, 22564113, 22558519, 22557080, 22556981, 22556793, 
  22555205, 22550680, 22550579, 22548884, 22543899, 22540822, 22536665, 
22536566, 22536377, 22535133, 22528876, 22527780, 22526065, 22521131, 22517878, 
22513674, 22513575, 22513405, 22513288, 22507520, 22504728, 
  22504629, 22493699, 22466016, 22458641, 22457551, 22457421, 22457264, 
22452879, 22449864, 22449765, 22443560, 22442952, 22436193, 22434644, 22434469, 
22434352, 22430792, 22426903, 22426804, 22420732, 22420025, 
  22413050, 22411963, 22411864, 22411675, 22407652, 22404156, 22404049, 
22397550, 22394622, 22390035, 22389936, 22389752, 22388386, 22383211, 22382115, 
22381934, 22375210, 22370297, 22367878, 22367779, 22367586, 
  22362556, 22359928, 22358236, 22353374, 22348704, 22345692, 22345593, 
22345399, 22341347, 22336809, 22336709, 22325812, 22292836, 22287756, 22287657, 
22287466, 22283194, 22278659, 22278560, 22272041, 22269121, 
  22264424, 22264325, 22264135, 22260102, 22255418, 22254818, 22248841, 
22245824, 22241490, 22241391, 22241210, 22240354, 22236224, 22235123, 22234060, 
8744, 8345, 8033, 2528, 2429, 2330, 
  2144, 2045, 22218408, 22215986, 22215887, 22209311, 22209212, 
22207919, 22205203, 22203385, 22203298, 22203211, 22203124, 22202954, 22202859, 
22202772, 22201869, 22200438, 22197706, 22195027, 22194932, 
  22194834, 22191208, 22188412, 22187029, 22182238, 22182134, 22182030, 
22181849, 22181737, 22181107, 22175811, 22175710, 22169859, 22169604, 22159266, 
22158131, 22158021, 22157824, 22153348, 22153236, 22147308, 
  22146736, 22143778, 22143599, 22143471, 22138702, 22138590, 22132612, 
22132513, 22132271, 22132172, 22131987, 21935599, 21932664, 21927997, 21925823, 
21885889, 21862973, 21859854, 21859671, 21858869, 21853440, 
  21851884, 21845405, 21842901, 21837523, 21837413, 21837209, 21832347, 
21829359, 21827652, 21822602, 21816150, 21805995, 21805812, 21805235, 21798914, 
21798026, 21791895, 21791124, 21783854, 21783744, 21783540, 
  21780568, 21774797, 21774687, 21768326, 21764063, 21759627, 21759517, 
21759311, 21755697, 21751690, 21751156, 21744906, 21738543, 21736176, 21735992, 
21735769, 21727603, 21725956, 21716432, 21678682, 21670968, 
  21670858, 21670665, 21669342, 21661932, 21661822, 21655311, 21650838, 
21646721, 21646611, 21646409, 21640984, 21637816, 21637706, 21631061, 21622723, 
21621459, 21621320, 21621148, 21612902, 21612790, 21606170, 
  21602265, 21597910, 21597800, 21597605, 21592489, 21589415, 21589305, 
21582910, 21578017, 21576758, 21576648, 21572692, 21566633, 21566521, 21560127, 
21560017, 21553910, 21553800, 21553613, 21553495, 21549102, 
  21548992, 21542759, 21540922, 21532093, 21531983, 21531786, 21531676, 
21531264, 21531154, 21525290, 21524817, 21519470, 21519360, 21519165, 21516571, 
21514269, 21514159, 21508389, 21508138, 21508028, 21507830, 
  21503457, 21502484, 21496897, 21494287, 21493722, 21493527, 21491807, 
21488530, 21486122, 21485766, 21485603, 21485383, 21481969, 21481672, 21476245, 
21472576, 21468851, 21468741, 21468546, 21467832, 21460086, 
  21425406, 21420632, 21420506, 21419974, 2141783

Re: recovering from "found xmin ... from before relfrozenxid ..."

2020-08-03 Thread Ashutosh Sharma

On Mon, Aug 3, 2020 at 7:06 PM Robert Haas  wrote:
>
> On Mon, Aug 3, 2020 at 5:05 AM Ashutosh Sharma  wrote:
> > Could you please explain this point once more in detail? I am not quite 
> > able to understand under what circumstances a buffer would be modified, but 
> > won't be marked as dirty or a WAL won't be written for it.
>
> Whenever this branch is taken:
>
> +   if (nskippedItems == noffs)
> +   goto skip_wal;
>

If the above path is taken that means none of the items in the page
got changed. As per the following if-check whenever an item in the
offnos[] array is found dead or unused, it is skipped (due to continue
statement) which means the item is neither marked dead nor it is
marked frozen. Now, if this happens for all the items in a page, then
the above condition (nskippedItems == noffs) would be true and hence
the buffer would remain unchanged, so, we don't mark such a buffer as
dirty and neither do any WAL logging for it. This is my understanding,
please let me know if I am missing something here. Thank you.

if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid))
{
nskippedItems++;
ereport(NOTICE,
 (errmsg("skipping tid (%u, %u) because it is
already marked %s",
   blkno, offnos[i],
   ItemIdIsDead(itemid) ? "dead" : "unused")));
continue;
}

> At this point you have already modified the page, using ItemIdSetDead,
> HeapTupleHeaderSet*, and/or directly adjusting htup->infomask. If this
> branch is taken, then MarkBufferDirty() and log_newpage_buffer() are
> skipped.
>

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT

2020-08-03 Thread Wolfgang Walther


Tom Lane:

We don't generally act that way in other ALTER commands and I don't see
a strong argument to start doing so here.  [...]

In short, I'm inclined to argue that this variant of ALTER TABLE
should replace *all* the fields of the constraint with the same
properties it'd have if you'd created it fresh using the same syntax.
This is by analogy to CREATE OR REPLACE commands, which don't
preserve any of the old properties of the replaced object.  Given
the interactions between these fields, I think you're going to end up
with a surprising mess of ad-hoc choices if you do differently.
Indeed, you already have, but I think it'll get worse if anyone
tries to extend the feature set further.


I don't think the analogy to CREATE OR REPLACE holds. Semantically 
REPLACE and ALTER are very different. Using ALTER the expectation is to 
change something, keeping everything else unchanged. Looking at all the 
other ALTER TABLE actions, especially ALTER COLUMN, it looks like every 
command does exactly one thing and not more. I don't think deferrability 
and ON UPDATE / ON CASCADE should be changed together at all, neither 
implicitly nor explicitly.


There seems to be a fundamental difference between deferrability and the 
ON UPDATE/ON DELETE clauses as well - the latter only apply to FOREIGN 
KEYs, while the former apply to multiple types of constraints.


Matheus de Oliveira:
I'm still not sure if the chosen path is the best way. But I'd be glad 
to follow any directions we all see fit.


For now, this patch applies two methods:
1. Changes full constraint definition (which keeps compatibility with 
current ALTER CONSTRAINT):

     ALTER CONSTRAINT [] [] []
2. Changes only the subset explicit seem in the command (a new way, I've 
chosen to just add SET in the middle, similar to `ALTER COLUMN ... SET 
{DEFAULT | NOT NULL}` ):

     ALTER CONSTRAINT SET [] [] []

I'm OK with changing the approach, we just need to chose the color :D


The `ALTER CONSTRAINT SET [] [] []` 
has the same problem about implied changes: What happens if you only do 
e.g. ALTER CONSTRAINT SET ON UPDATE xy - will the ON DELETE part be kept 
as-is or set to the default?


Also, since the ON UPDATE/ON DELETE just applies to FOREIGN KEYs and no 
other constraints, there's one level of "nesting" missing in your SET 
variant, I think.


I suggest to:

- keep `ALTER CONSTRAINT constraint_name [ DEFERRABLE | NOT DEFERRABLE ] 
[ INITIALLY DEFERRED | INITIALLY IMMEDIATE ]` exactly as-is


- add both:
 + `ALTER CONSTRAINT constraint_name [ALTER] FOREIGN KEY ON UPDATE 
referential_action`
 + `ALTER CONSTRAINT constraint_name [ALTER] FOREIGN KEY ON DELETE 
referential_action`


This does not imply any changes, that are not in the command - very much 
analog to the ALTER COLUMN variants.


This could also be extended in the future with stuff like `ALTER 
CONSTRAINT constraint_name [ALTER] FOREIGN KEY MATCH [ FULL | PARTIAL | 
SIMPLE ]`.

Re: Confusing behavior of create table like


On 2020-08-03 14:58, Konstantin Knizhnik wrote:

May be generated and identity columns are good things. I have nothing
against them.
But what preventing us from providing the similar behavior for
serial/bigseries types?


In my mind, serial/bigserial is deprecated and it's not worth spending 
effort on patching them up.


One thing we could do is change serial/bigserial to expand to identity 
column definitions instead of the current behavior.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cache relation sizes?

2020-08-03 Thread Pavel Stehule

po 3. 8. 2020 v 17:54 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 01.08.2020 00:56, Thomas Munro wrote:
> > On Fri, Jul 31, 2020 at 2:36 PM Thomas Munro 
> wrote:
> >> There's still the matter of crazy numbers of lseeks in regular
> >> backends; looking at all processes while running the above test, I get
> >> 1,469,060 (9.18 per pgbench transaction) without -M prepared, and
> >> 193,722 with -M prepared (1.21 per pgbench transaction).  Fixing that
> >> with this approach will require bullet-proof shared invalidation, but
> >> I think it's doable, in later work.
> > I couldn't help hacking on this a bit.  Perhaps instead of
> > bullet-proof general shared invalidation, we should have a way for
> > localised bits of code to state that they are ok with a "relaxed"
> > value.  Then they should explain the theory for why that is safe in
> > each case based on arguments about memory barrier pairs, but leave all
> > other callers making the system call so that we don't torpedo the
> > whole project by making it too hard.  For now, the main cases I'm
> > interested in are the ones that show up all the time as the dominant
> > system call in various workloads:
> >
> > (1) Sequential scan relation-size probe.  This should be OK with a
> > relaxed value.  You can't miss the invalidation for a truncation,
> > because the read barrier in your lock acquisition on the relation
> > pairs with the write barrier in the exclusive lock release of any
> > session that truncated, and you can't miss relation any relation
> > extension that your snapshot can see, because the release of the
> > extension lock pairs with the lock involved in snapshot acquisition.
> >
> > (2) Planner relation-size probe, which should be OK with a relaxed
> > value.  Similar arguments give you a fresh enough view, I think.
> >
> > Or maybe there is a theory along these lines that already covers every
> > use of smgrnblocks(), so a separate mode isn't require... I don't
> > know!
> >
> > The attached sketch-quality patch shows some of the required elements
> > working, though I ran out of steam trying to figure out how to thread
> > this thing through the right API layers so for now it always asks for
> > a relaxed value in table_block_relation_size().
> So in this thread three solutions are proposed:
> 1. "bullet-proof general shared invalidation"
> 2. recovery-only solution avoiding shared memory and invalidation
> 3. "relaxed" shared memory cache with simplified invalidation
>
> If solving such very important by still specific problem of caching
> relation size requires so much efforts,
> then may be it is time to go further in the direction towards shared
> catalog?
> This shared relation cache can easily store relation size as well.
> In addition it will solve a lot of other problems:
> - noticeable overhead of local relcache warming
> - large memory consumption in case of larger number of relations
> O(max_connections*n_relations)
> - sophisticated invalidation protocol and related performance issues
>
> Certainly access to shared cache requires extra synchronization.But DDL
> operations are relatively rare.
>

Some applications use very frequently CREATE TEMP TABLE, DROP TEMP TABLE,
or CREATE TABLE AS SELECT ..

Regards

Pavel

So in most cases we will have only shared locks. May be overhead of
> locking will not be too large?
>
>
>
>

Re: new heapcheck contrib module

On Mon, Aug 3, 2020 at 8:09 AM Robert Haas  wrote:
> I agree that there's a serious design problem with Mark's patch in
> this regard, but I disagree that the effort is pointless on its own
> terms. You're basically postulating that users don't care how corrupt
> their index is: whether there's one problem or one million problems,
> it's all the same. If the user presents an index with one million
> problems and we tell them about one of them, we've done our job and
> can go home.

It's not so much that I think that users won't care about whether any
given index is a bit corrupt or very corrupt. It's more like I don't
think that it's worth the eye-watering complexity, especially without
a real concrete goal in mind. "Counting all the errors, not just the
first" sounds like a tractable goal for the heap/table structure, but
it's just not like that with indexes. If you really wanted to do this,
you'd have to describe a practical scenario under which it made sense
to soldier on, where we'd definitely be able to count the number of
problems in a meaningful way, without much risk of either massively
overcounting or undecounting inconsistencies.

Consider how the search in verify_ntree.c actually works at a high
level. If you thoroughly corrupted one B-Tree leaf page (let's say you
replaced it with an all-zero page image), all pages to the right of
the page would be fundamentally inaccessible to the left-to-right
level search that is coordinated within
bt_check_level_from_leftmost(). And yet, most real index scans can
still be expected to work. How do you know to skip past that one
corrupt leaf page (by going back to the parent to get the next sibling
leaf page) during index verification? That's what it would take to do
this in the general case, I guess. More fundamentally, I wonder how
many inconsistencies one should imagine that this index has, before we
even get into talking about the implementation.

-- 
Peter Geoghegan

Re: Replace remaining StrNCpy() by strlcpy()


On 2020-08-03 14:12, Tom Lane wrote:

I wrote:

David Rowley  writes:

Will mean that we'll now no longer zero the full length of the m_xlog
field after the end of the string. Won't that mean we'll start writing
junk bytes to the stats collector?



StrNCpy doesn't zero-fill the destination today either (except for
the very last byte).


Oh, no, I take that back --- didn't read all of the strncpy man
page :-(.  Yeah, this is a point.  We'd need to check each call
site to see whether the zero-padding matters.


Oh, that's easy to miss.


In the specific case of the stats collector, if you don't want
to be sending junk bytes then you'd better be memset'ing the
whole message buffer not just this string field.  So I'm not
sure that the argument has any force there.  But in places
like namecpy() and namestrcpy() we absolutely do mean to be
zeroing the whole destination buffer.


That's easy to fix, but it's perhaps wondering briefly why it needs to 
be zero-padded.  hashname() doesn't care, heap_form_tuple() doesn't 
care.  Does anything care?


While we're here, shouldn't namestrcpy() do some pg_mbcliplen() stuff 
like namein()?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Replace remaining StrNCpy() by strlcpy()

Peter Eisentraut  writes:
> On 2020-08-03 14:12, Tom Lane wrote:
>> In the specific case of the stats collector, if you don't want
>> to be sending junk bytes then you'd better be memset'ing the
>> whole message buffer not just this string field.  So I'm not
>> sure that the argument has any force there.  But in places
>> like namecpy() and namestrcpy() we absolutely do mean to be
>> zeroing the whole destination buffer.

> That's easy to fix, but it's perhaps wondering briefly why it needs to 
> be zero-padded.  hashname() doesn't care, heap_form_tuple() doesn't 
> care.  Does anything care?

We do have an expectation that there are no undefined bytes in values to
be stored on-disk.  There's even some code in coerce_type() that will
complain about this:

 * For pass-by-reference data types, repeat the conversion to see if
 * the input function leaves any uninitialized bytes in the result. We
 * can only detect that reliably if RANDOMIZE_ALLOCATED_MEMORY is
 * enabled, so we don't bother testing otherwise.  The reason we don't
 * want any instability in the input function is that comparison of
 * Const nodes relies on bytewise comparison of the datums, so if the
 * input function leaves garbage then subexpressions that should be
 * identical may not get recognized as such.  See pgsql-hackers
 * discussion of 2008-04-04.

> While we're here, shouldn't namestrcpy() do some pg_mbcliplen() stuff 
> like namein()?

Excellent point --- probably so, unless the callers are all truncating
in advance, which I doubt.

regards, tom lane

Re: Allow an alias to be attached directly to a JOIN ... USING

2020-08-03 Thread Wolfgang Walther

Peter Eisentraut:

On 2019-12-31 00:07, Vik Fearing wrote:

One thing I notice is that the joined columns are still accessible from
their respective table names when they should not be per spec.  That
might be one of those "silly restrictions" that we choose to ignore, but
it should probably be noted somewhere, at the very least in a code
comment if not in user documentation. (This is my reading of SQL:2016 SR
11.a.i)

Here is a rebased patch.

The above comment is valid.  One reason I didn't implement it is that it 
would create inconsistencies with existing behavior, which is already 
nonstandard.

For example,

create table a (id int, a1 int, a2 int);
create table b (id int, b2 int, b3 int);

makes

select a.id from a join b using (id);

invalid.  Adding an explicit alias for the common column names doesn't 
change that semantically, because an implicit alias also exists if an 
explicit one isn't specified.
I just looked through the patch without applying or testing it - but I 
couldn't find anything that would indicate that this is not going to 
work for e.g. a LEFT JOIN as well. First PG patch I looked at, so tell 
me if I missed something there.

So given this:

SELECT x.id FROM a LEFT JOIN b USING (id) AS x

will this return NULL or a.id for rows that don't match in b? This 
should definitely be mentioned in the docs and I guess a test wouldn't 
be too bad as well?

In any case: If a.id and b.id would not be available anymore, but just 
x.id, either the id value itself or the NULL value (indicating the 
missing row in b) are lost. So this seems like a no-go.

> I agree that some documentation would be in order if we decide to leave
> it like this.

Keep it like that!

Re: public schema default ACL


On 2020-08-03 15:46, Robert Haas wrote:

However, if people are used to
being able to deposit stuff in /usr/bin and you tell them that they
now can't (because the permissions will henceforth be drwxr-xr-x or
the directly won't exist at all) then some of them are going to
complain. I don't know what to do about that: it's a straightforward
trade-off between security and backward compatibility, and you can't
have both.


File system conventions, permissions, and restrictions have been changed 
many times in the history of Unix, Linux, and the like.  Recent examples 
are /usr/bin and /bin unification and that /tmp is changing to a 
per-user mount.  There are of course always a few complaints and some 
breakage, but generally this has been going well and is usually 
appreciated overall.


The important things in my mind are that you keep an easy onboarding 
experience (you can do SQL things without having to create and unlock a 
bunch of things first) and that advanced users can do the things they 
want to do *somehow*.


As an example, per-user /tmp is not hardcoded into the kernel, it's just 
a run-time configuration.  If you want it to behave differently, you can 
set that up.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: avoid bitmapOR-ing indexes with scan condition inconsistent with partition constraint

2020-08-03 Thread Justin Pryzby

Rebased and updated for tests added in 13838740f.

-- 
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581
>From 9272dda812d2b959d0bd536e0679f8f8527da7b1 Mon Sep 17 00:00:00 2001
From: Konstantin Knizhnik 
Date: Fri, 12 Oct 2018 15:53:51 +0300
Subject: [PATCH v3 1/2] Secondary index access optimizations

---
 .../postgres_fdw/expected/postgres_fdw.out|   8 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql |   2 +-
 src/backend/optimizer/path/allpaths.c |   2 +
 src/backend/optimizer/util/plancat.c  |  45 ++
 src/include/optimizer/plancat.h   |   3 +
 src/test/regress/expected/create_table.out|  14 +-
 src/test/regress/expected/inherit.out | 123 ++--
 .../regress/expected/partition_aggregate.out  |  10 +-
 src/test/regress/expected/partition_join.out  |  42 +-
 src/test/regress/expected/partition_prune.out | 587 ++
 src/test/regress/expected/rowsecurity.out |  12 +-
 src/test/regress/expected/update.out  |   4 +-
 12 files changed, 322 insertions(+), 530 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 90db550b92..dbbae1820e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -629,12 +629,12 @@ EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- Nu
Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NULL))
 (3 rows)
 
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
- QUERY PLAN  
--
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
+QUERY PLAN
+--
  Foreign Scan on public.ft1 t1
Output: c1, c2, c3, c4, c5, c6, c7, c8
-   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NOT NULL))
+   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE ((c3 IS NOT NULL))
 (3 rows)
 
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83971665e3..08aef9289e 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -304,7 +304,7 @@ RESET enable_nestloop;
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 1; -- Var, OpExpr(b), Const
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 100 AND t1.c2 = 0; -- BoolExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- NullTest
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 = -c1;  -- OpExpr(l)
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE 1 = c1!;   -- OpExpr(r)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6da0dcd61c..a9171c075c 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -387,6 +387,7 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		switch (rel->rtekind)
 		{
 			case RTE_RELATION:
+remove_restrictions_implied_by_constraints(root, rel, rte);
 if (rte->relkind == RELKIND_FOREIGN_TABLE)
 {
 	/* Foreign table */
@@ -1040,6 +1041,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 			set_dummy_rel_pathlist(childrel);
 			continue;
 		}
+		remove_restrictions_implied_by_constraints(root, childrel, childRTE);
 
 		/*
 		 * Constraint exclusion failed, so copy the parent's join quals and
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 25545029d7..45cd72a0fe 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1557,6 +1557,51 @@ relation_excluded_by_constraints(PlannerInfo *root,
 	return false;
 }
 
+/*
+ * Remove from restrictions list items implied by table constraints
+ */
+void remove_restrictions_implied_by_constraints(PlannerInfo *root,
+RelOptInfo *rel, RangeTblEntry *rte)
+{
+	List	   *constraint_pred;
+	List	   *safe_constraints = NIL;
+	List	   *safe_restrictions = NIL;
+	ListCell   *lc;
+
+	if (rte->rtekind != RTE_RELATION || rte->inh)
+		return;
+
+	/*
+	 * O

Re: Reduce/eliminate the impact of FPW

2020-08-03 Thread Daniel Wood

> On 08/03/2020 8:26 AM Robert Haas  wrote:
...
> I think this is what's called a double-write buffer, or what was tried
> some years ago under that name.  A significant problem is that you
> have to fsync() the double-write buffer before you can write the WAL.

I don't think it does need to be fsync'ed before the WAL.  If the
log record has a FPW reference beyond the physical log EOF then we
don't need to restore the before image because we haven't yet did
the dirty page write from the cache.  The before image only needs
to be flushed before the dirty page write.  Usually this will have
already done.

> ... But for short transactions, such as those
> performed by pgbench, you'd probably end up with a lot of cases where
> you had to write 3 pages instead of 2, and not only that, but the
> writes have to be consecutive rather than simultaneous, and to
> different parts of the disk rather than sequential. That would likely
> suck a lot.

Wherever you write the before images, in the WAL or into a separate
file you would write the same number of pages.  I don't understand
the 3 pages vs 2 pages comment.

And, "different parts of the disk"???  I wouldn't enable the feature
on spinning media unless I had a dedicated disk for it.

NOTE:
If the 90's Informix called this the physical log.  Restoring at
crash time restored physical consistency after which redo/undo
recovery achieved logical consistency.  From their doc's:
"If the before-image of a modified page is stored in the physical-log 
buffer, it is eventually flushed from the physical-log buffer to the physical 
log on disk. The before-image of the page plays a critical role in restoring 
data and fast recovery. For more details, see Physical-Log Buffer."

> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Re: PG 13 release notes, first draft

On Thu, Jul 30, 2020 at 10:45 AM Peter Geoghegan  wrote:
> On Thu, Jul 30, 2020 at 10:32 AM Bruce Momjian  wrote:
> > I came up with a more verbose documentation suggestion, attached.
>
> I'm okay with this.

Are you going to push this soon, Bruce?

--
Peter Geoghegan

Re: dblnk_is_busy returns 1 for dead connecitons

2020-08-03 Thread Merlin Moncure

On Sun, Aug 2, 2020 at 9:55 PM Merlin Moncure  wrote:
>
> On Sun, Aug 2, 2020 at 7:18 PM Merlin Moncure  wrote:
> >
> > Hackers,
> >
> > I have a situation that I am observing where dblink_is_busy returns 1
> > even though the connection is long gone.   tcp keepalives are on and
> > the connection has been dead for several hours. Looking at the call
> > for dblink_is_busy, I see that  it is a thin wrapper to PQusBusy().
> > If I attempt to call dblink_get_result(), the result comes back with
> > an error mesage, 'invalid socket'. This however is not helpful since
> > there is no way to probe for dead connections in dblink that appears
> > to be 100% reliable.  My workaround that I had been relying on was to
> > call dblink_get_notify twice, which for some weird reason forced the
> > connection error to the surface.  However for whatever reason, that is
> > not working here.
> >
> > In cases the connection was cancelled via dblink_cancel_query(), so in
> > some scenarios connections cancelled that way seem to become 'stuck'.
> > Any thoughts on this?
>
> Correction, keepalives are probably not on, because dblink does not
> have an option to set them.  Also, it looks like dblink_is_busy()
> calls pqConsumeInput without checking the error code.  Is that safe?

I could not reproduce this with application external test script (see
attached if curious).  I alos noticed you can set keepalives in the
libpq connection string, so I'll do that and see if it helps, and
report back for posterity.  Thanks, sorry for the noise.

merlin


dblink_test.sql
Description: Binary data

EDB builds Postgres 13 with an obsolete ICU version

2020-08-03 Thread Daniel Verite

 Hi,

As a follow-up to bug #16570 [1] and other previous discussions
on the mailing-lists, I'm checking out PG13 beta for Windows
from:
 https://www.enterprisedb.com/postgresql-early-experience
and it ships with the same obsolete ICU 53 that was used
for PG 10,11,12.
Besides not having the latest Unicode features and fixes, ICU 53
ignores the BCP 47 tags syntax in collations used as examples
in Postgres documentation, which leads to confusion and
false bug reports.
The current version is ICU 67.

I don't see where the suggestion to upgrade it before the
next PG release should be addressed but maybe some people on
this list do know or have the leverage to make it happen?

[1]
https://www.postgresql.org/message-id/16570-58cc04e1a6ef3c3f%40postgresql.org

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: https://www.manitou-mail.org
Twitter: @DanielVerite

Re: extension patch of CREATE OR REPLACE TRIGGER

2020-08-03 Thread Wolfgang Walther


osumi.takami...@fujitsu.com:

* I'm a little bit concerned about the semantics of changing the
tgdeferrable/tginitdeferred properties of an existing trigger.  If there are 
trigger
events pending, and the trigger is redefined in such a way that those events
should already have been fired, what then?

OK. I need a discussion about this point.
There would be two ideas to define the behavior of this semantics change, I 
think.
The first idea is to throw an error that means
the *pending* trigger can't be replaced during the session.
The second one is just to replace the trigger and ignite the new trigger
at the end of the session when its tginitdeferred is set true.
For me, the first one sounds safer. Yet, I'd like to know other opinions.


IMHO, constraint triggers should behave the same in that regard as other 
constraints. I just checked:


BEGIN;
CREATE TABLE t1 (a int CONSTRAINT u UNIQUE DEFERRABLE INITIALLY DEFERRED);
INSERT INTO t1 VALUES (1),(1);
ALTER TABLE t1 ALTER CONSTRAINT u NOT DEFERRABLE;

will throw with:

ERROR:  cannot ALTER TABLE "t1" because it has pending trigger events
SQL state: 55006

So if a trigger event is pending, CREATE OR REPLACE for that trigger 
should throw. I think it should do in any case, not just when changing 
deferrability. This makes it easier to reason about.


If the user has a pending trigger, they can still do SET CONSTRAINTS 
trigger_name IMMEDIATE; to resolve that and then do CREATE OR REPLACE 
TRIGGER, just like in the ALTER TABLE case.



regression=# create constraint trigger my_trig after insert on trig_table
deferrable initially deferred for each row execute procedure
before_replacement(); CREATE TRIGGER regression=# begin; BEGIN
regression=*# insert into trig_table default values; INSERT 0 1 regression=*#
drop trigger my_trig on trig_table; DROP TRIGGER regression=*# commit;
ERROR:  relation 38489 has no triggers

I could reproduce this bug, using the current master without my patch.
So this is another issue.
I'm thinking that throwing an error when *pending* trigger is dropped
makes sense. Does everyone agree with it ?


Just tested the same example as above, but with DROP TABLE t1; instead 
of ALTER TABLE. This throws with:


ERROR:  cannot DROP TABLE "t1" because it has pending trigger events
SQL state: 55006

So yes, your suggestion makes a lot of sense!

Re: psql - improve test coverage from 41% to 88%

2020-08-03 Thread Fabien COELHO




Re-reading this thread, I see no complaints about introducing a 
dependency on Expect.


Indeed, Tom's complaint was on another thread, possibly when interactive 
tests "src/bin/psql/t/010_tab_completion.pl" were added.


ISTM that one of the issue was that some farm animal would be broken.

I'm quite lost about Expect portability discussion wrt windows, it is 
unclear to me whether it is expected to work there or not.


As I stated, I do not like re-inventing the wheel, probably quite badly, 
when someone else already did a good job.


--
Fabien.

Re: Reduce/eliminate the impact of FPW

2020-08-03 Thread SATYANARAYANA NARLAPURAM

Increasing checkpoint_timeout helps reduce the amount of log written to the
disk. This has several benefits like, reduced number of WAL IO, archival
load on the system, less network traffic to the standby replicas. However,
this increases the crash recovery time and impact server availability.
Investing in parallel recovery for Postgres helps reduce the crash recovery
time and allows us to change the checkpoint frequency to much higher value?
This idea is orthogonal to the double write improvements mentioned in the
thread. Thomas Munro has a patch of doing page prefetching during recovery
which speeds up recovery if the working set doesn't fit in the memory, we
also need parallel recovery to replay huge amounts of WAL, when the working
set is in memory.

Thanks,
Satya

On Mon, Aug 3, 2020 at 11:14 AM Daniel Wood  wrote:

>
> > On 08/03/2020 8:26 AM Robert Haas  wrote:
> ...
> > I think this is what's called a double-write buffer, or what was tried
> > some years ago under that name.  A significant problem is that you
> > have to fsync() the double-write buffer before you can write the WAL.
>
> I don't think it does need to be fsync'ed before the WAL.  If the
> log record has a FPW reference beyond the physical log EOF then we
> don't need to restore the before image because we haven't yet did
> the dirty page write from the cache.  The before image only needs
> to be flushed before the dirty page write.  Usually this will have
> already done.
>
> > ... But for short transactions, such as those
> > performed by pgbench, you'd probably end up with a lot of cases where
> > you had to write 3 pages instead of 2, and not only that, but the
> > writes have to be consecutive rather than simultaneous, and to
> > different parts of the disk rather than sequential. That would likely
> > suck a lot.
>
> Wherever you write the before images, in the WAL or into a separate
> file you would write the same number of pages.  I don't understand
> the 3 pages vs 2 pages comment.
>
> And, "different parts of the disk"???  I wouldn't enable the feature
> on spinning media unless I had a dedicated disk for it.
>
> NOTE:
> If the 90's Informix called this the physical log.  Restoring at
> crash time restored physical consistency after which redo/undo
> recovery achieved logical consistency.  From their doc's:
> "If the before-image of a modified page is stored in the physical-log
> buffer, it is eventually flushed from the physical-log buffer to the
> physical log on disk. The before-image of the page plays a critical role in
> restoring data and fast recovery. For more details, see Physical-Log
> Buffer."
>
> > --
> > Robert Haas
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
>
>
>

Re: PG 13 release notes, first draft

On Mon, Aug  3, 2020 at 11:35:50AM -0700, Peter Geoghegan wrote:
> On Thu, Jul 30, 2020 at 10:45 AM Peter Geoghegan  wrote:
> > On Thu, Jul 30, 2020 at 10:32 AM Bruce Momjian  wrote:
> > > I came up with a more verbose documentation suggestion, attached.
> >
> > I'm okay with this.
> 
> Are you going to push this soon, Bruce?

Done.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Amcheck: do rightlink verification with lock coupling

On Mon, Jul 20, 2020 at 11:46 AM Andrey M. Borodin  wrote:
> In this thread [0] we decided that lock coupling is necessary for 
> btree_xlog_unlink_page().
> So, maybe let's revive this patch?

Yes, let's do that. Can you resubmit it, please?


Peter Geoghegan

Re: psql - improve test coverage from 41% to 88%

2020-08-03 Thread Andrew Dunstan

On 8/3/20 3:34 PM, Fabien COELHO wrote:
>
>
> I'm quite lost about Expect portability discussion wrt windows, it is
> unclear to me whether it is expected to work there or not.

Sorry if I was unclear. Expect will not work on Windows. Nor will use of
IO::Pty  or IO::Tty, which are what Expect uses under the hood. So use
of any of that needs to be done just as it is done on
010_tab_completion.pl, i.e.

eval { require IO::Pty; };
if ($@)
{
    plan skip_all => 'IO::Pty is needed to run this test';
}

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

On Sun, Aug 2, 2020 at 9:07 AM Michail Nikolaev
 wrote:
> Thanks for your work, the patch is looking better now.

Pushed -- thanks!

-- 
Peter Geoghegan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-08-03 Thread Andy Fan

Hi Floris:

On Thu, Jul 23, 2020 at 3:22 AM Floris Van Nee 
wrote:

> Hi Andy,
>
>
>
> A small thing I found:
>
>
>
> +static List *
>
> +get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
>
> +
>List *const_exprs,
>
> +
>List *const_expr_opfamilies,
>
> +
>Bitmapset *used_varattrs,
>
> +
>bool *useful,
>
> +
>bool *multi_nullvals)
>
> …
>
> + indexpr_item = list_head(unique_index->indexprs);
>
> + for(c = 0; c < unique_index->ncolumns; c++)
>
> + {
>
>
>
> I believe the for loop must be over unique_index->nkeycolumns, rather than
> columns. It shouldn’t include the extra non-key columns. This can currently
> lead to invalid memory accesses as well a few lines later when it does an
> array access of unique_index->opfamily[c] – this array only has nkeycolumns
> entries.
>

You are correct, I would include this in the next version patch, Thank you
for this checking!

--
Andy Fan
Best Regards

>
>
>
>
> *From:* Andy Fan 
> *Sent:* Sunday 19 July 2020 5:03 AM
> *To:* Dmitry Dolgov <9erthali...@gmail.com>
> *Cc:* David Rowley ; PostgreSQL Hackers <
> pgsql-hackers@lists.postgresql.org>; Tom Lane ;
> Ashutosh Bapat ; rushabh.lat...@gmail.com
> *Subject:* Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
> [External]
>
>
>
> Fixed a test case in v10.
>
>
>
> --
>
> Best Regards
>
> Andy Fan
>


-- 
Best Regards
Andy Fan

ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-08-03 Thread Alvaro Herrera

I've been working on the ability to detach a partition from a
partitioned table, without causing blockages to concurrent activity.
I think this operation is critical for some use cases.

There was a lot of great discussion which ended up in Robert completing
a much sought implementation of non-blocking ATTACH.  DETACH was
discussed too because it was a goal initially, but eventually dropped
from that patch altogether. Nonetheless, that thread provided a lot of
useful input to this implementation.  Important ones:

[1] 
https://postgr.es/m/CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w...@mail.gmail.com
[2] 
https://postgr.es/m/CA+TgmoaAjkTibkEr=xJg3ndbRsHHSiYi2SJgX69MVosj=lj...@mail.gmail.com
[3] 
https://postgr.es/m/CA+TgmoY13KQZF-=hntrt9uywyx3_oyoqpu9iont49jggidp...@mail.gmail.com

Attached is a patch that implements
ALTER TABLE ... DETACH PARTITION .. CONCURRENTLY.

In the previous thread we were able to implement the concurrent model
without the extra keyword.  For this one I think that won't work; my
implementation works in two transactions so there's a restriction that
you can't run it in a transaction block.  Also, there's a wait phase
that makes it slower than the non-concurrent one.  Those two drawbacks
make me think that it's better to keep both modes available, just like
we offer both CREATE INDEX and CREATE INDEX CONCURRENTLY.

Why two transactions?  The reason is that in order for this to work, we
make a catalog change (mark it detached), and commit so that all
concurrent transactions can see the change.  A second transaction waits
for anybody who holds any lock on the partitioned table and grabs Access
Exclusive on the partition (which now no one cares about, if they're
looking at the partitioned table), where the DDL action on the partition
can be completed.

ALTER TABLE is normally unable to run in two transactions.  I hacked it
(0001) so that the relation can be closed and reopened in the Exec phase
(by having the rel as part of AlteredTableInfo: when ATRewriteCatalogs
returns, it uses that pointer to close the rel).  It turns out that this
is sufficient to make that work.  This means that ALTER TABLE DETACH
CONCURRENTLY cannot work as part of a multi-command ALTER TABLE, but
that's alreay enforced by the grammar anyway.

DETACH CONCURRENTLY doesn't work if a default partition exists.  It's
just too problematic a case; you would still need to have AEL on the
default partition.


I haven't yet experimented with queries running in a standby in tandem
with a detach.

-- 
Álvaro Herrera
>From 2f9202bf6f4c86d607032d7f04d3b2cee74a9617 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 13 Jul 2020 20:15:30 -0400
Subject: [PATCH v1 1/2] Let ALTER TABLE exec routines deal with the relation

This means that ATExecCmd relies on AlteredRelationInfo->rel instead of
keeping the relation as a local variable; this is useful if the
subcommand needs to modify the relation internally.  For example, a
subcommand that internally commits its transaction needs this.
---
 src/backend/commands/tablecmds.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ac53f79ada..07f4f562ee 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -157,6 +157,8 @@ typedef struct AlteredTableInfo
 	Oid			relid;			/* Relation to work on */
 	char		relkind;		/* Its relkind */
 	TupleDesc	oldDesc;		/* Pre-modification tuple descriptor */
+	/* Transiently set during Phase 2, normally set to NULL */
+	Relation	rel;
 	/* Information saved by Phase 1 for Phase 2: */
 	List	   *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */
 	/* Information saved by Phases 1/2 for Phase 3: */
@@ -354,7 +356,7 @@ static void ATPrepCmd(List **wqueue, Relation rel, AlterTableCmd *cmd,
 	  AlterTableUtilityContext *context);
 static void ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			  AlterTableUtilityContext *context);
-static void ATExecCmd(List **wqueue, AlteredTableInfo *tab, Relation rel,
+static void ATExecCmd(List **wqueue, AlteredTableInfo *tab,
 	  AlterTableCmd *cmd, LOCKMODE lockmode, int cur_pass,
 	  AlterTableUtilityContext *context);
 static AlterTableCmd *ATParseTransformCmd(List **wqueue, AlteredTableInfo *tab,
@@ -4326,7 +4328,6 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 		{
 			AlteredTableInfo *tab = (AlteredTableInfo *) lfirst(ltab);
 			List	   *subcmds = tab->subcmds[pass];
-			Relation	rel;
 			ListCell   *lcmd;
 
 			if (subcmds == NIL)
@@ -4335,10 +4336,10 @@ ATRewriteCatalogs(List **wqueue, LOCKMODE lockmode,
 			/*
 			 * Appropriate lock was obtained by phase 1, needn't get it again
 			 */
-			rel = relation_open(tab->relid, NoLock);
+			tab->rel = relation_open(tab->relid, NoLock);
 
 			foreach(lcmd, subcmds)
-ATExecCmd(wqueue, tab, rel,
+ATExecCmd(wqueue, tab,
 		  castNode(AlterTableCmd, lfirst(lcmd)),
 		  lock

Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2020-08-03 Thread Alvaro Herrera

On 2020-Aug-03, Alvaro Herrera wrote:

> There was a lot of great discussion which ended up in Robert completing
> a much sought implementation of non-blocking ATTACH.  DETACH was
> discussed too because it was a goal initially, but eventually dropped
> from that patch altogether. Nonetheless, that thread provided a lot of
> useful input to this implementation.  Important ones:
> 
> [1] 
> https://postgr.es/m/CA+TgmoYg4x7AH=_QSptvuBKf+3hUdiCa4frPkt+RvXZyjX1n=w...@mail.gmail.com
> [2] 
> https://postgr.es/m/CA+TgmoaAjkTibkEr=xJg3ndbRsHHSiYi2SJgX69MVosj=lj...@mail.gmail.com
> [3] 
> https://postgr.es/m/CA+TgmoY13KQZF-=hntrt9uywyx3_oyoqpu9iont49jggidp...@mail.gmail.com

There was some discussion about having a version number in the partition
descriptor somewhere as a means to implement this.  I couldn't figure
out how that would work, or what the version number would be attached
to.  Surely the idea wasn't to increment the version number to every
partition other than the one being detached?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Document "59.2. Built-in Operator Classes" have a clerical error?

On Sun, Aug  2, 2020 at 08:43:53PM -0700, David G. Johnston wrote:
> On Sun, Aug 2, 2020 at 8:17 PM osdba  wrote:
> 
> hi all:
> 
> In Document "Table 59-1. Built-in GiST Operator Classes":
> 
> "range_ops any range type && &> &< >> << <@ -|- = @> @>"， exist double "@>
> "，
>  
> Should be "<@ @>" ?
> 
> 
> 
> It helps to reference the current version of the page (or provide a url link)
> as that section seems to have migrated to Chapter 64 - though it is unchanged
> even on the main development branch.
> 
> The table itself is extremely difficult to read: it would be more easily
> readable if the font was monospaced, but its not.
> 
> I'm reasonably confident that the equal sign is part of the second-to-last
> operator while the lone @> is the final operator.  Mostly I say this because
> GiST doesn't do straight equality so a lone equal operator isn't valid.

I dug into this.  This query I think explains why the duplicate is
there:

SELECT oprname, oprleft::regtype, oprright::regtype, oprresult::regtype
FROM pg_am
JOIN pg_opclass ON opcmethod = pg_am.oid
JOIN pg_amop ON opcfamily = pg_amop.amopfamily
JOIN pg_operator ON amopopr = pg_operator.oid
WHERE amname = 'gist'
AND opcname = 'range_ops'
ORDER BY 1

 oprname | oprleft  |  oprright  | oprresult
-+--++---
 &&  | anyrange | anyrange   | boolean
 &<  | anyrange | anyrange   | boolean
 &>  | anyrange | anyrange   | boolean
 -|- | anyrange | anyrange   | boolean
 <<  | anyrange | anyrange   | boolean
 <@  | anyrange | anyrange   | boolean
 =   | anyrange | anyrange   | boolean
 >>  | anyrange | anyrange   | boolean
-->  @>  | anyrange | anyrange   | boolean
-->  @>  | anyrange | anyelement | boolean

Notice that @> appears twice.  (I am not sure why @> appears twice in
the SQL output, while <@ appears only once.)  The PG docs explain the
duplicate:

https://www.postgresql.org/docs/12/functions-range.html

@>  contains range  int4range(2,4) @> int4range(2,3)t
@>  contains element'[2011-01-01,2011-03-01)'::tsrange @> 
'2011-01-10'::timestamp   t
<@  range is contained by   int4range(2,4) <@ int4range(1,7)
t
<@  element is contained by 42 <@ int4range(1,7)f

There is an anyrange/anyrange version, and an anyrange/anyelement
version of @> and <@.  Anyway, for the docs, I think we can either
remove the duplicate entry, or modify it to clarify one is for
anyrange/anyrange and another is for anyrange/anyelement.  I suggest the
first option.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Document "59.2. Built-in Operator Classes" have a clerical error?



Ah, seems Tom has even more detail so we will continue to discuss on
that thread.

---

On Mon, Aug  3, 2020 at 07:51:51PM -0400, Bruce Momjian wrote:
> On Sun, Aug  2, 2020 at 08:43:53PM -0700, David G. Johnston wrote:
> > On Sun, Aug 2, 2020 at 8:17 PM osdba  wrote:
> > 
> > hi all:
> > 
> > In Document "Table 59-1. Built-in GiST Operator Classes":
> > 
> > "range_ops any range type && &> &< >> << <@ -|- = @> @>"， exist double 
> > "@>
> > "，
> >  
> > Should be "<@ @>" ?
> > 
> > 
> > 
> > It helps to reference the current version of the page (or provide a url 
> > link)
> > as that section seems to have migrated to Chapter 64 - though it is 
> > unchanged
> > even on the main development branch.
> > 
> > The table itself is extremely difficult to read: it would be more easily
> > readable if the font was monospaced, but its not.
> > 
> > I'm reasonably confident that the equal sign is part of the second-to-last
> > operator while the lone @> is the final operator.  Mostly I say this because
> > GiST doesn't do straight equality so a lone equal operator isn't valid.
> 
> I dug into this.  This query I think explains why the duplicate is
> there:
> 
>   SELECT oprname, oprleft::regtype, oprright::regtype, oprresult::regtype
>   FROM pg_am
>   JOIN pg_opclass ON opcmethod = pg_am.oid
>   JOIN pg_amop ON opcfamily = pg_amop.amopfamily
>   JOIN pg_operator ON amopopr = pg_operator.oid
>   WHERE amname = 'gist'
>   AND opcname = 'range_ops'
>   ORDER BY 1
> 
>oprname | oprleft  |  oprright  | oprresult
>   -+--++---
>&&  | anyrange | anyrange   | boolean
>&<  | anyrange | anyrange   | boolean
>&>  | anyrange | anyrange   | boolean
>-|- | anyrange | anyrange   | boolean
><<  | anyrange | anyrange   | boolean
><@  | anyrange | anyrange   | boolean
>=   | anyrange | anyrange   | boolean
>>>  | anyrange | anyrange   | boolean
> -->@>  | anyrange | anyrange   | boolean
> -->@>  | anyrange | anyelement | boolean
> 
> Notice that @> appears twice.  (I am not sure why @> appears twice in
> the SQL output, while <@ appears only once.)  The PG docs explain the
> duplicate:
> 
>   https://www.postgresql.org/docs/12/functions-range.html
> 
>   @>  contains range  int4range(2,4) @> int4range(2,3)t
>   @>  contains element'[2011-01-01,2011-03-01)'::tsrange @> 
> '2011-01-10'::timestamp   t
>   <@  range is contained by   int4range(2,4) <@ int4range(1,7)
> t
>   <@  element is contained by 42 <@ int4range(1,7)f
> 
> There is an anyrange/anyrange version, and an anyrange/anyelement
> version of @> and <@.  Anyway, for the docs, I think we can either
> remove the duplicate entry, or modify it to clarify one is for
> anyrange/anyrange and another is for anyrange/anyelement.  I suggest the
> first option.
> 
> -- 
>   Bruce Momjian  https://momjian.us
>   EnterpriseDB https://enterprisedb.com
> 
>   The usefulness of a cup is in its emptiness, Bruce Lee
> 
> 
> 

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: EDB builds Postgres 13 with an obsolete ICU version

On Mon, Aug  3, 2020 at 08:56:06PM +0200, Daniel Verite wrote:
>  Hi,
> 
> As a follow-up to bug #16570 [1] and other previous discussions
> on the mailing-lists, I'm checking out PG13 beta for Windows
> from:
>  https://www.enterprisedb.com/postgresql-early-experience
> and it ships with the same obsolete ICU 53 that was used
> for PG 10,11,12.
> Besides not having the latest Unicode features and fixes, ICU 53
> ignores the BCP 47 tags syntax in collations used as examples
> in Postgres documentation, which leads to confusion and
> false bug reports.
> The current version is ICU 67.
> 
> I don't see where the suggestion to upgrade it before the
> next PG release should be addressed but maybe some people on
> this list do know or have the leverage to make it happen?

Well, you can ask EDB about this, but perhaps the have kept the same ICU
version so indexes will not need to be reindexed.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: public schema default ACL

2020-08-03 Thread David G. Johnston

On Sun, Aug 2, 2020 at 11:30 PM Noah Misch  wrote:

>
> Interaction with dump/restore (including pg_upgrade) options:
> a. If the schema has a non-default ACL, dump/restore reproduces it.
>Otherwise, the new default prevails.
> b. Dump/restore always reproduces the schema ACL.
>
> Initial ownership of schema "public" options:
> 1. Bootstrap superuser owns it.  (Without superuser cooperation, database
>owners can't drop it or create objects in it.)
> 2. Don't create the schema during initdb.  Database owners can create it or
>any other schema.  (A superuser could create it in template1, which
>converts an installation to option (1).)
> 3. Database owner owns it.  (One might implement this by offering ALTER
> SCHEMA
>x OWNER TO DATABASE_OWNER, which sets nspowner to a system OID meaning
>"refer to pg_database.datdba".  A superuser could issue DDL to convert
> to
>option (1) or (2).)
>
> Automatic creation of $user schemas options:
> X. Automatic schema creation doesn't exist.
> Y. Create $user schemas on-demand (at login time or CREATE TABLE/CREATE
>FUNCTION/etc. time) if the DBA specified a "SCHEMA_CREATE" option in the
>CREATE ROLE statement.
> Z. Like (Y), but SCHEMA_CREATE is the default.
>
> Between (b)(2)(X) and (b)(3)(X), what are folks' preferences?  Does anyone
> strongly favor some other option (including the option of changing nothing)
> over both of those two?
>

Both, as well as a reconsideration of not providing an escape hatch to the
search_path change as part of dump/restore in response to a number of
emails to these lists.

I like an option 2 that simply and quickly allows a DBA to setup a system
with zero-trust and have all grants be made explicitly.  This would go
beyond just the public schema and basically remove the concept of grants to
the built-in PUBLIC group.

I like option 3 for the user-friendly default option that has as few
compatibility issues compared to today as possible.

David J.

Re: 13dev failed assert: comparetup_index_btree(): ItemPointer values should never be equal

On Tue, Jul 28, 2020 at 3:09 PM Peter Geoghegan  wrote:
> Perhaps 2011's commit 520bcd9c9bb missed similar
> HEAPTUPLE_INSERT_IN_PROGRESS issues that manifest themselves within
> Justin's test case now?

Any further thoughts on this, Tom?

This is clearly a live bug that we should fix before too long. I could
write the patch myself, but I would like to get your response to my
analysis before starting down that road.

--
Peter Geoghegan

At the moment JIT compilation, if enabled, is applied to all
expressions in the entire plan. This can sometimes be a problem as
some expressions may be evaluated lots and warrant being JITted, but
others may only be evaluated just a few times, or even not at all.

This problem tends to become large when table partitioning is involved
as the number of expressions in the plan grows with each partition
present in the plan. Some partitions may have many rows and it can be
useful to JIT expression, but others may have few rows or even no
rows, in which case JIT is a waste of effort.

I recall a few cases where people have complained that JIT was too
slow. One case, in particular, is [1].

It would be nice if JIT was more granular about which parts of the
plan it could be enabled for. So I went and did that in the attached.

The patch basically changes the plan-level consideration of if JIT
should be enabled and to what level into a per-plan-node
consideration. So, instead of considering JIT based on the overall
total_cost of the plan, we just consider it on the plan-node's
total_cost.

I was just planing around with a test case of:

create table listp(a int, b int) partition by list(a);
select 'create table listp'|| x || ' partition of listp for values
in('||x||');' from generate_Series(1,1000) x;
\gexec
insert into listp select 1,x from generate_series(1,1) x;
vacuum analyze listp;

explain (analyze, buffers) select count(*) from listp where b < 0;

I get:

master jit=on
JIT:
Functions: 3002
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 141.587 ms, Inlining 11.760 ms, Optimization
6518.664 ms, Emission 3152.266 ms, Total 9824.277 ms
Execution Time: 12588.292 ms
(2013 rows)

master jit=off
Execution Time: 3672.391 ms

patched jit=on
JIT:
Functions: 5
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.675 ms, Inlining 3.322 ms, Optimization 10.766
ms, Emission 5.892 ms, Total 20.655 ms
Execution Time: 2754.160 ms

This explain format will need further work as each of those flags is
now per plan node rather than on the plan as a whole. I considered
just making the true/false a counter to count the number of functions,
e.g Inlined: 5 Optimized: 5 etc.

I understand from [2] that Andres has WIP code to improve the
performance of JIT compilation. That's really great, but I also
believe that no matter how fast we make it, it's going to be a waste
of effort unless the expressions are evaluated enough times for the
cheaper evaluations to pay off the compilation costs. It'll never be a
win when we evaluate certain expressions zero times. What Andres has
should allow us to drop the default jit costs.

Happy to hear people's thoughts on this.

David

[1]
https://www.postgresql.org/message-id/7736c40e-6db5-4e7a-8fe3-4b2ab8e22...@elevated-dev.com
[2]
https://www.postgresql.org/message-id/20200728212806.tu5ebmdbmfrvh...@alap3.anarazel.de

granular_jit_v1.patch
Description: Binary data

Re: Cache relation sizes?

2020-08-03 Thread Thomas Munro

On Tue, Aug 4, 2020 at 3:54 AM Konstantin Knizhnik
 wrote:
> So in this thread three solutions are proposed:
> 1. "bullet-proof general shared invalidation"
> 2. recovery-only solution avoiding shared memory and invalidation
> 3. "relaxed" shared memory cache with simplified invalidation

Hi Konstantin,

By the way, point 2 is now committed (c5315f4f).  As for 1 vs 3, I
wasn't proposing two different invalidation techniques: in both
approaches, I'm calling the cached values "relaxed", meaning that
their freshness is controlled by memory barriers elsewhere that the
caller has to worry about.  I'm just suggesting for idea 3 that it
might be a good idea to use relaxed values only in a couple of hot
code paths where we do the analysis required to convince ourselves
that memory barriers are already in the right places to make it safe.
By "bullet-proof" I meant that we could in theory convince ourselves
that *all* users of smgrnblocks() can safely use relaxed values, but
that's hard.

That said, the sketch patch I posted certainly needs more work, and
maybe someone has a better idea on how to do it.

> If solving such very important by still specific problem of caching
> relation size requires so much efforts,
> then may be it is time to go further in the direction towards shared
> catalog?

I wouldn't say it requires too much effort, at least the conservative
approach (3).  But I also agree with what you're saying, in the long
term:

> This shared relation cache can easily store relation size as well.
> In addition it will solve a lot of other problems:
> - noticeable overhead of local relcache warming
> - large memory consumption in case of larger number of relations
> O(max_connections*n_relations)
> - sophisticated invalidation protocol and related performance issues
> Certainly access to shared cache requires extra synchronization.But DDL
> operations are relatively rare.
> So in most cases we will have only shared locks. May be overhead of
> locking will not be too large?

Yeah, I would be very happy if we get a high performance shared
sys/rel/plan/... caches in the future, and separately, having the
relation size available in shmem is something that has come up in
discussions about other topics too (tree-based buffer mapping,
multi-relation data files, ...).  I agree with you that our cache
memory usage is a big problem, and it will be great to fix that one
day.  I don't think that should stop us from making small improvements
to the existing design in the meantime, though.  "The perfect is the
enemy of the good."  Look at all this trivial stuff:

https://wiki.postgresql.org/wiki/Syscall_Reduction

I don't have high quality data to report yet, but from simple tests
I'm seeing orders of magnitude fewer syscalls per pgbench transaction
in recovery, when comparing 11 to 14devel, due to various changes.
Admittedly, the size probes in regular backends aren't quite as bad as
the recovery ones were, because they're already limited by the rate
you can throw request/response cycles at the DB, but there are still
some cases like planning for high-partition counts and slow
filesystems that can benefit measurably from caching.

SSL TAP test fails due to default client certs.