date:20220408

Re: Windows now has fdatasync()

2022-04-08 Thread Michael Paquier

On Fri, Apr 08, 2022 at 12:40:55AM -0400, Tom Lane wrote:
> As long as the C11-isms are in MSVC-only code, it seems like this is
> exactly equivalent to setting a minimum MSVC version.  I don't see
> an objection-in-principle there, it's just a practical question of
> how far back is reasonable to support MSVC versions.  (That's very
> distinct from how far back we need the built code to run.)

Good question.  Older versions of VS are available, so this is not a
problem:
https://visualstudio.microsoft.com/vs/older-downloads/

I think that we should at least drop 2013, as there is a bunch of
stuff related to _MSC_VER < 1900 that could be removed with that,
particularly for locales.
--
Michael


signature.asc
Description: PGP signature

Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-08 Thread Michael Paquier

On Thu, Apr 07, 2022 at 10:19:35PM -0400, Robert Haas wrote:
> Yeah, that's exactly why I didn't do what Michael proposes. If we're
> going to go to this trouble to avoid changing the layout of a PGPROC,
> we must be doing that on the theory that extension code cares about
> delayChkpt. And if that is so, it seems reasonable to suppose that it
> might also want to call the associated functions.

Compatibility does not strike me as a problem with two static inline 
functions used as wrappers of their common logic.

> Honestly, I wouldn't have thought that this mattered, because I
> wouldn't have guessed that any non-core code cared about delayChkpt.
> But I would have been wrong.

That's a minor point.  If you wish to keep this code as you are
proposing, that's fine as well by me.
--
Michael


signature.asc
Description: PGP signature

Re: Handle infinite recursion in logical replication setup

2022-04-08 Thread vignesh C

On Tue, Apr 5, 2022 at 6:21 AM Peter Smith  wrote:
>
> Here are my comments for the latest patch v6-0001.
>
> (I will post my v6-0002 review comments separately)
>
> PATCH v6-0001 comments
> ==
>
> 1.1 General - Option name
>
> I still feel like the option name is not ideal. Unfortunately, this is
> important because any name change would impact lots of these patch
> files and docs, struct members etc.
>
> It was originally called "local_only", but I thought that as a
> SUBSCRIPTION option that was confusing because "local" means local to
> what? Really it is local to the publisher, not local to the
> subscriber, so that name seemed misleading.
>
> Then I suggested "publish_local_only". Although that resolved the
> ambiguity problem, other people thought it seemed odd to have the
> "publish" prefix for a subscription-side option.
>
> So now it is changed again to "subscribe_local_only" -- It's getting
> better but still, it is implied that the "local" means local to the
> publisher except there is nothing in the option name really to convey
> that meaning. IMO we here all understand the meaning of this option
> mostly by familiarity with this discussion thread, but I think a user
> coming to this for the first time will still be confused by the name.
>
> Below are some other name ideas. Maybe they are not improvements, but
> it might help other people to come up with something better.
>
> subscribe_publocal_only = true/false
> origin = pub_only/any
> adjacent_data_only = true/false
> direct_subscriptions_only = true/false
> ...
>
>
> (FYI, the remainder of these review comments will assume the option is
> still called "subscribe_local_only")

Modified to local_only

> ~~~
>
> 1.2 General - inconsistent members and args
>
> IMO the struct members and args should also be named for close
> consistency with whatever the option name is.
>
> Currently the option is called "subscription_local_only".  So I think
> the members/args would be better to be called "local_only" instead of
> "only_local".

Modified

> ~~~
>
> 1.3 Commit message - wrong option name
>
> The commit message refers to the option name as "publish_local_only"
> instead of the option name that is currently implemented.

Modified

> ~~~
>
> 1.4 Commit message - wording
>
> The wording seems a bit off. Below is suggested simpler wording which
> I AFAIK conveys the same information.
>
> BEFORE
> Add an option publish_local_only which will subscribe only to the locally
> generated data in the publisher node. If subscriber is created with this
> option, publisher will skip publishing the data that was subscribed
> from other nodes. It can be created using following syntax:
> ex: CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port='
> PUBLICATION pub1 with (publish_local_only = on);
>
> SUGGESTION
> This patch adds a new SUBSCRIPTION boolean option
> "subscribe_local_only". The default is false. When a SUBSCRIPTION is
> created with this option enabled, the publisher will only publish data
> that originated at the publisher node.
> Usage:
> CREATE SUBSCRIPTION sub1 CONNECTION 'dbname =postgres port='
> PUBLICATION pub1 with (subscribe_local_only = true);

Modified

> ~~~
>
> 1.5 doc/src/sgml/ref/create_subscription.sgml - "generated" changes.
>
> + 
> +  Specifies whether the subscription will request the publisher to 
> send
> +  locally generated changes or both the locally generated changes and
> +  the replicated changes that was generated from other nodes. The
> +  default is false.
> + 
>
> For some reason, it seemed a bit strange to me to use the term
> "generated" changes. Maybe better to refer to the origin of changes?
>
> SUGGESTION
> Specifies whether the publisher should send only changes that
> originated locally at the publisher node, or send any publisher node
> changes regardless of their origin. The default is false.

Modified

> ~~~
>
> 1.6 src/backend/replication/pgoutput/pgoutput.c -
> LOGICALREP_PROTO_TWOPHASE_VERSION_NUM
>
> @@ -496,6 +509,12 @@ pgoutput_startup(LogicalDecodingContext *ctx,
> OutputPluginOptions *opt,
>   else
>   ctx->twophase_opt_given = true;
>
> + if (data->only_local && data->protocol_version <
> LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("requested proto_version=%d does not support
> subscribe_local_only, need %d or higher",
> + data->protocol_version, LOGICALREP_PROTO_TWOPHASE_VERSION_NUM)));
>
> I thought this code should not be using
> LOGICALREP_PROTO_TWOPHASE_VERSION_NUM. Shouldn't there be some newly
> introduced constant like LOGICALREP_PROTO_LOCALONLY_VERSION_NUM which
> you will use here?

Modified, I have set LOGICALREP_PROTO_LOCALONLY_VERSION_NUM to same
value as LOGICALREP_PROTO_TWOPHASE_VERSION_NUM, will increment this
once server version is changed.

> ~~~
>
> 1.7 src/bin/pg_dump/pg_dump.c - 15
>
> @@ -4451,11 +4452,13 @@ getSubscriptio

Re: Handle infinite recursion in logical replication setup

2022-04-08 Thread vignesh C

On Tue, Apr 5, 2022 at 8:17 AM Peter Smith  wrote:
>
> Here are my comments for the latest patch v6-0002.
>
> PATCH v6-0002 comments
> ==
>
> 2.1 General - should this be an independent patch?
>
> In many ways, I think most of this patch is unrelated to the other
> "local_only" patch (v6-0001).
>
> For example, IIUC even in the current HEAD, we could consider it to be
> a user error if multiple SUBSCRIPTIONS or multiple PUBLICATIONS of the
> same SUBSCRIPTION are replicating to the same TABLE on the same node
> and using "copy_data = on".
>
> So I think it would be ok to throw an ERROR if such a copy_data clash
> is detected, and then the user will have to change to use "copy_data =
> off" for some/all of them to avoid data duplication.
>
> The "local_only" option only adds some small logic to this new ERROR,
> but it's not really a prerequisite at all.
>
> e.g. this whole ERROR part of the patch can be a separate thread.

As Amit also pointed out, the patches have some dependency, I will
keep it as it is.

> ~~~
>
> 2.2 General - can we remove the "force" enum?
>
> Now, because I consider the clashing "copy_data = on" ERROR to be a
> user error, I think that is something that the user can already take
> care of themselves just using the copy_data = off.
>
> I did not really like the modifying of the "copy_data" option from
> just boolean to some kind of hybrid boolean + "force".
>
> a) It smells a bit off to me. IMO replication is supposed to end up
> with the same (replicated) data on the standby machine but this
> "force" mode seems to be just helping the user to break that concept
> and say - "I know what I'm doing, and I don't care if I get lots of
> duplicated data in the replica table - just let me do it"...
>
> b) It also somehow feels like the new "force" was introduced mostly to
> make the code ERROR handling implementation simpler, rather than to
> make the life of the end-user better. Yes, if force is removed maybe
> the copy-clash-detection-code will need to be internally quite more
> complex than it is now, but that is how it should be, instead of
> putting extra burden on the user (e.g. by complicating the PG docs and
> giving them yet more alternatives to configure). I think any clashing
> copy_data options really is a user error, but also I think the current
> boolean copy_data true/false already gives the user a way to fix it.
>
> c) Actually, your new error hint messages are similar to my
> perspective: They already say "Use CREATE/ALTER SUBSCRIPTION with
> copy_data = off or force". All I am saying is remove the "force", and
> the user can still fix the problem just by using "copy_data = off"
> appropriately.

When a user is trying to add a node to bidirectional replication
setup, the user will use the force option to copy the data from one of
the nodes and use off to skip copying the data from other nodes. This
option will be used while adding nodes to bidirectional replication,
the same has been documented with examples in the patch. I felt we
should retain this option.

> ==
>
> So (from above) I am not much in favour of the copy_data becoming a
> hybrid enum and using "force", yet that is what most of this patch is
> implementing. Anyway, the remainder of my review comments here are for
> the code in its current form. Maybe if "force" can be removed most of
> the following comments end up being redundant.
>
> ==
>
> 2.3 Commit message - wording
>
> This message is difficult to understand.
>
> I think that the long sentence "Now when user is trying..." can be
> broken into more manageable parts.

Slightly modified

> This part "and throw an error so that user can handle the initial copy
> data." also seemed a bit vague.

I have given the reference to the documentation section of the patch
for initial data copy handling

> ~~~
>
> 2.4 Commit message - more functions
>
> "This patch does couple of things:"
>
> IIUC, there seems a third thing implemented by this patch but not
> described by the comment. I think it also adds support for ALTER
> SUBSCRIPTION SET PUBLICATION WITH (subscribe_local_only)

This is part of 0001 patch, I felt this should not be part of 002 patch commit.

> ~~~
>
> 2.5 doc/src/sgml/ref/create_subscription.sgml - wording
>
> @@ -161,6 +161,13 @@ CREATE SUBSCRIPTION  class="parameter">subscription_namethe replicated changes that was generated from other nodes. The
>default is false.
>   
> + 
> +  If the tables in the publication were also subscribing to the data 
> in
> +  the publisher from other publishers, it will affect the
> +  CREATE SUBSCRIPTION based on the value specified
> +  for copy_data option. Refer to the
> +   for details.
> + 
>
> Is there is a simpler way to express all that?
>
> SUGGESTION
> There is some interation between the option "subscribe_local_only" and
> option "copy_data". Refer to the  linkend="sql-createsubscription-no

Re: BufferAlloc: don't take two simultaneous locks

2022-04-08 Thread Kyotaro Horiguchi

At Thu, 07 Apr 2022 14:14:59 +0300, Yura Sokolov  
wrote in 
> В Чт, 07/04/2022 в 16:55 +0900, Kyotaro Horiguchi пишет:
> > Hi, Yura.
> > 
> > At Wed, 06 Apr 2022 16:17:28 +0300, Yura Sokolov  
> > wrot
> > e in 
> > > Ok, I got access to stronger server, did the benchmark, found weird
> > > things, and so here is new version :-)
> > 
> > Thanks for the new version and benchmarking.
> > 
> > > First I found if table size is strictly limited to NBuffers and FIXED,
> > > then under high concurrency get_hash_entry may not find free entry
> > > despite it must be there. It seems while process scans free lists, other
> > > concurrent processes "moves etry around", ie one concurrent process
> > > fetched it from one free list, other process put new entry in other
> > > freelist, and unfortunate process missed it since it tests freelists
> > > only once.
> > 
> > StrategyGetBuffer believes that entries don't move across freelists
> > and it was true before this patch.
> 
> StrategyGetBuffer knows nothing about dynahash's freelist.
> It knows about buffer manager's freelist, which is not partitioned.

Yeah, right. I meant get_hash_entry.

> > > To fix this issues I made following:
> > > 
> > > # Concurrency
> > > 
> > > First, I limit concurrency by introducing other lwlocks tranche -
> > > BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
> > > 128).
> > > If backend doesn't find buffer in buffer table and wants to introduce
> > > it, it first calls
> > > LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
> > > If lock were acquired, then it goes to eviction and replace process.
> > > Otherwise, it waits lock to be released and repeats search.
> > >
> > > This greately improve performance for > 400 clients in pgbench.
> > 
> > So the performance difference between the existing code and v11 is the
> > latter has a collision cross section eight times smaller than the
> > former?
> 
> No. Acquiring EvictPartitionLock
> 1. doesn't block readers, since readers doesn't acquire EvictPartitionLock
> 2. doesn't form "tree of lock dependency" since EvictPartitionLock is
>   independent from PartitionLock.
> 
> Problem with existing code:
> 1. Process A locks P1 and P2
> 2. Process B (p3-old, p1-new) locks P3 and wants to lock P1
> 3. Process C (p4-new, p1-old) locks P4 and wants to lock P1
> 4. Process D (p5-new, p4-old) locks P5 and wants to lock P4
> At this moment locks P1, P2, P3, P4 and P5 are all locked and waiting
> for Process A.
> And readers can't read from same five partitions.
> 
> With new code:
> 1. Process A locks E1 (evict partition) and locks P2,
>then releases P2 and locks P1.
> 2. Process B tries to locks E1, waits and retries search.
> 3. Process C locks E4, locks P1, then releases P1 and locks P4
> 4. Process D locks E5, locks P4, then releases P4 and locks P5
> So, there is no network of locks.
> Process A doesn't block Process D in any moment:
> - either A blocks C, but C doesn't block D at this moment
> - or A doesn't block C.
> And readers doesn't see simultaneously locked five locks which
> depends on single Process A.

Thansk for the detailed explanation. I see that.

> > +* Prevent "thundering herd" problem and limit concurrency.
> > 
> > this is something like pressing accelerator and break pedals at the
> > same time.  If it improves performance, just increasing the number of
> > buffer partition seems to work?
> 
> To be honestly: of cause simple increase of NUM_BUFFER_PARTITIONS
> does improve average case.
> But it is better to cure problem than anesthetize.
> Increase of
> NUM_BUFFER_PARTITIONS reduces probability and relative
> weight of lock network, but doesn't eliminate.

Agreed.

> > It's also not great that follower backends runs a busy loop on the
> > lock until the top-runner backend inserts the new buffer to the
> > buftable then releases the newParititionLock.
> > 
> > > I tried other variant as well:
> > > - first insert entry with dummy buffer index into buffer table.
> > > - if such entry were already here, then wait it to be filled.
> > > - otherwise find victim buffer and replace dummy index with new one.
> > > Wait were done with shared lock on EvictPartitionLock as well.
> > > This variant performed quite same.
> > 
> > This one looks better to me. Since a partition can be shared by two or
> > more new-buffers, condition variable seems to work better here...
> > 
> > > Logically I like that variant more, but there is one gotcha: 
> > > FlushBuffer could fail with elog(ERROR). Therefore then there is
> > > a need to reliable remove entry with dummy index.
> > 
> > Perhaps UnlockBuffers can do that.
> 
> Thanks for suggestion. I'll try to investigate and retry this way
> of patch.
> 
> > > And after all, I still need to hold EvictPartitionLock to notice
> > > waiters.
> > > I've tried to use ConditionalVariable, but its performance were much
> > > worse.
> > 
> > How many CVs did you use?
> 
> I've tried both NUM_PARTITION_LOCKS and

Re: Windows now has fdatasync()

2022-04-08 Thread Dave Page

On Fri, 8 Apr 2022 at 05:41, Tom Lane  wrote:

> Michael Paquier  writes:
> > On Fri, Apr 08, 2022 at 02:56:15PM +1200, Thomas Munro wrote:
> >> I propose that we drop support for Windows versions older than
> >> 10/Server 2016 in the PostgreSQL 16 cycle,
>
> Do we have any data on what people are actually using?
>

None that I know of. Anecdotally, we dropped support for pgAdmin on Windows
< 8 (2012 for the server edition), and had a single complaint - and the
user happily acknowledged they were on an old release and expected support
to be dropped sooner or later. Windows 8 was a pretty unpopular release, so
I would expect shifting to 10/2016+ for PG 16 would be unlikely to be a
major problem.

FWIW, Python dropped support for < 8/2012 with v3.9.

>
> > Do you think that we could raise the minimum C standard on WIN32 to
> > C11, at least for MSVC?
>
> As long as the C11-isms are in MSVC-only code, it seems like this is
> exactly equivalent to setting a minimum MSVC version.  I don't see
> an objection-in-principle there, it's just a practical question of
> how far back is reasonable to support MSVC versions.  (That's very
> distinct from how far back we need the built code to run.)
>
> regards, tom lane
>
>
>

-- 
Dave Page
Blog: https://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: https://www.enterprisedb.com

Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-08 Thread Markus Wanner

On Fri, 2022-04-08 at 08:47 +0900, Michael Paquier wrote:
> On Thu, Apr 07, 2022 at 11:19:15AM -0400, Robert Haas wrote:
> > Here are patches for master and v14 to do things this way.
> > Comments?
> 
> Thanks for the patches.  They look correct.

+1, looks good to me and addresses my specific original concern.

> For ~14, I'd rather avoid
> the code duplication done by GetVirtualXIDsDelayingChkptEnd() and
> HaveVirtualXIDsDelayingChkpt() that could be avoided with an extra
> bool argument to the existing routine.  The same kind of duplication
> happens with GetVirtualXIDsDelayingChkpt() and
> GetVirtualXIDsDelayingChkptEnd().

I agree with Michael, it would be nice to not duplicate the code, but
use a common underlying method.  A modified patch is attached.

Best Regards

Markus

From: Robert Haas 
Date: Thu, 7 Apr 2022 11:15:07 -0400
Subject: [PATCH] Rethink the delay-checkpoint-end mechanism in the
 back-branches.

The back-patch of commit bbace5697df12398e87ffd9879171c39d27f5b33 had
the unfortunate effect of changing the layout of PGPROC in the
back-branches, which could break extensions. This happened because it
changed the delayChkpt from type bool to type int. So, change it back,
and add a new bool delayChkptEnd field instead. The new field should
fall within what used to be padding space within the struct, and so
hopefully won't cause any extensions to break.

Per report from Markus Wanner and discussion with Tom Lane and others.
---
 backend/access/transam/multixact.c  |   6 +-
 backend/access/transam/twophase.c   |  13 ++--
 backend/access/transam/xact.c   |   6 +-
 backend/access/transam/xlog.c   |  10 +--
 backend/access/transam/xloginsert.c |   2 
 backend/catalog/storage.c   |   6 +-
 backend/storage/buffer/bufmgr.c |   6 +-
 backend/storage/ipc/procarray.c | 100 +++-
 include/storage/proc.h  |  35 +---
 include/storage/procarray.h |   7 +-
 10 files changed, 107 insertions(+), 84 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 50d8bab9e21..b643564f16a 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3075,8 +3075,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	 * crash/basebackup, even though the state of the data directory would
 	 * require it.
 	 */
-	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
-	MyProc->delayChkpt |= DELAY_CHKPT_START;
+	Assert(!MyProc->delayChkpt);
+	MyProc->delayChkpt = true;
 
 	/* WAL log truncation */
 	WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3102,7 +3102,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
 	/* Then offsets */
 	PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-	MyProc->delayChkpt &= ~DELAY_CHKPT_START;
+	MyProc->delayChkpt = false;
 
 	END_CRIT_SECTION();
 	LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index dea3f485f7a..633a6f1747f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -474,8 +474,9 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
 	}
 	proc->xid = xid;
 	Assert(proc->xmin == InvalidTransactionId);
-	proc->delayChkpt = 0;
+	proc->delayChkpt = false;
 	proc->statusFlags = 0;
+	proc->delayChkptEnd = false;
 	proc->pid = 0;
 	proc->databaseId = databaseid;
 	proc->roleId = owner;
@@ -1165,8 +1166,7 @@ EndPrepare(GlobalTransaction gxact)
 
 	START_CRIT_SECTION();
 
-	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
-	MyProc->delayChkpt |= DELAY_CHKPT_START;
+	MyProc->delayChkpt = true;
 
 	XLogBeginInsert();
 	for (record = records.head; record != NULL; record = record->next)
@@ -1209,7 +1209,7 @@ EndPrepare(GlobalTransaction gxact)
 	 * checkpoint starting after this will certainly see the gxact as a
 	 * candidate for fsyncing.
 	 */
-	MyProc->delayChkpt &= ~DELAY_CHKPT_START;
+	MyProc->delayChkpt = false;
 
 	/*
 	 * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2276,8 +2276,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	START_CRIT_SECTION();
 
 	/* See notes in RecordTransactionCommit */
-	Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
-	MyProc->delayChkpt |= DELAY_CHKPT_START;
+	MyProc->delayChkpt = true;
 
 	/*
 	 * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2325,7 +2324,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	TransactionIdCommitTree(xid, nchildren, children);
 
 	/* Checkpoint can proceed now */
-	MyProc->delayChkpt &= ~DELAY_CHKPT_START;
+	MyProc->delayChkpt = false;
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c5e7261921d..e8523693c0a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1335,9 +1335,9

Re: Temporary file access API

2022-04-08 Thread Antonin Houska

Robert Haas  wrote:

> On Tue, Mar 8, 2022 at 6:12 AM Antonin Houska  wrote:
> > Thanks for your comments, the initial version is attached here.
> 
> I've been meaning to look at this thread for some time but have not
> found enough time to do that until just now. And now I have
> questions...
> 
> 1. Supposing we accepted this, how widely do you think that we could
> adopt it? I see that this patch set adapts a few places to use it and
> that's nice, but I have a feeling there's a lot more places that are
> making use of system calls directly, or through some other
> abstraction, than just this. I'm not sure that we actually want to use
> something like this everywhere, but what would be the rule for
> deciding where to use it and where not to use
> it? If the plan for this facility is just to adapt these two
> particular places to use it, that doesn't feel like enough to be
> worthwhile.

Admittedly I viewed the problem from the perspective of the TDE, so I haven't
spent much time looking for other opportunities. Now, with the stats collector
using shared memory, even one of the use cases implemented here no longer
exists. I need to do more research.

Do you think that the use of a system call is a problem itself (e.g. because
the code looks less simple if read/write is used somewhere and fread/fwrite
elsewhere; of course of read/write is mandatory in special cases like WAL,
heap pages, etc.)  or is the problem that the system calls are used too
frequently? I suppose only the latter.

Anyway, I'm not sure there are *many* places where system calls are used too
frequently. Instead, the coding uses to be such that the information is first
assembled in memory and then written to file at once. So the value of the
(buffered) stream is that it makes the code simpler (eliminates the need to
prepare the data in memory). That's what I tried to do for reorderbuffer.c and
pgstat.c in my patch.

Related question is whether we should try to replace some uses of the libc
stream (FILE *) at some places. You seem to suggest that in [1]. One example
is snapmgr.c:ExportSnapshot(), if we also implement output formatting. Of
course there are places where (FILE *) cannot be replaced because, besides
regular file, the code needs to work with stdin/stdout in general. (Parsing of
configuration files falls into this category, but that doesn't matter because
bison-generated parser seems to implement buffering anyway.)

> 2. What about frontend code? Most frontend code won't examine data
> files directly, but at least pg_controldata, pg_checksums, and
> pg_resetwal are exceptions.

If the frequency of using system calls is the problem, then I wouldn't change
these because ControlFileData structure needs to be initialized in memory
anyway and then written at once. And pg_checksums reads whole blocks
anyway. I'll take a closer look.

> 3. How would we extend this to support encryption? Add an extra
> argument to BufFileStreamInit(V)FD passing down the encryption
> details?

Yes.

> There are some smaller things about the patch with which I'm not 100%
> comfortable, but I'd like to start by understanding the big picture.

Thanks!

[1] 
https://www.postgresql.org/message-id/CA+TgmoYGjN_f=fcerx49bzjhng+gocty+a+xhnrwcvvdy8u...@mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

Re: Add spin_delay() implementation for Arm in s_lock.h

2022-04-08 Thread Blake, Geoff

Thanks for all the help Tom!

On 4/6/22, 6:07 PM, "Tom Lane"  wrote:

CAUTION: This email originated from outside of the organization. Do not 
click links or open attachments unless you can confirm the sender and know the 
content is safe.

"Blake, Geoff"  writes:
> Hi Tom, Andres,
> Any additional feedback for this patch?

I did some more research and testing:

* Using a Mac with the M1 Pro chip (marginally beefier than the M1
I was testing on before), I think I can see some benefit in the
test case I proposed upthread.  It's marginal though.

* On a Raspberry Pi 3B+, there's no outside-the-noise difference.

* ISB doesn't exist in pre-V7 ARM, so it seems prudent to restrict
the patch to ARM64.  I doubt any flavor of ARM32 would be able to
benefit anyway.  (Googling finds that MariaDB made this same
choice not long ago [1].)

So what we've got is that there seems to be benefit at high
core counts, and it at least doesn't hurt at lower ones.
That's good enough for me, so pushed.

regards, tom lane

[1] https://jira.mariadb.org/browse/MDEV-25807

Re: [Proposal] vacuumdb --schema only

2022-04-08 Thread Gilles Darold


Le 08/04/2022 à 02:46, Justin Pryzby a écrit :

On Wed, Apr 06, 2022 at 07:43:42PM +0200, Gilles Darold wrote:

Thanks for the review, all these changes are available in new version v6
of the patch and attached here.

This is failing in CI (except on macos, which is strangely passing).
http://cfbot.cputube.org/gilles-darold.html

https://api.cirrus-ci.com/v1/artifact/task/5379693443547136/log/src/bin/scripts/tmp_check/log/regress_log_100_vacuumdb

not ok 59 - vacuumdb --schema "Foo" postgres exit code 0

#   Failed test 'vacuumdb --schema "Foo" postgres exit code 0'
#   at t/100_vacuumdb.pl line 151.
not ok 60 - vacuumdb --schema schema only: SQL found in server log

#   Failed test 'vacuumdb --schema schema only: SQL found in server log'
#   at t/100_vacuumdb.pl line 151.
#   '2022-04-06 18:15:36.313 UTC [34857][not initialized] 
[[unknown]][:0] LOG:  connection received: host=[local]
# 2022-04-06 18:15:36.314 UTC [34857][client backend] [[unknown]][3/2801:0] 
LOG:  connection authorized: user=postgres database=postgres 
application_name=100_vacuumdb.pl
# 2022-04-06 18:15:36.318 UTC [34857][client backend] 
[100_vacuumdb.pl][3/2802:0] LOG:  statement: SELECT 
pg_catalog.set_config('search_path', '', false);
# 2022-04-06 18:15:36.586 UTC [34857][client backend] [100_vacuumdb.pl][:0] 
LOG:  disconnection: session time: 0:00:00.273 user=postgres database=postgres 
host=[local]
# '
# doesn't match '(?^:VACUUM "Foo".bar)'


I'm surprised because make check do do not reports errors running on an 
Ubuntu 20.04 and CentOs 8:



t/010_clusterdb.pl  ok
t/011_clusterdb_all.pl  ok
t/020_createdb.pl . ok
t/040_createuser.pl ... ok
t/050_dropdb.pl ... ok
t/070_dropuser.pl . ok
t/080_pg_isready.pl ... ok
t/090_reindexdb.pl  ok
t/091_reindexdb_all.pl  ok
t/100_vacuumdb.pl . ok
t/101_vacuumdb_all.pl . ok
t/102_vacuumdb_stages.pl .. ok
t/200_connstr.pl .. ok
All tests successful.
Files=13, Tests=233, 17 wallclock secs ( 0.09 usr  0.02 sys + 6.63 cusr  
2.68 csys =  9.42 CPU)

Result: PASS


In tmp_check/log/regress_log_100_vacuumdb:

# Running: vacuumdb --schema "Foo" postgres
vacuumdb: vacuuming database "postgres"
ok 59 - vacuumdb --schema "Foo" postgres exit code 0
ok 60 - vacuumdb --schema schema only: SQL found in server log

In PG log:

2022-04-08 11:01:44.519 CEST [17223] 100_vacuumdb.pl LOG: statement: 
RESET search_path;
2022-04-08 11:01:44.519 CEST [17223] 100_vacuumdb.pl LOG: statement: 
WITH listed_objects (object_oid, column_list) AS (
  VALUES ('"Foo"'::pg_catalog.regnamespace::pg_catalog.oid, 
NULL::pg_catalog.text)

    )
    SELECT c.relname, ns.nspname, listed_objects.column_list FROM 
pg_catalog.pg_class c
 JOIN pg_catalog.pg_namespace ns ON c.relnamespace 
OPERATOR(pg_catalog.=) ns.oid
 LEFT JOIN pg_catalog.pg_class t ON c.reltoastrelid 
OPERATOR(pg_catalog.=) t.oid
 JOIN listed_objects ON listed_objects.object_oid 
OPERATOR(pg_catalog.=) ns.oid

 WHERE c.relkind OPERATOR(pg_catalog.=) ANY (array['r', 'm'])
 ORDER BY c.relpages DESC;
2022-04-08 11:01:44.521 CEST [17223] 100_vacuumdb.pl LOG: statement: 
SELECT pg_catalog.set_config('search_path', '', false);
2022-04-08 11:01:44.521 CEST [17223] 100_vacuumdb.pl LOG: statement: 
VACUUM "Foo".bar;


And if I run the command manually:

$ /usr/local/pgsql/bin/vacuumdb -e -h localhost --schema '"Foo"' -d 
contrib_regress -U postgres

SELECT pg_catalog.set_config('search_path', '', false);
vacuumdb: vacuuming database "contrib_regress"
RESET search_path;
WITH listed_objects (object_oid, column_list) AS (
  VALUES ('"Foo"'::pg_catalog.regnamespace::pg_catalog.oid, 
NULL::pg_catalog.text)

)
SELECT c.relname, ns.nspname, listed_objects.column_list FROM 
pg_catalog.pg_class c
 JOIN pg_catalog.pg_namespace ns ON c.relnamespace 
OPERATOR(pg_catalog.=) ns.oid
 LEFT JOIN pg_catalog.pg_class t ON c.reltoastrelid 
OPERATOR(pg_catalog.=) t.oid
 JOIN listed_objects ON listed_objects.object_oid 
OPERATOR(pg_catalog.=) ns.oid

 WHERE c.relkind OPERATOR(pg_catalog.=) ANY (array['r', 'm'])
 ORDER BY c.relpages DESC;
SELECT pg_catalog.set_config('search_path', '', false);

VACUUM "Foo".bar;


$ echo $?
0

I don't know what happen on cfbot, investigating...


--
Gilles Darold

Re: Support logical replication of DDLs

2022-04-08 Thread Amit Kapila

On Tue, Mar 29, 2022 at 9:47 AM Dilip Kumar  wrote:
> >
> > The idea is to force skipping any direct data population (which can
> > potentially cause data inconsistency on the subscriber)
> > in CREATE AS and SELECT INTO command on the subscriber by forcing the
> > skipData flag in the intoClause of the parsetree after
> > the logical replication worker parses the command. The data sync will
> > be taken care of by the DML replication after the DDL replication
> > finishes.
>
> Okay, something like that should work, I am not sure it is the best
> design though.
>

Even if this works, how will we make Alter Table statement work where
it needs to rewrite the table? There also I think we can face a
similar problem if we directly send the statement, once the table will
be updated due to the DDL statement and then again due to table
rewrite as that will have a separate WAL.

Another somewhat unrelated problem I see with this work is how to save
recursion of the same command between nodes (when the involved nodes
replicate DDLs). For DMLs, we can avoid that via replication origins
as is being done in the patch proposed [1] but not sure how will we
deal with that here?

[1] - https://commitfest.postgresql.org/38/3610/

-- 
With Regards,
Amit Kapila.

Re: pg14 psql broke \d datname.nspname.relname

2022-04-08 Thread Robert Haas

On Thu, Apr 7, 2022 at 11:40 PM Greg Stark  wrote:
> That doesn't seem to be entirely inconsistent with what Tom describes.
> Instead of "throw an error" the function would return an error and
> possibly some extra info which the caller would use to handle the
> error appropriately.

I don't personally see how we're going to come out ahead with that
approach, but if you or Tom or someone else want to put something
together, that's fine with me. I'm not stuck on this approach, I just
don't see how we come out ahead with the type of thing you're talking
about. I mean we could return the error text, but it's only to a
handful of places, so it just doesn't really seem like a win over what
the patch is already doing.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Support logical replication of DDLs

2022-04-08 Thread Amit Kapila

On Thu, Mar 17, 2022 at 3:36 AM Alvaro Herrera  wrote:
>
> Did you see some old code I wrote towards this goal?
> https://www.postgresql.org/message-id/20150215044814.gl3...@alvh.no-ip.org
> The intention was that DDL would produce some JSON blob that accurately
> describes the DDL that was run;
>

I have read that thread and found one of your emails [1] where you
seem to be saying that JSON representation is not required for BDR.
Will in some way going via JSON blob way make this project
easier/better?

> the caller can acquire that and use it
> to produce working DDL that doesn't depend on runtime conditions.
>

For runtime conditions, one of the things you have mentioned in that
thread is to add schema name in the statement at the required places
which this patch deals with in a different way by explicitly sending
it along with the DDL statement. The other cases where we might need
deparsing are Alter Table type cases (where we need to rewrite the
table) where we may want to send a different DDL. I haven't analyzed
but I think it is better to have a list where all we need deparsing
and what is the best way to deal with it. The simpler cases seem to be
working with the approach proposed by this patch but I am not sure if
it will work for all kinds of cases.

[1] - 
https://www.postgresql.org/message-id/20150504185721.GB2523%40alvh.no-ip.org

-- 
With Regards,
Amit Kapila.

Re: generic plans and "initial" pruning

2022-04-08 Thread David Rowley

On Fri, 8 Apr 2022 at 17:49, Amit Langote  wrote:
> Attached updated patch with these changes.

Thanks for making the changes.  I started looking over this patch but
really feel like it needs quite a few more iterations of what we've
just been doing to get it into proper committable shape. There seems
to be only about 40 mins to go before the freeze, so it seems very
unrealistic that it could be made to work.

I started trying to take a serious look at it this evening, but I feel
like I just failed to get into it deep enough to make any meaningful
improvements.  I'd need more time to study the problem before I could
build up a proper opinion on how exactly I think it should work.

Anyway. I've attached a small patch that's just a few things I
adjusted or questions while reading over your v13 patch.  Some of
these are just me questioning your code (See XXX comments) and some I
think are improvements. Feel free to take the hunks that you see fit
and drop anything you don't.

David
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 05cc99df8f..5ee978937d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -121,6 +121,8 @@ static void EvalPlanQualStart(EPQState *epqstate, Plan 
*planTree);
  *
  * Note: Partitioned tables mentioned in PartitionedRelPruneInfo nodes that
  * drive the pruning will be locked before doing the pruning.
+ *
+ * 
  */
 PartitionPruneResult *
 ExecutorDoInitialPruning(PlannedStmt *plannedstmt, ParamListInfo params)
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 3037742b8d..e9ca6bc55f 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1707,6 +1707,7 @@ ExecInitPartitionPruning(PlanState *planstate,
Assert(n_total_subplans > 0);
*initially_valid_subplans = bms_add_range(NULL, 0,

  n_total_subplans - 1);
+   return prunestate;
}
 
/*
@@ -1714,14 +1715,15 @@ ExecInitPartitionPruning(PlanState *planstate,
 * that were removed above due to initial pruning.  No need to do this 
if
 * no steps were removed.
 */
-   if (bms_num_members(*initially_valid_subplans) < n_total_subplans)
+   if (prunestate &&
+   bms_num_members(*initially_valid_subplans) < n_total_subplans)
{
/*
 * We can safely skip this when !do_exec_prune, even though that
 * leaves invalid data in prunestate, because that data won't be
 * consulted again (cf initial Assert in 
ExecFindMatchingSubPlans).
 */
-   if (prunestate && prunestate->do_exec_prune)
+   if (prunestate->do_exec_prune)
PartitionPruneFixSubPlanMap(prunestate,

*initially_valid_subplans,

n_total_subplans);
@@ -1751,7 +1753,8 @@ ExecPartitionDoInitialPruning(PlannedStmt *plannedstmt, 
ParamListInfo params,
Bitmapset*valid_subplan_offs;
 
/*
-* A temporary context to allocate stuff needded to run the pruning 
steps.
+* A temporary context to for memory allocations required while 
execution
+* partition pruning steps.
 */
tmpcontext = AllocSetContextCreate(CurrentMemoryContext,
   
"initial pruning working data",
@@ -1765,11 +1768,12 @@ ExecPartitionDoInitialPruning(PlannedStmt *plannedstmt, 
ParamListInfo params,
pdir = CreatePartitionDirectory(CurrentMemoryContext, false);
 
/*
-* We don't yet have a PlanState for the parent plan node, so must 
create
-* a standalone ExprContext to evaluate pruning expressions, equipped 
with
-* the information about the EXTERN parameters that the caller passed 
us.
-* Note that that's okay because the initial pruning steps do not 
contain
-* anything that requires the execution to have started.
+* We don't yet have a PlanState for the parent plan node, so we must
+* create a standalone ExprContext to evaluate pruning expressions,
+* equipped with the information about the EXTERN parameters that the
+* caller passed us.  Note that that's okay because the initial pruning
+* steps do not contain anything that requires the execution to have
+* started.
 */
econtext = CreateStandaloneExprContext();
econtext->ecxt_param_list_info = params;
@@ -1874,7 +1878,6 @@ CreatePartitionPruneState(PlanState *planstate,
PartitionedRelPruneInfo *pinfo

Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread Magnus Hagander

On Wed, Mar 30, 2022 at 3:04 PM Magnus Hagander  wrote:

> On Tue, Mar 29, 2022 at 10:06 PM David Rowley 
> wrote:
>
>> On Wed, 30 Mar 2022 at 02:38, Robert Haas  wrote:
>> > I think WARNING is fine. After all, the parameter is called
>> > "jit_warn_above_fraction".
>>
>> I had a think about this patch.  I guess it's a little similar to
>> checkpoint_warning. The good thing about the checkpoint_warning is
>> that in the LOG message we give instructions about how the DBA can fix
>> the issue, i.e increase max_wal_size.
>>
>> With the proposed patch I see there is no hint about what might be
>> done to remove/reduce the warnings.  I imagine that's because it's not
>> all that clear which GUC should be changed. In my view, likely
>> jit_above_cost is the most relevant but there is also
>> jit_inline_above_cost, jit_optimize_above_cost, jit_tuple_deforming
>> and jit_expressions which are relevant too.
>>
>> If we go with this patch,  the problem I see here is that the amount
>> of work the JIT compiler must do for a given query depends mostly on
>> the number of expressions that must be compiled in the query (also to
>> a lesser extent jit_inline_above_cost, jit_optimize_above_cost,
>> jit_tuple_deforming and jit_expressions). The DBA does not really have
>> much control over the number of expressions in the query.  All he or
>> she can do to get rid of the warning is something like increase
>> jit_above_cost.  After a few iterations of that, the end result is
>> that jit_above_cost is now high enough that JIT no longer triggers
>> for, say, that query to that table with 1000 partitions where no
>> plan-time pruning takes place.  Is that really a good thing? It likely
>> means that we just rarely JIT anything at all!
>>
>
> I don't agree with the conclusion of that.
>
> What the parameter would be useful for is to be able to tune those costs
> (or just turn it off) *for that individual query*. That doesn't mean you
> "rarely JIT anything atll", it just means you rarely JIT that particular
> query.
>
> In fact, my goal is to specifically make people do that and *not* just
> turn off JIT globally.
>
>
> I'd much rather see us address the costing problem before adding some
>> warning, especially a warning where it's not clear how to make go
>> away.
>>
>
> The easiest way would be to add a HINT that says turn off jit for this
> particular query or something?
>
> I do agree that if we can make  "spending too much time on JIT vs query
> runtime" go away completely, then there is no need for a parameter like
> this.
>
> I still think the warning is useful. And I think it may stay useful even
> after we have made the JIT costing smarter -- though that's not certain of
> course.
>
>
This patch is still sitting at "ready for committer".

As an example, I have added such a hint in the attached.

I still stand  by that this patch is better than nothing. Sure, I would
love for us to adapt the JIT costing model and algorithm to make this not a
problem. And once we've done that, we should remove the parameter again.

It's not on by default, and it's trivial to remove in the future.


Yes, we're right up at the deadline. I'd still like to get it in, so I'd
really appreciate some further voices :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6e3e27bed7..fa35fe04e8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6718,6 +6718,21 @@ local0.*/var/log/postgresql

  
 
+ 
+  jit_warn_above_fraction (floating point)
+  
+   jit_warn_above_fraction configuration parameter
+  
+  
+   
+
+ Causes a warning to be written to the log if the time spent on JIT (see )
+ goes above this fraction of the total query runtime.
+ A value of 0 (the default) disables the warning.
+
+   
+ 
+
  
   log_startup_progress_interval (integer)
   
diff --git a/src/backend/tcop/fastpath.c b/src/backend/tcop/fastpath.c
index d429aa4663..a1bff893a3 100644
--- a/src/backend/tcop/fastpath.c
+++ b/src/backend/tcop/fastpath.c
@@ -196,6 +196,7 @@ HandleFunctionRequest(StringInfo msgBuf)
 	struct fp_info *fip;
 	bool		callit;
 	bool		was_logged = false;
+	int			msecs;
 	char		msec_str[32];
 
 	/*
@@ -305,7 +306,7 @@ HandleFunctionRequest(StringInfo msgBuf)
 	/*
 	 * Emit duration logging if appropriate.
 	 */
-	switch (check_log_duration(msec_str, was_logged))
+	switch (check_log_duration(&msecs, msec_str, was_logged))
 	{
 		case 1:
 			ereport(LOG,
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 95dc2e2c83..d1dd2273ad 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -998,7 +998,9 @@ exec_simple_query(const char *query_string)
 	bool		save_log_statement_stats = log_statement_stats;
 	bool		was_logged = f

Re: Parallel Full Hash Join

2022-04-08 Thread Thomas Munro

On Wed, Jan 12, 2022 at 10:30 AM Melanie Plageman
 wrote:
> On Fri, Nov 26, 2021 at 3:11 PM Thomas Munro  wrote:
> > #3 0x009cf57e in ExceptionalCondition (conditionName=0x29cae8
> > "BarrierParticipants(&accessor->shared->batch_barrier) == 1",
> > errorType=, fileName=0x2ae561 "nodeHash.c",
> > lineNumber=lineNumber@entry=2224) at assert.c:69
> > No locals.
> > #4 0x0071575e in ExecParallelScanHashTableForUnmatched
> > (hjstate=hjstate@entry=0x80a60a3c8,
> > econtext=econtext@entry=0x80a60ae98) at nodeHash.c:2224
>
> I believe this assert can be safely removed.

Agreed.

I was looking at this with a view to committing it, but I need more
time.  This will be at the front of my queue when the tree reopens.
I'm trying to find the tooling I had somewhere that could let you test
attaching and detaching at every phase.

The attached version is just pgindent'd.
From e7453cae9b2a686d57f967fd41533546d463dd0c Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 2 Oct 2020 15:53:44 +1300
Subject: [PATCH v11 1/3] Fix race condition in parallel hash join batch
 cleanup.

With unlucky timing and parallel_leader_participation off, PHJ could
attempt to access per-batch state just as it was being freed.  There was
code intended to prevent that by checking for a cleared pointer, but it
was racy.  Fix, by introducing an extra barrier phase.  The new phase
PHJ_BUILD_RUNNING means that it's safe to access the per-batch state to
find a batch to help with, and PHJ_BUILD_DONE means that it is too late.
The last to detach will free the array of per-batch state as before, but
now it will also atomically advance the phase at the same time, so that
late attachers can avoid the hazard.  This mirrors the way per-batch
hash tables are freed (see phases PHJ_BATCH_PROBING and PHJ_BATCH_DONE).

The build barrier must make it to PHJ_BUILD_DONE before shared resources
can be safely destroyed. This works fine in most cases with the addition
of another synchronization point. However, when the inner side is empty,
the build barrier will only make it to PHJ_BUILD_HASHING_INNER before
workers attempt to detach from the hashtable. In the case of the empty
inner optimization, advance the build barrier to PHJ_BUILD_RUNNING
before proceeding to cleanup. See batch 0 batch barrier fast forward in
ExecParallelHashJoinSetUpBatches() for precedent.

Revealed by a build farm failure, where BarrierAttach() failed a sanity
check assertion, because the memory had been clobbered by dsa_free().

This should eventually be back-patched to all supported releases, but
the failure is rare and the previous attempt at this was reverted, so
let's do this in master only for now, ahead of some other changes that
will move things around a bit.

Author: Thomas Munro 
Author: Melanie Plageman 
Reported-by: Michael Paquier 
Discussion: https://postgr.es/m/20200929061142.GA29096%40paquier.xyz
---
 src/backend/executor/nodeHash.c | 49 +-
 src/backend/executor/nodeHashjoin.c | 54 -
 src/include/executor/hashjoin.h |  3 +-
 3 files changed, 73 insertions(+), 33 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 3510a4247c..d7d1d77ed1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -333,14 +333,21 @@ MultiExecParallelHash(HashState *node)
 	hashtable->nbuckets = pstate->nbuckets;
 	hashtable->log2_nbuckets = my_log2(hashtable->nbuckets);
 	hashtable->totalTuples = pstate->total_tuples;
-	ExecParallelHashEnsureBatchAccessors(hashtable);
+
+	/*
+	 * Unless we're completely done and the batch state has been freed, make
+	 * sure we have accessors.
+	 */
+	if (BarrierPhase(build_barrier) < PHJ_BUILD_DONE)
+		ExecParallelHashEnsureBatchAccessors(hashtable);

 	/*
 	 * The next synchronization point is in ExecHashJoin's HJ_BUILD_HASHTABLE
-	 * case, which will bring the build phase to PHJ_BUILD_DONE (if it isn't
-	 * there already).
+	 * case, which will bring the build phase to PHJ_BUILD_RUNNING (if it
+	 * isn't there already).
 	 */
 	Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER ||
+		   BarrierPhase(build_barrier) == PHJ_BUILD_RUNNING ||
 		   BarrierPhase(build_barrier) == PHJ_BUILD_DONE);
 }

@@ -624,7 +631,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 		/*
 		 * The next Parallel Hash synchronization point is in
 		 * MultiExecParallelHash(), which will progress it all the way to
-		 * PHJ_BUILD_DONE.  The caller must not return control from this
+		 * PHJ_BUILD_RUNNING.  The caller must not return control from this
 		 * executor node between now and then.
 		 */
 	}
@@ -3065,14 +3072,11 @@ ExecParallelHashEnsureBatchAccessors(HashJoinTable hashtable)
 	}

 	/*
-	 * It's possible for a backend to start up very late so that the whole
-	 * join is finished and the shm state for tracking batches has already
-	 * been freed by ExecHashTableDetach().  In that

Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread Stephen Frost

Greetings,

On Fri, Apr 8, 2022 at 07:27 Magnus Hagander  wrote:

> On Wed, Mar 30, 2022 at 3:04 PM Magnus Hagander 
> wrote:
>
>> On Tue, Mar 29, 2022 at 10:06 PM David Rowley 
>> wrote:
>>
>>> On Wed, 30 Mar 2022 at 02:38, Robert Haas  wrote:
>>> > I think WARNING is fine. After all, the parameter is called
>>> > "jit_warn_above_fraction".
>>>
>>> I had a think about this patch.  I guess it's a little similar to
>>> checkpoint_warning. The good thing about the checkpoint_warning is
>>> that in the LOG message we give instructions about how the DBA can fix
>>> the issue, i.e increase max_wal_size.
>>>
>>> With the proposed patch I see there is no hint about what might be
>>> done to remove/reduce the warnings.  I imagine that's because it's not
>>> all that clear which GUC should be changed. In my view, likely
>>> jit_above_cost is the most relevant but there is also
>>> jit_inline_above_cost, jit_optimize_above_cost, jit_tuple_deforming
>>> and jit_expressions which are relevant too.
>>>
>>> If we go with this patch,  the problem I see here is that the amount
>>> of work the JIT compiler must do for a given query depends mostly on
>>> the number of expressions that must be compiled in the query (also to
>>> a lesser extent jit_inline_above_cost, jit_optimize_above_cost,
>>> jit_tuple_deforming and jit_expressions). The DBA does not really have
>>> much control over the number of expressions in the query.  All he or
>>> she can do to get rid of the warning is something like increase
>>> jit_above_cost.  After a few iterations of that, the end result is
>>> that jit_above_cost is now high enough that JIT no longer triggers
>>> for, say, that query to that table with 1000 partitions where no
>>> plan-time pruning takes place.  Is that really a good thing? It likely
>>> means that we just rarely JIT anything at all!
>>>
>>
>> I don't agree with the conclusion of that.
>>
>> What the parameter would be useful for is to be able to tune those costs
>> (or just turn it off) *for that individual query*. That doesn't mean you
>> "rarely JIT anything atll", it just means you rarely JIT that particular
>> query.
>>
>> In fact, my goal is to specifically make people do that and *not* just
>> turn off JIT globally.
>>
>>
>> I'd much rather see us address the costing problem before adding some
>>> warning, especially a warning where it's not clear how to make go
>>> away.
>>>
>>
>> The easiest way would be to add a HINT that says turn off jit for this
>> particular query or something?
>>
>> I do agree that if we can make  "spending too much time on JIT vs query
>> runtime" go away completely, then there is no need for a parameter like
>> this.
>>
>> I still think the warning is useful. And I think it may stay useful even
>> after we have made the JIT costing smarter -- though that's not certain of
>> course.
>>
>>
> This patch is still sitting at "ready for committer".
>
> As an example, I have added such a hint in the attached.
>
> I still stand  by that this patch is better than nothing. Sure, I would
> love for us to adapt the JIT costing model and algorithm to make this not a
> problem. And once we've done that, we should remove the parameter again.
>
> It's not on by default, and it's trivial to remove in the future.
>
>
> Yes, we're right up at the deadline. I'd still like to get it in, so I'd
> really appreciate some further voices :)
>

Looks reasonable to me, so +1. The default is has it off and so I seriously
doubt it’ll cause any issues and it’ll be very handy on large and busy
systems with lots of queries finding those that have a serious amount of
time being spent in JIT (and hopefully avoid folks just turning jit off
across the board, since that’s worse- we need more data on jit and need to
work on improving it, not ending up with everyone turning it off).

Thanks!

Stephen

>

Re: Last day of commitfest

2022-04-08 Thread Magnus Hagander

On Thu, Apr 7, 2022 at 3:59 AM Julien Rouhaud  wrote:

>
> > * JIT counters in pg_stat_statements (Magnus Hagander)
> >  Feedback from Dmitry Dolgov and Julien Rouhaud
>
> Note that the code looks good and no one disagreed with the proposed
> fields.
>
> The only remaining problem is a copy/pasto in the docs so nothing
> critical.  I
> personally think that it would be very good to have so maybe Magnus will
> push
> it today (which would probably instantly break the other pg_stat_statements
> patches that are now Ready for Committer), and if not I think it should go
> to
> the next commitfest instead.
>

Dang, I missed that one while looking at the other jit patch.

It did already conflict with the patch that Michael applied, but I'll try
to clean that up quickly and apply it with this fix.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Support logical replication of DDLs

2022-04-08 Thread Alvaro Herrera

On 2022-Apr-08, Amit Kapila wrote:

> On Thu, Mar 17, 2022 at 3:36 AM Alvaro Herrera  
> wrote:
> >
> > Did you see some old code I wrote towards this goal?
> > https://www.postgresql.org/message-id/20150215044814.gl3...@alvh.no-ip.org
> > The intention was that DDL would produce some JSON blob that accurately
> > describes the DDL that was run;
> 
> I have read that thread and found one of your emails [1] where you
> seem to be saying that JSON representation is not required for BDR.
> Will in some way going via JSON blob way make this project
> easier/better?

I don't know if replication support will be easier by using JSON; I just
think that JSON makes the overall feature more easily usable for other
purposes.

I am not familiar with BDR replication nowadays.

> For runtime conditions, one of the things you have mentioned in that
> thread is to add schema name in the statement at the required places
> which this patch deals with in a different way by explicitly sending
> it along with the DDL statement.

Hmm, ok.  The point of the JSON-blob route is that the publisher sends a
command representation that can be parsed/processed/transformed
arbitrarily by the subscriber using generic rules; it should be trivial
to use a JSON tool to change schema A to schema B in any arbitrary DDL
command, and produce another working DDL command without having to know
how to write that command specifically.  So if I have a rule that
"schema A there is schema B here", all DDL commands can be replayed with
no further coding (without having to rely on getting the run-time
search_path correct.)

-- 
Álvaro Herrera   48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El sudor es la mejor cura para un pensamiento enfermo" (Bardia)

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Matthias van de Meent

On Fri, 8 Apr 2022 at 01:01, Andres Freund  wrote:
>
> Hi,
>
> On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote:
> > We should definitely increase MaxHeapTuplesPerPage before too long,
> > for a variety of reasons that I have talked about in the past. Its
> > current value is 291 on all mainstream platforms, a value that's
> > derived from accidental historic details -- which predate HOT.
>
> I'm on-board with that - but I think we should rewrite a bunch of places that
> use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
> several KB of stack at the current the current value already (*), but if it 
> grows
> further...

Yeah, I think we should definately support more line pointers on a
heap page, but abusing MaxHeapTuplesPerPage for that is misleading:
the current value is the physical limit for heap tuples, as we have at
most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage
won't change. A macro MaxHeapLinePointersPerPage would probably be
more useful, which could be as follows (assuming we don't want to
allow filling a page with effectively only dead line pointers):

#define MaxHeapLinePointersPerPage \
   ((int) (((BLCKSZ - SizeOfPageHeaderData) / \
 (MAXALIGN(SizeofHeapTupleHeader) + 2 * sizeof(ItemIdData))) * 2))

This accounts for the worst case of one redirect + one min-sized live
heap tuple, and fills the page with it. Although impossible to put a
page in such a state, that would be the worst case of live line
pointers on a page.
For the default BLCKSZ of 8kB, that results in 510 line pointers
used-but-not-dead, an increase of ~ 70% over what's currently
available.

-Matthias

Re: Extract epoch from Interval weird behavior

2022-04-08 Thread Peter Eisentraut


On 24.02.22 03:35, Joseph Koshakow wrote:

However when executing EXTRACT we first truncate
DAYS_PER_YEAR to an integer, and then multiply it
by the total years in the Interval
/* this always fits into int64 */

secs_from_day_month = ((int64) DAYS_PER_YEAR * (interval->month / 
MONTHS_PER_YEAR) +
  (int64) DAYS_PER_MONTH * (interval->month % 
MONTHS_PER_YEAR) +
   interval->day) * SECS_PER_DAY;

Is this truncation on purpose? It seems like
EXTRACT is not accounting for leap years in
it's calculation.


This was not intentional.  The cast is only to make the multiplication 
happen in int64; it didn't mean to drop any fractional parts.



Oops I sent that to the wrong email. If this isn't intented I've created a patch
that fixes it, with the following two open questions
  * DAYS_PER_YEAR_NUM is recalculated every time. Is there anyway
to convert a float directly to a numeric to avoid this?


We really wanted to avoid doing calculations in numeric as much as 
possible.  So we should figure out a different way to write this.  The 
attached patch works for me.  It's a bit ugly since it hardcodes some 
factors.  Maybe we can rephrase it a bit more elegantly.From 0b7222beb5260c710d79c9a0573c3b39a64acf1b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 8 Apr 2022 13:16:25 +0200
Subject: [PATCH v1] Fix extract epoch from interval calculation

The new numeric code for extract epoch from interval accidentally
truncated the DAYS_PER_YEAR value to an integer, leading to results
that mismatched the floating-point interval_part calculations.

The commit a2da77cdb4661826482ebf2ddba1f953bc74afe4 that introduced
this actually contains the regression test change that this reverts.
I suppose this was missed at the time.

Reported-by: Joseph Koshakow 
Discussion: 
https://www.postgresql.org/message-id/flat/CAAvxfHd5n%3D13NYA2q_tUq%3D3%3DSuWU-CufmTf-Ozj%3DfrEgt7pXwQ%40mail.gmail.com
---
 src/backend/utils/adt/timestamp.c  | 6 +++---
 src/test/regress/expected/interval.out | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/adt/timestamp.c 
b/src/backend/utils/adt/timestamp.c
index 1c0bf0aa5c..2677d27632 100644
--- a/src/backend/utils/adt/timestamp.c
+++ b/src/backend/utils/adt/timestamp.c
@@ -5288,9 +5288,9 @@ interval_part_common(PG_FUNCTION_ARGS, bool retnumeric)
int64   val;
 
/* this always fits into int64 */
-   secs_from_day_month = ((int64) DAYS_PER_YEAR * 
(interval->month / MONTHS_PER_YEAR) +
-  (int64) 
DAYS_PER_MONTH * (interval->month % MONTHS_PER_YEAR) +
-  
interval->day) * SECS_PER_DAY;
+   secs_from_day_month = ((int64) (4 * DAYS_PER_YEAR) * 
(interval->month / MONTHS_PER_YEAR) +
+  (int64) (4 * 
DAYS_PER_MONTH) * (interval->month % MONTHS_PER_YEAR) +
+  (int64) 4 * 
interval->day) * SECS_PER_DAY/4;
 
/*---
 * result = secs_from_day_month + interval->time / 
1'000'000
diff --git a/src/test/regress/expected/interval.out 
b/src/test/regress/expected/interval.out
index e4b1246f45..8e2d535543 100644
--- a/src/test/regress/expected/interval.out
+++ b/src/test/regress/expected/interval.out
@@ -953,11 +953,11 @@ SELECT f1,
  @ 1 min   |   0 |   0.000 |   0.00 |  
1 |0 |   0 | 0 |   1 |0 |  0 |   0 |  0 |   
  60.00
  @ 5 hours |   0 |   0.000 |   0.00 |  
0 |5 |   0 | 0 |   1 |0 |  0 |   0 |  0 |  
18000.00
  @ 10 days |   0 |   0.000 |   0.00 |  
0 |0 |  10 | 0 |   1 |0 |  0 |   0 |  0 | 
864000.00
- @ 34 years|   0 |   0.000 |   0.00 |  
0 |0 |   0 | 0 |   1 |   34 |  3 |   0 |  0 | 
1072224000.00
+ @ 34 years|   0 |   0.000 |   0.00 |  
0 |0 |   0 | 0 |   1 |   34 |  3 |   0 |  0 | 
1072958400.00
  @ 3 mons  |   0 |   0.000 |   0.00 |  
0 |0 |   0 | 3 |   2 |0 |  0 |   0 |  0 |
7776000.00
  @ 14 secs ago |   -1400 |  -14000.000 | -14.00 |  
0 |0 |   0 | 0 |   1 |0 |  0 |   0 |  0 |   
 -14.00
  @ 1 day 2 hours 3 mins 4 secs | 400 |4000.000 |   4.00 |  
3 |2 |   1 | 0 |   1 |0 |  0 |   0 |  0 |  
93784.00
- @ 6 years |

Re: generic plans and "initial" pruning

2022-04-08 Thread Amit Langote

Hi David,

On Fri, Apr 8, 2022 at 8:16 PM David Rowley  wrote:
> On Fri, 8 Apr 2022 at 17:49, Amit Langote  wrote:
> > Attached updated patch with these changes.
> Thanks for making the changes.  I started looking over this patch but
> really feel like it needs quite a few more iterations of what we've
> just been doing to get it into proper committable shape. There seems
> to be only about 40 mins to go before the freeze, so it seems very
> unrealistic that it could be made to work.

Yeah, totally understandable.

> I started trying to take a serious look at it this evening, but I feel
> like I just failed to get into it deep enough to make any meaningful
> improvements.  I'd need more time to study the problem before I could
> build up a proper opinion on how exactly I think it should work.
>
> Anyway. I've attached a small patch that's just a few things I
> adjusted or questions while reading over your v13 patch.  Some of
> these are just me questioning your code (See XXX comments) and some I
> think are improvements. Feel free to take the hunks that you see fit
> and drop anything you don't.

Thanks a lot for compiling those.

Most looked fine changes to me except a couple of typos, so I've
adopted those into the attached new version, even though I know it's
too late to try to apply it.  Re the XXX comments:

+ /* XXX why would pprune->rti_map[i] ever be zero here??? */

Yeah, no there can't be, was perhaps being overly paraioid.

+ * XXX is it worth doing a bms_copy() on glob->minLockRelids if
+ * glob->containsInitialPruning is true?. I'm slighly worried that the
+ * Bitmapset could have a very long empty tail resulting in excessive
+ * looping during AcquireExecutorLocks().
+ */

I guess I trust your instincts about bitmapset operation efficiency
and what you've written here makes sense.  It's typical for leaf
partitions to have been appended toward the tail end of rtable and I'd
imagine their indexes would be in the tail words of minLockRelids.  If
copying the bitmapset removes those useless words, I don't see why we
shouldn't do that.  So added:

+ /*
+ * It seems worth doing a bms_copy() on glob->minLockRelids if we deleted
+ * bit from it just above to prevent empty tail bits resulting in
+ * inefficient looping during AcquireExecutorLocks().
+ */
+ if (glob->containsInitialPruning)
+ glob->minLockRelids = bms_copy(glob->minLockRelids)

Not 100% about the comment I wrote.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v14-0001-Optimize-AcquireExecutorLocks-to-skip-pruned-par.patch
Description: Binary data

Re: Expose JIT counters/timing in pg_stat_statements

2022-04-08 Thread Magnus Hagander

On Tue, Mar 8, 2022 at 4:08 AM Julien Rouhaud  wrote:

> On Mon, Mar 07, 2022 at 01:40:34PM +0100, Magnus Hagander wrote:
> >
> > I wonder if there might be an interesting middle ground, or if that is
> > making it too much. That is, we could have an
> > Option 3:
> > jit_count
> > total_jit_time - for sum of functions+inlining+optimization+emission time
> > min_jit_time - for sum of functions+inlining+optimization+emission time
> > max_jit_time - for sum of functions+inlining+optimization+emission time
> > mean_jit_time - for sum of functions+inlining+optimization+emission time
> > stddev_jit_time - for sum of functions+inlining+optimization+emission
> time
> > jit_functions
> > jit_generation_time
> > jit_inlining_count
> > jit_inlining_time
> > jit_optimization_count
> > jit_optimization_time
> > jit_emission_count
> > jit_emission_time
> >
> > That is, we'd get the more detailed timings across the total time, but
> > not on the details. But that might be overkill.
>
> I also thought about it but it seems overkill.  pg_stat_statements view is
> already very big, and I think that the JIT time should be somewhat stable,
> at
> least compared to how much a query execution time can vary depending on the
> parameters.  This approach would also be a bit useless if you change the
> costing of underlying JIT operation.
>
> > But -- here's an updated patched based on Option 2.
>
> Thanks!
>
> Code-wide, the patch looks good.  For the doc, it seems that you documented
> jit_inlining_count three times rather than documenting
> jit_optimization_count
> and jit_emission_count.
>

Oops, thanks and fixed.


I don't think we can add tests there, and having a test for every new
> counter
> being >= 0 seems entirely useless, however there should be a new test
> added for
> the "oldextversions" test to make sure that there's no issue with old SQL
> / new
> shlib compatibility.  And looking at it I see that it was already missed
> for
> version 1.9 :(
>

Indeed. Fixed here.

Michael had already applied a patch that took us to 1.10 and added that
test, so I've just updated it here. I don't think we normally bump the
version twice int he same day, so I just mergd the SQL script changes as
well.

PFA a "final" version for the CI to run.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index 2813eb16cb..efb2049ecf 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -197,44 +197,52 @@ SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
 -- New functions and views for pg_stat_statements in 1.10
 AlTER EXTENSION pg_stat_statements UPDATE TO '1.10';
 \d pg_stat_statements
-View "public.pg_stat_statements"
-   Column|   Type   | Collation | Nullable | Default 
--+--+---+--+-
- userid  | oid  |   |  | 
- dbid| oid  |   |  | 
- toplevel| boolean  |   |  | 
- queryid | bigint   |   |  | 
- query   | text |   |  | 
- plans   | bigint   |   |  | 
- total_plan_time | double precision |   |  | 
- min_plan_time   | double precision |   |  | 
- max_plan_time   | double precision |   |  | 
- mean_plan_time  | double precision |   |  | 
- stddev_plan_time| double precision |   |  | 
- calls   | bigint   |   |  | 
- total_exec_time | double precision |   |  | 
- min_exec_time   | double precision |   |  | 
- max_exec_time   | double precision |   |  | 
- mean_exec_time  | double precision |   |  | 
- stddev_exec_time| double precision |   |  | 
- rows| bigint   |   |  | 
- shared_blks_hit | bigint   |   |  | 
- shared_blks_read| bigint   |   |  | 
- shared_blks_dirtied | bigint   |   |  | 
- shared_blks_written | bigint   |   |  | 
- local_blks_hit  | bigint   |   |  | 
- local_blks_read | bigint   |   |  | 
- local_blks_dirtied  | bigint   |   |  | 
- local_blks_written  | bigint   |   |  | 
- temp_blks_read  | bigint   |   |  | 
- temp_blks_written   | bigint   |   |  | 
- blk_read_t

Re: shared-memory based stats collector

2022-04-08 Thread Ranier Vilela

Hi,

Per Coverity.

pgstat_reset_entry does not check if lock it was really blocked.
I think if shared_stat_reset_contents is called without lock,
is it an issue not?

regards,

Ranier Vilela


0001-avoid-reset-stats-without-lock.patch
Description: Binary data

Re: PROXY protocol support

2022-04-08 Thread Magnus Hagander

On Sat, Apr 2, 2022 at 12:17 AM wilfried roset 
wrote:

> Hi,
>
> I've been able to test the patch. Here is a recap of the experimentation.
>
> # Setup
>
> All tests have been done witch 3 VMs (PostgreSQL, HAproxy, psql client) on
> Debian 11 communicating over private network.
> * PostgreSQL have been built with proxy_protocol_11.patch applied on
> master branch (465ab24296).
> * psql client is from postgresql-client-13 from Debian 11 repository.
> * HAproxy version used is 2.5.5-1~bpo11+1 installed from
> https://haproxy.debian.net
>
> # Configuration
>
> PostgresSQL has been configured to listen only on its private IP. To enable
> proxy protocol support `proxy_port` has been configured to `5431` and
> `proxy_servers` to `10.0.0.0/24` . `log_connections`
> has been turned on to make
> sure the correct IP address is logged. `log_min_duration_statement` has
> been
> configured to 0 to log all queries. Finally `log_destination` has been
> configured to `csvlog`.
>
> pg_hba.conf is like this:
>
>   local   all all trust
>   hostall all 127.0.0.1/32trust
>   hostall all ::1/128 trust
>   local   replication all trust
>   hostreplication all 127.0.0.1/32trust
>   hostreplication all ::1/128 trust
>   hostall all 10.0.0.208/32   md5
>
> Where 10.0.0.208 is the IP host the psql client's VM.
>
> HAproxy has two frontends, one for proxy protocol (port 5431) and one for
> regular TCP traffic. The configuration looks like this:
>
>   listen postgresql
>   bind 10.0.0.222:5432
>   server pg 10.0.0.253:5432 check
>
>   listen postgresql_proxy
>   bind 10.0.0.222:5431
>   server pg 10.0.0.253:5431 send-proxy-v2
>
> Where 10.0.0.222 is the IP of HAproxy's VM and 10.0.0.253 is the IP of
> PostgreSQL's VM.
>
> # Tests
>
> * from psql's vm to haproxy on port 5432 (no proxy protocol)
>   --> connection denied by pg_hba.conf, as expected
>
> * from psql's vm to postgresql's VM on port 5432 (no proxy protocol)
>   --> connection success with psql's vm ip in logfile and pg_stat_activity
>
> * from psql's vm to postgresql's VM on port 5431 (proxy protocol)
>   --> unable to open a connection, as expected
>
> * from psql's vm to haproxy on port 5431 (proxy protocol)
>   --> connection success with psql's vm ip in logfile and pg_stat_activity
>
> I've also tested without proxy protocol enable (and pg_hba.conf updated
> accordingly), PostgreSQL behave as expected.
>
> # Conclusion
>
> From my point of view the documentation is clear enough and the feature
> works
> as expected.


Hi!

Thanks for this review and testing!

I think it could do with at least noe more look-over at the source code
level as well at this point though since it's been sitting around for a
while, so it won't make it in for this deadline. But hopefully I can get it
in early in the next cycle!

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: SQL/JSON: functions

2022-04-08 Thread Justin Pryzby

On Thu, Mar 31, 2022 at 04:25:58PM -0400, Andrew Dunstan wrote:
> No code chunks left, only a documentation patch which should land

Documentation review for a6baa4bad.

> Construct a JSON the provided strings:

a JSON what ?
*from* the provided strings ?

> Construct a JSON from the provided values various types:

should say "a JSON scalar" ?
*of* various types ?

> Construct a JSON object from the provided key/value pairs of various types:

For comparison, that one looks ok.

+  JSON_EXISTS function checks whether the provided
+   JSON_VALUE function extracts a value from the provided
+   JSON_QUERY function extracts an 
SQL/JSON
+  JSON_TABLE function queries JSON 
data
+ JSON_TABLE uses the
+  JSON_SERIALIZE function transforms a SQL/JSON value

I think all these should all begin with "THE >...< function ...", like the
others do.

+To use other types, you must create the CAST from 
json for this type.
=> create a cast from json to this type.

+Values can be null, but not keys.
I think it's clearer to say "..but keys cannot."

+  For any scalar other than a number or a Boolean the text

Boolean COMMA the text

+ The path name must be unique and cannot coincide with column names.
Maybe say "name must be unique and distinct from the column names."

+  ... If you specify a GROUP BY
+  or an ORDER BY clause, this function returns a 
separate JSON object
+  for each table row.

"for each table row" sounds inaccurate or imprecise.  The SELECT docs say this:
| GROUP BY will condense into a single row all selected rows that share the 
same values for the grouped expressions

BTW, the documentation references look a little like OIDs...
Does someone already have an SNMP-based doc browser ?
| For details, see Section 9.16.3.4.2.

Re: SQL/JSON: functions

2022-04-08 Thread Andrew Dunstan



On 4/8/22 08:02, Justin Pryzby wrote:
> On Thu, Mar 31, 2022 at 04:25:58PM -0400, Andrew Dunstan wrote:
>> No code chunks left, only a documentation patch which should land
> Documentation review for a6baa4bad.
>
>> Construct a JSON the provided strings:
> a JSON what ?
> *from* the provided strings ?
>
>> Construct a JSON from the provided values various types:
> should say "a JSON scalar" ?
> *of* various types ?
>
>> Construct a JSON object from the provided key/value pairs of various types:
> For comparison, that one looks ok.
>
> +  JSON_EXISTS function checks whether the provided
> +   JSON_VALUE function extracts a value from the 
> provided
> +   JSON_QUERY function extracts an 
> SQL/JSON
> +  JSON_TABLE function queries 
> JSON data
> + JSON_TABLE uses the
> +  JSON_SERIALIZE function transforms a SQL/JSON 
> value
>
> I think all these should all begin with "THE >...< function ...", like the
> others do.
>
> +To use other types, you must create the CAST from 
> json for this type.
> => create a cast from json to this type.
>
> +Values can be null, but not keys.
> I think it's clearer to say "..but keys cannot."
>
> +  For any scalar other than a number or a Boolean the text
>
> Boolean COMMA the text
>
> + The path name must be unique and cannot coincide with column names.
> Maybe say "name must be unique and distinct from the column names."
>
> +  ... If you specify a GROUP BY
> +  or an ORDER BY clause, this function returns a 
> separate JSON object
> +  for each table row.
>
> "for each table row" sounds inaccurate or imprecise.  The SELECT docs say 
> this:
> | GROUP BY will condense into a single row all selected rows that share the 
> same values for the grouped expressions
>
> BTW, the documentation references look a little like OIDs...
> Does someone already have an SNMP-based doc browser ?
> | For details, see Section 9.16.3.4.2.



Many thanks, useful.

I already had a couple of these items on my list but I ran out of time
before tiredness overcame me last night.

I'm planning on removing some of that stuff that generates the last
complaint if I can do it without too much violence.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread David Rowley

On Fri, 8 Apr 2022 at 23:27, Magnus Hagander  wrote:
>
>
>
> On Wed, Mar 30, 2022 at 3:04 PM Magnus Hagander  wrote:
>>
>> On Tue, Mar 29, 2022 at 10:06 PM David Rowley  wrote:
>>>
>>> If we go with this patch,  the problem I see here is that the amount
>>> of work the JIT compiler must do for a given query depends mostly on
>>> the number of expressions that must be compiled in the query (also to
>>> a lesser extent jit_inline_above_cost, jit_optimize_above_cost,
>>> jit_tuple_deforming and jit_expressions). The DBA does not really have
>>> much control over the number of expressions in the query.  All he or
>>> she can do to get rid of the warning is something like increase
>>> jit_above_cost.  After a few iterations of that, the end result is
>>> that jit_above_cost is now high enough that JIT no longer triggers
>>> for, say, that query to that table with 1000 partitions where no
>>> plan-time pruning takes place.  Is that really a good thing? It likely
>>> means that we just rarely JIT anything at all!
>>
>>
>> I don't agree with the conclusion of that.
>>
>> What the parameter would be useful for is to be able to tune those costs (or 
>> just turn it off) *for that individual query*. That doesn't mean you "rarely 
>> JIT anything atll", it just means you rarely JIT that particular query.

I just struggle to imagine that anyone is going to spend much effort
tuning a warning parameter per query.  I imagine they're far more
likely to just ramp it up to only catch some high percentile problems
or just (more likely) just not bother with it.  It seems more likely
that if anyone was to tune anything per query here it would be
jit_above_cost, since that actually might have an affect on the
performance of the query, rather than if it spits out some warning
message or not.  ISTM that if the user knows what to set it to per
query, then there's very little point in having a warning as we'd be
alerting them to something they already know about.

I looked in the -general list to see if we could get some common
explanations to give us an idea of the most common reason for high JIT
compilation time. It seems that the plans were never simple. [1] seems
due to a complex plan. I'm basing that off the "Functions: 167". I
didn't catch the full plan. From what I can tell, [2] seems to be due
to "lots of empty tables", so assuming the clamping at 1 page is
causing issues there.  I think both of those cases could be resolved
by building the costing the way I mentioned.  I admit that 2 cases is
not a very large sample size.

David

[1] 
https://www.postgresql.org/message-id/flat/CAPL5KHq8zfWPzueCemXw4c%2BU568PoDfqo3wBDNm3KAyvybdaMQ%40mail.gmail.com#35aca8b42c3862f44b6be5b260c1a109
[2] 
https://www.postgresql.org/message-id/flat/CAHOFxGo5xJt02RmwAWrtv2K0jcqqxG-cDiR8FQbvb0WxdKhcgw%40mail.gmail.com#12d91822e869a2e22ca830cb5632f549

Re: Kerberos delegation support in libpq and postgres_fdw

2022-04-08 Thread Stephen Frost

Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> The new krb_user_ccache is a lot closer to 'global', though it's
> specifically for user-authenticated backends (allowing the postmaster
> and other things like replication connections to use whatever the
> credential cache is set to by the administrator on startup), but that
> seems like it makes sense to me- generally you're not going to want
> regular user backends to be accessing the credential cache of the
> 'postgres' unix account on the server.

Added an explicit 'environment' option to allow for, basically, existing
behavior, where we don't mess with the environment variable at all,
though I kept the default as MEMORY since I don't think it's really
typical that folks actually want regular user backends to inherit the
credential cache of the server.

Added a few more tests and updated the documentation too.  Sadly, seems
we've missed the deadline for v15 though for lack of feedback on these.
Would really like to get some other folks commenting as these are new
pg_hba and postgresql.conf options being added.

Thanks!

Stephen
From bd248c3fd82d04d3c12bf6c777f861134a45a101 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Thu, 7 Apr 2022 15:34:39 -0400
Subject: [PATCH] Add support for Kerberos credential delegation

Accept GSSAPI/Kerberos delegated credentials.  With this, a user could
authenticate to PostgreSQL using Kerberos credentials, delegate
credentials to the PostgreSQL server, and then the PostgreSQL server
could use those credentials to connect to another service, such as with
postgres_fdw or dblink or theoretically any other authenticated
connection which is able to use delegated credentials.

If an administrator prefers to not allow credentials to be delegated to
the server, they can be disallowed using a new pg_hba option for gss
called 'allow_cred_delegation'.

A new server GUC has also been introduced to allow an administrator to
control what the kerberos credential cache is configured to for user
authenticated backends, krb_user_ccache.  This defaults to MEMORY:,
which is where delegated credentials are stored (and is otherwise empty,
avoiding the risk of an administrator's credentials on the server being
mistakenly picked up and used).

Original patch by: Peifeng Qiu, whacked around some by me.
Reviewed-by: Jacob Champion
Discussion: https://postgr.es/m/co1pr05mb8023cc2cb575e0faad7df4f8a8...@co1pr05mb8023.namprd05.prod.outlook.com
---
 contrib/dblink/dblink.c   |   6 +-
 .../postgres_fdw/expected/postgres_fdw.out|   2 +-
 contrib/postgres_fdw/option.c |   3 +
 doc/src/sgml/client-auth.sgml |  13 ++
 doc/src/sgml/config.sgml  |  28 
 doc/src/sgml/libpq.sgml   |  19 +++
 src/backend/libpq/auth.c  |  27 +++-
 src/backend/libpq/be-gssapi-common.c  |  51 +++
 src/backend/libpq/be-secure-gssapi.c  |  19 ++-
 src/backend/libpq/hba.c   |  19 +++
 src/backend/utils/adt/hbafuncs.c  |   4 +
 src/backend/utils/init/postinit.c |   8 +-
 src/backend/utils/misc/guc.c  |  15 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/libpq/auth.h  |   1 +
 src/include/libpq/be-gssapi-common.h  |   3 +
 src/include/libpq/hba.h   |   1 +
 src/include/libpq/libpq-be.h  |   3 +
 src/interfaces/libpq/exports.txt  |   1 +
 src/interfaces/libpq/fe-auth.c|  12 +-
 src/interfaces/libpq/fe-connect.c |  12 ++
 src/interfaces/libpq/fe-secure-gssapi.c   |   3 +-
 src/interfaces/libpq/libpq-fe.h   |   1 +
 src/interfaces/libpq/libpq-int.h  |   1 +
 src/test/kerberos/Makefile|   3 +
 src/test/kerberos/t/001_auth.pl   | 128 --
 src/test/perl/PostgreSQL/Test/Utils.pm|  27 
 27 files changed, 391 insertions(+), 20 deletions(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index a06d4bd12d..e5b70e084e 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -2643,7 +2643,7 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)
 {
 	if (!superuser())
 	{
-		if (!PQconnectionUsedPassword(conn))
+		if (!(PQconnectionUsedPassword(conn) || PQconnectionUsedGSSAPI(conn)))
 		{
 			PQfinish(conn);
 			ReleaseExternalFD();
@@ -2652,8 +2652,8 @@ dblink_security_check(PGconn *conn, remoteConn *rconn)

 			ereport(ERROR,
 	(errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
-	 errmsg("password is required"),
-	 errdetail("Non-superuser cannot connect if the server does not request a password."),
+	 errmsg("password or GSSAPI is required"),
+	 errdetail("Non-superuser cannot connect if the server does not request a password or use GSSAPI."),
 	 errhint("Target server's authentication method must be changed.")));

Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread Magnus Hagander

On Fri, Apr 8, 2022 at 2:19 PM David Rowley  wrote:

> On Fri, 8 Apr 2022 at 23:27, Magnus Hagander  wrote:
> >
> >
> >
> > On Wed, Mar 30, 2022 at 3:04 PM Magnus Hagander 
> wrote:
> >>
> >> On Tue, Mar 29, 2022 at 10:06 PM David Rowley 
> wrote:
> >>>
> >>> If we go with this patch,  the problem I see here is that the amount
> >>> of work the JIT compiler must do for a given query depends mostly on
> >>> the number of expressions that must be compiled in the query (also to
> >>> a lesser extent jit_inline_above_cost, jit_optimize_above_cost,
> >>> jit_tuple_deforming and jit_expressions). The DBA does not really have
> >>> much control over the number of expressions in the query.  All he or
> >>> she can do to get rid of the warning is something like increase
> >>> jit_above_cost.  After a few iterations of that, the end result is
> >>> that jit_above_cost is now high enough that JIT no longer triggers
> >>> for, say, that query to that table with 1000 partitions where no
> >>> plan-time pruning takes place.  Is that really a good thing? It likely
> >>> means that we just rarely JIT anything at all!
> >>
> >>
> >> I don't agree with the conclusion of that.
> >>
> >> What the parameter would be useful for is to be able to tune those
> costs (or just turn it off) *for that individual query*. That doesn't mean
> you "rarely JIT anything atll", it just means you rarely JIT that
> particular query.
>
> I just struggle to imagine that anyone is going to spend much effort
> tuning a warning parameter per query.  I imagine they're far more
> likely to just ramp it up to only catch some high percentile problems
> or just (more likely) just not bother with it.  It seems more likely
> that if anyone was to tune anything per query here it would be
> jit_above_cost, since that actually might have an affect on the
> performance of the query, rather than if it spits out some warning
> message or not.  ISTM that if the user knows what to set it to per
> query, then there's very little point in having a warning as we'd be
> alerting them to something they already know about.
>

I would not expect people to tune the *warning* at a query level. If
anything, then ys, they would tune the either jit_above_cost or just
jit=off. But the idea being you can do that on a per query level instead of
globally.


I looked in the -general list to see if we could get some common
> explanations to give us an idea of the most common reason for high JIT
> compilation time. It seems that the plans were never simple. [1] seems
> due to a complex plan. I'm basing that off the "Functions: 167". I
> didn't catch the full plan. From what I can tell, [2] seems to be due
> to "lots of empty tables", so assuming the clamping at 1 page is
> causing issues there.  I think both of those cases could be resolved
> by building the costing the way I mentioned.  I admit that 2 cases is
> not a very large sample size.
>

Again, I am very much for improvements of the costing model. This is in no
way intended to be a replacement for that. It's intended to be a stop-gap.

What I see much of today are things like
https://dba.stackexchange.com/questions/264955/handling-performance-problems-with-jit-in-postgres-12
or
https://dev.to/xenatisch/cascade-of-doom-jit-and-how-a-postgres-update-led-to-70-failure-on-a-critical-national-service-3f2a

The bottom line is that people end up with recommendations to turn off JIT
globally more or less by default. Because there's no real useful way today
to figure out when it causes problems vs when it helps.

The addition to pg_stat_statements I pushed a short while ago would help
with that. But I think having a warning like this would also be useful. As
a stop-gap measure, yes, but we really don't know when we will have an
improved costing model for it. I hope you're right and that we can have it
by 16, and then I will definitely advocate for removing the warning again
if it works.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Mark all GUC variable as PGDLLIMPORT

2022-04-08 Thread Robert Haas

On Wed, Apr 6, 2022 at 7:56 PM Michael Paquier  wrote:
> On Wed, Apr 06, 2022 at 12:57:29AM +0700, John Naylor wrote:
> > For these two patches, I'd say a day or two after feature freeze is a
> > reasonable goal.
>
> Yeah.  For patches as invasive as the PGDLLIMPORT business and the
> frontend error refactoring, I am also fine to have two exceptions with
> the freeze deadline.

Done now.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-04-08 Thread Justin Pryzby

On Wed, Apr 06, 2022 at 03:58:29PM +0900, Etsuro Fujita wrote:
> I have committed the patch after modifying it as such.  (I think we
> can improve these later, if necessary.)

This patch seems to be causing the planner to crash.
Here's a query reduced from sqlsmith.

| explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= 
pg_trigger_depth();

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x55b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at 
../../../../src/include/nodes/pg_list.h:151
151 return l ? l->length : 0;
(gdb) bt
#0  0x55b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at 
../../../../src/include/nodes/pg_list.h:151
#1  0x55b43968af89 in mark_async_capable_plan 
(plan=plan@entry=0x7f4219ed93b0, path=path@entry=0x7f4219e89538) at 
createplan.c:1132
#2  0x55b439691924 in create_append_plan (root=root@entry=0x55b43affb2b0, 
best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at 
createplan.c:1329
#3  0x55b43968fa21 in create_plan_recurse (root=root@entry=0x55b43affb2b0, 
best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at 
createplan.c:421
#4  0x55b43968f974 in create_projection_plan 
(root=root@entry=0x55b43affb2b0, best_path=best_path@entry=0x7f4219ed0f60, 
flags=flags@entry=1) at createplan.c:2039
#5  0x55b43968fa6f in create_plan_recurse (root=root@entry=0x55b43affb2b0, 
best_path=0x7f4219ed0f60, flags=flags@entry=1) at createplan.c:433
#6  0x55b439690221 in create_plan (root=root@entry=0x55b43affb2b0, 
best_path=) at createplan.c:348
#7  0x55b4396a1451 in standard_planner (parse=0x55b43af05e28, 
query_string=, cursorOptions=2048, boundParams=0x0) at 
planner.c:413
#8  0x55b4396a19c1 in planner (parse=parse@entry=0x55b43af05e28, 
query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM 
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();", 
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) 
at planner.c:277
#9  0x55b439790c78 in pg_plan_query 
(querytree=querytree@entry=0x55b43af05e28, 
query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM 
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();", 
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) 
at postgres.c:883
#10 0x55b439790d54 in pg_plan_queries (querytrees=0x55b43afdd528, 
query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM 
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();", 
cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0) 
at postgres.c:975
#11 0x55b439791239 in exec_simple_query 
(query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM 
information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();") at 
postgres.c:1169
#12 0x55b439793183 in PostgresMain (dbname=, 
username=) at postgres.c:4542
#13 0x55b4396e6af7 in BackendRun (port=port@entry=0x55b43af2ffe0) at 
postmaster.c:4489
#14 0x55b4396e9c03 in BackendStartup (port=port@entry=0x55b43af2ffe0) at 
postmaster.c:4217
#15 0x55b4396e9e4a in ServerLoop () at postmaster.c:1791
#16 0x55b4396eb401 in PostmasterMain (argc=7, argv=) at 
postmaster.c:1463
#17 0x55b43962b4df in main (argc=7, argv=0x55b43aeff0c0) at main.c:202

Actually, the original query failed like this:
#2  0x55b4398e9f90 in ExceptionalCondition 
(conditionName=conditionName@entry=0x55b439a61238 "plan->scanstatus == 
SUBQUERY_SCAN_UNKNOWN", errorType=errorType@entry=0x55b43994b00b 
"FailedAssertion", 
#3  0x55b4396a2ecf in trivial_subqueryscan (plan=0x55b43b59cac8) at 
setrefs.c:1367

Commitfest Closed

2022-04-08 Thread Greg Stark

It has reached 2022-03-08 Anywhere on Earth[*] so I believe that means
Postgres 15 Feature Freeze is in effect (modulo a couple patches that
were held until the end of the commitfest to make merging easier).

I've marked the commitfest closed and will be moving any patches that
didn't receive feedback over to the next commitfest. I think this is
most of the remaining patches though there may be a few Waiting for
Author patches that can be Returned with Feedback or even Rejected.
I'll do the Ready for Committer patches last to allow for the
stragglers held back.

It's always frustrating seeing patches get ignored but on the plus
side nearly 100 patches are marked Committed and a lot of patches did
get feedback.

Thanks to all the reviewers and committers who put a lot of work in,
especially in the last two weeks. I especially want to thank Andres
who showed me how to use the cfbot to check on patch statuses and did
a lot of work doing that until I was up to speed.

[*] https://www.timeanddate.com/time/zones/aoe

-- 
greg

Re: Commitfest Closed

2022-04-08 Thread Alvaro Herrera

On 2022-Apr-08, Greg Stark wrote:

> Thanks to all the reviewers and committers who put a lot of work in,
> especially in the last two weeks. I especially want to thank Andres
> who showed me how to use the cfbot to check on patch statuses and did
> a lot of work doing that until I was up to speed.

Thanks for herding through the CF!

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/

Re: Mark all GUC variable as PGDLLIMPORT

2022-04-08 Thread Magnus Hagander

On Fri, Apr 8, 2022 at 2:42 PM Robert Haas  wrote:

> On Wed, Apr 6, 2022 at 7:56 PM Michael Paquier 
> wrote:
> > On Wed, Apr 06, 2022 at 12:57:29AM +0700, John Naylor wrote:
> > > For these two patches, I'd say a day or two after feature freeze is a
> > > reasonable goal.
> >
> > Yeah.  For patches as invasive as the PGDLLIMPORT business and the
> > frontend error refactoring, I am also fine to have two exceptions with
> > the freeze deadline.
>
> Done now.
>

\o/

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Extract epoch from Interval weird behavior

2022-04-08 Thread Tom Lane

Peter Eisentraut  writes:
> We really wanted to avoid doing calculations in numeric as much as 
> possible.  So we should figure out a different way to write this.  The 
> attached patch works for me.  It's a bit ugly since it hardcodes some 
> factors.  Maybe we can rephrase it a bit more elegantly.

I think it's fine but needs some commentary.  Maybe about like
"To do this calculation in integer arithmetic even though
DAYS_PER_YEAR is fractional, multiply everything by 4
and then divide by 4 again at the end.  This relies on
DAYS_PER_YEAR being a multiple of 0.25 and on SECS_PER_DAY
being a multiple of 4."

BTW, it might be good to parenthesize as

(... big calculation ...) * (SECS_PER_DAY/4)

to eliminate any question of whether the value could overflow
before the final division by 4.

regards, tom lane

Re: Support logical replication of DDLs

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 7:34 AM Alvaro Herrera  wrote:
> > For runtime conditions, one of the things you have mentioned in that
> > thread is to add schema name in the statement at the required places
> > which this patch deals with in a different way by explicitly sending
> > it along with the DDL statement.
>
> Hmm, ok.  The point of the JSON-blob route is that the publisher sends a
> command representation that can be parsed/processed/transformed
> arbitrarily by the subscriber using generic rules; it should be trivial
> to use a JSON tool to change schema A to schema B in any arbitrary DDL
> command, and produce another working DDL command without having to know
> how to write that command specifically.  So if I have a rule that
> "schema A there is schema B here", all DDL commands can be replayed with
> no further coding (without having to rely on getting the run-time
> search_path correct.)

Yeah, that was a really nice aspect of that approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Robert Haas

On Thu, Apr 7, 2022 at 7:01 PM Andres Freund  wrote:
> On 2022-04-04 19:24:22 -0700, Peter Geoghegan wrote:
> > We should definitely increase MaxHeapTuplesPerPage before too long,
> > for a variety of reasons that I have talked about in the past. Its
> > current value is 291 on all mainstream platforms, a value that's
> > derived from accidental historic details -- which predate HOT.
>
> I'm on-board with that - but I think we should rewrite a bunch of places that
> use MaxHeapTuplesPerPage sized-arrays on the stack first. It's not great using
> several KB of stack at the current the current value already (*), but if it 
> grows
> further...

I agree that the value of 291 is pretty much accidental, but it also
seems fairly generous to me. The bigger you make it, the more space
you can waste. I must have missed (or failed to understand) previous
discussions about why raising it would be a good idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: why pg_walfile_name() cannot be executed during recovery?

2022-04-08 Thread Bharath Rupireddy

On Thu, Apr 7, 2022 at 9:07 PM Robert Haas  wrote:
>
> On Thu, Apr 7, 2022 at 9:32 AM Bharath Rupireddy
>  wrote:
> > I spent some time today to allow pg_walfile_{name, name_offset} run in
> > recovery. Timeline ID is computed while in recovery as follows - WAL
> > receiver's last received and flushed WAL record's TLI if it's
> > streaming, otherwise the last replayed WAL record's TLI. This way,
> > these functions can be used on standby or PITR server or even in crash
> > recovery if the server opens up for read-only connections.
>
> I don't think this is a good definition. Suppose I ask for
> pg_walfile_name() using an older LSN. With this approach, we're going
> to get a filename based on the idea that the TLI that was in effect
> back then is the same one as the TLI that is in effect now, which
> might not be true. For example, suppose that the current TLI is 2 and
> it branched off of timeline 1 at 10/0. If I ask for
> pg_walfile_name('F/0'), it's going to give me the name of a WAL file
> that has never existed. That seems bad.
>
> It's also worth noting that there's a bit of a definitional problem
> here. If in the same situation, I ask for pg_walfile_name('11/0'),
> it's going to give me a filename based on TLI 2, but there's also a
> WAL file for that LSN with TLI 1. How do we know which one the user
> wants? Perhaps one idea would be to say that the relevant TLI is the
> one which was in effect at the time that LSN was replayed. If we do
> that, what about future LSNs? We could assume that for future LSNs,
> the TLI should be the same as the current TLI, but maybe that's also
> misleading, because recovery_target_timeline could be set.

Fundamental question - should the pg_walfile_{name, name_offset} check
whether the file with the computed WAL file name exists on the server
right now or ever existed earlier? Right now, they don't do that, see
[1].

I think we can make the functions more robust:
pg_walfile_{name, name_offset}(lsn, check_if_file_exists = false, tli
= invalid_timelineid) - when check_if_file_exists is true checks for
the computed WAL file existence and when a valid tli is provided uses
it in computing the WAL file name. When tli isn't provided, it
continues to use insert tli for primary, and in recovery it uses tli
as proposed in my patch. Perhaps, it can also do (as Michael
suggested) this - if check_if_file_exists is true and tli isn't
provided and there's timeline history, then it can go look at all the
timelines and whether the file exists with the computed name with
history tli.

> I think it's really important to start by being precise about the
> question that we think pg_walfile_name() ought to be answering. If we
> don't know that, then we really can't say what TLI it should be using.
> It's not hard to make the function return SOME answer using SOME TLI,
> but then it's not clear that the answer is the right one for any
> particular purpose. And in that case the function is more dangerous
> than useful, because people will write code that uses it to do stuff,
> and then that stuff won't actually work correctly under all
> circumstances.

Yes, once we agree on the semantics of these functions, having better
documentation will help.

Thoughts?

[1]
postgres=# select * from pg_walfile_name('5/dfdf');
 pg_walfile_name
--
 00010005
(1 row)
postgres=# select * from pg_walfile_name_offset('5/dfdf');
file_name | file_offset
--+-
 00010005 |   57311
(1 row)

Regards,
Bharath Rupireddy.

Re: Support for grabbing multiple consecutive values with nextval()

2022-04-08 Thread Greg Stark

On Sun, 27 Feb 2022 at 07:09, Jille Timmermans  wrote:
>
> Hi,
>
> First time PostgreSQL contributor here :)

I wish I had noticed this patch during the CF. It seems like a nice
self-contained feature that could have been easily reviewed and
committed and it's always good to see first-time contributions.
Hopefully it'll get committed early in the next cycle.

-- 
greg

Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-08 Thread Robert Haas

On Wed, Apr 6, 2022 at 8:15 AM Robert Haas  wrote:
> On Tue, Apr 5, 2022 at 8:43 PM Thom Brown  wrote:
> > I share your discomfort with the wording.  How about:
> >
> > WAL records must be kept on standby until they are ready to be applied.
> > Therefore, longer delays will result in a greater accumulation of WAL files,
> > increasing disk space requirements for the standby's pg_wal
> > directory.
>
> Looks awesome.

Here that is in patch form. I feel that the feature freeze should not
preclude committing this documentation improvement, but if someone
feels otherwise, then I will leave this until the tree reopens.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v1-0001-docs-Note-the-recovery_min_apply_delay-bloats-pg_.patch
Description: Binary data

Re: Add parameter jit_warn_above_fraction

2022-04-08 Thread Stephen Frost

Greetings,

* Magnus Hagander (mag...@hagander.net) wrote:
> On Fri, Apr 8, 2022 at 2:19 PM David Rowley  wrote:
> > On Fri, 8 Apr 2022 at 23:27, Magnus Hagander  wrote:
> > > On Wed, Mar 30, 2022 at 3:04 PM Magnus Hagander 
> > wrote:
> > >>
> > >> On Tue, Mar 29, 2022 at 10:06 PM David Rowley 
> > wrote:
> > >>>
> > >>> If we go with this patch,  the problem I see here is that the amount
> > >>> of work the JIT compiler must do for a given query depends mostly on
> > >>> the number of expressions that must be compiled in the query (also to
> > >>> a lesser extent jit_inline_above_cost, jit_optimize_above_cost,
> > >>> jit_tuple_deforming and jit_expressions). The DBA does not really have
> > >>> much control over the number of expressions in the query.  All he or
> > >>> she can do to get rid of the warning is something like increase
> > >>> jit_above_cost.  After a few iterations of that, the end result is
> > >>> that jit_above_cost is now high enough that JIT no longer triggers
> > >>> for, say, that query to that table with 1000 partitions where no
> > >>> plan-time pruning takes place.  Is that really a good thing? It likely
> > >>> means that we just rarely JIT anything at all!
> > >>
> > >>
> > >> I don't agree with the conclusion of that.
> > >>
> > >> What the parameter would be useful for is to be able to tune those
> > costs (or just turn it off) *for that individual query*. That doesn't mean
> > you "rarely JIT anything atll", it just means you rarely JIT that
> > particular query.
> >
> > I just struggle to imagine that anyone is going to spend much effort
> > tuning a warning parameter per query.  I imagine they're far more
> > likely to just ramp it up to only catch some high percentile problems
> > or just (more likely) just not bother with it.  It seems more likely
> > that if anyone was to tune anything per query here it would be
> > jit_above_cost, since that actually might have an affect on the
> > performance of the query, rather than if it spits out some warning
> > message or not.  ISTM that if the user knows what to set it to per
> > query, then there's very little point in having a warning as we'd be
> > alerting them to something they already know about.
> 
> I would not expect people to tune the *warning* at a query level. If
> anything, then ys, they would tune the either jit_above_cost or just
> jit=off. But the idea being you can do that on a per query level instead of
> globally.

Yeah, exactly, this is about having a busy system and wanting to know
which queries are spending a lot of time doing JIT relative to the query
time, so that you can go adjust your JIT parameters or possibly disable
JIT for those queries (or maybe bring those cases to -hackers and try to
help make our costing better).

> > I looked in the -general list to see if we could get some common
> > explanations to give us an idea of the most common reason for high JIT
> > compilation time. It seems that the plans were never simple. [1] seems
> > due to a complex plan. I'm basing that off the "Functions: 167". I
> > didn't catch the full plan. From what I can tell, [2] seems to be due
> > to "lots of empty tables", so assuming the clamping at 1 page is
> > causing issues there.  I think both of those cases could be resolved
> > by building the costing the way I mentioned.  I admit that 2 cases is
> > not a very large sample size.
> 
> Again, I am very much for improvements of the costing model. This is in no
> way intended to be a replacement for that. It's intended to be a stop-gap.

Not sure I'd say it's a 'stop-gap' as it's really very similar, imv
anyway, to log_min_duration_statement- you want to know what queries are
taking a lot of time but you can't log all of them.

> What I see much of today are things like
> https://dba.stackexchange.com/questions/264955/handling-performance-problems-with-jit-in-postgres-12
> or
> https://dev.to/xenatisch/cascade-of-doom-jit-and-how-a-postgres-update-led-to-70-failure-on-a-critical-national-service-3f2a
> 
> The bottom line is that people end up with recommendations to turn off JIT
> globally more or less by default. Because there's no real useful way today
> to figure out when it causes problems vs when it helps.

Yeah, that's frustrating.

> The addition to pg_stat_statements I pushed a short while ago would help
> with that. But I think having a warning like this would also be useful. As
> a stop-gap measure, yes, but we really don't know when we will have an
> improved costing model for it. I hope you're right and that we can have it
> by 16, and then I will definitely advocate for removing the warning again
> if it works.

Having this in pg_stat_statements is certainly helpful but having a
warning also is.  I don't think we have to address this in only one way.
A lot faster to flip this guc and then look in the logs on a busy system
than to install pg_stat_statements, restart the cluster once you get
permission to do so, and then query it.

Thanks,

Stephen

Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?

2022-04-08 Thread Bharath Rupireddy

On Wed, Apr 6, 2022 at 4:30 PM Ashutosh Bapat
 wrote:
>
> On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy
>  wrote:
> >
> > Hi,
> >
> > I'm thinking if there's a way in core postgres to achieve $subject. In
> > reality, the sync/async standbys can either be closer/farther (which
> > means sync/async standbys can receive WAL at different times) to
> > primary, especially in cloud HA environments with primary in one
> > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> > $subject may not be possible on dev systems (say, for testing some HA
> > features) unless we can inject a delay in WAL senders before sending
> > WAL.
> >
> > How about having two developer-only GUCs {async,
> > sync}_wal_sender_delay? When set, the async and sync WAL senders will
> > delay sending WAL by {async, sync}_wal_sender_delay
> > milliseconds/seconds? Although, I can't think of any immediate use, it
> > will be useful someday IMO, say for features like [1], if it gets in.
> > With this set of GUCs, one can even add core regression tests for HA
> > features.
> >
> > Thoughts?
>
> I think this is a common problem, people run into. Once way to
> simulate network delay is what you suggest, yes. But I was wondering
> if there are tools/libraries that can help us to do that. Googling
> gives OS specific tools but nothing like a C or perl library which can
> be used for this purpose.

Thanks. IMO, non-postgres tools (not sure if they exist, if at all
they exist) to simulate network delays may not be reliable and usable
easily, say, for adding some TAP tests for HA features. Especially in
the cloud-world usage of those external tools may not even be
possible. With the developer-only GUCs as being proposed here in this
thread, it's pretty much easy to simulate what we want, but only the
extra caution is to not let others (probably non-superusers) set and
misuse these developer-only GUCs. I think that's even true for all the
existing developer-only GUCs.

Thoughts?

Regards,
Bharath Rupireddy.

Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-08 Thread Magnus Hagander

On Fri, Apr 8, 2022 at 3:36 PM Robert Haas  wrote:

> On Wed, Apr 6, 2022 at 8:15 AM Robert Haas  wrote:
> > On Tue, Apr 5, 2022 at 8:43 PM Thom Brown  wrote:
> > > I share your discomfort with the wording.  How about:
> > >
> > > WAL records must be kept on standby until they are ready to be applied.
> > > Therefore, longer delays will result in a greater accumulation of WAL
> files,
> > > increasing disk space requirements for the standby's
> pg_wal
> > > directory.
> >
> > Looks awesome.
>
> Here that is in patch form. I feel that the feature freeze should not
> preclude committing this documentation improvement, but if someone
> feels otherwise, then I will leave this until the tree reopens.
>

We normally allow documentation and bug fixes after the feature freeze.
(It's only in the "we're about to wrap the release right now"-freeze that
we have to avoid those)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: GSoC: New and improved website for pgjdbc (JDBC)

2022-04-08 Thread Dave Cramer

Joseph



On Thu, 7 Apr 2022 at 17:49, Joseph Ho  wrote:

> Hello,
>
> I am Joseph Ho, a senior at Dr Norman Bethune Collegiate Institute
> interested in going into computer science. I am interested in working to
> create and improve the website for pgjdbc during GSoC 2022.
>
> I am wondering how the draft proposal should be made. Will I need to
> submit a web design of the new and improved website or will I need to
> submit something else? Also, am I able to use a web framework of my choice
> or is there one that you prefer that we use?
>

You should register on the GSoC site Contributor Registration | Google
Summer of Code 

The draft proposal can just be a document at this point which outlines your
ideas

Currently the site is built using Jekyll, and I see no good reason to
change. What I am looking for is to update the version of Jekyll to latest
and make the site cleaner with a new design.

You should be aware that proposals have to be in by the 19th of April


Regards,

Dave

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Can we automatically add elapsed times to tap test log?

2022-04-08 Thread Andrew Dunstan


On 4/7/22 19:55, Andrew Dunstan wrote:
> On 4/7/22 17:58, Andres Freund wrote:
>> Hi,
>>
>> On 2022-04-07 17:45:09 -0400, Tom Lane wrote:
>>> Andres Freund  writes:
 On 2022-04-07 17:21:09 -0400, Tom Lane wrote:
> I too think that the elapsed time is useful.  I'm less convinced
> that the time-of-day marker is useful.
 I think it'd be quite useful if it had more precision - it's a pita to
 correlate regress_log_* output with server logs.
>>> Fair point.  Maybe we could keep the timestamp (with ms precision
>>> if possible) and then the parenthetical bit is time-since-last-line
>>> (also with ms precision)?  I think that would more or less satisfy
>>> both uses.
>> Would work for me...
>>
> All doable. Time::HiRes gives us a higher resolution timer. I'll post a
> new version in a day or two.


New version attached.


Sample traces:


andrew@emma:log $ egrep '^\[[0-9][0-9]:[00-9][0-9]:' 
regress_log_020_pg_receivewal | tail -n 15
[09:22:45.031](0.000s) ok 30 # skip postgres was not built with LZ4 support
[09:22:45.032](0.000s) ok 31 # skip postgres was not built with LZ4 support
[09:22:45.296](0.265s) ok 32 - streaming some WAL
[09:22:45.297](0.001s) ok 33 - check that previously partial WAL is now complete
[09:22:45.298](0.001s) ok 34 - check stream dir permissions
[09:22:45.298](0.000s) # Testing pg_receivewal with slot as starting streaming 
point
[09:22:45.582](0.284s) ok 35 - pg_receivewal fails with non-existing slot: exit 
code not 0
[09:22:45.583](0.001s) ok 36 - pg_receivewal fails with non-existing slot: 
matches
[09:22:45.618](0.036s) ok 37 - WAL streamed from the slot's restart_lsn
[09:22:45.619](0.001s) ok 38 - WAL from the slot's restart_lsn has been archived
[09:22:46.597](0.978s) ok 39 - Stream some wal after promoting, resuming from 
the slot's position
[09:22:46.598](0.001s) ok 40 - WAL segment 0001000B archived 
after timeline jump
[09:22:46.598](0.000s) ok 41 - WAL segment 0002000C archived 
after timeline jump
[09:22:46.598](0.000s) ok 42 - timeline history file archived after timeline 
jump
[09:22:46.599](0.001s) 1..42


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com
diff --git a/src/test/perl/PostgreSQL/Test/SimpleTee.pm b/src/test/perl/PostgreSQL/Test/SimpleTee.pm
index bb9d79a755..d92c31a891 100644
--- a/src/test/perl/PostgreSQL/Test/SimpleTee.pm
+++ b/src/test/perl/PostgreSQL/Test/SimpleTee.pm
@@ -10,10 +10,31 @@
 # method is currently implemented; that's all we need. We don't want to
 # depend on IO::Tee just for this.
 
+# The package is enhanced to add timestamp and elapsed time decorations to
+# the log file traces sent through this interface from Test::More.
+
 package PostgreSQL::Test::SimpleTee;
 use strict;
 use warnings;
 
+use Time::HiRes qw(time);
+
+my $last_time;
+
+BEGIN { $last_time = time; }
+
+sub _time_str
+{
+	my $tm = time;
+	my $diff = $tm - $last_time;
+	$last_time = $tm;
+my ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) =
+  localtime($tm);
+	my $usec = int(1000 * ($tm - int($tm)));
+return sprintf("[%.2d:%.2d:%.2d.%.3d](%.3fs) ",
+   $hour, $min, $sec, $usec, $diff);
+}
+
 sub TIEHANDLE
 {
 	my $self = shift;
@@ -24,10 +45,16 @@ sub PRINT
 {
 	my $self = shift;
 	my $ok   = 1;
+	# The first file argument passed to tiehandle in PostgreSQL::Test::Utils is
+	# the original stdout, which is what PROVE sees. Additional decorations
+	# confuse it, so only put out the time string on files after the first.
+	my $skip = 1;
+	my $ts = _time_str;
 	for my $fh (@$self)
 	{
-		print $fh @_ or $ok = 0;
+		print $fh ($skip ? "" : $ts), @_ or $ok = 0;
 		$fh->flush   or $ok = 0;
+		$skip = 0;
 	}
 	return $ok;
 }

Re: why pg_walfile_name() cannot be executed during recovery?

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 9:31 AM Bharath Rupireddy
 wrote:
> Fundamental question - should the pg_walfile_{name, name_offset} check
> whether the file with the computed WAL file name exists on the server
> right now or ever existed earlier? Right now, they don't do that, see
> [1].

I don't think that checking whether the file exists is the right
approach. However, I do think that it's important to be precise about
which TLI is going to be used. I think it would be reasonable to
redefine this function (on both the primary and the standby) so that
the TLI that is used is the one that was in effect at the time record
at the given LSN was either written or replayed. Then, you could
potentially use this function to figure out whether you still have the
WAL files that are needed to replay up to some previous point in the
WAL stream. However, what about the segments where we switched from
one TLI to the next in the middle of the segment? There, you probably
need both the old and the new segments, or maybe if you're trying to
stream them you only need the new one because we have some weird
special case that will send the segment from the new timeline when the
segment from the old timeline is requested. So you couldn't just call
this function on one LSN per segment and call it good, and it wouldn't
necessarily be the case that the filenames you got back were exactly
the ones you needed.

So I'm not entirely sure this proposal is good enough, but it at least
would have the advantage of meaning that the filename you get back is
one that existed at some point in time and somebody used it for
something.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Atomic rename feature for Windows.

2022-04-08 Thread Greg Stark

On Thu, 9 Dec 2021 at 23:36, Tom Lane  wrote:
>
> I'm not for dropping support for some platform just because it's old.

I guess I'll have to spin up the Vax again :)

GSOC: New and improved website for pgjdbc (JDBC) (2022)

2022-04-08 Thread S.R Keshav




postgreSql_ New and improved website for pgjdbc (JDBC) (2022).pdf
Description: Adobe PDF document

Re: remove more archiving overhead

2022-04-08 Thread Robert Haas

On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart  wrote:
> On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
> > Yes.  I found that a crash at an unfortunate moment can produce multiple
> > links to the same file in pg_wal, which seemed bad independent of archival.
> > By fixing that (i.e., switching from durable_rename_excl() to
> > durable_rename()), we not only avoid this problem, but we also avoid trying
> > to archive a file the server is concurrently writing.  Then, after a crash,
> > the WAL file to archive should either not exist (which is handled by the
> > archiver) or contain the same contents as any preexisting archives.
>
> I moved the fix for this to a new thread [0] since I think it should be
> back-patched.  I've attached a new patch that only contains the part
> related to reducing archiving overhead.

While we're now after the feature freeze and thus this will need to
wait for v16, it looks like a reasonable change to me.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: REINDEX blocks virtually any queries but some prepared queries.

2022-04-08 Thread Frédéric Yhuel





On 4/8/22 02:22, Michael Paquier wrote:

On Thu, Apr 07, 2022 at 05:29:36PM +0200, Guillaume Lelarge a écrit :

Le jeu. 7 avr. 2022 à 15:44, Frédéric Yhuel  a écrit 
:

On 4/7/22 14:40, Justin Pryzby wrote:
Thank you Justin! I applied your fixes in the v2 patch (attached).


v2 patch sounds good.


The location of the new sentence and its wording seem fine to me.  So
no objections from me to add what's suggested, as suggested.  I'll
wait for a couple of days first.



Thank you Michael.


Indeed ;) That being said, REINDEX CONCURRENTLY could give you an
invalid index, so sometimes you may be tempted to go for a simpler
REINDEX, especially if you believe that the SELECTs won't be blocked.


Agreed.


There are many factors to take into account, one is more expensive
than the other in terms of resources and has downsides, downsides
compensated by the reduction in the lock requirements.  There are
cases where REINDEX is a must-have, as CONCURRENTLY does not support
catalog indexes, and these tend to be easily noticed when corruption
spreads around.


Indeed!

Best regards,
Frédéric

Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Robert Haas

On Thu, Apr 7, 2022 at 2:30 PM Nathan Bossart  wrote:
> Presently, WAL recycling uses durable_rename_excl(), which notes that a
> crash at an unfortunate moment can result in two links to the same file.
> My testing [1] demonstrated that it was possible to end up with two links
> to the same file in pg_wal after a crash just before unlink() during WAL
> recycling.  Specifically, the test produced links to the same file for the
> current WAL file and the next one because the half-recycled WAL file was
> re-recycled upon restarting.  This seems likely to lead to WAL corruption.

Wow, that's bad.

> The attached patch prevents this problem by using durable_rename() instead
> of durable_rename_excl() for WAL recycling.  This removes the protection
> against accidentally overwriting an existing WAL file, but there shouldn't
> be one.

I see that durable_rename_excl() has the following comment: "Similar
to durable_rename(), except that this routine tries (but does not
guarantee) not to overwrite the target file." If those are the desired
semantics, we could achieve them more simply and more safely by just
trying to stat() the target file and then, if it's not found, call
durable_rename(). I think that would be a heck of a lot safer than
what this function is doing right now.

I'd actually be in favor of nuking durable_rename_excl() from orbit
and putting the file-exists tests in the callers. Otherwise, someone
might assume that it actually has the semantics that its name
suggests, which could be pretty disastrous. If we don't want to do
that, then I'd changing to do the stat-then-durable-rename thing
internally, so we don't leave hard links lying around in *any* code
path. Perhaps that's the right answer for the back-branches in any
case, since there could be third-party code calling this function.

Your proposed fix is OK if we don't want to do any of that stuff, but
personally I'm much more inclined to blame durable_rename_excl() for
being horrible than I am to blame the calling code for using it
improvidently.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-08 Thread Thom Brown

On Fri, 8 Apr 2022, 14:36 Robert Haas,  wrote:

> On Wed, Apr 6, 2022 at 8:15 AM Robert Haas  wrote:
> > On Tue, Apr 5, 2022 at 8:43 PM Thom Brown  wrote:
> > > I share your discomfort with the wording.  How about:
> > >
> > > WAL records must be kept on standby until they are ready to be applied.
> > > Therefore, longer delays will result in a greater accumulation of WAL
> files,
> > > increasing disk space requirements for the standby's
> pg_wal
> > > directory.
> >
> > Looks awesome.
>
> Here that is in patch form. I feel that the feature freeze should not
> preclude committing this documentation improvement, but if someone
> feels otherwise, then I will leave this until the tree reopens.
>

Thanks. This doesn't include my self-correction:

s/kept on standby/kept on the standby/

Thom

>

Size of pg_rewrite (Was: Report checkpoint progress with pg_stat_progress_checkpoint)

2022-04-08 Thread Matthias van de Meent

On Sat, 19 Mar 2022 at 01:15, Andres Freund  wrote:
> pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664
>
> pg_rewrite is the second biggest relation in an empty database already...

Yeah, that's not great. Thanks for nerd-sniping me into looking into
how views and pg_rewrite rules work, that was very interesting and I
learned quite a lot.

# Immediately potential, limited to progress views

I noticed that the CASE-WHEN (used in translating progress stage index
to stage names) in those progress reporting views can be more
efficiently described (althoug with slightly worse behaviour around
undefined values) using text array lookups (as attached). That
resulted in somewhat smaller rewrite entries for the progress views
(toast compression was good old pglz):

template1=# SELECT sum(octet_length(ev_action)),
SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
ev_class::regclass::text LIKE '%progress%';

master:
  sum  |  sum
---+---
 97277 | 19956
patched:
  sum  |  sum
---+---
 77069 | 18417

So this seems like a nice improvement of 20% uncompressed / 7% compressed.

I tested various cases of phase number to text translations: `CASE ..
WHEN`; `(ARRAY[]::text[])[index]` and `('{}'::text[])[index]`. See
results below:

postgres=# create or replace view arrayliteral_view as select
(ARRAY['a','b','c','d','e','f']::text[])[index] as name from tst
s(index);
CREATE INDEX
postgres=# create or replace view stringcast_view as select
('{a,b,c,d,e,f}'::text[])[index] as name from tst s(index);
CREATE INDEX
postgres=# create or replace view split_stringcast_view as select
(('{a,b,' || 'c,d,e,f}')::text[])[index] as name from tst s(index);
CREATE VIEW
postgres=# create or replace view case_view as select case index when
0 then 'a' when 1 then 'b' when 2 then 'c' when 3 then 'd' when 4 then
'e' when 5 then 'f' end as name from tst s(index);
CREATE INDEX

postgres=# postgres=# select ev_class::regclass::text,
octet_length(ev_action), pg_column_size(ev_action) from pg_rewrite
where ev_class in ('arrayliteral_view'::regclass::oid,
'case_view'::regclass::oid, 'split_stringcast_view'::regclass::oid,
'stringcast_view'::regclass::oid);
   ev_class| octet_length | pg_column_size
---+--+
 arrayliteral_view | 3311 |   1322
 stringcast_view   | 2610 |   1257
 case_view | 5170 |   1412
 split_stringcast_view | 2847 |   1350

It seems to me that we could consider replacing the CASE statements
with array literals and lookups if we really value our template
database size. But, as text literal concatenations don't seem to get
constant folded before storing them in the rules table, this rewrite
of the views would result in long lines in the system_views.sql file,
or we'd have to deal with the additional overhead of the append
operator and cast nodes.

# Future work; nodeToString / readNode, all rewrite rules

Additionally, we might want to consider other changes like default (or
empty value) elision in nodeToString, if that is considered a
reasonable option and if we really want to reduce the size of the
pg_rewrite table.

I think a lot of space can be recovered from that: A manual removal of
what seemed to be fields with default values (and the removal of all
query location related fields) in the current definition of
pg_stat_progress_create_index reduces its uncompressed size from
23226B raw and 4204B compressed to 13821B raw and 2784B compressed,
for an on-disk space saving of 33% for this view's ev_action.

Do note, however, that that would add significant branching in the
nodeToString and readNode code, which might slow down that code
significantly. I'm not planning on working on that; but in my opinion
that is a viable path to reducing the size of new database catalogs.

-Matthias

PS. attached patch is not to be considered complete - it is a minimal
example of the array literal form. It fails regression tests because I
didn't bother updating or including the regression tests on system
views.
Index: src/backend/catalog/system_views.sql
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
--- a/src/backend/catalog/system_views.sql  (revision 
32723e5fabcc7db1bf4e897baaf0d251b500c1dc)
+++ b/src/backend/catalog/system_views.sql  (date 1649160138886)
@@ -1120,13 +1120,7 @@
 SELECT
 S.pid AS pid, S.datid AS datid, D.datname AS datname,
 CAST(S.relid AS oid) AS relid,
-CASE S.param1 WHEN 0 THEN 'initializing'
-  WHEN 1 THEN 'acquiring sample rows'
-  WHEN 2 THEN 'acquiring inherited sample rows'
-  WHEN 3 THEN 'computing statistics'
-  WHEN 4 THEN 'computing

Re: Mingw task for Cirrus CI

2022-04-08 Thread Melih Mutlu

Hi Andrew,

You should set MSYSTEM=UCRT64 in the environment section. Given that,
> there should be no need to specify a --host= setting for configure.
>

It's set to UCRT64 in the docker image side [1]. I didn't know --host isn't
necessary on UCRT64 environment. I'll remove it then.

 [1]
 
https://github.com/anarazel/pg-vm-images/blob/main/docker/windows_ci_mingw64#L11


Best,
Melih

Re: Kerberos delegation support in libpq and postgres_fdw

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 8:21 AM Stephen Frost  wrote:
> Added an explicit 'environment' option to allow for, basically, existing
> behavior, where we don't mess with the environment variable at all,
> though I kept the default as MEMORY since I don't think it's really
> typical that folks actually want regular user backends to inherit the
> credential cache of the server.
>
> Added a few more tests and updated the documentation too.  Sadly, seems
> we've missed the deadline for v15 though for lack of feedback on these.
> Would really like to get some other folks commenting as these are new
> pg_hba and postgresql.conf options being added.

Hi,

I don't think this patch is quite baked enough to go in even if the
deadline hadn't formally passed, but I'm happy to offer a few opinions
... especially if we can also try to sort out a plan for getting that
wider-checksums thing you mentioned done for v16.

 + /* gssencmode is also libpq option, same to above. */
+ {"gssencmode", UserMappingRelationId, true},

I really hate names like this that are just a bunch of stuff strung
together with no punctuation and some arbitrary abbreviations thrown
in for good measure. But since the libpq parameter already exists it's
hard to argue we should do anything else here.

+  allow_cred_delegation

First, I again recommend not choosing words at random to abbreviate.
"delegate_credentials" would be shorter and clearer. Second, I think
we need to decide whether we envision just having one parameter here
for every kind of credential delegation that libpq might ever support,
or whether this is really something specific to GSS. If the latter,
the name should mention GSS.

I also suggest that the default value of this option should be false,
rather than true. I would be unhappy if ssh started defaulting to
ForwardAgent=yes, because that's less secure and I don't want my
credentials shared with random servers without me making a choice to
do that. Similarly here I think we should default to the more secure
option.

+  
+   
+Sets the location of the Kerberos credential cache to be used for
+regular user backends which go through authentication.  The default is
+MEMORY:, which is where delegated credentials
+are stored (and is otherwise empty).  Care should be used when changing
+this value- setting it to a file-based credential cache will mean that
+user backends could potentially use any credentials stored to access
+other systems.
+If this parameter is set to an empty string, then the variable will be
+ explicit un-set and the system-dependent default is used, which may be a
+ file-based credential cache with the same caveats as previously
+ mentioned.  If the special value 'environment' is used, then the variable
+ is left untouched and will be whatever was set in the environment at
+ startup time.

"MEMORY:" seems like a pretty weird choice of arbitrary string. Is it
supposed to look like a Windows drive letter or pseudo-device, or
what? I'm not sure exactly what's better here, but I just think this
doesn't look like anything else we've got today. And then we've got a
second special environment, "environment", which looks completely
different: now it's lower-case and without the colon. And then empty
string is special too.

I wonder whether we really quite this many cases. But if we do they
probably need better and more consistent naming.

The formatting here also looks weird.

+#ifndef PG_KRB_USER_CCACHE
+#define PG_KRB_USER_CCACHE "MEMORY:"
+#endif

At the risk of stating the obvious, the general idea of a #define is
that you define things in one place and then use the defined symbol
rather than the original value everywhere. This patch takes the
less-useful approach of defining two different symbols for the same
string in different files. This one has this #ifndef/#endif guard here
which I think it probably shouldn't, since the choice of string
probably shouldn't be compile-time configurable, but it also won't
work, because there's no similar guard in the other file.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Mingw task for Cirrus CI

2022-04-08 Thread Alvaro Herrera

On 2022-Apr-07, Andres Freund wrote:

> Since dash won't help us to get the build time down sufficiently, and the
> tests don't pass without a separate build tree, I looked at what makes
> config/prep_buildtree so slow.

Maybe we can replace prep_buildtree with a Perl script.  Surely that
should be faster.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Sallah, I said NO camels! That's FIVE camels; can't you count?"
(Indiana Jones)

Re: Mark all GUC variable as PGDLLIMPORT

2022-04-08 Thread Julien Rouhaud

On Fri, Apr 08, 2022 at 03:04:18PM +0200, Magnus Hagander wrote:
> On Fri, Apr 8, 2022 at 2:42 PM Robert Haas  wrote:
> 
> > On Wed, Apr 6, 2022 at 7:56 PM Michael Paquier 
> > wrote:
> > > On Wed, Apr 06, 2022 at 12:57:29AM +0700, John Naylor wrote:
> > > > For these two patches, I'd say a day or two after feature freeze is a
> > > > reasonable goal.
> > >
> > > Yeah.  For patches as invasive as the PGDLLIMPORT business and the
> > > frontend error refactoring, I am also fine to have two exceptions with
> > > the freeze deadline.
> >
> > Done now.
> >
> 
> \o/

Woohoo!  Thanks a lot!

Re: Commitfest Closed

2022-04-08 Thread Peter Geoghegan

On Fri, Apr 8, 2022 at 5:58 AM Alvaro Herrera  wrote:
> Thanks for herding through the CF!

+1

-- 
Peter Geoghegan

Re: [COMMITTERS] pgsql: Allow time delayed standbys and recovery

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 10:45 AM Thom Brown  wrote:
> Thanks. This doesn't include my self-correction:
>
> s/kept on standby/kept on the standby/

Here is v2, endeavoring to rectify that oversight.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


v2-0001-docs-Note-the-recovery_min_apply_delay-bloats-pg_.patch
Description: Binary data

Re: [Proposal] vacuumdb --schema only

2022-04-08 Thread Gilles Darold


Le 08/04/2022 à 02:46, Justin Pryzby a écrit :

On Wed, Apr 06, 2022 at 07:43:42PM +0200, Gilles Darold wrote:

Thanks for the review, all these changes are available in new version v6
of the patch and attached here.

This is failing in CI (except on macos, which is strangely passing).
http://cfbot.cputube.org/gilles-darold.html

https://api.cirrus-ci.com/v1/artifact/task/5379693443547136/log/src/bin/scripts/tmp_check/log/regress_log_100_vacuumdb

not ok 59 - vacuumdb --schema "Foo" postgres exit code 0

#   Failed test 'vacuumdb --schema "Foo" postgres exit code 0'
#   at t/100_vacuumdb.pl line 151.
not ok 60 - vacuumdb --schema schema only: SQL found in server log

#   Failed test 'vacuumdb --schema schema only: SQL found in server log'
#   at t/100_vacuumdb.pl line 151.
#   '2022-04-06 18:15:36.313 UTC [34857][not initialized] 
[[unknown]][:0] LOG:  connection received: host=[local]
# 2022-04-06 18:15:36.314 UTC [34857][client backend] [[unknown]][3/2801:0] 
LOG:  connection authorized: user=postgres database=postgres 
application_name=100_vacuumdb.pl
# 2022-04-06 18:15:36.318 UTC [34857][client backend] 
[100_vacuumdb.pl][3/2802:0] LOG:  statement: SELECT 
pg_catalog.set_config('search_path', '', false);
# 2022-04-06 18:15:36.586 UTC [34857][client backend] [100_vacuumdb.pl][:0] 
LOG:  disconnection: session time: 0:00:00.273 user=postgres database=postgres 
host=[local]
# '
# doesn't match '(?^:VACUUM "Foo".bar)'



Ok, got it with the help of rjuju. Actually it was compiling well using 
gcc but clang give some warnings. A fix of these warning makes CI happy.



Attached v7 of the patch that should pass cfbot.

--
Gilles Darold
diff --git a/doc/src/sgml/ref/vacuumdb.sgml b/doc/src/sgml/ref/vacuumdb.sgml
index 956c0f01cb..0de001ef24 100644
--- a/doc/src/sgml/ref/vacuumdb.sgml
+++ b/doc/src/sgml/ref/vacuumdb.sgml
@@ -39,6 +39,40 @@ PostgreSQL documentation
dbname
   
 
+  
+   vacuumdb
+   connection-option
+   option
+
+   
+
+ 
+   
+
+ 
+  -n
+  --schema
+ 
+ schema
+
+   
+
+   
+
+ 
+  -N
+  --exclude-schema
+ 
+ schema
+
+   
+ 
+
+   
+
+   dbname
+  
+
   
vacuumdb
connection-option
@@ -244,6 +278,28 @@ PostgreSQL documentation
   
  
 
+ 
+  -n schema
+  --schema=schema
+  
+   
+Clean or analyze all tables in schema only.
+Multiple schemas can be vacuumed by writing multiple -n switches.
+   
+  
+ 
+
+ 
+  -N schema
+  --exclude-schema=schema
+  
+   
+Clean or analyze all tables NOT in schema.
+Multiple schemas can be excluded from the vacuum by writing multiple -N switches.
+   
+  
+ 
+
  
   --no-index-cleanup
   
@@ -619,6 +675,14 @@ PostgreSQL documentation
 $ vacuumdb --analyze --verbose --table='foo(bar)' xyzzy
 
 
+   
+To clean all tables in the foo and bar schemas
+only in a database named xyzzy:
+
+$ vacuumdb --schema='foo' --schema='bar' xyzzy
+
+
+
  
 
  
diff --git a/src/bin/scripts/t/100_vacuumdb.pl b/src/bin/scripts/t/100_vacuumdb.pl
index 96a818a3c1..7bbfb97246 100644
--- a/src/bin/scripts/t/100_vacuumdb.pl
+++ b/src/bin/scripts/t/100_vacuumdb.pl
@@ -103,6 +103,8 @@ $node->safe_psql(
   CREATE TABLE funcidx (x int);
   INSERT INTO funcidx VALUES (0),(1),(2),(3);
   CREATE INDEX i0 ON funcidx ((f1(x)));
+  CREATE SCHEMA "Foo";
+  CREATE TABLE "Foo".bar(id int);
 |);
 $node->command_ok([qw|vacuumdb -Z --table="need""q(uot"(")x") postgres|],
 	'column list');
@@ -146,5 +148,15 @@ $node->issues_sql_like(
 	[ 'vacuumdb', '--min-xid-age', '2147483001', 'postgres' ],
 	qr/GREATEST.*relfrozenxid.*2147483001/,
 	'vacuumdb --table --min-xid-age');
+$node->issues_sql_like(
+	[ 'vacuumdb', '--schema', '"Foo"', 'postgres' ],
+	qr/VACUUM "Foo".bar/,
+	'vacuumdb --schema schema only');
+$node->command_fails(
+	[ 'vacuumdb',   '-n', 'pg_catalog', '-t', 'pg_class', 'postgres' ],
+	'cannot vacuum all tables in schema(s) and specific table(s) at the same time');
+$node->command_fails(
+	[ 'vacuumdb',   '-n', 'pg_catalog', '-N',  '"Foo"', 'postgres' ],
+	'cannot use option -n | --schema and -N | --exclude-schema at the same time');
 
 done_testing();
diff --git a/src/bin/scripts/t/101_vacuumdb_all.pl b/src/bin/scripts/t/101_vacuumdb_all.pl
index 1dcf411767..b122c995b1 100644
--- a/src/bin/scripts/t/101_vacuumdb_all.pl
+++ b/src/bin/scripts/t/101_vacuumdb_all.pl
@@ -15,5 +15,8 @@ $node->issues_sql_like(
 	[ 'vacuumdb', '-a' ],
 	qr/statement: VACUUM.*statement: VACUUM/s,
 	'vacuum all databases');
+$node->command_fails(
+	[ 'vacuumdb', '-a',  '-n', 'pg_catalog' ],
+	'cannot vacuum specific schema(s) in all databases');
 
 done_testing();
diff --git a/src/bin/scripts/vacuumdb.c b/src/bin/scripts/vacuumdb.c
index 4f6917fd39..f118b05169 100644
--- a/src/bin/scripts/vacuumdb.c
+++ b/src/bin/scripts/vacu

Re: Commitfest Closed

2022-04-08 Thread Julien Rouhaud

On Fri, Apr 08, 2022 at 08:09:16AM -0700, Peter Geoghegan wrote:
> On Fri, Apr 8, 2022 at 5:58 AM Alvaro Herrera  wrote:
> > Thanks for herding through the CF!
> 
> +1

+1!

Re: Mingw task for Cirrus CI

2022-04-08 Thread Melih Mutlu

> On windows that makes prep_buildtree go from 42.4s to 5.8s for me.
>

I applied Andres's faster prep build tree changes and triggered some cirrus
runs

Without these changes, preparing build tree was taking around 42.3s
(sometimes even more) [1]
It seems like with these changes it drops to around 8s [2]

[1] https://cirrus-ci.com/task/6562493345562624
[2] https://cirrus-ci.com/task/4836843802853376

Best,
Melih

Re: Size of pg_rewrite

2022-04-08 Thread Dagfinn Ilmari Mannsåker

Matthias van de Meent  writes:

> But, as text literal concatenations don't seem to get constant folded
> before storing them in the rules table, this rewrite of the views
> would result in long lines in the system_views.sql file, or we'd have
> to deal with the additional overhead of the append operator and cast
> nodes.

There is no need to use the concatenation operator to split array
constants across multiple lines. Newlines are fine either inside the
string (between array elements), or between two string string literals
(which become one string constant at parse time).

https://www.postgresql.org/docs/current/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS

ilmari@[local]:5432 ~=# select
ilmari@[local]:5432 ~-# '{foo,
ilmari@[local]:5432 ~'# bar}'::text[],
ilmari@[local]:5432 ~-# '{bar,'
ilmari@[local]:5432 ~-# 'baz}'::text[];
┌───┬───┐
│   text│   text│
├───┼───┤
│ {foo,bar} │ {bar,baz} │
└───┴───┘
(1 row)


- ilmari

Re: Kerberos delegation support in libpq and postgres_fdw

2022-04-08 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Apr 8, 2022 at 8:21 AM Stephen Frost  wrote:
> > Added an explicit 'environment' option to allow for, basically, existing
> > behavior, where we don't mess with the environment variable at all,
> > though I kept the default as MEMORY since I don't think it's really
> > typical that folks actually want regular user backends to inherit the
> > credential cache of the server.
> >
> > Added a few more tests and updated the documentation too.  Sadly, seems
> > we've missed the deadline for v15 though for lack of feedback on these.
> > Would really like to get some other folks commenting as these are new
> > pg_hba and postgresql.conf options being added.
> 
> I don't think this patch is quite baked enough to go in even if the
> deadline hadn't formally passed, but I'm happy to offer a few opinions
> ... especially if we can also try to sort out a plan for getting that
> wider-checksums thing you mentioned done for v16.

Sure.

>  + /* gssencmode is also libpq option, same to above. */
> + {"gssencmode", UserMappingRelationId, true},
> 
> I really hate names like this that are just a bunch of stuff strung
> together with no punctuation and some arbitrary abbreviations thrown
> in for good measure. But since the libpq parameter already exists it's
> hard to argue we should do anything else here.

Well, yeah.

> +  allow_cred_delegation
> 
> First, I again recommend not choosing words at random to abbreviate.
> "delegate_credentials" would be shorter and clearer. Second, I think
> we need to decide whether we envision just having one parameter here
> for every kind of credential delegation that libpq might ever support,
> or whether this is really something specific to GSS. If the latter,
> the name should mention GSS.

delegate_credentials seems to imply that the server has some kind of
control over the act of delegating credentials, which isn't really the
case.  The client has to decide to delegate credentials and it does that
independent of the server- the server side just gets to either accept
those delegated credentials, or ignore them. 

In terms of having a prefix, this is certainly something that I'd like
to see SSPI support added for as well (perhaps that can be in v16 too)
and so it's definitely not GSS-exclusive among the authentication
methods that we have today.  In that sense, this option falls into the
same category as 'include_realm' and 'krb_realm' in that it applies to
more than one, but not all, of our authentication methods.

> I also suggest that the default value of this option should be false,
> rather than true. I would be unhappy if ssh started defaulting to
> ForwardAgent=yes, because that's less secure and I don't want my
> credentials shared with random servers without me making a choice to
> do that. Similarly here I think we should default to the more secure
> option.

This is a bit backwards from how it works though- this option is about
if the server will accept delegated credentials, not if the client sends
them.  If your client was set to ForwardAgent=yes, would you be happy if
the server's default was AllowAgentForwarding=no?  (At least on the
system I'm looking at, the current default is AllowAgentForwarding=yes
in sshd_config).

Regarding the client side, it is the case that GSSAPIDelegateCredentials
in ssh defaults to no, so it seems like the next iteration of the patch
should probably include a libpq option similar to that ssh_config
option.  As I mentioned before, users already can decide if they'd like
proxyable credentials or not when they kinit, though more generally this
is set as a environment-wide policy, but we can add an option and
disable it by default.

> +  
> +   
> +Sets the location of the Kerberos credential cache to be used for
> +regular user backends which go through authentication.  The default 
> is
> +MEMORY:, which is where delegated credentials
> +are stored (and is otherwise empty).  Care should be used when 
> changing
> +this value- setting it to a file-based credential cache will mean 
> that
> +user backends could potentially use any credentials stored to access
> +other systems.
> +If this parameter is set to an empty string, then the variable will 
> be
> + explicit un-set and the system-dependent default is used, which may be a
> + file-based credential cache with the same caveats as previously
> + mentioned.  If the special value 'environment' is used, then the variable
> + is left untouched and will be whatever was set in the environment at
> + startup time.
> 
> "MEMORY:" seems like a pretty weird choice of arbitrary string. Is it
> supposed to look like a Windows drive letter or pseudo-device, or
> what? I'm not sure exactly what's better here, but I just think this
> doesn't look like anything else we've got today. And then we've got a
> second special environment, "environment", which looks complet

Re: Atomic rename feature for Windows.

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 10:12 AM Greg Stark  wrote:
> On Thu, 9 Dec 2021 at 23:36, Tom Lane  wrote:
> > I'm not for dropping support for some platform just because it's old.
>
> I guess I'll have to spin up the Vax again :)

This is a pretty good summary of what's wrong with our current
deprecation policy. Like Tom, I kind of hate removing support for old
systems. But I've also come to realize that we often end up supporting
systems because there's one PostgreSQL developer who has access and
sets up a buildfarm member ... which tends to mean that we support all
the stuff that lots of people are using, plus a pretty random subset
of older systems that do funny things and most people can't access to
debug any problems that may occur. And that's kind of annoying.

(I don't have a specific proposal for what to do about it.)

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Size of pg_rewrite

2022-04-08 Thread Matthias van de Meent

On Fri, 8 Apr 2022 at 17:20, Dagfinn Ilmari Mannsåker  wrote:
>
> Matthias van de Meent  writes:
>
> > But, as text literal concatenations don't seem to get constant folded
> > before storing them in the rules table, this rewrite of the views
> > would result in long lines in the system_views.sql file, or we'd have
> > to deal with the additional overhead of the append operator and cast
> > nodes.
>
> There is no need to use the concatenation operator to split array
> constants across multiple lines. Newlines are fine either inside the
> string (between array elements), or between two string string literals
> (which become one string constant at parse time).

Ah, neat, that saves some long lines in the system_views file. I had
already tried the "auto-concatenate two consecutive string literals",
but that try failed in initdb, so now I'm not sure what happened
there.

Thanks!

-Matthias

Re: Kerberos delegation support in libpq and postgres_fdw

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 11:29 AM Stephen Frost  wrote:
> > +  allow_cred_delegation
> >
> > First, I again recommend not choosing words at random to abbreviate.
> > "delegate_credentials" would be shorter and clearer. Second, I think
> > we need to decide whether we envision just having one parameter here
> > for every kind of credential delegation that libpq might ever support,
> > or whether this is really something specific to GSS. If the latter,
> > the name should mention GSS.
>
> delegate_credentials seems to imply that the server has some kind of
> control over the act of delegating credentials, which isn't really the
> case.  The client has to decide to delegate credentials and it does that
> independent of the server- the server side just gets to either accept
> those delegated credentials, or ignore them.

Oh ... I thought this was a libpq parameter to control the client
behavior. I guess I didn't read it carefully enough.

> Regarding the client side, it is the case that GSSAPIDelegateCredentials
> in ssh defaults to no, so it seems like the next iteration of the patch
> should probably include a libpq option similar to that ssh_config
> option.  As I mentioned before, users already can decide if they'd like
> proxyable credentials or not when they kinit, though more generally this
> is set as a environment-wide policy, but we can add an option and
> disable it by default.

+1.

> This isn't actually something we have a choice in, really, it's from the
> Kerberos library.  MEMORY is the library's in-memory credential cache.
> Other possible values are FILE:/some/file, DIR:/some/dir, API:, and
> others.  Documentaton is available here:
> https://web.mit.edu/kerberos/krb5-1.12/doc/basic/ccache_def.html

Well, I was just going by the fact that this string ("MEMORY:") seems
to be being interpreted in our code, not the library.

> > I wonder whether we really quite this many cases. But if we do they
> > probably need better and more consistent naming.
>
> I wouldn't want to end up with values that could end up conflicting with
> real values that a user might want to specify, so the choice of
> 'environment' and empty-value were specifically chosen to avoid that
> risk.  If we're worried that doing so isn't sufficient or is too
> confusing, the better option would likely be to have another GUC that
> controls if we unset, ignore, or set the value to what the other GUC
> says to set it to.  I'm fine with that if you agree.

Yeah, I thought of that, and it might be the way to go. I wasn't too
sure we needed the explicit-unset behavior as an option, but I defer
to you on that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Commitfest Closed

2022-04-08 Thread Greg Stark

I moved to next CF almost all the Needs Review and Waiting on Author patches.

The remaining ones are either:

1) Bug fixes, Documentation, or testing patches that we may want to
make Open Issues

2) Patches that look like we may want to mark Rejected or Returned
with Feedback and start a new discussion

3) Patches whose email history confused me, such as where multiple
patches are under discussion

I also haven't gone through the Ready for Committer patches yet. I'll
do that at the end of the day.

Incidentally I marked a lot of the Waiting on Author patches as Needs
Review before moving to the next CF because generally I think they
were only Waiting on Author because of the cfbot failures and they
were waiting on design feedback.

Also, as another aside, I find a lot of the patches that haven't been
reviewed were patches that were posted without any specific concerns
or questions. That tends to imply the author thinks the patch is ready
and just waiting on a comprehensive review which is a daunting task.

I would suggest if you're an author posting a WIP and there's some
specific uncertainties that you have about the patch that asking about
them would encourage reviewers to dive in and help you make progress.

Re: Atomic rename feature for Windows.

2022-04-08 Thread Greg Stark

On Fri, 8 Apr 2022 at 11:30, Robert Haas  wrote:
>
> On Fri, Apr 8, 2022 at 10:12 AM Greg Stark  wrote:
> > On Thu, 9 Dec 2021 at 23:36, Tom Lane  wrote:
> > > I'm not for dropping support for some platform just because it's old.
> >
> > I guess I'll have to spin up the Vax again :)
>
> This is a pretty good summary of what's wrong with our current
> deprecation policy.

I didn't intend it that way but, ok.

> Like Tom, I kind of hate removing support for old
> systems. But I've also come to realize that we often end up supporting
> systems because there's one PostgreSQL developer who has access and
> sets up a buildfarm member ... which tends to mean that we support all
> the stuff that lots of people are using, plus a pretty random subset
> of older systems that do funny things and most people can't access to
> debug any problems that may occur. And that's kind of annoying.

Generally I think supporting older systems that do funny things is
helpful in avoiding problems that either 1) Can happen on newer
systems but rarely 2) Can happen on other systems that people are
using but we don't know about and aren't testing and 3) Can happen on
future systems or future compilers and we might not even find out
about.

But that's useful for some things and not for others. Like, it's
useful to be sure we don't have odd dependencies on timing quirks of
the specific machines that are currently common, or depend on gcc/llvm
compiler behaviour that isn't guaranteed. But less so for supporting
some quirky filesystem behaviour on Windows 8 that newer Windows
doesn't have and Unix guarantees not to have. (Or supporting non-IEEE
Vax FP now that we've decided we just don't any more).

-- 
greg

Re: API stability [was: pgsql: Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.]

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 4:47 AM Markus Wanner
 wrote:
> I agree with Michael, it would be nice to not duplicate the code, but
> use a common underlying method.  A modified patch is attached.

I don't think this is better, but I don't think it's worth arguing
about, either, so I'll do it this way if nobody objects.

Meanwhile, I've committed the patch for master to master.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Atomic rename feature for Windows.

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 11:45 AM Greg Stark  wrote:
> But that's useful for some things and not for others. Like, it's
> useful to be sure we don't have odd dependencies on timing quirks of
> the specific machines that are currently common, or depend on gcc/llvm
> compiler behaviour that isn't guaranteed. But less so for supporting
> some quirky filesystem behaviour on Windows 8 that newer Windows
> doesn't have and Unix guarantees not to have. (Or supporting non-IEEE
> Vax FP now that we've decided we just don't any more).

Yeah, exactly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: shared-memory based stats collector

2022-04-08 Thread Andres Freund

Hi, 

On April 8, 2022 4:49:48 AM PDT, Ranier Vilela  wrote:
>Hi,
>
>Per Coverity.
>
>pgstat_reset_entry does not check if lock it was really blocked.
>I think if shared_stat_reset_contents is called without lock,
>is it an issue not?

I don't think so - the nowait parameter is say to false, so the lock 
acquisition is blocking.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Temporary file access API

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 4:54 AM Antonin Houska  wrote:
> Do you think that the use of a system call is a problem itself (e.g. because
> the code looks less simple if read/write is used somewhere and fread/fwrite
> elsewhere; of course of read/write is mandatory in special cases like WAL,
> heap pages, etc.)  or is the problem that the system calls are used too
> frequently? I suppose only the latter.

I'm not really super-interested in reducing the number of system
calls. It's not a dumb thing in which to be interested and I know that
for example Thomas Munro is very interested in it and has done a bunch
of work in that direction just to improve performance. But for me the
attraction of this is mostly whether it gets us closer to TDE.

And that's why I'm asking these questions about adopting it in
different places. I kind of thought that your previous patches needed
to encrypt, I don't know, 10 or 20 different kinds of files. So I was
surprised to see this patch touching the handling of only 2 kinds of
files. If we consolidate the handling of let's say 15 of 20 cases into
a single mechanism, we've really moved the needle in the right
direction -- but consolidating the handling of 2 of 20 cases, or
whatever the real numbers are, isn't very exciting.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: remove more archiving overhead

2022-04-08 Thread Nathan Bossart

On Fri, Apr 08, 2022 at 10:20:27AM -0400, Robert Haas wrote:
> On Thu, Apr 7, 2022 at 6:23 PM Nathan Bossart  
> wrote:
>> On Thu, Feb 24, 2022 at 09:55:53AM -0800, Nathan Bossart wrote:
>> > Yes.  I found that a crash at an unfortunate moment can produce multiple
>> > links to the same file in pg_wal, which seemed bad independent of archival.
>> > By fixing that (i.e., switching from durable_rename_excl() to
>> > durable_rename()), we not only avoid this problem, but we also avoid trying
>> > to archive a file the server is concurrently writing.  Then, after a crash,
>> > the WAL file to archive should either not exist (which is handled by the
>> > archiver) or contain the same contents as any preexisting archives.
>>
>> I moved the fix for this to a new thread [0] since I think it should be
>> back-patched.  I've attached a new patch that only contains the part
>> related to reducing archiving overhead.
> 
> While we're now after the feature freeze and thus this will need to
> wait for v16, it looks like a reasonable change to me.

Dang, just missed it.  Thanks for taking a look.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Size of pg_rewrite (Was: Report checkpoint progress with pg_stat_progress_checkpoint)

2022-04-08 Thread Andres Freund

Hi, 

On April 8, 2022 7:52:07 AM PDT, Matthias van de Meent 
 wrote:
>On Sat, 19 Mar 2022 at 01:15, Andres Freund  wrote:
>> pg_rewrite without pg_stat_progress_checkpoint: 745472, with: 753664
>>
>> pg_rewrite is the second biggest relation in an empty database already...
>
>Yeah, that's not great. Thanks for nerd-sniping me into looking into
>how views and pg_rewrite rules work, that was very interesting and I
>learned quite a lot.

Thanks for looking!


># Immediately potential, limited to progress views
>
>I noticed that the CASE-WHEN (used in translating progress stage index
>to stage names) in those progress reporting views can be more
>efficiently described (althoug with slightly worse behaviour around
>undefined values) using text array lookups (as attached). That
>resulted in somewhat smaller rewrite entries for the progress views
>(toast compression was good old pglz):
>
>template1=# SELECT sum(octet_length(ev_action)),
>SUM(pg_column_size(ev_action)) FROM pg_rewrite WHERE
>ev_class::regclass::text LIKE '%progress%';
>
>master:
>  sum  |  sum
>---+---
> 97277 | 19956
>patched:
>  sum  |  sum
>---+---
> 77069 | 18417
>
>So this seems like a nice improvement of 20% uncompressed / 7% compressed.
>
>I tested various cases of phase number to text translations: `CASE ..
>WHEN`; `(ARRAY[]::text[])[index]` and `('{}'::text[])[index]`. See
>results below:
>
>postgres=# create or replace view arrayliteral_view as select
>(ARRAY['a','b','c','d','e','f']::text[])[index] as name from tst
>s(index);
>CREATE INDEX
>postgres=# create or replace view stringcast_view as select
>('{a,b,c,d,e,f}'::text[])[index] as name from tst s(index);
>CREATE INDEX
>postgres=# create or replace view split_stringcast_view as select
>(('{a,b,' || 'c,d,e,f}')::text[])[index] as name from tst s(index);
>CREATE VIEW
>postgres=# create or replace view case_view as select case index when
>0 then 'a' when 1 then 'b' when 2 then 'c' when 3 then 'd' when 4 then
>'e' when 5 then 'f' end as name from tst s(index);
>CREATE INDEX
>
>
>postgres=# postgres=# select ev_class::regclass::text,
>octet_length(ev_action), pg_column_size(ev_action) from pg_rewrite
>where ev_class in ('arrayliteral_view'::regclass::oid,
>'case_view'::regclass::oid, 'split_stringcast_view'::regclass::oid,
>'stringcast_view'::regclass::oid);
>   ev_class| octet_length | pg_column_size
>---+--+
> arrayliteral_view | 3311 |   1322
> stringcast_view   | 2610 |   1257
> case_view | 5170 |   1412
> split_stringcast_view | 2847 |   1350
>
>It seems to me that we could consider replacing the CASE statements
>with array literals and lookups if we really value our template
>database size. But, as text literal concatenations don't seem to get
>constant folded before storing them in the rules table, this rewrite
>of the views would result in long lines in the system_views.sql file,
>or we'd have to deal with the additional overhead of the append
>operator and cast nodes.

My inclination is that the mapping functions should be c functions. There's 
really no point in doing it in SQL and it comes at a noticable price. And, if 
done in C, we can fix mistakes in minor releases, which we can't in SQL.


># Future work; nodeToString / readNode, all rewrite rules
>
>Additionally, we might want to consider other changes like default (or
>empty value) elision in nodeToString, if that is considered a
>reasonable option and if we really want to reduce the size of the
>pg_rewrite table.
>
>I think a lot of space can be recovered from that: A manual removal of
>what seemed to be fields with default values (and the removal of all
>query location related fields) in the current definition of
>pg_stat_progress_create_index reduces its uncompressed size from
>23226B raw and 4204B compressed to 13821B raw and 2784B compressed,
>for an on-disk space saving of 33% for this view's ev_action.
>
>Do note, however, that that would add significant branching in the
>nodeToString and readNode code, which might slow down that code
>significantly. I'm not planning on working on that; but in my opinion
>that is a viable path to reducing the size of new database catalogs.

We should definitely be careful about that. I do agree that there's a lot of 
efficiency to be gained in the serialization format. Once we have the automatic 
node func generation in place, we could have one representation for human 
consumption, and one for density...

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-04-08 Thread Zhihong Yu

On Fri, Apr 8, 2022 at 5:43 AM Justin Pryzby  wrote:

> On Wed, Apr 06, 2022 at 03:58:29PM +0900, Etsuro Fujita wrote:
> > I have committed the patch after modifying it as such.  (I think we
> > can improve these later, if necessary.)
>
> This patch seems to be causing the planner to crash.
> Here's a query reduced from sqlsmith.
>
> | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1
> <= pg_trigger_depth();
>
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x55b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at
> ../../../../src/include/nodes/pg_list.h:151
> 151 return l ? l->length : 0;
> (gdb) bt
> #0  0x55b4396a2edf in trivial_subqueryscan (plan=0x7f4219ed93b0) at
> ../../../../src/include/nodes/pg_list.h:151
> #1  0x55b43968af89 in mark_async_capable_plan 
> (plan=plan@entry=0x7f4219ed93b0,
> path=path@entry=0x7f4219e89538) at createplan.c:1132
> #2  0x55b439691924 in create_append_plan (root=root@entry=0x55b43affb2b0,
> best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at
> createplan.c:1329
> #3  0x55b43968fa21 in create_plan_recurse (root=root@entry=0x55b43affb2b0,
> best_path=best_path@entry=0x7f4219ed0cb8, flags=flags@entry=0) at
> createplan.c:421
> #4  0x55b43968f974 in create_projection_plan 
> (root=root@entry=0x55b43affb2b0,
> best_path=best_path@entry=0x7f4219ed0f60, flags=flags@entry=1) at
> createplan.c:2039
> #5  0x55b43968fa6f in create_plan_recurse (root=root@entry=0x55b43affb2b0,
> best_path=0x7f4219ed0f60, flags=flags@entry=1) at createplan.c:433
> #6  0x55b439690221 in create_plan (root=root@entry=0x55b43affb2b0,
> best_path=) at createplan.c:348
> #7  0x55b4396a1451 in standard_planner (parse=0x55b43af05e28,
> query_string=, cursorOptions=2048, boundParams=0x0) at
> planner.c:413
> #8  0x55b4396a19c1 in planner (parse=parse@entry=0x55b43af05e28,
> query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
> information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
> cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
> at planner.c:277
> #9  0x55b439790c78 in pg_plan_query 
> (querytree=querytree@entry=0x55b43af05e28,
> query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
> information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
> cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
> at postgres.c:883
> #10 0x55b439790d54 in pg_plan_queries (querytrees=0x55b43afdd528,
> query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
> information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();",
> cursorOptions=cursorOptions@entry=2048, boundParams=boundParams@entry=0x0)
> at postgres.c:975
> #11 0x55b439791239 in exec_simple_query
> (query_string=query_string@entry=0x55b43af04c40 "SELECT 1 FROM
> information_schema.constraint_column_usage WHERE 1 > pg_trigger_depth();")
> at postgres.c:1169
> #12 0x55b439793183 in PostgresMain (dbname=,
> username=) at postgres.c:4542
> #13 0x55b4396e6af7 in BackendRun (port=port@entry=0x55b43af2ffe0) at
> postmaster.c:4489
> #14 0x55b4396e9c03 in BackendStartup (port=port@entry=0x55b43af2ffe0)
> at postmaster.c:4217
> #15 0x55b4396e9e4a in ServerLoop () at postmaster.c:1791
> #16 0x55b4396eb401 in PostmasterMain (argc=7, argv=) at
> postmaster.c:1463
> #17 0x55b43962b4df in main (argc=7, argv=0x55b43aeff0c0) at main.c:202
>
> Actually, the original query failed like this:
> #2  0x55b4398e9f90 in ExceptionalCondition
> (conditionName=conditionName@entry=0x55b439a61238 "plan->scanstatus ==
> SUBQUERY_SCAN_UNKNOWN", errorType=errorType@entry=0x55b43994b00b
> "FailedAssertion",
> #3  0x55b4396a2ecf in trivial_subqueryscan (plan=0x55b43b59cac8) at
> setrefs.c:1367
>

Hi,
I logged the value of plan->scanstatus before the assertion :

2022-04-08 16:20:59.601 UTC [26325] LOG:  scan status 0
2022-04-08 16:20:59.601 UTC [26325] STATEMENT:  explain SELECT 1 FROM
information_schema.constraint_column_usage WHERE 1 <= pg_trigger_depth();
2022-04-08 16:20:59.796 UTC [26296] LOG:  server process (PID 26325) was
terminated by signal 11: Segmentation fault

It seems its value was SUBQUERY_SCAN_UNKNOWN.

Still trying to find out the cause for the crash.

Re: pgsql: Add TAP test for archive_cleanup_command and recovery_end_comman

2022-04-08 Thread Tom Lane

Andres Freund  writes:
> On 2022-04-07 13:57:45 -0400, Tom Lane wrote:
>> Yeah, with only one instance it could just be cosmic rays or something.

Not cosmic rays: skink has shown the same symptom three times running.
Looks like maybe the archive_cleanup_command itself is doing something
it shouldn't?

regards, tom lane

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan

On Fri, Apr 8, 2022 at 4:38 AM Matthias van de Meent
 wrote:
> Yeah, I think we should definately support more line pointers on a
> heap page, but abusing MaxHeapTuplesPerPage for that is misleading:
> the current value is the physical limit for heap tuples, as we have at
> most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage
> won't change. A macro MaxHeapLinePointersPerPage would probably be
> more useful, which could be as follows (assuming we don't want to
> allow filling a page with effectively only dead line pointers):

That's a good point. Sounds like it might be the right approach.

I suppose that it will depend on how much use of MaxHeapTuplesPerPage
remains once it is split in two like this.

-- 
Peter Geoghegan

Re: How to simulate sync/async standbys being closer/farther (network distance) to primary in core postgres?

2022-04-08 Thread SATYANARAYANA NARLAPURAM

On Fri, Apr 8, 2022 at 6:44 AM Bharath Rupireddy <
bharath.rupireddyforpostg...@gmail.com> wrote:

> On Wed, Apr 6, 2022 at 4:30 PM Ashutosh Bapat
>  wrote:
> >
> > On Tue, Apr 5, 2022 at 9:23 PM Bharath Rupireddy
> >  wrote:
> > >
> > > Hi,
> > >
> > > I'm thinking if there's a way in core postgres to achieve $subject. In
> > > reality, the sync/async standbys can either be closer/farther (which
> > > means sync/async standbys can receive WAL at different times) to
> > > primary, especially in cloud HA environments with primary in one
> > > Availability Zone(AZ)/Region and standbys in different AZs/Regions.
> > > $subject may not be possible on dev systems (say, for testing some HA
> > > features) unless we can inject a delay in WAL senders before sending
> > > WAL.
>

Simulation will be helpful even for end customers to simulate faults in the
production environments during availability zone/disaster recovery drills.

> > >
> > > How about having two developer-only GUCs {async,
> > > sync}_wal_sender_delay? When set, the async and sync WAL senders will
> > > delay sending WAL by {async, sync}_wal_sender_delay
> > > milliseconds/seconds? Although, I can't think of any immediate use, it
> > > will be useful someday IMO, say for features like [1], if it gets in.
> > > With this set of GUCs, one can even add core regression tests for HA
> > > features.
>

I would suggest doing this at the slot level, instead of two GUCs that
control the behavior of all the slots (physical/logical). Something like
"pg_suspend_replication_slot and pg_Resume_replication_slot"?
Alternatively a GUC on the standby side instead of primary so that the wal
receiver stops responding to the wal sender? This helps achieve the same as
above but the granularity is now at individual replica level.

Thanks,
Satya

Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Nathan Bossart

On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> I see that durable_rename_excl() has the following comment: "Similar
> to durable_rename(), except that this routine tries (but does not
> guarantee) not to overwrite the target file." If those are the desired
> semantics, we could achieve them more simply and more safely by just
> trying to stat() the target file and then, if it's not found, call
> durable_rename(). I think that would be a heck of a lot safer than
> what this function is doing right now.

IIUC it actually does guarantee that you won't overwrite the target file
when HAVE_WORKING_LINK is defined.  If not, it provides no guarantees at
all.  Using stat() before rename() would therefore weaken this check for
systems with working link(), but it'd probably strengthen it for systems
without a working link().

> I'd actually be in favor of nuking durable_rename_excl() from orbit
> and putting the file-exists tests in the callers. Otherwise, someone
> might assume that it actually has the semantics that its name
> suggests, which could be pretty disastrous. If we don't want to do
> that, then I'd changing to do the stat-then-durable-rename thing
> internally, so we don't leave hard links lying around in *any* code
> path. Perhaps that's the right answer for the back-branches in any
> case, since there could be third-party code calling this function.

I think there might be another problem.  The man page for rename() seems to
indicate that overwriting an existing file also introduces a window where
the old and new path are hard links to the same file.  This isn't a problem
for the WAL files because we should never be overwriting an existing one,
but I wonder if it's a problem for other code paths.  My guess is that many
code paths that overwrite an existing file are first writing changes to a
temporary file before atomically replacing the original.  Those paths are
likely okay, too, as you can usually just discard any existing temporary
files.

> Your proposed fix is OK if we don't want to do any of that stuff, but
> personally I'm much more inclined to blame durable_rename_excl() for
> being horrible than I am to blame the calling code for using it
> improvidently.

I do agree that it's worth examining this stuff a bit closer.  I've
frequently found myself trying to reason about all the different states
that callers of these functions can produce, so any changes that help
simplify matters are a win in my book.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan

On Fri, Apr 8, 2022 at 6:17 AM Robert Haas  wrote:
> I agree that the value of 291 is pretty much accidental, but it also
> seems fairly generous to me. The bigger you make it, the more space
> you can waste. I must have missed (or failed to understand) previous
> discussions about why raising it would be a good idea.

What do you mean about wasting space? Wasting space on the stack? I
can't imagine you meant wasting space on the page, since being able to
accomodate more items on each heap page seems like it would be
strictly better, barring any unintended weird FSM issues.

As far as I know the only real downside to increasing it is the impact
on tidbitmap.c. Increasing the number of valid distinct TID values
might have a negative impact on performance during bitmap scans, which
will need to be managed. However, I don't think that increased stack
space usage will be a problem, with a little work. It either won't
matter at all (e.g. an array of offset numbers on the stack still
won't be very big), or it can be fixed locally where it turns out to
matter (like in lazy_scan_prune).

We used to routinely use MaxOffsetNumber for arrays of item offset
numbers. I cut down on that in the B-Tree code, reducing it to
MaxIndexTuplesPerPage (which is typically 407) in a few places. So
anything close to our current MaxIndexTuplesPerPage ought to be fine
for most individual arrays stored on the stack.

-- 
Peter Geoghegan

Re: Defer selection of asynchronous subplans until the executor initialization stage

2022-04-08 Thread Etsuro Fujita

Hi,

On Fri, Apr 8, 2022 at 9:43 PM Justin Pryzby  wrote:
> This patch seems to be causing the planner to crash.
> Here's a query reduced from sqlsmith.
>
> | explain SELECT 1 FROM information_schema.constraint_column_usage WHERE 1 <= 
> pg_trigger_depth();
>
> Program terminated with signal SIGSEGV, Segmentation fault.

Reproduced.  Will look into this.

Thanks for the report!

Best regards,
Etsuro Fujita

Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Nathan Bossart

On Fri, Apr 08, 2022 at 09:53:12AM -0700, Nathan Bossart wrote:
> On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
>> I'd actually be in favor of nuking durable_rename_excl() from orbit
>> and putting the file-exists tests in the callers. Otherwise, someone
>> might assume that it actually has the semantics that its name
>> suggests, which could be pretty disastrous. If we don't want to do
>> that, then I'd changing to do the stat-then-durable-rename thing
>> internally, so we don't leave hard links lying around in *any* code
>> path. Perhaps that's the right answer for the back-branches in any
>> case, since there could be third-party code calling this function.
> 
> I think there might be another problem.  The man page for rename() seems to
> indicate that overwriting an existing file also introduces a window where
> the old and new path are hard links to the same file.  This isn't a problem
> for the WAL files because we should never be overwriting an existing one,
> but I wonder if it's a problem for other code paths.  My guess is that many
> code paths that overwrite an existing file are first writing changes to a
> temporary file before atomically replacing the original.  Those paths are
> likely okay, too, as you can usually just discard any existing temporary
> files.

Ha, so there are only a few callers of durable_rename_excl() in the
PostgreSQL tree.  One is basic_archive.c, which is already doing a stat()
check.  IIRC I only used durable_rename_excl() here to handle the case
where multiple servers are writing archives to the same location.  If that
happened, the archiver process would begin failing.  If a crash left two
hard links to the same file around, we will silently succeed the next time
around thanks to the compare_files() check.  Besides the WAL installation
code, the only other callers are in timeline.c, and both note that the use
of durable_rename_excl() is for "paranoidly trying to avoid overwriting an
existing file (there shouldn't be one)."

So AFAICT basic_archive.c is the only caller with a strong reason for using
durable_rename_excl(), and even that might not be worth keeping it around.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: trigger example for plsample

2022-04-08 Thread Mark Wong

On Thu, Apr 07, 2022 at 06:29:53PM -0400, Tom Lane wrote:
> Chapman Flack  writes:
> > v4 looks good to me.
> 
> Pushed with very minor editorialization.  Mainly, I undid the
> decision to stop printing the function source text, on the
> grounds that (1) it falsified the comment immediately above,
> and (2) if you have to print it anyway to avoid compiler warnings,
> you're just creating confusing inconsistency between the two
> handler functions.

Sounds good to me, thanks!

Regards,
Mark

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan

On Fri, Apr 8, 2022 at 9:44 AM Peter Geoghegan  wrote:
> On Fri, Apr 8, 2022 at 4:38 AM Matthias van de Meent
>  wrote:
> > Yeah, I think we should definately support more line pointers on a
> > heap page, but abusing MaxHeapTuplesPerPage for that is misleading:
> > the current value is the physical limit for heap tuples, as we have at
> > most 1 heap tuple per line pointer and thus the MaxHeapTuplesPerPage
> > won't change. A macro MaxHeapLinePointersPerPage would probably be
> > more useful, which could be as follows (assuming we don't want to
> > allow filling a page with effectively only dead line pointers):
>
> That's a good point. Sounds like it might be the right approach.
>
> I suppose that it will depend on how much use of MaxHeapTuplesPerPage
> remains once it is split in two like this.

Thinking about this some more, I wonder if it would make sense to
split MaxHeapTuplesPerPage into two new constants (a true
MaxHeapTuplesPerPage, plus MaxHeapLinePointersPerPage), for the
reasons discussed, but also as a way of getting a *smaller* effective
MaxHeapTuplesPerPage than 291 in some contexts only.

There are some ways in which the current MaxHeapTuplesPerPage isn't
enough, but also some ways in which it is excessive. It might be
useful if PageGetHeapFreeSpace() usually considered a heap page to
have no free space if the number of tuples with storage (or some cheap
proxy thereof) was about 227, which is the largest number of distinct
heap tuples that can *plausibly* ever be stored on an 8KiB page (it
ignores zero column tables). Most current PageGetHeapFreeSpace()
callers (including VACUUM) would continue to call that function in the
same way as today, and get this lower limit.

A few of the existing PageGetHeapFreeSpace() callers could store more
line pointers than that (MaxHeapLinePointersPerPage, which might be
510 in practice) -- but only those involved in updates. The overall
idea is to recognize that free space is not interchangeable -- updates
should have some kind of advantage over plain inserts when it comes to
the space on the page of the tuple that they're updating.

We might even want to make our newly defined, lower
MaxHeapTuplesPerPage into a tunable storage param.

-- 
Peter Geoghegan

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 12:57 PM Peter Geoghegan  wrote:
> What do you mean about wasting space? Wasting space on the stack? I
> can't imagine you meant wasting space on the page, since being able to
> accomodate more items on each heap page seems like it would be
> strictly better, barring any unintended weird FSM issues.

I meant wasting space in the page. I think that's a real concern.
Imagine you allowed 1000 line pointers per page. Each one consumes 2
bytes. So now you could have ~25% of each page in the table storing
dead line pointers. That sounds awful, and just running VACUUM won't
fix it once it's happened, because the still-live line pointers are
likely to be at the end of the line pointer array and thus truncating
it won't necessarily be possible.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Frontend error logging style

2022-04-08 Thread Tom Lane

Daniel Gustafsson  writes:
> On 30 Mar 2022, at 00:38, Tom Lane  wrote:
>> Feel free to work on a followup editing patch though.

> Thats my plan, once this lands I'll rebase the comments on top of your work 
> and
> we can have a separate discussion around them then.

The main patch is pushed now.  I addressed the complaint Peter had
about the messages with "Check your installation" pseudo-hints
by getting rid of them; I concur with your observation that those
hints were basically useless.  I also fixed the one place where the
message should clearly be "could not close" not "could not write".
Mostly didn't yield to temptation anywhere else.

One other loose end is bothering me: I stuck with logging.h's
original choice to put "if (likely())" or "if (unlikely())"
conditionals into the macros, but I rather suspect that that's
just a waste.  I think we should put a centralized level check
into logging.c, and get rid of at least the "if (likely())"
checks, because those are going to succeed approximately 100.0%
of the time.  Maybe there's an argument for keeping the unlikely()
ones.

regards, tom lane

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan

On Fri, Apr 8, 2022 at 12:04 PM Robert Haas  wrote:
> I meant wasting space in the page. I think that's a real concern.
> Imagine you allowed 1000 line pointers per page. Each one consumes 2
> bytes. So now you could have ~25% of each page in the table storing
> dead line pointers. That sounds awful, and just running VACUUM won't
> fix it once it's happened, because the still-live line pointers are
> likely to be at the end of the line pointer array and thus truncating
> it won't necessarily be possible.

I see. That's a legitimate concern, though one that I believe can be
addressed. I have learned to dread any kind of bloat that's
irreversible, no matter how minor it might seem when seen as an
isolated event, so I'm certainly sympathetic to these concerns. You
can make a similar argument in favor of a higher
MaxHeapLinePointersPerPage limit, though -- and that's why I believe
an increase of some kind makes sense. The argument goes like this:

What if we miss the opportunity to systematically keep successor
versions of a given logical row on the same heap page over time, due
only to the current low MaxHeapLinePointersPerPage limit of 291? If we
had only been able to "absorb" just a few extra versions in the short
term, we would have had stability (in the sense of being able to
preserve locality among related logical rows) in the long term. We
could have kept everything together, if only we didn't overreact to
what were actually short term, rare perturbations.

-- 
Peter Geoghegan

Re: avoid multiple hard links to same WAL file after a crash

2022-04-08 Thread Nathan Bossart

On Fri, Apr 08, 2022 at 10:38:03AM -0400, Robert Haas wrote:
> I'd actually be in favor of nuking durable_rename_excl() from orbit
> and putting the file-exists tests in the callers. Otherwise, someone
> might assume that it actually has the semantics that its name
> suggests, which could be pretty disastrous. If we don't want to do
> that, then I'd changing to do the stat-then-durable-rename thing
> internally, so we don't leave hard links lying around in *any* code
> path. Perhaps that's the right answer for the back-branches in any
> case, since there could be third-party code calling this function.

I've attached a patch that simply removes durable_rename_excl() and
replaces existing calls with durable_rename().  I noticed that Andres
expressed similar misgivings about durable_rename_excl() last year [0] [1].
I can create a stat-then-durable-rename version of this for back-patching
if that is still the route we want to go.

[0] https://postgr.es/me/20210318014812.ds2iz4jz5h7la6un%40alap3.anarazel.de
[1] https://postgr.es/m/20210318023004.gz2aejhze2kkkqr2%40alap3.anarazel.de

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From d3c633e19555dc0cf98207ad5e7c08ab9ce85dc0 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Fri, 8 Apr 2022 11:48:17 -0700
Subject: [PATCH v2 1/1] Remove durable_rename_excl().

durable_rename_excl() attempts to avoid overwriting any existing
files by using link() and unlink(), but it falls back to rename()
on some platforms (e.g., Windows), which offers no such ovewrite
protection.  Most callers use durable_rename_excl() just in case
there is an existing file, but in practice there shouldn't be one.
basic_archive uses it to avoid overwriting an archive concurrently
created by another server, but as mentioned above, it will still
overwrite files on some platforms.

Furthermore, failures during durable_rename_excl() can result in
multiple hard links to the same file.  My testing demonstrated that
it was possible to end up with two links to the same file in pg_wal
after a crash just before unlink() during WAL recycling.
Specifically, the test produced links to the same file for the
current WAL file and the next one because the half-recycled WAL
file was re-recycled upon restarting.  This seems likely to lead to
WAL corruption.

This change removes durable_rename_excl() and replaces all existing
calls with durable_rename().  This removes the protection against
accidentally overwriting an existing file, but some platforms are
already living without it, and ordinarily there shouldn't be one.

Author: Nathan Bossart
Reviewed-by: Robert Haas
Discussion: https://postgr.es/m/20220407182954.GA1231544%40nathanxps13
---
 contrib/basic_archive/basic_archive.c |  5 ++-
 src/backend/access/transam/timeline.c | 14 +-
 src/backend/access/transam/xlog.c |  8 +---
 src/backend/storage/file/fd.c | 63 ---
 src/include/pg_config_manual.h|  7 ---
 src/include/storage/fd.h  |  1 -
 6 files changed, 7 insertions(+), 91 deletions(-)

diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c
index e7efbfb9c3..ed33854c57 100644
--- a/contrib/basic_archive/basic_archive.c
+++ b/contrib/basic_archive/basic_archive.c
@@ -281,9 +281,10 @@ basic_archive_file_internal(const char *file, const char *path)

 	/*
 	 * Sync the temporary file to disk and move it to its final destination.
-	 * This will fail if destination already exists.
+	 * Note that this will overwrite any existing file, but this is only
+	 * possible if someone else created the file since the stat() above.
 	 */
-	(void) durable_rename_excl(temp, destination, ERROR);
+	(void) durable_rename(temp, destination, ERROR);

 	ereport(DEBUG1,
 			(errmsg("archived \"%s\" via basic_archive", file)));
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index be21968293..128f754e87 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -441,12 +441,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, newTLI);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);

 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -519,12 +514,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Now move the completed history file into place with its final name.
 	 */
 	TLHistoryFilePath(path, tli);
-
-	/*
-	 * Perform the rename using link if available, paranoidly trying to avoid
-	 * overwriting an existing file (there shouldn't be one).
-	 */
-	durable_rename_excl(tmppath, path, ERROR);
+	durable_rename(tmppath, path, ERROR);
 }

Re: Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2022-04-08 Thread Nathan Bossart

On Wed, Mar 30, 2022 at 09:21:30AM -0700, Nathan Bossart wrote:
> Here is an updated patch set.

rebased

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a92ccf47c9c334eea5b8d07b8fcab7031181c37e Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 15 Feb 2022 09:40:53 -0800
Subject: [PATCH v12 1/2] make more use of get_dirent_type()

---
 src/backend/access/heap/rewriteheap.c |  4 +-
 .../replication/logical/reorderbuffer.c   | 12 +++---
 src/backend/replication/logical/snapbuild.c   |  5 +--
 src/backend/replication/slot.c|  4 +-
 src/backend/storage/file/copydir.c| 21 +++---
 src/backend/storage/file/fd.c | 20 +
 src/backend/utils/misc/guc-file.l | 42 +++
 src/timezone/pgtz.c   |  8 +---
 8 files changed, 48 insertions(+), 68 deletions(-)

diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 2a53826736..d64d7aae2e 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -113,6 +113,7 @@
 #include "access/xact.h"
 #include "access/xloginsert.h"
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "lib/ilist.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1213,7 +1214,6 @@ CheckPointLogicalRewriteHeap(void)
 	mappings_dir = AllocateDir("pg_logical/mappings");
 	while ((mapping_de = ReadDir(mappings_dir, "pg_logical/mappings")) != NULL)
 	{
-		struct stat statbuf;
 		Oid			dboid;
 		Oid			relid;
 		XLogRecPtr	lsn;
@@ -1227,7 +1227,7 @@ CheckPointLogicalRewriteHeap(void)
 			continue;
 
 		snprintf(path, sizeof(path), "pg_logical/mappings/%s", mapping_de->d_name);
-		if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+		if (get_dirent_type(path, mapping_de, false, ERROR) != PGFILETYPE_REG)
 			continue;
 
 		/* Skip over files that cannot be ours. */
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 5adc016d44..63ef55f3f7 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -91,6 +91,7 @@
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/catalog.h"
+#include "common/file_utils.h"
 #include "lib/binaryheap.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -4408,15 +4409,10 @@ ReorderBufferCleanupSerializedTXNs(const char *slotname)
 {
 	DIR		   *spill_dir;
 	struct dirent *spill_de;
-	struct stat statbuf;
 	char		path[MAXPGPATH * 2 + 12];
 
 	sprintf(path, "pg_replslot/%s", slotname);
 
-	/* we're only handling directories here, skip if it's not ours */
-	if (lstat(path, &statbuf) == 0 && !S_ISDIR(statbuf.st_mode))
-		return;
-
 	spill_dir = AllocateDir(path);
 	while ((spill_de = ReadDirExtended(spill_dir, path, INFO)) != NULL)
 	{
@@ -4464,6 +4460,7 @@ StartupReorderBuffer(void)
 {
 	DIR		   *logical_dir;
 	struct dirent *logical_de;
+	char		path[MAXPGPATH * 2 + 12];
 
 	logical_dir = AllocateDir("pg_replslot");
 	while ((logical_de = ReadDir(logical_dir, "pg_replslot")) != NULL)
@@ -4476,6 +4473,11 @@ StartupReorderBuffer(void)
 		if (!ReplicationSlotValidateName(logical_de->d_name, DEBUG2))
 			continue;
 
+		/* we're only handling directories here, skip if it's not ours */
+		sprintf(path, "pg_replslot/%s", logical_de->d_name);
+		if (get_dirent_type(path, logical_de, false, ERROR) != PGFILETYPE_DIR)
+			continue;
+
 		/*
 		 * ok, has to be a surviving logical slot, iterate and delete
 		 * everything starting with xid-*
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 83fca8a77d..df50abfd98 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -123,6 +123,7 @@
 #include "access/heapam_xlog.h"
 #include "access/transam.h"
 #include "access/xact.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "replication/logical.h"
@@ -1947,15 +1948,13 @@ CheckPointSnapBuild(void)
 		uint32		hi;
 		uint32		lo;
 		XLogRecPtr	lsn;
-		struct stat statbuf;
 
 		if (strcmp(snap_de->d_name, ".") == 0 ||
 			strcmp(snap_de->d_name, "..") == 0)
 			continue;
 
 		snprintf(path, sizeof(path), "pg_logical/snapshots/%s", snap_de->d_name);
-
-		if (lstat(path, &statbuf) == 0 && !S_ISREG(statbuf.st_mode))
+		if (get_dirent_type(path, snap_de, false, ERROR) != PGFILETYPE_REG)
 		{
 			elog(DEBUG1, "only regular files expected: %s", path);
 			continue;
diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index c35ea7c35b..5ee68e71b8 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -41,6 +41,7 @@
 
 #include "access/transam.h"
 #include "access/xlog_internal.h"
+#include "common/file_utils.h"
 #include "common/string.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -1485,7 +1486,6 @@ StartupReplicatio

Re: [PATCH] Tracking statements entry timestamp in pg_stat_statements

2022-04-08 Thread Andrei Zubkov

Hi,

I've rebased this patch so that it can be applied after 57d6aea00fc.

v14 attached
--
regards, Andrei
From 6c541f3001d952e72e5d865fde09de3fb4f36d10 Mon Sep 17 00:00:00 2001
From: Andrei Zubkov 
Date: Fri, 8 Apr 2022 23:12:55 +0300
Subject: [PATCH] pg_stat_statements: Track statement entry timestamp

This patch adds stats_since and minmax_stats_since columns to the
pg_stat_statements view and pg_stat_statements() function. The new min/max reset
mode for the pg_stat_stetments_reset() function is controlled by the
parameter minmax_only.
stat_since column is populated with the current timestamp when a new statement
is added to the pg_stat_statements hashtable. It provides clean information
about statistics collection time interval for each statement. Besides it can be
used by sampling solutions to detect situations when a statement was evicted and
stored again between samples.
Such sampling solution could derive any pg_stat_statements statistic values for
an interval between two samples with the exception of all min/max statistics. To
address this issue this patch adds the ability to reset min/max statistics
independently of the statement reset using the new minmax_only parameter
of the pg_stat_statements_reset(userid oid, dbid oid, queryid bigint,
minmax_only boolean) function. Timestamp of such reset is stored in the
minmax_stats_since field for each statement.
pg_stat_statements_reset() function now returns the timestamp of a reset as a
result.

Discussion:
https://www.postgresql.org/message-id/flat/72e80e7b160a6eb189df9ef6f068cce3765d37f8.camel%40moonset.ru
---
 .../expected/oldextversions.out   | 118 +++---
 .../expected/pg_stat_statements.out   | 361 +-
 .../pg_stat_statements--1.9--1.10.sql |  16 +-
 .../pg_stat_statements/pg_stat_statements.c   | 128 +--
 .../pg_stat_statements/sql/oldextversions.sql |   7 +-
 .../sql/pg_stat_statements.sql| 149 +++-
 doc/src/sgml/pgstatstatements.sgml|  66 +++-
 7 files changed, 637 insertions(+), 208 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/oldextversions.out b/contrib/pg_stat_statements/expected/oldextversions.out
index efb2049ecff..0634d73bc03 100644
--- a/contrib/pg_stat_statements/expected/oldextversions.out
+++ b/contrib/pg_stat_statements/expected/oldextversions.out
@@ -138,7 +138,7 @@ SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);

 -- New function pg_stat_statement_info, and new function
 -- and view for pg_stat_statements introduced in 1.9
-AlTER EXTENSION pg_stat_statements UPDATE TO '1.9';
+ALTER EXTENSION pg_stat_statements UPDATE TO '1.9';
 SELECT pg_get_functiondef('pg_stat_statements_info'::regproc);
pg_get_functiondef
 -
@@ -194,55 +194,79 @@ SELECT count(*) > 0 AS has_data FROM pg_stat_statements;
  t
 (1 row)

+\d pg_stat_statements_info
+  View "public.pg_stat_statements_info"
+   Column|   Type   | Collation | Nullable | Default 
+-+--+---+--+-
+ dealloc | bigint   |   |  | 
+ stats_reset | timestamp with time zone |   |  | 
+
+SELECT pg_get_functiondef('pg_stat_statements_reset'::regproc);
+   pg_get_functiondef   
+
+ CREATE OR REPLACE FUNCTION public.pg_stat_statements_reset(userid oid DEFAULT 0, dbid oid DEFAULT 0, queryid bigint DEFAULT 0)+
+  RETURNS void +
+  LANGUAGE c   +
+  PARALLEL SAFE STRICT +
+ AS '$libdir/pg_stat_statements', $function$pg_stat_statements_reset_1_7$function$ +
+ 
+(1 row)
+
+SET SESSION AUTHORIZATION pg_read_all_stats;
+SELECT pg_stat_statements_reset();
+ERROR:  permission denied for function pg_stat_statements_reset
+RESET SESSION AUTHORIZATION;
 -- New functions and views for pg_stat_statements in 1.10
 AlTER EXTENSION pg_stat_statements UPDATE TO '1.10';
 \d pg_stat_statements
-  View "public.pg_stat_statements"
- Column |   Type   | Collation | Nullable | Default 
-+--+---+--+-
- userid | oid  |

Re: Frontend error logging style

2022-04-08 Thread Tom Lane

I wrote:
> One other loose end is bothering me: I stuck with logging.h's
> original choice to put "if (likely())" or "if (unlikely())"
> conditionals into the macros, but I rather suspect that that's
> just a waste.  I think we should put a centralized level check
> into logging.c, and get rid of at least the "if (likely())"
> checks, because those are going to succeed approximately 100.0%
> of the time.  Maybe there's an argument for keeping the unlikely()
> ones.

Concretely, something like the attached.  As a simple check,
I looked at the compiled size of pg_dump.  It went from

  textdata bss dec hex filename
 38029840081384  385690   5e29a /home/postgres/testversion/bin/pg_dump

to

   textdata bss dec hex filename
 37495440081384  380346   5cdba src/bin/pg_dump/pg_dump

for a savings of about 5K or 1.5%.  Not a huge amount, but
not nothing either, especially considering that the existing
coding isn't buying us anything.

regards, tom lane

diff --git a/src/bin/pg_dump/pg_backup_utils.h b/src/bin/pg_dump/pg_backup_utils.h
index 5b1c51554d..8173bb93cf 100644
--- a/src/bin/pg_dump/pg_backup_utils.h
+++ b/src/bin/pg_dump/pg_backup_utils.h
@@ -34,8 +34,7 @@ extern void exit_nicely(int code) pg_attribute_noreturn();
 /* In pg_dump, we modify pg_fatal to call exit_nicely instead of exit */
 #undef pg_fatal
 #define pg_fatal(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
 		exit_nicely(1); \
 	} while(0)
 
diff --git a/src/common/logging.c b/src/common/logging.c
index 18d6669f27..8a061f46b4 100644
--- a/src/common/logging.c
+++ b/src/common/logging.c
@@ -223,6 +223,10 @@ pg_log_generic_v(enum pg_log_level level, enum pg_log_part part,
 	Assert(fmt);
 	Assert(fmt[strlen(fmt) - 1] != '\n');
 
+	/* Do nothing if log level is too low. */
+	if (level < __pg_log_level)
+		return;
+
 	/*
 	 * Flush stdout before output to stderr, to ensure sync even when stdout
 	 * is buffered.
diff --git a/src/include/common/logging.h b/src/include/common/logging.h
index e213bb70d0..35c7c7b976 100644
--- a/src/include/common/logging.h
+++ b/src/include/common/logging.h
@@ -104,48 +104,39 @@ void		pg_log_generic_v(enum pg_log_level level, enum pg_log_part part,
  * pg_log_generic[_v] directly, except perhaps in error interface code.
  */
 #define pg_log_error(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_error_detail(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_DETAIL, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_DETAIL, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_error_hint(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_HINT, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_HINT, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_warning(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_WARNING)) \
-			pg_log_generic(PG_LOG_WARNING, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_WARNING, PG_LOG_PRIMARY, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_warning_detail(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_WARNING)) \
-			pg_log_generic(PG_LOG_WARNING, PG_LOG_DETAIL, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_WARNING, PG_LOG_DETAIL, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_warning_hint(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_WARNING)) \
-			pg_log_generic(PG_LOG_WARNING, PG_LOG_HINT, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_WARNING, PG_LOG_HINT, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_info(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_INFO)) \
-			pg_log_generic(PG_LOG_INFO, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_INFO, PG_LOG_PRIMARY, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_info_detail(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_INFO)) \
-			pg_log_generic(PG_LOG_INFO, PG_LOG_DETAIL, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_INFO, PG_LOG_DETAIL, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_info_hint(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_INFO)) \
-			pg_log_generic(PG_LOG_INFO, PG_LOG_HINT, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_INFO, PG_LOG_HINT, __VA_ARGS__); \
 	} while(0)
 
 #define pg_log_debug(...) do { \
@@ -167,8 +158,7 @@ void		pg_log_generic_v(enum pg_log_level level, enum pg_log_part part,
  * A common shortcut: pg_log_error() and immediately exit(1).
  */
 #define pg_fatal(...) do { \
-		if (likely(__pg_log_level <= PG_LOG_ERROR)) \
-			pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
+		pg_log_generic(PG_LOG_ERROR, PG_LOG_PRIMARY, __VA_ARGS__); \
 		exit(1); \
 	} while(0)

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Robert Haas

On Fri, Apr 8, 2022 at 3:31 PM Peter Geoghegan  wrote:
> What if we miss the opportunity to systematically keep successor
> versions of a given logical row on the same heap page over time, due
> only to the current low MaxHeapLinePointersPerPage limit of 291? If we
> had only been able to "absorb" just a few extra versions in the short
> term, we would have had stability (in the sense of being able to
> preserve locality among related logical rows) in the long term. We
> could have kept everything together, if only we didn't overreact to
> what were actually short term, rare perturbations.

Hmm. I wonder if we could teach the system to figure out which of
those things is happening. In the case that I'm worried about, when
we're considering growing the line pointer array, either the line
pointers will be dead or the line pointers will be used but the tuples
to which they point will be dead. In the case you describe here, there
should be very few dead tuples or line pointers in the page. Maybe
when the number of line pointers starts to get big, we refuse to add
more without checking the number of dead tuples and dead line pointers
and verifying that those numbers are still small. Or, uh, something.

One fly in the ointment is that if we refuse to expand the line
pointer array, we might extend the relation instead, which is another
kind of bloat and thus not great. But if the line pointer array is
somehow filling up with tons of dead tuples, we're going to have to
extend the relation anyway. I suspect that in some circumstances it's
better to just accept that outcome and hope that it leads to some
pages becoming empty, thus allowing their line pointer arrays to be
reset.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Pre-allocating WAL files

2022-04-08 Thread Nathan Bossart

On Thu, Mar 17, 2022 at 04:12:12PM -0700, Nathan Bossart wrote:
> It seems unlikely that this will be committed for v15, so I've adjusted the
> commitfest entry to v16 and moved it to the next commitfest.

rebased

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 3781795f9b4e448df6bdd24d5cd7c0743b5e2944 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 10 Nov 2021 18:35:14 +
Subject: [PATCH v8 1/2] Move WAL segment creation logic to its own function.

---
 src/backend/access/transam/xlog.c | 103 +--
 src/backend/storage/file/fd.c | 114 ++
 src/include/storage/fd.h  |   1 +
 3 files changed, 116 insertions(+), 102 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a7814d4019..87d71e2008 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2918,11 +2918,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	 bool *added, char *path)
 {
 	char		tmppath[MAXPGPATH];
-	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
 	int			fd;
-	int			save_errno;
 
 	Assert(logtli != 0);
 
@@ -2952,106 +2950,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	elog(DEBUG2, "creating and filling new WAL file");
 
 	snprintf(tmppath, MAXPGPATH, XLOGDIR "/xlogtemp.%d", (int) getpid());
-
-	unlink(tmppath);
-
-	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not create file \"%s\": %m", tmppath)));
-
-	memset(zbuffer.data, 0, XLOG_BLCKSZ);
-
-	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-	save_errno = 0;
-	if (wal_init_zero)
-	{
-		struct iovec iov[PG_IOV_MAX];
-		int			blocks;
-
-		/*
-		 * Zero-fill the file.  With this setting, we do this the hard way to
-		 * ensure that all the file space has really been allocated.  On
-		 * platforms that allow "holes" in files, just seeking to the end
-		 * doesn't allocate intermediate space.  This way, we know that we
-		 * have all the space and (after the fsync below) that all the
-		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
-		 * O_DSYNC will be sufficient to sync future writes to the log file.
-		 */
-
-		/* Prepare to write out a lot of copies of our zero buffer at once. */
-		for (int i = 0; i < lengthof(iov); ++i)
-		{
-			iov[i].iov_base = zbuffer.data;
-			iov[i].iov_len = XLOG_BLCKSZ;
-		}
-
-		/* Loop, writing as many blocks as we can for each system call. */
-		blocks = wal_segment_size / XLOG_BLCKSZ;
-		for (int i = 0; i < blocks;)
-		{
-			int			iovcnt = Min(blocks - i, lengthof(iov));
-			off_t		offset = i * XLOG_BLCKSZ;
-
-			if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
-			{
-save_errno = errno;
-break;
-			}
-
-			i += iovcnt;
-		}
-	}
-	else
-	{
-		/*
-		 * Otherwise, seeking to the end and writing a solitary byte is
-		 * enough.
-		 */
-		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
-		{
-			/* if write didn't set errno, assume no disk space */
-			save_errno = errno ? errno : ENOSPC;
-		}
-	}
-	pgstat_report_wait_end();
-
-	if (save_errno)
-	{
-		/*
-		 * If we fail to make the file, delete it to release disk space
-		 */
-		unlink(tmppath);
-
-		close(fd);
-
-		errno = save_errno;
-
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not write to file \"%s\": %m", tmppath)));
-	}
-
-	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
-	if (pg_fsync(fd) != 0)
-	{
-		int			save_errno = errno;
-
-		close(fd);
-		errno = save_errno;
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not fsync file \"%s\": %m", tmppath)));
-	}
-	pgstat_report_wait_end();
-
-	if (close(fd) != 0)
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not close file \"%s\": %m", tmppath)));
+	CreateEmptyWalSegment(tmppath);
 
 	/*
 	 * Now move the segment into place with its final name.  Cope with
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 14b77f2861..4efc46460e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3891,3 +3891,117 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
 
 	return sum;
 }
+
+/*
+ * CreateEmptyWalSegment
+ *
+ * Create a new file that can be used as a new WAL segment.  The caller is
+ * responsible for installing the new file in pg_wal.
+ */
+void
+CreateEmptyWalSegment(const char *path)
+{
+	PGAlignedXLogBlock zbuffer;
+	int			fd;
+	int			save_errno;
+
+	unlink(path);
+
+	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
+	fd = BasicOpenFile(path, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not create

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Andres Freund

Hi,

On 2022-04-08 09:17:40 -0400, Robert Haas wrote:
> I agree that the value of 291 is pretty much accidental, but it also
> seems fairly generous to me. The bigger you make it, the more space
> you can waste. I must have missed (or failed to understand) previous
> discussions about why raising it would be a good idea.

It's not hard to hit scenarios where pages are effectively unusable, because
they have close to 291 dead items, without autovacuum triggering (or
autovacuum just taking a while). You basically just need updates / deletes to
concentrate in a certain range of the table and have indexing that prevents
HOT updates. Because the overall percentage of dead tuples is low, no
autovacuum is triggered, yet a range of the table contains little but dead
items.  At which point you basically waste 7k bytes (1164 bytes for dead items
IIRC) until a vacuum finally kicks in - way more than what what you'd waste if
the number of line items were limited at e.g. 2 x MaxHeapTuplesPerPage

This has become a bit more pronounced with vacuum skipping index cleanup when
there's "just a few" dead items - if all your updates concentrate in a small
region, 2% of the whole relation size isn't actually that small.

I wonder if we could reduce the real-world space wastage of the line pointer
array, if we changed the the logic about which OffsetNumbers to use during
inserts / updates and and made a few tweaks to to pruning.

1) It's kind of OK for heap-only tuples to get a high OffsetNumber - we can
   reclaim them during pruning once they're dead. They don't leave behind a
   dead item that's unreclaimable until the next vacuum with an index cleanup
   pass.

2) Arguably the OffsetNumber of a redirect target can be changed. It might
   break careless uses of WHERE ctid = ... though (which likely are already
   broken, just harder to hit).

These leads me to a few potential improvements:

a) heap_page_prune_prune() should take the number of used items into account
   when deciding whether to prune. Right now we trigger hot pruning based on
   the number of items only if PageGetMaxOffsetNumber(page) >=
   MaxHeapTuplesPerPage. But because it requires a vacuum to reclaim an ItemId
   used for a root tuple, we should trigger HOT pruning when it might lower
   which OffsetNumber get used.

b) heap_page_prune_prune() should be triggered in more paths. E.g. when
   inserting / updating, we should prune if it allows us to avoid using a high
   OffsetNumber.

c) What if we left some percentage of ItemIds unused, when looking for the
   OffsetNumber of a new HOT row version? That'd make it more likely for
   non-HOT updates and inserts to fit onto the page, without permanently
   increasing the size of the line pointer array.

d) If we think 2) is acceptable, we could move the targets of redirects to
   make space for new root tuples, without increasing the permanent size of
   the line pointer array.

Crazy?

Greetings,

Andres Freund

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Andres Freund

Hi,

On 2022-04-08 15:04:37 -0400, Robert Haas wrote:
> I meant wasting space in the page. I think that's a real concern.
> Imagine you allowed 1000 line pointers per page. Each one consumes 2
> bytes.

It's 4 bytes per line pointer, right?

struct ItemIdData {
unsigned int   lp_off:15;/* 0: 0  4 */
unsigned int   lp_flags:2;   /* 0:15  4 */
unsigned int   lp_len:15;/* 0:17  4 */

/* size: 4, cachelines: 1, members: 3 */
/* last cacheline: 4 bytes */
};

Or am I confusing myself somehow?

I do wish the length of the tuple weren't in ItemIdData, but part of the
tuple, so we'd not waste this space for dead items (I think it'd also simplify
more code than it'd complicate). But ...

- Andres

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan

On Fri, Apr 8, 2022 at 2:18 PM Andres Freund  wrote:
> It's 4 bytes per line pointer, right?

Yeah, it's 4 bytes in Postgres. Most other DB systems only need 2
bytes, which is implemented in exactly the way that you're imagining.

-- 
Peter Geoghegan

Re: MERGE bug report

2022-04-08 Thread Alvaro Herrera

On 2022-Apr-06, Richard Guo wrote:

> That's right. The varattno is set to zero for whole-row Var. And in this
> case these whole-row Vars are not included in the targetlist.
> 
> Attached is an attempt for the fix.

Wow, this is very interesting.  I was surprised that this patch was
necessary at all -- I mean, if wholerow refs don't work, then why do
references to any other columns work?  The answer is that parse_merge.c
is already setting up the subplan's targetlist by expanding all vars of
the source relation.  I then remembered than in Simon's (or Pavan's)
original coding, parse_merge.c had a hack to include a var with the
source's wholerow in that targetlist, which I had later removed ...

I eventually realized that there's no need for parse_merge.c to expand
the source rel at all, and indeed it's wasteful: we can just let
preprocess_targetlist include the vars that are referenced by either
quals or each action's targetlist instead.  That led me to the attached
patch, which is not commit-quality yet but it should show what I have in
mind.

I added a test query to tickle this problematic case.

Another point, not completely connected to this bug but appearing in the
same function, is that we have some redundant code: we can just let the
stanza for UPDATE/DELETE do the identity columns dance.  This saves a
few lines in the MERGE-specific stanza there, which was doing exactly
the same thing.  (There's a difference in the "inh" test, but I think
that was just outdated.)

I also discovered that the comment for fix_join_expr needed an update,
since it doesn't mention MERGE, and it does mention all other situations
in which it is used.  Added that too.

This patch is a comment about "aggregates, window functions and
placeholder vars".  This was relevant and correct when only the qual of
each action was being handled (i.e., Richard's patch).  Now that we're
also handling the action's targetlist, I think I need to put the PVC
flags back.  But no tests broke, which probably means we also need some
additional tests cases.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 6ea3505646..fa5969bbd5 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -2763,6 +2763,10 @@ search_indexed_tlist_for_sortgroupref(Expr *node,
  *	  to-be-updated relation) alone. Correspondingly inner_itlist is to be
  *	  EXCLUDED elements, outer_itlist = NULL and acceptable_rel the target
  *	  relation.
+ * 4) MERGE.  In this case, references to the source relation are to be
+ *replaced with INNER_VAR references, and target Vars (the to-be-
+ *modified relation) are left alone. So inner_itlist is to be
+ *the source relation and acceptable_rel the target relatin.
  *
  * 'clauses' is the targetlist or list of join clauses
  * 'outer_itlist' is the indexed target list of the outer join relation,
diff --git a/src/backend/optimizer/prep/preptlist.c b/src/backend/optimizer/prep/preptlist.c
index 99ab3d7559..c1c3067365 100644
--- a/src/backend/optimizer/prep/preptlist.c
+++ b/src/backend/optimizer/prep/preptlist.c
@@ -107,14 +107,15 @@ preprocess_targetlist(PlannerInfo *root)
 		root->update_colnos = extract_update_targetlist_colnos(tlist);

 	/*
-	 * For non-inherited UPDATE/DELETE, register any junk column(s) needed to
-	 * allow the executor to identify the rows to be updated or deleted.  In
-	 * the inheritance case, we do nothing now, leaving this to be dealt with
-	 * when expand_inherited_rtentry() makes the leaf target relations.  (But
-	 * there might not be any leaf target relations, in which case we must do
-	 * this in distribute_row_identity_vars().)
+	 * For non-inherited UPDATE/DELETE/MERGE, register any junk column(s)
+	 * needed to allow the executor to identify the rows to be updated or
+	 * deleted.  In the inheritance case, we do nothing now, leaving this to
+	 * be dealt with when expand_inherited_rtentry() makes the leaf target
+	 * relations.  (But there might not be any leaf target relations, in which
+	 * case we must do this in distribute_row_identity_vars().)
 	 */
-	if ((command_type == CMD_UPDATE || command_type == CMD_DELETE) &&
+	if ((command_type == CMD_UPDATE || command_type == CMD_DELETE ||
+		 command_type == CMD_MERGE) &&
 		!target_rte->inh)
 	{
 		/* row-identity logic expects to add stuff to processed_tlist */
@@ -125,23 +126,15 @@ preprocess_targetlist(PlannerInfo *root)
 	}

 	/*
-	 * For MERGE we need to handle the target list for the target relation,
-	 * and also target list for each action (only INSERT/UPDATE matter).
+	 * For MERGE we also need to handle the target list for each INSERT and
+	 * UPDATE action separately.  In addition, we examine the qual of each
+	 * action and add any Vars there (other than those of the target rel) to
+	 * the subplan targetlist.
 	 */
 	if (command_type == CMD_MERGE)
 	{
 		ListCel

Re: Lowering the ever-growing heap->pd_lower

2022-04-08 Thread Peter Geoghegan

On Fri, Apr 8, 2022 at 2:06 PM Andres Freund  wrote:
> It's not hard to hit scenarios where pages are effectively unusable, because
> they have close to 291 dead items, without autovacuum triggering (or
> autovacuum just taking a while).

I think that this is mostly a problem with HOT-updates, and regular
updates to a lesser degree. Deletes seem less troublesome.

I find that it's useful to think in terms of the high watermark number
of versions required for a given logical row over time. It's probably
quite rare for most individual logical rows to truly require more than
2 or 3 versions per row at the same time, to serve queries. Even in
update-heavy tables. And without doing anything fancy with the
definition of HeapTupleSatisfiesVacuum(). There are important
exceptions, certainly, but overall I think that we're still not doing
good enough with these easier cases.

The high watermark number of versions is probably going to be
significantly greater than the typical number of versions for the same
row. So maybe we give up on keeping a row on its original heap block
today, all because of a once-off (or very rare) event where we needed
slightly more extra space for only a fraction of a second.

The tell-tale sign of these kinds of problems can sometimes be seen
with synthetic, rate-limited benchmarks. If it takes a very long time
for the problem to grow, but nothing about the workload really ever
changes, then that suggests problems that have this quality.  The
probability of any given logical row being moved to another heap block
is very low. And yet it is inevitable that many (even all) will, given
enough time, given enough opportunities to get unlucky.

> This has become a bit more pronounced with vacuum skipping index cleanup when
> there's "just a few" dead items - if all your updates concentrate in a small
> region, 2% of the whole relation size isn't actually that small.

The 2% threshold was chosen based on the observation that it was below
the effective threshold where autovacuum just won't ever launch
anything on a moderate sized table (unless you set
autovacuum_vacuum_scale_factor to something absurdly low). The real
problem is that IMV. That's why I think that we need to drive it based
primarily on page-level characteristics. While effectively ignoring
pages that are all-visible when deciding if enough bloat is present to
necessitate vacuuming.

> 1) It's kind of OK for heap-only tuples to get a high OffsetNumber - we can
>reclaim them during pruning once they're dead. They don't leave behind a
>dead item that's unreclaimable until the next vacuum with an index cleanup
>pass.

I like the general direction here, but this particular idea doesn't
seem like a winner.

> 2) Arguably the OffsetNumber of a redirect target can be changed. It might
>break careless uses of WHERE ctid = ... though (which likely are already
>broken, just harder to hit).

That makes perfect sense to me, though.

> a) heap_page_prune_prune() should take the number of used items into account
>when deciding whether to prune. Right now we trigger hot pruning based on
>the number of items only if PageGetMaxOffsetNumber(page) >=
>MaxHeapTuplesPerPage. But because it requires a vacuum to reclaim an ItemId
>used for a root tuple, we should trigger HOT pruning when it might lower
>which OffsetNumber get used.

Unsure about this.

> b) heap_page_prune_prune() should be triggered in more paths. E.g. when
>inserting / updating, we should prune if it allows us to avoid using a high
>OffsetNumber.

Unsure about this too.

I prototyped a design that gives individual backends soft ownership of
heap blocks that were recently allocated, and later prunes the heap
page when it fills [1]. Useful for aborted transactions, where it
preserves locality -- leaving aborted tuples behind makes their space
ultimately reused for unrelated inserts, which is bad. But eager
pruning allows the inserter to leave behind more or less pristine heap
pages, which don't need to be pruned later on.

> c) What if we left some percentage of ItemIds unused, when looking for the
>OffsetNumber of a new HOT row version? That'd make it more likely for
>non-HOT updates and inserts to fit onto the page, without permanently
>increasing the size of the line pointer array.

That sounds promising.

[1] 
https://postgr.es/m/cah2-wzm-vhveqyth8hlyyho2wdg8ecrm0upqjwjap6bovfe...@mail.gmail.com
--
Peter Geoghegan

1 2 >

1 - 100 of 117 matches

Mail list logo