date:20200629

Re: track_planning causing performance regression

2020-06-29 Thread Julien Rouhaud

Hi,

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins  wrote:
>
> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
> ~45% performance drop [2] at high DB connection counts (when compared with 
> v12.3)
>
> Disabling pg_stat_statements.track_planning (which is 'On' by default)
> brings the TPS numbers up to v12.3 levels.
>
> The inflection point (in this test-case) is 128 Connections, beyond which the
> TPS numbers are consistently low. Looking at the mailing list [1], this issue
> didn't surface earlier possibly since the regression is trivial at low 
> connection counts.
>
> It would be great if this could be optimized further, or track_planning
> disabled (by default) so as to not trip users upgrading from v12 with 
> pg_stat_statement
> enabled (but otherwise not particularly interested in track_planning).
>
> These are some details around the above test:
>
> pgbench: scale - 100 / threads - 16
> test-duration - 30s each
> server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
> client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the 
> DB server - Same AZ)
> v12 - REL_12_STABLE (v12.3)
> v13Beta1 - REL_13_STABLE (v13Beta1)
> max_connections = 1
> shared_preload_libraries = 'pg_stat_statements'
> shared_buffers 128MB

I can't reproduce this on my laptop, but I can certainly believe that
running the same 3 queries using more connections than available cores
will lead to extra overhead.

I disagree with the conclusion though.  It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement.  A quick test using pgbench -M prepared, with
track_planning enabled, with still way too many connections already
shows a 25% improvement over the -M simple without track_planning.

Re: [PATCH] COPY command's data format option allows only lowercase csv, text or binary

2020-06-29 Thread Michael Paquier

On Thu, Jun 25, 2020 at 11:07:33AM +0900, Michael Paquier wrote:
> 0d8c9c1 has introduced some in parse_basebackup_options() for the
> new manifest option, and fe30e7e for AlterType(), no?

Please forget this one.  I had a moment of brain fade.  Those have
been added for the option values, and on the option names we use
directly strcmp(), so I am not actually seeing a code path on HEAD
where we use pg_strcasecmp for something coming only from the parser.
--
Michael


signature.asc
Description: PGP signature

Re: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

2020-06-29 Thread Kyotaro Horiguchi

Hello.

At Mon, 29 Jun 2020 04:35:11 +, "higuchi.dais...@fujitsu.com" 
 wrote in 
> Hi,
> 
> I found the bug about archive_timeout parameter.
> There is the case archive_timeout parameter is ignored after recovery works.
...
> [Problem]
> When the value of archive_timeout is smaller than that of checkpoint_timeout 
> and recovery works, archive_timeout is ignored in the first WAL archiving.
> Once WAL is archived, the archive_timeout seems to be valid after that.
...
> Currently, cur_timeout is set according to only checkpoint_timeout when it is 
> during recovery.
> Even during recovery, the cur_timeout should be calculated including 
> archive_timeout as well as checkpoint_timeout, I think.
> I attached the patch to solve this problem.

Unfortunately the diff command in your test script doesn't show me
anything, but I can understand what you are thinking is a problem,
maybe.  But the patch doesn't seem the fix for the issue.

Archiving works irrelevantly from that parameter. Completed WAL
segments are immediately marked as ".ready" and archiver does its task
immediately independently from checkpointer. The parameter, as
described in documentation, forces the server to switch to a new WAL
segment file periodically so that it can be archived, that is, it
works only on primary.  On the other hand on standby, a WAL segment is
not marked as ".ready" until any data for the *next* segment comes. So
the patch is not the fix for the issue.

If primary switched segment and archived it but standby didn't archive
the same immediately, you could force that by writing something on the
master.

Anyway, the attached patch would resolve your problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From fec10780e132c1d284c66355c5215c284c16204d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 29 Jun 2020 16:12:01 +0900
Subject: [PATCH] Let complete segment archived immediately on standy

walreceiver marks a completed segment as ".ready" after any data for
the next segment comes. So standby can archive a WAL segment later
than the primary archives the same segment.  Let walreceiver archive a
segment as soon as it is completed.
---
 src/backend/replication/walreceiver.c | 77 +++
 1 file changed, 42 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..06c1e3cbe4 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -902,44 +902,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		if (recvFile < 0)
 		{
 			bool		use_existent;
 
-			/*
-			 * fsync() and close current file before we switch to next one. We
-			 * would otherwise have to reopen this file to fsync it later
-			 */
-			if (recvFile >= 0)
-			{
-char		xlogfname[MAXFNAMELEN];
-
-XLogWalRcvFlush(false);
-
-XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-/*
- * XLOG segment files will be re-read by recovery in startup
- * process soon, so we don't advise the OS to release cache
- * pages associated with the file like XLogFileClose() does.
- */
-if (close(recvFile) != 0)
-	ereport(PANIC,
-			(errcode_for_file_access(),
-			 errmsg("could not close log segment %s: %m",
-	xlogfname)));
-
-/*
- * Create .done file forcibly to prevent the streamed segment
- * from being archived later.
- */
-if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-	XLogArchiveForceDone(xlogfname);
-else
-	XLogArchiveNotify(xlogfname);
-			}
-			recvFile = -1;
-
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			use_existent = true;
@@ -985,6 +951,47 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		buf += byteswritten;
 
 		LogstreamResult.Write = recptr;
+
+		/*
+		 * Close the current WAL segment if it is completed then let the file
+		 * be archived if needed.
+		 */
+		if (!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		{
+			char		xlogfname[MAXFNAMELEN];
+
+			Assert (recvFile >= 0);
+
+			/*
+			 * fsync() and close current file before we switch to next one. We
+			 * would otherwise have to reopen this file to fsync it later
+			 */
+			XLogWalRcvFlush(false);
+
+			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+			/*
+			 * XLOG segment files will be re-read by recovery in startup
+			 * process soon, so we don't advise the OS to release cache
+			 * pages associated with the file like XLogFileClose() does.
+			 */
+			if (close(recvFile) != 0)
+ereport(PANIC,
+		(errcode_for_file_access(),
+		 errmsg("could not close log segment %s: %m",
+xlogfname)));
+
+			/*
+			 * Create .done file forcibly to prevent the streamed segment
+			 * from being archived later.
+			 */
+			if (XLogArch

Re: POC and rebased patch for CSN based snapshots

2020-06-29 Thread Andrey V. Lepikhov


On 6/12/20 2:41 PM, movead...@highgo.ca wrote:

Hello hackers,

Currently, I do some changes based on the last version:
1. Catch up to the current  commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.


Some remarks on your patch:
1. The variable last_max_csn can be an atomic variable.
2. GenerateCSN() routine: in the case than csn < csnState->last_max_csn 
This is the case when someone changed the value of the system clock. I 
think it is needed to write a WARNING to the log file. (May be we can do 
synchronization with a time server.
3. That about global snapshot xmin? In the pgpro version of the patch we 
had GlobalSnapshotMapXmin() routine to maintain circular buffer of 
oldestXmins for several seconds in past. This buffer allows to shift 
oldestXmin in the past when backend is importing global transaction. 
Otherwise old versions of tuples that were needed for this transaction 
can be recycled by other processes (vacuum, HOT, etc).
How do you implement protection from local pruning? I saw 
SNAP_DESYNC_COMPLAIN, but it is not used anywhere.
4. The current version of the patch is not applied clearly with current 
master.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: pg_bsd_indent compiles bytecode

2020-06-29 Thread Michael Paquier

On Sat, Jun 27, 2020 at 06:54:04PM -0400, Tom Lane wrote:
> Having now checked, there isn't any such problem.  No .bc files are
> getting built except in src/backend and in other modules that feed
> into the backend, such as src/timezone and most of contrib.
> 
> I do see .bc files getting built for pg_bsd_indent, as Alvaro reported.
> Seems like it must be a bug in the pgxs make logic, not anything more
> generic.

Yeah, and I think that it is caused by the following bit:
ifeq ($(with_llvm), yes)
all: $(addsuffix .bc, $(MODULES)) $(patsubst %.o,%.bc, $(OBJS))
endif

Shouldn't the latter part be ignored if PROGRAM is used?
--
Michael


signature.asc
Description: PGP signature

Re: TAP tests and symlinks on Windows

2020-06-29 Thread Michael Paquier

On Fri, Jun 26, 2020 at 02:00:37PM +0200, Peter Eisentraut wrote:
> We should be more accurate about things like this:
> 
> +# The following tests test symlinks. Windows may not have symlinks, so
> +# skip there.
> 
> The issue isn't whether Windows has symlinks, since all versions of Windows
> supported by PostgreSQL do (AFAIK).  The issue is only whether the Perl
> installation that runs the tests has symlink support.  And that is only
> necessary if the test itself wants to create or inspect symlinks.  For
> example, there are existing tests involving tablespaces that work just fine
> on Windows.

Check.  Indeed that sounds confusing.

> Relatedly, your patch ends up skipping the tests on MSYS2, even though Perl
> supports symlinks there out of the box.

Do you think that it would be enough to use what Andrew has mentioned
in [1]?  I don't have a MSYS2 installation, so I am unfortunately not
able to confirm that, but I would just move the check to TestLib.pm
and save it in an extra variable.

[1]: 
https://www.postgresql.org/message-id/6c5ffed0-20ee-8878-270f-ab56b7023...@2ndquadrant.com
--
Michael


signature.asc
Description: PGP signature

Re: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

2020-06-29 Thread Kyotaro Horiguchi

At Mon, 29 Jun 2020 16:41:11 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Anyway, the attached patch would resolve your problem.

I found another issue related to my last patch.

For the current master (and older versions) if walreceiver is signaled
to exit just after a segment is completed, walreceiver exits without
marking the last segment as ".ready".  After restart, it doesn't
remember that it didn't notified the last segment and the segment is
missing in archive. I think this is really a bug.

With the patch, that failure won't happen.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-06-29 Thread Bharath Rupireddy

Thanks for the responses.

>
> If this is just going to apply to postgres_fdw why not just have that module 
> provide a function "disconnect_open_sessions()" or the like that does this 
> upon user command?  I suppose there would be some potential value to having 
> this be set per-user but that wouldn't preclude the usefulness of a function. 
>   And by having a function the usefulness of the GUC seems reduced.
>

The idea of having module-specific functions to remove cached entries
seems like a good idea. Users have to frequently call these functions
to clean up the cached entries in a long lasting single session. This
may not
be always possible if these sessions are from an application not from
a psql-like client which is a more frequent scenario in the customer
use cases. In this case users might have to change their application
code that is
issuing queries to postgres server to include these functions.

Assuming the fact that the server/session configuration happens much
before the user application starts to submit actual database queries,
having a GUC just helps to avoid making such function calls in between
the session, by having to set the GUC either to true if required to
cache connections or to off if not to cache connections.

>
> On a related note is there any entanglement here with the supplied dblink 
> and/or dblink_fdw [1] modules as they do provide connect and disconnect 
> functions and also leverages postgres_fdw (or dblink_fdw if specified, which 
> brings us back to the question of whether this option should be respected by 
> that FDW).
>

I found that dblink also has the connection caching concept and it
does provide a user a function to disconnect/remove cached connections
using a function, dblink_disconnect() using connection name as it's
input.
IMO, this solution seems a bit problematic as explained in my first
response in this mail.
The postgres_fdw connection caching and dblink connection caching has
no link at all. Please point me if I'm missing anything here.
But probably, this GUC can be extended from a bool to an enum of type
config_enum_entry and use it for dblink as well. This is extensible as
well. Please let me know if this is okay, so that I can code for it.

>
> Otherwise, I would imagine that having multiple queries execute before 
> wanting to drop the connection would be desirable so at minimum a test case 
> that does something like:
>
> SELECT count(*) FROM remote.tbl1;
> -- connection still open
> SET enable_connectioncache TO false;
> SELECT count(*) FROM remote.tbl2;
> -- now it was closed
>
> Or maybe even better, have the close action happen on a transaction boundary.
>

This is a valid scenario, as the same connection can be used in the
same transaction multiple times. With my attached initial patch above
the point is already covered. The decision to cache or not cache the
connection happens at the main transaction end i.e. in
pgfdw_xact_callback().

>
> And if it doesn't just apply to postgres_fdw (or at least doesn't have to) 
> then the description text should be less specific.
>

If we are agreed on a generic GUC for postgres_fdw, dblink and so on.
I will change the description and documentation accordingly.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Add Information during standby recovery conflicts

2020-06-29 Thread Masahiko Sawada

On Thu, 18 Jun 2020 at 16:28, Drouvot, Bertrand  wrote:
>
> Hi hackers,
>
> I've attached a patch to add information during standby recovery conflicts.
> The motive behind is to be able to get some information:
>
> On the apply side
> On the blocker(s) side
>
> Motivation:
>
> When a standby recovery conflict occurs it could be useful to get more 
> information to be able to dive deep on the root cause and find a way to 
> avoid/mitigate new occurrences.

I think this is a good feature. Like log_lock_waits, it will help the
users to investigate recovery conflict issues.

>
> Adding this information would make the investigations easier, it could help 
> answering questions like:
>
> On which LSN was the WAL apply blocked?
> What was the purpose of the bocked WAL record?
> On which relation (if any) was the blocked WAL record related to?
> What was the blocker(s) doing?
> When did the blocker(s) started their queries (if any)?
> What was the blocker(s) waiting for? on which wait event?
>
> Technical context and proposal:
>
> There is 2 points in this patch:
>
> Add the information about the blocked WAL record. This is done in standby.c 
> (ResolveRecoveryConflictWithVirtualXIDs, ResolveRecoveryConflictWithDatabase, 
> StandbyTimeoutHandler)

I think we already have the information about the WAL record being
applied in errcontext.

I wonder if we can show the recovery conflict information in the main
LOG message, the blocker information in errdetail, and use errcontext
with regard to WAL record information. For example:

LOG:  process 500 waiting for recovery conflict on snapshot
DETAIL:  conflicting transition id: 100, 200, 300
CONTEXT:  WAL redo at 0/3001970 for Heap2/CLEAN: remxid 506

> Outcome Example:
>
> 2020-06-15 06:48:23.774 UTC [6971] LOG:  wal record apply is blocked by 2 
> connection(s), reason: User query might have needed to see row versions that 
> must be removed.
> 2020-06-15 06:48:23.774 UTC [6971] LOG:  blocked wal record rmgr: Heap2, lsn: 
> 0/038E2678, received at: 2020-06-15 06:48:23.774098+00, desc: CLEAN, 
> relation: rel 1663/13586/16652 fork main blk 0
> 2020-06-15 06:48:54.773 UTC [7088] LOG:  about to interrupt pid: 7088, 
> backend_type: client backend, state: active, wait_event_type: Timeout, 
> wait_event: PgSleep, query_start: 2020-06-15 06:48:14.87672+00

I'm concerned that the above information on the process who is about
to be interrupted is very detailed but I'm not sure it will be helpful
for the users. If the blocker is waiting on something lock, the
information should be logged by log_lock_waits. Also the canceled
backend will emit the ERROR log with the message "canceling statement
due to conflict with recovery”, and its pid can be logged at the log
prefix. The user can know who has been canceled by recovery conflicts
and the query.

Probably we need to consider having a time threshold (or boolean to
turn on/off) to emit this information to the server logs. Otherwise,
we will end up writing every conflict information, making the log
dirty needlessly.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: track_planning causing performance regression

2020-06-29 Thread Fujii Masao





On 2020/06/29 16:05, Julien Rouhaud wrote:

Hi,

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins  wrote:


During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows


Thanks for the benchmark!



~45% performance drop [2] at high DB connection counts (when compared with 
v12.3)


That's bad :(




Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low 
connection counts.

It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with 
pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).


Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.



These are some details around the above test:

pgbench: scale - 100 / threads - 16
test-duration - 30s each
server - 96 vCPUs / 768GB - r5.24xl (AWS EC2 instance)
client - 72 vCPUs / 144GB - c5.18xl (AWS EC2 instance) (co-located with the DB 
server - Same AZ)
v12 - REL_12_STABLE (v12.3)
v13Beta1 - REL_13_STABLE (v13Beta1)
max_connections = 1
shared_preload_libraries = 'pg_stat_statements'
shared_buffers 128MB


I can't reproduce this on my laptop, but I can certainly believe that
running the same 3 queries using more connections than available cores
will lead to extra overhead.

I disagree with the conclusion though.  It seems to me that if you
really have this workload that consists in these few queries and want
to get better performance, you'll anyway use a connection pooler
and/or use prepared statements, which will make this overhead
disappear entirely, and will also yield an even bigger performance
improvement.  A quick test using pgbench -M prepared, with
track_planning enabled, with still way too many connections already
shows a 25% improvement over the -M simple without track_planning.


I understand your point. But IMO the default setting basically should
be safer value, i.e., off at least until the problem disappears.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: ModifyTable overheads in generic plans

2020-06-29 Thread Amit Langote

On Mon, Jun 29, 2020 at 10:39 AM David Rowley  wrote:
>
> On Sat, 27 Jun 2020 at 00:36, Amit Langote  wrote:
> > 2. ExecCheckRTPerms(): checks permissions of *all* partitions before
> > executing the plan tree, but maybe it's okay to check only the ones
> > that will be accessed
>
> I don't think it needs to be quite as complex as that.
> expand_single_inheritance_child will set the
> RangeTblEntry.requiredPerms to 0, so we never need to check
> permissions on a partition.  The overhead of permission checking when
> there are many partitions is just down to the fact that
> ExecCheckRTPerms() loops over the entire rangetable and calls
> ExecCheckRTEPerms for each one.  ExecCheckRTEPerms() does have very
> little work to do when requiredPerms is 0, but the loop itself and the
> function call overhead show up when you remove the other bottlenecks.

I had forgotten that we set requiredPerms to 0 for the inheritance child tables.

> I have a patch somewhere that just had the planner add the RTindexes
> with a non-zero requiredPerms and set that in the plan so that
> ExecCheckRTPerms could just look at the ones that actually needed
> something checked.   There's a slight disadvantage there that for
> queries to non-partitioned tables that we need to build a Bitmapset
> that has all items from the rangetable.  That's likely a small
> overhead, but not free, so perhaps there is a better way.

I can't think of anything for this that doesn't involve having one
more list of RTEs or bitmapset of RT indexes in PlannedStmt.



--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: track_planning causing performance regression

2020-06-29 Thread Julien Rouhaud

On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
 wrote:
>
> On 2020/06/29 16:05, Julien Rouhaud wrote:
> > On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins  wrote:
> >>
> >> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows
>
> Thanks for the benchmark!
>
>
> >> ~45% performance drop [2] at high DB connection counts (when compared with 
> >> v12.3)
>
> That's bad :(
>
>
> >>
> >> Disabling pg_stat_statements.track_planning (which is 'On' by default)
> >> brings the TPS numbers up to v12.3 levels.
> >>
> >> The inflection point (in this test-case) is 128 Connections, beyond which 
> >> the
> >> TPS numbers are consistently low. Looking at the mailing list [1], this 
> >> issue
> >> didn't surface earlier possibly since the regression is trivial at low 
> >> connection counts.
> >>
> >> It would be great if this could be optimized further, or track_planning
> >> disabled (by default) so as to not trip users upgrading from v12 with 
> >> pg_stat_statement
> >> enabled (but otherwise not particularly interested in track_planning).
>
> Your benchmark result seems to suggest that the cause of the problem is
> the contention of per-query spinlock in pgss_store(). Right?
> This lock contention is likely to happen when multiple sessions run
> the same queries.
>
> One idea to reduce that lock contention is to separate per-query spinlock
> into two; one is for planning, and the other is for execution. pgss_store()
> determines which lock to use based on the given "kind" argument.
> To make this idea work, also every pgss counters like shared_blks_hit
> need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.

I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.

Re: track_planning causing performance regression

2020-06-29 Thread Fujii Masao





On 2020/06/29 18:17, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
 wrote:


On 2020/06/29 16:05, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins  wrote:


During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 shows


Thanks for the benchmark!



~45% performance drop [2] at high DB connection counts (when compared with 
v12.3)


That's bad :(




Disabling pg_stat_statements.track_planning (which is 'On' by default)
brings the TPS numbers up to v12.3 levels.

The inflection point (in this test-case) is 128 Connections, beyond which the
TPS numbers are consistently low. Looking at the mailing list [1], this issue
didn't surface earlier possibly since the regression is trivial at low 
connection counts.

It would be great if this could be optimized further, or track_planning
disabled (by default) so as to not trip users upgrading from v12 with 
pg_stat_statement
enabled (but otherwise not particularly interested in track_planning).


Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.


This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?


Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?



It'll also quite increase the shared memory consumption.


Yes.



I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.


Yeah, we can consider more improvements against this issue.
But I'm afraid these (maybe including my idea) basically should
be items for v14...

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: track_planning causing performance regression

2020-06-29 Thread Julien Rouhaud

On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
 wrote:
>
> >> Your benchmark result seems to suggest that the cause of the problem is
> >> the contention of per-query spinlock in pgss_store(). Right?
> >> This lock contention is likely to happen when multiple sessions run
> >> the same queries.
> >>
> >> One idea to reduce that lock contention is to separate per-query spinlock
> >> into two; one is for planning, and the other is for execution. pgss_store()
> >> determines which lock to use based on the given "kind" argument.
> >> To make this idea work, also every pgss counters like shared_blks_hit
> >> need to be separated into two, i.e., for planning and execution.
> >
> > This can probably remove some overhead, but won't it eventually hit
> > the same issue when multiple connections try to plan the same query,
> > given the number of different queries and very low execution runtime?
>
> Yes. But maybe we can expect that the idea would improve
> the performance to the near same level as v12?

A POC patch should be easy to do and see how much it solves this
problem.  However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.

> > I'm wondering if we could instead use atomics to store the counters.
> > The only downside is that we won't guarantee per-row consistency
> > anymore, which may be problematic.
>
> Yeah, we can consider more improvements against this issue.
> But I'm afraid these (maybe including my idea) basically should
> be items for v14...

Yes, that's clearly not something I'd vote to push in v13 at this point.

Re: track_planning causing performance regression

2020-06-29 Thread Fujii Masao





On 2020/06/29 18:53, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
 wrote:



Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.


This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?


Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?


A POC patch should be easy to do and see how much it solves this
problem.  However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.


Agreed. +1 to change that default to off.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: update substring pattern matching syntax

2020-06-29 Thread Peter Eisentraut


On 2020-06-28 08:13, Fabien COELHO wrote:

v2 patches apply cleanly, compile, global check ok, citext check ok, doc
gen ok. No further comments.


committed, thanks


As I did not find an entry in the CF, so I did nothing about tagging it
"ready".


Right, I had not registered it yet.

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: estimation problems for DISTINCT ON with FDW

2020-06-29 Thread Bharath Rupireddy

> It doesn't cost out the plan of pushing the DISTINCT ON down to the foreign 
> side, which is probably the best way to run the query.  I guess it makes 
> sense that FDW machinery in general doesn't want to try to push a PostgreSQL 
> specific construct.

I think you are right, the DISTINCT operation is not being pushed to
remote(I may be wrong here. just for info - I looked at remote SQL
from explain(verbose) on the query to find this out) and so is for
estimates. There might be problems pushing DISTINCTs to remote servers
with the usage of fdw for sharding configurations. But when fdw is
used for non-sharded configurations such as just to get existing data
from another remote postgres server, oracle, hadoop or  some other
remote database engines where DISTINCT operation is supported, it's
good to push that to remote for both explains/estimates as well as in
the actual queries itself, to reduce data transferred from remote
database server to local postgres database server.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Asymmetry in opening and closing indices for partition routing

2020-06-29 Thread Ashutosh Bapat

On Mon, 22 Jun 2020 at 23:22, Alvaro Herrera 
wrote:

>
> I'm not sure that expecting the relcache entry's refcount drops to zero
> at the right time is a good approach; that may cause leaks some other
> place might have refcounts you're not expecting (say, an open cursor
> that's not fully read).
>

My proposal was to maintain a refcount counting the number of times an
index is opened in ResultRelInfo itself, not to rely on the relcache ref
count. But I think that would be an overkill. Please read ahead


>
> (I'm not terribly worried about refcount leakage as a theoretical
> concern, since the ResourceOwner mechanism will warn us about that if it
> happens.)
>
> > 2. Throw an error in ExecOpenIndices if all the arrays are present. We
> will
> > need to check leaf_part_rri->ri_IndexRelationDescs == NULL in
> > ExecInitPartitionInfo().
>
> This sounds like a job for an assert rather than an error.
>

I agree. Here's a patch to fix to add Assert'ion in ExecOpenIndices(). I
ran make check with this patch and the assertion didn't trip. I think this
will be a good step forward.

-- 
Best Wishes,
Ashutosh
From 638e922bebd9ee4657e40daa63b61ccac4bd205c Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Thu, 25 Jun 2020 18:50:52 +0530
Subject: [PATCH] Assert if ExecOpenIndices() is called twice on same result
 relation.

If ExecOpenIndices() is called more than once for the result relation,
it overwrites the arrays for indices, leaking memory. Assert would catch
this in development.

Ashutosh Bapat.
---
 src/backend/executor/execIndexing.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1862af621b..a94b98b045 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -158,6 +158,13 @@ ExecOpenIndices(ResultRelInfo *resultRelInfo, bool speculative)
 	RelationPtr relationDescs;
 	IndexInfo **indexInfoArray;
 
+	/*
+	 * This function can be called only once per ResultRelInfo, otherwise, we
+	 * will leak memory for arrays and also leak index relcache references.
+	 */
+	Assert(!resultRelInfo->ri_IndexRelationDescs);
+	Assert(!resultRelInfo->ri_IndexRelationInfo);
+
 	resultRelInfo->ri_NumIndices = 0;
 
 	/* fast path if no indexes */
-- 
2.17.1

Bug with indexes on whole-row expressions

2020-06-29 Thread Laurenz Albe

CREATE TABLE boom (a integer, b integer);

-- index on whole-row expression
CREATE UNIQUE INDEX ON boom ((boom));

INSERT INTO boom VALUES
   (1, 2),
   (1, 3);

ALTER TABLE boom DROP b;

TABLE boom;

 a 
---
 1
 1
(2 rows)

REINDEX TABLE boom;
ERROR:  could not create unique index "boom_boom_idx"
DETAIL:  Key ((boom.*))=((1)) is duplicated.

The problem here is that there *is* a "pg_depend" entry for the
index, but it only depends on the whole table, not on specific columns.

I have been thinking what would be the right approach to fix this:

1. Don't fix it, because it is an artificial corner case.
   (But I can imagine someone trying to exclude duplicate rows with
   a unique index.)

2. Add code that checks if there is an index with a whole-row reference
   in the definition before dropping a column.
   That feels like a wart for a special case.

3. Forbid indexes on whole-row expressions.
   After all, you can do the same with an index on all the columns.
   That would open the question what to do about upgrading old databases
   that might have such indexes today.

4. Add dependencies on all columns whenever a whole-row expression
   is used in an index.
   That would need special processing for pg_upgrade.

I'd like to hear your opinions.

Yours,
Laurenz Albe

Re: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

2020-06-29 Thread Fujii Masao





On 2020/06/29 16:41, Kyotaro Horiguchi wrote:

Hello.

At Mon, 29 Jun 2020 04:35:11 +, "higuchi.dais...@fujitsu.com" 
 wrote in

Hi,

I found the bug about archive_timeout parameter.
There is the case archive_timeout parameter is ignored after recovery works.

...

[Problem]
When the value of archive_timeout is smaller than that of checkpoint_timeout 
and recovery works, archive_timeout is ignored in the first WAL archiving.
Once WAL is archived, the archive_timeout seems to be valid after that.

...

Currently, cur_timeout is set according to only checkpoint_timeout when it is 
during recovery.
Even during recovery, the cur_timeout should be calculated including 
archive_timeout as well as checkpoint_timeout, I think.
I attached the patch to solve this problem.


Unfortunately the diff command in your test script doesn't show me
anything, but I can understand what you are thinking is a problem,
maybe.  But the patch doesn't seem the fix for the issue.

Archiving works irrelevantly from that parameter. Completed WAL
segments are immediately marked as ".ready" and archiver does its task
immediately independently from checkpointer. The parameter, as
described in documentation, forces the server to switch to a new WAL
segment file periodically so that it can be archived, that is, it
works only on primary.  On the other hand on standby, a WAL segment is
not marked as ".ready" until any data for the *next* segment comes. So
the patch is not the fix for the issue.


The problems that you're describing and Daisuke-san reported are really
the same? The reported problem seems that checkpointer can sleep on
the latch for more than archive_timeout just after recovery and cannot
switch WAL files promptly even if necessary.

The cause of this problem is that the checkpointer's sleep time is calculated
from both checkpoint_timeout and archive_timeout during normal running,
but calculated only from checkpoint_timeout during recovery. So Daisuke-san's
patch tries to change that so that it's calculated from both of them even
during recovery. No?

-   if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
+   if (XLogArchiveTimeout > 0)
{
elapsed_secs = now - last_xlog_switch_time;
-   if (elapsed_secs >= XLogArchiveTimeout)
+   if (elapsed_secs >= XLogArchiveTimeout && 
!RecoveryInProgress())
continue;   /* no sleep for us ... 
*/
cur_timeout = Min(cur_timeout, XLogArchiveTimeout - 
elapsed_secs);

last_xlog_switch_time is not updated during recovery. So "elapsed_secs" can be
large and cur_timeout can be negative. Isn't this problematic?

As another approach, what about waking the checkpointer up at the end of
recovery like we already do for walsenders?

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: bugfix: invalid bit/varbit input causes the log file to be unreadable

2020-06-29 Thread Quan Zongliang


Good.

I tested it, and it looks fine.

Thank you.


On 2020/6/29 1:10 上午, Tom Lane wrote:

I wrote:

Even granting the premise, the proposed patch seems like a significant
decrease in user-friendliness for typical cases.  I'd rather see us
make an effort to print one valid-per-the-DB-encoding character.
Now that we can rely on snprintf to count %s restrictions in bytes,
I think something like this should work:
  errmsg("\"%.*s\" is not a valid binary digit",
 pg_mblen(sp), sp)));
But the real problem is that this is only the tip of the iceberg.
You didn't even hit all the %c usages in varbit.c.

I went through all the %c format sequences in the backend to see which
ones could use this type of fix.  There were not as many as I'd expected,
but still a fair number.  (I skipped cases where the input was coming from
the catalogs, as well as some non-user-facing debug printouts.)  That
leads to the attached patch, which seems to do the job without breaking
anything that works today.

regards, tom lane

PS: I failed to resist the temptation to improve some shoddy error
messages nearby in pageinspect/heapfuncs.c.

Re: POC: postgres_fdw insert batching

2020-06-29 Thread Ashutosh Bapat

On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
 wrote:

>
> FDW batching: 4584 ms
>
> So, rather nice improvement, I'd say ...

Very nice.

>
> Before I spend more time hacking on this, I have a couple open questions
> about the design, restrictions etc.
>
>
> 1) Extend the FDW API?
>
> In the patch, the batching is simply "injected" into the existing insert
> API method, i.e. ExecForeignInsert et al. I wonder if it'd be better to
> extend the API with a "batched" version of the method, so that we can
> easily determine whether the FDW supports batching or not - it would
> require changes in the callers, though. OTOH it might be useful for
> COPY, where we could do something similar to multi_insert (COPY already
> benefits from this patch, but it does not use the batching built-into
> COPY).

Amit Langote has pointed out a related patch being discussed on hackers at [1].

That patch introduces a new API. But if we can do it without
introducing a new API that will be good. FDWs which can support
batching can just modify their code and don't have to implement and
manage a new API. We already have a handful of those APIs.

>
> 2) What about the insert results?
>
> I'm not sure what to do about "result" status for the inserted rows. We
> only really "stash" the rows into a buffer, so we don't know if it will
> succeed or not. The patch simply assumes it will succeed, but that's
> clearly wrong, and it may result in reporting a wrong number or rows.

I didn't get this. We are executing an INSERT on the foreign server,
so we get the number of rows INSERTed from that server. We should just
add those up across batches. If there's a failure, it would abort the
transaction, local as well as remote.

>
> The patch also disables the batching when the insert has a RETURNING
> clause, because there's just a single slot (for the currently inserted
> row). I suppose a "batching" method would take an array of slots.
>

It will be a rare case when a bulk load also has a RETURNING clause.
So, we can leave with this restriction. We should try to choose a
design which allows that restriction to be lifted in the future. But I
doubt that restriction will be a serious one.

>
> 3) What about the other DML operations (DELETE/UPDATE)?
>
> The other DML operations could probably benefit from the batching too.
> INSERT was good enough for a PoC, but having batching only for INSERT
> seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
> of quals, but likely doable.

Bulk INSERTs are more common in a sharded environment because of data
load in say OLAP systems. Bulk update/delete are rare, although not
that rare. So if an approach just supports bulk insert and not bulk
UPDATE/DELETE that will address a large number of usecases IMO. But if
we can make everything work together that would be good as well.

In your patch, I see that an INSERT statement with batch is
constructed as INSERT INTO ... VALUES (...), (...) as many values as
the batch size. That won't work as is for UPDATE/DELETE since we can't
pass multiple pairs of ctids and columns to be updated for each ctid
in one statement. Maybe we could build as many UPDATE/DELETE
statements as the size of a batch, but that would be ugly. What we
need is a feature like a batch prepared statement in libpq similar to
what JDBC supports
((https://mkyong.com/jdbc/jdbc-preparedstatement-example-batch-update/).
This will allow a single prepared statement to be executed with a
batch of parameters, each batch corresponding to one foreign DML
statement.

>
>
> 3) Should we do batching for COPY insteads?
>
> While looking at multi_insert, I've realized it's mostly exactly what
> the new "batching insert" API function would need to be. But it's only
> really used in COPY, so I wonder if we should just abandon the idea of
> batching INSERTs and do batching COPY for FDW tables.

I think this won't support RETURNING as well. But if we could somehow
use copy protocol to send the data to the foreign server and yet treat
it as INSERT, that might work. I think we have find out which performs
better COPY or batch INSERT.

>
> For cases that can replace INSERT with COPY this would be enough, but
> unfortunately it does nothing for DELETE/UPDATE so I'm hesitant to do
> this :-(

Agreed, if we want to support bulk UPDATE/DELETE as well.

>
>
> 4) Expected consistency?
>
> I'm not entirely sure what are the consistency expectations for FDWs.
> Currently the FDW nodes pointing to the same server share a connection,
> so the inserted rows might be visible to other nodes. But if we only
> stash the rows in a local buffer for a while, that's no longer true. So
> maybe this breaks the consistency expectations?
>
> But maybe that's OK - I'm not sure how the prepared statements/cursors
> affect this. I can imagine restricting the batching only to plans where
> this is not an issue (single FDW node or something), but it seems rather
> fragile and undesirable.

I think that area is grey. Depending up

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-29 Thread Dilip Kumar

On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila  wrote:
>
> On Mon, Jun 22, 2020 at 4:26 PM Amit Kapila  wrote:
> >
> > On Thu, Jun 18, 2020 at 9:02 PM Dilip Kumar  wrote:
> > >
> > > Yes, I have made the changes.  Basically, now I am only using the
> > > XLOG_XACT_INVALIDATIONS for generating all the invalidation messages.
> > > So whenever we are getting the new set of XLOG_XACT_INVALIDATIONS, we
> > > are directly appending it to the txn->invalidations.  I have tested
> > > the XLOG_INVALIDATIONS part but while sending this mail I realized
> > > that we could write some automated test for the same.
> > >
> >
> > Can you share how you have tested it?
> >
> > >  I will work on
> > > that soon.
> > >
> >
> > Cool, I think having a regression test for this will be a good idea.
> >
>
> Other than above tests, can we somehow verify that the invalidations
> generated at commit time are the same as what we do with this patch?
> We have verified with individual commands but it would be great if we
> can verify for the regression tests.

I have verified this using a few random test cases.  For verifying
this I have made some temporary code changes with an assert as shown
below.  Basically, on DecodeCommit we call
ReorderBufferAddInvalidations function only for an assert checking.

-void
 ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
  XLogRecPtr
lsn, Size nmsgs,
-
SharedInvalidationMessage *msgs)
+
SharedInvalidationMessage *msgs, bool commit)
 {
ReorderBufferTXN *txn;

txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
-
+   if (commit)
+   {
+   Assert(txn->ninvalidations == nmsgs);
+   return;
+   }

The result is that for a normal local test it works fine.  But with
regression suit, it hit an assert at many places because if the
rollback of the subtransaction is involved then at commit time
invalidation messages those are not logged whereas with command time
invalidation those are logged.

As of now, I have only put assert on the count,  if we need to verify
the exact messages then we might need to somehow categories the
invalidation messages because the ordering of the messages will not be
the same.  For testing this we will have to arrange them by category
i.e relcahce, catcache and then we can compare them.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: track_planning causing performance regression

2020-06-29 Thread Fujii Masao




On 2020/06/29 18:56, Fujii Masao wrote:



On 2020/06/29 18:53, Julien Rouhaud wrote:

On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
 wrote:



Your benchmark result seems to suggest that the cause of the problem is
the contention of per-query spinlock in pgss_store(). Right?
This lock contention is likely to happen when multiple sessions run
the same queries.

One idea to reduce that lock contention is to separate per-query spinlock
into two; one is for planning, and the other is for execution. pgss_store()
determines which lock to use based on the given "kind" argument.
To make this idea work, also every pgss counters like shared_blks_hit
need to be separated into two, i.e., for planning and execution.


This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?


Yes. But maybe we can expect that the idea would improve
the performance to the near same level as v12?


A POC patch should be easy to do and see how much it solves this
problem.  However I'm not able to reproduce the issue, and IMHO unless
we specifically want to be able to distinguish planner-time counters
from execution-time counters, I'd prefer to disable track_planning by
default than going this way, so that users with a sane usage won't
have to suffer from a memory increase.


Agreed. +1 to change that default to off.


Attached patch does this.

I also add the following into the description about each *_plan_time column
in the docs. IMO this is helpful for users when they see that those columns
report zero by default and try to understand why.

(if pg_stat_statements.track_planning is enabled, otherwise 
zero)

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out 
b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index f615f8c2bf..c3f013860a 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -3,6 +3,7 @@ CREATE EXTENSION pg_stat_statements;
 -- simple and compound statements
 --
 SET pg_stat_statements.track_utility = FALSE;
+SET pg_stat_statements.track_planning = TRUE;
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset 
 --
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index cef8bb5a49..65ac301b99 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -442,7 +442,7 @@ _PG_init(void)
 "Selects whether 
planning duration is tracked by pg_stat_statements.",
 NULL,
 &pgss_track_planning,
-true,
+false,
 PGC_SUSET,
 0,
 NULL,
diff --git a/contrib/pg_stat_statements/sql/pg_stat_statements.sql 
b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
index 75c10554a8..6ed8e38028 100644
--- a/contrib/pg_stat_statements/sql/pg_stat_statements.sql
+++ b/contrib/pg_stat_statements/sql/pg_stat_statements.sql
@@ -4,6 +4,7 @@ CREATE EXTENSION pg_stat_statements;
 -- simple and compound statements
 --
 SET pg_stat_statements.track_utility = FALSE;
+SET pg_stat_statements.track_planning = TRUE;
 SELECT pg_stat_statements_reset();
 
 SELECT 1 AS "int";
diff --git a/doc/src/sgml/pgstatstatements.sgml 
b/doc/src/sgml/pgstatstatements.sgml
index a13e28a84c..430d8bf07c 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -101,6 +101,8 @@
   
   
Number of times the statement was planned
+   (if pg_stat_statements.track_planning is enabled,
+   otherwise zero)
   
  
 
@@ -110,6 +112,8 @@
   
   
Total time spent planning the statement, in milliseconds
+   (if pg_stat_statements.track_planning is enabled,
+   otherwise zero)
   
  
 
@@ -119,6 +123,8 @@
   
   
Minimum time spent planning the statement, in milliseconds
+   (if pg_stat_statements.track_planning is enabled,
+   otherwise zero)
   
  
 
@@ -128,6 +134,8 @@
   
   
Maximum time spent planning the statement, in milliseconds
+   (if pg_stat_statements.track_planning is enabled,
+   otherwise zero)
   
  
 
@@ -137,6 +145,8 @@
   
   
Mean time spent planning the statement, in milliseconds
+   (if pg_stat_statements.track_planning

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-29 Thread Amit Kapila

On Mon, Jun 29, 2020 at 10:26 AM Masahiko Sawada
 wrote:
>
> On Fri, 26 Jun 2020 at 17:53, Amit Kapila  wrote:
> >
> > On Fri, Jun 26, 2020 at 11:31 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Thu, 25 Jun 2020 at 19:35, Amit Kapila  wrote:
> > > >
> > > >
> > > > Today, I thought about it again, and if we consider the point that
> > > > logical_decoding_work_mem value doesn’t change much then having the
> > > > stats at slot-level would also allow computing
> > > > logical_decoding_work_mem based on stats.  Do you think it is a
> > > > reasonable assumption that users won't change
> > > > logical_decoding_work_mem for different processes (WALSender, etc.)?
> > >
> > > FWIW, if we use logical_decoding_work_mem as a threshold of starting
> > > of sending changes to a subscriber, I think there might be use cases
> > > where the user wants to set different logical_decoding_work_mem values
> > > to different wal senders. For example, setting a lower value to
> > > minimize the latency of synchronous logical replication to a near-site
> > > whereas setting a large value to minimize the amount of data sent to a
> > > far site.
> > >
> >
> > How does setting a large value can minimize the amount of data sent?
> > One possibility is if there are a lot of transaction aborts and
> > transactions are not large enough that they cross
> > logical_decoding_work_mem threshold but such cases shouldn't be many.
>
> Yeah, this is what I meant.
>
> I agree that it would not be a common case that the user sets
> different values for different processes. Based on that assumption, I
> also think having the stats at slot-level is a good idea.
>

Okay.

> But I might
> want to have the reset function.
>

I don't mind but lets fist see how the patch for the basic feature
looks and what is required to implement it?  Are you interested in
writing the patch for this work?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: tar-related code in PostgreSQL

2020-06-29 Thread Robert Haas

On Sun, Jun 28, 2020 at 11:24 AM Hamid Akhtar  wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:not tested
>
> The patch works perfectly.
>
> The new status of this patch is: Ready for Committer

Thanks, but this was committed on June 15th, as per my previous email.
Perhaps I forgot to update the CommitFest application

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: track_planning causing performance regression

2020-06-29 Thread Julien Rouhaud

On Mon, Jun 29, 2020 at 1:14 PM Fujii Masao  wrote:
>
> On 2020/06/29 18:56, Fujii Masao wrote:
> >
> >
> > On 2020/06/29 18:53, Julien Rouhaud wrote:
> >> On Mon, Jun 29, 2020 at 11:38 AM Fujii Masao
> >>  wrote:
> >>>
> > Your benchmark result seems to suggest that the cause of the problem is
> > the contention of per-query spinlock in pgss_store(). Right?
> > This lock contention is likely to happen when multiple sessions run
> > the same queries.
> >
> > One idea to reduce that lock contention is to separate per-query 
> > spinlock
> > into two; one is for planning, and the other is for execution. 
> > pgss_store()
> > determines which lock to use based on the given "kind" argument.
> > To make this idea work, also every pgss counters like shared_blks_hit
> > need to be separated into two, i.e., for planning and execution.
> 
>  This can probably remove some overhead, but won't it eventually hit
>  the same issue when multiple connections try to plan the same query,
>  given the number of different queries and very low execution runtime?
> >>>
> >>> Yes. But maybe we can expect that the idea would improve
> >>> the performance to the near same level as v12?
> >>
> >> A POC patch should be easy to do and see how much it solves this
> >> problem.  However I'm not able to reproduce the issue, and IMHO unless
> >> we specifically want to be able to distinguish planner-time counters
> >> from execution-time counters, I'd prefer to disable track_planning by
> >> default than going this way, so that users with a sane usage won't
> >> have to suffer from a memory increase.
> >
> > Agreed. +1 to change that default to off.
>
> Attached patch does this.

Patch looks good to me.

> I also add the following into the description about each *_plan_time column
> in the docs. IMO this is helpful for users when they see that those columns
> report zero by default and try to understand why.
>
> (if pg_stat_statements.track_planning is enabled, 
> otherwise zero)

+1

Do you intend to wait for other input before pushing?  FWIW I'm still
not convinced that the exposed problem is representative of any
realistic workload.  I of course entirely agree with the other
documentation changes.

Re: Index Skip Scan (new UniqueKeys)

2020-06-29 Thread Dmitry Dolgov

> On Thu, Jun 11, 2020 at 04:14:07PM +0800, Andy Fan wrote:
>
> I just get the rough idea of patch, looks we have to narrow down the
> user cases where we can use this method. Consider the below example:

Hi

Not exactly narrow down, but rather get rid of wrong usage of skipping
for index scan. Since skipping for it was added later than for index
only scan I can imagine there are still blind spots, so good that you've
looked. In this particular case, when index expressions do not fully
cover those expressionse result need to be distinct on, skipping just
doesn't have enough information and should not be used. I'll add it to
the next version, thanks!

RE: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

2020-06-29 Thread higuchi.dais...@fujitsu.com

Thank you for comments.

>Unfortunately the diff command in your test script doesn't show me
>anything, but I can understand what you are thinking is a problem,
>maybe.

I'm sorry but I might have confused you... I explain how to use my test script.
I use diff command to check if the archiver has started. diff command does not 
output nothing to stdout.
So, please see the time displayed by the two date command by output of my test 
script.
I think you can confirm that the difference between the results of date 
commands is not the archive_timeout setting of 10 seconds.
If my test script runs for a few minutes, it means that my problem is 
reproduced.

>immediately independently from checkpointer. The parameter, as
>described in documentation, forces the server to switch to a new WAL
>segment file periodically so that it can be archived, that is, it
>works only on primary.

I confirm that this problem is occurred in non-replication environment.
The problem occurs when database try to archive WAL during or after archive 
recovery.
So your patch may be good to solve another problem, but unfortunately it didn't 
fix my problem.

Regards,
Daisuke, Higuchi

Re: Allow CURRENT_ROLE in GRANTED BY

2020-06-29 Thread Peter Eisentraut


On 2020-06-24 23:08, Alvaro Herrera wrote:

On 2020-Jun-24, Peter Eisentraut wrote:


I was checking some loose ends in SQL conformance, when I noticed: We
support GRANT role ... GRANTED BY CURRENT_USER, but we don't support
CURRENT_ROLE in that place, even though in PostgreSQL they are equivalent.
Here is a trivial patch to add that.


Hmm, since this adds to RoleSpec, this change makes every place that
uses that production also take CURRENT_ROLE, so we'd need to document in
all those places.  For example, alter_role.sgml, create_schema.sgml,
etc.


Good point.  Here is an updated patch that updates all the documentation 
places where CURRENT_USER is mentioned.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 3dea85cf0391828e02f901ab808dad31422ee30b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 24 Jun 2020 08:21:48 +0200
Subject: [PATCH v2] Allow CURRENT_ROLE where CURRENT_USER is accepted

In the particular case of GRANTED BY, this is specified in the SQL
standard.  Since in PostgreSQL, CURRENT_ROLE is equivalent to
CURRENT_USER, and CURRENT_USER is already supported here, adding
CURRENT_ROLE is trivial.  The other cases are PostgreSQL extensions,
but for the same reason it also makes sense there.
---
 doc/src/sgml/ref/alter_aggregate.sgml| 2 +-
 doc/src/sgml/ref/alter_collation.sgml| 2 +-
 doc/src/sgml/ref/alter_conversion.sgml   | 2 +-
 doc/src/sgml/ref/alter_database.sgml | 2 +-
 doc/src/sgml/ref/alter_domain.sgml   | 2 +-
 doc/src/sgml/ref/alter_event_trigger.sgml| 2 +-
 doc/src/sgml/ref/alter_foreign_data_wrapper.sgml | 2 +-
 doc/src/sgml/ref/alter_foreign_table.sgml| 2 +-
 doc/src/sgml/ref/alter_function.sgml | 2 +-
 doc/src/sgml/ref/alter_group.sgml| 1 +
 doc/src/sgml/ref/alter_language.sgml | 2 +-
 doc/src/sgml/ref/alter_large_object.sgml | 2 +-
 doc/src/sgml/ref/alter_materialized_view.sgml| 2 +-
 doc/src/sgml/ref/alter_opclass.sgml  | 2 +-
 doc/src/sgml/ref/alter_operator.sgml | 2 +-
 doc/src/sgml/ref/alter_opfamily.sgml | 2 +-
 doc/src/sgml/ref/alter_policy.sgml   | 2 +-
 doc/src/sgml/ref/alter_procedure.sgml| 2 +-
 doc/src/sgml/ref/alter_publication.sgml  | 2 +-
 doc/src/sgml/ref/alter_role.sgml | 2 ++
 doc/src/sgml/ref/alter_routine.sgml  | 2 +-
 doc/src/sgml/ref/alter_schema.sgml   | 2 +-
 doc/src/sgml/ref/alter_sequence.sgml | 2 +-
 doc/src/sgml/ref/alter_server.sgml   | 2 +-
 doc/src/sgml/ref/alter_statistics.sgml   | 2 +-
 doc/src/sgml/ref/alter_subscription.sgml | 2 +-
 doc/src/sgml/ref/alter_table.sgml| 2 +-
 doc/src/sgml/ref/alter_tablespace.sgml   | 2 +-
 doc/src/sgml/ref/alter_tsconfig.sgml | 2 +-
 doc/src/sgml/ref/alter_tsdictionary.sgml | 2 +-
 doc/src/sgml/ref/alter_type.sgml | 2 +-
 doc/src/sgml/ref/alter_user.sgml | 1 +
 doc/src/sgml/ref/alter_user_mapping.sgml | 4 ++--
 doc/src/sgml/ref/alter_view.sgml | 2 +-
 doc/src/sgml/ref/create_policy.sgml  | 2 +-
 doc/src/sgml/ref/create_schema.sgml  | 1 +
 doc/src/sgml/ref/create_tablespace.sgml  | 2 +-
 doc/src/sgml/ref/create_user_mapping.sgml| 4 ++--
 doc/src/sgml/ref/drop_owned.sgml | 2 +-
 doc/src/sgml/ref/drop_user_mapping.sgml  | 4 ++--
 doc/src/sgml/ref/grant.sgml  | 1 +
 doc/src/sgml/ref/reassign_owned.sgml | 4 ++--
 doc/src/sgml/ref/revoke.sgml | 1 +
 src/backend/parser/gram.y| 4 
 44 files changed, 52 insertions(+), 41 deletions(-)

diff --git a/doc/src/sgml/ref/alter_aggregate.sgml 
b/doc/src/sgml/ref/alter_aggregate.sgml
index 2ad3e0440b..95934a100f 100644
--- a/doc/src/sgml/ref/alter_aggregate.sgml
+++ b/doc/src/sgml/ref/alter_aggregate.sgml
@@ -23,7 +23,7 @@
 
 ALTER AGGREGATE name ( 
aggregate_signature ) RENAME TO 
new_name
 ALTER AGGREGATE name ( 
aggregate_signature )
-OWNER TO { new_owner | CURRENT_USER 
| SESSION_USER }
+OWNER TO { new_owner | CURRENT_ROLE 
| CURRENT_USER | SESSION_USER }
 ALTER AGGREGATE name ( 
aggregate_signature ) SET SCHEMA 
new_schema
 
 where aggregate_signature is:
diff --git a/doc/src/sgml/ref/alter_collation.sgml 
b/doc/src/sgml/ref/alter_collation.sgml
index bee6f0dd3c..af9ff2867b 100644
--- a/doc/src/sgml/ref/alter_collation.sgml
+++ b/doc/src/sgml/ref/alter_collation.sgml
@@ -24,7 +24,7 @@
 ALTER COLLATION name REFRESH VERSION
 
 ALTER COLLATION name RENAME TO 
new_name
-ALTER COLLATION name OWNER TO { 
new_owner | CURRENT_USER | SESSION_USER }
+ALTER COLLATION name OWNER TO { 
new_owner | CURRENT_ROLE | CURRENT_USER | 
SESSION_USER }
 ALTER COLL

Re: xid wraparound danger due to INDEX_CLEANUP false

2020-06-29 Thread Masahiko Sawada

On Sun, 28 Jun 2020 at 02:44, Peter Geoghegan  wrote:
>
> On Fri, Jun 26, 2020 at 10:15 PM Masahiko Sawada
>  wrote:
> > Regarding to the extent of the impact, this bug will affect the user
> > who turned vacuum_index_cleanup off or executed manually vacuum with
> > INDEX_CLEANUP off for a long time, after some vacuums. On the other
> > hand, the user who uses INDEX_CLEANUP off on the spot or turns
> > vacuum_index_cleanup off of the table from the start would not be
> > affected or less affected.
>
> I don't think that it's likely to cause too much trouble. It's already
> possible to leak deleted pages, if only because the FSM isn't crash
> safe. Actually, the nbtree README says this, and has since 2003:
>
> """
> (Note: if we find a deleted page with an extremely old transaction
> number, it'd be worthwhile to re-mark it with FrozenTransactionId so that
> a later xid wraparound can't cause us to think the page is unreclaimable.
> But in more normal situations this would be a waste of a disk write.)
> """
>
> But, uh, isn't the btvacuumcleanup() call supposed to avoid
> wraparound? Who knows?!
>
> It doesn't seem like the recycling aspect of page deletion was
> rigorously designed, possibly because it's harder to test than page
> deletion itself. This is a problem that we should fix.

Agreed.

>
> > I apologize for writing this patch without enough consideration. I
> > should have been more careful as I learned the nbtree page recycle
> > strategy when discussing  vacuum_cleanup_index_scale_factor patch.
>
> While it's unfortunate that this was missed, let's not lose
> perspective. Anybody using the INDEX_CLEANUP feature (whether it's
> through a direct VACUUM, or by using the reloption) is already asking
> for an extreme behavior: skipping regular index vacuuming. I imagine
> that the vast majority of users that are in that position just don't
> care about the possibility of leaking deleted pages. They care about
> avoiding a real disaster from XID wraparound.

For back branches, I'm considering how we let users know about this.
For safety, we can let users know that we recommend avoiding
INDEX_CLEANUP false unless it's necessary to avoid running out of XIDs
on the documentation and/or the release note. But on the other hand,
since there is the fact that leaving recyclable pages is already
possible to happen as you mentioned I'm concerned it gets the user
into confusion and might needlessly incite unrest of users. I'm
thinking what we can do for users, in addition to leaving the summary
of this discussion as a source code comment. What do you think?

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

2020-06-29 Thread higuchi.dais...@fujitsu.com

Fujii-san, thank you for comments.

>The cause of this problem is that the checkpointer's sleep time is calculated
>from both checkpoint_timeout and archive_timeout during normal running,
>but calculated only from checkpoint_timeout during recovery. So Daisuke-san's
>patch tries to change that so that it's calculated from both of them even
>during recovery. No?

Yes, it's exactly so.

>last_xlog_switch_time is not updated during recovery. So "elapsed_secs" can be
>large and cur_timeout can be negative. Isn't this problematic?

Yes... My patch was missing this.
How about using the original archive_timeout value for calculating cur_timeout 
during recovery?

if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
{
elapsed_secs = now - last_xlog_switch_time;
if (elapsed_secs >= XLogArchiveTimeout)
continue;   /* no sleep for us ... 
*/
cur_timeout = Min(cur_timeout, XLogArchiveTimeout - 
elapsed_secs);
}
+   else if (XLogArchiveTimeout > 0)
+   cur_timeout = Min(cur_timeout, XLogArchiveTimeout);

During recovery, accurate cur_timeout is not calculated because elapsed_secs is 
not used.
However, after recovery is complete, WAL archiving will start by the next 
archive_timeout is reached.
I felt it is enough to solve this problem.

>As another approach, what about waking the checkpointer up at the end of
>recovery like we already do for walsenders?

If the above solution is not good, I will consider this approach.

Regards,
Daisuke, Higuchi

Re: some more pg_dump refactoring

2020-06-29 Thread Peter Eisentraut


On 2020-06-25 08:58, Fabien COELHO wrote:

You changed the query strings to use "\n" instead of " ". I would not have
changed that, because it departs from the style around, and I do not think
it improves readability at the C code level.


This was the style that was introduced by 
daa9fe8a5264a3f192efa5ddee8fb011ad9da365.



However, on version < 8.4, ISTM that funcargs and funciargs are always
added: is this voluntary?.


That was a mistake.


Would it make sense to accumulate in the other direction, older to newer,
so that new attributes are added at the end of the select?


I think that could make sense, but the current style was introduced by 
daa9fe8a5264a3f192efa5ddee8fb011ad9da365.  Should we revise that?



Should array_to_string be pg_catalog.array_to_string? All other calls seem
to have an explicit schema.


It's not handled fully consistently in pg_dump.  But my understanding is 
that this is no longer necessary because pg_dump explicitly sets the 
search path before running.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 617adf2512484610e40f2178a46d8c0af9b746b4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 29 Jun 2020 15:03:07 +0200
Subject: [PATCH v2] pg_dump: Reorganize dumpFunc() and dumpAgg()

Similar to daa9fe8a5264a3f192efa5ddee8fb011ad9da365, instead of
repeating the almost same large query in each version branch, use one
query and add a few columns to the SELECT list depending on the
version.  This saves a lot of duplication.
---
 src/bin/pg_dump/pg_dump.c | 492 --
 1 file changed, 157 insertions(+), 335 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a41a3db876..f3ad1a5d9e 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -11847,171 +11847,88 @@ dumpFunc(Archive *fout, FuncInfo *finfo)
asPart = createPQExpBuffer();
 
/* Fetch function-specific details */
+   appendPQExpBuffer(query,
+ "SELECT\n"
+ "proretset,\n"
+ "prosrc,\n"
+ "probin,\n"
+ "provolatile,\n"
+ "proisstrict,\n"
+ "prosecdef,\n"
+ "(SELECT lanname FROM 
pg_catalog.pg_language WHERE oid = prolang) AS lanname,\n");
+
if (fout->remoteVersion >= 12)
-   {
-   /*
-* prosupport was added in 12
-*/
appendPQExpBuffer(query,
- "SELECT proretset, prosrc, 
probin, "
- 
"pg_catalog.pg_get_function_arguments(oid) AS funcargs, "
- 
"pg_catalog.pg_get_function_identity_arguments(oid) AS funciargs, "
- 
"pg_catalog.pg_get_function_result(oid) AS funcresult, "
- "array_to_string(protrftypes, 
' ') AS protrftypes, "
- "prokind, provolatile, 
proisstrict, prosecdef, "
- "proleakproof, proconfig, 
procost, prorows, "
- "prosupport, proparallel, "
- "(SELECT lanname FROM 
pg_catalog.pg_language WHERE oid = prolang) AS lanname "
- "FROM pg_catalog.pg_proc "
- "WHERE oid = 
'%u'::pg_catalog.oid",
- finfo->dobj.catId.oid);
-   }
-   else if (fout->remoteVersion >= 11)
-   {
-   /*
-* prokind was added in 11
-*/
+ "prosupport,\n");
+   else
appendPQExpBuffer(query,
- "SELECT proretset, prosrc, 
probin, "
- 
"pg_catalog.pg_get_function_arguments(oid) AS funcargs, "
- 
"pg_catalog.pg_get_function_identity_arguments(oid) AS funciargs, "
- 
"pg_catalog.pg_get_function_result(oid) AS funcresult, "
- "array_to_string(protrftypes, 
' ') AS protrftypes, "
- "prokind, provolatile, 
proisstrict, prosecdef, "
- "proleakproof, proconfig, 
procost, prorows, "
- "'-' AS prosupport,

Re: track_planning causing performance regression

2020-06-29 Thread Ants Aasma

On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud  wrote:

> On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
>  wrote:
> >
> > On 2020/06/29 16:05, Julien Rouhaud wrote:
> > > On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins 
> wrote:
> > >>
> > >> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1
> shows
> >
> > Thanks for the benchmark!
> >
> >
> > >> ~45% performance drop [2] at high DB connection counts (when compared
> with v12.3)
> >
> > That's bad :(
> >
> >
> > >>
> > >> Disabling pg_stat_statements.track_planning (which is 'On' by default)
> > >> brings the TPS numbers up to v12.3 levels.
> > >>
> > >> The inflection point (in this test-case) is 128 Connections, beyond
> which the
> > >> TPS numbers are consistently low. Looking at the mailing list [1],
> this issue
> > >> didn't surface earlier possibly since the regression is trivial at
> low connection counts.
> > >>
> > >> It would be great if this could be optimized further, or
> track_planning
> > >> disabled (by default) so as to not trip users upgrading from v12 with
> pg_stat_statement
> > >> enabled (but otherwise not particularly interested in track_planning).
> >
> > Your benchmark result seems to suggest that the cause of the problem is
> > the contention of per-query spinlock in pgss_store(). Right?
> > This lock contention is likely to happen when multiple sessions run
> > the same queries.
> >
> > One idea to reduce that lock contention is to separate per-query spinlock
> > into two; one is for planning, and the other is for execution.
> pgss_store()
> > determines which lock to use based on the given "kind" argument.
> > To make this idea work, also every pgss counters like shared_blks_hit
> > need to be separated into two, i.e., for planning and execution.
>
> This can probably remove some overhead, but won't it eventually hit
> the same issue when multiple connections try to plan the same query,
> given the number of different queries and very low execution runtime?
> It'll also quite increase the shared memory consumption.
>
> I'm wondering if we could instead use atomics to store the counters.
> The only downside is that we won't guarantee per-row consistency
> anymore, which may be problematic.
>


The problem looks to be that spinlocks are terrible with overloaded CPU and
a contended spinlock. A process holding the spinlock might easily get
scheduled out leading to excessive spinning by everybody. I think a simple
thing to try would be to replace the spinlock with LWLock.

I did a prototype patch that replaces spinlocks with futexes, but was not
able to find a workload where it mattered. We have done a great job at
eliminating spinlocks from contended code paths. Robins, perhaps you could
try it to see if it reduces the regression you are observing. The patch is
against v13 stable branch.

-- 
Ants Aasma
Senior Database Engineerwww.cybertec-postgresql.com
diff --git a/src/backend/storage/lmgr/s_lock.c b/src/backend/storage/lmgr/s_lock.c
index 7fac0703419..56d45b7cfce 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -90,6 +90,7 @@ s_lock_stuck(const char *file, int line, const char *func)
 int
 s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
 {
+#ifndef HAS_FUTEX
 	SpinDelayStatus delayStatus;
 
 	init_spin_delay(&delayStatus, file, line, func);
@@ -102,6 +103,8 @@ s_lock(volatile slock_t *lock, const char *file, int line, const char *func)
 	finish_spin_delay(&delayStatus);
 
 	return delayStatus.delays;
+#endif
+	elog(FATAL, "Should not be called");
 }
 
 #ifdef USE_DEFAULT_S_UNLOCK
@@ -218,6 +221,71 @@ update_spins_per_delay(int shared_spins_per_delay)
 	return (shared_spins_per_delay * 15 + spins_per_delay) / 16;
 }
 
+#ifdef HAS_FUTEX
+#include 
+#include 
+#include 
+
+static int
+futex(volatile uint32 *uaddr, int futex_op, int val,
+	  const struct timespec *timeout, int *uaddr2, int val3)
+{
+	return syscall(SYS_futex, uaddr, futex_op, val,
+   timeout, uaddr, val3);
+}
+
+int
+futex_lock(volatile slock_t *lock, uint32 current, const char *file, int line, const char *func)
+{
+	int i, s;
+	/*
+	 * First lets wait for a bit without involving the kernel, it is quite likely
+	 * the lock holder is still running.
+	 **/
+	if (likely(current < 2))
+	{
+		uint32 expected;
+		for (i = 0; i < DEFAULT_SPINS_PER_DELAY; i++)
+		{
+			SPIN_DELAY();
+			expected = lock->value;
+			if (expected == 0 && pg_atomic_compare_exchange_u32(lock, &expected, 1))
+return i;
+		}
+
+		while (expected != 2 && !pg_atomic_compare_exchange_u32(lock, &expected, 2)) {
+			if (expected == 0 && pg_atomic_compare_exchange_u32(lock, &expected, 2))
+return i;
+		}
+	}
+
+	/* At this point lock value is 2 and we will get waken up */
+	while (true)
+	{
+		uint32 expected = 0;
+		s = futex(&(lock->value), FUTEX_WAIT, 2, NULL, NULL, 0);
+		if (s == -1 && errno != EAGAIN)
+			elog(FATAL, "Futex wait failed with error: %m");
+
+		/* Maybe someone else was waiting too

Re: Why forbid "INSERT INTO t () VALUES ();"

2020-06-29 Thread Stephen Frost

Greetings,

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> On 2020-06-25 18:07, Tom Lane wrote:
> >So I'm still not convinced we should do this.  "MySQL is incapable
> >of conforming to the standard" is a really lousy reason for us to do
> >something.
> 
> Conformance to the standard means that the syntax described in the standard
> behaves as specified in the standard.  It doesn't mean you can't have
> additional syntax that is not in the standard.

Agreed in general with the caveat that we don't want to support syntax
that the standard might decide later means something else.

For this case, however, I tend to agree with the other folks on this
thread who feel that we should add it- since it seems quite unlikely
that the standard folks would define this syntax to somehow mean
something else.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Fwd: PostgreSQL: WolfSSL support

2020-06-29 Thread Stephen Frost

Greetings,

* Felix Lechner (felix.lech...@lease-up.com) wrote:
> Attached please find a WIP patch for wolfSSL support in postgresql-12.

Would really be best to have this off of HEAD if we're going to be
looking at it rather than v12.  We certainly aren't going to add new
support for something new into the back-branches.

Further, I'd definitely suggest seeing how this plays with the patch to
add support for NSS which was posted recently to -hackers by Daniel.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Fwd: PostgreSQL: WolfSSL support

2020-06-29 Thread Stephen Frost

Greetings,

* Christoph Berg (m...@debian.org) wrote:
> Re: Peter Eisentraut
> > What would be the advantage of using wolfSSL over OpenSSL?
> 
> Avoiding the OpenSSL-vs-GPL linkage problem with readline.

I'd further say "folks are interested in an alternative to OpenSSL" as
being a generally good reason to add support for alternatives, such as
the patch to add NSS support, which would also help with the GPL linkage
problem and add a higher FIPS rating option than OpenSSL.

Thanks,

Stephen

signature.asc
Description: PGP signature

Parallell hashjoin sometimes ignores temp_tablespaces

2020-06-29 Thread Magnus Hagander

If a database (a) has a default tablespace set,

Reproduction:

CREATE TABLESPACE t LOCATION '/tmp/t';
CREATE DATABASE dumb TABLESPACE t;
\c dumb
SET temp_tablespaces=t;

At this point if you run a query with a parallel hash join in it, the
tempfiles go in base/pgsql_tmp instead of the temporary tablespace. For
example:

create table foo(bar int);
insert into foo select * from generate_series(1,100);
set parallel_tuple_cost =0;
set parallel_setup_cost =0;
set log_temp_files=0;
set client_min_messages ='log';
explain analyze select foo.bar,count(*) from foo inner join foo foo2 on
foo.bar=foo2.bar group by foo.bar;

Will trigger some temp files in the 't' tablespace and some in the
'pg_default' one.

I think the fix is the attached one (tested on version 11 which is what
$customer is using).  To me it looks like this may have been a copy/paste
error all the way back in 98e8b480532 which added default_tablespace back
in 2004. (And is in itself entirely unrelated to parallel hashjoin, but
that's where it got exposed at least in my case)

Thoughts?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 946777f48b..3105efe040 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -1354,7 +1354,7 @@ PrepareTempTablespaces(void)
 		 */
 		if (curoid == MyDatabaseTableSpace)
 		{
-			tblSpcs[numSpcs++] = InvalidOid;
+			tblSpcs[numSpcs++] = curoid;
 			continue;
 		}

Re: Default setting for enable_hashagg_disk

2020-06-29 Thread Tomas Vondra


On Sun, Jun 28, 2020 at 06:39:40PM -0700, Peter Geoghegan wrote:

On Sat, Jun 27, 2020 at 3:41 AM Tomas Vondra
 wrote:

Well, there are multiple ideas discussed in this (sub)thread, one of
them being a per-query memory limit. That requires decisions how much
memory should different nodes get, which I think would need to be
cost-based.


A design like that probably makes sense. But it's way out of scope for
Postgres 13, and not something that should be discussed further on
this thread IMV.



100% agree


Of course, a simpler scheme like this would not require that. And maybe
introducing hash_mem is a good idea - I'm not particularly opposed to
that, actually. But I think we should not introduce separate memory
limits for each node type, which was also mentioned earlier.


I had imagined that hash_mem would apply to hash join and hash
aggregate only. A GUC that either represents a multiple of work_mem,
or an absolute work_mem-style KB value.



I'm not against having a hash_mem (and I'd vote to be it a simple
work_mem-style value, not a multiple). Maybe we should have it, the
argument to allow hashagg (and perhaps hashjoin) to use more memory than
some other nodes seems convincing to me.

I'm just not sure which of the problems mentioned in this thread it
actually addresses ...


The problem of course is that hash_mem does not really solve the issue
discussed at the beginning of this thread, i.e. regressions due to
underestimates and unexpected spilling at execution time.


Like Andres, I suspect that that's a smaller problem in practice. A
hash aggregate that spills often has performance characteristics
somewhat like a group aggregate + sort, anyway. I'm worried about
cases where an *in-memory* hash aggregate is naturally far far faster
than other strategies, and yet we can't use it -- despite the fact
that Postgres 12 could "safely" do so. (It probably doesn't matter
whether the slow plan that you get in Postgres 13 is a hash aggregate
that spills, or something else -- this is not really a costing
problem.)



Not sure I follow. Which cases do you mean when you say that 12 could
safely do them, but 13 won't? I see the following two cases:


a) Planner in 12 and 13 disagree about whether the hash table will fit
into work_mem.

I don't quite see why this would be the case (given the same cardinality
estimates etc.), though. That is, if 12 says "will fit" I'd expect 13 to
end up with the same conclusion. But maybe 13 has higher per-tuple
overhead or something? I know we set aside some memory for BufFiles, but
not when we expect the whole hash table to fit into memory.


b) Planner believes the hash table will fit, due to underestimate.

On 12, we'd just let the hash table overflow, which may be a win when
there's enough RAM and the estimate is not "too wrong". But it may
easily end with a sad OOM.

On 13, we'll just start spilling. True - people tend to use conservative
work_mem values exactly because of cases like this (which is somewhat
futile as the underestimate may be arbitrarily wrong) and also because
they don't know how many work_mem instances the query will use.

So yeah, I understand why people may not want to increase work_mem too
much, and maybe hash_work would be a way to get the "no spill" behavior.



Besides, hash_mem *can* solve that problem to some extent. Other cases
(cases where the estimates are so bad that hash_mem won't help) seem
like less of a concern to me. To some extent, that's the price you pay
to avoid the risk of an OOM.



True, avoiding the risk of OOM has it's cost.


The thread is getting a rather confusing mix of proposals how to fix
that for v13 and proposals how to improve our configuration of memory
limits :-(


As I said to Amit in my last message, I think that all of the ideas
that are worth pursuing involve giving hash aggregate nodes license to
use more memory than other nodes. One variant involves doing so only
at execution time, while the hash_mem idea involves formalizing and
documenting that hash-based nodes are special -- and taking that into
account during both planning and execution.



Understood. I mostly agree with this.


Interesting. What is not entirely clear to me how do these databases
decide how much should each node get during planning. With the separate
work_mem-like settings it's fairly obvious, but how do they do that with
the global limit (either per-instance or per-query)?


I don't know, but that seems like a much more advanced way of
approaching the problem. It isn't in scope here.



+1


Perhaps I'm not considering some unintended consequence of the planner
giving hash-based nodes extra memory "for free" in the common case
where hash_mem exceeds work_mem (by 2x, say). But my guess is that
that won't be a significant problem that biases the planner in some
obviously undesirable way.



My concern is how much more difficult would these proposals make the
reasoning about memory usage get. Maybe not much, not sure.

I certainly agree it may be

Re: pgsql: Enable Unix-domain sockets support on Windows

2020-06-29 Thread Andrew Dunstan



On 6/28/20 4:33 AM, Peter Eisentraut wrote:
> On 2020-06-27 13:57, Amit Kapila wrote:
>> Fair enough, but what should be the behavior in the Windows versions
>> (<10) where Unix-domain sockets are not supported?
>
> You get an error about an unsupported address family, similar to
> trying to use IPv6 on a system that doesn't support it.
>
>> BTW, in which
>> format the path needs to be specified for unix_socket_directories?  I
>> tried with '/c/tmp', 'c:/tmp', 'tmp' but nothing seems to be working,
>> it gives me errors like: "could not create lock file
>> "/c/tmp/.s.PGSQL.5432.lock": No such file or directory" on server
>> start.  I am trying this on Win7 just to check what is the behavior of
>> this feature on it.
>
> Hmm, the only thing I remember about this now is that you need to use
> native Windows paths, meaning you can't just use /tmp under MSYS, but
> it needs to be something like C:\something.  But the error you have
> there is not even about the socket file but about the lock file, which
> is a normal file, so if that goes wrong, it might be an unrelated
> problem.
>


It needs to be a path from the Windows POV, not an Msys virtualized
path. So c:/tmp or just /tmp should work, but /c/tmp or similar probably
will not. The directory needs to exist. I just checked that this is
working, both in postgresql.conf and on the psql command line.


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

2020-06-29 Thread Bruce Momjian

On Sun, Jun 28, 2020 at 05:40:16PM -0700, Peter Geoghegan wrote:
> I think any problem that might come up with the costing is best
> thought of as a distinct problem. This thread is mostly about the
> problem of users getting fewer in-memory hash aggregates compared to a
> previous release running the same application (though there has been
> some discussion of the other problem, too [1], but it's thought to be
> less serious).
> 
> The problem is that affected users were theoretically never entitled
> to the performance they came to rely on, and yet there is good reason
> to think that hash aggregate really should be entitled to more memory.
> They won't care that they were theoretically never entitled to that
> performance, though -- they *liked* the fact that hash agg could
> cheat. And they'll dislike the fact that this cannot be corrected by
> tuning work_mem, since that affects all node types that consume
> work_mem, not just hash aggregate -- that could cause OOMs for them.
> 
> There are two or three similar ideas under discussion that might fix
> the problem. They all seem to involve admitting that hash aggregate's
> "cheating" might actually have been a good thing all along (even
> though giving hash aggregate much much more memory than other nodes is
> terrible), and giving hash aggregate license to "cheat openly". Note
> that the problem isn't exactly a problem with the hash aggregate
> spilling patch. You could think of the problem as a pre-existing issue
> -- a failure to give more memory to hash aggregate, which really
> should be entitled to more memory. Jeff's patch just made the issue
> more obvious.

In thinking some more about this, I came out with two ideas. First, in
pre-PG 13, we didn't choose hash_agg if we thought it would spill, but
if we misestimated and it used more work_mem, we allowed it.  The effect
of this is that if we were close, but it went over, we allowed it just
for hash_agg.  Is this something we want to codify for all node types,
i.e., choose a non-spill node type if we need a lot more than work_mem,
but then let work_mem be a soft limit if we do choose it, e.g., allow
50% over work_mem in the executor for misestimation before spill?  My
point is, do we want to use a lower work_mem for planning and a higher
one in the executor before spilling.

My second thought is from an earlier report that spilling is very
expensive, but smaller work_mem doesn't seem to hurt much.  Would we
achieve better overall performance by giving a few nodes a lot of memory
(and not spill those), and other nodes very little, rather than having
them all be the same size, and all spill?

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: bugfix: invalid bit/varbit input causes the log file to be unreadable

2020-06-29 Thread Tom Lane

Quan Zongliang  writes:
> I tested it, and it looks fine.

Pushed, thanks for reporting the issue!

regards, tom lane

Re: pgbench and timestamps (bounced)

2020-06-29 Thread Fabien COELHO



[Resent on hackers for CF registration, sorry for the noise]

Hello Tom,

The attached patch fixes some of the underlying problems reported by delaying 
the :var to $1 substitution to the last possible moments, so that what 
variables are actually defined is known. PREPARE-ing is also delayed to after 
these substitutions are done.


It requires a mutex around the commands, I tried to do some windows 
implementation which may or may not work.


The attached patch fixes (2) & (3) for extended & prepared.

I have a doubt about fixing (1) because it would be a significant behavioral 
change and it requires changing the replace routine significantly to check for 
quoting, comments, and so on. This means that currently ':var' is still broken 
under -M extended & prepared, I could only break it differently by providing a 
nicer error message and also break it under simple whereas it currently works 
there. I'm not thrilled by spending efforts to do that.


The patches change the name of "parseQuery" to "makeVariablesParameters", 
because it was not actually parsing any query. Maybe the new name could be 
improved.


In passing, there was a bug in how NULL was passed, which I tried to fix
as well.


I don't often do much with pgbench and variables, but there are a few
things that surprise me here.
1) That pgbench replaces variables within single quotes, and;
2) that we still think it's a variable name when it starts with a digit, 
and;

3) We replace variables that are undefined.



Also (4) this only happens when in non-simple query mode --- the
example works fine without "-M prepared".


After looking around in the code, it seems like the core of the issue
is that pgbench.c's parseQuery() doesn't check whether a possible
variable name is actually defined, unlike assignVariables() which is
what does the same job in simple query mode.  So that explains the
behavioral difference.


Yes.


The reason for doing that probably was that parseQuery() is run when
the input file is read, so that relevant variables might not be set
yet.  We could fix that by postponing the work to be done at first
execution of the query, as is already the case for PQprepare'ing the
query.


Yep, done at first execution of the Command, so that variables are known.


Also, after further thought I realize that (1) absolutely is a bug
in the non-simple query modes, whatever you think about it in simple
mode.  The non-simple modes are trying to pass the variable values
as extended-query-protocol parameters, and the backend is not going
to recognize $n inside a literal as being a parameter.


Yep. See my comments above.


If we fixed (1) and (3) I think there wouldn't be any great need
to tighten up (2).


I did (2) but not (1), for now.

--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 8e728dc094..7436210fd4 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -998,15 +998,14 @@ pgbench  options  d
   
There is a simple variable-substitution facility for script files.
Variable names must consist of letters (including non-Latin letters),
-   digits, and underscores.
+   digits (but not on the first character), and underscores.
Variables can be set by the command-line -D option,
explained above, or by the meta commands explained below.
In addition to any variables preset by -D command-line options,
there are a few variables that are preset automatically, listed in
. A value specified for these
variables using -D takes precedence over the automatic presets.
-   Once set, a variable's
-   value can be inserted into a SQL command by writing
+   Once set, a variable's value can be inserted into a SQL command by writing
:variablename.  When running more than
one client session, each session has its own set of variables.
pgbench supports up to 255 variable uses in one
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 08a5947a9e..09ccf05db5 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -119,12 +119,24 @@ typedef int pthread_attr_t;
 
 static int	pthread_create(pthread_t *thread, pthread_attr_t *attr, void *(*start_routine) (void *), void *arg);
 static int	pthread_join(pthread_t th, void **thread_return);
+
+typedef HANDLE pthread_mutex_t;
+static void pthread_mutex_init(pthread_mutex_t *pm, void *unused);
+static void pthread_mutex_lock(pthread_mutex_t *pm);
+static void pthread_mutex_unlock(pthread_mutex_t *pm);
+static void pthread_mutex_destroy(pthread_mutex_t *pm);
+
 #elif defined(ENABLE_THREAD_SAFETY)
 /* Use platform-dependent pthread capability */
 #include 
 #else
 /* No threads implementation, use none (-j 1) */
 #define pthread_t void *
+#define pthread_mutex_t void *
+#define pthread_mutex_init(m, p)
+#define pthread_mutex_lock(m)
+#define pthread_mutex_unlock(m)
+#define pthread_mutex_destroy(m)
 #endif
 
 
@@ -422,7 +434,7 @@ typedef struct
 	instr_time	txn_begin;		/* used

[PATCH] Better cleanup in TLS tests for -13beta2

2020-06-29 Thread Felix Lechner

Hi,

This patch removes two temporary files that are not removed. In
Debian, repeated builds fail. We do not allow builds from modified
sources.

The first file was clearly an oversight. It was created separately. I
am not sure why the loop over @keys did not remove the second.

For the record, the error message from Debian is below. Lines 25-27
show the files that were left behind.

Kind regards,
Felix Lechner

lechner@4bba56c5a8a8:~/postgresql$ debuild
 dpkg-buildpackage -us -uc -ui
dpkg-buildpackage: info: source package postgresql-13
dpkg-buildpackage: info: source version 13~beta2-1
dpkg-buildpackage: info: source distribution experimental
dpkg-buildpackage: info: source changed by Christoph Berg 
 dpkg-source --before-build .
dpkg-buildpackage: info: host architecture amd64
 fakeroot debian/rules clean
dh clean
   debian/rules override_dh_auto_clean
make[1]: Entering directory '/home/lechner/postgresql'
rm -rf build
make[1]: Leaving directory '/home/lechner/postgresql'
   dh_autoreconf_clean
   dh_clean
 dpkg-source -b .
dpkg-source: info: using source format '3.0 (quilt)'
dpkg-source: info: building postgresql-13 using existing
./postgresql-13_13~beta2.orig.tar.bz2
dpkg-source: info: using patch list from debian/patches/series
dpkg-source: warning: ignoring deletion of file configure, use
--include-removal to override
dpkg-source: warning: ignoring deletion of file
src/include/pg_config.h.in, use --include-removal to override
dpkg-source: warning: ignoring deletion of file
doc/src/sgml/man-stamp, use --include-removal to override
dpkg-source: warning: ignoring deletion of file
doc/src/sgml/html-stamp, use --include-removal to override
dpkg-source: info: local changes detected, the modified files are:
 postgresql/src/test/ssl/ssl/client_tmp.key
 postgresql/src/test/ssl/ssl/client_wrongperms_tmp.key
dpkg-source: error: aborting due to unexpected upstream changes, see
/tmp/postgresql-13_13~beta2-1.diff.gy3ajb
dpkg-source: info: you can integrate the local changes with dpkg-source --commit
dpkg-buildpackage: error: dpkg-source -b . subprocess returned exit status 2
debuild: fatal error at line 1182:
dpkg-buildpackage -us -uc -ui failed
+++ b/src/test/ssl/t/001_ssltests.pl
@@ -544,6 +544,8 @@ test_connect_fails($common_connstr, "ssl
 	qr/SSL error/, "intermediate client certificate is missing");
 
 # clean up
+unlink("ssl/client_wrongperms_tmp.key");
+unlink("ssl/client_tmp.key");
 foreach my $key (@keys)
 {
 	unlink("ssl/${key}_tmp.key");

Re: +(pg_lsn, int8) and -(pg_lsn, int8) operators

2020-06-29 Thread Asif Rehman

Hi,

The patch looks fine to me, however there is one hunk failing for the test 
case, so it needs to be rebased.

The new status of this patch is: Waiting on Author

Re: +(pg_lsn, int8) and -(pg_lsn, int8) operators

2020-06-29 Thread Fujii Masao




On 2020/06/30 1:03, Asif Rehman wrote:

Hi,

The patch looks fine to me, however there is one hunk failing for the test 
case, so it needs to be rebased.


Thanks for the check! Attached is the updated version of the patch.



The new status of this patch is: Waiting on Author


I will change the status back to Needs Review.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 49fb19ff91..7027758d28 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -4801,7 +4801,13 @@ SELECT * FROM pg_attribute
 standard comparison operators, like = and
 >.  Two LSNs can be subtracted using the
 - operator; the result is the number of bytes separating
-those write-ahead log locations.
+those write-ahead log locations.  Also the number of bytes can be
+added into and subtracted from LSN using the
++(pg_lsn,numeric) and
+-(pg_lsn,numeric) operators, respectively. Note that
+the calculated LSN should be in the range of pg_lsn type,
+i.e., between 0/0 and
+/.

   
 
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 5f23f2afac..1773fa292e 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -41,6 +41,7 @@
 #include "utils/guc.h"
 #include "utils/int8.h"
 #include "utils/numeric.h"
+#include "utils/pg_lsn.h"
 #include "utils/sortsupport.h"
 
 /* --
@@ -472,6 +473,7 @@ static void apply_typmod(NumericVar *var, int32 typmod);
 static bool numericvar_to_int32(const NumericVar *var, int32 *result);
 static bool numericvar_to_int64(const NumericVar *var, int64 *result);
 static void int64_to_numericvar(int64 val, NumericVar *var);
+static bool numericvar_to_uint64(const NumericVar *var, uint64 *result);
 #ifdef HAVE_INT128
 static bool numericvar_to_int128(const NumericVar *var, int128 *result);
 static void int128_to_numericvar(int128 val, NumericVar *var);
@@ -3692,6 +3694,30 @@ numeric_float4(PG_FUNCTION_ARGS)
 }
 
 
+Datum
+numeric_pg_lsn(PG_FUNCTION_ARGS)
+{
+   Numeric num = PG_GETARG_NUMERIC(0);
+   NumericVar  x;
+   XLogRecPtr  result;
+
+   if (NUMERIC_IS_NAN(num))
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot convert NaN to pg_lsn")));
+
+   /* Convert to variable format and thence to pg_lsn */
+   init_var_from_num(num, &x);
+
+   if (!numericvar_to_uint64(&x, (uint64 *) &result))
+   ereport(ERROR,
+   (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+errmsg("pg_lsn out of range")));
+
+   PG_RETURN_LSN(result);
+}
+
+
 /* --
  *
  * Aggregate functions
@@ -6742,6 +6768,78 @@ int64_to_numericvar(int64 val, NumericVar *var)
var->weight = ndigits - 1;
 }
 
+/*
+ * Convert numeric to uint64, rounding if needed.
+ *
+ * If overflow, return false (no error is raised).  Return true if okay.
+ */
+static bool
+numericvar_to_uint64(const NumericVar *var, uint64 *result)
+{
+   NumericDigit *digits;
+   int ndigits;
+   int weight;
+   int i;
+   uint64  val;
+   NumericVar  rounded;
+
+   /* Round to nearest integer */
+   init_var(&rounded);
+   set_var_from_var(var, &rounded);
+   round_var(&rounded, 0);
+
+   /* Check for zero input */
+   strip_var(&rounded);
+   ndigits = rounded.ndigits;
+   if (ndigits == 0)
+   {
+   *result = 0;
+   free_var(&rounded);
+   return true;
+   }
+
+   /* Check for negative input */
+   if (rounded.sign == NUMERIC_NEG)
+   {
+   free_var(&rounded);
+   return false;
+   }
+
+   /*
+* For input like 100, we must treat stripped digits as real. So
+* the loop assumes there are weight+1 digits before the decimal point.
+*/
+   weight = rounded.weight;
+   Assert(weight >= 0 && ndigits <= weight + 1);
+
+   /* Construct the result */
+   digits = rounded.digits;
+   val = digits[0];
+   for (i = 1; i <= weight; i++)
+   {
+   if (unlikely(pg_mul_u64_overflow(val, NBASE, &val)))
+   {
+   free_var(&rounded);
+   return false;
+   }
+
+   if (i < ndigits)
+   {
+   if (unlikely(pg_add_u64_overflow(val, digits[i], &val)))
+   {
+   free_var(&rounded);
+   return false;
+   }
+   }
+   }
+
+   free_va

Re: [HACKERS] Custom compression methods

2020-06-29 Thread Andres Freund

Hi,

On 2020-06-24 07:59:47 -0400, Robert Haas wrote:
> On Tue, Jun 23, 2020 at 4:00 PM Andres Freund  wrote:
> > https://postgr.es/m/20130621000900.GA12425%40alap2.anarazel.de is a
> > thread with more information / patches further along.
> >
> > I confused this patch with the approach in
> > https://www.postgresql.org/message-id/d8576096-76ba-487d-515b-44fdedba8bb5%402ndquadrant.com
> > sorry for that.  It obviously still differs by not having lower space
> > overhead (by virtue of not having a 4 byte 'va_cmid', but no additional
> > space for two methods, and then 1 byte overhead for 256 more), but
> > that's not that fundamental a difference.
> 
> Wait a minute. Are we saying there are three (3) dueling patches for
> adding an alternate TOAST algorithm? It seems like there is:
> 
> This "custom compression methods" thread - vintage 2017 - Original
> code by Nikita Glukhov, later work by Ildus Kurbangaliev
> The "pluggable compression support" thread - vintage 2013 - Andres Freund
> The "plgz performance" thread - vintage 2019 - Petr Jelinek
> 
> Anyone want to point to a FOURTH implementation of this feature?

To be clear, I don't think the 2003 patch should be considered as being
"in the running".

Greetings,

Andres Freund

Re: pg_bsd_indent compiles bytecode

2020-06-29 Thread Andres Freund

Hi,

On 2020-06-27 18:43:40 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I can test that with another program, but for some reason pg_bsd_indent
> > fails to build against 13/HEAD, but builds fine against 12. Not sure yet
> > what's up:
> 
> Huh.  Works here on RHEL8 ... what platform are you using?

That was on Debian unstable, but I don't think it's really related. The
issue turns out to be that gcc 10 changed the default from -fno-common
to -fcommon, and I had 13/HEAD set up to use gcc 10, but 12 to use gcc
9.

The way that pg_bsd_indent defines its variables isn't standard C, as
far as I can tell, which explains the errors I was getting. All the
individual files include indent_globs.h, which declares/defines a bunch
of variables. Since it doesn't use extern, they'll all end up as full
definitions in each .o when -fno-common is used (the default now), hence
the multiple definition errors. The only reason it works with -fcommon
is that they'll end up processed as weak symbols and 'deduplicated' at
link time.

Ick.

Greetings,

Andres Freund

Re: pg_bsd_indent compiles bytecode

2020-06-29 Thread Andres Freund

Hi,

On 2020-06-27 18:54:04 -0400, Tom Lane wrote:
> Having now checked, there isn't any such problem.  No .bc files are
> getting built except in src/backend and in other modules that feed
> into the backend, such as src/timezone and most of contrib.

> I do see .bc files getting built for pg_bsd_indent, as Alvaro reported.
> Seems like it must be a bug in the pgxs make logic, not anything more
> generic.

Yea. The issue is in pgxs.mk. So it does affect contrib/ modules that
use PROGRAM (e.g. contrib/pg_standby/pg_standby.bc is built), but not
other parts of the tree.

It's easy enough to fix by just adding a ifndef PROGRAM around the piece
adding the depency to the .bc files:

ifeq ($(with_llvm), yes)
ifndef PROGRAM
all: $(addsuffix .bc, $(MODULES)) $(patsubst %.o,%.bc, $(OBJS))
endif # PROGRAM
endif # with_llvm

but it's not particularly pretty. But given the way pgxs.mk is
structured, I'm not sure there's really a great answer?

Greetings,

Andres Freund

Re: Default setting for enable_hashagg_disk

2020-06-29 Thread Peter Geoghegan

On Mon, Jun 29, 2020 at 8:07 AM Tomas Vondra
 wrote:
> Not sure I follow. Which cases do you mean when you say that 12 could
> safely do them, but 13 won't? I see the following two cases:

> a) Planner in 12 and 13 disagree about whether the hash table will fit
> into work_mem.
>
> I don't quite see why this would be the case (given the same cardinality
> estimates etc.), though. That is, if 12 says "will fit" I'd expect 13 to
> end up with the same conclusion. But maybe 13 has higher per-tuple
> overhead or something? I know we set aside some memory for BufFiles, but
> not when we expect the whole hash table to fit into memory.

I have no reason to believe that the planner is any more or any less
likely to conclude that the hash table will fit in memory in v13 as
things stand (I don't know if the BufFile issue matters).

In general, grouping estimates probably aren't very good compared to
join estimates. I imagine that in either v12 or v13 the planner is
likely to incorrectly believe that it'll all fit in memory fairly
often. v12 was much too permissive about what could happen. But v13 is
too conservative.

> b) Planner believes the hash table will fit, due to underestimate.
>
> On 12, we'd just let the hash table overflow, which may be a win when
> there's enough RAM and the estimate is not "too wrong". But it may
> easily end with a sad OOM.

It might end up with an OOM on v12 due to an underestimate -- but
probably not! The fact that a hash aggregate is faster than a group
aggregate ameliorates the higher memory usage. You might actually use
less memory this way.

> On 13, we'll just start spilling. True - people tend to use conservative
> work_mem values exactly because of cases like this (which is somewhat
> futile as the underestimate may be arbitrarily wrong) and also because
> they don't know how many work_mem instances the query will use.
>
> So yeah, I understand why people may not want to increase work_mem too
> much, and maybe hash_work would be a way to get the "no spill" behavior.

Andres wanted to increase the amount of memory that could be used at
execution time, without changing planning. You could say that hash_mem
is a more ambitious proposal than that. It's changing the behavior
across the board -- though in a way that makes sense anyway. It has
the additional benefit of making it more likely that an in-memory hash
aggregate will be used. That isn't a problem that we're obligated to
solve now, so this may seem odd. But if the more ambitious plan is
actually easier to implement and support, why not pursue it?

hash_mem seems a lot easier to explain and reason about than having
different work_mem budgets during planning and execution, which is
clearly a kludge. hash_mem makes sense generally, and more or less
solves the problems raised on this thread.

-- 
Peter Geoghegan

Re: Default setting for enable_hashagg_disk

2020-06-29 Thread Bruce Momjian

On Mon, Jun 29, 2020 at 10:20:14AM -0700, Peter Geoghegan wrote:
> I have no reason to believe that the planner is any more or any less
> likely to conclude that the hash table will fit in memory in v13 as
> things stand (I don't know if the BufFile issue matters).
> 
> In general, grouping estimates probably aren't very good compared to
> join estimates. I imagine that in either v12 or v13 the planner is
> likely to incorrectly believe that it'll all fit in memory fairly
> often. v12 was much too permissive about what could happen. But v13 is
> too conservative.

FYI, we have improved planner statistics estimates for years, which must
have affected node spill behavior on many node types (except hash_agg),
and don't remember any complaints about it.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: Default setting for enable_hashagg_disk

2020-06-29 Thread Peter Geoghegan

On Mon, Jun 29, 2020 at 8:29 AM Bruce Momjian  wrote:
> Is this something we want to codify for all node types,
> i.e., choose a non-spill node type if we need a lot more than work_mem,
> but then let work_mem be a soft limit if we do choose it, e.g., allow
> 50% over work_mem in the executor for misestimation before spill?  My
> point is, do we want to use a lower work_mem for planning and a higher
> one in the executor before spilling.

Andres said something about doing that with hash aggregate, which I
can see an argument for, but I don't think that it would make sense
with most other nodes. In particular, sorts still perform almost as
well with only a fraction of the "optimal" memory.

> My second thought is from an earlier report that spilling is very
> expensive, but smaller work_mem doesn't seem to hurt much.

It's not really about the spilling itself IMV. It's the inability to
do hash aggregation in a single pass.

You can think of hashing (say for hash join or hash aggregate) as a
strategy that consists of a logical division followed by a physical
combination. Sorting (or sort merge join, or group agg), in contrast,
consists of a physical division and logical combination. As a
consequence, it can be a huge win to do everything in memory in the
case of hash aggregate. Whereas sort-based aggregation can sometimes
be slightly faster with external sorts due to CPU caching effects, and
because an on-the-fly merge in tuplesort can output the first tuple
before the tuples are fully sorted.

> Would we
> achieve better overall performance by giving a few nodes a lot of memory
> (and not spill those), and other nodes very little, rather than having
> them all be the same size, and all spill?

If the nodes that we give more memory to use it for a hash table, then yes.

-- 
Peter Geoghegan

Re: pg_bsd_indent compiles bytecode

2020-06-29 Thread Tom Lane

Andres Freund  writes:
> The way that pg_bsd_indent defines its variables isn't standard C, as
> far as I can tell, which explains the errors I was getting. All the
> individual files include indent_globs.h, which declares/defines a bunch
> of variables. Since it doesn't use extern, they'll all end up as full
> definitions in each .o when -fno-common is used (the default now), hence
> the multiple definition errors. The only reason it works with -fcommon
> is that they'll end up processed as weak symbols and 'deduplicated' at
> link time.

Ugh.  I agree that's pretty bogus, even if there's anything in the
C standard that allows it.  I'll put it on my to-do list.

regards, tom lane

Re: [PATCH] Better cleanup in TLS tests for -13beta2

2020-06-29 Thread Daniel Gustafsson

> On 29 Jun 2020, at 17:52, Felix Lechner  wrote:

> This patch removes two temporary files that are not removed. In
> Debian, repeated builds fail. We do not allow builds from modified
> sources.

Aha, nice catch!

> The first file was clearly an oversight. It was created separately. I
> am not sure why the loop over @keys did not remove the second.

That's because it's created in the 002_scram.pl testsuite as well, but not
cleaned up there.

The proposed patch admittedly seems a bit like a hack, and the client_tmo.key
handling is wrong as mentioned above.  I propose that we instead add the key to
the @keys array and have clean up handle all files uniformly.  The attached
fixes both of the files.

cheers ./daniel

tls_tests_cleanup.patch
Description: Binary data

Re: [PATCH] Better cleanup in TLS tests for -13beta2

2020-06-29 Thread Tom Lane

Daniel Gustafsson  writes:
> The proposed patch admittedly seems a bit like a hack, and the client_tmo.key
> handling is wrong as mentioned above.  I propose that we instead add the key 
> to
> the @keys array and have clean up handle all files uniformly.  The attached
> fixes both of the files.

Hmm ... so I guess my reaction to both of these is "what guarantees
that we get to the part of the script that does the unlinks?".
I've certainly seen lots of TAP tests fail to complete.  Could we
do the cleanup in an END block or the like?  (I'm a poor enough
Perl programmer to be uncertain what's the best way, but I know
Perl has constructs like that.)

regards, tom lane

Re: [PATCH] Better cleanup in TLS tests for -13beta2

2020-06-29 Thread Daniel Gustafsson

> On 29 Jun 2020, at 21:27, Tom Lane  wrote:

> Hmm ... so I guess my reaction to both of these is "what guarantees
> that we get to the part of the script that does the unlinks?".
> I've certainly seen lots of TAP tests fail to complete.  Could we
> do the cleanup in an END block or the like?  (I'm a poor enough
> Perl programmer to be uncertain what's the best way, but I know
> Perl has constructs like that.)

If execution calls die() during testing, then we wont reach the clean up
portion at the end but we would if we did it as part of END which is (unless my
memory is too fogged) guaranteed to be the last code to run before the
interpreter exits.

That being said, we do retain temporary files on such failures on purpose in
our TestLib since 88802e068017bee8cea7a5502a712794e761c7b5 and a few follow-up
commits since, should these be handled differently?  They are admittedly less
"unknown" as compared to other files as they are copies, but famous last words
have been spoken about bugs that can never happen.

cheers ./daniel

Re: [PATCH] Better cleanup in TLS tests for -13beta2

2020-06-29 Thread Tom Lane

Daniel Gustafsson  writes:
>> On 29 Jun 2020, at 21:27, Tom Lane  wrote:
>> Hmm ... so I guess my reaction to both of these is "what guarantees
>> that we get to the part of the script that does the unlinks?".

> That being said, we do retain temporary files on such failures on purpose in
> our TestLib since 88802e068017bee8cea7a5502a712794e761c7b5 and a few follow-up
> commits since, should these be handled differently?  They are admittedly less
> "unknown" as compared to other files as they are copies, but famous last words
> have been spoken about bugs that can never happen.

Oh, good point.  Objection withdrawn.

regards, tom lane

Re: Parallell hashjoin sometimes ignores temp_tablespaces

2020-06-29 Thread Daniel Gustafsson

> On 29 Jun 2020, at 17:02, Magnus Hagander  wrote:

> I think the fix is the attached one (tested on version 11 which is what 
> $customer is using).  To me it looks like this may have been a copy/paste 
> error all the way back in 98e8b480532 which added default_tablespace back in 
> 2004. (And is in itself entirely unrelated to parallel hashjoin, but that's 
> where it got exposed at least in my case)

Running through the repro and patch on HEAD I confirm that the attached fixes
the issue. +1 for the patch and a backpatch of it.

It would be nice to have a test covering test_tablespaces, but it seems a tad
cumbersome to create a stable one.

cheers ./daniel

Re: Default setting for enable_hashagg_disk

2020-06-29 Thread Tomas Vondra


On Mon, Jun 29, 2020 at 10:20:14AM -0700, Peter Geoghegan wrote:

On Mon, Jun 29, 2020 at 8:07 AM Tomas Vondra
 wrote:

Not sure I follow. Which cases do you mean when you say that 12 could
safely do them, but 13 won't? I see the following two cases:



a) Planner in 12 and 13 disagree about whether the hash table will fit
into work_mem.

I don't quite see why this would be the case (given the same cardinality
estimates etc.), though. That is, if 12 says "will fit" I'd expect 13 to
end up with the same conclusion. But maybe 13 has higher per-tuple
overhead or something? I know we set aside some memory for BufFiles, but
not when we expect the whole hash table to fit into memory.


I have no reason to believe that the planner is any more or any less
likely to conclude that the hash table will fit in memory in v13 as
things stand (I don't know if the BufFile issue matters).

In general, grouping estimates probably aren't very good compared to
join estimates. I imagine that in either v12 or v13 the planner is
likely to incorrectly believe that it'll all fit in memory fairly
often. v12 was much too permissive about what could happen. But v13 is
too conservative.



Can you give and example of what you mean by being too permissive or too
conservative? Do you mean the possibility of unlimited memory usage in
v12, and strict enforcement in v13?

IMO enforcing the work_mem limit (in v13) is right in principle, but I
do understand the concerns about unexpected regressions compared to v12.


b) Planner believes the hash table will fit, due to underestimate.

On 12, we'd just let the hash table overflow, which may be a win when
there's enough RAM and the estimate is not "too wrong". But it may
easily end with a sad OOM.


It might end up with an OOM on v12 due to an underestimate -- but
probably not! The fact that a hash aggregate is faster than a group
aggregate ameliorates the higher memory usage. You might actually use
less memory this way.



I don't understand what you mean by "less memory" when the whole issue
is significantly exceeding work_mem?

I don't think the OOM is the only negative performance here - using a
lot of memory also forces eviction of data from page cache (although
writing a lot of temporary files may have similar effect).


On 13, we'll just start spilling. True - people tend to use conservative
work_mem values exactly because of cases like this (which is somewhat
futile as the underestimate may be arbitrarily wrong) and also because
they don't know how many work_mem instances the query will use.

So yeah, I understand why people may not want to increase work_mem too
much, and maybe hash_work would be a way to get the "no spill" behavior.


Andres wanted to increase the amount of memory that could be used at
execution time, without changing planning. You could say that hash_mem
is a more ambitious proposal than that. It's changing the behavior
across the board -- though in a way that makes sense anyway. It has
the additional benefit of making it more likely that an in-memory hash
aggregate will be used. That isn't a problem that we're obligated to
solve now, so this may seem odd. But if the more ambitious plan is
actually easier to implement and support, why not pursue it?

hash_mem seems a lot easier to explain and reason about than having
different work_mem budgets during planning and execution, which is
clearly a kludge. hash_mem makes sense generally, and more or less
solves the problems raised on this thread.



I agree with this, and I'm mostly OK with having hash_mem. In fact, from
the proposals in this thread I like it the most - as long as it's used
both during planning and execution. It's a pretty clear solution.

It's not a perfect solution in the sense that it does not reintroduce
the v12 behavior perfectly (i.e. we'll still spill after reaching
hash_mem) but that may be good enougn.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

2020-06-29 Thread Tomas Vondra


On Mon, Jun 29, 2020 at 01:31:40PM -0400, Bruce Momjian wrote:

On Mon, Jun 29, 2020 at 10:20:14AM -0700, Peter Geoghegan wrote:

I have no reason to believe that the planner is any more or any less
likely to conclude that the hash table will fit in memory in v13 as
things stand (I don't know if the BufFile issue matters).

In general, grouping estimates probably aren't very good compared to
join estimates. I imagine that in either v12 or v13 the planner is
likely to incorrectly believe that it'll all fit in memory fairly
often. v12 was much too permissive about what could happen. But v13 is
too conservative.


FYI, we have improved planner statistics estimates for years, which must
have affected node spill behavior on many node types (except hash_agg),
and don't remember any complaints about it.



I think misestimates for GROUP BY are quite common and very hard to fix.
Firstly, our ndistinct estimator may give pretty bad results depending
e.g. on how is the table correlated.

I've been running some TPC-H benchmarks, and for partsupp.ps_partkey our
estimate was 4338776, when the actual value is 1500, i.e. ~3.5x
higher. This was with statistics target increased to 1000. I can easily
imagine even worse estimates with lower values.

This ndistinct estimator is used even for extended statistics, so that
can't quite save us. Moreover, the grouping may be on top of a join, in
which case using ndistinct coefficients may not be possible :-(

So I think this is a quite real problem ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Default setting for enable_hashagg_disk

2020-06-29 Thread Peter Geoghegan

On Mon, Jun 29, 2020 at 2:22 PM Tomas Vondra
 wrote:
> Can you give and example of what you mean by being too permissive or too
> conservative? Do you mean the possibility of unlimited memory usage in
> v12, and strict enforcement in v13?

Yes -- that's all I meant.

> IMO enforcing the work_mem limit (in v13) is right in principle, but I
> do understand the concerns about unexpected regressions compared to v12.

Yeah. Both of these two things are true at the same time.

> I don't understand what you mean by "less memory" when the whole issue
> is significantly exceeding work_mem?

I was just reiterating what I said a few times already: Not using an
in-memory hash aggregate when the amount of memory required is high
but not prohibitively high is penny wise, pound foolish. It's easy to
imagine this actually using up more memory when an entire workload is
considered. This observation does not apply to a system that only ever
has one active connection at a time, but who cares about that?

> I don't think the OOM is the only negative performance here - using a
> lot of memory also forces eviction of data from page cache (although
> writing a lot of temporary files may have similar effect).

True.

> I agree with this, and I'm mostly OK with having hash_mem. In fact, from
> the proposals in this thread I like it the most - as long as it's used
> both during planning and execution. It's a pretty clear solution.

Great.

It's not trivial to write the patch, since there are a few tricky
cases. For example, maybe there is some subtlety in nodeAgg.c with
AGG_MIXED cases. Is there somebody else that knows that code better
than I do that wants to have a go at writing a patch?

-- 
Peter Geoghegan

Re: track_planning causing performance regression

2020-06-29 Thread Peter Geoghegan

On Mon, Jun 29, 2020 at 1:55 AM Fujii Masao  wrote:
> > I disagree with the conclusion though.  It seems to me that if you
> > really have this workload that consists in these few queries and want
> > to get better performance, you'll anyway use a connection pooler
> > and/or use prepared statements, which will make this overhead
> > disappear entirely, and will also yield an even bigger performance
> > improvement.  A quick test using pgbench -M prepared, with
> > track_planning enabled, with still way too many connections already
> > shows a 25% improvement over the -M simple without track_planning.
>
> I understand your point. But IMO the default setting basically should
> be safer value, i.e., off at least until the problem disappears.

+1 -- this regression seems unacceptable to me.

-- 
Peter Geoghegan

Re: track_planning causing performance regression

2020-06-29 Thread Peter Geoghegan

On Mon, Jun 29, 2020 at 3:23 PM Peter Geoghegan  wrote:
> +1 -- this regression seems unacceptable to me.

I added an open item to track this.

Thanks
-- 
Peter Geoghegan

Re: track_planning causing performance regression

2020-06-29 Thread Andres Freund

Hi,

On 2020-06-29 09:05:18 +0200, Julien Rouhaud wrote:
> I can't reproduce this on my laptop, but I can certainly believe that
> running the same 3 queries using more connections than available cores
> will lead to extra overhead.

> I disagree with the conclusion though.  It seems to me that if you
> really have this workload that consists in these few queries and want
> to get better performance, you'll anyway use a connection pooler
> and/or use prepared statements, which will make this overhead
> disappear entirely, and will also yield an even bigger performance
> improvement.

It's an extremely common to have have times where there's more active
queries than CPUs. And a pooler won't avoid that fully, at least not
without drastically reducing overall throughput.

Greetings,

Andres Freund

Re: track_planning causing performance regression

2020-06-29 Thread Andres Freund

Hi,

On 2020-06-29 17:55:28 +0900, Fujii Masao wrote:
> One idea to reduce that lock contention is to separate per-query spinlock
> into two; one is for planning, and the other is for execution. pgss_store()
> determines which lock to use based on the given "kind" argument.
> To make this idea work, also every pgss counters like shared_blks_hit
> need to be separated into two, i.e., for planning and execution.

I suspect that the best thing would be to just turn the spinlock into an
lwlock. Spinlocks deal terribly with contention. I suspect it'd solve
the performance issue entirely. And it might even be possible, further
down the line, to just use a shared lock, and use atomics for the
counters.

Greetings,

Andres Freund

Re: pgstattuple: Have pgstattuple_approx accept TOAST tables

2020-06-29 Thread Peter Eisentraut


On 2020-06-17 13:39, Laurenz Albe wrote:

On Fri, 2020-04-17 at 13:01 +0200, Peter Eisentraut wrote:

I alluded to this in [0], but it's better discussed in its own thread.

I think the check that makes pgstattuple_approx reject TOAST tables is a
mistake.  They have visibility and free space map, and it works just
fine if the check is removed.

Attached is a patch to fix this and add some tests related to how
pgstattuple and pg_visibility accept TOAST tables for inspection.


[0]:
https://www.postgresql.org/message-id/dc35a398-37d0-75ce-07ea-1dd71d98f...@2ndquadrant.com


I gave the patch a spin, and it passes regression tests and didn't
cause any problems when I played with it.

No upgrade or dump considerations, of course.

This is a clear improvement.

I'll mark the patch as "ready for committer".


committed, thanks

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Open Item: Should non-text EXPLAIN always show properties?

2020-06-29 Thread David Rowley

On Fri, 26 Jun 2020 at 04:33, Tom Lane  wrote:
>
> Robert Haas  writes:
> > On Thu, Jun 25, 2020 at 8:42 AM James Coleman  wrote:
> >> Yesterday I'd replied [1] to Justin's proposal for this WRT
> >> incremental sort and expressed my opinion that including both
> >> unnecessarily (i.e., including disk when an in-memory sort was used)
> >> is undesirable and confusing and leads to shortcuts I believe to be
> >> bad habits when using the data programmatically.
>
> > +1.
>
> I think the policy about non-text output formats is "all applicable
> fields should be included automatically".  But the key word there is
> "applicable".  Are disk-sort numbers applicable when no disk sort
> happened?
>
> I think the right way to think about this is that we are building
> an output data structure according to a schema that should be fixed
> for any particular plan shape.  If event X happened zero times in
> a given execution, but it could have happened in a different execution
> of the same plan, then we should print X with a zero count.  If X
> could not happen period in this plan, we should omit X's entry.
>
> So the real question here is whether the disk vs memory decision is
> plan time vs run time.  AFAIK it's run time, which leads me to think
> we ought to print the zeroes.

I think that's a pretty good way of thinking about it.

For the HashAgg case, the plan could end up spilling, so based on what
you've said, we should be printing those zeros as some other execution
of the same plan could spill.

If nobody objects to that very soon, then I'll go ahead and push the
changes for HashAgg's non-text EXPLAIN ANALYZE

David

Re: Hybrid Hash/Nested Loop joins and caching results from subplans

2020-06-29 Thread David Rowley

On Fri, 12 Jun 2020 at 16:10, Andy Fan  wrote:
> I first tried to see if we can have a row estimation before the subplan
> is created and it looks very complex.  The subplan was created during
> preprocess_qual_conditions, at that time, we even didn't create the base
> RelOptInfo , to say nothing of join_rel which the rows estimation happens
> much later.
>
> Then I see if we can delay the cache decision until we have the rows 
> estimation,
> ExecInitSubPlan may be a candidate.  At this time  we can't add a new
> ResutCache node, but we can add a cache function to SubPlan node with costed
> based.  However the num_of_distinct values for parameterized variable can't be
> calculated which I still leave it as an open issue.

I don't really like the idea of stuffing this feature into some
existing node type.  Doing so would seem pretty magical when looking
at an EXPLAIN ANALYZE. There is of course overhead to pulling tuples
through an additional node in the plan, but if you use that as an
argument then there's some room to argue that we should only have 1
executor node type to get rid of that overhead.

Tom mentioned in [1] that he's reconsidering his original thoughts on
leaving the AlternativeSubPlan selection decision until execution
time.  If that were done late in planning, as Tom mentioned, then it
would be possible to give a more accurate cost to the Result Cache as
we'd have built the outer plan by that time and would be able to
estimate the number of distinct calls to the correlated subplan. As
that feature is today we'd be unable to delay making the decision
until execution time as we don't have the required details to know how
many distinct calls there will be to the Result Cache node.

For now, I'm planning on changing things around a little in the Result
Cache node to allow faster deletions from the cache.  As of now, we
must perform 2 hash lookups to perform a single delete.  This is
because we must perform the lookup to fetch the entry from the MRU
list key, then an additional lookup in the hash delete code.  I plan
on changing the hash delete code to expose another function that
allows us to delete an item directly if we've already looked it up.
This should make a small reduction in the overheads of the node.
Perhaps if the overhead is very small (say < 1%) when the cache is of
no use then it might not be such a bad thing to just have a Result
Cache for correlated subplans regardless of estimates. With the TPCH
Q20 test, it appeared as if the overhead was 0.27% for that particular
subplan. A more simple subplan would execute more quickly resulting
the Result Cache overhead being a more significant portion of the
overall subquery execution. I'd need to perform a worst-case overhead
test to get an indication of what the percentage is.

David

[1] https://www.postgresql.org/message-id/1992952.1592785...@sss.pgh.pa.us

Re: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

2020-06-29 Thread Kyotaro Horiguchi

Opps! I misunderstood that.

At Mon, 29 Jun 2020 13:00:25 +, "higuchi.dais...@fujitsu.com" 
 wrote in 
> Fujii-san, thank you for comments.
> 
> >The cause of this problem is that the checkpointer's sleep time is calculated
> >from both checkpoint_timeout and archive_timeout during normal running,
> >but calculated only from checkpoint_timeout during recovery. So Daisuke-san's
> >patch tries to change that so that it's calculated from both of them even
> >during recovery. No?
> 
> Yes, it's exactly so.
> 
> >last_xlog_switch_time is not updated during recovery. So "elapsed_secs" can 
> >be
> >large and cur_timeout can be negative. Isn't this problematic?
> 
> Yes... My patch was missing this.

The patch also makes WaitLatch called with zero timeout, which causes
assertion failure.

> How about using the original archive_timeout value for calculating 
> cur_timeout during recovery?
> 
> if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
> {
> elapsed_secs = now - last_xlog_switch_time;
> if (elapsed_secs >= XLogArchiveTimeout)
> continue;   /* no sleep for us 
> ... */
> cur_timeout = Min(cur_timeout, XLogArchiveTimeout - 
> elapsed_secs);
> }
> +   else if (XLogArchiveTimeout > 0)
> +   cur_timeout = Min(cur_timeout, XLogArchiveTimeout);
> 
> During recovery, accurate cur_timeout is not calculated because elapsed_secs 
> is not used.
> However, after recovery is complete, WAL archiving will start by the next 
> archive_timeout is reached.
> I felt it is enough to solve this problem.

That causes unwanted change of cur_timeout during recovery.

> >As another approach, what about waking the checkpointer up at the end of
> >recovery like we already do for walsenders?

We don't want change checkpoint interval during recovery, that means
we cannot cnosider archive_timeout at the fist checkpointer after
recovery ends. So I think that the suggestion from Fujii-san is the
direction.

> If the above solution is not good, I will consider this approach.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

2020-06-29 Thread Fujii Masao





On 2020/06/30 9:14, Kyotaro Horiguchi wrote:

Opps! I misunderstood that.

At Mon, 29 Jun 2020 13:00:25 +, "higuchi.dais...@fujitsu.com" 
 wrote in

Fujii-san, thank you for comments.


The cause of this problem is that the checkpointer's sleep time is calculated

>from both checkpoint_timeout and archive_timeout during normal running,

but calculated only from checkpoint_timeout during recovery. So Daisuke-san's
patch tries to change that so that it's calculated from both of them even
during recovery. No?


Yes, it's exactly so.


last_xlog_switch_time is not updated during recovery. So "elapsed_secs" can be
large and cur_timeout can be negative. Isn't this problematic?


Yes... My patch was missing this.


The patch also makes WaitLatch called with zero timeout, which causes
assertion failure.


How about using the original archive_timeout value for calculating cur_timeout 
during recovery?

 if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
 {
 elapsed_secs = now - last_xlog_switch_time;
 if (elapsed_secs >= XLogArchiveTimeout)
 continue;   /* no sleep for us ... 
*/
 cur_timeout = Min(cur_timeout, XLogArchiveTimeout - 
elapsed_secs);
 }
+   else if (XLogArchiveTimeout > 0)
+   cur_timeout = Min(cur_timeout, XLogArchiveTimeout);

During recovery, accurate cur_timeout is not calculated because elapsed_secs is 
not used.


Yes, that's an idea. But I'm a bit concerned about that this change makes
checkpointer wake up more frequently than necessary during recovery.
Which may increase the idle power consumption of checkpointer during
recovery. Of course, this would not be so problematic in practice because
we can expect that archive_timeout is not so small. But it seems better to
avoid unncessary wake-ups if we can easily do that.


However, after recovery is complete, WAL archiving will start by the next 
archive_timeout is reached.
I felt it is enough to solve this problem.


That causes unwanted change of cur_timeout during recovery.


As another approach, what about waking the checkpointer up at the end of
recovery like we already do for walsenders?


We don't want change checkpoint interval during recovery, that means
we cannot cnosider archive_timeout at the fist checkpointer after
recovery ends. So I think that the suggestion from Fujii-san is the
direction.


+1
If this idea has some problems, we can revisit Daisuke-san's idea.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-29 Thread Masahiko Sawada

On Mon, 29 Jun 2020 at 20:37, Amit Kapila  wrote:
>
> On Mon, Jun 29, 2020 at 10:26 AM Masahiko Sawada
>  wrote:
> >
> > On Fri, 26 Jun 2020 at 17:53, Amit Kapila  wrote:
> > >
> > > On Fri, Jun 26, 2020 at 11:31 AM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Thu, 25 Jun 2020 at 19:35, Amit Kapila  
> > > > wrote:
> > > > >
> > > > >
> > > > > Today, I thought about it again, and if we consider the point that
> > > > > logical_decoding_work_mem value doesn’t change much then having the
> > > > > stats at slot-level would also allow computing
> > > > > logical_decoding_work_mem based on stats.  Do you think it is a
> > > > > reasonable assumption that users won't change
> > > > > logical_decoding_work_mem for different processes (WALSender, etc.)?
> > > >
> > > > FWIW, if we use logical_decoding_work_mem as a threshold of starting
> > > > of sending changes to a subscriber, I think there might be use cases
> > > > where the user wants to set different logical_decoding_work_mem values
> > > > to different wal senders. For example, setting a lower value to
> > > > minimize the latency of synchronous logical replication to a near-site
> > > > whereas setting a large value to minimize the amount of data sent to a
> > > > far site.
> > > >
> > >
> > > How does setting a large value can minimize the amount of data sent?
> > > One possibility is if there are a lot of transaction aborts and
> > > transactions are not large enough that they cross
> > > logical_decoding_work_mem threshold but such cases shouldn't be many.
> >
> > Yeah, this is what I meant.
> >
> > I agree that it would not be a common case that the user sets
> > different values for different processes. Based on that assumption, I
> > also think having the stats at slot-level is a good idea.
> >
>
> Okay.
>
> > But I might
> > want to have the reset function.
> >
>
> I don't mind but lets fist see how the patch for the basic feature
> looks and what is required to implement it?  Are you interested in
> writing the patch for this work?

Yes, I'll write the draft patch.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: More efficient RI checks - take 2

2020-06-29 Thread Andres Freund

Hi,

I was looking at this patch with Corey during a patch-review session. So
these are basically our "combined" comments.


On 2020-06-05 17:16:43 +0200, Antonin Houska wrote:
> From 6c1cb8ae7fbf0a8122d8c6637c61b9915bc57223 Mon Sep 17 00:00:00 2001
> From: Antonin Houska 
> Date: Fri, 5 Jun 2020 16:42:34 +0200
> Subject: [PATCH 1/5] Check for RI violation outside ri_PerformCheck().

Probably good to add a short comment to the commit explaining why you're
doing this.

The change makes sense to me. Unless somebody protests I think we should
just apply it regardless of the rest of the series - the code seems
clearer afterwards.


> From 6b09e5598553c8e57b4ef9342912f51adb48f8af Mon Sep 17 00:00:00 2001
> From: Antonin Houska 
> Date: Fri, 5 Jun 2020 16:42:34 +0200
> Subject: [PATCH 2/5] Changed ri_GenerateQual() so it generates the whole
>  qualifier.
> 
> This way we can use the function to reduce the amount of copy&pasted code a
> bit.


>  /*
> - * ri_GenerateQual --- generate a WHERE clause equating two variables
> + * ri_GenerateQual --- generate WHERE/ON clause.
> + *
> + * Note: to avoid unnecessary explicit casts, make sure that the left and
> + * right operands match eq_oprs expect (ie don't swap the left and right
> + * operands accidentally).
> + */
> +static void
> +ri_GenerateQual(StringInfo buf, char *sep, int nkeys,
> + const char *ltabname, Relation lrel,
> + const int16 *lattnums,
> + const char *rtabname, Relation rrel,
> + const int16 *rattnums,
> + const Oid *eq_oprs,
> + GenQualParams params,
> + Oid *paramtypes)
> +{
> + for (int i = 0; i < nkeys; i++)
> + {
> + Oid ltype = RIAttType(lrel, lattnums[i]);
> + Oid rtype = RIAttType(rrel, rattnums[i]);
> + Oid lcoll = RIAttCollation(lrel, 
> lattnums[i]);
> + Oid rcoll = RIAttCollation(rrel, 
> rattnums[i]);
> + charparamname[16];
> + char   *latt,
> +*ratt;
> + char   *sep_current = i > 0 ? sep : NULL;
> +
> + if (params != GQ_PARAMS_NONE)
> + sprintf(paramname, "$%d", i + 1);
> +
> + if (params == GQ_PARAMS_LEFT)
> + {
> + latt = paramname;
> + paramtypes[i] = ltype;
> + }
> + else
> + latt = ri_ColNameQuoted(ltabname, RIAttName(lrel, 
> lattnums[i]));
> +
> + if (params == GQ_PARAMS_RIGHT)
> + {
> + ratt = paramname;
> + paramtypes[i] = rtype;
> + }
> + else
> + ratt = ri_ColNameQuoted(rtabname, RIAttName(rrel, 
> rattnums[i]));


Why do we need support for having params on left or right side, instead
of just having them on one side?


> + ri_GenerateQualComponent(buf, sep_current, latt, ltype, 
> eq_oprs[i],
> +  ratt, rtype);
> +
> + if (lcoll != rcoll)
> + ri_GenerateQualCollation(buf, lcoll);
> + }
> +}
> +
> +/*
> + * ri_GenerateQual --- generate a component of WHERE/ON clause equating two
> + * variables, to be AND-ed to the other components.
>   *
>   * This basically appends " sep leftop op rightop" to buf, adding casts
>   * and schema qualification as needed to ensure that the parser will select
> @@ -1828,17 +1802,86 @@ quoteRelationName(char *buffer, Relation rel)
>   * if they aren't variables or parameters.
>   */
>  static void
> -ri_GenerateQual(StringInfo buf,
> - const char *sep,
> - const char *leftop, Oid leftoptype,
> - Oid opoid,
> - const char *rightop, Oid rightoptype)
> +ri_GenerateQualComponent(StringInfo buf,
> +  const char *sep,
> +  const char *leftop, Oid 
> leftoptype,
> +  Oid opoid,
> +  const char *rightop, Oid 
> rightoptype)
>  {
> - appendStringInfo(buf, " %s ", sep);
> + if (sep)
> + appendStringInfo(buf, " %s ", sep);
>   generate_operator_clause(buf, leftop, leftoptype, opoid,
>rightop, rightoptype);
>  }

Why is this handled inside ri_GenerateQualComponent() instead of
ri_GenerateQual()? Especially because the latter now has code to pass in
a different sep into ri_GenerateQualComponent().


> +/*
> + * ri_ColNameQuoted() --- return column name, with both table and column name

Re: A patch for get origin from commit_ts.

2020-06-29 Thread Madan Kumar

Hello hackers,

We already have pg_xact_commit_timestamp() that returns the timestamp of
the commit. It may be better to have one single function returning both
timestamp and origin for a given transaction ID.

A second thing is that TransactionIdGetCommitTsData() was introdued in
core(73c986add). It has only one caller pg_xact_commit_timestamp() which
passes RepOriginId as NULL, making last argument to the
TransactionIdGetCommitTsData() a dead code in core.

Quick code search shows that it is getting used by pglogical (caller:
https://sources.debian.org/src/pglogical/2.3.2-1/pglogical_conflict.c/?hl=509#L509).
CCing Craig Ringer and Petr Jelinek for the inputs.

Warm Regards,
Madan Kumar K
"There is no Elevator to Success. You have to take the Stairs"

Re: pg_bsd_indent compiles bytecode

2020-06-29 Thread Tom Lane

I wrote:
> Andres Freund  writes:
>> The way that pg_bsd_indent defines its variables isn't standard C, as
>> far as I can tell, which explains the errors I was getting. All the
>> individual files include indent_globs.h, which declares/defines a bunch
>> of variables. Since it doesn't use extern, they'll all end up as full
>> definitions in each .o when -fno-common is used (the default now), hence
>> the multiple definition errors. The only reason it works with -fcommon
>> is that they'll end up processed as weak symbols and 'deduplicated' at
>> link time.

> Ugh.  I agree that's pretty bogus, even if there's anything in the
> C standard that allows it.  I'll put it on my to-do list.

I pushed the attached patch to the pg_bsd_indent repo.  Perhaps Piotr
would like to absorb it into upstream.

I don't intend to mark pg_bsd_indent with a new release number for
this --- for people who successfully compiled, it's the same as before.

regards, tom lane

commit acb2f0a7f3689805b954ea19a927b5021fc69409
Author: Tom Lane 
Date:   Mon Jun 29 21:19:16 2020 -0400

Avoid duplicate declarations of bsdindent's global variables.

Arrange for all the variable declarations in indent_globs.h to look
like "extern" declarations to every .c file except indent.c.
This prevents linker failure due to duplicated global variables when
the code is built with -fno-common, which is soon to be gcc's default.

The method of temporarily #define'ing "extern" to empty is a hack,
no doubt, but it avoids requiring a duplicate set of variable
definitions, so it seemed like the best way.

Discussion: https://postgr.es/m/20200629165051.xlfqhstajf6yn...@alap3.anarazel.de

diff --git a/indent.c b/indent.c
index 62d4d01..d3a0ece 100644
--- a/indent.c
+++ b/indent.c
@@ -49,6 +49,10 @@ static char sccsid[] = "@(#)indent.c	5.17 (Berkeley) 6/7/93";
 #include 
 #include 
 #include 
+
+/* Tell indent_globs.h to define our global variables here */
+#define DECLARE_INDENT_GLOBALS 1
+
 #include "indent_globs.h"
 #include "indent_codes.h"
 #include "indent.h"
diff --git a/indent_globs.h b/indent_globs.h
index d018af1..398784b 100644
--- a/indent_globs.h
+++ b/indent_globs.h
@@ -50,9 +50,16 @@
 #define true  1
 #endif
 
+/*
+ * Exactly one calling file should define this symbol.  The global variables
+ * will be defined in that file, and just referenced elsewhere.
+ */
+#ifdef DECLARE_INDENT_GLOBALS
+#define extern
+#endif
 
-FILE   *input;		/* the fid for the input file */
-FILE   *output;		/* the output file */
+extern FILE *input;		/* the fid for the input file */
+extern FILE *output;		/* the output file */
 
 #define CHECK_SIZE_CODE(desired_size) \
 	if (e_code + (desired_size) >= l_code) { \
@@ -106,94 +113,94 @@ FILE   *output;		/* the output file */
 	s_token = tokenbuf + 1; \
 	}
 
-char   *labbuf;		/* buffer for label */
-char   *s_lab;		/* start ... */
-char   *e_lab;		/* .. and end of stored label */
-char   *l_lab;		/* limit of label buffer */
+extern char *labbuf;		/* buffer for label */
+extern char *s_lab;		/* start ... */
+extern char *e_lab;		/* .. and end of stored label */
+extern char *l_lab;		/* limit of label buffer */
 
-char   *codebuf;		/* buffer for code section */
-char   *s_code;		/* start ... */
-char   *e_code;		/* .. and end of stored code */
-char   *l_code;		/* limit of code section */
+extern char *codebuf;		/* buffer for code section */
+extern char *s_code;		/* start ... */
+extern char *e_code;		/* .. and end of stored code */
+extern char *l_code;		/* limit of code section */
 
-char   *combuf;		/* buffer for comments */
-char   *s_com;		/* start ... */
-char   *e_com;		/* ... and end of stored comments */
-char   *l_com;		/* limit of comment buffer */
+extern char *combuf;		/* buffer for comments */
+extern char *s_com;		/* start ... */
+extern char *e_com;		/* ... and end of stored comments */
+extern char *l_com;		/* limit of comment buffer */
 
 #define token s_token
-char   *tokenbuf;		/* the last token scanned */
-char	   *s_token;
-char   *e_token;
-char	   *l_token;
+extern char *tokenbuf;		/* the last token scanned */
+extern char *s_token;
+extern char *e_token;
+extern char *l_token;
 
-char   *in_buffer;		/* input buffer */
-char	   *in_buffer_limit;	/* the end of the input buffer */
-char   *buf_ptr;		/* ptr to next character to be taken from
+extern char *in_buffer;		/* input buffer */
+extern char *in_buffer_limit;	/* the end of the input buffer */
+extern char *buf_ptr;		/* ptr to next character to be taken from
  * in_buffer */
-char   *buf_end;		/* ptr to first after last char in in_buffer */
+extern char *buf_end;		/* ptr to first after last char in in_buffer */
 
-charsc_buf[sc_size];	/* input text is saved here when looking for
+extern char  sc_buf[sc_size];	/* input text is saved here when looking for
  * the brace after an if, w

Re: pg_bsd_indent compiles bytecode

2020-06-29 Thread Tom Lane

Andres Freund  writes:
> It's easy enough to fix by just adding a ifndef PROGRAM around the piece
> adding the depency to the .bc files:

> ifeq ($(with_llvm), yes)
> ifndef PROGRAM
> all: $(addsuffix .bc, $(MODULES)) $(patsubst %.o,%.bc, $(OBJS))
> endif # PROGRAM
> endif # with_llvm

> but it's not particularly pretty. But given the way pgxs.mk is
> structured, I'm not sure there's really a great answer?

Yeah.  The only other plausible alternative I see is like this:

ifeq ($(with_llvm), yes)
ifdef MODULES
all: $(addsuffix .bc, $(MODULES))
endif # MODULES
ifdef MODULE_big
all: $(patsubst %.o,%.bc, $(OBJS))
endif # MODULE_big
endif # with_llvm

which might be a little nicer because it squares better with,
eg, the install/uninstall rules.  But it's not much better.

regards, tom lane

Re: POC: postgres_fdw insert batching

2020-06-29 Thread Etsuro Fujita

On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat
 wrote:
> On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
>  wrote:

> > 3) What about the other DML operations (DELETE/UPDATE)?
> >
> > The other DML operations could probably benefit from the batching too.
> > INSERT was good enough for a PoC, but having batching only for INSERT
> > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
> > of quals, but likely doable.
>
> Bulk INSERTs are more common in a sharded environment because of data
> load in say OLAP systems. Bulk update/delete are rare, although not
> that rare. So if an approach just supports bulk insert and not bulk
> UPDATE/DELETE that will address a large number of usecases IMO. But if
> we can make everything work together that would be good as well.

In most cases, I think the entire UPDATE/DELETE operations would be
pushed down to the remote side by DirectModify.  So, I'm not sure we
really need the bulk UPDATE/DELETE.

> > 3) Should we do batching for COPY insteads?
> >
> > While looking at multi_insert, I've realized it's mostly exactly what
> > the new "batching insert" API function would need to be. But it's only
> > really used in COPY, so I wonder if we should just abandon the idea of
> > batching INSERTs and do batching COPY for FDW tables.

> I think we have find out which performs
> better COPY or batch INSERT.

Maybe I'm missing something, but I think the COPY patch [1] seems more
promising to me, because 1) it would not get the remote side's planner
and executor involved, and 2) the data would be loaded more
efficiently by multi-insert on the demote side.  (Yeah, COPY doesn't
support RETURNING, but it's rare that RETURNING is needed in a bulk
load, as you mentioned.)

> [1] 
> https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru

Best regards,
Etsuro Fujita

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-06-29 Thread Masahiko Sawada

On Mon, 22 Jun 2020 at 14:56, Bharath Rupireddy
 wrote:
>
> Hi,
>
> When a query on foreign table is executed from a local session using
> postgres_fdw, as expected the local postgres backend opens a
> connection which causes a remote session/backend to be opened on the
> remote postgres server for query execution.
>
> One observation is that, even after the query is finished, the remote
> session/backend still persists on the remote postgres server. Upon
> researching, I found that there is a concept of Connection Caching for
> the remote connections made using postgres_fdw. Local backend/session
> can cache up to 8 different connections per backend. This caching is
> useful as it avoids the cost of reestablishing new connections per
> foreign query.
>
> However, at times, there may be situations where the long lasting
> local sessions may execute very few foreign queries and remaining all
> are local queries, in this scenario, the remote sessions opened by the
> local sessions/backends may not be useful as they remain idle and eat
> up the remote server connections capacity. This problem gets even
> worse(though this use case is a bit imaginary) if all of
> max_connections(default 100 and each backend caching 8 remote
> connections) local sessions open remote sessions and they are cached
> in the local backend.
>
> I propose to have a new session level GUC called
> "enable_connectioncache"(name can be changed if it doesn't correctly
> mean the purpose) with the default value being true which means that
> all the remote connections are cached. If set to false, the
> connections are not cached and so are remote sessions closed by the local 
> backend/session at
> the end of each remote transaction.

I've not looked at your patch deeply but if this problem is talking
only about postgres_fdw I think we should improve postgres_fdw, not
adding a GUC to the core. It’s not that all FDW plugins use connection
cache and postgres_fdw’s connection cache is implemented within
postgres_fdw, I think we should focus on improving postgres_fdw. I
also think it’s not a good design that the core manages connections to
remote servers connected via FDW. I wonder if we can add a
postgres_fdw option for this purpose, say keep_connection [on|off].
That way, we can set it per server so that remote connections to the
particular server don’t remain idle.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-29 Thread Amit Kapila

On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar  wrote:
>
> On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila  wrote:
> >
> >
> > Other than above tests, can we somehow verify that the invalidations
> > generated at commit time are the same as what we do with this patch?
> > We have verified with individual commands but it would be great if we
> > can verify for the regression tests.
>
> I have verified this using a few random test cases.  For verifying
> this I have made some temporary code changes with an assert as shown
> below.  Basically, on DecodeCommit we call
> ReorderBufferAddInvalidations function only for an assert checking.
>
> -void
>  ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
>   XLogRecPtr
> lsn, Size nmsgs,
> -
> SharedInvalidationMessage *msgs)
> +
> SharedInvalidationMessage *msgs, bool commit)
>  {
> ReorderBufferTXN *txn;
>
> txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> -
> +   if (commit)
> +   {
> +   Assert(txn->ninvalidations == nmsgs);
> +   return;
> +   }
>
> The result is that for a normal local test it works fine.  But with
> regression suit, it hit an assert at many places because if the
> rollback of the subtransaction is involved then at commit time
> invalidation messages those are not logged whereas with command time
> invalidation those are logged.
>

Yeah, somehow, we need to ignore rollback to savepoint tests and
verify for others.

> As of now, I have only put assert on the count,  if we need to verify
> the exact messages then we might need to somehow categories the
> invalidation messages because the ordering of the messages will not be
> the same.  For testing this we will have to arrange them by category
> i.e relcahce, catcache and then we can compare them.
>

Can't we do this by verifying that each message at commit time exists
in the list of invalidation messages we have collected via processing
XLOG_XACT_INVALIDATIONS?

One additional question on patch
v30-0003-Extend-the-output-plugin-API-with-stream-methods:
+static void
+stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
+ XLogRecPtr apply_lsn)
{
..
..
+ state.report_location = apply_lsn;
..
..
+ ctx->write_location = apply_lsn;
..
}

Can't we name the last parameter as 'commit_lsn' as that is how
documentation in the patch spells it and it sounds more appropriate?
Also, is there a reason for assigning report_location and
write_location differently than what we do in commit_cb_wrapper?
Basically, assign those as txn->final_lsn and txn->end_lsn
respectively.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-29 Thread Amit Kapila

On Tue, Jun 30, 2020 at 6:38 AM Masahiko Sawada
 wrote:
>
> On Mon, 29 Jun 2020 at 20:37, Amit Kapila  wrote:
> >
> > On Mon, Jun 29, 2020 at 10:26 AM Masahiko Sawada
> >  wrote:
> > >
> > > I agree that it would not be a common case that the user sets
> > > different values for different processes. Based on that assumption, I
> > > also think having the stats at slot-level is a good idea.
> > >
> >
> > Okay.
> >
> > > But I might
> > > want to have the reset function.
> > >
> >
> > I don't mind but lets fist see how the patch for the basic feature
> > looks and what is required to implement it?  Are you interested in
> > writing the patch for this work?
>
> Yes, I'll write the draft patch.
>

Great, thanks.  One thing we can consider is that instead of storing
the stats directly in the slot we can consider sending it to stats
collector as suggested by Tomas.  Basically that can avoid contention
around slots (See discussion in email [1]).  I have not evaluated any
of the approaches in detail so you can let us know the advantage of
one over another.  Now, you might be already considering this but I
thought it is better to share what I have in mind rather than saying
that later once you have the draft patch ready.

[1] - 
https://www.postgresql.org/message-id/20200623101831.it6lzwbm37xwquco%40development

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

RE: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

2020-06-29 Thread higuchi.dais...@fujitsu.com

>> We don't want change checkpoint interval during recovery, that means
>> we cannot cnosider archive_timeout at the fist checkpointer after
>> recovery ends. So I think that the suggestion from Fujii-san is the
>> direction.
>+1
>If this idea has some problems, we can revisit Daisuke-san's idea.

Thanks for your comments. 
Ok, I will work on the fix to wake the checkpointer up at the end of recovery.

Regards,
Daisuke, Higuchi

Re: Backpatch b61d161c14 (Introduce vacuum errcontext ...)

2020-06-29 Thread Amit Kapila

On Fri, Jun 26, 2020 at 9:19 AM Amit Kapila  wrote:
>
> On Fri, Jun 26, 2020 at 7:25 AM Justin Pryzby  wrote:
> >
> >
> > Comments:
> >
> > > * The index name is saved only during this phase and restored 
> > > immediately
> >
> > => I wouldn't say "only" since it's saved during lazy_vacuum: index AND 
> > cleanup.
> >
> > >update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *oldpos, int 
> > >phase,
> >
> > => You called your struct "LVSavedErrInfo" but the variables are still 
> > called
> > "pos".  I would call it olderrinfo or just old.
> >
>
> Fixed both of the above comments. I used the variable name as saved_err_info.
>
> > Also, this doesn't (re)rename the "cbarg" stuff that Alvaro didn't like, 
> > which
> > was my 0001 patch.
> >
>
> If I am not missing anything then that change was in
> lazy_cleanup_index and after this patch, it won't be required because
> we are using a different variable name.
>
> I have combined both the patches now.
>

I am planning to push this tomorrow if there are no further
suggestions/comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: pgsql: Enable Unix-domain sockets support on Windows

2020-06-29 Thread Amit Kapila

On Mon, Jun 29, 2020 at 8:48 PM Andrew Dunstan
 wrote:
>
>
> On 6/28/20 4:33 AM, Peter Eisentraut wrote:
> > On 2020-06-27 13:57, Amit Kapila wrote:
> >> Fair enough, but what should be the behavior in the Windows versions
> >> (<10) where Unix-domain sockets are not supported?
> >
> > You get an error about an unsupported address family, similar to
> > trying to use IPv6 on a system that doesn't support it.
> >
> >> BTW, in which
> >> format the path needs to be specified for unix_socket_directories?  I
> >> tried with '/c/tmp', 'c:/tmp', 'tmp' but nothing seems to be working,
> >> it gives me errors like: "could not create lock file
> >> "/c/tmp/.s.PGSQL.5432.lock": No such file or directory" on server
> >> start.  I am trying this on Win7 just to check what is the behavior of
> >> this feature on it.
> >
> > Hmm, the only thing I remember about this now is that you need to use
> > native Windows paths, meaning you can't just use /tmp under MSYS, but
> > it needs to be something like C:\something.  But the error you have
> > there is not even about the socket file but about the lock file, which
> > is a normal file, so if that goes wrong, it might be an unrelated
> > problem.
> >
>
>
> It needs to be a path from the Windows POV, not an Msys virtualized
> path. So c:/tmp or just /tmp should work, but /c/tmp or similar probably
> will not. The directory needs to exist. I just checked that this is
> working, both in postgresql.conf and on the psql command line.
>

Okay, thanks for the verification.  I was trying to see its behavior
on Win7 or a similar environment where this feature is not supported
to see if we get the appropriate error message.  If by any chance, you
have access to such an environment, then it might be worth trying on
such an environment once.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Use of "long" in incremental sort code

2020-06-29 Thread David Rowley

Hi,

I noticed the incremental sort code makes use of the long datatype a
few times, e.g in TuplesortInstrumentation and
IncrementalSortGroupInfo.  (64-bit windows machines have sizeof(long)
== 4).  I understand that the values are in kilobytes and it would
take 2TB to cause them to wrap. Never-the-less, I think it would be
better to choose a better-suited type. work_mem is still limited to
2GB on 64-bit Windows machines, so perhaps there's some argument that
it does not matter about fields that related to in-memory stuff, but
the on-disk fields are wrong.  The in-memory fields likely raise the
bar further for fixing the 2GB work_mem limit on Windows.

Maybe Size would be better for the in-memory fields and uint64 for the
on-disk fields?

David

Re: [PATCH] Better cleanup in TLS tests for -13beta2

2020-06-29 Thread Michael Paquier

On Mon, Jun 29, 2020 at 03:51:48PM -0400, Tom Lane wrote:
> Daniel Gustafsson  writes:
>> That being said, we do retain temporary files on such failures on purpose in
>> our TestLib since 88802e068017bee8cea7a5502a712794e761c7b5 and a few 
>> follow-up
>> commits since, should these be handled differently?  They are admittedly less
>> "unknown" as compared to other files as they are copies, but famous last 
>> words
>> have been spoken about bugs that can never happen.
> 
> Oh, good point.  Objection withdrawn.

I looked at the patch, and can confirm that client_wrongperms_tmp.key
remains around after running 001_ssltests.pl, and client_tmp.key after
running 002_scram.pl.  The way the patch does its cleanup looks fine
to me, so I'll apply and backpatch where necessary, if there are no
objections of course.
--
Michael


signature.asc
Description: PGP signature

Re: Use of "long" in incremental sort code

2020-06-29 Thread Tom Lane

David Rowley  writes:
> I noticed the incremental sort code makes use of the long datatype a
> few times, e.g in TuplesortInstrumentation and
> IncrementalSortGroupInfo.  (64-bit windows machines have sizeof(long)
> == 4).  I understand that the values are in kilobytes and it would
> take 2TB to cause them to wrap. Never-the-less, I think it would be
> better to choose a better-suited type. work_mem is still limited to
> 2GB on 64-bit Windows machines, so perhaps there's some argument that
> it does not matter about fields that related to in-memory stuff, but
> the on-disk fields are wrong.  The in-memory fields likely raise the
> bar further for fixing the 2GB work_mem limit on Windows.

> Maybe Size would be better for the in-memory fields and uint64 for the
> on-disk fields?

There is a fairly widespread issue that memory-size-related GUCs and
suchlike variables are limited to represent sizes that fit in a "long".
Although Win64 is the *only* platform where that's an issue, maybe
it's worth doing something about.  But we shouldn't just fix the sort
code, if we do do something.

(IOW, I don't agree with doing a fix that doesn't also fix work_mem.)

regards, tom lane

Re: POC: postgres_fdw insert batching

2020-06-29 Thread Ashutosh Bapat

On Tue, 30 Jun 2020 at 08:47, Etsuro Fujita  wrote:

> On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat
>  wrote:
> > On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
> >  wrote:
>
> > > 3) What about the other DML operations (DELETE/UPDATE)?
> > >
> > > The other DML operations could probably benefit from the batching too.
> > > INSERT was good enough for a PoC, but having batching only for INSERT
> > > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
> > > of quals, but likely doable.
> >
> > Bulk INSERTs are more common in a sharded environment because of data
> > load in say OLAP systems. Bulk update/delete are rare, although not
> > that rare. So if an approach just supports bulk insert and not bulk
> > UPDATE/DELETE that will address a large number of usecases IMO. But if
> > we can make everything work together that would be good as well.
>
> In most cases, I think the entire UPDATE/DELETE operations would be
> pushed down to the remote side by DirectModify.  So, I'm not sure we
> really need the bulk UPDATE/DELETE.
>

That may not be true for a partitioned table whose partitions are foreign
tables. Esp. given the work that Amit Langote is doing [1]. It really
depends on the ability of postgres_fdw to detect that the DML modifying
each of the partitions can be pushed down. That may not come easily.


>
> > > 3) Should we do batching for COPY insteads?
> > >
> > > While looking at multi_insert, I've realized it's mostly exactly what
> > > the new "batching insert" API function would need to be. But it's only
> > > really used in COPY, so I wonder if we should just abandon the idea of
> > > batching INSERTs and do batching COPY for FDW tables.
>
> > I think we have find out which performs
> > better COPY or batch INSERT.
>
> Maybe I'm missing something, but I think the COPY patch [1] seems more
> promising to me, because 1) it would not get the remote side's planner
> and executor involved, and 2) the data would be loaded more
> efficiently by multi-insert on the demote side.  (Yeah, COPY doesn't
> support RETURNING, but it's rare that RETURNING is needed in a bulk
> load, as you mentioned.)
>
> > [1]
> https://www.postgresql.org/message-id/flat/3d0909dc-3691-a576-208a-90986e55489f%40postgrespro.ru
>
> Best regards,
> Etsuro Fujita
>

[1]
https://www.postgresql.org/message-id/ca+hiwqhphdqddn48ycehynnniahh78rwcrv1rex65-fszgb...@mail.gmail.com
-- 
Best Wishes,
Ashutosh

Re: Use of "long" in incremental sort code

2020-06-29 Thread David Rowley

On Tue, 30 Jun 2020 at 16:20, Tom Lane  wrote:
> There is a fairly widespread issue that memory-size-related GUCs and
> suchlike variables are limited to represent sizes that fit in a "long".
> Although Win64 is the *only* platform where that's an issue, maybe
> it's worth doing something about.  But we shouldn't just fix the sort
> code, if we do do something.
>
> (IOW, I don't agree with doing a fix that doesn't also fix work_mem.)

I raised it mostly because this new-to-PG13-code is making the problem worse.

If we're not going to change the in-memory fields, then shouldn't we
at least change the ones for disk space tracking?

David

Re: truncating timestamps on arbitrary intervals

2020-06-29 Thread John Naylor

In v9, I've simplified the patch somewhat to make it easier for future
work to build on.

- When truncating on month-or-greater intervals, require the origin to
align on month. This removes the need to handle weird corner cases
that have no straightforward behavior.
- Remove hackish and possibly broken code to allow origin to be after
the input timestamp. The default origin is Jan 1, 1 AD, so only AD
dates will behave correctly by default. This is not enforced for now,
since it may be desirable to find a way to get this to work in a nicer
way.
- Rebase docs over PG13 formatting changes.



--
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v9-datetrunc-interval.patch
Description: Binary data

Re: A patch for get origin from commit_ts.

2020-06-29 Thread Michael Paquier

On Mon, Jun 29, 2020 at 06:17:27PM -0700, Madan Kumar wrote:
> We already have pg_xact_commit_timestamp() that returns the timestamp of
> the commit. It may be better to have one single function returning both
> timestamp and origin for a given transaction ID.
> 
> A second thing is that TransactionIdGetCommitTsData() was introdued in
> core(73c986add). It has only one caller pg_xact_commit_timestamp() which
> passes RepOriginId as NULL, making last argument to the
> TransactionIdGetCommitTsData() a dead code in core.
> 
> Quick code search shows that it is getting used by pglogical (caller:
> https://sources.debian.org/src/pglogical/2.3.2-1/pglogical_conflict.c/?hl=509#L509).
> CCing Craig Ringer and Petr Jelinek for the inputs.

Another question that has popped up when doing this review is what
would be the use-case of adding this information at SQL level knowing
that logical replication exists since 10?  Having dead code in the
backend tree is not a good idea of course, so we can also have as
argument to simplify TransactionIdGetCommitTsData().  Now, pglogical
has pglogical_xact_commit_timestamp_origin() to get the replication
origin with its own function so providing an extending equivalent
returning one row with two fields would be nice for pglogical so as
this function is not necessary.  As mentioned by Madan, the portion of
the code using TransactionIdGetCommitTsData() relies on it for
conflicts of updates (the first win, last win logic at quick glance).

I am adding Peter E in CC for an opinion, the last commits of
pglogical are from him.
--
Michael

signature.asc
Description: PGP signature

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-29 Thread Dilip Kumar

On Tue, Jun 30, 2020 at 9:20 AM Amit Kapila  wrote:
>
> On Mon, Jun 29, 2020 at 4:24 PM Dilip Kumar  wrote:
> >
> > On Mon, Jun 22, 2020 at 4:30 PM Amit Kapila  wrote:
> > >
> > >
> > > Other than above tests, can we somehow verify that the invalidations
> > > generated at commit time are the same as what we do with this patch?
> > > We have verified with individual commands but it would be great if we
> > > can verify for the regression tests.
> >
> > I have verified this using a few random test cases.  For verifying
> > this I have made some temporary code changes with an assert as shown
> > below.  Basically, on DecodeCommit we call
> > ReorderBufferAddInvalidations function only for an assert checking.
> >
> > -void
> >  ReorderBufferAddInvalidations(ReorderBuffer *rb, TransactionId xid,
> >   XLogRecPtr
> > lsn, Size nmsgs,
> > -
> > SharedInvalidationMessage *msgs)
> > +
> > SharedInvalidationMessage *msgs, bool commit)
> >  {
> > ReorderBufferTXN *txn;
> >
> > txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> > -
> > +   if (commit)
> > +   {
> > +   Assert(txn->ninvalidations == nmsgs);
> > +   return;
> > +   }
> >
> > The result is that for a normal local test it works fine.  But with
> > regression suit, it hit an assert at many places because if the
> > rollback of the subtransaction is involved then at commit time
> > invalidation messages those are not logged whereas with command time
> > invalidation those are logged.
> >
>
> Yeah, somehow, we need to ignore rollback to savepoint tests and
> verify for others.

Yeah, I have run the regression suite,  I can see a lot of failure
maybe we can somehow see the diff and confirm that all the failures
are due to rollback to savepoint only.  I will work on this.

>
> > As of now, I have only put assert on the count,  if we need to verify
> > the exact messages then we might need to somehow categories the
> > invalidation messages because the ordering of the messages will not be
> > the same.  For testing this we will have to arrange them by category
> > i.e relcahce, catcache and then we can compare them.
> >
>
> Can't we do this by verifying that each message at commit time exists
> in the list of invalidation messages we have collected via processing
> XLOG_XACT_INVALIDATIONS?

Let me try what is the easiest way to test this.

>
> One additional question on patch
> v30-0003-Extend-the-output-plugin-API-with-stream-methods:
> +static void
> +stream_commit_cb_wrapper(ReorderBuffer *cache, ReorderBufferTXN *txn,
> + XLogRecPtr apply_lsn)
> {
> ..
> ..
> + state.report_location = apply_lsn;
> ..
> ..
> + ctx->write_location = apply_lsn;
> ..
> }
>
> Can't we name the last parameter as 'commit_lsn' as that is how
> documentation in the patch spells it and it sounds more appropriate?

You are right commit_lsn seems more appropriate here.

> Also, is there a reason for assigning report_location and
> write_location differently than what we do in commit_cb_wrapper?
> Basically, assign those as txn->final_lsn and txn->end_lsn
> respectively.

Yes, I think it should be handled in same way as commit_cb_wrapper.
Because before calling ReorderBufferStreamCommit in
ReorderBufferCommit, we are properly updating the final_lsn as well as
the end_lsn.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: refactoring basebackup.c

2020-06-29 Thread Dipesh Pandit

Hi,

I have repeated the experiment with 8K block size and found that the
results are not varying much after applying the patch.
Please find the details below.

*Backup type*: local backup using pg_basebackup
*Data size*: Around 200GB (200 tables - each table around 1.05 GB)
*TAR_SEND_SIZE value*: 8kb

*Server details:*
RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit,
64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4

*Results:*

Iteration WIthout refactor
patch WIth refactor
patch
1st run real 10m19.001s
user 1m37.895s
sys 8m33.008s real 9m45.291s
user 1m23.192s
sys 8m14.993s
2nd run real 9m33.970s
user 1m19.490s
sys 8m6.062s real 9m30.560s
user 1m22.124s
sys 8m0.979s
3rd run real 9m19.327s
user 1m21.772s
sys 7m50.613s real 8m59.241s
user 1m19.001s
sys 7m32.645s
4th run real 9m56.873s
user 1m22.370s
sys 8m27.054s real 9m52.290s
user 1m22.175s
sys 8m23.052s
5th run real 9m45.343s
user 1m23.113s
sys 8m15.418s real 9m49.633s
user 1m23.122s
sys 8m19.240s

Later I connected with Suraj to validate the experiment details and found
that the setup and steps followed are exactly the same in this
experiment when compared with the previous experiment.

Thanks,
Dipesh

On Thu, May 14, 2020 at 7:50 AM Suraj Kharage <
suraj.khar...@enterprisedb.com> wrote:

> Hi,
>
> On Wed, May 13, 2020 at 7:49 PM Robert Haas  wrote:
>
>>
>> So the patch came out slightly faster at 8kB and slightly slower in the
>> other tests. That's kinda strange. I wonder if it's just noise. How much do
>> the results vary run to run?
>>
> It is not varying much except for 8kB run. Please see below details for
> both runs of each scenario.
>
> 8kb 32kb (default value) 128kB 1024kB
> WIthout refactor
> patch 1st run real 10m50.924s
> user 1m29.774s
> sys 9m13.058s real 8m36.245s
> user 1m8.471s
> sys 7m21.520s real 7m8.690s
> user 0m54.840s
> sys 6m1.725s real 18m16.898s
> user 1m39.105s
> sys 9m42.803s
> 2nd run real 10m22.718s
> user 1m23.629s
> sys 8m51.410s real 8m44.455s
> user 1m7.896s
> sys 7m28.909s real 6m54.299s
> user 0m55.690s
> sys 5m46.502s real 18m3.511s
> user 1m38.197s
> sys 9m36.517s
> WIth refactor
> patch 1st run real 10m11.350s
> user 1m25.038s
> sys 8m39.226s real 8m56.226s
> user 1m9.774s
> sys 7m41.032s real 7m26.678s
> user 0m54.833s
> sys 6m20.057s real 19m5.218s
> user 1m44.122s
> sys 10m17.623s
> 2nd run real 11m30.500s
> user 1m45.221s
> sys 9m37.815s real 9m4.103s
> user 1m6.893s
> sys 7m49.393s real 7m26.713s
> user 0m54.868s
> sys 6m19.652s real 18m17.230s
> user 1m42.749s
> sys 9m53.704s
>
>
> --
> --
>
> Thanks & Regards,
> Suraj kharage,
> EnterpriseDB Corporation,
> The Postgres Database Company.
>

Re: track_planning causing performance regression

2020-06-29 Thread Fujii Masao




On 2020/06/29 22:23, Ants Aasma wrote:

On Mon, 29 Jun 2020 at 12:17, Julien Rouhaud mailto:rjuju...@gmail.com>> wrote:

On Mon, Jun 29, 2020 at 10:55 AM Fujii Masao
mailto:masao.fu...@oss.nttdata.com>> wrote:
 >
 > On 2020/06/29 16:05, Julien Rouhaud wrote:
 > > On Mon, Jun 29, 2020 at 7:49 AM Tharakan, Robins mailto:tha...@amazon.com>> wrote:
 > >>
 > >> During fully-cached SELECT-only test using pgbench, Postgres v13Beta1 
shows
 >
 > Thanks for the benchmark!
 >
 >
 > >> ~45% performance drop [2] at high DB connection counts (when compared 
with v12.3)
 >
 > That's bad :(
 >
 >
 > >>
 > >> Disabling pg_stat_statements.track_planning (which is 'On' by default)
 > >> brings the TPS numbers up to v12.3 levels.
 > >>
 > >> The inflection point (in this test-case) is 128 Connections, beyond 
which the
 > >> TPS numbers are consistently low. Looking at the mailing list [1], 
this issue
 > >> didn't surface earlier possibly since the regression is trivial at 
low connection counts.
 > >>
 > >> It would be great if this could be optimized further, or 
track_planning
 > >> disabled (by default) so as to not trip users upgrading from v12 with 
pg_stat_statement
 > >> enabled (but otherwise not particularly interested in track_planning).
 >
 > Your benchmark result seems to suggest that the cause of the problem is
 > the contention of per-query spinlock in pgss_store(). Right?
 > This lock contention is likely to happen when multiple sessions run
 > the same queries.
 >
 > One idea to reduce that lock contention is to separate per-query spinlock
 > into two; one is for planning, and the other is for execution. 
pgss_store()
 > determines which lock to use based on the given "kind" argument.
 > To make this idea work, also every pgss counters like shared_blks_hit
 > need to be separated into two, i.e., for planning and execution.

This can probably remove some overhead, but won't it eventually hit
the same issue when multiple connections try to plan the same query,
given the number of different queries and very low execution runtime?
It'll also quite increase the shared memory consumption.

I'm wondering if we could instead use atomics to store the counters.
The only downside is that we won't guarantee per-row consistency
anymore, which may be problematic.



The problem looks to be that spinlocks are terrible with overloaded CPU and a 
contended spinlock. A process holding the spinlock might easily get scheduled 
out leading to excessive spinning by everybody. I think a simple thing to try 
would be to replace the spinlock with LWLock.


Yes. Attached is the POC patch that replaces per-counter spinlock with LWLock.



I did a prototype patch that replaces spinlocks with futexes, but was not able 
to find a workload where it mattered.


I'm not familiar with futex, but could you tell me why you used futex instead
of LWLock that we already have? Is futex portable?


We have done a great job at eliminating spinlocks from contended code paths. 
Robins, perhaps you could try it to see if it reduces the regression you are 
observing.


Yes. Also we need to check that this change doesn't increase performance
overhead in other workloads.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index cef8bb5a49..aa506f6c11 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -39,7 +39,7 @@
  * in an entry except the counters requires the same.  To look up an entry,
  * one must hold the lock shared.  To read or update the counters within
  * an entry, one must hold the lock shared or exclusive (so the entry doesn't
- * disappear!) and also take the entry's mutex spinlock.
+ * disappear!) and also take the entry's partition lock.
  * The shared state variable pgss->extent (the next free spot in the external
  * query-text file) should be accessed only while holding either the
  * pgss->mutex spinlock, or exclusive lock on pgss->lock.  We use the mutex to
@@ -115,6 +115,11 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM 
/ 100;
 
 #define JUMBLE_SIZE1024/* query serialization 
buffer size */
 
+#definePGSS_NUM_LOCK_PARTITIONS()  (pgss_max)
+#definePGSS_HASH_PARTITION_LOCK(key)   \
+   (&(pgss->base + \
+  (get_hash_value(pgss_hash, key) % PGSS_NUM_LOCK_PARTITIONS()))->lock)
+
 /*
  * Extension version number, for supporting older extension versions' objects
  */
@@ -207,7 +212,7 @@ typedef struct pgssEntry
Sizequery_offset;   /* query text offset in external file */
int query

Re: POC: postgres_fdw insert batching

2020-06-29 Thread Amit Langote

On Tue, Jun 30, 2020 at 1:22 PM Ashutosh Bapat
 wrote:
> On Tue, 30 Jun 2020 at 08:47, Etsuro Fujita  wrote:
>> On Mon, Jun 29, 2020 at 7:52 PM Ashutosh Bapat
>>  wrote:
>> > On Sun, Jun 28, 2020 at 8:40 PM Tomas Vondra
>> >  wrote:
>>
>> > > 3) What about the other DML operations (DELETE/UPDATE)?
>> > >
>> > > The other DML operations could probably benefit from the batching too.
>> > > INSERT was good enough for a PoC, but having batching only for INSERT
>> > > seems somewhat asmymetric. DELETE/UPDATE seem more complicated because
>> > > of quals, but likely doable.
>> >
>> > Bulk INSERTs are more common in a sharded environment because of data
>> > load in say OLAP systems. Bulk update/delete are rare, although not
>> > that rare. So if an approach just supports bulk insert and not bulk
>> > UPDATE/DELETE that will address a large number of usecases IMO. But if
>> > we can make everything work together that would be good as well.
>>
>> In most cases, I think the entire UPDATE/DELETE operations would be
>> pushed down to the remote side by DirectModify.  So, I'm not sure we
>> really need the bulk UPDATE/DELETE.
> That may not be true for a partitioned table whose partitions are foreign 
> tables. Esp. given the work that Amit Langote is doing [1]. It really depends 
> on the ability of postgres_fdw to detect that the DML modifying each of the 
> partitions can be pushed down. That may not come easily.

While it's true that how to accommodate the DirectModify API in the
new inherited update/delete planning approach is an open question on
that thread, I would eventually like to find an answer to that.  That
is, that work shouldn't result in losing the foreign partition's
ability to use DirectModify API to optimize updates/deletes.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Re: A patch for get origin from commit_ts.

2020-06-29 Thread movead...@highgo.ca


>> A second thing is that TransactionIdGetCommitTsData() was introdued in
>> core(73c986add). It has only one caller pg_xact_commit_timestamp() which
>> passes RepOriginId as NULL, making last argument to the
>> TransactionIdGetCommitTsData() a dead code in core.
>>
>> Quick code search shows that it is getting used by pglogical (caller:
>> https://sources.debian.org/src/pglogical/2.3.2-1/pglogical_conflict.c/?hl=509#L509).
>> CCing Craig Ringer and Petr Jelinek for the inputs.
 
>Another question that has popped up when doing this review is what
>would be the use-case of adding this information at SQL level knowing
>that logical replication exists since 10?  Having dead code in the
>backend tree is not a good idea of course, so we can also have as
>argument to simplify TransactionIdGetCommitTsData().  Now, pglogical
>has pglogical_xact_commit_timestamp_origin() to get the replication
>origin with its own function so providing an extending equivalent
>returning one row with two fields would be nice for pglogical so as
>this function is not necessary.  As mentioned by Madan, the portion of
>the code using TransactionIdGetCommitTsData() relies on it for
>conflicts of updates (the first win, last win logic at quick glance).

Thanks for the explanation, the origin in commit_ts seems useless, I am just
want to know why it appears there. It's ok to close this issue if we do not
want to touch it now.

And I am more interest in origin in wal, if data from a logical replicate or a 
manual origin then many wal records will get a 'RepOriginId',  'RepOriginId'
in 'xact' wal record may help to do some filter, the other same dead code
too. So can you help me to understand why or the historical reason for that?




Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: refactoring basebackup.c

2020-06-29 Thread Suraj Kharage

On Tue, Jun 30, 2020 at 10:45 AM Dipesh Pandit 
wrote:

> Hi,
>
> I have repeated the experiment with 8K block size and found that the
> results are not varying much after applying the patch.
> Please find the details below.
>
>
> Later I connected with Suraj to validate the experiment details and found
> that the setup and steps followed are exactly the same in this
> experiment when compared with the previous experiment.
>
>
Thanks Dipesh.
It looks like the results are not varying much with your run as you
followed the same steps.
One of my run with 8kb which took more time than others might be because of
noise at that time.

-- 
--

Thanks & Regards,
Suraj kharage,



edbpostgres.com

warnings for invalid function casts

2020-06-29 Thread Peter Eisentraut

Some time ago, there were some discussions about gcc warnings produced 
by -Wcast-function-type [0].  To clarify, while that thread seemed to 
imply that the warnings appear by default in some compiler version, this 
is not the case AFAICT, and the warnings are entirely optional.


So I took a look at what it would take to fix all the warnings and came 
up with the attached patch.


There are three subplots:

1. Changing the return type of load_external_function() and 
lookup_external_function() from PGFunction to a generic pointer type, 
which is what the discussion in [0] started out about.


2. There is a bit of cheating in dynahash.c.  They keycopy field is 
declared as a function pointer that returns a pointer to the 
destination, to match the signature of memcpy(), but then we assign 
strlcpy() to it, which returns size_t.  Even though we never use the 
return value, I'm not sure whether this could break if size_t and 
pointers are of different sizes, which in turn is very unlikely.


3. Finally, there is some nonsense necessary in plpython, which is 
annoying but otherwise uninteresting.


Is there anything we want to pursue further here?


[0]: 
https://www.postgresql.org/message-id/flat/20180206200205.f5kvbyn6jawtzi6s%40alap3.anarazel.de


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 7212f3e13107bf262dd0e0c4baeb2f8170ca6073 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 30 Jun 2020 01:19:56 +0200
Subject: [PATCH] Fix -Wcast-function-type warnings

Discussion: 
https://www.postgresql.org/message-id/flat/20180206200205.f5kvbyn6jawtzi6s%40alap3.anarazel.de
---
 configure | 91 +++
 configure.in  |  2 +
 src/backend/utils/fmgr/dfmgr.c| 14 ++---
 src/backend/utils/fmgr/fmgr.c |  2 +-
 src/backend/utils/hash/dynahash.c | 11 +++-
 src/include/fmgr.h|  8 ++-
 src/pl/plpython/plpy_plpymodule.c | 14 ++---
 7 files changed, 123 insertions(+), 19 deletions(-)

diff --git a/configure b/configure
index 2feff37fe3..9907637e31 100755
--- a/configure
+++ b/configure
@@ -5643,6 +5643,97 @@ if test 
x"$pgac_cv_prog_CXX_cxxflags__Wimplicit_fallthrough_3" = x"yes"; then
 fi
 
 
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports 
-Wcast-function-type, for CFLAGS" >&5
+$as_echo_n "checking whether ${CC} supports -Wcast-function-type, for 
CFLAGS... " >&6; }
+if ${pgac_cv_prog_CC_cflags__Wcast_function_type+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+pgac_save_CC=$CC
+CC=${CC}
+CFLAGS="${CFLAGS} -Wcast-function-type"
+ac_save_c_werror_flag=$ac_c_werror_flag
+ac_c_werror_flag=yes
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv_prog_CC_cflags__Wcast_function_type=yes
+else
+  pgac_cv_prog_CC_cflags__Wcast_function_type=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_c_werror_flag=$ac_save_c_werror_flag
+CFLAGS="$pgac_save_CFLAGS"
+CC="$pgac_save_CC"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$pgac_cv_prog_CC_cflags__Wcast_function_type" >&5
+$as_echo "$pgac_cv_prog_CC_cflags__Wcast_function_type" >&6; }
+if test x"$pgac_cv_prog_CC_cflags__Wcast_function_type" = x"yes"; then
+  CFLAGS="${CFLAGS} -Wcast-function-type"
+fi
+
+
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CXX} supports 
-Wcast-function-type, for CXXFLAGS" >&5
+$as_echo_n "checking whether ${CXX} supports -Wcast-function-type, for 
CXXFLAGS... " >&6; }
+if ${pgac_cv_prog_CXX_cxxflags__Wcast_function_type+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CXXFLAGS=$CXXFLAGS
+pgac_save_CXX=$CXX
+CXX=${CXX}
+CXXFLAGS="${CXXFLAGS} -Wcast-function-type"
+ac_save_cxx_werror_flag=$ac_cxx_werror_flag
+ac_cxx_werror_flag=yes
+ac_ext=cpp
+ac_cpp='$CXXCPP $CPPFLAGS'
+ac_compile='$CXX -c $CXXFLAGS $CPPFLAGS conftest.$ac_ext >&5'
+ac_link='$CXX -o conftest$ac_exeext $CXXFLAGS $CPPFLAGS $LDFLAGS 
conftest.$ac_ext $LIBS >&5'
+ac_compiler_gnu=$ac_cv_cxx_compiler_gnu
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_cxx_try_compile "$LINENO"; then :
+  pgac_cv_prog_CXX_cxxflags__Wcast_function_type=yes
+else
+  pgac_cv_prog_CXX_cxxflags__Wcast_function_type=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+ac_ext=c
+ac_cpp='$CPP $CPPFLAGS'
+ac_compile='$CC -c $CFLAGS $CPPFLAGS conftest.$ac_ext >&5'
+ac_link='$CC -o conftest$ac_exeext $CFLAGS $CPPFLAGS $LDFLAGS conftest.$ac_ext 
$LIBS >&5'
+ac_compiler_gnu=$ac_cv_c_compiler_gnu
+
+ac_cxx_werror_flag=$ac_save_cxx_werror_flag
+CXXFLAGS="$pgac_save_CXXFLAGS"
+CXX="$pgac_save_CXX"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: 
$pgac_cv_prog_CXX_cxxflags__Wcast_function_type" >&5

96 matches

Mail list logo