date:20180710

Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-10 Thread Jamison, Kirk

Hello hackers,

Recently, the problem on improving performance of multiple drop/truncate tables 
in a single transaction with large shared_buffers (as shown below) was solved 
by commit b416691.
  BEGIN;
  truncate tbl001;
  ...
  truncate tbl050;
  COMMIT;

However, we have a customer that needs to execute multiple concurrent TRUNCATEs 
(40~50) on different large tables (as shown below) in a shorter amount of time. 
This one is not covered by the previous commit's improvement.
  BEGIN;
  truncate tbl001;
  COMMIT;
  ...
  BEGIN;
  truncate tbl050;
  COMMIT;

[Problem]
Currently, when the standby recovers the WAL of TRUNCATE/DROP TABLE, it leads 
to separate scans of the whole shared buffer in sequence to check whether or 
not the table to be deleted is cached in the shared buffer. Moreover, if the 
size of shared_buffers is large (i.e. 300GB) and the primary server fails 
during the replay, it would take a long while for the standby to complete 
recovery.

[Idea]
Since in the current implementation, the replay of each TRUNCATE/DROP TABLE 
scans the whole shared buffer.
One approach (though idea is not really developed yet) is to improve the 
recovery by delaying the shared buffer scan and invalidation 
(DropRelFileNodeBuffers) and to put it after the next checkpoint (after 
failover completion). The replay of TRUNCATE/DROP TABLE just make the 
checkpointer process remember what relations should be invalidated in the 
shared buffer during subsequent checkpoint. The checkpointer then scans the 
shared buffer only once to invalidate the buffers of relations that was dropped 
and truncated.

However, this is still a rough idea, so I am not sure if it’s feasible. I would 
like to know if the community has advice or other alternative solutions on how 
to work around this.
Any insights, advice, feedback?

Thank you in advance.


Regards,
Kirk Jamison

Re: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-10 Thread Andres Freund

Hi,

On 2018-07-10 07:05:12 +, Jamison, Kirk wrote:
> Hello hackers,
> 
> Recently, the problem on improving performance of multiple drop/truncate 
> tables in a single transaction with large shared_buffers (as shown below) was 
> solved by commit b416691.
>   BEGIN;
>   truncate tbl001;
>   ...
>   truncate tbl050;
>   COMMIT;
> 
> However, we have a customer that needs to execute multiple concurrent 
> TRUNCATEs (40~50) on different large tables (as shown below) in a shorter 
> amount of time. This one is not covered by the previous commit's improvement.
>   BEGIN;
>   truncate tbl001;
>   COMMIT;
>   ...
>   BEGIN;
>   truncate tbl050;
>   COMMIT;
> 
> [Problem]
> Currently, when the standby recovers the WAL of TRUNCATE/DROP TABLE, it leads 
> to separate scans of the whole shared buffer in sequence to check whether or 
> not the table to be deleted is cached in the shared buffer. Moreover, if the 
> size of shared_buffers is large (i.e. 300GB) and the primary server fails 
> during the replay, it would take a long while for the standby to complete 
> recovery.

If you do so sequentially on the primary, I'm not clear as to why you
think the issue is bigger in recovery?


> [Idea]
> Since in the current implementation, the replay of each TRUNCATE/DROP TABLE 
> scans the whole shared buffer.
> One approach (though idea is not really developed yet) is to improve the 
> recovery by delaying the shared buffer scan and invalidation 
> (DropRelFileNodeBuffers) and to put it after the next checkpoint (after 
> failover completion). The replay of TRUNCATE/DROP TABLE just make the 
> checkpointer process remember what relations should be invalidated in the 
> shared buffer during subsequent checkpoint. The checkpointer then scans the 
> shared buffer only once to invalidate the buffers of relations that was 
> dropped and truncated.

I think you'd run into a lot of very hairy details with this
approach. Consider what happens if client processes need fresh buffers
and need to write out a victim buffer. You'll need to know that the
relevant buffer is actually invalid. Thus the knowledge about the
"delayed" drops would need to be in shared buffers and scanned on every
dirty buffer writeout.


> However, this is still a rough idea, so I am not sure if it’s feasible. I 
> would like to know if the community has advice or other alternative solutions 
> on how to work around this.
> Any insights, advice, feedback?

I personally think we should rather just work towards a ordered buffer
mapping implementation.

Greetings,

Andres Freund

Re: Non-reserved replication slots and slot advancing

2018-07-10 Thread Andres Freund

Hi,

On 2018-07-10 09:54:28 +0900, Michael Paquier wrote:
> > > Also, ERRCODE_FEATURE_NOT_SUPPORTED doesn't quite seem right. How about
> > > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE?
> > 
> > +1 to both of Andres' suggestions.
> 
> Those indeed sound better.  What do you think of the attached?

Looks generally good. One remark:

> + /* A slot whose restart_lsn has never been reserved cannot be advanced 
> */
> + if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
> + {
> + ReplicationSlotRelease();
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +  errmsg("cannot advance replication slot that 
> has not previously reserved WAL")));
> + }

Why is the ReplicationSlotRelease() needed here? Souldn't the error
handling code do so automatically?

Greetings,

Andres Freund

Re: Non-reserved replication slots and slot advancing

2018-07-10 Thread Michael Paquier

On Tue, Jul 10, 2018 at 12:26:30AM -0700, Andres Freund wrote:
> Why is the ReplicationSlotRelease() needed here? Souldn't the error
> handling code do so automatically?

Oh, indeed.  I didn't know that PostgresMain was doing some cleanup
here.  There is a second place where this can be removed, which comes
from the original commit which added the advance function.  An updated
version is attached.  Do you have other comments?
--
Michael
diff --git a/contrib/test_decoding/expected/slot.out b/contrib/test_decoding/expected/slot.out
index 21e9d56f73..2737a8a301 100644
--- a/contrib/test_decoding/expected/slot.out
+++ b/contrib/test_decoding/expected/slot.out
@@ -131,3 +131,20 @@ SELECT pg_drop_replication_slot('regression_slot1');
 ERROR:  replication slot "regression_slot1" does not exist
 SELECT pg_drop_replication_slot('regression_slot2');
 ERROR:  replication slot "regression_slot2" does not exist
+-- slot advance with physical slot, error with non-reserved slot
+SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3');
+slot_name 
+--
+ regression_slot3
+(1 row)
+
+SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN
+ERROR:  invalid target wal lsn
+SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
+ERROR:  cannot advance replication slot that has not previously reserved WAL
+SELECT pg_drop_replication_slot('regression_slot3');
+ pg_drop_replication_slot 
+--
+ 
+(1 row)
+
diff --git a/contrib/test_decoding/sql/slot.sql b/contrib/test_decoding/sql/slot.sql
index 706340c1d8..24cdf7155d 100644
--- a/contrib/test_decoding/sql/slot.sql
+++ b/contrib/test_decoding/sql/slot.sql
@@ -68,3 +68,9 @@ SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot1', 'test_
 -- both should error as they should be dropped on error
 SELECT pg_drop_replication_slot('regression_slot1');
 SELECT pg_drop_replication_slot('regression_slot2');
+
+-- slot advance with physical slot, error with non-reserved slot
+SELECT slot_name FROM pg_create_physical_replication_slot('regression_slot3');
+SELECT pg_replication_slot_advance('regression_slot3', '0/0'); -- invalid LSN
+SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
+SELECT pg_drop_replication_slot('regression_slot3');
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3ed9021c2f..4851bc2e24 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9867,7 +9867,8 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   
   The address (LSN) of oldest WAL which still
   might be required by the consumer of this slot and thus won't be
-  automatically removed during checkpoints.
+  automatically removed during checkpoints.  NULL
+  if the LSN of this slot has never been reserved.
   
  
 
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 2806e1076c..c92f5f0789 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -483,6 +483,12 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	/* Acquire the slot so we "own" it */
 	ReplicationSlotAcquire(NameStr(*slotname), true);
 
+	/* A slot whose restart_lsn has never been reserved cannot be advanced */
+	if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
+		ereport(ERROR,
+(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot advance replication slot that has not previously reserved WAL")));
+
 	/*
 	 * Check if the slot is not moving backwards.  Physical slots rely simply
 	 * on restart_lsn as a minimum point, while logical slots have confirmed
@@ -495,14 +501,11 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 		minlsn = MyReplicationSlot->data.restart_lsn;
 
 	if (moveto < minlsn)
-	{
-		ReplicationSlotRelease();
 		ereport(ERROR,
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot move slot to %X/%X, minimum is %X/%X",
+ errmsg("cannot advance replication slot to %X/%X, minimum is %X/%X",
 		(uint32) (moveto >> 32), (uint32) moveto,
 		(uint32) (minlsn >> 32), (uint32) minlsn)));
-	}
 
 	/* Do the actual slot update, depending on the slot type */
 	if (OidIsValid(MyReplicationSlot->data.database))


signature.asc
Description: PGP signature

RE: hostorder and failover_timeout for libpq

2018-07-10 Thread Iwata, Aya

Hello Ildar,

I have a question about failover_timeout parameter.
Which would be better: implementing the parameter to retry at waiting time 
or controlling the connection retry on the application side?

Also, I have no idea if the amount of random access by hostorder parameter will 
have a good effect on load balancing.
Please let me know if there are examples.

I am sorry if these were examined by the previous thread. I haven't read it yet.

Regards,
Aya Iwata

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-07-10 Thread Haribabu Kommi

On Tue, Jul 10, 2018 at 12:26 AM Fujii Masao  wrote:

> On Sun, Jul 8, 2018 at 11:48 AM, Haribabu Kommi
>  wrote:
> >
> > On Fri, Jul 6, 2018 at 3:22 AM Fujii Masao 
> wrote:
> >>
> >> On Wed, Jul 4, 2018 at 7:12 PM, Haribabu Kommi <
> kommi.harib...@gmail.com>
> >> wrote:
> >> >
> >> > Update patch attached.
> >>
> >> + if (userid != 0 && dbid != 0 && queryid != 0)
> >>
> >> UINT64CONST() should be used for the constant for queryid?
> >
> >
> > OK.
> >
> >>
> >> It's rare case, but 0 can be assigned as valid queryid. Right?
> >
> >
> > But for normal queries, in post parse analyze function, the queryID
> > is calculated and it set to 1, in case if the calculation becomes 0.
> > But for the utility statements, the calculation is done using the
> > pgss_hash_string() function. I am not sure whether this function
> > can return 0.
>
> Though I've not read the whole code of pgss_hash_string(), ISTM that
> the function can return 0. Otherwise, the following code is unnecessary
> after queryid is assigned by hash_any_extended(),
> in pgss_post_parse_analyze().
>
> /*
> * If we are unlucky enough to get a hash of zero, use 1 instead, to
> * prevent confusion with the utility-statement case.
> */
> if (query->queryId == UINT64CONST(0))
> query->queryId = UINT64CONST(1);
>
> > If yes, then we may need same handling to utility statements
> > similar like normal statements but with queryID as 2 for utility
> statements.
>
> That's possible, but I think that it's better to get rid of such corner
> case at all.
>

QueryID 2 is used in case if the generated hash value is 0 for utility
statements
to give difference with normal statements and also used the UINT64CONST
macro as per the earlier comment.

updated patch attached. This patch needs to be applied on top of the ACL
permissions revert patch. ACL revert patch also in this thread.

Regards,
Haribabu Kommi
Fujitsu Australia


0002-pg_stat_statements_reset-to-reset-specific-query-use_v4.patch
Description: Binary data


0001-Revoke-pg_stat_statements_reset-permissions_v2.patch
Description: Binary data

Re: Non-reserved replication slots and slot advancing

2018-07-10 Thread Andres Freund

On 2018-07-10 16:59:07 +0900, Michael Paquier wrote:
> On Tue, Jul 10, 2018 at 12:26:30AM -0700, Andres Freund wrote:
> > Why is the ReplicationSlotRelease() needed here? Souldn't the error
> > handling code do so automatically?
> 
> Oh, indeed.  I didn't know that PostgresMain was doing some cleanup
> here.  There is a second place where this can be removed, which comes
> from the original commit which added the advance function.

Gna, it's almost like the original code wasn't properly reviewed.


> An updated version is attached.  Do you have other comments?

Looks sane, without having tested it.


>   if (moveto < minlsn)
> - {
> - ReplicationSlotRelease();
>   ereport(ERROR,
>   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> -  errmsg("cannot move slot to %X/%X, minimum is 
> %X/%X",
> +  errmsg("cannot advance replication slot to 
> %X/%X, minimum is %X/%X",
>   (uint32) (moveto >> 32), 
> (uint32) moveto,
>   (uint32) (minlsn >> 32), 
> (uint32) minlsn)));
> - }

If you're touching this, I'd also change the errcode here.


Greetings,

Andres Freund

[Tiny Debug Issue] Undefined Reference problem encountered during compile

2018-07-10 Thread 이재훈

First of all, thank you for listening to this mail.

As I mentioned in the title, I got a undefined reference in compiling time.
The reason that I wonder is that I surely add header file for the function.

In more detail,

I want to use TransactionIdEquals() function in freelist.c file.
Then, I included a header file, "access/xact.h" for the function.

However, in make command, compiler send implicit declaration of function
warning and then undefined reference at linking time.

How could I overcome this problem? :(

RE: How can we submit code patches that implement our (pending) patents?

2018-07-10 Thread Tsunakawa, Takayuki

From: Nico Williams [mailto:n...@cryptonector.com]
> On Sat, Jul 07, 2018 at 10:20:35AM -0700, Andres Freund wrote:
> > It's entirely possible to dual license contributions and everything. Why
> > are you making such aggressive statements about a, so far, apparently
> > good faith engagement?
> 
> One problem is that many contributors would not want to be tainted by
> knowledge of the patents in question (since that leads to triple
> damages).
> 
> How would you protect contributors and core developers from tainting?

IIUC, you are concerned about the possibility that PG developers would read the 
patent document (not the PG source code), and unconsciously use the patented 
algorithm for other software that's not related to PostgreSQL.  That would only 
be helped by not reading the patent document...  BTW, are you relieved the 
current PostgreSQL doesn't contain any patented code?  As far as I know, 
PostgreSQL development process doesn't have a step to check patent and 
copyright infringement.

> One possible answer is that you wouldn't.  But that might reduce the
> size of the community, or lead to a fork.

Yes, that's one unfortunate future, which I don't want to happen of course.  I 
believe PostgreSQL should accept patent for further evolution, because 
PostgreSQL is now a popular, influential software that many organizations want 
to join.

Regards
Takayuki Tsunakawa

Re: [Tiny Debug Issue] Undefined Reference problem encountered during compile

2018-07-10 Thread Julien Rouhaud

Hi,

On Tue, Jul 10, 2018 at 10:19 AM, 이재훈  wrote:
>
> I want to use TransactionIdEquals() function in freelist.c file.
> Then, I included a header file, "access/xact.h" for the function.
>
> However, in make command, compiler send implicit declaration of function
> warning and then undefined reference at linking time.
>
> How could I overcome this problem? :(

This macro is actually defined in access/transam.h.

Re: [Tiny Debug Issue] Undefined Reference problem encountered during compile

2018-07-10 Thread Michael Paquier

On Tue, Jul 10, 2018 at 05:19:25PM +0900, 이재훈 wrote:
> I want to use TransactionIdEquals() function in freelist.c file.
> Then, I included a header file, "access/xact.h" for the function.

This is in access/transam.h.
--
Michael


signature.asc
Description: PGP signature

Re: Non-reserved replication slots and slot advancing

2018-07-10 Thread Michael Paquier

On Tue, Jul 10, 2018 at 01:16:30AM -0700, Andres Freund wrote:
> On 2018-07-10 16:59:07 +0900, Michael Paquier wrote:
> > if (moveto < minlsn)
> > -   {
> > -   ReplicationSlotRelease();
> > ereport(ERROR,
> > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > -errmsg("cannot move slot to %X/%X, minimum is 
> > %X/%X",
> > +errmsg("cannot advance replication slot to 
> > %X/%X, minimum is %X/%X",
> > (uint32) (moveto >> 32), 
> > (uint32) moveto,
> > (uint32) (minlsn >> 32), 
> > (uint32) minlsn)));
> > -   }
> 
> If you're touching this, I'd also change the errcode here.

Yep, let's change that as well.  If you want to look at that stuff more
deeply, please feel free.  Otherwise I could always push what I have
now.
--
Michael


signature.asc
Description: PGP signature

RE: How can we submit code patches that implement our (pending) patents?

2018-07-10 Thread Tsunakawa, Takayuki

From: Markus Wanner [mailto:markus.wan...@2ndquadrant.com]
> equally sure there are well intended ones as well. For example, I'd
> expect patent pools (including the Open Invention Network, cited by the
> OP) to hire non-IANAL personnel who know Legalese well enough to setup
> valid contracts (between participating companies).

I think I'll consult Open Invention Network on this issue, since I haven't 
received any reply from SFLC.

> I certainly like the (future) patent holder coming forth to offer a
> grant a lot better than the one who doesn't (but still holds the
> patent). I'm missing the appreciation for that former strategy in this
> thread and fear we're setting a precedent for the latter one, instead.

Me too.


Regards
Takayuki Tsunakawa

Re: How can we submit code patches that implement our (pending) patents?

2018-07-10 Thread Dave Page

Hi

On Tue, Jul 10, 2018 at 9:29 AM, Tsunakawa, Takayuki <
tsunakawa.ta...@jp.fujitsu.com> wrote:

> From: Markus Wanner [mailto:markus.wan...@2ndquadrant.com]
> > equally sure there are well intended ones as well. For example, I'd
> > expect patent pools (including the Open Invention Network, cited by the
> > OP) to hire non-IANAL personnel who know Legalese well enough to setup
> > valid contracts (between participating companies).
>
> I think I'll consult Open Invention Network on this issue, since I haven't
> received any reply from SFLC.
>

SFLC have acted as the projects counsel in the past, so I'm not surprised
they aren't talking to you; you won't be a known contact to them as a PG
contributor, and as a Fujitsu employee there would likely be a conflict of
interest for them to talk to you.


>
> > I certainly like the (future) patent holder coming forth to offer a
> > grant a lot better than the one who doesn't (but still holds the
> > patent). I'm missing the appreciation for that former strategy in this
> > thread and fear we're setting a precedent for the latter one, instead.
>
> Me too.
>
>
> Regards
> Takayuki Tsunakawa
>
>


-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Recovery performance of standby for multiple concurrent truncates on large tables

2018-07-10 Thread Ants Aasma

On Tue, Jul 10, 2018 at 10:05 AM Jamison, Kirk 
wrote:

> Since in the current implementation, the replay of each TRUNCATE/DROP
> TABLE scans the whole shared buffer.
>
> One approach (though idea is not really developed yet) is to improve the
> recovery by delaying the shared buffer scan and invalidation
> (DropRelFileNodeBuffers) and to put it after the next checkpoint (after
> failover completion). The replay of TRUNCATE/DROP TABLE just make the
> checkpointer process remember what relations should be invalidated in the
> shared buffer during subsequent checkpoint. The checkpointer then scans the
> shared buffer only once to invalidate the buffers of relations that was
> dropped and truncated.
>

How about using the background writer for this? It seems to me that the
main reason to invalidate buffers would be to free them up for buffer
allocation, which is precisely the task of background writer. When adding a
filenode to be invalidated, take note of bgwriter position and add it to a
queue. When bgwriter is advancing, check each buffer tag against a hash
table of filenodes being invalidated. When background writer has completed
a loop it can remove the invalidated filenode. When bgwriter falls behind
the clock sweep and there are filenodes to invalidate it should run the
invalidation scan instead of skipping ahead. If there are already too many
filenodes being invalidated, then whoever is trying to add a new one gets
to run the invalidation scan until something can be evicted.

--
Ants Aasma
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com/

Re: [PATCH] Improve geometric types

2018-07-10 Thread Emre Hasegeli

> The version number restriction isn't strictly needed.  I only
> suggested it because it'd match the #if that wraps the code that's
> actually using those macros, introduced by commit cec8394b5ccd.  That
> was presumably done because versions >= 1800 (= Visual Studio 2013)
> have their own definitions of isinf() and isnan(), and I guess that
> our definitions were probably breaking stuff on that compiler.

Now I understand what you mean.  win32_port.h defines isnan(x) as
_isnan(x) if (_MSC_VER < 1800).  It doesn't look right to have the
definition in here but not include  as _isnan() is coming
from there.  I am preparing an additional patch to add the include and
remove it from files where it is obviously put to work around this
problem.

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-07-10 Thread Peter Eisentraut

On 09.07.18 15:49, Heikki Linnakangas wrote:
> The way 
> forward is to test if we can get the same performance benefit from 
> switching to CMPXCHG16B, and keep the WAL format unchanged.

I have implemented this.  I didn't see any performance benefit using the
benchmark that Alexander Kuzmenkov outlined earlier in the thread.  (But
I also didn't see a great benefit for the originally proposed patch, so
maybe I'm not doing this right or this particular hardware doesn't
benefit from it as much.)

I'm attaching the patches and scripts here if someone else wants to do
more testing.

The first patch adds a zoo of 128-bit atomics support.  It's consistent
with (a.k.a. copy-and-pasted from) the existing 32- and 64-bit set, but
it's not the complete set, only as much as was necessary for this exercise.

The second patch then makes use of that in the WAL code under discussion.

pgbench invocations were:

pgbench -i -I t bench
pgbench -n -c $N -j $N -M prepared -f pgbench-wal-cas.sql -T 60 bench

for N from 1 to 32.

Note:  With gcc (at least versions 7 and 8) you need to use some
non-default -march setting to get 128-bit atomics to work.  (Otherwise
the configure test fails and the fallback implementation is used.)  I
have found the minimum to be -march=nocona.  But different -march
settings actually affect the benchmark performance, so be sure to test
the baseline with the same -march setting.  Recommended configure
invocation: ./configure ... CC='gcc -march=whatever'

clang appears to work out of the box.

Also, curiously my gcc installations provide 128-bit
__sync_val_compare_and_swap() but not 128-bit
__atomic_compare_exchange_n() in spite of what the documentation indicates.

So independent of whether this approach actually provides any benefit,
the 128-bit atomics support seems a bit wobbly.

(I'm also wondering why we are using __sync_val_compare_and_swap()
rather than __sync_bool_compare_and_swap(), since all we're doing with
the return value is emulate the bool version anyway.)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 3359d5d5dd828bc0317ca28474ebe5dd931d44e8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 4 Jul 2018 22:13:58 +0200
Subject: [PATCH 1/2] Add some int128 atomics support

---
 config/c-compiler.m4   | 31 +
 configure  | 65 ++
 configure.in   |  2 +
 src/backend/port/atomics.c | 51 +++
 src/include/pg_config.h.in |  8 +++
 src/include/port/atomics.h | 49 ++
 src/include/port/atomics/fallback.h| 30 +
 src/include/port/atomics/generic-gcc.h | 46 +
 src/include/port/atomics/generic.h | 91 ++
 9 files changed, 373 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 9731a517de..a4aecd3e61 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -606,6 +606,21 @@ if test x"$pgac_cv_gcc_sync_int64_cas" = x"yes"; then
   AC_DEFINE(HAVE_GCC__SYNC_INT64_CAS, 1, [Define to 1 if you have 
__sync_val_compare_and_swap(int64 *, int64, int64).])
 fi])# PGAC_HAVE_GCC__SYNC_INT64_CAS

+# PGAC_HAVE_GCC__SYNC_INT128_CAS
+# -
+# Check if the C compiler understands __sync_compare_and_swap() for 128bit
+# types, and define HAVE_GCC__SYNC_INT128_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__SYNC_INT128_CAS],
+[AC_CACHE_CHECK(for builtin __sync int128 atomic operations, 
pgac_cv_gcc_sync_int128_cas,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+  [PG_INT128_TYPE lock = 0;
+   __sync_val_compare_and_swap(&lock, 0, (PG_INT128_TYPE) 37);])],
+  [pgac_cv_gcc_sync_int128_cas="yes"],
+  [pgac_cv_gcc_sync_int128_cas="no"])])
+if test x"$pgac_cv_gcc_sync_int128_cas" = x"yes"; then
+  AC_DEFINE(HAVE_GCC__SYNC_INT128_CAS, 1, [Define to 1 if you have 
__sync_val_compare_and_swap(int128 *, int128, int128).])
+fi])# PGAC_HAVE_GCC__SYNC_INT128_CAS
+
 # PGAC_HAVE_GCC__ATOMIC_INT32_CAS
 # -
 # Check if the C compiler understands __atomic_compare_exchange_n() for 32bit
@@ -638,6 +653,22 @@ if test x"$pgac_cv_gcc_atomic_int64_cas" = x"yes"; then
   AC_DEFINE(HAVE_GCC__ATOMIC_INT64_CAS, 1, [Define to 1 if you have 
__atomic_compare_exchange_n(int64 *, int64 *, int64).])
 fi])# PGAC_HAVE_GCC__ATOMIC_INT64_CAS

+# PGAC_HAVE_GCC__ATOMIC_INT128_CAS
+# -
+# Check if the C compiler understands __atomic_compare_exchange_n() for 128bit
+# types, and define HAVE_GCC__ATOMIC_INT128_CAS if so.
+AC_DEFUN([PGAC_HAVE_GCC__ATOMIC_INT128_CAS],
+[AC_CACHE_CHECK(for builtin __atomic int128 atomic operations, 
pgac_cv_gcc_atomic_int128_cas,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+  [PG_INT128_TYPE val = 0;
+   PG_INT128_TYPE expect = 0;
+   __atomic_compare_exchange_n(&val, &expect, 37, 0, __ATOMIC_SEQ_CST, 
__ATOMIC_RELAXED);])],
+  [pgac_cv_gcc_atomic_int128_

Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record

2018-07-10 Thread Andrey V. Lepikhov





On 10.07.2018 06:45, Andres Freund wrote:

Hi,

On 2018-07-10 06:41:32 +0500, Andrey V. Lepikhov wrote:

This functionality is needed in practice when we have to determine a
recovery time of specific backup.


What do you mean by "recovery time of specific backup"?



recovery time - is a time point where backup of PostgreSQL database 
instance was made.
Performing database recovery, we want to know what point in time the 
restored database will correspond to.
This functionality refers to improving the usability of pg_basebackup 
and pg_probackup utilities.





This code developed in compatibility with WAL segments, which do not have a
timestamp in a XLOG_BACKUP_END record.


I don't understand what "compatibility with WAL segments" could mean?
And how are WAL segments related to "XLOG_BACKUP_END record", except as
to how every WAL record is related? Are you thinking about the switch
records?



In this case 'compatibility' means that patched postgres codes 
(pg_basebackup, pg_probackup, pg_waldump etc) will correctly read WAL 
segments which not contains a timestamp field in XLOG_BACKUP_END record.



Greetings,

Andres Freund



--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company

Costing bug in hash join logic for semi joins

2018-07-10 Thread RK

There is a costing bug in hash join logic seems to have been introduced by
the patch related to inner_unique enhancements(commit:
9c7f5229ad68d7e0e4dd149e3f80257893e404d4). Specifically, "hashjointuples"
which tracks the number of matches for hash clauses is computed wrong for
inner unique scenario. This leads to lot of semi-joins  incorrectly turn to
inner joins with unique on inner side. Function "final_cost_hashjoin" has
special handling to cost semi/anti joins to account for early stop after
the first match. This is enhanced by the above said commit to be used for
inner_unique scenario also. However, "hashjointuples" computation is not
fixed to handle this new scenario which is incorrectly stepping into the
anti join logic and assuming unmatched rows. Fix is to handle inner_unique
case when computing "hashjointuples".  Here is the outline of the code that
shows the bug.

void
final_cost_hashjoin(PlannerInfo *root, HashPath *path,
  JoinCostWorkspace *workspace,
JoinPathExtraData *extra)
{
 .
/* CPU costs */


*if (path->jpath.jointype == JOIN_SEMI ||
path->jpath.jointype == JOIN_ANTI ||extra->inner_unique)*
   {
 ..





*/* Get # of
tuples that will pass the basic join */   if
(path->jpath.jointype == JOIN_SEMI)  hashjointuples =
outer_matched_rows;  else hashjointuples =
outer_path_rows - outer_matched_rows; *
}
   else
   {
 .
   }
}

Thanks, RK (Salesforce)

user-friendliness improvement of pageinspect

2018-07-10 Thread 杨杰

Hi,


Why does the heap_page_item () of the pageinspect extension not consider 
providing better user-friendliness?


My test table has the following data, and when I look at the t_data I see data 
of type bytea instead of a more intuitive type, even the same type as the 
original table.


# select * from test1;
 a |  b   
---+--
 1 | text
(1 row)


# SELECT lp as tuple, t_xmin, t_xmax, t_attrs[1] as a, t_attrs[2] as b FROM 
heap_page_item_attrs(get_raw_page('test1', 0), 'test1');
 tuple | t_xmin | t_xmax | a  |  b   
---++++--
 1 |587 |  0 | \x0100 | \x0b74657874
(1 row)


Similar to this effect:


# SELECT lp as tuple, t_xmin, t_xmax, t_attrs[1] as a, t_attrs[2] as b FROM 
heap_page_item_attrs(get_raw_page('test1', 0), 'test1');
 tuple | t_xmin | t_xmax | a  |  b   
---++++--
 1 |587 |  0 | 1 | text
(1 row)


Look forward to the solution.

Re: [HACKERS] PoC plpgsql - possibility to force custom or generic plan

2018-07-10 Thread Peter Eisentraut

On 23.01.18 17:08, Pavel Stehule wrote:
> attached updated patch

This appears to be the patch of record in this thread.  I think there is
general desire for adding a setting like this, and the implementation is
simple enough.

One change perhaps: How about naming the default setting "auto" instead
of "default".  That makes it clearer what it does.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Costing bug in hash join logic for semi joins

2018-07-10 Thread David Rowley

On 10 July 2018 at 11:44, RK  wrote:
> There is a costing bug in hash join logic seems to have been introduced by
> the patch related to inner_unique enhancements(commit:
> 9c7f5229ad68d7e0e4dd149e3f80257893e404d4). Specifically, "hashjointuples"
> which tracks the number of matches for hash clauses is computed wrong for
> inner unique scenario. This leads to lot of semi-joins  incorrectly turn to
> inner joins with unique on inner side. Function "final_cost_hashjoin" has
> special handling to cost semi/anti joins to account for early stop after the
> first match. This is enhanced by the above said commit to be used for
> inner_unique scenario also. However, "hashjointuples" computation is not
> fixed to handle this new scenario which is incorrectly stepping into the
> anti join logic and assuming unmatched rows. Fix is to handle inner_unique
> case when computing "hashjointuples".  Here is the outline of the code that
> shows the bug.

Thanks for the analysis and the report. I agree the code is wrong.
Looks simple enough just to handle the semi and unique joins in the
else clause and make the if handle the ANTI join.

I've done that in the attached.  Also on reading the comment above, it
looks slightly incorrect. To me, it looks like it's applying a
twentieth of the cost and not a tenth as the comment claims. I
couldn't resist updating that too.

I didn't feel the need to add any regression tests around this. It
seems like an unlikely bug to ever appear again.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

fix_unique_hash_join_costing.patch
Description: Binary data

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-10 Thread Ashutosh Bapat

On Tue, Jul 10, 2018 at 9:08 AM, Etsuro Fujita
 wrote:
> (2018/07/09 20:43), Ashutosh Bapat wrote:
>>>
>>>
>>> I don't have any numbers right now, so that is nothing but a concern. But
>>> as
>>> I said in a previous email, in the approach I proposed, we don't need to
>>> spend extra cycles where partitioning is not involved.  I think that is a
>>> good thing in itself.  No?
>>
>>
>> At the cost of having targetlist being type inconsistent. I don't have
>> any testcase either to show that that's a problem in practice. So,
>> it's a trade-off of a concern vs concern.
>
>
> As I said before, I don't see any issue in having such a targetlist, but I
> might be missing something, so I'd like to discuss a bit more about that.
> Could you tell me the logic/place in the PG code where you think the problem
> might occur.  (IIRC, you mentioned something about that before (pathkeys? or
> index-only scans?), but sorry, I don't understand that.)

IIUC, index-only scan will be used when the all the required columns
are covered by an index. If there is an index on the whole-row
reference of the parent, it will be translated into a
ConvertRowtypeExpr of the child when an index on the child is created.
If the targetlist doesn't have ConvertRowtypeExpr, as your patch does,
the planner won't be able to use such an index on the child table. But
I couldn't create an index with a whole-row reference in it. So, I
think this isn't possible right now. But in future if we allow
creating index on whole-row reference or

Pathkey points to an equivalence class, which contains equivalence
members. A parent equivalence class member containing a whole-row
reference gets translated into a child equivalence member containing a
ConvertRowtypeExpr. At places in planner we match equivalence members
to the targetlist entries. This matching will fail unexpectedly when
ConvertRowtypeExpr is removed from a child's targetlist. But again I
couldn't reproduce a problem when such a mismatch arises.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

RE: Locking B-tree leafs immediately in exclusive mode

2018-07-10 Thread Imai, Yoshikazu

On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote:
> > Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same 
> > as you did.
> 
> OK.  But not that c5.18xlarge is 72-VCPU machine, which AFAIK is
> close to the performance of 36 physical cores.

Thanks for pointing that. I referred to /proc/cpuinfo and understood there are 
36 physical cores. 

> In this case it also looks like we observed 1% regression.  Despite 1%
> may seem to be very small, I think we should clarify whether it really
> exists.  I have at least two hypothesis about this.
> 
> 1) There is no real regression, observed difference of TPS is less
> than error of measurements.  In order to check that we need to retry
> the experiment multiple times.  Also, if you run benchmark on master
> before patched version (or vice versa) you should also try to swap the
> order to make sure there is no influence of the order of benchmarks.
> 2) If we consider relation between TPS and number of clients, TPS is
> typically growing with increasing number of clients until reach some
> saturation value.  After the saturation value, there is some
> degradation of TPS.  If patch makes some latency lower, that my cause
> saturation to happen earlier.  In order to check that, we need run
> benchmarks with various number of clients and draw a graph: TPS
> depending on clients.
> 
> So, may I ask you to make more experiments in order to clarify the
> observed regression?

I experimented 2) with changing clients parameter with 18, 36, 54, 72. 
While doing experiment, I realized that results of pgbench with 36 clients 
improve after executing pgbench with 72 clients.
I don't know why this occurs, but anyway, in this experiment, I executed 
pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, -c 
36, -c 54, -c 72)
I tested experiments to master and patched unorderly(e.g. master, patched, 
patched, master, master, patched, patched, master)

# results of changing clients(18, 36, 54, 72 clients)
master, -c 18 -j 18:  Ave 400410 TPS (407615,393942,401845,398241) 
master, -c 36 -j 36:  Ave 415616 TPS (411939,400742,424855,424926)
master, -c 54 -j 54:  Ave 378734 TPS (401646,354084,408044,351163)
master, -c 72 -j 72:  Ave 360864 TPS (367718,360029,366432,349277)
patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678)
patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840)
patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727)
patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553)

It may seem saturation is between 18 and 36 clients, so I also experimented 
with 27 clients.

# results of changing clients(27 clients)
master, -c 27 -j 27:  Ave 416756 (423512,424241,399241,420030) TPS
patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS

I created a graph and attached in this mail("detecting saturation.png").
Referring to a graph, patched version's saturation happens earlier than 
master's one as you expected.
But even the patched version's nearly saturated TPS value has small regression 
from the master's one, I think.

Is there another experiments to do about this?

Yoshikazu Imai

Re: Locking B-tree leafs immediately in exclusive mode

2018-07-10 Thread Alexander Korotkov

On Mon, Jul 9, 2018 at 8:18 PM Peter Geoghegan  wrote:
> On Mon, Jul 9, 2018 at 9:43 AM, Alexander Korotkov
>  wrote:
> > In this case it also looks like we observed 1% regression.  Despite 1%
> > may seem to be very small, I think we should clarify whether it really
> > exists.  I have at least two hypothesis about this.
> >
> > 1) There is no real regression, observed difference of TPS is less
> > than error of measurements.  In order to check that we need to retry
> > the experiment multiple times.  Also, if you run benchmark on master
> > before patched version (or vice versa) you should also try to swap the
> > order to make sure there is no influence of the order of benchmarks.
> > 2) If we consider relation between TPS and number of clients, TPS is
> > typically growing with increasing number of clients until reach some
> > saturation value.  After the saturation value, there is some
> > degradation of TPS.  If patch makes some latency lower, that my cause
> > saturation to happen earlier.  In order to check that, we need run
> > benchmarks with various number of clients and draw a graph: TPS
> > depending on clients.
> >
> > So, may I ask you to make more experiments in order to clarify the
> > observed regression?
>
> It would be nice to actually see script_duplicated.sql. I don't know
> exactly what the test case was.

BTW, contents of script_duplicated.sql was posted by Imai, Yoshikazu
in the message body:

> # script_duplicated.sql
> INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz');

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Locking B-tree leafs immediately in exclusive mode

2018-07-10 Thread Alexander Korotkov

On Tue, Jul 10, 2018 at 2:19 PM Imai, Yoshikazu
 wrote:
> On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote:
> > > Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) 
> > > same as you did.
> >
> > OK.  But not that c5.18xlarge is 72-VCPU machine, which AFAIK is
> > close to the performance of 36 physical cores.
>
> Thanks for pointing that. I referred to /proc/cpuinfo and understood there 
> are 36 physical cores.
>
> > In this case it also looks like we observed 1% regression.  Despite 1%
> > may seem to be very small, I think we should clarify whether it really
> > exists.  I have at least two hypothesis about this.
> >
> > 1) There is no real regression, observed difference of TPS is less
> > than error of measurements.  In order to check that we need to retry
> > the experiment multiple times.  Also, if you run benchmark on master
> > before patched version (or vice versa) you should also try to swap the
> > order to make sure there is no influence of the order of benchmarks.
> > 2) If we consider relation between TPS and number of clients, TPS is
> > typically growing with increasing number of clients until reach some
> > saturation value.  After the saturation value, there is some
> > degradation of TPS.  If patch makes some latency lower, that my cause
> > saturation to happen earlier.  In order to check that, we need run
> > benchmarks with various number of clients and draw a graph: TPS
> > depending on clients.
> >
> > So, may I ask you to make more experiments in order to clarify the
> > observed regression?
>
> I experimented 2) with changing clients parameter with 18, 36, 54, 72.
> While doing experiment, I realized that results of pgbench with 36 clients 
> improve after executing pgbench with 72 clients.
> I don't know why this occurs, but anyway, in this experiment, I executed 
> pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, 
> -c 36, -c 54, -c 72)
> I tested experiments to master and patched unorderly(e.g. master, patched, 
> patched, master, master, patched, patched, master)
>
> # results of changing clients(18, 36, 54, 72 clients)
> master, -c 18 -j 18:  Ave 400410 TPS (407615,393942,401845,398241)
> master, -c 36 -j 36:  Ave 415616 TPS (411939,400742,424855,424926)
> master, -c 54 -j 54:  Ave 378734 TPS (401646,354084,408044,351163)
> master, -c 72 -j 72:  Ave 360864 TPS (367718,360029,366432,349277)
> patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678)
> patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840)
> patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727)
> patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553)
>
> It may seem saturation is between 18 and 36 clients, so I also experimented 
> with 27 clients.
>
> # results of changing clients(27 clients)
> master, -c 27 -j 27:  Ave 416756 (423512,424241,399241,420030) TPS
> patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS
>
> I created a graph and attached in this mail("detecting saturation.png").
> Referring to a graph, patched version's saturation happens earlier than 
> master's one as you expected.
> But even the patched version's nearly saturated TPS value has small 
> regression from the master's one, I think.
>
> Is there another experiments to do about this?

Thank you for the experiments!  It seems that there is real regression
here...  BTW, which script were you using in this benchmark:
script_unordered.sql or script_duplicated.sql?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-07-10 Thread Heikki Linnakangas


On 27/06/18 08:26, Thomas Munro wrote:

On Wed, Apr 25, 2018 at 6:23 PM, Thomas Munro
 wrote:

On Tue, Apr 24, 2018 at 7:37 PM, Michael Paquier  wrote:

Thomas, trying to understand here...  Why this place for the signal
initialization?  Wouldn't InitPostmasterChild() be a more logical place
as we'd want to have this logic caught by all other processes?


Yeah, you're right.  Here's one like that.


Amazingly, due to the way the project schedules fell and thanks to the
help of a couple of very supportive FreeBSD committers and reviewers,
the new PROC_PDEATHSIG_CTL kernel facility[1] landed in a production
release today, beating the Commitfest by several days.

My question is whether this patch set is enough, or people (Andres?) want
to go further and actually kill the postmaster death pipe completely.
My kqueue patch would almost completely kill it on BSDs as it happens,
but for Linux there are a number of races and complications to plug to
do that IIUC.  I don't immediately see what there is to gain by doing
that work (assuming we reuse WaitEventSet objects in enough places),
and we'll always need to maintain the pipe code for other OSes anyway.
So I'm -0.5 on doing that, even though the technical puzzle is
appealing...

[1] 
https://www.freebsd.org/cgi/man.cgi?query=procctl&apropos=0&sektion=2&manpath=FreeBSD+11.2-RELEASE&arch=default&format=html


Yeah, I don't think we can kill the pipe completely. As long as we still 
need it for the other OSes, I don't see much point in eliminating it on 
Linux and BSDs either. I'd rather keep the code similar across platforms.


Looking at the patch:

The 'postmaster_possibly_dead' flag is not reset anywhere. So if a 
process receives a spurious death signal, even though postmaster is 
still alive, PostmasterIsAlive() will continue to use the slow path.


postmaster_possibly_dead needs to be marked as 'volatile', no?

The autoconf check for PR_SET_PDEATHSIG seems slightly misplaced. And I 
think we can simplify it with AC_CHECK_HEADER(). I'd also like to avoid 
adding code to c.h for this, that seems too global.


After some kibitzing, I ended up with the attached. It fixes the 
postmaster_possible_dead issues mentioned above, and moves around the 
autoconf and #ifdef logic a bit to make it a bit nicer, at least in my 
opinion. I don't have a FreeBSD machine at hand, so I didn't try fixing 
that patch.


- Heikki
>From 9418ed472d113a80bf0fbb209efb0835767a5c50 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Tue, 10 Jul 2018 14:31:48 +0300
Subject: [PATCH 1/1] Use signals for postmaster death on Linux.

Linux provides a way to ask for a signal when your parent process dies.
Use that to make PostmasterIsAlive() very cheap.

Author: Thomas Munro, based on a suggestion from Andres Freund
Reviewed-By: Michael Paquier
Discussion: https://postgr.es/m/7261eb39-0369-f2f4-1bb5-62f3b6083b5e%40iki.fi
Discussion: https://postgr.es/m/20180411002643.6buofht4ranhei7k%40alap3.anarazel.de
---
 configure  |   2 +-
 configure.in   |   2 +-
 src/backend/storage/ipc/latch.c|   6 +-
 src/backend/storage/ipc/pmsignal.c | 123 +
 src/backend/utils/init/miscinit.c  |   4 ++
 src/include/pg_config.h.in |   3 +
 src/include/pg_config.h.win32  |   3 +
 src/include/storage/pmsignal.h |  33 +-
 8 files changed, 157 insertions(+), 19 deletions(-)

diff --git a/configure b/configure
index 1bc465b942..41e0e1cf34 100755
--- a/configure
+++ b/configure
@@ -12494,7 +12494,7 @@ $as_echo "#define HAVE_STDBOOL_H 1" >>confdefs.h
 fi
 
 
-for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
+for ac_header in atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/prctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h
 do :
   as_ac_Header=`$as_echo "ac_cv_header_$ac_header" | $as_tr_sh`
 ac_fn_c_check_header_mongrel "$LINENO" "$ac_header" "$as_ac_Header" "$ac_includes_default"
diff --git a/configure.in b/configure.in
index a6b3b88cfa..1e76c9ee46 100644
--- a/configure.in
+++ b/configure.in
@@ -1260,7 +1260,7 @@ AC_SUBST(UUID_LIBS)
 
 AC_HEADER_STDBOOL
 
-AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h sys/un.h termios.h ucred.h utime.h wchar.h wctype.h])
+AC_CHECK_HEADERS([atomic.h crypt.h dld.h fp_class.h getopt.h ieeefp.h ifaddrs.h langinfo.h mbarrier.h poll.h sys/epoll.h sys/ipc.h sys/prctl.h sys/pstat.h sys/resource.h sys/select.h sys/sem.h sys/shm.h sys/sockio.h sys/tas.h

Re: shared-memory based stats collector

2018-07-10 Thread Kyotaro HORIGUCHI

Hello. Thanks for the opinions.

At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund  wrote in 
<20180706201036.awheoi6tk556x...@alap3.anarazel.de>
> Hi,
> 
> On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:
> > *If* we can provide the snapshots view of them without too much overhead I
> > think it's worth looking into that while *also* proviiding a lower overhead
> > interface for those that don't care about it.
> 
> I don't see how that's possible without adding significant amounts of
> complexity and probably memory / cpu overhead. The current stats already
> are quite inconsistent (often outdated, partially updated, messages
> dropped when busy) - I don't see what we really gain by building
> something MVCC like in the "new" stats subsystem.
> 
> 
> > If it ends up that keeping the snapshots become too much overhead in either
> > in performance or code-maintenance, then I agree can probably drop that.
> > But we should at least properly investigate the cost.
> 
> I don't think it's worthwhile to more than think a bit about it. There's
> fairly obvious tradeoffs in complexity here. Trying to get there seems
> like a good way to make the feature too big.

Agreed.

Well, if we allow to lose consistency in some extent for improved
performance and smaller footprint, relaxing the consistency of
database stats can reduce footprint further especially on a
cluster with so many databases. Backends are interested only in
the residing database and vacuum doesn't cache stats at all. A
possible problem is vacuum and stats collector can go into a race
condition. I'm not sure but I suppose it is not worse than being
involved in an IO congestion.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 889237d7381d334df3a35e1fa5350298352fb9fe Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/6] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 138 +++
 src/include/lib/dshash.h |  23 +++-
 2 files changed, 160 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b46f7c4cfd..5b133226ac 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -592,6 +592,144 @@ dshash_memhash(const void *v, size_t size, void *arg)
 	return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *   Sequentially scan trhough dshash table and return all the
+ *   elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+bool consistent)
+{
+	status->hash_table = hash_table;
+	status->curbucket = 0;
+	status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
+	status->curitem = NULL;
+	status->curpartition = -1;
+	status->consistent = consistent;
+
+	/*
+	 * Protect all partitions from modification if the caller wants a
+	 * consistent result.
+	 */
+	if (consistent)
+	{
+		int i;
+
+		for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+		{
+			Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+			LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+		}
+	}
+	ensure_valid_bucket_pointers(hash_table);
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+	dsa_pointer next_item_pointer;
+
+	if (status->curitem == NULL)
+	{
+		Assert (status->curbucket == 0);
+		Assert(!status->hash_table->find_locked);
+
+		/* first shot. grab the first item. */
+		next_item_pointer = status->hash_table->buckets[status->curbucket];
+		status->hash_table->find_locked = true;
+	}
+	else
+		next_item_pointer = status->curitem->next;
+
+	/* Move to the next bucket if we finished the current bucket */
+	while (!DsaPointerIsValid(next_item_pointer))
+	{
+		if (++status->curbucket >= status->nbuckets)
+		{
+			/* all buckets have been scanned. finsih. */
+			dshash_seq_term(status);
+			return NULL;
+		}
+		Assert(status->hash_table->find_locked);
+
+		next_item_pointer = status->hash_table->buckets[status->curbucket];
+
+		/*
+		 * we need a lock on the scanning partition even if the caller don't
+		 * requested a consistent snapshot.
+		 */
+		if (!status->consistent && DsaPointerIsValid(next_item_pointer))
+		{
+			dshash_table_item  *i

Re: EXPLAIN of Parallel Append

2018-07-10 Thread Amit Kapila

On Mon, Jul 9, 2018 at 7:00 PM, Jesper Pedersen
 wrote:
> On 07/07/2018 01:08 AM, Amit Kapila wrote:
>>
>> On Wed, Mar 14, 2018 at 8:58 PM, Jesper Pedersen
>>>
>>> Parallel Append's ntuples is 1, but given nloops is 3 you end up with the
>>> slightly confusing "(actual ... *rows=0* loops=3)".
>>>
>>
>> The number of rows displayed is total_rows / loops due to which you
>> are seeing these numbers.  This behavior is the same for all parallel
>> nodes, nothing specific to Parallel Append.
>>
>
> Thanks !
>
> Maybe something like the attached patch for the documentation is needed.
>

-performance characteristics of the plan.
+performance characteristics of the plan. Note, that the parallel nodes
+may report zero rows returned due internal calculations when one or more
+rows are actually being returned.

typo.
/due/due to

I think it is quite unclear what you mean by internal calculations.
If you can come up with something similar to how we have already
explained similar thing for Nest Loop Joins [1], then it would be
great, you can add something like what you have written at the end of
the paragraph after explaining the actual calculation.  This is quite
a common confusion since the parallel query is developed; if you can
write some nice example and text, it would be really helpful.

[1] -
https://www.postgresql.org/docs/devel/static/using-explain.html#USING-EXPLAIN-ANALYZE

Refer below text on that link:
"In some query plans, it is possible for a subplan node to be executed
more than once. For example, the inner index scan will be executed
once per outer row in the above nested-loop plan. In such cases, the
loops value reports the total number of executions of the node, and
the actual time and rows values shown are averages per-execution. This
is done to make the numbers comparable with the way that the cost
estimates are shown. Multiply by the loops value to get the total time
actually spent in the node."

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: shared-memory based stats collector

2018-07-10 Thread Tomas Vondra


On 07/10/2018 02:07 PM, Kyotaro HORIGUCHI wrote:

Hello. Thanks for the opinions.

At Fri, 6 Jul 2018 13:10:36 -0700, Andres Freund  wrote in 
<20180706201036.awheoi6tk556x...@alap3.anarazel.de>

Hi,

On 2018-07-06 22:03:12 +0200, Magnus Hagander wrote:

*If* we can provide the snapshots view of them without too much overhead I
think it's worth looking into that while *also* proviiding a lower overhead
interface for those that don't care about it.


I don't see how that's possible without adding significant amounts of
complexity and probably memory / cpu overhead. The current stats already
are quite inconsistent (often outdated, partially updated, messages
dropped when busy) - I don't see what we really gain by building
something MVCC like in the "new" stats subsystem.



If it ends up that keeping the snapshots become too much overhead in either
in performance or code-maintenance, then I agree can probably drop that.
But we should at least properly investigate the cost.


I don't think it's worthwhile to more than think a bit about it. There's
fairly obvious tradeoffs in complexity here. Trying to get there seems
like a good way to make the feature too big.


Agreed.

Well, if we allow to lose consistency in some extent for improved
performance and smaller footprint, relaxing the consistency of
database stats can reduce footprint further especially on a
cluster with so many databases. Backends are interested only in
the residing database and vacuum doesn't cache stats at all. A
possible problem is vacuum and stats collector can go into a race
condition. I'm not sure but I suppose it is not worse than being
involved in an IO congestion.



As someone who regularly analyzes stats collected from user systems, I 
think there's certainly some value with keeping the snapshots reasonably 
consistent. But I agree it doesn't need to be perfect, and some level of 
inconsistency is acceptable (and the amount of complexity/overhead 
needed to maintain perfect consistency seems rather excessive here).


There's one more reason why attempts to keep stats snapshots "perfectly" 
consistent are likely doomed to fail - the messages are sent over UDP, 
which does not guarantee delivery etc. So there's always some level of 
possible inconsistency even with "perfectly consistent" snapshots.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

[PG-11] Potential bug related to INCLUDE clause of CREATE INDEX

2018-07-10 Thread Aditya Toshniwal

Hi Dave,

I am working on a feature to support INCLUDE clause of index in PG-11. As
per the documentation https://www.postgresql.org/docs/11/static/
sql-createindex.html, columns listed in INCLUDE clause cannot also be
present as index key columns. But I find different behaviour for below
queries which are logically identical.

CREATE TABLE some_table
(
id serial primary key,
first_name character varying(45),
last_name character varying
)

CREATE INDEX ind1
ON public.some_table USING btree
(id)
INCLUDE(id)
TABLESPACE pg_default;

This query fails with error
ERROR: included columns must not intersect with key columns

CREATE INDEX ind1
ON public.some_table USING btree
(id asc nulls last)
INCLUDE(id)
TABLESPACE pg_default;

This query passes and index is created.

Kindly let me know if I am missing anything.

-- 
Thanks and Regards,
Aditya Toshniwal
Software Engineer | EnterpriseDB Software Solutions | Pune
"Don't Complain about Heat, Plant a tree"

Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX

2018-07-10 Thread Aditya Toshniwal

Hi Team,

Please ignore the name after "Hi" in the previous mail. :/
The potential bug is a mentioned in the mail.



On Tue, Jul 10, 2018 at 6:37 PM, Aditya Toshniwal <
aditya.toshni...@enterprisedb.com> wrote:

> Hi Dave,
>
> I am working on a feature to support INCLUDE clause of index in PG-11. As
> per the documentation https://www.postgresql.org/docs/11/static/sql-
> createindex.html, columns listed in INCLUDE clause cannot also be present
> as index key columns. But I find different behaviour for below queries
> which are logically identical.
>
> CREATE TABLE some_table
> (
> id serial primary key,
> first_name character varying(45),
> last_name character varying
> )
>
> CREATE INDEX ind1
> ON public.some_table USING btree
> (id)
> INCLUDE(id)
> TABLESPACE pg_default;
>
> This query fails with error
> ERROR: included columns must not intersect with key columns
>
> CREATE INDEX ind1
> ON public.some_table USING btree
> (id asc nulls last)
> INCLUDE(id)
> TABLESPACE pg_default;
>
> This query passes and index is created.
>
> Kindly let me know if I am missing anything.
>
> --
> Thanks and Regards,
> Aditya Toshniwal
> Software Engineer | EnterpriseDB Software Solutions | Pune
> "Don't Complain about Heat, Plant a tree"
>



-- 
Thanks and Regards,
Aditya Toshniwal
Software Engineer | EnterpriseDB Software Solutions | Pune
"Don't Complain about Heat, Plant a tree"

Re: Test for trigger condition accessing system attributes

2018-07-10 Thread Heikki Linnakangas


On 10/05/18 11:58, Ashutosh Bapat wrote:

Hi,
I was investigating the cases when the system attributes are accessed
beyond the scans. After investigating set_plan_references(), I thought
that we never access system attributes beyond scans. This lead me to
assume that EEOP_INNER/OUTER_SYSVAR are not needed since we do not
access system attributes from an inner or outer slot. I removed the
defintions and code using those and ran regression. All the tests
passed. So, I was about to conclude that my assumption is correct. But
then looking at TriggerEnabled() I realised that we also (ab?)use
INNER/OUTER Vars for OLD/NEW tuples for trigger condition. If the WHEN
condition in CREATE TRIGGER command refers to a system attribute, we
will end up having INNER/OUTER var refering a system attribute, thus
exercising code for EEOP_INNER/OUTER_SYSVAR.

Here's patch containing a testcase exercizing that code using
EEOP_INNER/OUTER_SYSVAR.


Pushed with some small changes. Thanks!

- Heikki

Re: [PATCH] Add support for ON UPDATE/DELETE actions on ALTER CONSTRAINT

2018-07-10 Thread Ashutosh Bapat

On Wed, Mar 7, 2018 at 11:49 PM, Matheus de Oliveira
 wrote:
>
>
> Em 3 de mar de 2018 19:32, "Peter Eisentraut"
>  escreveu:
>
> On 2/20/18 10:10, Matheus de Oliveira wrote:
>> Besides that, there is a another change in this patch on current ALTER
>> CONSTRAINT about deferrability options. Previously, if the user did
>> ALTER CONSTRAINT without specifying an option on deferrable or
>> initdeferred, it was implied the default options, so this:
>>
>> ALTER TABLE tbl
>> ALTER CONSTRAINT con_name;
>>
>> Was equivalent to:
>>
>> ALTER TABLE tbl
>> ALTER CONSTRAINT con_name NOT DEFERRABLE INITIALLY IMMEDIATE;
>
> Oh, that seems wrong.  Probably, it shouldn't even accept that syntax
> with an empty options list, let alone reset options that are not
> mentioned.
>
>
> Yeah, it felt really weird when I noticed it. And I just noticed while
> reading the source.
>
> Can
>
> you prepare a separate patch for this issue?
>
>
> I can do that, no problem. It'll take awhile though, I'm on a trip and will
> be home around March 20th.

Matheus,
When do you think you can provide the patch for bug fix?

Also, the patch you originally posted doesn't apply cleanly. Can you
please post a rebased version?

The patch contains 70 odd lines of  test SQL and 3600 odd lines of
output. The total patch is 4200 odd lines. I don't think that it will
be acceptable to add that huge an output to the regression test. You
will need to provide a patch with much smaller output addition and may
be a smaller test as well.

>
> You think this should be applied to all versions that support ALTER
> CONSTRAINT, right?

I think so.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Test for trigger condition accessing system attributes

2018-07-10 Thread Ashutosh Bapat

Thanks Heikki.

On Tue, Jul 10, 2018 at 6:47 PM, Heikki Linnakangas  wrote:
> On 10/05/18 11:58, Ashutosh Bapat wrote:
>>
>> Hi,
>> I was investigating the cases when the system attributes are accessed
>> beyond the scans. After investigating set_plan_references(), I thought
>> that we never access system attributes beyond scans. This lead me to
>> assume that EEOP_INNER/OUTER_SYSVAR are not needed since we do not
>> access system attributes from an inner or outer slot. I removed the
>> defintions and code using those and ran regression. All the tests
>> passed. So, I was about to conclude that my assumption is correct. But
>> then looking at TriggerEnabled() I realised that we also (ab?)use
>> INNER/OUTER Vars for OLD/NEW tuples for trigger condition. If the WHEN
>> condition in CREATE TRIGGER command refers to a system attribute, we
>> will end up having INNER/OUTER var refering a system attribute, thus
>> exercising code for EEOP_INNER/OUTER_SYSVAR.
>>
>> Here's patch containing a testcase exercizing that code using
>> EEOP_INNER/OUTER_SYSVAR.
>
>
> Pushed with some small changes. Thanks!
>
> - Heikki



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Retrieve memory size allocated by libpq

2018-07-10 Thread Lars Kanis

Attached is an updated patch of PQresultMemsize(). The implementation is
unchanged, but I added SGML documentation about this new function.

I'd be pleased about comments to adding this to libpq.

-- 
Kind Regards,
Lars

From 766a7f2381ae6c9d442a7359cabc58515186f8c4 Mon Sep 17 00:00:00 2001
From: Lars Kanis 
Date: Sat, 23 Jun 2018 19:34:11 +0200
Subject: [PATCH] libpq: Add function PQresultMemsize()

This function retrieves the number of bytes allocated for a given result.
That can be used to instruct the GC about the memory consumed behind a
wrapping object and for diagnosing memory consumption.

This is an alternative approach to customizable malloc/realloc/free
functions as discussed here:
https://www.postgresql.org/message-id/flat/20170828172834.GA71455%40TC.local#20170828172834.GA71455@TC.local
---
 doc/src/sgml/libpq.sgml  | 28 
 src/interfaces/libpq/exports.txt |  1 +
 src/interfaces/libpq/fe-exec.c   | 14 +-
 src/interfaces/libpq/libpq-fe.h  |  1 +
 src/interfaces/libpq/libpq-int.h |  2 ++
 5 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index d67212b831..c573c79ae3 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -6384,6 +6384,34 @@ void *PQresultAlloc(PGresult *res, size_t nBytes);
 

 
+   
+
+ PQresultMemsize
+ 
+  PQresultMemsize
+ 
+
+
+
+ 
+  Retrieves the number of bytes allocated for a PGresult object.
+
+size_t PQresultMemsize(const PGresult *res);
+
+ 
+
+ 
+  The number of bytes includes the memory allocated for the PGresult itself,
+  memory to store data from the server, required internal metadata of a
+  PGresult object and data allocated by PQresultAlloc.
+  That is to say all memory which gets freed by PQclear.
+
+  This information can be used for diagnosing memory consumption and to instruct
+  a garbage collector about the memory consumed behind a wrapping object.
+ 
+
+   
+

 
  PQlibVersion
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index d6a38d0df8..0b50dddbb7 100644
--- a/src/interfaces/libpq/exports.txt
+++ b/src/interfaces/libpq/exports.txt
@@ -172,3 +172,4 @@ PQsslAttribute169
 PQsetErrorContextVisibility 170
 PQresultVerboseErrorMessage 171
 PQencryptPasswordConn 172
+PQresultMemsize   173
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index 4c0114c514..064c7a693c 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -166,6 +166,7 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
 	result->curBlock = NULL;
 	result->curOffset = 0;
 	result->spaceLeft = 0;
+	result->memsize = sizeof(PGresult);
 
 	if (conn)
 	{
@@ -215,6 +216,12 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
 	return result;
 }
 
+size_t
+PQresultMemsize(const PGresult *res)
+{
+	return res->memsize;
+}
+
 /*
  * PQsetResultAttrs
  *
@@ -567,9 +574,11 @@ pqResultAlloc(PGresult *res, size_t nBytes, bool isBinary)
 	 */
 	if (nBytes >= PGRESULT_SEP_ALLOC_THRESHOLD)
 	{
-		block = (PGresult_data *) malloc(nBytes + PGRESULT_BLOCK_OVERHEAD);
+		size_t alloc_size = nBytes + PGRESULT_BLOCK_OVERHEAD;
+		block = (PGresult_data *) malloc(alloc_size);
 		if (!block)
 			return NULL;
+		res->memsize += alloc_size;
 		space = block->space + PGRESULT_BLOCK_OVERHEAD;
 		if (res->curBlock)
 		{
@@ -594,6 +603,7 @@ pqResultAlloc(PGresult *res, size_t nBytes, bool isBinary)
 	block = (PGresult_data *) malloc(PGRESULT_DATA_BLOCKSIZE);
 	if (!block)
 		return NULL;
+	res->memsize += PGRESULT_DATA_BLOCKSIZE;
 	block->next = res->curBlock;
 	res->curBlock = block;
 	if (isBinary)
@@ -711,6 +721,7 @@ PQclear(PGresult *res)
 	res->errFields = NULL;
 	res->events = NULL;
 	res->nEvents = 0;
+	res->memsize = 0;
 	/* res->curBlock was zeroed out earlier */
 
 	/* Free the PGresult structure itself */
@@ -927,6 +938,7 @@ pqAddTuple(PGresult *res, PGresAttValue *tup, const char **errmsgp)
 realloc(res->tuples, newSize * sizeof(PGresAttValue *));
 		if (!newTuples)
 			return false;		/* malloc or realloc failed */
+		res->memsize += (newSize - res->tupArrSize) * sizeof(PGresAttValue *);
 		res->tupArrSize = newSize;
 		res->tuples = newTuples;
 	}
diff --git a/src/interfaces/libpq/libpq-fe.h b/src/interfaces/libpq/libpq-fe.h
index ed9c806861..4fd9a4fcda 100644
--- a/src/interfaces/libpq/libpq-fe.h
+++ b/src/interfaces/libpq/libpq-fe.h
@@ -491,6 +491,7 @@ extern int	PQgetlength(const PGresult *res, int tup_num, int field_num);
 extern int	PQgetisnull(const PGresult *res, int tup_num, int field_num);
 extern int	PQnparams(const PGresult *res);
 extern Oid	PQparamtype(const PGresult *res, int param_num);
+extern size_t	PQresultMemsize(const PGresult *res);
 
 /* Describe prepared statements and portals */
 extern PGresult *PQdescribePrepared(PGconn *conn, const cha

Re: Locking B-tree leafs immediately in exclusive mode

2018-07-10 Thread 今井良一

On 2018/07/10 20:36, Alexander Korotkov wrote:
> On Tue, Jul 10, 2018 at 2:19 PM Imai, Yoshikazu
>  wrote:
>> On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote:
 Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) 
 same as you did.
>>>
>>> OK.  But not that c5.18xlarge is 72-VCPU machine, which AFAIK is
>>> close to the performance of 36 physical cores.
>>
>> Thanks for pointing that. I referred to /proc/cpuinfo and understood there 
>> are 36 physical cores.
>>
>>> In this case it also looks like we observed 1% regression.  Despite 1%
>>> may seem to be very small, I think we should clarify whether it really
>>> exists.  I have at least two hypothesis about this.
>>>
>>> 1) There is no real regression, observed difference of TPS is less
>>> than error of measurements.  In order to check that we need to retry
>>> the experiment multiple times.  Also, if you run benchmark on master
>>> before patched version (or vice versa) you should also try to swap the
>>> order to make sure there is no influence of the order of benchmarks.
>>> 2) If we consider relation between TPS and number of clients, TPS is
>>> typically growing with increasing number of clients until reach some
>>> saturation value.  After the saturation value, there is some
>>> degradation of TPS.  If patch makes some latency lower, that my cause
>>> saturation to happen earlier.  In order to check that, we need run
>>> benchmarks with various number of clients and draw a graph: TPS
>>> depending on clients.
>>>
>>> So, may I ask you to make more experiments in order to clarify the
>>> observed regression?
>>
>> I experimented 2) with changing clients parameter with 18, 36, 54, 72.
>> While doing experiment, I realized that results of pgbench with 36 clients 
>> improve after executing pgbench with 72 clients.
>> I don't know why this occurs, but anyway, in this experiment, I executed 
>> pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, 
>> -c 36, -c 54, -c 72)
>> I tested experiments to master and patched unorderly(e.g. master, patched, 
>> patched, master, master, patched, patched, master)
>>
>> # results of changing clients(18, 36, 54, 72 clients)
>> master, -c 18 -j 18:  Ave 400410 TPS (407615,393942,401845,398241)
>> master, -c 36 -j 36:  Ave 415616 TPS (411939,400742,424855,424926)
>> master, -c 54 -j 54:  Ave 378734 TPS (401646,354084,408044,351163)
>> master, -c 72 -j 72:  Ave 360864 TPS (367718,360029,366432,349277)
>> patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678)
>> patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840)
>> patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727)
>> patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553)
>>
>> It may seem saturation is between 18 and 36 clients, so I also experimented 
>> with 27 clients.
>>
>> # results of changing clients(27 clients)
>> master, -c 27 -j 27:  Ave 416756 (423512,424241,399241,420030) TPS
>> patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS
>>
>> I created a graph and attached in this mail("detecting saturation.png").
>> Referring to a graph, patched version's saturation happens earlier than 
>> master's one as you expected.
>> But even the patched version's nearly saturated TPS value has small 
>> regression from the master's one, I think.
>>
>> Is there another experiments to do about this?
> 
> Thank you for the experiments!  It seems that there is real regression
> here...  BTW, which script were you using in this benchmark:
> script_unordered.sql or script_duplicated.sql?

Sorry, I forgot to write that. I used script_unordered.sql.


Yoshikazu Imai

Re: How can we submit code patches that implement our (pending) patents?

2018-07-10 Thread Tom Lane

"Tsunakawa, Takayuki"  writes:
> From: Nico Williams [mailto:n...@cryptonector.com]
>> ... But that might reduce the
>> size of the community, or lead to a fork.

> Yes, that's one unfortunate future, which I don't want to happen of
> course.  I believe PostgreSQL should accept patent for further
> evolution, because PostgreSQL is now a popular, influential software
> that many organizations want to join.

The core team has considered this matter, and has concluded that it's
time to establish a firm project policy that we will not accept any code
that is known to be patent-encumbered.  The long-term legal risks and
complications involved in doing that seem insurmountable, given the
community's amorphous legal nature and the existing Postgres license
wording (neither of which are open for negotiation here).  Furthermore,
Postgres has always been very friendly to creation of closed-source
derivatives, but it's hard to see how inclusion of patented code would
not cause serious problems for those.  The potential benefits of
accepting patented code just don't seem to justify trying to navigate
these hazards.

regards, tom lane

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-07-10 Thread Ashutosh Bapat

I didn't see any hackers thread linked to this CF entry. Hence sending this 
mail through CF app.

The patch looks good to me. It applies cleanly, compiles cleanly and make check 
passes.

I think you could avoid condition
+   /* Do not override parent's NOT NULL constraint. */
+   if (restdef->is_not_null)
+   coldef->is_not_null = restdef->is_not_null;

by rewriting this line as
coldef->is_not_null = coldef->is_not_null || restdef->is_not_null;

The comment looks a bit odd either way since we are changing 
coldef->is_not_null based on restdef->is_not_null. That's because of the nature 
of boolean variables. May be a bit of explanation is needed.

On a side note, I see
coldef->constraints = restdef->constraints;
Shouldn't we merge the constraints instead of just overwriting those?
--
Best Wishesh
Ashutosh

Re: Non-reserved replication slots and slot advancing

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-10, Michael Paquier wrote:

> Yep, let's change that as well.  If you want to look at that stuff more
> deeply, please feel free.  Otherwise I could always push what I have
> now.

I say please push already.  We can push more fixes later if they are
needed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] PoC: full merge join on comparison clause

2018-07-10 Thread Ashutosh Bapat

On Tue, Jul 10, 2018 at 12:05 AM, Alexander Kuzmenkov
 wrote:
> On 07/09/2018 04:12 PM, Ashutosh Bapat wrote:
>>
>> On Fri, Jul 6, 2018 at 6:31 PM, Ashutosh Bapat
>>  wrote:
>>>
>>> I will continue reviewing the patches.
>>>
>> Here are some more review comments
>
>
>
> Ashutosh,
>
> Many thanks for the review, I'm glad that we are continuing with this patch.
> I'm working on your comments now, will post the updated version this week.

While updating the patches, please consider adding some comments as to
why only single inequality clause supported. I didn't see comments in
the patch explaining that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX

2018-07-10 Thread Tom Lane

Aditya Toshniwal  writes:
> I am working on a feature to support INCLUDE clause of index in PG-11. As
> per the documentation https://www.postgresql.org/docs/11/static/
> sql-createindex.html, columns listed in INCLUDE clause cannot also be
> present as index key columns. But I find different behaviour for below
> queries which are logically identical.

I wonder why there is any such restriction at all.  We have never
attempted to prevent the creation of "silly" indexes, eg

regression=# create table some_table (id int);
CREATE TABLE
regression=# create index on some_table (id,id);
CREATE INDEX
regression=# \d+ some_table
Table "public.some_table"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description 
+-+---+--+-+-+--+-
 id | integer |   |  | | plain   |  | 
Indexes:
"some_table_id_id1_idx" btree (id, id)

So my inclination is to rip out the "must not intersect" test altogether,
not try to make it a bit smarter.

regards, tom lane

Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-10 Thread Heikki Linnakangas


Thanks for picking this up!

(I hope this gets through the email filters this time, sending a shell 
script seems to be difficult. I also trimmed the CC list, if that helps.)


On 04/07/18 07:59, Michael Paquier wrote:

Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems.  One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=yetxvmuu1yk4umxqy2g...@mail.gmail.com
For back-patching, I find that really appealing.


This fails in the case that there are any WAL-logged changes to the 
table while the COPY is running. That can happen at least if the table 
has an INSERT trigger, that performs operations on the same table, and 
the COPY fires the trigger. That scenario is covered by the little bash 
script I posted earlier in this thread 
(https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). 
Attached is a new version of that script, updated to make it work with v11.



The second thing that the patch attached does is to tweak
ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
wal_level = minimal.


If we go down that route, let's at least keep the TRUNCATE optimization 
for temporary and unlogged tables.


- Heikki

#!/bin/bash

set -e

export PGDATABASE=postgres

PATH=./bin:$PATH

cat > /tmp/test-copydata <> data-minimal/postgresql.conf
echo "wal_level=minimal" >> data-minimal/postgresql.conf

# CREATE, INSERT, COPY, crash.
#
# If COPY inserts to the existing block, and is not WAL-logged, replaying
# the implicit FPW of the INSERT record will destroy the COPY data.
pg_ctl -D data-minimal -w start
psql <= 16384;
INSERT INTO test1 VALUES ('inserted row');
\copy test1 FROM '/tmp/test-copydata'
COMMIT;
EOF
pg_ctl -D data-minimal stop -m immediate
sleep 1
pg_ctl -D data-minimal -w start
echo "Should have 4 rows:"
psql -c "SELECT * FROM test1"
psql -c "DROP TABLE test1" > /dev/null # cleanup

# CREATE, COPY, crash. Trigger in COPY that inserts more to same table.
#
# If the INSERTS from the trigger go to the same block we're copying to,
# and the INSERTs are WAL-logged, WAL replay will fail when it tries to
# replay the WAL record but the "before" image doesn't match, because not
# all changes were WAL-logged.
#pg_ctl -D data-minimal -w start
psql <= 16384;

\copy test1 FROM '/tmp/test-copydata'
COMMIT;
EOF
pg_ctl -D data-minimal stop -m immediate
sleep 1
pg_ctl -D data-minimal -w start
echo "Should have 6 rows (3 original and 3 inserted by trigger):"
psql -c "SELECT * FROM test1"
psql -c "DROP TABLE test1" > /dev/null # cleanup
psql -c "DROP FUNCTION test1_beforetrig()" > /dev/null # cleanup


# CREATE, TRUNCATE, COPY, crash.
#
# If we skip WAL-logging of the COPY, replaying the TRUNCATE record destroy
# the newly inserted data.
#pg_ctl -D data-minimal -w start
psql <= 16384;
TRUNCATE test1;
SELECT relname, relfilenode from pg_class where relfilenode >= 16384;
\copy test1 FROM '/tmp/test-copydata'
COMMIT;
EOF
pg_ctl -D data-minimal stop -m immediate
sleep 1
pg_ctl -D data-minimal -w start
echo "Should have 3 rows:"
psql -c "SELECT * FROM test1"

Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX

2018-07-10 Thread Dilip Kumar

On Tue, Jul 10, 2018 at 6:37 PM, Aditya Toshniwal
 wrote:
> Hi Dave,
>
> I am working on a feature to support INCLUDE clause of index in PG-11. As
> per the documentation
> https://www.postgresql.org/docs/11/static/sql-createindex.html, columns
> listed in INCLUDE clause cannot also be present as index key columns. But I
> find different behaviour for below queries which are logically identical.
>

>
> CREATE INDEX ind1
> ON public.some_table USING btree
> (id asc nulls last)
> INCLUDE(id)
> TABLESPACE pg_default;
>
> This query passes and index is created.
>
> Kindly let me know if I am missing anything.
>

Seems like a bug to me.  I think the problem is while checking whether
the INCLUDE column intersects with the index key or not it will
compare the "IndexElem" of INCLUDE with the "IndexElem" of the index
key.  So if any field of the "IndexElem" is not same then it will be
considered as non-intersecting and in this example, the ORDER is not
matching.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] PoC: full merge join on comparison clause

2018-07-10 Thread Alexander Kuzmenkov

I tried to fix the things you mentioned and improve the comments. Among 
other changes, there is now a description of how merge join works with 
inequalities at the top of nodeMergejoin.c. It also explains why we only 
support one inequality clause.


Some particular points:

On 07/06/2018 04:01 PM, Ashutosh Bapat wrote:

-StrategyNumber opstrategy = mergestrategies[iClause];
+StrategyNumber sort_strategy = mergestrategies[iClause];
-intop_strategy;
+intjoin_strategy;
I don't see a reason why should we change the name of variable here. These are
operator strategies and there's no need to change their names. The name change
is introducing unnecessary diffs.


These variables have different meaning but their names differ only with 
an underscore. When I had to change this function, I made mistakes 
because of this. I'd keep the descriptive names to avoid further 
confusion. Should this be a separate patch?



I think we should write a wrapper around MJCompare which interprets the result 
rather
than changing MJCompare itself. OR at least change the name of MJCompare.


Renamed the function to MJTestTuples to reflect that it decides whether 
we join tuples or advance either side.




- * for get_mergejoin_opfamilies().
+ * for get_equality_opfamilies().

I think we should leave mergejoin word in there or at least indicate that these
are btree opfamilies so that we don't confuse it with hash equality operator
families.


Renamed these to get_btree_equality_opfamilies() and 
get_btree_comparison_opfamilies().





+if (parent->mj_Ineq_Present)
+elog(ERROR, "inequality mergejoin clause must be the last one");
+

IIUC, this never happens. If it really happens, we have created a path which
can not be used practically. That should never happen. It will help to add a
comment here clarifying that situation.


This is just a cross-check for the planner. Added a comment. We should 
probably use a separate error code for internal errors as opposed to 
user errors, but I'm not sure if we have one, I see just elog(ERROR) 
being used everywhere.





+boolhave_inequality = have_inequality_mergeclause(mergeclauses);

There will be many paths created with different ordering of pathkeys. So,
instead of calling have_inequality_mergeclause() for each of those paths, it's
better to save this status in the path itself when creating the path.


I removed this function altogether, because we can just check the last 
merge clause. When we cost the path, we already have a proper 
mergejoinable list of clauses, so if there is an inequality clause, it's 
the last one.




  /* Make a pathkey list with this guy first */
  if (l != list_head(all_pathkeys))
+{
+if (have_inequality && l == list_tail(all_pathkeys))
+/* Inequality merge clause must be the last, we can't
move it */
+break;
+

I am kind of baffled by this change. IIUC the way we create different orderings
of pathkeys here, we are just rotating the pathkeys in circular order. This
means there is exactly one ordering of pathkeys where the pathkey corresponding
to the inequality clause is the last one.


This code does not rotate the pathkeys circularly, but puts each of them 
in the first position, and keeps the rest in the original order.
Say, if we have three equality pathkeys, and one inequality pathkey at 
the end (let's denote them as E1, E2, E3, IE), the permutations it tries 
will be like this:

E1 E2 E3 IE
E2 E1 E3 IE
E3 E1 E2 IE
Does this sound right?



  /* Might have no mergeclauses */
  if (nClauses == 0)
  return NIL;

+{
+List *ineq_clauses = find_inequality_clauses(mergeclauses);
+
+if (list_length(ineq_clauses) > 1)
+return NIL;

Without this patch, when there is an inequality clause with FULL JOIN, we will
not create a merge join path because select_mergejoin_clauses() will set
mergejoin_allowed to false. This means that we won't call
sort_inner_and_outer(). I think this patch also has to do the same i.e. when
there are more than one inequality clauses, select_mergejoin_clauses() should
set mergejoin_allowed to false in case of a FULL JOIN since merge join
machinary won't be able to handle that case.

If we do that, we could arrange extra.mergeclause_list such that the inequality
clause is always at the end thus finding inequality clause would be easy.


I changed select_mergejoin_clauses() to filter multiple inequality 
clauses and disable join if needed. Now we can use extra inequalities as 
join filter, if it's not full join. I didn't reorder 
extra.mergeclause_list there, because this order is ignored later. 
select_outer_pathkeys_for_merge() chooses the order of pathkeys using 
some heuristics, and then find_mergeclauses_for_outer_pathkeys() 
reorders the clauses accordingly.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgr

Re: [PG-11] Potential bug related to INCLUDE clause of CREATE INDEX

2018-07-10 Thread Andrey Borodin

Hi!

> 10 июля 2018 г., в 17:54, Tom Lane  написал(а):
> 
> Aditya Toshniwal  writes:
>> I am working on a feature to support INCLUDE clause of index in PG-11. As
>> per the documentation https://www.postgresql.org/docs/11/static/
>> sql-createindex.html, columns listed in INCLUDE clause cannot also be
>> present as index key columns. But I find different behaviour for below
>> queries which are logically identical.
> 
> I wonder why there is any such restriction at all.  We have never
> attempted to prevent the creation of "silly" indexes [...] So my inclination 
> is to rip out the "must not intersect" test altogether,
> not try to make it a bit smarter

It seems to me valid way of reaching the completely consistent validation 
behavior. But there are some other validation steps that seem useful: e.g. 
"ERROR:  including column does not support ASC/DESC options" and "ERROR:  
including column does not support NULLS FIRST/LAST options".

IMHO it is not a bug. CREATE INDEX ON some_table(id+0) INCLUDE (id); or some 
similar tricks will work anyway.

Best regards, Andrey Borodin.

Re: Let's remove DSM_IMPL_NONE.

2018-07-10 Thread Peter Eisentraut

committed

On 10.07.18 08:49, Kyotaro HORIGUCHI wrote:
> Thank you for the notice.
> 
> At Mon, 9 Jul 2018 12:30:22 +0300, Arthur Zakirov  
> wrote in <20180709093021.GA9309@zakirov.localdomain>
>> Hello,
>>
>> On Mon, Jul 09, 2018 at 06:07:24PM +0900, Kyotaro HORIGUCHI wrote:
>>> The new version v4 is changed in the following points.
>>>
>>> - Don't renumber the DSM_IMPL symbols, just removed _NONE.
>>>
>>> - Rewrote the pointed comment.
>>>
>>> - Revised the commit message removing a mention to an
>>>   already-committed patch.
>>>
>>> - (and rebased)
>>
>> Just a little note. In parallel.sgml it is still mentioned that
>> dynamic_shared_memory_type accepts 'none' value:
>>
>>>  must be set to a
>>> value other than none.
> 
> Oops! Thanks. Just removed it altogether and I didn't find
> another instance.
> 
> regards.
> 


-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] [PATCH] kNN for SP-GiST

2018-07-10 Thread Andrey Borodin

Hi!

> I'm reviewing this patch. Currently I'm trying to understand sp-gist scan 
> deeeper, but as for now have some small notices.

I've passed through the code one more time. Here are few more questions:
1. Logic behind division of the patch into steps is described last time 
2017-01-30, but ISTM actual steps have changed since than? May I ask you to 
write a bit about steps of the patchset?
2. The patch leaves contribs intact. Do extensions with sp-gist opclasses need 
to update it's behavior somehow to be used as-is? Or to support new 
functionality?
3. There is a patch about predicate locking in SP-GiST [0] Is this KNN patch 
theoretically compatible with predicate locking? Seems like it is, I just want 
to note that this functionality may exist.
4. Scan state now have scanStack and queue. May be it's better to name 
scanStack and scanQueue or stack and queue?

Best regards, Andrey Borodin.

[0] https://commitfest.postgresql.org/14/1215/

Re: [HACKERS] plpgsql - additional extra checks

2018-07-10 Thread Pavel Stehule

2018-07-09 21:44 GMT+02:00 Alvaro Herrera :

> > + ereport(errlevel,
> >   (errcode(ERRCODE_TOO_MANY_
> ROWS),
> >errmsg("query returned
> more than one row"),
> > -  errdetail ?
> errdetail_internal("parameters: %s", errdetail) : 0));
> > +  errdetail ?
> errdetail_internal("parameters: %s", errdetail) : 0,
> > +  use_errhint ?
> errhint("too_many_rows check of extra_%s is active.",
> > +
>  too_many_rows_level == ERROR ? "errors" : "warnings") : 0));
>
> Please write this in a way that results in less translatable messages,
> and don't build setting names at runtime.  Concretely I suggest this:
>
> errhint(too_many_rows_level == ERROR ?
> gettext_noop("%s check of extra_errors is active.") :
> gettext_noop("%s check of extra_warnings is active."),
> "too_many_rows");
>
> This way,
> 1. only two messages need translation, not one per type of warning/error
> 2. the translator doesn't need to scratch their head to figure out what
>the second %s is
> 3. they don't have to worry about introducing a typo in the string
>"too_many_rows" or the strings "extra_errors", "extra_warnings".
>(Laugh all you want. It's a real problem).
>
> If you can add a /* translator: */ comment to indicate what the first %s
> is, all the better.  I'm just not sure *where* it needs to go.  I'm not
> 100% sure the gettext_noop() is really needed either.
>
> > + ereport(strict_
> multiassignment_level,
> > +
>  (errcode(ERRCODE_DATATYPE_MISMATCH),
> > +
> errmsg("Number of source and target fields in assignment do not match."),
>
> Please, no uppercase in errmsg(), and no ending period.
>

Thank you for notes. Tomorrow morning I'll spend few hours in train and
I'll send updated patch

Regards


Pavel


> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: [PATCH] Timestamp for a XLOG_BACKUP_END WAL-record

2018-07-10 Thread Fujii Masao

On Tue, Jul 10, 2018 at 6:41 PM, Andrey V. Lepikhov
 wrote:
>
>
> On 10.07.2018 06:45, Andres Freund wrote:
>>
>> Hi,
>>
>> On 2018-07-10 06:41:32 +0500, Andrey V. Lepikhov wrote:
>>>
>>> This functionality is needed in practice when we have to determine a
>>> recovery time of specific backup.
>>
>>
>> What do you mean by "recovery time of specific backup"?
>>
>
> recovery time - is a time point where backup of PostgreSQL database instance
> was made.
> Performing database recovery, we want to know what point in time the
> restored database will correspond to.
> This functionality refers to improving the usability of pg_basebackup and
> pg_probackup utilities.

Why don't you use a backup history file for that purpose?

Regards,

-- 
Fujii Masao

_isnan() on Windows

2018-07-10 Thread Emre Hasegeli

isnan() function is evidently not present on  on Windows
before Visual Studio 2013.  We define it on win32_port.h using
_isnan().  However _isnan() is also not present.  It is on .
The patch is attached to include this from win32_port.h.

Thanks to Thomas Munro for point this out to me [1].  It is hard to
notice this issue without testing the changes on Windows.

[1] 
https://www.postgresql.org/message-id/CAEepm%3D3dE0w%3D_dOcELpPum6suze2NZwc%3DAV4T_xYrDUoHbZJvA%40mail.gmail.com


win32_port_float_h_v00.patch
Description: Binary data

Re: _isnan() on Windows

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-10, Emre Hasegeli wrote:

> isnan() function is evidently not present on  on Windows
> before Visual Studio 2013.  We define it on win32_port.h using
> _isnan().  However _isnan() is also not present.  It is on .
> The patch is attached to include this from win32_port.h.
> 
> Thanks to Thomas Munro for point this out to me [1].  It is hard to
> notice this issue without testing the changes on Windows.

Oh, it looks like commits 33a7101281c6, 8e211f539146, 86dbbf20d849
(probably others) papered over this by the expedient of adding #include
 to random .c files rather than your patch, which seems the
proper fix.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-09, Tom Lane wrote:

> Suppose you did
> 
> create domain overint as int;
> create table pt (a overint) partition by range (a);
> create table pt1 partition of pt for values from (0) to (100);
> 
> and the system took it, and then you did
> 
> alter domain overint add check (value > 100);
> 
> What happens now?

It scans the table to check whether any values violate that condition,
and raises an error if they do:

alvherre=# alter domain overint add check (value > 100);
ERROR:  column "a" of table "ppt1" contains values that violate the new 
constraint

This looks sensible behavior to me.

Now if you don't have any values that violate the new constraint, then
the constraint can be created just fine, and you now have a partition
that can never accept any values.  But that doesn't seem like a terrible
problem.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Alvaro Herrera

On 2018-May-08, Amit Langote wrote:

> In HEAD, since we already fixed that case in e5dcbb88a15d [1] which is a
> different piece of code anyway, the patch only serves to improve the
> deparse output emitted by ruleutils.c for partition constraint expressions
> where pseudo-type partition key is involved.  The change can be seen in
> the updated test output for create_table test.

Actually, even in 11/master it also fixes this case:

alvherre=# explain update p set a = a || a where a = '{1}';
QUERY PLAN
──
 Update on p  (cost=0.00..54.03 rows=14 width=38)
   Update on p1
   Update on p2
   ->  Seq Scan on p1  (cost=0.00..27.02 rows=7 width=38)
 Filter: (a = '{1}'::integer[])
   ->  Seq Scan on p2  (cost=0.00..27.02 rows=7 width=38)
 Filter: (a = '{1}'::integer[])
(7 filas)

Because UPDATE uses the predtest.c prune code, not partprune.  So it's
not just some ruleutils beautification.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Tom Lane

Alvaro Herrera  writes:
> On 2018-Jul-09, Tom Lane wrote:
>> Suppose you did
>> 
>> create domain overint as int;
>> create table pt (a overint) partition by range (a);
>> create table pt1 partition of pt for values from (0) to (100);
>> 
>> and the system took it, and then you did
>> 
>> alter domain overint add check (value > 100);
>> 
>> What happens now?

> It scans the table to check whether any values violate that condition,
> and raises an error if they do:

> alvherre=# alter domain overint add check (value > 100);
> ERROR:  column "a" of table "ppt1" contains values that violate the new 
> constraint

> This looks sensible behavior to me.

And what about those partition bound values?  They are now illegal
for the domain, so I would expect a dump/reload to fail, regardless
of whether there are any values in the table.

regards, tom lane

Re: New GUC to sample log queries

2018-07-10 Thread Adrien Nayrat

On 06/27/2018 11:13 PM, Adrien Nayrat wrote:
>> 3) Is it intentional to only sample with log_min_duration_statement and
>> not also with log_duration?  It seems like it should affect both.  In
>> both cases, the name is too generic.  Something called "log_sample_rate"
>> should sample **everything**.
> I do not think it could be useful to sample other case such as log_duration.
> 
> But yes, the GUC is confusing and I am not comfortable to introduce a new GUC 
> in
> my initial patch.
> 
> Maybe we should adapt current GUC with something like :
> 
> log_min_duration_statement = ,>
> This give :
> 
> log_min_duration_statement = 0,0.1
> 
> Equivalent to :
> log_min_duration_statement = 0
> log_sample_rate = 0.1
> 
> Thought?
> 

After reflection it seems a bad idea :

  * it breaks compatibility with external tools
  * it introduce a kind of "composite" GUC which may add complexity to use. For
example in pg_settings view.

What do you think of : log_min_duration_statement_sample ? Is it too long?


I saw a few days ago this error on http://commitfest.cputube.org

postgres.sgml:5202: element xref: validity error : IDREF attribute linkend
references an unknown ID "log_min_duration_statement"

Patch attached with fix on linkend marker

Regards,

-- 
Adrien

commit d388b1e31926c2f5f12aa5453cc4df6e7d9e5f60
Author: anayrat 
Date:   Tue Jul 10 19:58:16 2018 +0200

Add a new GUC, log_sample_rate, to log a fraction of queries.

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e307bb4e8e..72791ad2a6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5190,6 +5190,26 @@ local0.*/var/log/postgresql

   
 
+ 
+  log_sample_rate (real)
+  
+   log_sample_rate configuration parameter
+  
+  
+   
+
+ Causes logging only a fraction of the statements when  is used. The default is
+ 1, meaning log all statements longer than
+ log_min_duration_statement.  Setting this to zero
+ disables logging, same as setting
+ log_min_duration_statement to
+ -1.  Using log_sample_rate is
+ helpful when the traffic is too high to log all queries.
+
+   
+  
+
  
 
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f4133953be..5590f9a9d4 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2115,7 +2115,8 @@ check_log_statement(List *stmt_list)
 
 /*
  * check_log_duration
- *		Determine whether current command's duration should be logged
+ *		Determine whether current command's duration should be logged.
+ *		If log_sample_rate < 1.0, log only a sample.
  *
  * Returns:
  *		0 if no logging is needed
@@ -2137,6 +2138,7 @@ check_log_duration(char *msec_str, bool was_logged)
 		int			usecs;
 		int			msecs;
 		bool		exceeded;
+		bool		in_sample;
 
 		TimestampDifference(GetCurrentStatementStartTimestamp(),
 			GetCurrentTimestamp(),
@@ -2153,7 +2155,15 @@ check_log_duration(char *msec_str, bool was_logged)
 	 (secs > log_min_duration_statement / 1000 ||
 	  secs * 1000 + msecs >= log_min_duration_statement)));
 
-		if (exceeded || log_duration)
+		/*
+		 * Do not log if log_sample_rate = 0.
+		 * Log a sample if log_sample_rate <= 1 and avoid unecessary random()
+		 * call if log_sample_rate = 1.
+		 */
+		in_sample = log_sample_rate != 0 &&
+			(log_sample_rate == 1 || random() <= log_sample_rate * MAX_RANDOM_VALUE);
+
+		if (exceeded && in_sample || log_duration)
 		{
 			snprintf(msec_str, 32, "%ld.%03d",
 	 secs * 1000 + msecs, usecs % 1000);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17292e04fe..daf1ae3af1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -455,6 +455,7 @@ int			log_min_messages = WARNING;
 int			client_min_messages = NOTICE;
 int			log_min_duration_statement = -1;
 int			log_temp_files = -1;
+double 		log_sample_rate = 1.0;
 int			trace_recovery_messages = LOG;
 
 int			temp_file_limit = -1;
@@ -3257,6 +3258,17 @@ static struct config_real ConfigureNamesReal[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"log_sample_rate", PGC_SUSET, LOGGING_WHEN,
+			gettext_noop("Fraction of statements to log."),
+			gettext_noop("1.0 logs all statements."),
+			NULL
+		},
+		&log_sample_rate,
+		1.0, 0.0, 1.0,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0.0, 0.0, 0.0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 657c3f81f8..b8ebbe9f7f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -446,6 +446,8 @@
 	# statements running at least this number
 	# of milliseconds
 
+#log_sample_rate = 1  # Fraction of logged statements. 1 means log all
+	# statements.
 
 # - What to Log -
 
diff --git a/src/include/utils/guc.h b/src/include/utils

Re: Desirability of client-side expressions in psql?

2018-07-10 Thread Corey Huinker

>
>
>psql> \if :i >= 5
>
>
I think we're ok with that so long as none of the operators or values has a
\ in it.
What barriers do you see to re-using the pgbench grammar?

Re: user-friendliness improvement of pageinspect

2018-07-10 Thread Peter Geoghegan

On Tue, Jul 10, 2018 at 12:41 AM, 杨杰  wrote:
> Why does the heap_page_item () of the pageinspect extension not consider
> providing better user-friendliness?
>
> My test table has the following data, and when I look at the t_data I see
> data of type bytea instead of a more intuitive type, even the same type as
> the original table.

tuple_data_split() can do this (split the data into individual
attributes). If you want a friendlier, more visual way of getting this
information, then you may find pg_hexedit helpful:

https://github.com/petergeoghegan/pg_hexedit

-- 
Peter Geoghegan

Re: Allow COPY's 'text' format to output a header

2018-07-10 Thread Simon Muller

On 4 July 2018 at 22:44, Simon Muller  wrote:

> I noticed through the patch tester link at http://commitfest.cputube.org/
> that my patch caused a file_fdw test to fail (since I previously tested
> only with "make check" and not with "make check-world").
>
> This v2 patch should fix that.
>

This patch just fixes a newline issue introduced in my previous patch.


text_header_v3.patch
Description: Binary data

Re: Reduce amount of WAL generated by CREATE INDEX for gist, gin and sp-gist

2018-07-10 Thread Heikki Linnakangas


On 28/02/18 18:03, Anastasia Lubennikova wrote:

Hi,

I want to propose a bunch of patches which allow to reduce WAL traffic
generated by CREATE INDEX for GiST, GIN and SP-GiST. Similarly to b-tree
and RUM, we can now log index pages of other access methods only once
in the end of indexbuild process.


Makes sense. Is there any scenario where the current method is better? 
In theory, doing another pass through the whole index, to create the WAL 
records, requires more I/O. But in practice, it seems that the reduction 
in WAL is almost certainly a win.



Implementation is based on generic_xlog.


Why? I think we should just add a log_relation() function in 
xloginsert.c directly, alongside log_newpage_buffer().


This makes the assumption that all the pages in these indexes used the 
standard page layout. I think that's a valid assumption, but needs at 
least a comment on that. And perhaps an Assert, to check that 
pd_lower/upper look sane.


As a further optimization, would it be a win to WAL-log multiple pages 
in each record?


This leaves the XLOG_*_CREATE_INDEX WAL record types unused, BTW.

- Heikki

Re: Failure assertion in GROUPS mode of window function in current HEAD

2018-07-10 Thread Tom Lane

Masahiko Sawada  writes:
> BTW, I found an another but related issue. We can got an assertion
> failure also by the following query.

> =# select sum(c) over (partition by c order by c groups between 1
> preceding and current row) from test;

Oh, good point, that's certainly legal per spec, so we'd need to do
something about it.

The proximate cause of the problem, I think, is that the planner throws
away the "order by c" as being redundant because it already sorted by c
for the partitioning requirement.  So we end up with ordNumCols = 0
even though the query had ORDER BY.

This breaks not only GROUPS cases, but also RANGE OFFSET cases, because
the executor expects to have an ordering column.  I thought for a bit
about fixing that by forcing the planner to emit the ordering column for
RANGE OFFSET even if it's redundant.  But it's hard to make the existing
logic in get_column_info_for_window do that --- it can't tell which
partitioning column the ordering column was redundant with, and even if it
could, that column might've been of a different data type.  So eventually
I just threw away that redundant-key-elimination logic altogether.
I believe that we never thought it was really useful as an optimization,
but way back when window functions were put in, we didn't have (or at
least didn't think about) a way to identify the partitioning/ordering
columns without reference to the input pathkeys.

With this patch, WindowAggPath.winpathkeys is not used for anything
anymore.  I'm inclined to get rid of it, though I didn't do so here.
(If we keep it, we at least need to adjust the comment in relation.h
that claims createplan.c needs it.)

The other issue here is that nodeWindowAgg's setup logic is not quite
consistent with update_frameheadpos and update_frametailpos about when
to create tuplestore read pointers and slots.  We might've prevented
all the inconsistent cases with the parser and planner fixes, but it
seems best to make them really consistent anyway, so I changed that.

Draft patch attached.

regards, tom lane

diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 968d5d3..0db193b 100644
*** a/src/backend/executor/nodeWindowAgg.c
--- b/src/backend/executor/nodeWindowAgg.c
*** begin_partition(WindowAggState *winstate
*** 1079,1084 
--- 1079,1085 
  {
  	WindowAgg  *node = (WindowAgg *) winstate->ss.ps.plan;
  	PlanState  *outerPlan = outerPlanState(winstate);
+ 	int			frameOptions = winstate->frameOptions;
  	int			numfuncs = winstate->numfuncs;
  	int			i;

*** begin_partition(WindowAggState *winstate
*** 1143,1150 
  		 * If the frame head is potentially movable, or we have an EXCLUSION
  		 * clause, we might need to restart aggregation ...
  		 */
! 		if (!(winstate->frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING) ||
! 			(winstate->frameOptions & FRAMEOPTION_EXCLUSION))
  		{
  			/* ... so create a mark pointer to track the frame head */
  			agg_winobj->markptr = tuplestore_alloc_read_pointer(winstate->buffer, 0);
--- 1144,1151 
  		 * If the frame head is potentially movable, or we have an EXCLUSION
  		 * clause, we might need to restart aggregation ...
  		 */
! 		if (!(frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING) ||
! 			(frameOptions & FRAMEOPTION_EXCLUSION))
  		{
  			/* ... so create a mark pointer to track the frame head */
  			agg_winobj->markptr = tuplestore_alloc_read_pointer(winstate->buffer, 0);
*** begin_partition(WindowAggState *winstate
*** 1182,1202 

  	/*
  	 * If we are in RANGE or GROUPS mode, then determining frame boundaries
! 	 * requires physical access to the frame endpoint rows, except in
  	 * degenerate cases.  We create read pointers to point to those rows, to
  	 * simplify access and ensure that the tuplestore doesn't discard the
! 	 * endpoint rows prematurely.  (Must match logic in update_frameheadpos
! 	 * and update_frametailpos.)
  	 */
  	winstate->framehead_ptr = winstate->frametail_ptr = -1; /* if not used */

! 	if ((winstate->frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS)) &&
! 		node->ordNumCols != 0)
  	{
! 		if (!(winstate->frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING))
  			winstate->framehead_ptr =
  tuplestore_alloc_read_pointer(winstate->buffer, 0);
! 		if (!(winstate->frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING))
  			winstate->frametail_ptr =
  tuplestore_alloc_read_pointer(winstate->buffer, 0);
  	}
--- 1183,1206 

  	/*
  	 * If we are in RANGE or GROUPS mode, then determining frame boundaries
! 	 * requires physical access to the frame endpoint rows, except in certain
  	 * degenerate cases.  We create read pointers to point to those rows, to
  	 * simplify access and ensure that the tuplestore doesn't discard the
! 	 * endpoint rows prematurely.  (Must create pointers in exactly the same
! 	 * cases that update_frameheadpos and update_frametailpos need them.)
  	 *

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-10, Alvaro Herrera wrote:

> alvherre=# explain update p set a = a || a where a = '{1}';
> QUERY PLAN
> ──
>  Update on p  (cost=0.00..54.03 rows=14 width=38)
>Update on p1
>Update on p2
>->  Seq Scan on p1  (cost=0.00..27.02 rows=7 width=38)
>  Filter: (a = '{1}'::integer[])
>->  Seq Scan on p2  (cost=0.00..27.02 rows=7 width=38)
>  Filter: (a = '{1}'::integer[])
> (7 filas)
> 
> Because UPDATE uses the predtest.c prune code, not partprune.  So it's
> not just some ruleutils beautification.

I added this test, modified some comments, and pushed.

Thanks for the patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-10, Tom Lane wrote:

> And what about those partition bound values?  They are now illegal
> for the domain, so I would expect a dump/reload to fail, regardless
> of whether there are any values in the table.

Hmm, true.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-09, Tom Lane wrote:

> Alvaro Herrera  writes:

> > However, if we take out the
> > expression_planner() and replace it with a call to
> > strip_implicit_coercions(), not only it magically starts working, but
> > also the regression tests start failing with the attached diff, which
> > seems a Good Thing to me.
> 
> Why would you find that to be a good thing?  The prohibition against
> mutable coercions seems like something we need here, for more or less
> the same reason in the domain example.

By the way, while playing with a partition on type money and replacing
expression_planner() with strip_implicit_coercions(), the stored
partition bounds are completely broken -- they end up as literals of
type integer rather than money, so any insert at all into the partition
fails (even if the value is nominally the same).  So clearly it's not a
change we want.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: _isnan() on Windows

2018-07-10 Thread Tom Lane

Alvaro Herrera  writes:
> On 2018-Jul-10, Emre Hasegeli wrote:
>> isnan() function is evidently not present on  on Windows
>> before Visual Studio 2013.  We define it on win32_port.h using
>> _isnan().  However _isnan() is also not present.  It is on .
>> The patch is attached to include this from win32_port.h.
>> 
>> Thanks to Thomas Munro for point this out to me [1].  It is hard to
>> notice this issue without testing the changes on Windows.

> Oh, it looks like commits 33a7101281c6, 8e211f539146, 86dbbf20d849
> (probably others) papered over this by the expedient of adding #include
>  to random .c files rather than your patch, which seems the
> proper fix.

I disagree --- including  in c.h, as this would have us do,
seems like a huge expansion of the visibility of that header.  Moreover,
doing that only on Windows seems certain to lead to missing-header
problems in the reverse direction, ie patches developed on Windows
will fail elsewhere.

I think we should stick with the existing coding convention of including
that only in the specific .c files that need it.

regards, tom lane

Re: _isnan() on Windows

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-10, Tom Lane wrote:

> Alvaro Herrera  writes:

> > Oh, it looks like commits 33a7101281c6, 8e211f539146, 86dbbf20d849
> > (probably others) papered over this by the expedient of adding #include
> >  to random .c files rather than your patch, which seems the
> > proper fix.
> 
> I disagree --- including  in c.h, as this would have us do,
> seems like a huge expansion of the visibility of that header.  Moreover,
> doing that only on Windows seems certain to lead to missing-header
> problems in the reverse direction, ie patches developed on Windows
> will fail elsewhere.

I don't think so, because that's only done for MSVC older than 2013.
Nobody uses that for new development anymore.  Downloading versions
prior to 2017 is difficult enough already.

> I think we should stick with the existing coding convention of including
> that only in the specific .c files that need it.

It seems saner to contain the ugliness to a port-specific file, where it
won't pollute two dozen files for no reason.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: GiST VACUUM

2018-07-10 Thread Heikki Linnakangas

I'm now looking at the first patch in this series, to allow completely 
empty GiST pages to be recycled. I've got some questions:



--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, 
GISTSTATE *giststate)
GISTInsertStack *item;
OffsetNumber downlinkoffnum;
 
+			if(GistPageIsDeleted(stack->page))

+   {
+   UnlockReleaseBuffer(stack->buffer);
+   xlocked = false;
+   state.stack = stack = stack->parent;
+   continue;
+   }
downlinkoffnum = gistchoose(state.r, stack->page, itup, 
giststate);
iid = PageGetItemId(stack->page, downlinkoffnum);
idxtuple = (IndexTuple) PageGetItem(stack->page, iid);


This seems misplaced. This code deals with internal pages, and as far as 
I can see, this patch never marks internal pages as deleted, only leaf 
pages. However, we should have something like this in the leaf-page 
branch, to deal with the case that an insertion lands on a page that was 
concurrently deleted. Did you have any tests, where an insertion runs 
concurrently with vacuum, that would exercise this?


The code in gistbulkdelete() seems pretty expensive. In the first phase, 
it records the parent of every empty leaf page it encounters. In the 
second phase, it scans every leaf page of that parent, not only those 
leaves that were seen as empty.


I'm a bit wary of using pd_prune_xid for the checks to determine if a 
deleted page can be recycled yet. In heap pages, pd_prune_xid is just a 
hint, but here it's used for a critical check. This seems to be the same 
mechanism we use in B-trees, but in B-trees, we store the XID in 
BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use 
ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See 
comments in _bt_unlink_halfdead_page() for explanation. This patch is 
missing any comments to explain how this works in GiST.


If you crash in the middle of gistbulkdelete(), after it has removed the 
downlink from the parent, but before it has marked the leaf page as 
deleted, the leaf page is "leaked". I think that's acceptable, but a 
comment at least would be good.


- Heikki

Re: patch to allow disable of WAL recycling

2018-07-10 Thread Jerry Jelinek

 Thanks to everyone who took the time to look at the patch and send me
feedback.  I'm happy to work on improving the documentation of this new
tunable to clarify when it should be used and the implications. I'm trying
to understand more specifically what else needs to be done next. To
summarize, I think the following general concerns were brought up.

1) Disabling WAL recycling could have a negative performance impact on a
COW filesystem if all WAL files could be kept in the filesystem cache.
2) Disabling WAL recycling reduces reliability, even on COW filesystems.
3) Using something like posix_fadvise to reload recycled WAL files into the
filesystem cache is better even for a COW filesystem.
4) There are "several" other purposes for WAL recycling which this tunable
would impact.
5) A WAL recycling tunable is too specific and a more general solution is
needed.
6) Need more performance data.

For #1, #2 and #3, I don't understand these concerns. It would be helpful
if these could be more specific

For #4, can anybody enumerate these other purposes for WAL recycling?

For #5, perhaps I am making an incorrect assumption about what the original
response was requesting, but I understand that WAL recycling is just one
aspect of WAL file creation/allocation. However, the creation of a new WAL
file is not a problem we've ever observed. In general, any modern
filesystem should do a good job of caching recently accessed files. We've
never observed a problem with the allocation of a new WAL file slightly
before it is needed. The problem we have observed is specifically around
WAL file recycling when we have to access old files that are long gone from
the filesystem cache. The semantics around recycling seem pretty crisp as
compared to some other tunable which would completely change how WAL files
are created. Given that a change like that is also much more intrusive, it
seems better to provide a tunable to disable WAL recycling vs. some other
kind of tunable for which we can't articulate any improvement except in the
recycling scenario.

For #6, there is no feasible way for us to recreate our workload on other
operating systems or filesystems. Can anyone expand on what performance
data is needed?

I'd like to restate the original problem we observed.

When PostgreSQL decides to reuse an old WAL file whose contents have been
evicted from the cache (because they haven't been used in hours), this
turns what should be a workload bottlenecked by synchronous write
performance (that can be well-optimized with an SSD log device) into a
random read workload (that's much more expensive for any system). What's
significantly worse is that we saw this on synchronous standbys. When that
happened, the WAL receiver was blocked on a random read from disk, and
since it's single-threaded, all write queries on the primary stop until the
random read finishes. This is particularly bad for us when the sync is
doing other I/O (e.g., for an autovacuum or a database backup) that causes
disk reads to take hundreds of milliseconds.

To summarize, recycling old WAL files seems like an optimization designed
for certain filesystems that allocate disk blocks up front. Given that the
existing behavior is already filesystem specific, is there specific reasons
why we can't provide a tunable to disable this behavior for filesystems
which don't behave that way?

Thanks again,
Jerry

On Tue, Jun 26, 2018 at 7:35 AM, Jerry Jelinek 
wrote:

> Hello All,
>
> Attached is a patch to provide an option to disable WAL recycling. We have
> found that this can help performance by eliminating read-modify-write
> behavior on old WAL files that are no longer resident in the filesystem
> cache. The is a lot more detail on the background of the motivation for
> this in the following thread.
>
> https://www.postgresql.org/message-id/flat/CACukRjO7DJvub8e2AijOayj8BfKK3
> XXBTwu3KKARiTr67M3E3w%40mail.gmail.com#CACukRjO7DJvub8e2AijOayj8BfKK3
> xxbtwu3kkaritr67m3...@mail.gmail.com
>
> A similar change has been tested against our 9.6 branch that we're
> currently running, but the attached patch is against master.
>
> Thanks,
> Jerry
>
>

Re: _isnan() on Windows

2018-07-10 Thread Tom Lane

Alvaro Herrera  writes:
> On 2018-Jul-10, Tom Lane wrote:
>> I disagree --- including  in c.h, as this would have us do,
>> seems like a huge expansion of the visibility of that header.  Moreover,
>> doing that only on Windows seems certain to lead to missing-header
>> problems in the reverse direction, ie patches developed on Windows
>> will fail elsewhere.

> I don't think so, because that's only done for MSVC older than 2013.
> Nobody uses that for new development anymore.

Hm.  OK, maybe it's all right given that.  I'm still a bit worried
about downsides, but no doubt the buildfarm will tell us.

regards, tom lane

Re: patch to allow disable of WAL recycling

2018-07-10 Thread Joshua D. Drake


On 07/10/2018 01:15 PM, Jerry Jelinek wrote:
Thanks to everyone who took the time to look at the patch and send me 
feedback.  I'm happy to work on improving the documentation of this 
new tunable to clarify when it should be used and the implications. 
I'm trying to understand more specifically what else needs to be done 
next. To summarize, I think the following general concerns were 
brought up.


For #6, there is no feasible way for us to recreate our workload on 
other operating systems or filesystems. Can anyone expand on what 
performance data is needed?




I think a simple way to prove this would be to run BenchmarkSQL against 
PostgreSQL in a default configuration with pg_xlog/pg_wal on a 
filesystem that is COW (zfs) and then run another test where 
pg_xlog/pg_wal is patched with your patch and new behavior and then run 
the test again. BenchmarkSQL is a more thorough benchmarking tool that 
something like pg_bench and is very easy to setup.


The reason you would use a default configuration is because it will 
cause a huge amount of wal churn, although a test with a proper wal 
configuration would also be good.


Thanks,

JD



--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *

Re: patch to allow disable of WAL recycling

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-10, Jerry Jelinek wrote:

> 2) Disabling WAL recycling reduces reliability, even on COW filesystems.

I think the problem here is that WAL recycling in normal filesystems
helps protect the case where filesystem gets full.  If you remove it,
that protection goes out the window.  You can claim that people needs to
make sure to have available disk space, but this does become a problem
in practice.  I think the thing to do is verify what happens with
recycling off when the disk gets full; is it possible to recover
afterwards?  Is there any corrupt data?  What happens if the disk gets
full just as the new WAL file is being created -- is there a Postgres
PANIC or something?  As I understand, with recycling on it is easy (?)
to recover, there is no PANIC crash, and no data corruption results.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Problem with tupdesc in jsonb_to_recordset

2018-07-10 Thread Dmitry Dolgov

Hi,

I've found out that currently in some situations jsonb_to_recordset can lead to
a crash. Minimal example that I've managed to create looks like this:

CREATE OR REPLACE FUNCTION test(data JSONB)
  RETURNS INTEGER AS $$
DECLARE
  test_var int;
BEGIN
  WITH jsonb_values AS (
  SELECT
(SELECT SUM(value)
 FROM jsonb_to_recordset(element #> '{values}')
 AS q(value INTEGER)) AS value_sum
  FROM jsonb_array_elements(data) AS element
  ) SELECT SUM(value_sum) FROM jsonb_values INTO test_var;
  RETURN test_var;
END;
$$ LANGUAGE plpgsql;


And then:

=# SELECT test('[
{
"values": [
{
"value": 1
},
{
"value": 3
}
]
},
{
"values": [
{
"value": 1
},
{
"value": 3
}
]
}
]' :: JSONB);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

After brief investigation it looks like an issue with tupdesc from the function
cache:

if (!cache)
{
fcinfo->flinfo->fn_extra = cache =
MemoryContextAllocZero(fcinfo->flinfo->fn_mcxt, sizeof(*cache));
//...

rsi->setDesc = cache->c.io.composite.tupdesc;

Then after the first call of populate_recordset_worker rsi->setDesc is being
reset since we never increased tdrefcount and the next call will use wrong
cache data. Apparently, it can be fixed by incrementing a tdrefcount (see the
attached patch).
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index e358b5a..9250646 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -3728,6 +3728,7 @@ populate_recordset_worker(FunctionCallInfo fcinfo, const char *funcname,
 
 	rsi->setResult = state->tuple_store;
 	rsi->setDesc = cache->c.io.composite.tupdesc;
+	rsi->setDesc->tdrefcount = 1;
 
 	PG_RETURN_NULL();
 }

Re: cost_sort() improvements

2018-07-10 Thread Tomas Vondra

On 06/28/2018 06:47 PM, Teodor Sigaev wrote:
> Hi!
>
> ...
>
>  - cost for i-th columns:
>    Ci * log2(NGi) => Ci * log2(N/G(i-1))
>    So, here I believe, that i-th comparison function will be called only
> inside
>    group which number is defined by (i-1) leading columns. Note, following
>    discussion [1] I add multiplier 1.5 here to be closer to worst case when
>    groups are significantly differ. Right now there is no way to
> determine how
>    differ they could be. Note, this formula describes cost of first
> column too
>    except difference with multiplier.

One more thought about estimating the worst case - I wonder if simply
multiplying the per-tuple cost by 1.5 is the right approach. It does not
seem particularly principled, and it's trivial simple to construct
counter-examples defeating it (imagine columns with 99% of the rows
having the same value, and then many values in exactly one row - that
will defeat the ndistinct-based group size estimates).

The reason why constructing such counter-examples is so simple is that
this relies entirely on the ndistinct stats, ignoring the other stats we
already have. I think this might leverage the per-column MCV lists (and
eventually multi-column MCV lists, if that ever gets committed).

We're estimating the number of tuples in group (after fixing values in
the preceding columns), because that's what determines the number of
comparisons we need to make at a given stage. The patch does this by
estimating the average group size, by estimating ndistinct of the
preceding columns G(i-1), and computing ntuples/G(i-1).

But consider that we also have MCV lists for each column - if there is a
very common value, it should be there. So let's say M(i) is a frequency
of the most common value in i-th column, determined either from the MCV
list or as 1/ndistinct for that column. Then if m(i) is minimum of M(i)
for the fist i columns, then the largest possible group of values when
comparing values in i-th column is

N * m(i-1)

This seems like a better way to estimate the worst case. I don't know if
this should be used instead of NG(i), or if those two estimates should
be combined in some way.

I think estimating the largest group we need to sort should be helpful
for the incremental sort patch, so I'm adding Alexander to this thread.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Optimze usage of immutable functions as relation

2018-07-10 Thread Heikki Linnakangas


On 16/05/18 13:47, Aleksandr Parfenov wrote:

Hello,

I reworked a patch to make more stable in different cases. I decided to
use simplify_function instead of eval_const_expression to prevent
inlining of the function. The only possible outputs of the
simplify_function are Const node and NULL.


Hmm. That's still a bit inefficient, we'll try to simplify the function 
on every reference to it.


We already simplify functions in function RTEs, but we currently do it 
after preprocessing all other expressions in the query. See 
subquery_planner(), around comment "/* Also need to preprocess 
expressions within RTEs */". If we moved that so that we simplify 
expressions in the range table first, then in the code that you're 
adding to eval_const_expression_mutator(), you could just check if the 
function expression is already a Const.



But stepping back a bit, it's a bit weird that we're handling this 
differently from VALUES and other subqueries. The planner knows how to 
do this trick for simple subqueries:


postgres=# explain select * from tenk1, (select abs(100)) as a (a) where 
unique1 < a;

QUERY PLAN
---
 Seq Scan on tenk1  (cost=0.00..483.00 rows=100 width=248)
   Filter: (unique1 < 100)
(2 rows)

Note that it not only evaluated the function into a constant, but also 
got rid of the join. For a function RTE, however, it can't do that:


postgres=# explain select * from tenk1, abs(100) as a (a) where unique1 < a;
QUERY PLAN
---
 Nested Loop  (cost=0.00..583.01 rows= width=248)
   Join Filter: (tenk1.unique1 < a.a)
   ->  Function Scan on a  (cost=0.00..0.01 rows=1 width=4)
   ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=244)
(4 rows)

Could we handle this in pull_up_subqueries(), similar to the 
RTE_SUBQUERY and RTE_VALUES cases?


- Heikki

Re: Jsonb transform for pl/python

2018-07-10 Thread Tom Lane

Peter Eisentraut  writes:
> On 6/23/18 01:44, Nikita Glukhov wrote:
>> We are simply trying first to convert numeric to int64 if is does not have
>> digits after the decimal point, and then construct Python Int instead of
>> Decimal.  Standard Python json.loads() does the same for exact integers.

> We just had a very similar but not identical discussion in the thread
> about the PL/Perl transforms, where we rejected the proposal.  Other
> comments on this one?

I can sympathize with the speed concern, but the proposed patch produces
a functional change (ie, you get a different kind of Python object, with
different behaviors, in some cases).  That seems to me to be a good reason
to reject it.  The fact that it makes the behavior vary depending on the
local width of "long" is also a problem, although I think that could be
fixed.

More generally, I'd be happier with a patch that sped things up for
non-integers too.  I don't know whether Python exposes enough internals
of type Decimal to make that practical, though.  One idea for doing it at
arm's length is to compute the Decimal value using numeric-digit-at-a-time
arithmetic, roughly

x := 0;
foreach(digit, numeric)
x.fma(1, digit);
x.scaleb(appropriate-exponent);

In principle that's probably faster than string conversion, but not
sure by how much.

regards, tom lane

Re: Optimze usage of immutable functions as relation

2018-07-10 Thread Tom Lane

Heikki Linnakangas  writes:
> But stepping back a bit, it's a bit weird that we're handling this 
> differently from VALUES and other subqueries. The planner knows how to 
> do this trick for simple subqueries:

> postgres=# explain select * from tenk1, (select abs(100)) as a (a) where 
> unique1 < a;
>  QUERY PLAN
> ---
>   Seq Scan on tenk1  (cost=0.00..483.00 rows=100 width=248)
> Filter: (unique1 < 100)
> (2 rows)

> Note that it not only evaluated the function into a constant, but also 
> got rid of the join. For a function RTE, however, it can't do that:

> postgres=# explain select * from tenk1, abs(100) as a (a) where unique1 < a;
>  QUERY PLAN
> ---
>   Nested Loop  (cost=0.00..583.01 rows= width=248)
> Join Filter: (tenk1.unique1 < a.a)
> ->  Function Scan on a  (cost=0.00..0.01 rows=1 width=4)
> ->  Seq Scan on tenk1  (cost=0.00..458.00 rows=1 width=244)
> (4 rows)

> Could we handle this in pull_up_subqueries(), similar to the 
> RTE_SUBQUERY and RTE_VALUES cases?

Perhaps.  You could only do it for non-set-returning functions, which
isn't the typical use of function RTEs, which is probably why we've not
thought hard about it before.  I'm not sure what would need to happen for
lateral functions.  Also to be considered, if it's not foldable to a
constant, is whether we're risking evaluating it more times than before.

regards, tom lane

Re: Desirability of client-side expressions in psql?

2018-07-10 Thread Fabien COELHO




Hello Corey,


   psql> \if :i >= 5


I think we're ok with that so long as none of the operators or values has a
\ in it.
What barriers do you see to re-using the pgbench grammar?


The pgbench expression grammar mimics SQL expression grammar,
on integers, floats, booleans & NULL.

I'm unsure about some special cases in psql (`shell command`,
'text' "identifier"). They can be forbidden on a new commande (\let),
but what happens on "\if ..." which I am afraid allows them is unclear.

--
Fabien.

CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?

2018-07-10 Thread Tom Lane

I poked into the odd behavior reported in bug #15251:

https://www.postgresql.org/message-id/152967245839.1266.6939666809369185...@wrigleys.postgresql.org

The reason for the poor plan chosen when the caller hasn't got select
privilege on the child table is that statistic_proc_security_check()
decides not to allow use of the stats for the child table.  There are
two functions for which it decides that, because they aren't leakproof:
textregexeq() and btint4cmp().  Now, it's probably necessary that we
consider textregexeq() leaky, since it might spit up errors about its
pattern argument.  But btint4cmp?  It seems pretty silly that the
integer comparison operators are marked leakproof while their underlying
comparison function isn't.

(The reason we're hitting this is that calc_arraycontsel() finds the
datatype's default btree comparison function and tries to use that
to estimate selectivity.  While marking btint4cmp leakproof doesn't
completely fix the misestimation, it goes a long way in this example.)

I propose to run through the system operator classes, find any for which
the comparison function isn't marked leakproof but the operators are,
and fix them.  This is clearly appropriate for HEAD and maybe it's not
too late to force an initdb for v11 --- thoughts?

Another question that could be raised is why we are refusing to use
stats for a child table when the caller has select on the parent.
It's completely trivial to extract data from a child table if you
have select on the parent, so it seems like we are checking the
wrong table's privileges.

regards, tom lane

possible issue with array type created for every heap relation composite type

2018-07-10 Thread Jimmy Yih

Hello,

In Postgres 8.3, it was decided that an array type would be automatically
created for every heap relation's composite type.

Reference thread:
https://www.postgresql.org/message-id/flat/20070302234016.GF3665%40fetter.org

The possible issue I would like to note is related to how the array type is
named in makeArrayTypeName() function.  The composite type will take the
heap relation's relname as its typname and the array type will usually be
named with an underscore prepended (after first attempting to use the
relname and hitting typname collision with the composite type).  If the
typname with the underscore prepended is already taken, the logic is to
continue prepending underscores until there are no typname collisions
(truncating the end of the typname if typname gets past NAMEDATALEN of
64).  It is possible that enough heap relations with similar relnames could
cause more and more typname collisions until you end up with typnames that
primarily consist of underscores or not being able to construct a typname
because we have reached a typname consisting of all underscores (which can
cause table creation failure).

This is more likely to happen when heap relnames are similar and of string
length 63+.  I can see this possibly being an issue with table partitioning
if a user decides on partition names that reach string length 63+ (assuming
user uses exactly 63 characters or does not hit relation already exists
from relname truncation).  This may also become an issue if we ever decide
to do automated partition naming instead of having the user naming the
partitions manually.

Is this an issue we should be worrying about?  When and how often are array
types of heap relation's composite type used (I can only think of doing
array_agg on table rows)?  Should we consider coding up "CREATE TYPE foo
ARRAY OF bar" as suggested in the past in the above referenced hackers
thread?

Attached are some SQL examples describing this issue.

Thanks,
Jimmy


typname_collision.sql
Description: Binary data


typname_collision_with_partitioning.sql
Description: Binary data

Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-10, Tom Lane wrote:

> I propose to run through the system operator classes, find any for which
> the comparison function isn't marked leakproof but the operators are,
> and fix them.  This is clearly appropriate for HEAD and maybe it's not
> too late to force an initdb for v11 --- thoughts?

on initdb in v11, see
https://postgr.es/m/cakjs1f9cqosks9jvcbkga2mdn-24ypwc9xlzfdnsmjjmupt...@mail.gmail.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: possible issue with array type created for every heap relation composite type

2018-07-10 Thread Tom Lane

Jimmy Yih  writes:
> The possible issue I would like to note is related to how the array type is
> named in makeArrayTypeName() function.  The composite type will take the
> heap relation's relname as its typname and the array type will usually be
> named with an underscore prepended (after first attempting to use the
> relname and hitting typname collision with the composite type).  If the
> typname with the underscore prepended is already taken, the logic is to
> continue prepending underscores until there are no typname collisions
> (truncating the end of the typname if typname gets past NAMEDATALEN of
> 64).  It is possible that enough heap relations with similar relnames could
> cause more and more typname collisions until you end up with typnames that
> primarily consist of underscores or not being able to construct a typname
> because we have reached a typname consisting of all underscores (which can
> cause table creation failure).

We've never heard a field report of this happening, so I'm not terribly
concerned about it.

regards, tom lane

Re: CVE-2017-7484-induced bugs, or, btree cmp functions are not leakproof?

2018-07-10 Thread Tom Lane

Alvaro Herrera  writes:
> on initdb in v11, see
> https://postgr.es/m/cakjs1f9cqosks9jvcbkga2mdn-24ypwc9xlzfdnsmjjmupt...@mail.gmail.com

[confused...]  I recall a thread mentioning that we might need initdb
in v11, but that one doesn't seem to be it?

regards, tom lane

Re: Non-reserved replication slots and slot advancing

2018-07-10 Thread Michael Paquier

On Tue, Jul 10, 2018 at 09:51:27AM -0400, Alvaro Herrera wrote:
> On 2018-Jul-10, Michael Paquier wrote:
> I say please push already.  We can push more fixes later if they are
> needed.

OK, I have pushed what I have now, with all the suggestions from
upthread included.
--
Michael


signature.asc
Description: PGP signature

TRUNCATE tables referenced by FKs on partitioned tables

2018-07-10 Thread Alvaro Herrera

$subject is broken:

create table prim (a int primary key);
create table partfk (a int references prim) partition by range (a);
create table partfk1 partition of partfk for values from (0) to (100);
create table partfk2 partition of partfk for values from (100) to (200);

You can't truncate prim on its own.  This is expected.
alvherre=# truncate table prim, partfk;
ERROR:  cannot truncate a table referenced in a foreign key constraint
DETALLE:  Table "partfk" references "prim".
SUGERENCIA:  Truncate table "partfk" at the same time, or use TRUNCATE ... 
CASCADE.

However, you can't do it even if you try to include partfk in the mix:

alvherre=# truncate table prim, partfk;
ERROR:  cannot truncate a table referenced in a foreign key constraint
DETALLE:  Table "partfk" references "prim".
SUGERENCIA:  Truncate table "partfk" at the same time, or use TRUNCATE ... 
CASCADE.

Trying to list all the partitions individually is pointless:

alvherre=# truncate table prim, partfk, partfk1, partfk2;
ERROR:  cannot truncate a table referenced in a foreign key constraint
DETALLE:  Table "partfk" references "prim".
SUGERENCIA:  Truncate table "partfk" at the same time, or use TRUNCATE ... 
CASCADE.

CASCADE is also useless:

alvherre=# truncate table prim cascade;
NOTICE:  truncate cascades to table "partfk"
NOTICE:  truncate cascades to table "partfk1"
NOTICE:  truncate cascades to table "partfk2"
ERROR:  cannot truncate a table referenced in a foreign key constraint
DETALLE:  Table "partfk" references "prim".
SUGERENCIA:  Truncate table "partfk" at the same time, or use TRUNCATE ... 
CASCADE.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: shared-memory based stats collector

2018-07-10 Thread Andres Freund

On 2018-07-10 14:52:13 +0200, Tomas Vondra wrote:
> There's one more reason why attempts to keep stats snapshots "perfectly"
> consistent are likely doomed to fail - the messages are sent over UDP, which
> does not guarantee delivery etc. So there's always some level of possible
> inconsistency even with "perfectly consistent" snapshots.

FWIW, I don't see us continuing to do so if we go for a shared hashtable
for stats.

- Andres

Shared buffer access rule violations?

2018-07-10 Thread Asim R P

Hi,

In order to make changes to a shared buffer, one must hold a pin on it
and the content lock in exclusive mode.  This rule seems to be
followed in most of the places but there are a few exceptions.

One can find several PageInit() calls with no content lock held.  See,
for example:

fill_seq_with_data()
vm_readbuf()
fsm_readbuf()

Moreover, fsm_vacuum_page() performs
"PageGetContents(page))->fp_next_slot = 0;" without content lock.

There may be more but I want to know if these can be treated as
violations before moving ahead.

Asim

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Amit Langote

On 2018/07/11 3:18, Alvaro Herrera wrote:
> On 2018-May-08, Amit Langote wrote:
> 
>> In HEAD, since we already fixed that case in e5dcbb88a15d [1] which is a
>> different piece of code anyway, the patch only serves to improve the
>> deparse output emitted by ruleutils.c for partition constraint expressions
>> where pseudo-type partition key is involved.  The change can be seen in
>> the updated test output for create_table test.
> 
> Actually, even in 11/master it also fixes this case:
> 
> alvherre=# explain update p set a = a || a where a = '{1}';
> QUERY PLAN
> ──
>  Update on p  (cost=0.00..54.03 rows=14 width=38)
>Update on p1
>Update on p2
>->  Seq Scan on p1  (cost=0.00..27.02 rows=7 width=38)
>  Filter: (a = '{1}'::integer[])
>->  Seq Scan on p2  (cost=0.00..27.02 rows=7 width=38)
>  Filter: (a = '{1}'::integer[])
> (7 filas)
> 
> Because UPDATE uses the predtest.c prune code, not partprune.  So it's
> not just some ruleutils beautification.

That's true.  Shame I totally missed that.

Thanks,
Amit

Re: Concurrency bug in UPDATE of partition-key

2018-07-10 Thread Alvaro Herrera

On 2018-Jul-09, Amit Kapila wrote:

> Alvaro,
> 
> Can you please comment whether this addresses your concern?

I was thinking that it would be a matter of passing the tuple slot to
EvalPlanQual for it to fill, rather than requiring it to fill some other
slot obtained from who-knows-where, but I realize now that that's nigh
impossible.  Thanks for the explanation and patience.

What bothers me about this whole business is how ExecBRDeleteTriggers
and ExecDelete are now completely different from their sibling routines,
but maybe there's no helping that.

Please move the output arguments at the end of argument lists; also, it
would be great if you add commentary about ExecDelete other undocumented
arguments (tupleDeleted in particular) while you're in the vicinity.

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: How can we submit code patches that implement our (pending) patents?

2018-07-10 Thread Tsunakawa, Takayuki

From: Dave Page [mailto:dp...@pgadmin.org]
> SFLC have acted as the projects counsel in the past, so I'm not surprised
> they aren't talking to you; you won't be a known contact to them as a PG
> contributor, and as a Fujitsu employee there would likely be a conflict
> of interest for them to talk to you.

I see.  Then I hope for some reply saying so so that I can give up my hope that 
I might get a good idea from them...

Who is a known contact to SFLC in PostgreSQL community?  Can we expect response 
from SFLC if he/she contacts them?


Regards
Takayuki Tsunakawa

Re: _isnan() on Windows

2018-07-10 Thread Michael Paquier

On Tue, Jul 10, 2018 at 04:23:42PM -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2018-Jul-10, Tom Lane wrote:
>>> I disagree --- including  in c.h, as this would have us do,
>>> seems like a huge expansion of the visibility of that header.  Moreover,
>>> doing that only on Windows seems certain to lead to missing-header
>>> problems in the reverse direction, ie patches developed on Windows
>>> will fail elsewhere.
> 
>> I don't think so, because that's only done for MSVC older than 2013.
>> Nobody uses that for new development anymore.
> 
> Hm.  OK, maybe it's all right given that.  I'm still a bit worried
> about downsides, but no doubt the buildfarm will tell us.

It seems to me that this patch is a good idea.  Any objections if I take
care of it?  I have a Windows VM with only MSVC 2015 if I recall
correctly though...
--
Michael


signature.asc
Description: PGP signature

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Amit Langote

On 2018/07/11 4:48, Alvaro Herrera wrote:
> On 2018-Jul-10, Alvaro Herrera wrote:
> 
>> alvherre=# explain update p set a = a || a where a = '{1}';
>> QUERY PLAN
>> ──
>>  Update on p  (cost=0.00..54.03 rows=14 width=38)
>>Update on p1
>>Update on p2
>>->  Seq Scan on p1  (cost=0.00..27.02 rows=7 width=38)
>>  Filter: (a = '{1}'::integer[])
>>->  Seq Scan on p2  (cost=0.00..27.02 rows=7 width=38)
>>  Filter: (a = '{1}'::integer[])
>> (7 filas)
>>
>> Because UPDATE uses the predtest.c prune code, not partprune.  So it's
>> not just some ruleutils beautification.
> 
> I added this test, modified some comments, and pushed.
> 
> Thanks for the patch.

Thank you for committing.

Regards,
Amit

RE: How can we submit code patches that implement our (pending) patents?

2018-07-10 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> The core team has considered this matter, and has concluded that it's
> time to establish a firm project policy that we will not accept any code
> that is known to be patent-encumbered.  The long-term legal risks and
> complications involved in doing that seem insurmountable, given the
> community's amorphous legal nature and the existing Postgres license
> wording (neither of which are open for negotiation here).  Furthermore,
> Postgres has always been very friendly to creation of closed-source
> derivatives, but it's hard to see how inclusion of patented code would
> not cause serious problems for those.  The potential benefits of
> accepting patented code just don't seem to justify trying to navigate
> these hazards.

That decision is very unfortunate as a corporate employee on one hand, but the 
firm cleanness looks a bit bright to me as one developer.

As a practical matter, when and where are you planning to post the project 
policy?  How would you check and prevent patented code?


Regards
Takayuki Tsunakawa

Re: Excessive PostmasterIsAlive calls slow down WAL redo

2018-07-10 Thread Thomas Munro

On Tue, Jul 10, 2018 at 11:39 PM, Heikki Linnakangas  wrote:
> The 'postmaster_possibly_dead' flag is not reset anywhere. So if a process
> receives a spurious death signal, even though postmaster is still alive,
> PostmasterIsAlive() will continue to use the slow path.

+1

> postmaster_possibly_dead needs to be marked as 'volatile', no?

+1

> The autoconf check for PR_SET_PDEATHSIG seems slightly misplaced. And I
> think we can simplify it with AC_CHECK_HEADER(). I'd also like to avoid
> adding code to c.h for this, that seems too global.

+1, much nicer, thanks.

> After some kibitzing, I ended up with the attached. It fixes the
> postmaster_possible_dead issues mentioned above, and moves around the
> autoconf and #ifdef logic a bit to make it a bit nicer, at least in my
> opinion.

Thanks, that looks good to me.  I added your name as co-author and
pushed to master.

I also made a couple of minor cosmetic changes in
PostmasterDeathSignalInit() to make the follow-up patch prettier (#if
defined() instead of #ifdef, and a signum variable because I later
need its address).

> I don't have a FreeBSD machine at hand, so I didn't try fixing that
> patch.

I updated the FreeBSD version to use the header test approach you
showed, and pushed that too.  FWIW the build farm has some FreeBSD
animals with and without PROC_PDEATHSIG_CTL.

I suppose it's possibly that we might want to reconsider the choice of
signal in the future (SIGINFO or SIGPWR).

(Random archeological note: TIL that Linux stole  from
Irix (RIP), but it had PR_TERMCHILD instead of PR_SET_PRDEATHSIG.)

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-07-10 Thread David Rowley

On 6 July 2018 at 21:25, Kato, Sho  wrote:
> 2. 11beta2 + patch1 + patch2
>
> patch1: Allow direct lookups of AppendRelInfo by child relid
> commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
> patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
>
>  part_num |   tps_ex| latency_avg | update_latency | select_latency | 
> insert_latency
> --+-+-+++
>   100 | 1224.430344 |   0.817 |  0.551 |  0.085 | 
>  0.048
>   200 |  689.567511 |1.45 |   1.12 |  0.119 | 
>   0.05
>   400 |  347.876616 |   2.875 |  2.419 |  0.185 | 
>  0.052
>   800 |  140.489269 |   7.118 |  6.393 |  0.329 | 
>  0.059
>  1600 |   29.681672 |  33.691 | 31.272 |  1.517 | 
>  0.147
>  3200 |7.021957 | 142.412 |  136.4 |  4.033 | 
>  0.214
>  6400 |1.462949 | 683.557 |669.187 |  7.677 | 
>  0.264

Hi,

Thanks a lot for benchmarking this.

Just a note to say that the "Allow direct lookups of AppendRelInfo by
child relid" patch is already in master. It's much more relevant to be
testing with master than pg11. This patch is not intended for pg11.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: user-friendliness improvement of pageinspect

2018-07-10 Thread Peter Geoghegan

On Tue, Jul 10, 2018 at 6:44 PM, Yang Jie  wrote:
> my question is not split the data into individual attributes.
> I want to see the data in the table, but I don't want to be a bytea type.

What's wrong with simply using an SQL query?

-- 
Peter Geoghegan

回复： user-friendliness improvement of pageinspect

2018-07-10 Thread Yang Jie

Thank you for your answer.


my question is not split the data into individual attributes.
I want to see the data in the table, but I don't want to be a bytea type.








在2018年7月11日 02:48，Peter Geoghegan 写道：
On Tue, Jul 10, 2018 at 12:41 AM, 杨杰  wrote:
Why does the heap_page_item () of the pageinspect extension not consider
providing better user-friendliness?

My test table has the following data, and when I look at the t_data I see
data of type bytea instead of a more intuitive type, even the same type as
the original table.

tuple_data_split() can do this (split the data into individual
attributes). If you want a friendlier, more visual way of getting this
information, then you may find pg_hexedit helpful:

https://github.com/petergeoghegan/pg_hexedit

--
Peter Geoghegan

Re: no partition pruning when partitioning using array type

2018-07-10 Thread Amit Langote

On 2018/07/11 4:50, Alvaro Herrera wrote:
> On 2018-Jul-10, Tom Lane wrote:
> 
>> And what about those partition bound values?  They are now illegal
>> for the domain, so I would expect a dump/reload to fail, regardless
>> of whether there are any values in the table.
> 
> Hmm, true.

There is a patch to overhaul how partition bound values are parsed:

https://commitfest.postgresql.org/18/1620/

With that patch, the error you reported upthread goes away (that is, one
can successfully create partitions), but the problem that Tom mentioned
above then appears.

What's the solution here then?  Prevent domains as partition key?

Thanks,
Amit

Re: [HACKERS] SERIALIZABLE with parallel query

2018-07-10 Thread Masahiko Sawada

On Mon, Jul 2, 2018 at 3:12 PM, Masahiko Sawada  wrote:
> On Fri, Jun 29, 2018 at 7:28 PM, Thomas Munro
>  wrote:
>> On Thu, Jun 28, 2018 at 7:55 PM, Masahiko Sawada  
>> wrote:
>>> I'd like to test and review this patches but they seem to conflict
>>> with current HEAD. Could you please rebase them?
>>
>> Hi Sawada-san,
>>
>> Thanks!  Rebased and attached.  The only changes are: the LWLock
>> tranche is now shown as "serializable_xact" instead of "sxact" (hmm,
>> LWLock tranches have lower_case_names_with_underscores, but individual
>> LWLocks have CamelCaseName...), and ShareSerializableXact() no longer
>> does Assert(!IsParallelWorker()).  These changes are based on the last
>> feedback from Robert.
>>
>
> Thank you! Will look at patches.
>

I looked at this patches. The latest patch can build without any
errors and warnings and pass all regression tests. I don't see
critical bugs but there are random comments.

+   /*
+* If the leader in a parallel query earler stashed a partially
+* released SERIALIZABLEXACT for final clean-up at end
of transaction
+* (because workers might still have been accessing
it), then it's
+* time to restore it.
+*/

There is a typo.
s/earler/earlier/


Should we add test to check if write-skew[1] anomaly doesn't happen
even in parallel mode?


- * We aren't acquiring lightweight locks for the predicate lock or lock
+ * We acquire an LWLock in the case of parallel mode, because worker
+ * backends have access to the leader's SERIALIZABLEXACT.  Otherwise,
+ * we aren't acquiring lightweight locks for the predicate lock or lock

There are LWLock and lightweight lock. Maybe it's better to unify the spelling.


@@ -2231,9 +2234,12 @@ PrepareTransaction(void)
/*
 * Mark serializable transaction as complete for predicate locking
 * purposes.  This should be done as late as we can put it and
still allow
-* errors to be raised for failure patterns found at commit.
+* errors to be raised for failure patterns found at commit.
This is not
+* appropriate for parallel workers however, because we aren't
committing
+* the leader's transaction and its serializable state will live on.
 */
-   PreCommit_CheckForSerializationFailure();
+   if (!IsParallelWorker())
+   PreCommit_CheckForSerializationFailure();

This code assumes that parallel workers could prepare transaction. Is
that expected behavior of parallel query? There is an assertion
!IsInParallelMode() at the beginning of that function though.


+/*
+ * If the transaction is committing, but it has been partially released
+ * already, then treat this as a roll back.  It was marked as rolled back.
+ */
+if (isCommit && SxactIsPartiallyReleased(MySerializableXact))
+isCommit = false;
+

Isn't it better to add an assertion to check if
MySerializableXact->flags has SXACT_FLAG_ROLLED_BACK flag for safety?

[1] https://en.wikipedia.org/wiki/Snapshot_isolation#Definition

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: user-friendliness improvement of pageinspect

2018-07-10 Thread Peter Geoghegan

On Tue, Jul 10, 2018 at 6:55 PM, Yang Jie  wrote:
> Simple queries are fine, but pageinspect can query previous data. If you
> return the same result as a simple query, this is similar to the oracle
> flashback version query.

Please don't top-post.

That may be true in some limited sense, but querying old data like
that seems totally impractical. You might do something like that for
forensic purposes, as a once off thing, but you didn't say anything
about your high level goals here.

-- 
Peter Geoghegan

Re: user-friendliness improvement of pageinspect

2018-07-10 Thread Yang Jie

Simple queries are fine, but pageinspect can query previous data. If you return 
the same result as a simple query, this is similar to the oracle flashback 
version query.




On 7/11/2018 09:51，Peter Geoghegan wrote：
On Tue, Jul 10, 2018 at 6:44 PM, Yang Jie  wrote:
my question is not split the data into individual attributes.
I want to see the data in the table, but I don't want to be a bytea type.

What's wrong with simply using an SQL query?

--
Peter Geoghegan

1 2 >

1 - 100 of 115 matches

Mail list logo