documentation about explicit locking

2018-07-04 Thread Amit Langote
Hi.

I wonder why we mention on the following page that CREATE COLLATION
requires SHARE ROW EXCLUSIVE lock

https://www.postgresql.org/docs/devel/static/explicit-locking.html

I know that's the lock taken on the pg_collation catalog, but do we need
to mention locks taken by a DDL command on the catalogs it affects?  All
other commands mentioned on the page require to specify the table name
that the lock will be taken on.

Thanks,
Amit




RE: Global shared meta cache

2018-07-04 Thread Ideriha, Takeshi
>-Original Message-
>From: AJG [mailto:ay...@gera.co.nz]
>Sent: Wednesday, June 27, 2018 3:21 AM
>To: pgsql-hack...@postgresql.org
>Subject: Re: Global shared meta cache
>
>Ideriha, Takeshi wrote
>> 2) benchmarked 3 times for each conditions and got the average result
>> of TPS.
>>  |master branch | prototype  |
>> proto/master (%)
>>
>> 
>>pgbench -c48 -T60 -Msimple -S   | 131297 |130541 |101%
>>pgbench -c48 -T60 -Msimple  | 4956   |4965   |95%
>>pgbench -c48 -T60 -Mprepared -S |129688  |132538 |97%
>>pgbench -c48 -T60 -Mprepared|5113|4615   |84%
>>
>>
>> 001_global_meta_cache.patch (6K)
>> > meta_cache.patch>
>
>
>Hello,
>Apologies for question. I thought I would just double check percentages that 
>have
>been presented.
>Is the percentage calculation correct?
>as #1 and #3 look inverted to me (say lower when should be higher and vice 
>versa),
>and
>#2 and #4 look incorrect generally (percentages look much larger than they 
>should be
>based on numbers.
>
>I.e. Msimple -S the protype had slightly worse tps performance (130541) versus
>Master (131297). I would expect the percentage to be e.g. 99% not 101%
>
>But I may be misunderstanding something :)
>
>Also, Msimple is 4956 master versus 4965 prototype. Just 9 tps change. A very 
>slight
>improvement in tps. but the percentage provided is 95%. I would expect it to 
>be just
>over 100%?
>Again, maybe im not understanding, and hoping it is just my error :)
>
>
>
>--
>Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
>
Hi, 
Thank you for comments and sorry for late replay.
Thanks to you, I noticed I made a mistake.
As you pointed out, I think my calculation is wrong.

I also need to change some settings of postgresql.conf and pgbench. 
So I'm going to measure performance again and submit the result.

Regards,
Takeshi Ideriha




Re: Failed assertion due to procedure created with SECURITY DEFINER option

2018-07-04 Thread Peter Eisentraut
On 03.07.18 19:20, Andres Freund wrote:
> On 2018-06-29 10:19:17 -0700, Andres Freund wrote:
>> Hi,
>>
>> On 2018-06-29 13:56:12 +0200, Peter Eisentraut wrote:
>>> On 6/29/18 13:07, amul sul wrote:
 This happens because of in fmgr_security_definer() function we are
 changing  global variable SecurityRestrictionContext and in the
 StartTransaction() insisting it should be zero, which is the problem.
>>>
>>> Hmm, what is the reason for this insistation?
>>
>> Because it's supposed to be reset by AbortTransaction(), after an error.
> 
> Does that make sense Peter?
> 
> I've added this thread to the open items list.

Proposed fix attached.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From d050e000ef2fa3856da274512ff3a111970cd180 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 4 Jul 2018 09:26:19 +0200
Subject: [PATCH] Prohibit transaction commands in security definer procedures

Starting and aborting transactions in security definer procedures
doesn't work.  StartTransaction() insists that the security context
stack is empty, so this would currently cause a crash, and
AbortTransaction() resets it.  This could be made to work by
reorganizing the code, but right now we just prohibit it.

Reported-by: amul sul 
Discussion: 
https://www.postgresql.org/message-id/flat/CAAJ_b96Gupt_LFL7uNyy3c50-wbhA68NUjiK5%3DrF6_w%3Dpq_T%3DQ%40mail.gmail.com
---
 doc/src/sgml/ref/create_procedure.sgml  |  6 ++
 src/backend/commands/functioncmds.c |  9 +
 src/pl/plpgsql/src/expected/plpgsql_transaction.out | 12 
 src/pl/plpgsql/src/sql/plpgsql_transaction.sql  | 13 +
 4 files changed, 40 insertions(+)

diff --git a/doc/src/sgml/ref/create_procedure.sgml 
b/doc/src/sgml/ref/create_procedure.sgml
index f3c3bb006c..6c1de34b01 100644
--- a/doc/src/sgml/ref/create_procedure.sgml
+++ b/doc/src/sgml/ref/create_procedure.sgml
@@ -203,6 +203,12 @@ Parameters
   conformance, but it is optional since, unlike in SQL, this feature
   applies to all procedures not only external ones.
  
+
+ 
+  A SECURITY DEFINER procedure cannot execute
+  transaction control statements (for example, COMMIT
+  and ROLLBACK, depending on the language).
+ 
 

 
diff --git a/src/backend/commands/functioncmds.c 
b/src/backend/commands/functioncmds.c
index 8864d9ae44..2cadc2e7b2 100644
--- a/src/backend/commands/functioncmds.c
+++ b/src/backend/commands/functioncmds.c
@@ -2244,6 +2244,15 @@ ExecuteCallStmt(CallStmt *stmt, ParamListInfo params, 
bool atomic, DestReceiver
if (!heap_attisnull(tp, Anum_pg_proc_proconfig, NULL))
callcontext->atomic = true;
 
+   /*
+* In security definer procedures, we can't allow transaction commands.
+* StartTransaction() insists that the security context stack is empty,
+* and AbortTransaction() resets the security context.  This could be
+* reorganized, but right now it doesn't work.
+*/
+   if (((Form_pg_proc )GETSTRUCT(tp))->prosecdef)
+   callcontext->atomic = true;
+
/*
 * Expand named arguments, defaults, etc.
 */
diff --git a/src/pl/plpgsql/src/expected/plpgsql_transaction.out 
b/src/pl/plpgsql/src/expected/plpgsql_transaction.out
index 7f008ac57e..1967b3fc53 100644
--- a/src/pl/plpgsql/src/expected/plpgsql_transaction.out
+++ b/src/pl/plpgsql/src/expected/plpgsql_transaction.out
@@ -130,6 +130,18 @@ $$;
 CALL transaction_test5();
 ERROR:  invalid transaction termination
 CONTEXT:  PL/pgSQL function transaction_test5() line 3 at COMMIT
+-- SECURITY DEFINER currently disallow transaction statements
+CREATE PROCEDURE transaction_test5b()
+LANGUAGE plpgsql
+SECURITY DEFINER
+AS $$
+BEGIN
+COMMIT;
+END;
+$$;
+CALL transaction_test5b();
+ERROR:  invalid transaction termination
+CONTEXT:  PL/pgSQL function transaction_test5b() line 3 at COMMIT
 TRUNCATE test1;
 -- nested procedure calls
 CREATE PROCEDURE transaction_test6(c text)
diff --git a/src/pl/plpgsql/src/sql/plpgsql_transaction.sql 
b/src/pl/plpgsql/src/sql/plpgsql_transaction.sql
index eddc518bb6..2bdca8e165 100644
--- a/src/pl/plpgsql/src/sql/plpgsql_transaction.sql
+++ b/src/pl/plpgsql/src/sql/plpgsql_transaction.sql
@@ -116,6 +116,19 @@ CREATE PROCEDURE transaction_test5()
 CALL transaction_test5();
 
 
+-- SECURITY DEFINER currently disallow transaction statements
+CREATE PROCEDURE transaction_test5b()
+LANGUAGE plpgsql
+SECURITY DEFINER
+AS $$
+BEGIN
+COMMIT;
+END;
+$$;
+
+CALL transaction_test5b();
+
+
 TRUNCATE test1;
 
 -- nested procedure calls

base-commit: fc057b2b8fc3c3556d9f8cc0195c622aaad92c9e
-- 
2.18.0



Re: Test-cases for deferred constraints in plpgsql_transaction.sql

2018-07-04 Thread Peter Eisentraut
On 02.07.18 17:11, Ashutosh Sharma wrote:
> Firstly, it would test if the ROLLBACK works as expected when used
> with the deferred constraints in plpgsql procedures. Secondly, it
> would test if the COMMIT/ROLLBACK works as expected for deferrable
> constraints which was initially immediate type but, later it got set
> to deferred using SET CONSTRAINTS command. I just raised this point
> because i couldn't find such test anywhere in plpgsl_transaction.sql
> file. Please let me know if i am missing something here. Thanks.

Deferred constraints operate on a level below PL/pgSQL.  PL/pgSQL just
calls the internal commit and rollback functions, and those handle the
rest.  So I don't think we gain much by testing all the functionality
that is associated with transactions again in each procedural language.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



RE: Recovery performance of DROP DATABASE with many tablespaces

2018-07-04 Thread Jamison, Kirk
Hi, Fujii-san

I came across this post and I got interested in it,
 so I tried to apply/test the patch but I am not sure if I did it correctly.
I set-up master-slave sync, 200GB shared_buffers, 2 
max_locks_per_transaction,
1 DB with 500 table partitions shared evenly across 5 tablespaces.

After dropping the db, with or without patch, 
there were no difference in recovery performance when dropping database, 
so maybe I made a mistake somewhere. But anyway, here's the results.

==WITHOUT PATCH===
[200GB shared buffers]
DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
2018/07/04_13:35:00.161
dropdb
2018/07/04_13:35:05.591 5.591 sec

[200GB shared_buffers]
DROPDB (including DROP TABLE and DROP TABLESPACE)
real3m19.717s
user0m0.001s
sys 0m0.001s

==WITH PATCH===
[200GB shared_buffers]
DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
2018/07/04_14:19:47.128
dropdb
2018/07/04_14:19:53.177 6.049 sec

[200GB shared_buffers]
DROPDB (included the DROP TABLE and DROP TABLESPACE commands)
real3m51.834s
user0m0.001s
sys 0m0.002s

Just in case, do you also have some performance test numbers/case 
to show the recovery perf improvement when dropping database that contain 
multiple tablespaces?

Regards,
Kirk Jamison


Re: [HACKERS] Restricting maximum keep segments by repslots

2018-07-04 Thread Kyotaro HORIGUCHI
Hello.

At Tue, 26 Jun 2018 16:26:59 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180626.162659.223208514.horiguchi.kyot...@lab.ntt.co.jp>
> The previous patche files doesn't have version number so I let
> the attached latest version be v2.
> 
> 
> v2-0001-Add-WAL-releaf-vent-for-replication-slots.patch
>   The body of the limiting feature
> 
> v2-0002-Add-monitoring-aid-for-max_replication_slots.patch
>   Shows the status of WAL rataining in pg_replication_slot view
> 
> v2-0003-TAP-test-for-the-slot-limit-feature.patch
>   TAP test for this feature
> 
> v2-0004-Documentation-for-slot-limit-feature.patch
>   Documentation, as the name.

Travis (test_decoding test) showed that GetOldestXLogFileSegNo
added by 0002 forgets to close temporarily opened pg_wal
directory. This is the fixed version v3.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5ef0e221ee29a185743576cbdc93ca79f649d1d3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/4] Add WAL releaf vent for replication slots

Adds a capability to limit the number of segments kept by replication
slots by a GUC variable.
---
 src/backend/access/transam/xlog.c | 116 +-
 src/backend/utils/misc/guc.c  |  11 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h |   1 +
 4 files changed, 108 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0981657801..959d18e029 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -105,6 +105,7 @@ int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
+int			max_slot_wal_keep_size_mb = 0;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -861,6 +862,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9377,6 +9379,74 @@ CreateRestartPoint(int flags)
 	return true;
 }
 
+/*
+ * Returns minimum segment number the next checktpoint must leave considering
+ * wal_keep_segments, replication slots and max_slot_wal_keep_size.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr)
+{
+	uint64 keepSegs;
+	XLogSegNo currSeg;
+	XLogSegNo tailSeg;
+	uint64 slotlimitbytes;
+	uint64 slotlimitfragment;
+	uint64 currposoff;
+	XLogRecPtr slotpos = minSlotPtr;
+	XLogSegNo	slotSeg;
+
+	Assert(wal_keep_segments >= 0);
+	Assert(max_slot_wal_keep_size_mb >= 0);
+
+	XLByteToSeg(currpos, currSeg, wal_segment_size);
+	XLByteToSeg(slotpos, slotSeg, wal_segment_size);
+
+	/*
+	 * wal_keep_segments keeps more segments than slot, slotpos is no longer
+	 * useful. Don't perform subtraction to keep values positive.
+	 */
+	if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+		slotpos = InvalidXLogRecPtr;
+
+	/* slots aren't useful, consider only wal_keep_segments */
+	if (slotpos == InvalidXLogRecPtr)
+	{
+		/* avoid underflow, don't go below 1 */
+		if (currSeg <= wal_keep_segments)
+			return 1;
+
+		return currSeg - wal_keep_segments;
+	}
+
+	/* just return slotSeg if we don't put a limit */
+	if (max_slot_wal_keep_size_mb == 0)
+		return slotSeg;
+
+	/*
+	 * Slot limit is defined and slot gives the oldest segment to keep,
+	 * calculate the oldest segment that should not be removed
+	 */
+	slotlimitbytes = 1024 * 1024 * max_slot_wal_keep_size_mb;
+	slotlimitfragment = XLogSegmentOffset(slotlimitbytes,
+ wal_segment_size);
+	currposoff = XLogSegmentOffset(currpos, wal_segment_size);
+	keepSegs = wal_keep_segments +
+		ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+	if (currposoff < slotlimitfragment)
+		keepSegs++;
+
+	/*
+	 * calculate the oldest segment that is kept by wal_keep_segments and
+	 * max_slot_wal_keep_size.
+	 */
+	if (currSeg <= keepSegs)
+		tailSeg = 1;
+	else
+		tailSeg = currSeg - keepSegs;
+
+	return tailSeg;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ -9389,33 +9459,37 @@ static void
 KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
 {
 	XLogSegNo	segno;
-	XLogRecPtr	keep;
+	XLogRecPtr	slotminptr = InvalidXLogRecPtr;
+	XLogSegNo	minSegNo;
+	XLogSegNo	slotSegNo;
 
 	XLByteToSeg(recptr, segno, wal_segment_size);
-	keep = XLogGetReplicationSlotMinimumLSN();
 
-	/* compute limit for wal_keep_segments first */
-	if (wal_keep_

Re: shared-memory based stats collector

2018-07-04 Thread Kyotaro HORIGUCHI
Hello. This is new version fixed windows build.

At Tue, 03 Jul 2018 19:01:44 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180703.190144.222427588.horiguchi.kyot...@lab.ntt.co.jp>
> Hello. Thanks for the comment.
> 
> At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas  wrote 
> in 
> > Copying the whole hash table kinds of sucks, partly because of the
> > time it will take to copy it, but also because it means that memory
> > usage is still O(nbackends * ntables).  Without looking at the patch,
> > I'm guessing that you're doing that because we need a way to show each
> > transaction a consistent snapshot of the data, and I admit that I
> > don't see another obvious way to tackle that problem.  Still, it would
> > be nice if we had a better idea.
> 
> The consistency here means "repeatable read" of an object's stats
> entry, not a snapshot covering all objects. We don't need to copy
> all the entries at once following this definition. The attached
> version makes a cache entry only for requested objects.
> 
> Addition to that vacuum doesn't require even repeatable read
> consistency so we don't need to cache the entries at all.
> backend_get_tab_entry now returns an isolated (that means
> not-stored-in-hash) palloc'ed copy without making a local copy in
> the case.
> 
> As the result, this version behaves as the follows.
> 
> - Stats collector stores the results in shared memory.
> 
> - In backend, cache is created only for requested objects and
>   lasts for the transaction.
> 
> - Vacuum directly reads the shared stats and doesn't create a
>   local copy.
> 
> The non-behavioral difference from the v1 is the follows.
> 
> - snapshot feature of dshash is removed.

This version includes some additional patches. 0003 removes
PG_STAT_TMP_DIR and it affects pg_stat_statements, pg_basebackup
and pg_rewind. Among them pg_stat_statements gets build failure
because it uses the directory to save query texts. 0005 is a new
patch and moves the file to the permanent stat directory. With
this change pg_basebackup and pg_rewind no longer ignore the
query text file.

I haven't explicitly mentioned that, but
dynamic_shared_memory_type = none prevents server from
starting. This patch is not providing a fallback path for the
case. I'm expecting that 'none' will be removed in v12.

v3-0001-sequential-scan-for-dshash.patch
  - Functionally same with v2. got cosmetic changes.

v3-0002-Change-stats-collector-to-an-axiliary-process.patch
  - Fixed for Windows build.

v3-0003-dshash-based-stats-collector.patch
  - Cosmetic changes from v2.

v3-0004-Documentation-update.patch
  - New patch in v3 of documentation edits.

v3-0005-Let-pg_stat_statements-not-to-use-PG_STAT_TMP_DIR.patch
  - New patch of tentative change of pg_stat_statements.

v3-0006-Remove-pg_stat_tmp-exclusion-from-pg_resetwal.patch
  - New patch of tentative change of pg_rewind.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 319535ed679495f9e86d417b1a23bb30ed80495f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/6] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 138 +++
 src/include/lib/dshash.h |  23 +++-
 2 files changed, 160 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b46f7c4cfd..5b133226ac 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -592,6 +592,144 @@ dshash_memhash(const void *v, size_t size, void *arg)
 	return tag_hash(v, size);
 }
 
+/*
+ * dshash_seq_init/_next/_term
+ *   Sequentially scan trhough dshash table and return all the
+ *   elements one by one, return NULL when no more.
+ *
+ * dshash_seq_term should be called if and only if the scan is abandoned
+ * before completion; if dshash_seq_next returns NULL then it has already done
+ * the end-of-scan cleanup.
+ *
+ * On returning element, it is locked as is the case with dshash_find.
+ * However, the caller must not release the lock. The lock is released as
+ * necessary in continued scan.
+ *
+ * As opposed to the equivalent for dynanash, the caller is not supposed to
+ * delete the returned element before continuing the scan.
+ *
+ * If consistent is set for dshash_seq_init, the whole hash table is
+ * non-exclusively locked. Otherwise a part of the hash table is locked in the
+ * same mode (partition lock).
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+bool consistent)
+{
+	status->hash_table = hash_table;
+	status->curbucket = 0;
+	status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
+	status->curitem = NULL;
+	status->curpartition = -1;
+	status->consistent = consistent;
+
+	/*
+	 * Protect all partitions from modification if the caller wants a
+	 * consistent result.
+	 */
+	if (consistent)
+	{
+		int i;
+
+		for (i = 0; i < DSHASH_NUM_PARTITIONS; ++

Re: CREATE TABLE .. LIKE .. EXCLUDING documentation

2018-07-04 Thread Peter Eisentraut
On 02.07.18 10:38, Daniel Gustafsson wrote:
>> On 29 Jun 2018, at 18:44, Tom Lane  wrote:
> 
>> +1 for shortening it as proposed by Peter.  The existing arrangement
>> made sense when it was first written, when there were only about three
>> individual options IIRC.  Now it's just confusing, especially since you
>> can't tell very easily whether any of the individual options were
>> intentionally omitted from the list.  It will not get better with
>> more options, either.
> 
> Marking this "Waiting for Author” awaiting an update version expanding with 
> the
> above comment.

I ended up rewriting that whole section a bit to give it more structure.
 I included all the points discussed in this thread.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Cache invalidation after authentication (on-the-fly role creation)

2018-07-04 Thread Thomas Munro
On Wed, Jul 4, 2018 at 4:35 PM, Michael Paquier  wrote:
> I wanted to comment on that this morning but forgot as my mind was
> driven away by another problem.  What if you used the Julien-Rouhaud's
> method of a custom script with only ";" used as query and -c?  This
> won't run any queries, and will stress authentication.

Ok, I tried "pgbench -c 8 -j 8 -T 60 --connect -f empty.sql postgres"
where empty.sql contains just ";", and the numbers were noisy but
around 1160 TPS with or without a patch.

On Wed, Jul 4, 2018 at 4:35 PM, Andres Freund  wrote:
> On 2018-07-04 16:25:18 +1200, Thomas Munro wrote:
>>   PerformAuthentication(MyProcPort);
>> + AcceptInvalidationMessages();
>>   InitializeSessionUserId(username, useroid);
>
> FWIW, a comment explaining why it's done there seems appropriate.

Done.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Call-AcceptInvalidationMessages-after-authenticat-v2.patch
Description: Binary data


[PATCH] btree_gist: fix union implementation for variable length columns

2018-07-04 Thread Pavel Raiskup
Hi hackers,

while I tried to debug 'gcc -fstack-protector -O3' problems in [1], I noticed
that gbt_var_union() mistreats the first vector element.  Patch is attached.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1544349
Pavel
>From 4e4f0fe8d2f74a85f37abdf095ab8aecf9329596 Mon Sep 17 00:00:00 2001
From: Pavel Raiskup 
Date: Wed, 4 Jul 2018 11:39:38 +0200
Subject: [PATCH] Fix btree_gist "union" for variable length types

The gbt_var_union() did not expect that first key in the entryvec
could represent leaf node, while it many times is.  So try call
f_l2n() even for the first element, same as we do for all the
other elements in gbt_var_bin_union().

This bug could create wrong indexes, so we it's suggested to
REINDEX.

diff --git a/contrib/btree_gist/btree_bit.c b/contrib/btree_gist/btree_bit.c
index 2225244ded..432b788204 100644
*** a/contrib/btree_gist/btree_bit.c
--- b/contrib/btree_gist/btree_bit.c
***
*** 75,81  gbt_bitcmp(const void *a, const void *b, Oid collation, FmgrInfo *flinfo)
  static bytea *
  gbt_bit_xfrm(bytea *leaf)
  {
! 	bytea	   *out = leaf;
  	int			sz = VARBITBYTES(leaf) + VARHDRSZ;
  	int			padded_sz = INTALIGN(sz);
  
--- 75,81 
  static bytea *
  gbt_bit_xfrm(bytea *leaf)
  {
! 	bytea	   *out;
  	int			sz = VARBITBYTES(leaf) + VARHDRSZ;
  	int			padded_sz = INTALIGN(sz);
  
diff --git a/contrib/btree_gist/btreeindex 670c879e77..6d8299d9b5 100644
*** a/contrib/btree_gist/btree_utils_var.c
--- b/contrib/btree_gist/btree_utils_var.c
***
*** 230,252  gbt_var_node_truncate(const GBT_VARKEY *node, int32 cpf_length, const gbtree_vin
  }
  
  
  
  void
  gbt_var_bin_union(Datum *u, GBT_VARKEY *e, Oid collation,
    const gbtree_vinfo *tinfo, FmgrInfo *flinfo)
  {
! 	GBT_VARKEY_R eo = gbt_var_key_readable(e);
  	GBT_VARKEY_R nr;
  
- 	if (eo.lower == eo.upper)	/* leaf */
- 	{
- 		GBT_VARKEY *tmp;
- 
- 		tmp = gbt_var_leaf2node(e, tinfo, flinfo);
- 		if (tmp != e)
- 			eo = gbt_var_key_readable(tmp);
- 	}
- 
  	if (DatumGetPointer(*u))
  	{
  		GBT_VARKEY_R ro = gbt_var_key_readable((GBT_VARKEY *) DatumGetPointer(*u));
--- 230,258 
  }
  
  
+ static GBT_VARKEY_R
+ gbt_var_key_readable_resolved(GBT_VARKEY *key, const gbtree_vinfo *tinfo,
+ 			  FmgrInfo *flinfo)
+ {
+ 	GBT_VARKEY_R out = gbt_var_key_readable(key);
+ 	if (out.lower == out.upper) /* leaf */
+ 	{
+ 		GBT_VARKEY *tmp;
+ 		tmp = gbt_var_leaf2node(key, tinfo, flinfo);
+ 		if (tmp != key)
+ 			out = gbt_var_key_readable(tmp);
+ 	}
+ 	return out;
+ }
+ 
  
  void
  gbt_var_bin_union(Datum *u, GBT_VARKEY *e, Oid collation,
    const gbtree_vinfo *tinfo, FmgrInfo *flinfo)
  {
! 	GBT_VARKEY_R eo = gbt_var_key_readable_resolved(e, tinfo, flinfo);
  	GBT_VARKEY_R nr;
  
  	if (DatumGetPointer(*u))
  	{
  		GBT_VARKEY_R ro = gbt_var_key_readable((GBT_VARKEY *) DatumGetPointer(*u));
***
*** 333,339  gbt_var_union(const GistEntryVector *entryvec, int32 *size, Oid collation,
  	*size = sizeof(GBT_VARKEY);
  
  	cur = (GBT_VARKEY *) DatumGetPointer(entryvec->vector[0].key);
! 	rk = gbt_var_key_readable(cur);
  	out = PointerGetDatum(gbt_var_key_copy(&rk));
  
  	for (i = 1; i < numranges; i++)
--- 339,345 
  	*size = sizeof(GBT_VARKEY);
  
  	cur = (GBT_VARKEY *) DatumGetPointer(entryvec->vector[0].key);
! 	rk = gbt_var_key_readable_resolved(cur, tinfo, flinfo);
  	out = PointerGetDatum(gbt_var_key_copy(&rk));
  
  	for (i = 1; i < numranges; i++)


Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-04 Thread Ashutosh Bapat
On Fri, Jun 29, 2018 at 6:21 PM, Etsuro Fujita
 wrote:
> (2018/06/22 22:54), Ashutosh Bapat wrote:
>>
>> I have started reviewing the patch.
>
>
> Thanks for the review!
>
>> +   if (enable_partitionwise_join&&  rel->top_parent_is_partitioned)
>> +   {
>> +   build_childrel_tlist(root, rel, childrel, 1,&appinfo);
>> +   }
>>
>> Why do we need rel->top_parent_is_partitioned? If a relation is
>> partitioned (if (rel->part_scheme), it's either the top parent or is
>> partition of some other partitioned table. In either case this
>> condition will be true.
>
>
> This would be needed to avoid unnecessarily applying build_childrel_tlist to
> child rels of a partitioned table for which we don't consider partitionwise
> join.  Consider:
>
> postgres=# create table lpt (c1 int, c2 text) partition by list (c1);
> CREATE TABLE
> postgres=# create table lpt_p1 partition of lpt for values in (1);
> CREATE TABLE
> postgres=# create table lpt_p2 (c1 int check (c1 in (2)), c2 text);
> CREATE TABLE
> postgres=# create table test (c1 int, c2 text);
> CREATE TABLE
> postgres=# explain verbose select * from (select * from lpt union all select
> * from lpt_p2) ss(c1, c2) inner join test on (ss.c1 = test.c1);

--- plan clipped

>
> In this example, the top parent is not a partitioned table and we don't need
> to consider partitionwise join for that, so we don't need to apply that to
> the child rel lpt_p1 of the partitioned table lpt (and the table lpt_p2).

FWIW, the test is not sufficient. In the above example, a simple
select * from lpt inner join test where lpt.c1 = test.c1 would not use
partition-wise join technique. Why not to avoid build_childrel_tlist()
in that case as well? Worst we can change the criteria to use
partition-wise join in future e.g. above case would use partition-wise
join by 1. merging lpt_p1 into corresponding partition of lpt so that
ss is partitioned and 2. repartitioning test or joining test with each
partition of lpt separately. In those cases the changes you are doing
here need to be reverted and put somewhere else. There's already a
patch to reverse the order of join and grouping out there. How would
this work with that.

In general I think set_append_rel_size() or build_simple_rel() is not
the place where we should decide whether the given relation will
participate in a PWJ. Also we should not selectively add a
ConvertRowtypeExpr on top of a whole-row Var of some a child relation.
We should do it for all the child rels or shouldn't do it at all. An
in-between state will produce a hell lot of confusion for any further
optimization. Whenever we change the code around partition-wise
operations in future, we will have to check whether or not a given
child rel has its whole-row Var embedded in ConvertRowtypeExpr. As I
have mentioned earlier, I am also not comfortable with the targetlist
of child relations being type inconsistent with that of the parent,
which is a fundamental rule in inheritance. Worst keep them
inconsistent during path creation and make them consistent at the time
of creating plans. A child's whole-row Var has datatype of the child
where as that of parent has datatype of parent. A ConvertRowtypeExpr
is used to fix this inconsistency. That's why I chose to use
pull_var_clause() as a place to fix the problem and not fix
ConvertRowtypeExpr in targetlist itself.

I am also not comfortable in just avoiding ConvertRowtypeExprs in
targetlist and leave them as is in the conditions and ECs etc. This
means searching a ConvertRowtypeExpr e.g. for creating pathkeys in
targetlist will fail at the time of path creation but will succeed at
the time of plan creation.

This is turning more invasive that my approach of fixing it through PVC.

>
>> +   /* No work if the child plan is an Append or MergeAppend */
>> +   if (IsA(subplan, Append) || IsA(subplan, MergeAppend))
>> +   return;
>>
>> Why? Probably it's related to the answer to the first question, But I
>> don't see the connection. Given that partition-wise join will be
>> applicable level by level, we need to recurse in
>> adjust_subplan_tlist().
>
>
> I don't think so; I think if the child plan is an Append or MergeAppend, the
> tlist for each subplan of the Append or MergeAppend would have already been
> adjusted while create_plan_recurse before we are called here.

Ok. Thanks for the clarification. I think we should add a comment.

>
>> +   /* The child plan should be able to do projection */
>> +   Assert(is_projection_capable_plan(subplan));
>> +
>> Again why? A MergeAppend's subplan could be a Sort plan, which will
>> not be projection capable.
>
>
> IIUC, since we are called here right after create_plan_recurse, the child
> plan would be a scan or join unless it's neither Append nor MergeAppend.  So
> it should be projection-capable.  Maybe I'm missing something, though.

That's not true. add_paths_to_append_rel() adds sort paths on scan if
necessary and creates merge append path.

-- 
Best Wishes,
Ashutosh Bapa

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-04 Thread Etsuro Fujita

(2018/07/02 18:46), Etsuro Fujita wrote:

(2018/06/22 23:58), Robert Haas wrote:

And, in general, it seems to me that we want
to produce the right outputs at the lowest possible level of the plan
tree. For instance, suppose that one of the relations being scanned
is not parallel-safe but the others are. Then, we could end up with a
plan like this:

Append
-> Seq Scan on temp_rela
-> Gather
-> Parallel Seq Scan on thing1
-> Gather
-> Parallel Seq Scan on thing2

If a projection is required to convert the row type expression, we
certainly want that to get pushed below the Gather, not to happen at
the Gather level itself.


IIUC, we currently don't consider such a plan for partition-wise join;
we'll only consider gathering partial paths for the parent appendrel.


While updating the patch, I noticed that I'm wrong here.  IIUC, 
apply_scanjoin_target_to_paths would allow us to consider such an Append 
for a partitioned relation when scanjoin_target_parallel_safe=false, as 
it generates new Append paths by recursively adjusting all partitions 
for which we call generate_gather_paths in that case as shown below:


/*
 * If the scan/join target is not parallel-safe, partial paths cannot
 * generate it.
 */
if (!scanjoin_target_parallel_safe)
{
/*
 * Since we can't generate the final scan/join target, this is our
 * last opportunity to use any partial paths that exist.  We 
don't do
 * this if the case where the target is parallel-safe, since we 
will

 * be able to generate superior paths by doing it after the final
 * scan/join target has been applied.
 *
 * Note that this may invalidate rel->cheapest_total_path, so 
we must
 * not rely on it after this point without first calling 
set_cheapest.

 */
generate_gather_paths(root, rel, false);

/* Can't use parallel query above this level. */
rel->partial_pathlist = NIL;
rel->consider_parallel = false;
}

I don't produce a test case where the plan is an Append with Gather 
subplans, but I'm not sure that it's a good thing to allow us to 
consider such a plan because in that plan, each Gather would try to grab 
its own pool of workers.  Am I missing something?


Best regards,
Etsuro Fujita



Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-04 Thread Ashutosh Bapat
On Wed, Jul 4, 2018 at 3:36 PM, Etsuro Fujita
 wrote:
>
> I don't produce a test case where the plan is an Append with Gather
> subplans, but I'm not sure that it's a good thing to allow us to consider
> such a plan because in that plan, each Gather would try to grab its own pool
> of workers.  Am I missing something?

If the overall join can not use parallelism, having individual child
joins use parallel query might turn out to be efficient even if done
one child at a time. Parallel append drastically reduces the cases
where something like could be useful, but I don't think we can
theoretically eliminate the need for such a plan.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: Pluggable Storage - Andres's take

2018-07-04 Thread Haribabu Kommi
On Tue, Jul 3, 2018 at 5:06 PM Andres Freund  wrote:

> Hi,
>
> As I've previously mentioned I had planned to spend some time to polish
> Haribabu's version of the pluggable storage patch and rebase it on the
> vtable based slot approach from [1]. While doing so I found more and
> more things that I previously hadn't noticed. I started rewriting things
> into something closer to what I think we want architecturally.
>

Thanks for the deep review and changes.


> The current state of my version of the patch is *NOT* ready for proper
> review (it doesn't even pass all tests, there's FIXME / elog()s).  But I
> think it's getting close enough to it's eventual shape that more eyes,
> and potentially more hands on keyboards, can be useful.
>

I will try to update it to make sure that it passes all the tests and also
try to
reduce the FIXME's.


> The most fundamental issues I had with Haribabu's last version from [2]
> are the following:
>
> - The use of TableTuple, a typedef from void *, is bad from multiple
>   fronts. For one it reduces just about all type safety. There were
>   numerous bugs in the patch where things were just cast from HeapTuple
>   to TableTuple to HeapTuple (and even to TupleTableSlot).  I think it's
>   a really, really bad idea to introduce a vague type like this for
>   development purposes alone, it makes it way too hard to refactor -
>   essentially throwing the biggest benefit of type safe languages out of
>   the window.
>

My earlier intention was to remove the HeapTuple usage entirely and replace
it with slot everywhere outside the tableam. But it ended up with TableTuple
before it reach to the stage because of heavy use of HeapTuple.


>   Additionally I think it's also the wrong approach architecturally. We
>   shouldn't assume that a tuple can efficiently be represented as a
>   single palloc'ed chunk. In fact, we should move *away* from relying on
>   that so much.
>
>   I've thus removed the TableTuple type entirely.
>

Thanks for the changes, I didn't check the code yet, so for now whenever the
HeapTuple is required it will be generated from slot?


> - Previous verions of the patchset exposed Buffers in the tableam.h API,
>   performed buffer locking / pinning / ExecStoreTuple() calls outside of
>   it.  That is wrong in my opinion, as various AMs will deal very
>   differently with buffer pinning & locking. The relevant logic is
>   largely moved within the AM.  Bringing me to the next point:
>
>
> - tableam exposed various operations based on HeapTuple/TableTuple's
>   (and their Buffers). This all need be slot based, as we can't
>   represent the way each AM will deal with this.  I've largely converted
>   the API to be slot based.  That has some fallout, but I think largely
>   works.  Lots of outdated comments.
>

Yes, I agree with you.


> - I think the move of the indexing from outside the table layer into the
>   storage layer isn't a good idea. It lead to having to pass EState into
>   the tableam, a callback API to perform index updates, etc.  This seems
>   to have at least partially been triggered by the speculative insertion
>   codepaths.  I've reverted this part of the changes.  The speculative
>   insertion / confirm codepaths are now exposed to tableam.h - I think
>   that's the right thing because we'll likely want to have that
>   functionality across more than a single tuple in the future.
>
>
> - The visibility functions relied on the *caller* performing buffer
>   locking. That's not a great idea, because generic code shouldn't know
>   about the locking scheme a particular AM needs.  I've changed the
>   external visibility functions to instead take a slot, and perform the
>   necessary locking inside.
>

When I first moved all the visibility functions as part of tableam, I find
this
problem, and it will be good if it takes of buffer locking and etc.


> - There were numerous tableam callback uses inside heapam.c - that makes
>   no sense, we know what the storage is therein.  The relevant
>
>
> - The integration between index lookups and heap lookups based on the
>   results on a index lookup was IMO too tight.  The index code dealt
>   with heap tuples, which isn't great.  I've introduced a new concept, a
>   'IndexFetchTableData' scan. It's initialized when building an index
>   scan, and provides the necessary state (say current heap buffer), to
>   do table lookups from within a heap.
>

I agree that it will be good with the new concept from index to access the
heap.


> - The am of relations required for bootstrapping was set to 0 - I don't
>   think that's a good idea. I changed it so it's set to the heap AM as
>   well.
>
>
> - HOT was encoded in the API in a bunch of places. That doesn't look
>   right to me. I tried to improve a bit on that, but I'm not yet quite
>   sure I like it. Needs written explanation & arguments...
>
>
> - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
>   a higher memory usage, because the re

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-07-04 Thread Haribabu Kommi
On Mon, Jul 2, 2018 at 6:42 PM Sergei Kornilov  wrote:

> Hello
>

Thanks for the review.


> I found SELECT pg_stat_statements_reset(NULL,NULL,s.queryid) in tests and
> it pass tests, but i wonder how it works. Should not we check the NULL
> through PG_ARGISNULL macro before any PG_GETARG_*? According
> src/include/fmgr.h
> > * If function is not marked "proisstrict" in pg_proc, it must check for
> > * null arguments using this macro.  Do not try to GETARG a null argument!
>

Thanks for checking, Added.


> > pg_stat_statements_reset(userid Oid, dbid Oid, queryid bigint) returns
> void
> And you forgot to change return type in docs (and description of return
> value)
>

Corrected and also added details of the returns value.

Update patch attached.

Regards,
Haribabu Kommi
Fujitsu Australia
From f3d4c1a319ceefafabd45039b4b3447fb126e91c Mon Sep 17 00:00:00 2001
From: Hari Babu 
Date: Fri, 29 Jun 2018 01:17:41 +1000
Subject: [PATCH] pg_stat_statements_reset to reset specific query/user/db
 statistics

Now the pg_stat_statements_reset() can accept input parameters of
userid, dbid and queryid. Based on the specified input values,
specific statistics entries are released. The new behaviour is useful
to get the fresh new statistics of a specific query/user/database
to observation without resetting all the existing statistics.
If no parameters are passed, it will release all entries as per the
old behaviour.

Backward Compatible change, Now pg_stat_statements_reset() function
returns the number of statement entries that are reset.
---
 contrib/pg_stat_statements/Makefile|   8 +-
 .../expected/pg_stat_statements.out| 134 -
 .../pg_stat_statements--1.5--1.6.sql   |  24 
 contrib/pg_stat_statements/pg_stat_statements.c| 120 +-
 .../pg_stat_statements/pg_stat_statements.control  |   2 +-
 .../pg_stat_statements/sql/pg_stat_statements.sql  |  45 +++
 doc/src/sgml/pgstatstatements.sgml |  16 ++-
 7 files changed, 303 insertions(+), 46 deletions(-)
 create mode 100644 contrib/pg_stat_statements/pg_stat_statements--1.5--1.6.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 39b368b70e..f0942f829e 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,10 +4,10 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o $(WIN32RES)
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.4--1.5.sql \
-	pg_stat_statements--1.3--1.4.sql pg_stat_statements--1.2--1.3.sql \
-	pg_stat_statements--1.1--1.2.sql pg_stat_statements--1.0--1.1.sql \
-	pg_stat_statements--unpackaged--1.0.sql
+DATA = pg_stat_statements--1.4.sql pg_stat_statements--1.5--1.6.sql \
+	pg_stat_statements--1.4--1.5.sql pg_stat_statements--1.3--1.4.sql \
+	pg_stat_statements--1.2--1.3.sql pg_stat_statements--1.1--1.2.sql \
+	pg_stat_statements--1.0--1.1.sql pg_stat_statements--unpackaged--1.0.sql
 PGFILEDESC = "pg_stat_statements - execution statistics of SQL statements"
 
 LDFLAGS_SL += $(filter -lm, $(LIBS))
diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 5318c3550c..4e2ef7719a 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -6,7 +6,7 @@ SET pg_stat_statements.track_utility = FALSE;
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset 
 --
- 
+   10
 (1 row)
 
 SELECT 1 AS "int";
@@ -122,7 +122,7 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset 
 --
- 
+   12
 (1 row)
 
 -- utility "create table" should not be shown
@@ -216,7 +216,7 @@ SET pg_stat_statements.track = 'none';
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset 
 --
- 
+   10
 (1 row)
 
 SELECT 1 AS "one";
@@ -243,7 +243,7 @@ SET pg_stat_statements.track = 'top';
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset 
 --
- 
+2
 (1 row)
 
 DO LANGUAGE plpgsql $$
@@ -303,7 +303,7 @@ SET pg_stat_statements.track = 'all';
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset 
 --
- 
+7
 (1 row)
 
 -- we drop and recreate the functions to avoid any caching funnies
@@ -361,7 +361,7 @@ SET pg_stat_statements.track_utility = TRUE;
 SELECT pg_stat_statements_reset();
  pg_stat_statements_reset 
 --
- 
+6
 (1 row)
 
 SELECT 1;
@@ -395,4 +395,126 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
  SELECT pg_stat_statements_reset() | 1 |1
 (8 rows

Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-04 Thread Andrew Dunstan
On Wed, Jul 4, 2018 at 12:59 AM, Michael Paquier  wrote:
> On Fri, Mar 30, 2018 at 10:06:46AM +0900, Kyotaro HORIGUCHI wrote:
>> Hello.  I found that c203d6cf81 hit this and this is the rebased
>> version on the current master.
>
> Okay, as this is visibly the oldest item in this commit fest, Andrew has
> asked me to look at a solution which would allow us to definitely close
> the loop for all maintained branches.  In consequence, I have been
> looking at this problem.  Here are my thoughts:
> - The set of errors reported on this thread are alarming, depending on
> the scenarios used, we could have "could not read file" stuff, or even
> data loss after WAL replay comes and wipes out everything.
> - Disabling completely the TRUNCATE optimization is definitely not cool,
> as there could be an impact for users.
> - Removing wal_level = minimal is not acceptable as well, as some people
> rely on this feature.
> - Rewriting the sync handling of heap relation files in an invasive way
> may be something to investigate and improve on HEAD (I am not really
> convinced about that actually for the optimizations discussed on this
> thread as this may result in more bugs than actual fixes), but that
> would do nothing for back-branches.
>
> Hence I propose the patch attached which disables the TRUNCATE and COPY
> optimizations for two cases, which are the ones actually causing
> problems.  One solution has been presented by Simon here for COPY, which
> is to disable the optimization when there are no blocks on a relation
> with wal_level = minimal:
> https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=yetxvmuu1yk4umxqy2g...@mail.gmail.com
> For back-patching, I find that really appealing.
>
> The second thing that the patch attached does is to tweak
> ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
> wal_level = minimal.
>
> Another thing that this patch adds is a set of regression tests to
> stress all the various scenarios presented on this thread with table
> creation, INSERT, COPY and TRUNCATE running in the same transactions for
> both wal_level = minimal and replica, which make sure that there are no
> failures and no actual data loss.  The test is useful anyway, as any
> patch presented did not present a way to test easily all the scenarios,
> except for a bash script present upthread, but this discarded some of
> the cases.
>
> I would propose that for a back-patch, except for the test which can go
> down easily to 9.6 but I have not tested that yet.
>


Many thanks for working on this.

+1 for these changes, even though the TRUNCATE fix looks perverse. If
anyone wants to propose further optimizations in this area this would
at least give us a startpoint which is correct.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-04 Thread Etsuro Fujita

(2018/07/04 19:04), Ashutosh Bapat wrote:

On Fri, Jun 29, 2018 at 6:21 PM, Etsuro Fujita
  wrote:

(2018/06/22 22:54), Ashutosh Bapat wrote:

+   if (enable_partitionwise_join&&   rel->top_parent_is_partitioned)
+   {
+   build_childrel_tlist(root, rel, childrel, 1,&appinfo);
+   }

Why do we need rel->top_parent_is_partitioned? If a relation is
partitioned (if (rel->part_scheme), it's either the top parent or is
partition of some other partitioned table. In either case this
condition will be true.



This would be needed to avoid unnecessarily applying build_childrel_tlist to
child rels of a partitioned table for which we don't consider partitionwise
join.  Consider:

postgres=# create table lpt (c1 int, c2 text) partition by list (c1);
CREATE TABLE
postgres=# create table lpt_p1 partition of lpt for values in (1);
CREATE TABLE
postgres=# create table lpt_p2 (c1 int check (c1 in (2)), c2 text);
CREATE TABLE
postgres=# create table test (c1 int, c2 text);
CREATE TABLE
postgres=# explain verbose select * from (select * from lpt union all select
* from lpt_p2) ss(c1, c2) inner join test on (ss.c1 = test.c1);


--- plan clipped



In this example, the top parent is not a partitioned table and we don't need
to consider partitionwise join for that, so we don't need to apply that to
the child rel lpt_p1 of the partitioned table lpt (and the table lpt_p2).


FWIW, the test is not sufficient. In the above example, a simple
select * from lpt inner join test where lpt.c1 = test.c1 would not use
partition-wise join technique. Why not to avoid build_childrel_tlist()
in that case as well?


I might misunderstand your words, but in the above example the patch 
doesn't apply build_childrel_tlist to lpt_p1 and lpt_p2.  The reason for 
that is because we can avoid adjusting the tlists for the corresponding 
subplans at plan creation time so that whole-row Vars in the tlists are 
transformed into CREs.  I think the overhead of the adjustment is not 
that big, but not zero, so it would be worth avoiding applying 
build_childrel_tlist to partitions if the top parent won't participate 
in a partitionwise-join at all.



Worst we can change the criteria to use
partition-wise join in future e.g. above case would use partition-wise
join by 1. merging lpt_p1 into corresponding partition of lpt so that
ss is partitioned and 2. repartitioning test or joining test with each
partition of lpt separately. In those cases the changes you are doing
here need to be reverted and put somewhere else. There's already a
patch to reverse the order of join and grouping out there. How would
this work with that.


Interesting, but that seems like a matter of PG12+.


In general I think set_append_rel_size() or build_simple_rel() is not
the place where we should decide whether the given relation will
participate in a PWJ. Also we should not selectively add a
ConvertRowtypeExpr on top of a whole-row Var of some a child relation.
We should do it for all the child rels or shouldn't do it at all.


One thing I thought was to apply build_childrel_tlist for all the child 
rels whether its top parent is a partitioned table or not, but as I 
mentioned above, that would incur needless overhead for adjusting the 
tlists for the child rels that don't involve in a partitionwise join 
such as lpt_p1 and lpt_p2, which I think is not good.



An
in-between state will produce a hell lot of confusion for any further
optimization. Whenever we change the code around partition-wise
operations in future, we will have to check whether or not a given
child rel has its whole-row Var embedded in ConvertRowtypeExpr. As I
have mentioned earlier, I am also not comfortable with the targetlist
of child relations being type inconsistent with that of the parent,
which is a fundamental rule in inheritance. Worst keep them
inconsistent during path creation and make them consistent at the time
of creating plans. A child's whole-row Var has datatype of the child
where as that of parent has datatype of parent.


I don't see any critical issue here.  Could you elaborate a bit more on 
that point?



A ConvertRowtypeExpr
is used to fix this inconsistency. That's why I chose to use
pull_var_clause() as a place to fix the problem and not fix
ConvertRowtypeExpr in targetlist itself.


I think the biggest issue in the original patch you proposed is that we 
spend extra cycles where partitioning is not involved, which is the 
biggest reason why I think the original patch isn't the right way to go.



I am also not comfortable in just avoiding ConvertRowtypeExprs in
targetlist and leave them as is in the conditions and ECs etc. This
means searching a ConvertRowtypeExpr e.g. for creating pathkeys in
targetlist will fail at the time of path creation but will succeed at
the time of plan creation.

This is turning more invasive that my approach of fixing it through PVC.


Sorry, I don't understand this.  Could you show me places where the 
problem happens?



+   /* No work if

Locking considerations of REINDEX

2018-07-04 Thread Pavan Deolasee
The documentation [1] claims that REINDEX does not block readers on the
table.

"REINDEX is similar to a drop and recreate of the index in that the index
contents are rebuilt from scratch. However, the locking considerations are
rather different. REINDEX locks out writes but not reads of the index's
parent table. It also takes an exclusive lock on the specific index being
processed, which will block reads that attempt to use that index. In
contrast, DROP INDEX momentarily takes an exclusive lock on the parent
table, blocking both writes and reads. The subsequent CREATE INDEX locks
out writes but not reads; since the index is not there, no read will
attempt to use it, meaning that there will be no blocking but reads might
be forced into expensive sequential scans."

But AFAICS get_relation_info() tries to lock every index and since REINDEX
will be holding a AEL on the index being reindexed, get_relation_info()
blocks. Since get_relation_info() gets into every read path, wouldn't a
concurrent REINDEX pretty much block every read access to the table, even
if REINDEX not holding AEL on the table itself?

I wonder if we just need fix the docs to or if we actually regressed at
some point in the history or if we have a bug in the implementation? It
mostly seems like a case of wrongly written docs even though in theory it
might be possible to skip an index being rebuilt. That may lead to
surprisingly worse plans getting chosen, leading to more trouble. Or may be
someday we would have ability so that the existing queries can continue to
read from the old physical index, new queries will shift to the new index
and eventually the old index's storage will be dropped when nobody can see
it.

Thanks,
Pavan

[1] https://www.postgresql.org/docs/11/static/sql-reindex.html


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-04 Thread Ashutosh Bapat
On Wed, Jul 4, 2018 at 5:36 PM, Etsuro Fujita
 wrote:
> (2018/07/04 19:04), Ashutosh Bapat wrote:
>>
>> On Fri, Jun 29, 2018 at 6:21 PM, Etsuro Fujita
>>   wrote:
>>>
>>> (2018/06/22 22:54), Ashutosh Bapat wrote:

 +   if (enable_partitionwise_join&&
 rel->top_parent_is_partitioned)
 +   {
 +   build_childrel_tlist(root, rel, childrel, 1,&appinfo);
 +   }

 Why do we need rel->top_parent_is_partitioned? If a relation is
 partitioned (if (rel->part_scheme), it's either the top parent or is
 partition of some other partitioned table. In either case this
 condition will be true.
>>>
>>>
>>>
>>> This would be needed to avoid unnecessarily applying build_childrel_tlist
>>> to
>>> child rels of a partitioned table for which we don't consider
>>> partitionwise
>>> join.  Consider:
>>>
>>> postgres=# create table lpt (c1 int, c2 text) partition by list (c1);
>>> CREATE TABLE
>>> postgres=# create table lpt_p1 partition of lpt for values in (1);
>>> CREATE TABLE
>>> postgres=# create table lpt_p2 (c1 int check (c1 in (2)), c2 text);
>>> CREATE TABLE
>>> postgres=# create table test (c1 int, c2 text);
>>> CREATE TABLE
>>> postgres=# explain verbose select * from (select * from lpt union all
>>> select
>>> * from lpt_p2) ss(c1, c2) inner join test on (ss.c1 = test.c1);
>>
>>
>> --- plan clipped
>>
>>>
>>> In this example, the top parent is not a partitioned table and we don't
>>> need
>>> to consider partitionwise join for that, so we don't need to apply that
>>> to
>>> the child rel lpt_p1 of the partitioned table lpt (and the table lpt_p2).
>>
>>
>> FWIW, the test is not sufficient. In the above example, a simple
>> select * from lpt inner join test where lpt.c1 = test.c1 would not use
>> partition-wise join technique. Why not to avoid build_childrel_tlist()
>> in that case as well?
>
>
> I might misunderstand your words, but in the above example the patch doesn't
> apply build_childrel_tlist to lpt_p1 and lpt_p2.  The reason for that is
> because we can avoid adjusting the tlists for the corresponding subplans at
> plan creation time so that whole-row Vars in the tlists are transformed into
> CREs.  I think the overhead of the adjustment is not that big, but not zero,
> so it would be worth avoiding applying build_childrel_tlist to partitions if
> the top parent won't participate in a partitionwise-join at all.

I don't think that answers my question. When we join lpt with test,
your patch will apply build_childrel_tlist() to lpt_p1 and lpt_p2 even
when join between lpt and test is not going to use partition-wise
join. Why? As per your explanation, the condition "if
(enable_partitionwise_join &&  rel->top_parent_is_partitioned)" is
used to avoid applying build_childrel_tlist() when partition-wise join
won't be possible. But it's not covering all the cases.

>
>> Worst we can change the criteria to use
>> partition-wise join in future e.g. above case would use partition-wise
>> join by 1. merging lpt_p1 into corresponding partition of lpt so that
>> ss is partitioned and 2. repartitioning test or joining test with each
>> partition of lpt separately. In those cases the changes you are doing
>> here need to be reverted and put somewhere else. There's already a
>> patch to reverse the order of join and grouping out there. How would
>> this work with that.
>
>
> Interesting, but that seems like a matter of PG12+.

Yes, and I don't think it's a good idea to change this fix for PG12+ again.

>
>> In general I think set_append_rel_size() or build_simple_rel() is not
>> the place where we should decide whether the given relation will
>> participate in a PWJ. Also we should not selectively add a
>> ConvertRowtypeExpr on top of a whole-row Var of some a child relation.
>> We should do it for all the child rels or shouldn't do it at all.
>
>
> One thing I thought was to apply build_childrel_tlist for all the child rels
> whether its top parent is a partitioned table or not, but as I mentioned
> above, that would incur needless overhead for adjusting the tlists for the
> child rels that don't involve in a partitionwise join such as lpt_p1 and
> lpt_p2, which I think is not good.
>
>> An
>> in-between state will produce a hell lot of confusion for any further
>> optimization. Whenever we change the code around partition-wise
>> operations in future, we will have to check whether or not a given
>> child rel has its whole-row Var embedded in ConvertRowtypeExpr. As I
>> have mentioned earlier, I am also not comfortable with the targetlist
>> of child relations being type inconsistent with that of the parent,
>> which is a fundamental rule in inheritance. Worst keep them
>> inconsistent during path creation and make them consistent at the time
>> of creating plans. A child's whole-row Var has datatype of the child
>> where as that of parent has datatype of parent.
>
>
> I don't see any critical issue here.  Could you elaborate a bit more on that
> point?

I think brea

Re: Libpq support to connect to standby server as priority

2018-07-04 Thread Laurenz Albe
Haribabu Kommi wrote:
> On Wed, Jan 24, 2018 at 9:01 AM Jing Wang  wrote:
> > Hi All,
> > 
> > Recently I put a proposal to support 'prefer-read' parameter in 
> > target_session_attrs in libpq. Now I updated the patch with adding content 
> > in the sgml and regression test case.
> > 
> > Some people may have noticed there is already another patch 
> > (https://commitfest.postgresql.org/15/1148/ ) which looks similar with 
> > this. But I would say this patch is more complex than my proposal. 
> > 
> > It is better separate these 2 patches to consider.
> 
> I also feel prefer-read and read-only options needs to take as two different 
> options.
> prefer-read is simple to support than read-only.
> 
> Here I attached an updated patch that is rebased to the latest master and also
> fixed some of the corner scenarios.

The patch applies, builds and passes "make check-world".

I think the "prefer-read" functionality is desirable: It is exactly what you 
need
if you want to use replication for load balancing, and your application supports
different database connections for reading and writing queries.

"read-only" does not have a clear use case in my opinion.

With the patch, PostgreSQL behaves as expected if I have a primary and a 
standby and run:

  psql "host=/tmp,/tmp port=5433,5434 target_session_attrs=prefer-read"

But if I stop the standby (port 5434), libpq goes into an endless loop.

Concerning the code:

- The documentation needs some attention. Suggestion:

   If this parameter is set to prefer-read, connections
   where SHOW transaction_read_only returns off are 
preferred.
   If no such connection can be found, a connection that allows read-write
   transactions will be accepted.

- I think the construction with "read_write_host_index" makes the code even more
  complicated than it already is.

  What about keeping the first successful connection open and storing it in a
  variable if we are in "prefer-read" mode.
  If we get the read-only connection we desire, close that cached connection,
  otherwise use it.

Yours,
Laurenz Albe



Re: How to use public key file to encrypt data

2018-07-04 Thread Jeff Janes
On Tue, Jul 3, 2018 at 6:17 AM, ROS Didier  wrote:

> Hi
>
>I Would like to know how to encrypt data with  *physical*
> public key files. I can’t find any documentation about this subject.
>
>Thanks in advance
>

This isn't really a suitable question for pgsql-hackers, as you aren't
talking about development of the PostgreSQL system itself, but rather usage
of that system.  That type of question should probably go to pgsql-general.

I don't know what it means for a key file to be "physical".  Does that mean
stored on disk?

The functions you get from pgcrypto extension do not automatically detect
the format of the key file, so you have to de-armor them explicitly if they
are stored in base64.  I wish it would detect that.  I just do it on the
fly.

Here is a program I have used for testing and benchmarking pgcrypto,
perhaps you would find it useful as an example.


Cheers,

Jeff


pgcrypto_public.pl
Description: Binary data


Re: Allow auto_explain to log to NOTICE

2018-07-04 Thread Andrew Dunstan
On Wed, Jun 20, 2018 at 2:06 PM, Daniel Gustafsson  wrote:
>> On 27 Apr 2018, at 04:24, Andres Freund  wrote:
>>
>> On 2018-04-27 11:52:18 +0930, Tom Dunstan wrote:
 I'd argue this should contain the non-error cases. It's just as
 reasonable to want to add this as a debug level or such.
>>>
>>> So all of warning, info, debug and debug1-5?
>>
>> Yea. Likely nobody will ever use debug5, but I don't see a point making
>> that judgement call here.
>
> Did you have a chance to hack up a new version of the patch with Andres’ 
> review
> comments?  I’m marking this patch as Waiting on Author for now based on the
> feedback so far in this thread.
>

Here is an updated version of the patch. Setting this to "needs review".

I haven't added tests for auto_explain - I think that would be useful
but it's a separate project.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


auto-explain-log-level-v2.patch
Description: Binary data


Re: Non-reserved replication slots and slot advancing

2018-07-04 Thread Alvaro Herrera
On 2018-Jul-04, Michael Paquier wrote:

> At the end, are their any objections into fixing the issue and
> tightening the advancing API?

None from me.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Failure assertion in GROUPS mode of window function in current HEAD

2018-07-04 Thread Masahiko Sawada
Hi,

I got an assertion failure when I use GROUPS mode and specifies offset
without ORDER BY clause. The reproduction steps and the backtrace I
got are following.

=# create table test as select 1 as c;
=# select sum(c) over (partition by c groups between 1 preceding and
current row) from test;
TRAP: FailedAssertion("!(ptr >= 0 && ptr < state->readptrcount)",
File: "tuplestore.c", Line: 478)

#0  0x003b3ce32495 in raise () from /lib64/libc.so.6
#1  0x003b3ce33c75 in abort () from /lib64/libc.so.6
#2  0x00a2ef5a in ExceptionalCondition (conditionName=0xc99f38
"!(ptr >= 0 && ptr < state->readptrcount)", errorType=0xc99f22
"FailedAssertion",
fileName=0xc99ec0 "tuplestore.c", lineNumber=478) at assert.c:54
#3  0x00a7fa7d in tuplestore_select_read_pointer
(state=0x139e4a0, ptr=-1) at tuplestore.c:478
#4  0x007216cd in update_frameheadpos (winstate=0x137fc30) at
nodeWindowAgg.c:1655
#5  0x0071fb7f in eval_windowaggregates (winstate=0x137fc30)
at nodeWindowAgg.c:735
#6  0x00722ac8 in ExecWindowAgg (pstate=0x137fc30) at
nodeWindowAgg.c:2196
#7  0x006e5776 in ExecProcNodeFirst (node=0x137fc30) at
execProcnode.c:445
#8  0x006da945 in ExecProcNode (node=0x137fc30) at
../../../src/include/executor/executor.h:237
#9  0x006dd2fc in ExecutePlan (estate=0x137fa18,
planstate=0x137fc30, use_parallel_mode=false, operation=CMD_SELECT,
sendTuples=true,
numberTuples=0, direction=ForwardScanDirection, dest=0x138b098,
execute_once=true) at execMain.c:1726
#10 0x006daf2c in standard_ExecutorRun (queryDesc=0x13785b8,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:363
#11 0x006dad48 in ExecutorRun (queryDesc=0x13785b8,
direction=ForwardScanDirection, count=0, execute_once=true) at
execMain.c:306
#12 0x008c7626 in PortalRunSelect (portal=0x12f3bd8,
forward=true, count=0, dest=0x138b098) at pquery.c:932
#13 0x008c72b4 in PortalRun (portal=0x12f3bd8,
count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x138b098, altdest=0x138b098,
completionTag=0x7fffaece38b0 "") at pquery.c:773
#14 0x008c1288 in exec_simple_query (
query_string=0x128e938 "select sum(c) over (partition by c groups
between 1 preceding and current row) from test;") at postgres.c:1122
#15 0x008c555e in PostgresMain (argc=1, argv=0x12b8258,
dbname=0x12b80d8 "postgres", username=0x12b80b0 "masahiko") at
postgres.c:4153
#16 0x00822c3c in BackendRun (port=0x12b0020) at postmaster.c:4361
#17 0x008223aa in BackendStartup (port=0x12b0020) at postmaster.c:4033
#18 0x0081e84b in ServerLoop () at postmaster.c:1706
#19 0x0081e17d in PostmasterMain (argc=5, argv=0x1289330) at
postmaster.c:1379
#20 0x007452d0 in main (argc=5, argv=0x1289330) at main.c:228

The cause of this assertion failure is that we attempted to select a
read pointer (framehead_ptr) that is not allocated. We allocate read
pointers for both frame head and tail when begin a new partition in
begin_partition(). The current code doesn't allocate them as follows
if ORDER BY clause is omitted, but this behavior doesn't match to both
update_framheadpos() and update_framtailpos() which always use each
read pointer in GROUPS offset mode.

if ((winstate->frameOptions & (FRAMEOPTION_RANGE | FRAMEOPTION_GROUPS)) &&
   node->ordNumCols != 0)
{
if (!(winstate->frameOptions & FRAMEOPTION_START_UNBOUNDED_PRECEDING))
winstate->framehead_ptr =
tuplestore_alloc_read_pointer(winstate->buffer, 0);
if (!(winstate->frameOptions & FRAMEOPTION_END_UNBOUNDED_FOLLOWING))
winstate->frametail_ptr =
tuplestore_alloc_read_pointer(winstate->buffer, 0);
}

Since this issue relates to only GROUPS mode it happen in PostgreSQL
11 or later. Attached patch fixes this issue and add regression tests
for testing GROUPS mode without ORDER BY clause.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_groups_mode_window_func.patch
Description: Binary data


Re: Locking considerations of REINDEX

2018-07-04 Thread Peter Geoghegan
On Wed, Jul 4, 2018 at 5:08 AM, Pavan Deolasee  wrote:
> But AFAICS get_relation_info() tries to lock every index and since REINDEX
> will be holding a AEL on the index being reindexed, get_relation_info()
> blocks. Since get_relation_info() gets into every read path, wouldn't a
> concurrent REINDEX pretty much block every read access to the table, even if
> REINDEX not holding AEL on the table itself?

Not necessarily -- prepared statements may not block.

> I wonder if we just need fix the docs to or if we actually regressed at some
> point in the history or if we have a bug in the implementation? It mostly
> seems like a case of wrongly written docs even though in theory it might be
> possible to skip an index being rebuilt.

I still agree with this, though. The practical distinction between
getting an AEL on the table and what REINDEX does is pretty much
indistinguishable from zero.

-- 
Peter Geoghegan



Re: Legacy GiST invalid tuples

2018-07-04 Thread Tom Lane
Andrey Borodin  writes:
> There is bunch of code in current GiST implementation checking for 
> GistTupleIsInvalid(). PostgreSQL since 9.1 do not create invalid tuples. 
> Should we support this tuples forever?

The question is not whether we still create such tuples.  The reason
the code is still there is that an index that's been pg_upgraded from
before 9.1 might still contain such tuples.  We can't drop the support
unless either we provide logic to clean up invalid entries, or we're
willing to force users to REINDEX old GiST indexes to get rid of them
that way.  The latter seems like a pretty high price just to get rid of
some crufty old code.

regards, tom lane



Re: Legacy GiST invalid tuples

2018-07-04 Thread Andrey Borodin



> 4 июля 2018 г., в 19:22, Tom Lane  написал(а):
> 
> Andrey Borodin  writes:
>> There is bunch of code in current GiST implementation checking for 
>> GistTupleIsInvalid(). PostgreSQL since 9.1 do not create invalid tuples. 
>> Should we support this tuples forever?
> 
> The question is not whether we still create such tuples.  The reason
> the code is still there is that an index that's been pg_upgraded from
> before 9.1 might still contain such tuples.  We can't drop the support
> unless either we provide logic to clean up invalid entries, or we're
> willing to force users to REINDEX old GiST indexes to get rid of them
> that way.  The latter seems like a pretty high price just to get rid of
> some crufty old code.

Thanks, Tom!

So, I can create the script for pg_upgrade that will walk through each old 
enough[0] GiST index, scan for invalid tuples and repair them. This procedure 
seems quite trivial, but there will be more code that we have now. Does it 
sound reasonable?

[0] Actually, I do not know how to understand which index is old enough.

Best regards, Andrey Borodin.




Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-07-04 Thread Sergei Kornilov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

I reviewed and tested patch one more times, including check with previous 
extension version. I have no more notes on the code. Patch has tests and 
appropriate documentation changes.
I think the patch is ready for committer.

regards, Sergei

The new status of this patch is: Ready for Committer


Re: Recovery performance of DROP DATABASE with many tablespaces

2018-07-04 Thread Fujii Masao
On Wed, Jul 4, 2018 at 4:47 PM, Jamison, Kirk  wrote:
> Hi, Fujii-san
>
> I came across this post and I got interested in it,
>  so I tried to apply/test the patch but I am not sure if I did it correctly.
> I set-up master-slave sync, 200GB shared_buffers, 2 
> max_locks_per_transaction,
> 1 DB with 500 table partitions shared evenly across 5 tablespaces.
>
> After dropping the db, with or without patch,
> there were no difference in recovery performance when dropping database,
> so maybe I made a mistake somewhere. But anyway, here's the results.
>
> ==WITHOUT PATCH===
> [200GB shared buffers]
> DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
> 2018/07/04_13:35:00.161
> dropdb
> 2018/07/04_13:35:05.591 5.591 sec
>
> [200GB shared_buffers]
> DROPDB (including DROP TABLE and DROP TABLESPACE)
> real3m19.717s
> user0m0.001s
> sys 0m0.001s
>
> ==WITH PATCH===
> [200GB shared_buffers]
> DROPDB only (skipped DROP TABLE and DROP TABLESPACE)
> 2018/07/04_14:19:47.128
> dropdb
> 2018/07/04_14:19:53.177 6.049 sec
>
> [200GB shared_buffers]
> DROPDB (included the DROP TABLE and DROP TABLESPACE commands)
> real3m51.834s
> user0m0.001s
> sys 0m0.002s
>
> Just in case, do you also have some performance test numbers/case
> to show the recovery perf improvement when dropping database that contain 
> multiple tablespaces?

Thanks for testing!

TBH, I have no numbers measured by the test.
One question about your test is; how did you measure the *recovery time*
of DROP DATABASE? Since it's *recovery* performance, basically it's not easy
to measure that.

Regards,

-- 
Fujii Masao



Why B-Tree suffix truncation matters

2018-07-04 Thread Peter Geoghegan
I've been working on B-Tree suffix truncation, as part of a larger
effort to make all index tuples have unique keys by treating their
heap TID as a first class part of the key space, as part of an even
larger (and still ill-defined) effort to improve index vacuuming
(a.k.a. "retail index tuple deletion"). I posted some prototype
patches over on the "Making all nbtree entries unique by having heap
TIDs participate in comparisons", and have been working on this fairly
solidly for the past few weeks.

I'm starting this new thread to discuss the benefits of B-Tree suffix
truncation, and to demonstrate how effective it can be at increasing
fan-out and space utilization with a real-world example. I haven't
explained why suffix truncation matters very well before now. Now that
I have a patch that works, I can just show the benefit directly. (By
the way, there are similar benefits to covering INCLUDE indexes, where
suffix truncation occurs all the time, and not just
opportunistically.)

Definition
==

To recap, suffix truncation is a classic optimization for B-Trees that
was first described by Bayer in 1977. We truncate away
non-distinguishing trailing attributes from pivot tuples to save space
in internal pages, improving fan-in. Pivot tuples are the index tuples
that don't point to the heap - they're only used to guide searches to
the correct leaf page. Since they only guide searches, they only need
those attributes up to and including the first attribute that
distinguished each half of a leaf page split at the time of the
original split. For example, if during a leaf page split the tuple to
the immediate left of the split point looks like this:

('USA', 'New Jersey', 'Wyckoff')

...and, if the tuple at the immediate right half looked like this:

('USA', 'New York', 'Albany')

...then we could get away with truncating to make the new left page's
high key (and right page's downlink) be represented as:

('USA', 'New York', )

Where "" is represented a little bit like a NULL
value, though with less overhead (we can use the same representation
as INCLUDE indexes do for pivot tuples). Don't make the mistake of
thinking of this as a form of compression -- it's actually a way
avoiding storing something that we simply don't need, and never could
need, that prematurely limits where tuples can be stored in the index.

Example
===

It's well known that the number of internal pages in a B-Tree seldom
exceeds 1% of the total (2% would be considered hugely bloated). These
pages are what we're really interested in shrinking, so suffix
truncation might appear to be a micro-optimization of little value.
Who cares if I can shave 0.2% off the total size of the index? In
fact, it can be far more than that in cases involving *random*
insertions. Here is a simple motivating example, that uses the UK land
registry data [1]:

pg@~[31744]=# \dt+
List of relations
 Schema │Name │ Type  │ Owner │Size│ Description
┼─┼───┼───┼┼─
 public │ land_registry_price_paid_uk │ table │ pg│ 3005 MB │
(1 row)

Let's create a table with the same schema, and build a composite index
on (county, city, locality) against that table, for our random
insertions test:

pg@~[31744]=# create table land2 (like land_registry_price_paid_uk);
CREATE TABLE
pg@~[31744]=# create index composite on land2 (county, city, locality);
CREATE INDEX

The data we're about to insert looks like this:

pg@~[7002]=# select county, city, locality from
land_registry_price_paid_uk limit 15;
  county  │city │  locality
──┼─┼─
 LANCASHIRE   │ BURNLEY │ BURNLEY
 SLOUGH   │ SLOUGH  │ SLOUGH
 EAST SUSSEX  │ RYE │ RYE
 HAMPSHIRE│ FARNBOROUGH │ FARNBOROUGH
 GREATER LONDON   │ LONDON  │ LONDON
 LINCOLNSHIRE │ SKEGNESS│ SKEGNESS
 WREKIN   │ TELFORD │ TELFORD
 KENT │ CANTERBURY  │ CANTERBURY
 GREATER LONDON   │ LONDON  │ LONDON
 GREATER LONDON   │ LONDON  │ LEYTON
 NORTHAMPTONSHIRE │ CORBY   │ CORBY
 SURREY   │ GUILDFORD   │ GUILDFORD
 SURREY   │ LEATHERHEAD │ OXSHOTT
 LINCOLNSHIRE │ WISBECH │ TYDD GOTE
 WILTSHIRE│ PEWSEY  │ PEWSEY
(15 rows)

Let's do something that results in more or less random insertions into
the index:

pg@~[31744]=# insert into land2 select * from land_registry_price_paid_uk;
INSERT 0 21456045
Time: 155536.176 ms (02:35.536)
pg@~[31744]=# \di+
 List of relations
 Schema │ Name  │ Type  │ Owner │  Table   │
Size   │ Description
┼───┼───┼───┼──┼─┼─
 public │ composite │ index │ pg│ land2│ 1134 MB │
(1 row)

(Note that I actually VACUUM FREEZE'd land_registry_price_paid_uk
before inserting in both c

Re: Bulk Insert into PostgreSQL

2018-07-04 Thread Peter Geoghegan
On Tue, Jul 3, 2018 at 4:34 PM, Srinivas Karthik V
 wrote:
> @Peter: I was indexing the primary key of all the tables in tpc-ds. Some of
> the fact tables has multiple columns as part of the primary key. Also, most
> of them are numeric type.

Please see my mail to -hackers on suffix truncation:
https://postgr.es/m/CAH2-Wzn5XbCzk6u0GL+uPnCp1tbrp2pJHJ=3byt4yq0_zzh...@mail.gmail.com

Perhaps this is related in some way, since in both cases we're talking
about a composite index on varlena-type columns, where the types have
expensive comparisons.

-- 
Peter Geoghegan



Re: pgsql: Fix crash when ALTER TABLE recreates indexes on partitions

2018-07-04 Thread Alvaro Herrera
On 2018-Jun-30, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Fix crash when ALTER TABLE recreates indexes on partitions
> 
> So ... buildfarm member skink has been reporting a valgrind failure
> during initdb since this patch went in.  However, I'm unable to reproduce
> such a failure here, and it's less than obvious how this particular patch
> could have caused a problem, and skink's report is pretty useless because
> it's providing only a numeric stack trace.  Thoughts?

Actually, older branches are failing in the same way, so it's not my
patch that caused it to fail.  Something must have changed in the system ...
Andres?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: "Access privileges" is missing after pg_dumpall

2018-07-04 Thread Tom Lane
Prabhat Sahu  writes:
> I have taken pg_dumpall in pg-master and after restoring the dump I am not
> able to see the "Access privileges" as below:
> Same is reproducible in back branches as well, is this fine ?

Yes, it is, because the privileges are the same in both states.  In
one case you have an explicit representation of the default privileges,
in the other it's just default.

regards, tom lane



Re: Speedup of relation deletes during recovery

2018-07-04 Thread Fujii Masao
On Tue, Jul 3, 2018 at 11:28 AM, Michael Paquier  wrote:
> On Tue, Jul 03, 2018 at 04:13:15AM +0900, Fujii Masao wrote:
>> OK, so what about the attached patch?
>
> I have been looking at this patch, and this looks in good shape to me

Thanks for the review! So, committed.

> (please indent!).

Hmm.. I failed to find indent issue in my patch... But anyway
future execution of pgindent will fix that even if it exists.

> +* Call smgrclose() in reverse order as when smgropen() is called.
> +* This trick enables remove_from_unowned_list() in smgrclose()
> +* to search the SMgrRelation from the unowned list,
> +* in O(1) performance.
>
> A nit here: with O(1) performance.

I changed the patch accordingly.

> Could it be possible to add an assertion so as isRedo is never enabled
> out of recovery?

I'm not sure if adding such assertion into only the function that
the patch added is helpful. There are many functions having something
like isRedo as an argument.

Regards,

-- 
Fujii Masao



Re: Legacy GiST invalid tuples

2018-07-04 Thread Alvaro Herrera
On 2018-Jul-04, Andrey Borodin wrote:

> Thanks, Tom!
> 
> So, I can create the script for pg_upgrade that will walk through each old 
> enough[0] GiST index, scan for invalid tuples and repair them. This procedure 
> seems quite trivial, but there will be more code that we have now. Does it 
> sound reasonable?
> 
> [0] Actually, I do not know how to understand which index is old enough.

Requiring a scan of all indexes during pg_upgrade might increase the
upgrade time prohibitively for some sites, so I don't think that's a
good solution.

I think keeping track of which indexes might be old enough not to have
invalid tuples anymore is a good idea in the long run.  If we start
doing it in pg12, then by the time pg17 comes about and we abandon pg11
(the one without the cataloguing) then we can retire the code to support
invalid tuples.  Most people, come this point, say "naaah this too long,
this project is useless" so the cataloguing is never done :-)

*If* we make it to 2023, that is.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Legacy GiST invalid tuples

2018-07-04 Thread Tom Lane
Alvaro Herrera  writes:
> Requiring a scan of all indexes during pg_upgrade might increase the
> upgrade time prohibitively for some sites, so I don't think that's a
> good solution.

Perhaps VACUUM could be taught to clean up invalid entries?  That
wouldn't help Andrey's unstated goal of being able to re-use the bits
for some other purpose in v12, but it might be practical to re-use
them sometime sooner than v17.

regards, tom lane



Re: hot_standby_feedback vs excludeVacuum and snapshots

2018-07-04 Thread Noah Misch
On Fri, Jun 08, 2018 at 11:03:38AM -0700, Andres Freund wrote:
> On 2018-06-08 09:23:02 +0100, Simon Riggs wrote:
> > For context, AEL locks are normally removed by COMMIT or ABORT.
> > StandbyReleaseOldLocks() is just a sweeper to catch anything that
> > didn't send an abort before it died, so it hardly ever activates. The
> > coding of StandbyReleaseOldLocks() is backwards... if it ain't in the
> > running list, then we remove it.
> > 
> > But that wasn't working correctly either, since as of 49bff5300d527 we
> > assigned AELs to subxids. Subxids weren't in the running list and so
> > AELs held by them would have been removed at the wrong time, an extant
> > bug in PG10. It looks to me like they would have been removed either
> > early or late, up to the next runningxact info record. They would be
> > removed, so no leakage, but the late timing wouldn't be noticeable for
> > tests or most usage, since it would look just like lock contention.
> > Early release might give same issue of block access to non-existent
> > block/file.
> 
> Yikes, that's kinda bad. It can also just cause plain crashes, no? The
> on-disk / catalog state isn't necessarily consistent during DDL, which
> is why we hold AE locks. At the very least it can cause corruption of
> in-use relcache entries and such.

Yes.  If I'm reading the commit message right, the committed code also
releases locks too early:

> commit 15378c1
> Author: Simon Riggs 
> AuthorDate: Sat Jun 16 14:03:29 2018 +0100
> Commit: Simon Riggs 
> CommitDate: Sat Jun 16 14:03:29 2018 +0100
> 
> Remove AELs from subxids correctly on standby
> 
> Issues relate only to subtransactions that hold AccessExclusiveLocks
> when replayed on standby.
> 
> Prior to PG10, aborting subtransactions that held an
> AccessExclusiveLock failed to release the lock until top level commit or
> abort. 49bff5300d527 fixed that.
> 
> However, 49bff5300d527 also introduced a similar bug where subtransaction
> commit would fail to release an AccessExclusiveLock, leaving the lock to
> be removed sometimes early and sometimes late. This commit fixes
> that bug also. Backpatch to PG10 needed.

Subtransaction commit is too early to release an arbitrary
AccessExclusiveLock.  The primary releases every AccessExclusiveLock at
top-level transaction commit, top-level transaction abort, or subtransaction
abort.  CommitSubTransaction() doesn't do that; it transfers locks to the
parent sub(xact).  Standby nodes can't safely remove an arbitrary lock earlier
than the primary would.



Re: Legacy GiST invalid tuples

2018-07-04 Thread Alvaro Herrera
On 2018-Jul-04, Tom Lane wrote:

> Alvaro Herrera  writes:
> > Requiring a scan of all indexes during pg_upgrade might increase the
> > upgrade time prohibitively for some sites, so I don't think that's a
> > good solution.
> 
> Perhaps VACUUM could be taught to clean up invalid entries?  That
> wouldn't help Andrey's unstated goal of being able to re-use the bits
> for some other purpose in v12, but it might be practical to re-use
> them sometime sooner than v17.

We tried to clean up HEAP_MOVED_IN / HEAP_MOVED_OFF a long time ago, but
gave up :-)  I recall Andres posted a write-up about adding columns to
pg_class to indicate "what version did last vacuum this whole table", as
a marker for features that are no longer needed.  Maybe that idea can be
reused for this purpose.  I'm thinking pg_upgrade can use its --check
phase to look for indexes marked as older than X (possibly containing
invalid tuples), so it instructs the user to run vacuum on it prior to
the upgrade.  I think the soonest this can work is to add the column in
pg12 so that it can be used to upgrade to pg13.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow COPY's 'text' format to output a header

2018-07-04 Thread Simon Muller
On 14 May 2018 at 08:35, Simon Muller  wrote:

> Okay, I've added this to the next commitfest at
> https://commitfest.postgresql.org/18/1629/.
>
> Thanks both Michael and David for the feedback so far.
>

I noticed through the patch tester link at http://commitfest.cputube.org/
that my patch caused a file_fdw test to fail (since I previously tested
only with "make check" and not with "make check-world").

This v2 patch should fix that.


text_header_v2.patch
Description: Binary data


Re: Legacy GiST invalid tuples

2018-07-04 Thread Andres Freund
Hi,

On 2018-07-04 16:43:19 -0400, Alvaro Herrera wrote:
> On 2018-Jul-04, Tom Lane wrote:
> 
> > Alvaro Herrera  writes:
> > > Requiring a scan of all indexes during pg_upgrade might increase the
> > > upgrade time prohibitively for some sites, so I don't think that's a
> > > good solution.
> > 
> > Perhaps VACUUM could be taught to clean up invalid entries?  That
> > wouldn't help Andrey's unstated goal of being able to re-use the bits
> > for some other purpose in v12, but it might be practical to re-use
> > them sometime sooner than v17.
> 
> We tried to clean up HEAP_MOVED_IN / HEAP_MOVED_OFF a long time ago, but
> gave up :-)  I recall Andres posted a write-up about adding columns to
> pg_class to indicate "what version did last vacuum this whole table", as
> a marker for features that are no longer needed.

Right. I plan to pick that up as soon as I've some free cycles. I'd not
object to somebody else working on it first though.


> Maybe that idea can be reused for this purpose.

Yes, I think that's explicitly in scope.


> I'm thinking pg_upgrade can use its --check phase to look for indexes
> marked as older than X (possibly containing invalid tuples), so it
> instructs the user to run vacuum on it prior to the upgrade.

I kinda wondered about making the feature back-patchable (by storing the
version in reloptions or such), but I think that'd just make it less
likely to get in.


> I think the soonest this can work is to add the column in pg12 so that
> it can be used to upgrade to pg13.

I don't think we can easily require that everyone pg_upgrading to v13+
upgrades to v12 first?

Greetings,

Andres Freund



Re: Should contrib modules install .h files?

2018-07-04 Thread Tom Lane
Andrew Gierth  writes:
> This installs to $(includedir_server)/$(MODULEDIR)/$MODULE/file.h
> (e.g. include/server/extension/hstore/hstore.h for an actual example),
> and errors if HEADERS_xxx is defined for anything that's not a module
> listed in MODULES or MODULE_big.

I've not studied this patch in any great detail, but the basic plan seems
sane.  However, don't we also need to teach the MSVC build system about
this?

regards, tom lane



Re: Legacy GiST invalid tuples

2018-07-04 Thread Alvaro Herrera
On 2018-Jul-04, Andres Freund wrote:

> > I think the soonest this can work is to add the column in pg12 so that
> > it can be used to upgrade to pg13.
> 
> I don't think we can easily require that everyone pg_upgrading to v13+
> upgrades to v12 first?

We've never done that, I don't know if we can get away with it.
Upgrades with pg_upgrade are not cheap enough for that, methinks.
Maybe I'm wrong, but people complain even at a *single* pg_upgrade --
seems everybody wants in-place upgrades nowadays ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Legacy GiST invalid tuples

2018-07-04 Thread Andres Freund
On 2018-07-04 17:02:01 -0400, Alvaro Herrera wrote:
> On 2018-Jul-04, Andres Freund wrote:
> 
> > > I think the soonest this can work is to add the column in pg12 so that
> > > it can be used to upgrade to pg13.
> > 
> > I don't think we can easily require that everyone pg_upgrading to v13+
> > upgrades to v12 first?
> 
> We've never done that, I don't know if we can get away with it.
> Upgrades with pg_upgrade are not cheap enough for that, methinks.
> Maybe I'm wrong, but people complain even at a *single* pg_upgrade --
> seems everybody wants in-place upgrades nowadays ...

Right. That's why I was wondering about making the "last full scan"
feature backpatchable...

Greetings,

Andres Freund



Re: Unexpected behavior of DROP VIEW/TABLE IF EXISTS

2018-07-04 Thread Tom Lane
Robert Haas  writes:
> Another possibility that would also seem to meet the OP's needs is to
> make it do this:

> DROP TABLE IF EXISTS X;
> NOTICE:  relation "X" is not a table, skipping

> His complaint was really that it generated an ERROR, IIUC.

While that would perhaps meet the OP's desires, it still seems like
a strange definition of "IF EXISTS" to me.  That's supposed to be an
escape hatch for "object does not exist", not for other error types.

The long and short of it is that I'm not dissatisfied with the way
this works now, and given the lack of previous complaints, not
many other people are either.  So I do not think this is a bug fix;
it's a definition disagreement.  I'm inclined to think that the
principle of "stare decisis" ought to apply here --- once we've let
a particular behavior stand for N release cycles, it should take
more than one or two people objecting to change it, because
backwards compatibility.

Also, based on other messages, it seems like what the OP wants is
to be sure that "CREATE TABLE X" will succeed afterwards, so that
failing to get rid of view X will not do what he needs.
There might be some merit in pursuing the DROP RELATION idea that
was floated earlier.  That seems analogous to DROP ROUTINE, which
is dealing with a related case and apparently is standards-compliant.

regards, tom lane



Re: Legacy GiST invalid tuples

2018-07-04 Thread Alvaro Herrera
On 2018-Jul-04, Andres Freund wrote:

> On 2018-07-04 17:02:01 -0400, Alvaro Herrera wrote:
> > On 2018-Jul-04, Andres Freund wrote:
> > 
> > > > I think the soonest this can work is to add the column in pg12 so that
> > > > it can be used to upgrade to pg13.
> > > 
> > > I don't think we can easily require that everyone pg_upgrading to v13+
> > > upgrades to v12 first?
> > 
> > We've never done that, I don't know if we can get away with it.
> > Upgrades with pg_upgrade are not cheap enough for that, methinks.
> > Maybe I'm wrong, but people complain even at a *single* pg_upgrade --
> > seems everybody wants in-place upgrades nowadays ...
> 
> Right. That's why I was wondering about making the "last full scan"
> feature backpatchable...

Maybe it's not so bad to make it a reloption in back branches and a
full-fledged column going forward?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: shared-memory based stats collector

2018-07-04 Thread Tom Lane
Kyotaro HORIGUCHI  writes:
> At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas  wrote 
> in 
>> Copying the whole hash table kinds of sucks, partly because of the
>> time it will take to copy it, but also because it means that memory
>> usage is still O(nbackends * ntables).  Without looking at the patch,
>> I'm guessing that you're doing that because we need a way to show each
>> transaction a consistent snapshot of the data, and I admit that I
>> don't see another obvious way to tackle that problem.  Still, it would
>> be nice if we had a better idea.

> The consistency here means "repeatable read" of an object's stats
> entry, not a snapshot covering all objects. We don't need to copy
> all the entries at once following this definition. The attached
> version makes a cache entry only for requested objects.

Uh, what?  That's basically destroying the long-standing semantics of
statistics snapshots.  I do not think we can consider that acceptable.
As an example, it would mean that scan counts for indexes would not
match up with scan counts for their tables.

regards, tom lane



Re: Unexpected behavior of DROP VIEW/TABLE IF EXISTS

2018-07-04 Thread David G. Johnston
On Wednesday, July 4, 2018, Tom Lane  wrote:
>
> Also, based on other messages, it seems like what the OP wants is
> to be sure that "CREATE TABLE X" will succeed afterwards, so that
> failing to get rid of view X will not do what he needs.
>

I read and agree that what should be possible, absent DROP RELATION, is for
someone to execute both DROP VIEW and DROP TABLE with the same name, in any
order, and not have the transaction abort with an error (if a table or view
of the same name already exists).  Having done that the CREATE VIEW will
succeed since either both were no-ops (no table or view of that name
existed) or one was guaranteed to succeed and the other was a no-op (table
or view existed - doesn't matter which).  As it stands now you have to know
whether the existing object is a table or a view in order to get the order
correct and the first run usually the table exists and the second run the
view exists so the script has to be changed in between.

If this didn't involve an error mode the desire to leave things as-is would
be more understandable to me.

David J.


Re: Cache invalidation after authentication (on-the-fly role creation)

2018-07-04 Thread Tom Lane
Thomas Munro  writes:
> I'd like to do this to postinit.c:

> PerformAuthentication(MyProcPort);
> +   AcceptInvalidationMessages();
> InitializeSessionUserId(username, useroid);

> Any objections?

That seems like a *really* ad-hoc place to put it.  Why should it be
there, and not (say) somewhere inside InitializeSessionUserId, or maybe
(also?) inside PerformAuthentication?  Why do the existing call sites for
AcceptInvalidationMessages --- in particular, the ones associated with
lock acquisition --- not solve the problem already?

I'm not opposed to trying to make things better if we have a stale-cache
problem during init, just dubious that this is how to do it.

regards, tom lane



setproctitle_fast()

2018-07-04 Thread Thomas Munro
Hello hackers,

FreeBSD's setproctitle() is a bit slow because it contains a syscall
or two, so people often run PostgreSQL with update_process_title set
to off on that OS.  That makes the user experience not quite as nice
as Linux.  As a weekend learn-me-some-kernel-hacking project I fixed
that and got the patch committed to FreeBSD 12, though I was asked to
use a new libc entry point _fast().  Here's a patch to teach
PostgreSQL about that.  It doesn't have much effect on small systems,
but it makes "pgbench -c 40 -j 40 -S -M prepared" do ~10% more
transactions per second on an AWS m4.10xlarge instance.

I'll park this patch here until the FreeBSD feature escapes in a
RELEASE version in a few months.

For anyone interested, https://wiki.postgresql.org/wiki/FreeBSD has
some notes on this and other such things.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Use-setproctitle_fast-to-update-the-ps-status-if-ava.patch
Description: Binary data


Re: Allow cancelling VACUUM of nbtrees with corrupted right links

2018-07-04 Thread Andres Freund
Hi,

On 2018-06-27 12:16:29 -0700, Andres Freund wrote:
> I think we should backpatch those checks - it's a fairly nasty situation
> for users to not be able to even drop an index without going to single
> user mode.

Did that back to 9.4 - before that page deletion and splits worked
differently enough that it seems higher risk to add the checks there.

Greetings,

Andres Freund



peripatus build failures....

2018-07-04 Thread Larry Rosenman
I noticed my buildfarm member peripatus hadn't been building due to me
missing a perl library.  After I fixed that I get failures on:

Buildfarm member peripatus failed on REL9_3_STABLE stage Make
Buildfarm member peripatus failed on REL9_4_STABLE stage Make
Buildfarm member peripatus failed on REL9_5_STABLE stage Make
Buildfarm member peripatus failed on REL9_6_STABLE stage Make
Buildfarm member peripatus failed on REL_10_STABLE stage Make

HEAD and REL_11_STABLE build fine.

These all fail on relocation issues in pgport(strcase*) and possibly
others. 

This is FreeBSD HEAD as of today, clang 6.0.1, lld as the linker. 

Can someone take a look at these?
I've tried making sure all the cache's are clear, etc, same results. 


Thanks.

-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106


signature.asc
Description: PGP signature


Re: Looks like we can enable AF_UNIX on Windows now

2018-07-04 Thread Noah Misch
On Wed, May 30, 2018 at 09:59:01AM +0800, Craig Ringer wrote:
> On 30 May 2018 at 09:53, Andres Freund  wrote:
> > On May 29, 2018 9:44:09 PM EDT, Craig Ringer  wrote:
> > >https://blogs.msdn.microsoft.com/commandline/2017/12/19/af_unix-comes-to-windows/
> > >
> > >The latest Windows 10 update shipped with AF_UNIX socket support for
> > >win32
> > >sockets.
> > >
> > >It's not that exciting because it doesn't support socketpair() or fd
> > >passing - yet. So really it doesn't offer us much more than we can
> > >already
> > >get with win32 named pipes. We can - and do - already get fd passing
> > >with
> > >DuplicateHandle anyway.
> > >
> > >Still, I thought it was interesting. We could probably just
> > >conditionally
> > >enable AF_UNIX sockets on new enough windows SDKs. Apparently if it's
> > >not
> > >supported by the OS runtime you get a graceful error.
> >
> > Last time I checked it didn't support transporting user identification
> > though. Which means not that much value would be added. Is that still the
> > case?
> 
> Right, so it is. I missed that.
> 
> They implemented unix sockets, except the interesting bits.

For use as a frontend/backend protocol transport, these are the interesting
bits:

1. User identification for peer auth
2. Enforcement of file modes from socket's ancestor directories
3. Compatibility with select() and other socket APIs

The article says the implementation has (2), and (3) seems likely.  For (1),
Windows already supports user identification over TCP, which PostgreSQL uses
to implement SSPI authentication.  I expect that to work equally well over
AF_UNIX, and adding a getpeereid() equivalent would not help much.

While enabling AF_UNIX sockets on Windows wouldn't achieve anything wondrous,
it would unblock check-world testing src/test/authentication on Windows.



Re: peripatus build failures....

2018-07-04 Thread Thomas Munro
On Thu, Jul 5, 2018 at 11:43 AM, Larry Rosenman  wrote:
> I noticed my buildfarm member peripatus hadn't been building due to me
> missing a perl library.  After I fixed that I get failures on:
>
> Buildfarm member peripatus failed on REL9_3_STABLE stage Make
> Buildfarm member peripatus failed on REL9_4_STABLE stage Make
> Buildfarm member peripatus failed on REL9_5_STABLE stage Make
> Buildfarm member peripatus failed on REL9_6_STABLE stage Make
> Buildfarm member peripatus failed on REL_10_STABLE stage Make
>
> HEAD and REL_11_STABLE build fine.
>
> These all fail on relocation issues in pgport(strcase*) and possibly
> others.
>
> This is FreeBSD HEAD as of today, clang 6.0.1, lld as the linker.
>
> Can someone take a look at these?
> I've tried making sure all the cache's are clear, etc, same results.

Seems to have something to do with the recent linker toolchain changes
in FreeBSD.  I don't actually understand what's going on here or why
it doesn't affect REL_11_STABLE and HEAD, but this seems to be some
kind of explanation + workaround (as used by the ports):

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219758
https://svnweb.freebsd.org/ports?view=revision&revision=456635

-- 
Thomas Munro
http://www.enterprisedb.com



Re: peripatus build failures....

2018-07-04 Thread Larry Rosenman
On Thu, Jul 05, 2018 at 12:30:37PM +1200, Thomas Munro wrote:
> On Thu, Jul 5, 2018 at 11:43 AM, Larry Rosenman  wrote:
> > I noticed my buildfarm member peripatus hadn't been building due to me
> > missing a perl library.  After I fixed that I get failures on:
> >
> > Buildfarm member peripatus failed on REL9_3_STABLE stage Make
> > Buildfarm member peripatus failed on REL9_4_STABLE stage Make
> > Buildfarm member peripatus failed on REL9_5_STABLE stage Make
> > Buildfarm member peripatus failed on REL9_6_STABLE stage Make
> > Buildfarm member peripatus failed on REL_10_STABLE stage Make
> >
> > HEAD and REL_11_STABLE build fine.
> >
> > These all fail on relocation issues in pgport(strcase*) and possibly
> > others.
> >
> > This is FreeBSD HEAD as of today, clang 6.0.1, lld as the linker.
> >
> > Can someone take a look at these?
> > I've tried making sure all the cache's are clear, etc, same results.
> 
> Seems to have something to do with the recent linker toolchain changes
> in FreeBSD.  I don't actually understand what's going on here or why
> it doesn't affect REL_11_STABLE and HEAD, but this seems to be some
> kind of explanation + workaround (as used by the ports):
> 
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219758
> https://svnweb.freebsd.org/ports?view=revision&revision=456635
> 
I agree.  Is there an easy way I can add this work around to peripatus'
source tree:

It may be that adding "LDFLAGS+= -Wl,-z,notext" (and removing LLD_UNSAFE) will 
let the port build with lld.

Thanks, Thomas.

-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106


signature.asc
Description: PGP signature


Re: Server crashed with dense_rank on partition table.

2018-07-04 Thread Andres Freund
On 2018-07-02 17:14:14 +0900, Amit Langote wrote:
> I studied this a bit and found a bug that's causing the crash.
> 
> The above mentioned commit has this hunk:
> 
> @@ -1309,6 +1311,9 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
>  PG_RETURN_INT64(rank);
> 
>  osastate = (OSAPerGroupState *) PG_GETARG_POINTER(0);
> +econtext = osastate->qstate->econtext;
> +if (!econtext)
> +osastate->qstate->econtext = econtext =
> CreateStandaloneExprContext();
> 
> In CreateStandloneExprContext(), we have this:
> 
> econtext->ecxt_per_query_memory = CurrentMemoryContext;
> 
> /*
>  * Create working memory for expression evaluation in this context.
>  */
> econtext->ecxt_per_tuple_memory =
> AllocSetContextCreate(CurrentMemoryContext,
>   "ExprContext",
>   ALLOCSET_DEFAULT_SIZES);
> 
> I noticed when debugging the crashing query that CurrentMemoryContext is
> actually per-tuple memory context of some expression context of the
> calling code, which would get reset before getting here again.  So, it's
> wrong of hypothetical_dense_rank_final to call CreateStandloneExprContext
> without first switching to an actual per-query context.
> 
> Attached patch seems to fix the crash.

Thanks, that looks correct. Pushed!

- Andres



Re: peripatus build failures....

2018-07-04 Thread Larry Rosenman
On Wed, Jul 04, 2018 at 07:35:28PM -0500, Larry Rosenman wrote:
> On Thu, Jul 05, 2018 at 12:30:37PM +1200, Thomas Munro wrote:
> > On Thu, Jul 5, 2018 at 11:43 AM, Larry Rosenman  wrote:
> > > I noticed my buildfarm member peripatus hadn't been building due to me
> > > missing a perl library.  After I fixed that I get failures on:
> > >
> > > Buildfarm member peripatus failed on REL9_3_STABLE stage Make
> > > Buildfarm member peripatus failed on REL9_4_STABLE stage Make
> > > Buildfarm member peripatus failed on REL9_5_STABLE stage Make
> > > Buildfarm member peripatus failed on REL9_6_STABLE stage Make
> > > Buildfarm member peripatus failed on REL_10_STABLE stage Make
> > >
> > > HEAD and REL_11_STABLE build fine.
> > >
> > > These all fail on relocation issues in pgport(strcase*) and possibly
> > > others.
> > >
> > > This is FreeBSD HEAD as of today, clang 6.0.1, lld as the linker.
> > >
> > > Can someone take a look at these?
> > > I've tried making sure all the cache's are clear, etc, same results.
> > 
> > Seems to have something to do with the recent linker toolchain changes
> > in FreeBSD.  I don't actually understand what's going on here or why
> > it doesn't affect REL_11_STABLE and HEAD, but this seems to be some
> > kind of explanation + workaround (as used by the ports):
> > 
> > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219758
> > https://svnweb.freebsd.org/ports?view=revision&revision=456635
> > 
> I agree.  Is there an easy way I can add this work around to peripatus'
> source tree:
> 
> It may be that adding "LDFLAGS+= -Wl,-z,notext" (and removing LLD_UNSAFE) 
> will let the port build with lld.
> 
> Thanks, Thomas.

BTW: If some hacker wants to play on this box, I'm happy to make an
account and give ssh access.  It's a playground.

-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106


signature.asc
Description: PGP signature


Re: Server crashed with dense_rank on partition table.

2018-07-04 Thread Amit Langote
On 2018/07/05 9:40, Andres Freund wrote:
> On 2018-07-02 17:14:14 +0900, Amit Langote wrote:
>> I studied this a bit and found a bug that's causing the crash.
>>
>> The above mentioned commit has this hunk:
>>
>> @@ -1309,6 +1311,9 @@ hypothetical_dense_rank_final(PG_FUNCTION_ARGS)
>>  PG_RETURN_INT64(rank);
>>
>>  osastate = (OSAPerGroupState *) PG_GETARG_POINTER(0);
>> +econtext = osastate->qstate->econtext;
>> +if (!econtext)
>> +osastate->qstate->econtext = econtext =
>> CreateStandaloneExprContext();
>>
>> In CreateStandloneExprContext(), we have this:
>>
>> econtext->ecxt_per_query_memory = CurrentMemoryContext;
>>
>> /*
>>  * Create working memory for expression evaluation in this context.
>>  */
>> econtext->ecxt_per_tuple_memory =
>> AllocSetContextCreate(CurrentMemoryContext,
>>   "ExprContext",
>>   ALLOCSET_DEFAULT_SIZES);
>>
>> I noticed when debugging the crashing query that CurrentMemoryContext is
>> actually per-tuple memory context of some expression context of the
>> calling code, which would get reset before getting here again.  So, it's
>> wrong of hypothetical_dense_rank_final to call CreateStandloneExprContext
>> without first switching to an actual per-query context.
>>
>> Attached patch seems to fix the crash.
> 
> Thanks, that looks correct. Pushed!

Thank you.

Regards,
Amit




Re: peripatus build failures....

2018-07-04 Thread Thomas Munro
On Thu, Jul 5, 2018 at 12:35 PM, Larry Rosenman  wrote:
> I agree.  Is there an easy way I can add this work around to peripatus'
> source tree:
>
> It may be that adding "LDFLAGS+= -Wl,-z,notext" (and removing LLD_UNSAFE) 
> will let the port build with lld.

Maybe something like this at the end of your build-farm.conf?

if ($branch =~ /^REL(9|_10)/)
{
$conf{config_env}->{"LDFLAGS"} = "...something something...";
}

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-07-04 Thread Robert Haas
On Tue, Jul 3, 2018 at 12:41 AM Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

> > rhaas=# drop table foo;
> > ERROR:  table "foo" does not exist
> > HINT: Try dropping a table with a different name that does exist, or
> > first create this table before trying to drop it.
>
> Again a wrong example and wrong comparison. I think I was clear about
> the problem when I wrote


I don't think this is a question of "right" vs. "wrong".  You are certainly
entitled to your opinion, but I believe that I am entitled to mine, too.

--
> When there was only a single way, i.e table
> inheritance, to "inherit" things users could probably guess that. But
> now there are multiple ways to inherit things, we have to help user a
> bit more. The user might figure out that ti's a partition of a table,
> so the constraint is inherited from the partitioned table. But it will
> help if we give a hint about from where the constraint was inherited
> like the error message itself reads like "can not drop constraint
> "p_a_check" on relation "p1" inherited from "partitioned table's name"
> OR a hint "you may drop constraint "p_a_check" on the partitioned
> table "partitioned table's name".
> --
>
> For some reason you have chosen to remove this from the email and
> commented on previous part of it.


Well, as far as I know, it's up to me which parts of your emails I want to
quote in my reply. I did read this part. It did not change my opinion.  My
fundamental objection to your proposal is that I think it is too wordy. I
think users will generally know whether they are using partitioning or
inheritance, and if they don't it's not hard to figure out.   I don't think
quoting only part of what you wrote made the quote misleading, but it did
allow me to express my opinion. I understand that you don't agree, which is
fine, but I stand by my position.

...Robert
-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


RE: How can we submit code patches that implement our (pending) patents?

2018-07-04 Thread Tsunakawa, Takayuki
From: Craig Ringer [mailto:cr...@2ndquadrant.com]
> I'm assuming you don't want to offer a grant that lets anyone use them for
> anything. But if you have a really broad grant to PostgreSQL, all someone
> would have to do to inherit the grant is re-use some part of PostgreSQL.

Your assumption is right.  No scope is the same as no patent; it won't help to 
defend PostgreSQL community against rival companies/communities of other DBMSs. 
 Or, I think we can set the scope to what OIN states.  Fortunately, anyone can 
join OIN free of charge.


> I guess there's a middle ground somewhere that protects substantial
> derivatives and extracts but stops you using some Pg code snippets as a
> freebie license.

Are you assuming that developers want to use PG code snippets for 
non-PostgreSQL or even non-DBMS software?  I believe that accepting patented 
code from companies would be practically more useful for PostgreSQL enhancement 
and growth.  PostgreSQL is now a mature software, and it can be more 
corporate-friendly like other software under Apache License.


Regards
Takayuki Tsunakawa





Re: peripatus build failures....

2018-07-04 Thread Larry Rosenman
On Thu, Jul 05, 2018 at 12:56:49PM +1200, Thomas Munro wrote:
> On Thu, Jul 5, 2018 at 12:35 PM, Larry Rosenman  wrote:
> > I agree.  Is there an easy way I can add this work around to peripatus'
> > source tree:
> >
> > It may be that adding "LDFLAGS+= -Wl,-z,notext" (and removing LLD_UNSAFE) 
> > will let the port build with lld.
> 
> Maybe something like this at the end of your build-farm.conf?
> 
> if ($branch =~ /^REL(9|_10)/)
> {
> $conf{config_env}->{"LDFLAGS"} = "...something something...";
> }
> 

Good news.  I ran a quick build test on my checked out FreeBSD ports
tree and with Ed Maste's suggestion, it builds. 

Ed's suggestion:
remove LLD_UNSAFE, and add to the LDFLAGS+= line in the port
-Wl,-z,notext.

So, that is indeed a fix for us.  My question is:
how to add this LDFLAG for FreeBSD >= 12 and PostgreSQL <= 11's configure et al

I'm more than willing to try and generate a patch, but would like some
guidance.  I'm also willing to give access to my box.



-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106


signature.asc
Description: PGP signature


Re: Invisible Indexes

2018-07-04 Thread David Rowley
On 19 June 2018 at 09:56, Andres Freund  wrote:
> Be careful about that - currently it's not actually trivially possible
> to ever update pg_index rows. No, I'm not kidding
> you. pg_index.indexcheckxmin relies on the pg_index row's xmin. If you
> have ALTER do a non inplace update, you'll break things.

Couldn't we just add a dedicated xid field to pg_index to solve that,
one which does not change when the row is updated?

Or would it be insanely weird to just not allow setting or unsetting
this invisible flag if indcheckxmin is true?  I can't imagine there
will be many people adding an index and not wanting to use it while
it's still being created.  I think the use case here is mostly people
wanting to test dropping indexes before they go and remove that 1TB
index that will take days to build again if they're wrong.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Invisible Indexes

2018-07-04 Thread Peter Geoghegan
On Wed, Jul 4, 2018 at 6:26 PM, David Rowley
 wrote:
> Or would it be insanely weird to just not allow setting or unsetting
> this invisible flag if indcheckxmin is true?  I can't imagine there
> will be many people adding an index and not wanting to use it while
> it's still being created.  I think the use case here is mostly people
> wanting to test dropping indexes before they go and remove that 1TB
> index that will take days to build again if they're wrong.

I'm surprised that that use case wasn't the first one that everyone
thought of. I actually assumed that that's what Andrew had in mind
when reading his original message. I only realized later that it
wasn't.

-- 
Peter Geoghegan



Re: peripatus build failures....

2018-07-04 Thread Larry Rosenman
On Wed, Jul 04, 2018 at 08:19:48PM -0500, Larry Rosenman wrote:
> On Thu, Jul 05, 2018 at 12:56:49PM +1200, Thomas Munro wrote:
> > On Thu, Jul 5, 2018 at 12:35 PM, Larry Rosenman  wrote:
> > > I agree.  Is there an easy way I can add this work around to peripatus'
> > > source tree:
> > >
> > > It may be that adding "LDFLAGS+= -Wl,-z,notext" (and removing LLD_UNSAFE) 
> > > will let the port build with lld.
> > 
> > Maybe something like this at the end of your build-farm.conf?
> > 
> > if ($branch =~ /^REL(9|_10)/)
> > {
> > $conf{config_env}->{"LDFLAGS"} = "...something something...";
> > }
> > 
> 
> Good news.  I ran a quick build test on my checked out FreeBSD ports
> tree and with Ed Maste's suggestion, it builds. 
> 
> Ed's suggestion:
> remove LLD_UNSAFE, and add to the LDFLAGS+= line in the port
> -Wl,-z,notext.
> 
> So, that is indeed a fix for us.  My question is:
> how to add this LDFLAG for FreeBSD >= 12 and PostgreSQL <= 11's configure et 
> al
> 
> I'm more than willing to try and generate a patch, but would like some
> guidance.  I'm also willing to give access to my box.
> 
> 

I also filed FreeBSD pr
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229523 to make the
change in the FreeBSD port.


-- 
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: l...@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106


signature.asc
Description: PGP signature


Re: PANIC during crash recovery of a recently promoted standby

2018-07-04 Thread Michael Paquier
On Mon, Jul 02, 2018 at 10:41:05PM +0900, Michael Paquier wrote:
> I am planning to finish wrapping this patch luckily on Wednesday JST
> time, or in the worst case on Thursday.  I got this problem on my mind
> for a couple of days now and I could not find a case where the approach
> taken could cause a problem.  Opinions are welcome.

Okay, pushed and back-patched.  Thanks to all who participated in the
thread!
--
Michael


signature.asc
Description: PGP signature


Re: Old small commitfest items

2018-07-04 Thread Peter Geoghegan
On Mon, Jul 2, 2018 at 6:30 PM, Michael Paquier  wrote:
> On Mon, Jul 02, 2018 at 10:30:11AM -0400, Andrew Dunstan wrote:
>> 528 1146 Fix the optimization to skip WAL-logging on table created in
>> same transaction
>
> This has been around for an astonishing amount of time...  I don't
> recall all the details but rewriting most of the relation sync handling
> around heapam for a corner optimization is no fun.  There is no way that
> we could go down to elimitate wal_level = minimal, so I am wondering if
> we should not silently ignore the optimization if possible instead of
> throwing an error.  Perhaps logging a WARNING could make sense.

I don't know about any of that, but something has to give. How much
more time has to pass before we admit defeat? At a certain point, that
is the responsible thing to do.


-- 
Peter Geoghegan



Re: Invisible Indexes

2018-07-04 Thread David Rowley
On 5 July 2018 at 13:31, Peter Geoghegan  wrote:
> On Wed, Jul 4, 2018 at 6:26 PM, David Rowley
>  wrote:
>> Or would it be insanely weird to just not allow setting or unsetting
>> this invisible flag if indcheckxmin is true?  I can't imagine there
>> will be many people adding an index and not wanting to use it while
>> it's still being created.  I think the use case here is mostly people
>> wanting to test dropping indexes before they go and remove that 1TB
>> index that will take days to build again if they're wrong.
>
> I'm surprised that that use case wasn't the first one that everyone
> thought of. I actually assumed that that's what Andrew had in mind
> when reading his original message. I only realized later that it
> wasn't.

hmm. Maybe I missed any other use case.  The mention of hypothetical
indexes seems a bit lost on this thread. Andrew's proposal mentions
that an invisible index will just not be considered by the planner.
I'd very much assume here that the index must exist on disk, and
there's not much hypothetical about that.

It seems to me that there would be exactly 1 place in the backend that
the new bool flag would need to be checked, and that's in
get_relation_info() to skip any indexes that are "invisible".  pg_dump
would, of course, need to know about this flag too.

Like Andrew, I'm not much of a fan of the GUC idea.  Testing a plan
without an index could just be a BEGIN; ALTER INDEX; EXPLAIN;
ROLLBACK; operation. It seems much neater not to spread the properties
of an index all over the place when we have a perfectly good table to
store index properties in.  Unsure why Tom thinks that's ugly.

FWIW I have also seen customers asking if they can test drop an index
by setting indisready to false. Naturally, people are often a bit
scared to confirm messing around with catalogue tables on a busy
production server is fine.

Also, FWIW, I'd not bother with a CREATE INDEX syntax for this and
leave it to ALTER INDEX.  I also think that ENABLE/DISABLE is nicer
than VISIBLE/NOT VISIBLE.  Those are already unreserved words too.
Although, perhaps pg_dump would prefer us to have syntax for this in
CREATE INDEX since it could describe the new object in a single
statement.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Invisible Indexes

2018-07-04 Thread Peter Geoghegan
On Wed, Jul 4, 2018 at 7:09 PM, David Rowley
 wrote:
> hmm. Maybe I missed any other use case.  The mention of hypothetical
> indexes seems a bit lost on this thread. Andrew's proposal mentions
> that an invisible index will just not be considered by the planner.
> I'd very much assume here that the index must exist on disk, and
> there's not much hypothetical about that.

+1

> Like Andrew, I'm not much of a fan of the GUC idea.  Testing a plan
> without an index could just be a BEGIN; ALTER INDEX; EXPLAIN;
> ROLLBACK; operation. It seems much neater not to spread the properties
> of an index all over the place when we have a perfectly good table to
> store index properties in.  Unsure why Tom thinks that's ugly.

I have to admit to not getting what's so ugly about it myself.

> FWIW I have also seen customers asking if they can test drop an index
> by setting indisready to false. Naturally, people are often a bit
> scared to confirm messing around with catalogue tables on a busy
> production server is fine.

That's very easy for me to understand. A large production application
can be complicated in a way that nobody can quite nail down. Often,
being sure that dropping an index won't have any ramifications is an
unobtainable luxury, because knowledge about how the app works isn't
centralized in one place. If it's a very large index, why even take a
very small chance?

-- 
Peter Geoghegan



Re: Old small commitfest items

2018-07-04 Thread Michael Paquier
On Wed, Jul 04, 2018 at 06:54:05PM -0700, Peter Geoghegan wrote:
> I don't know about any of that, but something has to give. How much
> more time has to pass before we admit defeat? At a certain point, that
> is the responsible thing to do.

Well, for this one it is not really complicated to avoid the failures
reported and the potential data losses if the so-said optimizations,
which are actually broken, have their checks tightened a bit.  So I'd
rather not give up on this one if there are ways to prevent user-facing
problems.
--
Michael


signature.asc
Description: PGP signature


Re: shared-memory based stats collector

2018-07-04 Thread Kyotaro HORIGUCHI
Hello.

At Wed, 04 Jul 2018 17:23:51 -0400, Tom Lane  wrote in 
<67470.1530739...@sss.pgh.pa.us>
> Kyotaro HORIGUCHI  writes:
> > At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas  
> > wrote in 
> > 
> >> Copying the whole hash table kinds of sucks, partly because of the
> >> time it will take to copy it, but also because it means that memory
> >> usage is still O(nbackends * ntables).  Without looking at the patch,
> >> I'm guessing that you're doing that because we need a way to show each
> >> transaction a consistent snapshot of the data, and I admit that I
> >> don't see another obvious way to tackle that problem.  Still, it would
> >> be nice if we had a better idea.
> 
> > The consistency here means "repeatable read" of an object's stats
> > entry, not a snapshot covering all objects. We don't need to copy
> > all the entries at once following this definition. The attached
> > version makes a cache entry only for requested objects.
> 
> Uh, what?  That's basically destroying the long-standing semantics of
> statistics snapshots.  I do not think we can consider that acceptable.
> As an example, it would mean that scan counts for indexes would not
> match up with scan counts for their tables.

The current stats collector mechanism sends at most 8 table stats
in a single message. Split messages from multiple transactions
can reach to collector in shuffled order. The resulting snapshot
can be "inconsistent" if INQUIRY message comes between such split
messages.  Of course a single meesage would be enough for common
transactions but not for all.

Even though the inconsistency happens more frequently with this
patch, I don't think users expect such strict consistency of
table stats, especially on a busy system. And I believe it's a
good thing if users saw more "useful" information for the relaxed
consistency. (The actual meaning of "useful" is out of the
current focus:p)


Meanwhile, if we should keep the practical-consistency, a giant
lock is out of the question. So we need a transactional stats of
any shape. It can be a whole-image snapshot or a regular MMVC
table, or maybe the current dshash with UNDO logs. Since there
are actually many states, it is inevitable to require storage to
reproduce each state.

I think the consensus is that the whole-image snapshot takes
too-much memory. MMVC is apparently too-much for the purpose.

UNDO logs seems a bit promising. If we looking stats in a long
transaction, the required memory for UNDO information easily
reaches to the same amount with the whole-image snapshot. But I
expect that it is not so common.

I'll consider that apart from the current patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Old small commitfest items

2018-07-04 Thread Peter Geoghegan
On Wed, Jul 4, 2018 at 7:53 PM, Michael Paquier  wrote:
> On Wed, Jul 04, 2018 at 06:54:05PM -0700, Peter Geoghegan wrote:
>> I don't know about any of that, but something has to give. How much
>> more time has to pass before we admit defeat? At a certain point, that
>> is the responsible thing to do.
>
> Well, for this one it is not really complicated to avoid the failures
> reported and the potential data losses if the so-said optimizations,
> which are actually broken, have their checks tightened a bit.  So I'd
> rather not give up on this one if there are ways to prevent user-facing
> problems.

I'm not suggesting that we should give up, or that we should not give
up. I think that timeboxing it is a good idea. In other words, the
question "How much more time has to pass before we admit defeat?" was
not a rhetorical question.

As things stand, we're not doing anything, which has a cost that adds
up as time goes on. Let's be realistic. If nobody is willing to do the
work, then a reasonable person must eventually conclude that that's
because it isn't worth doing.

-- 
Peter Geoghegan



Re: Non-reserved replication slots and slot advancing

2018-07-04 Thread Michael Paquier
On Wed, Jul 04, 2018 at 09:57:31AM -0400, Alvaro Herrera wrote:
> None from me.

Thanks Alvaro.  For now the patch uses the following error message:
+SELECT pg_replication_slot_advance('regression_slot3', '0/1'); -- error
+ERROR:  cannot move slot with non-reserved restart_lsn

Mentioning directly the column name of pg_replication_slots is confusing
I think.  Here are some suggestions of perhaps better error messages:
1) cannot move unreserved slot.
2) cannot move slot which has never been reserved.

Any better ideas?
--
Michael


signature.asc
Description: PGP signature


Re: Speedup of relation deletes during recovery

2018-07-04 Thread Michael Paquier
On Thu, Jul 05, 2018 at 03:10:33AM +0900, Fujii Masao wrote:
> On Tue, Jul 3, 2018 at 11:28 AM, Michael Paquier  wrote:
> Thanks for the review! So, committed.

Thanks.

>> (please indent!).
>
> Hmm.. I failed to find indent issue in my patch... But anyway
> future execution of pgindent will fix that even if it exists.

md.c is telling a rather different story ;)

Well, spurious noise when touching the same files for a different patch
is annoying...  I have made the experience already for a couple of
commits.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Fix crash when ALTER TABLE recreates indexes on partitions

2018-07-04 Thread Andres Freund
On 2018-07-04 13:15:19 -0400, Alvaro Herrera wrote:
> On 2018-Jun-30, Tom Lane wrote:
> 
> > Alvaro Herrera  writes:
> > > Fix crash when ALTER TABLE recreates indexes on partitions
> > 
> > So ... buildfarm member skink has been reporting a valgrind failure
> > during initdb since this patch went in.  However, I'm unable to reproduce
> > such a failure here, and it's less than obvious how this particular patch
> > could have caused a problem, and skink's report is pretty useless because
> > it's providing only a numeric stack trace.  Thoughts?
> 
> Actually, older branches are failing in the same way, so it's not my
> patch that caused it to fail.  Something must have changed in the system ...
> Andres?

I spent a fair amount of time trying to dig into this. It appears to be
related to a binutils update. Everything built with a binutils version <
2.30.90.20180627-1 (in my case 2.30-22) works correctly with
valgrind. If rebuilt after an upgrade it doesn't.  Still trying to
narrow down further.

Greetings,

Andres Freund



Re: Remove mention in docs that foreign keys on partitioned tables are not supported

2018-07-04 Thread Ashutosh Bapat
On Thu, Jul 5, 2018 at 6:44 AM, Robert Haas  wrote:
>
> Well, as far as I know, it's up to me which parts of your emails I want to
> quote in my reply. I did read this part. It did not change my opinion.  My
> fundamental objection to your proposal is that I think it is too wordy. I
> think users will generally know whether they are using partitioning or
> inheritance, and if they don't it's not hard to figure out.   I don't think
> quoting only part of what you wrote made the quote misleading, but it did
> allow me to express my opinion.

I would usually believe that people have read complete email before
replying. But when the reply raises a question without quoting a
portion of my mail which I think has the answer, I am confused.
There's no straight forward method to avoid that confusion given it's
written and turn based communication. But I try to get clarity on that
confusion.

> I understand that you don't agree, which is
> fine, but I stand by my position.
>

I am fine with disagreement, now that I know why there's disagreement.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



RE: Speeding up INSERTs and UPDATEs to partitioned tables

2018-07-04 Thread Kato, Sho
Hi,

I tried to benchmark with 
v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch, but when I 
create the second partition, server process get segmentation fault.

I don't know the cause, but it seems that an incorrect value is set to 
partdesc->boundinfo.

(gdb) p partdesc->boundinfo[0]
$6 = {strategy = 0 '\000', ndatums = 2139062142, datums = 0x7f7f7f7f7f7f7f7f, 
kind = 0x7f7f7f7f7f7f7f7f, indexes = 0x7f7f7f7f7f7f7f7f, null_index = 
2139062143, default_index = 2139062143}


$ psql postgres
psql (11beta2)
Type "help" for help.

postgres=# create table a(i int) partition by range(i);
CREATE TABLE
postgres=# create table a_1 partition of a for values from(1) to (200);
CREATE TABLE
postgres=# create table a_2 partition of a for values from(200) to (400);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!> 

2018-07-05 14:02:52.405 JST [60250] LOG:  server process (PID 60272) was 
terminated by signal 11: Segmentation fault
2018-07-05 14:02:52.405 JST [60250] DETAIL:  Failed process was running: create 
table a_2 partition of a for values from(200) to (400);

(gdb) bt
#0  0x00596e52 in get_default_oid_from_partdesc (partdesc=0x259e928) at 
partition.c:269
#1  0x00677355 in DefineRelation (stmt=0x259e610, relkind=114 'r', 
ownerId=10, typaddress=0x0, queryString=0x24d58b8 "create table a_2 partition 
of a for values from(200) to (400);") at tablecmds.c:832
#2  0x008b6893 in ProcessUtilitySlow (pstate=0x259e4f8, 
pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition of a for 
values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, 
params=0x0, queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") 
at utility.c:1000
#3  0x008b66c2 in standard_ProcessUtility (pstmt=0x24d67d8, 
queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to 
(400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, 
queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at 
utility.c:920
#4  0x008b583b in ProcessUtility (pstmt=0x24d67d8, 
queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to 
(400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:360
#5  0x008b482c in PortalRunUtility (portal=0x253af38, pstmt=0x24d67d8, 
isTopLevel=true, setHoldSnapshot=false, dest=0x24d6ac8, 
completionTag=0x7ffc05932330 "") at pquery.c:1178
#6  0x008b4a45 in PortalRunMulti (portal=0x253af38, isTopLevel=true, 
setHoldSnapshot=false, dest=0x24d6ac8, altdest=0x24d6ac8, 
completionTag=0x7ffc05932330 "") at pquery.c:1324
#7  0x008b3f7d in PortalRun (portal=0x253af38, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x24d6ac8, 
altdest=0x24d6ac8, completionTag=0x7ffc05932330 "") at pquery.c:799
#8  0x008adf16 in exec_simple_query (query_string=0x24d58b8 "create 
table a_2 partition of a for values from(200) to (400);") at postgres.c:1122
#9  0x008b21a5 in PostgresMain (argc=1, argv=0x24ff5b0, 
dbname=0x24ff410 "postgres", username=0x24d2358 "symfo") at postgres.c:4153
#10 0x008113f4 in BackendRun (port=0x24f73f0) at postmaster.c:4361
#11 0x00810b67 in BackendStartup (port=0x24f73f0) at postmaster.c:4033
#12 0x0080d0ed in ServerLoop () at postmaster.c:1706
#13 0x0080c9a3 in PostmasterMain (argc=1, argv=0x24d0310) at 
postmaster.c:1379
#14 0x00737392 in main (argc=1, argv=0x24d0310) at main.c:228

(gdb) disassemble 
Dump of assembler code for function get_default_oid_from_partdesc:
   0x00596e0a <+0>: push   %rbp
   0x00596e0b <+1>: mov%rsp,%rbp
   0x00596e0e <+4>: mov%rdi,-0x8(%rbp)
   0x00596e12 <+8>: cmpq   $0x0,-0x8(%rbp)
   0x00596e17 <+13>:je 0x596e56 

   0x00596e19 <+15>:mov-0x8(%rbp),%rax
   0x00596e1d <+19>:mov0x10(%rax),%rax
   0x00596e21 <+23>:test   %rax,%rax
   0x00596e24 <+26>:je 0x596e56 

   0x00596e26 <+28>:mov-0x8(%rbp),%rax
   0x00596e2a <+32>:mov0x10(%rax),%rax
   0x00596e2e <+36>:mov0x24(%rax),%eax
   0x00596e31 <+39>:cmp$0x,%eax
   0x00596e34 <+42>:je 0x596e56 

   0x00596e36 <+44>:mov-0x8(%rbp),%rax
   0x00596e3a <+48>:mov0x8(%rax),%rdx
   0x00596e3e <+52>:mov-0x8(%rbp),%rax
   0x00596e42 <+56>:mov0x10(%rax),%rax
   0x00596e46 <+60>:mov0x24(%rax),%eax
   0x00596e49 <+63>:cltq   
   0x00596e4b <+65>:shl$0x2,%rax
   0x00596e4f <+69>:add%rdx,%rax
=> 0x00596e52 <+72>:mov(%rax),%eax
   0x00596e54 <+74>: 

Re: [HACKERS] Restricting maximum keep segments by repslots

2018-07-04 Thread Masahiko Sawada
On Wed, Jul 4, 2018 at 5:28 PM, Kyotaro HORIGUCHI
 wrote:
> Hello.
>
> At Tue, 26 Jun 2018 16:26:59 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20180626.162659.223208514.horiguchi.kyot...@lab.ntt.co.jp>
>> The previous patche files doesn't have version number so I let
>> the attached latest version be v2.
>>
>>
>> v2-0001-Add-WAL-releaf-vent-for-replication-slots.patch
>>   The body of the limiting feature
>>
>> v2-0002-Add-monitoring-aid-for-max_replication_slots.patch
>>   Shows the status of WAL rataining in pg_replication_slot view
>>
>> v2-0003-TAP-test-for-the-slot-limit-feature.patch
>>   TAP test for this feature
>>
>> v2-0004-Documentation-for-slot-limit-feature.patch
>>   Documentation, as the name.
>
> Travis (test_decoding test) showed that GetOldestXLogFileSegNo
> added by 0002 forgets to close temporarily opened pg_wal
> directory. This is the fixed version v3.
>

Thank you for updating the patch! I looked at v3 patches. Here is
review comments.

---
+   {"max_slot_wal_keep_size", PGC_SIGHUP, REPLICATION_SENDING,
+   gettext_noop("Sets the maximum size of extra
WALs kept by replication slots."),
+NULL,
+GUC_UNIT_MB
+   },
+   &max_slot_wal_keep_size_mb,
+   0, 0, INT_MAX,
+   NULL, NULL, NULL
+   },

Maybe MAX_KILOBYTES would be better instead of INT_MAX like wal_max_size.

---
Once the following WARNING emitted this message is emitted whenever we
execute CHECKPOINT even if we don't change anything. Is that expected
behavior? I think it would be better to emit this message only when we
remove WAL segements that are required by slots.

WARNING:  some replication slots have lost required WAL segments
DETAIL:  The mostly affected slot has lost 153 segments.

---
+   Assert(wal_keep_segments >= 0);
+   Assert(max_slot_wal_keep_size_mb >= 0);

These assertions are not meaningful because these parameters are
ensured >= 0 by those definition.

---
+/* slots aren't useful, consider only wal_keep_segments */
+if (slotpos == InvalidXLogRecPtr)
+{

Perhaps XLogRecPtrIsInvalid(slotpos) is better.

---
+if (slotpos != InvalidXLogRecPtr && currSeg <= slotSeg + wal_keep_segments)
+slotpos = InvalidXLogRecPtr;
+
+/* slots aren't useful, consider only wal_keep_segments */
+if (slotpos == InvalidXLogRecPtr)
+{

This logic is hard to read to me. The slotpos can be any of: valid,
valid but then become invalid in halfway or invalid from beginning of
this function. Can we convert this logic to following?

if (XLogRecPtrIsInvalid(slotpos) ||
currSeg <= slotSeg + wal_keep_segments)

---
+keepSegs = wal_keep_segments +
+ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);

Why do we need to keep (wal_keep_segment + max_slot_wal_keep_size) WAL
segments? I think what this feature does is, if wal_keep_segments is
not useful (that is, slotSeg < (currSeg - wal_keep_segment) then we
normally choose slotSeg as lower boundary but max_slot_wal_keep_size
restrict the lower boundary so that it doesn't get lower than the
threshold. So I thought what this function should do is to calculate
min(currSeg - wal_keep_segment, max(currSeg - max_slot_wal_keep_size,
slotSeg)), I might be missing something though.

---
+SpinLockAcquire(&XLogCtl->info_lck);
+oldestSeg = XLogCtl->lastRemovedSegNo;
+SpinLockRelease(&XLogCtl->info_lck);

We can use XLogGetLastRemovedSegno() instead.

---
+xldir = AllocateDir(XLOGDIR);
+if (xldir == NULL)
+ereport(ERROR,
+(errcode_for_file_access(),
+ errmsg("could not open write-ahead log directory \"%s\": %m",
+XLOGDIR)));

Looking at other code allocating a directory we don't check xldir ==
NULL because it will be detected by ReadDir() function and raise an
error in that function. So maybe we don't need to check it just after
allocation.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center