date:20190507

Re: findTargetlistEntrySQL92() and COLLATE clause

2019-05-07 Thread Amit Langote

On 2019/04/27 0:02, Tom Lane wrote:
> Amit Langote  writes:
>> I couldn't find old discussions or source code comments about this, but
>> has someone encountered the following error and wondered whether it's
>> working that way for a reason?
> 
>> select a::text, b from foo order by 1, 2 collate "C";
>> ERROR:  collations are not supported by type integer
>> LINE 1: select a::text, b from foo order by 1, 2 collate "C";
>>  ^
> 
> The reason it works that way is that *anything* except a bare integer
> constant is treated according to SQL99 rules (that is, it's an ordinary
> expression) not SQL92 rules.  I do not think we should get into weird
> bastard combinations of SQL92 and SQL99 rules, because once you do,
> there is no principled way to decide what anything means.  Who's to say
> whether "ORDER BY 1 + 2" means to take column 1 and add 2 to it and then
> sort, or maybe to add columns 1 and 2 and sort on the sum, or whatever?

Ah, OK.  Thanks for the explanation.

> IOW, -1 on treating COLLATE as different from other sorts of expressions
> here.  There's no principle that can justify that.

In contrast to your example above, maybe the COLLATE case is less
ambiguous in terms of what ought to be done?

Thanks,
Amit

Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-07 Thread Heikki Linnakangas


On 05/05/2019 01:38, Peter Geoghegan wrote:

On Fri, Mar 1, 2019 at 3:59 PM Peter Geoghegan  wrote:

 /*
  * Perform the same check on this internal level that
  * _bt_mark_page_halfdead performed on the leaf level.
  */
 if (_bt_is_page_halfdead(rel, *rightsib))



I thought that internal pages were never half-dead after Postgres 9.4.
If that happens, then the check within _bt_pagedel() will throw an
ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't
this internal level _bt_is_page_halfdead() check contain a "can't
happen" error, or even a simple assertion?


I think that we should get rid of this code on HEAD shortly, because
it's effectively dead code. You don't have to know anything about
B-Trees to see why this must be true: VACUUM is specifically checking
if an internal page is half-dead here, even though it's already
treating half-dead internal pages as ipso facto corrupt in higher
level code (it's the first thing we check in _bt_pagedel()). This is
clearly contradictory. If there is a half-dead internal page, then
there is no danger of VACUUM not complaining loudly that you need to
REINDEX. This has been true since the page deletion overhaul that went
into 9.4.


I don't understand that reasoning. Yes, _bt_pagedel() will complain if 
it finds a half-dead internal page. But how does that mean that 
_bt_lock_branch_parent() can't encounter one?


- Heikki

New EXPLAIN option: ALL

2019-05-07 Thread David Fetter

Folks,

It can get a little tedious turning on (or off) all the boolean
options to EXPLAIN, so please find attached a shortcut.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 8de75e81eb7161c712e2af97b0619a69f33c6b91 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Tue, 7 May 2019 00:26:29 -0700
Subject: [PATCH v1] Add an ALL option to EXPLAIN
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.21.0"

This is a multi-part message in MIME format.
--2.21.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


which starts all boolean options at once.

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 385d10411f..ade3793c5f 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -43,6 +43,7 @@ EXPLAIN [ ANALYZE ] [ VERBOSE ] statementboolean ]
 TIMING [ boolean ]
 SUMMARY [ boolean ]
+ALL [ boolean ]
 FORMAT { TEXT | XML | JSON | YAML }
 
  
@@ -224,6 +225,16 @@ ROLLBACK;
 

 
+   
+ALL
+
+ 
+  Include (or exclude) all of the above settings before examining the rest
+  of the options. It defaults to FALSE.
+ 
+
+   
+

 FORMAT
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a6c6de78f1..966f4566c7 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -151,6 +151,26 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt, const char *queryString,
 	bool		summary_set = false;
 
 	/* Parse options list. */
+	/* First, see whether ALL is set */
+	foreach(lc, stmt->options)
+	{
+		DefElem	   *opt = (DefElem *) lfirst(lc);
+
+		if (strcmp(opt->defname, "all") == 0)
+		{
+			es->analyze = defGetBoolean(opt);
+			es->verbose = defGetBoolean(opt);
+			es->costs = defGetBoolean(opt);
+			es->buffers = defGetBoolean(opt);
+			es->settings = defGetBoolean(opt);
+			timing_set = true;
+			es->timing = defGetBoolean(opt);
+			summary_set = true;
+			es->timing = defGetBoolean(opt);
+		}
+	}
+
+	/* If you add another boolean option here, remember to add it above, too */
 	foreach(lc, stmt->options)
 	{
 		DefElem*opt = (DefElem *) lfirst(lc);

--2.21.0--

copy-past-o comment in lock.h

2019-05-07 Thread John Naylor

Attached is an attempt to match surrounding code. More broadly,
though, it seems the "ID info" comments belong with the SET_LOCKTAG_*
macros rather than with the LockTagType enum members.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


locktag-comment.patch
Description: Binary data

Re: Pluggable Storage - Andres's take

2019-05-07 Thread Rafia Sabih

On Mon, 6 May 2019 at 16:14, Andres Freund  wrote:
>
> Hi,
>
> On May 6, 2019 3:40:55 AM PDT, Rafia Sabih  wrote:
> >On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas 
> >wrote:
> >>
> >> On 08/04/2019 20:37, Andres Freund wrote:
> >> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> >> >> There's a little bug in index-only scan executor node, where it
> >mixes up the
> >> >> slots to hold a tuple from the index, and from the table. That
> >doesn't cause
> >> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy
> >AM, which
> >> >> uses a virtual slot, it caused warnings like this from index-only
> >scans:
> >> >
> >> > Hm. That's another one that I think I had fixed previously :(, and
> >then
> >> > concluded that it's not actually necessary for some reason. Your
> >fix
> >> > looks correct to me.  Do you want to commit it? Otherwise I'll look
> >at
> >> > it after rebasing zheap, and checking it with that.
> >>
> >> I found another slot type confusion bug, while playing with zedstore.
> >In
> >> an Index Scan, if you have an ORDER BY key that needs to be
> >rechecked,
> >> so that it uses the reorder queue, then it will sometimes use the
> >> reorder queue slot, and sometimes the table AM's slot, for the scan
> >> slot. If they're not of the same type, you get an assertion:
> >>
> >> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
> >> "execExprInterp.c", Line: 1905)
> >>
> >> Attached is a test for this, again using the toy table AM, extended
> >to
> >> be able to test this. And a fix.
> >>
> >> >> Attached is a patch with the toy implementation I used to test
> >this. I'm not
> >> >> suggesting we should commit that - although feel free to do that
> >if you
> >> >> think it's useful - but it shows how I bumped into these issues.
> >> >
> >> > Hm, probably not a bad idea to include something like it. It seems
> >like
> >> > we kinda would need non-stub implementation of more functions for
> >it to
> >> > test much / and to serve as an example.  I'm mildy inclined to just
> >do
> >> > it via zheap / externally, but I'm not quite sure that's good
> >enough.
> >>
> >> Works for me.
> >>
> >> >> +static Size
> >> >> +toyam_parallelscan_estimate(Relation rel)
> >> >> +{
> >> >> +ereport(ERROR,
> >> >> +(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >> >> + errmsg("function %s not implemented yet",
> >__func__)));
> >> >> +}
> >> >
> >> > The other stubbed functions seem like we should require them, but I
> >> > wonder if we should make the parallel stuff optional?
> >>
> >> Yeah, that would be good. I would assume it to be optional.
> >>
> >I was trying the toyam patch and on make check it failed with
> >segmentation fault at
> >
> >static void
> >toyam_relation_set_new_filenode(Relation rel,
> > char persistence,
> > TransactionId *freezeXid,
> > MultiXactId *minmulti)
> >{
> > *freezeXid = InvalidTransactionId;
> >
> >Basically, on running create table t (i int, j int) using toytable,
> >leads to this segmentation fault.
> >
> >Am I missing something here?
>
> I assume you got compiler warmings compiling it? The API for some callbacks 
> changed a bit.
>
Oh yeah it does.


-- 
Regards,
Rafia Sabih

Re: Pluggable Storage - Andres's take

2019-05-07 Thread Rafia Sabih

On Mon, 6 May 2019 at 22:39, Ashwin Agrawal  wrote:
>
> On Mon, May 6, 2019 at 7:14 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On May 6, 2019 3:40:55 AM PDT, Rafia Sabih  
> > wrote:
> > >I was trying the toyam patch and on make check it failed with
> > >segmentation fault at
> > >
> > >static void
> > >toyam_relation_set_new_filenode(Relation rel,
> > > char persistence,
> > > TransactionId *freezeXid,
> > > MultiXactId *minmulti)
> > >{
> > > *freezeXid = InvalidTransactionId;
> > >
> > >Basically, on running create table t (i int, j int) using toytable,
> > >leads to this segmentation fault.
> > >
> > >Am I missing something here?
> >
> > I assume you got compiler warmings compiling it? The API for some callbacks 
> > changed a bit.
>
> Attached patch gets toy table AM implementation to match latest master API.
> The patch builds on top of patch from Heikki in [1].
> Compiles and works but the test still continues to fail with WARNING
> for issue mentioned in [1]
>
Thanks Ashwin, this works fine with the mentioned warnings of course.

-- 
Regards,
Rafia Sabih

Re: New EXPLAIN option: ALL

2019-05-07 Thread David Fetter

On Tue, May 07, 2019 at 09:30:47AM +0200, David Fetter wrote:
> Folks,
> 
> It can get a little tedious turning on (or off) all the boolean
> options to EXPLAIN, so please find attached a shortcut.
> 
> Best,
> David.

It helps to have a working patch for this.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From d62669c5b1d22c18949ad5d6ae3f6b23b49819e3 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Tue, 7 May 2019 00:26:29 -0700
Subject: [PATCH v2] Add an ALL option to EXPLAIN
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.21.0"

This is a multi-part message in MIME format.
--2.21.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


which starts all boolean options at once.

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 385d10411f..ade3793c5f 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -43,6 +43,7 @@ EXPLAIN [ ANALYZE ] [ VERBOSE ] statementboolean ]
 TIMING [ boolean ]
 SUMMARY [ boolean ]
+ALL [ boolean ]
 FORMAT { TEXT | XML | JSON | YAML }
 
  
@@ -224,6 +225,16 @@ ROLLBACK;
 

 
+   
+ALL
+
+ 
+  Include (or exclude) all of the above settings before examining the rest
+  of the options. It defaults to FALSE.
+ 
+
+   
+

 FORMAT
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a6c6de78f1..7ad73e4482 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -151,6 +151,26 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt, const char *queryString,
 	bool		summary_set = false;
 
 	/* Parse options list. */
+	/* First, see whether ALL is set */
+	foreach(lc, stmt->options)
+	{
+		DefElem	   *opt = (DefElem *) lfirst(lc);
+
+		if (strcmp(opt->defname, "all") == 0)
+		{
+			es->analyze = defGetBoolean(opt);
+			es->verbose = defGetBoolean(opt);
+			es->costs = defGetBoolean(opt);
+			es->buffers = defGetBoolean(opt);
+			es->settings = defGetBoolean(opt);
+			timing_set = true;
+			es->timing = defGetBoolean(opt);
+			summary_set = true;
+			es->timing = defGetBoolean(opt);
+		}
+	}
+
+	/* If you add another boolean option here, remember to add it above, too */
 	foreach(lc, stmt->options)
 	{
 		DefElem*opt = (DefElem *) lfirst(lc);
@@ -194,6 +214,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt, const char *queryString,
 opt->defname, p),
 		 parser_errposition(pstate, opt->location)));
 		}
+		else if (strcmp(opt->defname, "all") == 0)
+			; /* Do nothing */
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3dc0e8a4fb..4a69f9711c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10704,6 +10704,7 @@ explain_option_elem:
 explain_option_name:
 			NonReservedWord			{ $$ = $1; }
 			| analyze_keyword		{ $$ = "analyze"; }
+			| ALL	{ $$ = "all"; }
 		;
 
 explain_option_arg:
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index bcddc7601e..8177f9c6da 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2862,9 +2862,9 @@ psql_completion(const char *text, int start, int end)
 		 * one word, so the above test is correct.
 		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "BUFFERS",
+			COMPLETE_WITH("ALL", "ANALYZE", "VERBOSE", "COSTS", "BUFFERS",
 		  "TIMING", "SUMMARY", "FORMAT");
-		else if (TailMatches("ANALYZE|VERBOSE|COSTS|BUFFERS|TIMING|SUMMARY"))
+		else if (TailMatches("ALL|ANALYZE|VERBOSE|COSTS|BUFFERS|TIMING|SUMMARY"))
 			COMPLETE_WITH("ON", "OFF");
 		else if (TailMatches("FORMAT"))
 			COMPLETE_WITH("TEXT", "XML", "JSON", "YAML");
@@ -2874,7 +2874,8 @@ psql_completion(const char *text, int start, int end)
 	  "VERBOSE");
 	else if (Matches("EXPLAIN", "(*)") ||
 			 Matches("EXPLAIN", "VERBOSE") ||
-			 Matches("EXPLAIN", "ANALYZE", "VERBOSE"))
+			 Matches("EXPLAIN", "ANALYZE", "VERBOSE") ||
+			 Matches("EXPLAIN", "ALL"))
 		COMPLETE_WITH("SELECT", "INSERT", "DELETE", "UPDATE", "DECLARE");
 
 /* FETCH && MOVE */

--2.21.0--

Re: Inconsistent error message wording for REINDEX CONCURRENTLY

2019-05-07 Thread Michael Paquier

On Sun, May 05, 2019 at 05:45:53PM -0400, Tom Lane wrote:
> In the other place, checking IsSystemNamespace isn't even
> approximately the correct way to proceed, since it fails to reject
> reindexing system catalogs' toast tables.

Good point.  I overlooked that part.  It is easy enough to have a test
which fails for a catalog index, a catalog table, a toast table on a
system catalog and a toast index on a system catalog.  However I don't
see a way to test directly that a toast relation or index on a
non-catalog relation works as we cannot run REINDEX CONCURRENTLY
within a function, and it is not possible to save the toast relation
name as a psql variable.  Perhaps somebody has a trick?x

> After looking around a bit, I propose that we invent
> "IsCatalogRelationOid(Oid reloid)" (not wedded to that name), which
> is a wrapper around IsCatalogClass() that does the needful syscache
> lookup for you.  Aside from this use-case, it could be used in
> sepgsql/dml.c, which I see is also using
> IsSystemNamespace(get_rel_namespace(oid)) for the wrong thing.

Hmmm.  A wrapper on top of IsCatalogClass() implies that we would need
to open the related relation directly in the new function so as it is
possible to grab its pg_class entry.  We could imply that the function
takes a ShareLock all the time, but that's not going to be true most
of the time and the recent discussions around lock upgrades stress me
a bit, and I'd rather not add new race conditions or upgrade hazards.
We should have an extra argument with the lock mode, but we have
nothing in catalog.c of that kind, and that does not feel consistent
with the current interface.  At the end I have made the choice to not
reinvent the world, and just get a Relation from the parent table when
looking after an index relkind so as IsCatalogRelation() is used for
the check.

What do you think about the updated patch attached?  I have removed
the tests from reindex_catalog.sql, and added more coverage into
create_index.sql.
--
Michael
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 663407c65d..bcc9a09370 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2729,6 +2729,12 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 /* Open relation to get its indexes */
 heapRelation = table_open(relationOid, ShareUpdateExclusiveLock);
 
+/* A system catalog cannot be reindexed concurrently */
+if (IsCatalogRelation(heapRelation))
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("cannot reindex a system catalog concurrently")));
+
 /* Add all the valid indexes of relation to list */
 foreach(lc, RelationGetIndexList(heapRelation))
 {
@@ -2813,18 +2819,18 @@ ReindexRelationConcurrently(Oid relationOid, int options)
 		case RELKIND_INDEX:
 			{
 Oid			heapId = IndexGetRelation(relationOid, false);
+Relation	heapRelation;
 
-/* A shared relation cannot be reindexed concurrently */
-if (IsSharedRelation(heapId))
-	ereport(ERROR,
-			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("concurrent reindex is not supported for shared relations")));
+/* Open relation to check it */
+heapRelation = table_open(heapId, ShareUpdateExclusiveLock);
 
 /* A system catalog cannot be reindexed concurrently */
-if (IsSystemNamespace(get_rel_namespace(heapId)))
+if (IsCatalogRelation(heapRelation))
 	ereport(ERROR,
 			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-			 errmsg("concurrent reindex is not supported for catalog relations")));
+			 errmsg("cannot reindex a system catalog concurrently")));
+
+table_close(heapRelation, NoLock);
 
 /* Save the list of relation OIDs in private context */
 oldcontext = MemoryContextSwitchTo(private_context);
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 326dc44177..7447a1d196 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -2092,10 +2092,15 @@ BEGIN;
 REINDEX TABLE CONCURRENTLY concur_reindex_tab;
 ERROR:  REINDEX CONCURRENTLY cannot run inside a transaction block
 COMMIT;
-REINDEX TABLE CONCURRENTLY pg_database; -- no shared relation
-ERROR:  concurrent index creation on system catalog tables is not supported
-REINDEX TABLE CONCURRENTLY pg_class; -- no catalog relations
-ERROR:  concurrent index creation on system catalog tables is not supported
+REINDEX TABLE CONCURRENTLY pg_class; -- no catalog relation
+ERROR:  cannot reindex a system catalog concurrently
+REINDEX INDEX CONCURRENTLY pg_class_oid_index; -- no catalog index
+ERROR:  cannot reindex a system catalog concurrently
+-- These are the toast table and index from pg_authid
+REINDEX TABLE CONCURRENTLY pg_toast.pg_toast_1260; -- no toast table
+ERROR:  cannot reindex a system catalog concurrently
+REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_1260_index; -- no toa

Re: copy-past-o comment in lock.h

2019-05-07 Thread Michael Paquier

On Tue, May 07, 2019 at 03:41:50PM +0800, John Naylor wrote:
> Attached is an attempt to match surrounding code. More broadly,
> though, it seems the "ID info" comments belong with the SET_LOCKTAG_*
> macros rather than with the LockTagType enum members.

+   LOCKTAG_SPECULATIVE_TOKEN,  /* for speculative insertion */
+   /* ID info for a speculative token is TRANSACTION info + token */
Shouldn't the first comment be just "speculative insertion"?  And the
second one "ID info for a speculative insertion is transaction ID +
its speculative insert counter"?
--
Michael


signature.asc
Description: PGP signature

Re: New vacuum option to do only freezing

2019-05-07 Thread Masahiko Sawada

On Fri, May 3, 2019 at 12:09 AM Robert Haas  wrote:
>
> On Tue, Apr 16, 2019 at 4:00 PM Tom Lane  wrote:
> > I wrote:
> > > I'm thinking that we really need to upgrade vacuum's reporting totals
> > > so that it accounts in some more-honest way for pre-existing dead
> > > line pointers.  The patch as it stands has made the reporting even more
> > > confusing, rather than less so.
> >
> > Here's a couple of ideas about that:
> >
> > 1. Ignore heap_page_prune's activity altogether, on the grounds that
> > it's just random chance that any cleanup done there was done during
> > VACUUM and not some preceding DML operation.  Make tups_vacuumed
> > be the count of dead line pointers removed.  The advantage of this
> > way is that tups_vacuumed would become independent of previous
> > non-VACUUM pruning activity, as well as preceding index-cleanup-disabled
> > VACUUMs.  But maybe it's hiding too much information.
> >
> > 2. Have heap_page_prune count, and add to tups_vacuumed, only HOT
> > tuples that it deleted entirely.  The action of replacing a DEAD
> > root tuple with a dead line pointer doesn't count for anything.
> > Then also add the count of dead line pointers removed to tups_vacuumed.
> >
> > Approach #2 is closer to the way we've defined tups_vacuumed up to
> > now, but I think it'd be more realistic in cases where previous
> > pruning or index-cleanup-disabled vacuums have left lots of dead
> > line pointers.
> >
> > I'm not especially wedded to either of these --- any other ideas?
>
> I think it's almost impossible to have clear reporting here with only
> a single counter.  There are two clearly-distinct cleanup operations
> going on here: (1) removing tuples from pages, and (2) making dead
> line pointers unused so that they can be reused for new tuples.  They
> happen in equal quantity when there are no HOT updates: pruning makes
> dead tuples into dead line pointers, and then index vacuuming allows
> the dead line pointers to be set unused.  But if there are HOT
> updates, intermediate tuples in each HOT chain are removed from the
> page but the line pointers are directly set to unused, so VACUUM could
> remove a lot of tuples but not need to make very many dead line
> pointers unused.  On the other hand, the opposite could also happen:
> maybe lots of single-page HOT-pruning has happened prior to VACUUM, so
> VACUUM has lots of dead line pointers to make unused but removes very
> few tuples because that's already been done.
>
> For the most part, tups_vacuumed seems to be intending to count #1
> rather than #2. While the comments for heap_page_prune and
> heap_prune_chain are not as clear as might be desirable, it appears to
> me that those functions are counting tuples removed from a page,
> ignoring everything that might happen to line pointers -- so using the
> return value of this function is consistent with wanting to count #1.
> However, there's one place that seems slightly unclear about this,
> namely this comment:
>
> /*
>  * DEAD item pointers are to be vacuumed normally; but we don't
>  * count them in tups_vacuumed, else we'd be double-counting (at
>  * least in the common case where heap_page_prune() just freed up
>  * a non-HOT tuple).
>  */
>
> If we're counting tuples removed from pages, then it's not merely that
> we would be double-counting, but that we would be counting completely
> the wrong thing.  However, as far as I can see, that's just an issue
> with the comments; the code is in all cases counting tuple removals,
> not dead line pointers marked unused.
>
> If I understand correctly, your first proposal amounts to redefining
> tups_vacuumed to count #2 rather than #1, and your second proposal
> amounts to making tups_vacuumed count #1 + #2 rather than #1.  I
> suggest a third option: have two counters.  tups_vacuum can continue
> to count #1, with just a comment adjustment. And then we can add a
> second counter which is incremented every time lazy_vacuum_page does
> ItemIdSetUnused, which will count #2.
>
> Thoughts?

I agree to have an another counter. Please note that non-HOT-updated
tuples that became dead after hot pruning (that is 'tupgone' tuples)
are changed to unused directly in lazy_page_vacuum. Therefore we would
need not to increment tups_vacuumed in tupgone case if we increment
the new counter in lazy_page_vacuum.

For the contents of vacuum verbose report, it could be worth to
discuss whether reporting the number of deleted line pointers would be
helpful for users. The reporting the number of line pointers we left
might be more helpful for users.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

2019-05-07 Thread Amit Langote

On 2019/04/27 3:57, Tom Lane wrote:
> Alvaro Herrera  writes:
>> Um, this one doesn't apply because of yesterday's 87259588d0ab.
> 
> Before we spend too much time on minutiae, we should ask ourselves whether
> this patch is even going in the right direction.  I'm not sure.
> 
> One point is that if we simply adopt the old index as-is, we won't see
> any updates in its metadata.  An example is that if we have an index
> on a varchar(10) column, and we alter the column to varchar(12),
> the current behavior is to generate a new index that agrees with that:
> 
> regression=# create table pp(f1 varchar(10) unique);
> CREATE TABLE
> regression=# \d pp_f1_key
>   Index "public.pp_f1_key"
>  Column | Type  | Key? | Definition 
> +---+--+
>  f1 | character varying(10) | yes  | f1
> unique, btree, for table "public.pp"
> 
> regression=# alter table pp alter column f1 type varchar(12);
> ALTER TABLE
> regression=# \d pp_f1_key
>   Index "public.pp_f1_key"
>  Column | Type  | Key? | Definition 
> +---+--+
>  f1 | character varying(12) | yes  | f1
> unique, btree, for table "public.pp"
> 
> With this patch, I believe, the index column will still claim to be
> varchar(10).

You're right, that's what happens.

> Is that OK?  It might not actually break anything
> right now, but at the very least it's likely to be confusing.
> Also, it'd essentially render the declared types/typmods of index
> columns untrustworthy, which seems like something that would come
> back to bite us.

That's definitely misleading.

Still, I think it'd be nice if we didn't have to do full-blown
DefineIndex() in this case if only to update the pg_attribute tuples of
the index relation.  Maybe we could update them directly in the ALTER
COLUMN TYPE's code path?

Thanks,
Amit

Re: New EXPLAIN option: ALL

2019-05-07 Thread Sergei Kornilov

Hi

I liked this idea.

+   timing_set = true;
+   es->timing = defGetBoolean(opt);
+   summary_set = true;
+   es->timing = defGetBoolean(opt);

second es->timing should be es->summary, right?

regards, Sergei

Re: copy-past-o comment in lock.h

2019-05-07 Thread John Naylor

On Tue, May 7, 2019 at 4:00 PM Michael Paquier  wrote:
>
> On Tue, May 07, 2019 at 03:41:50PM +0800, John Naylor wrote:
> > Attached is an attempt to match surrounding code. More broadly,
> > though, it seems the "ID info" comments belong with the SET_LOCKTAG_*
> > macros rather than with the LockTagType enum members.
>
> +   LOCKTAG_SPECULATIVE_TOKEN,  /* for speculative insertion */
> +   /* ID info for a speculative token is TRANSACTION info + token */
> Shouldn't the first comment be just "speculative insertion"?

That's probably better.

> And the
> second one "ID info for a speculative insertion is transaction ID +
> its speculative insert counter"?

I was just going by the variable name at hand, but more precision may be good.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-05-07 Thread Masahiko Sawada

On Mon, Apr 8, 2019 at 7:29 PM Masahiko Sawada  wrote:
>
> On Mon, Apr 8, 2019 at 7:22 PM Fujii Masao  wrote:
> >
> > On Mon, Apr 8, 2019 at 5:30 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Apr 8, 2019 at 5:15 PM Fujii Masao  wrote:
> > > >
> > > > On Mon, Apr 8, 2019 at 3:58 PM Julien Rouhaud  
> > > > wrote:
> > > > >
> > > > > On Mon, Apr 8, 2019 at 8:01 AM Fujii Masao  
> > > > > wrote:
> > > > > >
> > > > > > 2019年4月8日(月) 14:22 Tsunakawa, Takayuki 
> > > > > > :
> > > > > >>
> > > > > >> From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> > > > > >> > "vacuum_truncate" gets my vote too.
> > > > > >>
> > > > > >> +1
> > > > > >
> > > > > >
> > > > > > +1
> > > > > > ISTM that we have small consensus to
> > > > > > use "vacuum_truncate".
> > > > >
> > > > > I'm also fine with this name, and I also saw reports that this option
> > > > > is already needed in some production workload, as Andres explained.
> > > >
> > > > OK, so I pushed the "vacuum_truncate" version of the patch.
> > >
> > > Thank you!
> > >
> > > "TRUNCATE" option for vacuum command should be added to the open items?
> >
> > Yes, I think.
>
> Added.
>
> > Attached is the patch which adds TRUNCATE option into VACUUM.
>
> Thank you for the patch! I will review it.
>

Sorry for the late. I've reviewed the patch and it looks good to me.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: set relispartition when attaching child index

2019-05-07 Thread Amit Langote

On 2019/04/26 23:12, Alvaro Herrera wrote:
> On 2019-Apr-25, Amit Langote wrote:
> 
>> BTW, this will need to be back-patched to 11.
> 
> Done, thanks for the patch.  I added the test in master, but obviously
> it doesn't work in pg11, so I just verified manually that relispartition
> is set correctly.

Thank you.

> I don't think it's worth doing more, though there are
> other things that are affected by a bogus relispartition marking for an
> index (example: creating the index in the last partition that didn't
> have it, should mark the index on parent valid; I think that would fail
> to propagate to upper levels correctly.)

Hmm, I couldn't see any misbehavior for this example:

create table p (a int, b int) partition by list (a);
create table p1 partition of p for values in (1) partition by list (b);
create table p11 partition of p1 for values in (1);
create index on only p (a);
create index on only p1 (a);
alter index p_a_idx attach partition p1_a_idx ;

select relname, relispartition from pg_class where relname like 'p%idx';
 relname  │ relispartition
──┼
 p_a_idx  │ f
 p1_a_idx │ t
(2 rows)


\d p
 Table "public.p"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │  │
 b  │ integer │   │  │
Partition key: LIST (a)
Indexes:
"p_a_idx" btree (a) INVALID
Number of partitions: 1 (Use \d+ to list them.)


create index on p11 (a);
alter index p1_a_idx attach partition p11_a_idx ;
select relname, relispartition from pg_class where relname like 'p%idx';
  relname  │ relispartition
───┼
 p_a_idx   │ f
 p1_a_idx  │ t
 p11_a_idx │ t
(3 rows)

\d p
 Table "public.p"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │  │
 b  │ integer │   │  │
Partition key: LIST (a)
Indexes:
"p_a_idx" btree (a)
Number of partitions: 1 (Use \d+ to list them.)

Maybe, because the code path we fixed has nothing to do with this case?

Thanks,
Amit

Re: New EXPLAIN option: ALL

2019-05-07 Thread Rafia Sabih

On Tue, 7 May 2019 at 09:30, David Fetter  wrote:
>
> Folks,
>
> It can get a little tedious turning on (or off) all the boolean
> options to EXPLAIN, so please find attached a shortcut.
>

I don't understand this, do you mind explaining a bit may be with an
example on how you want it to work.
-- 
Regards,
Rafia Sabih

Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-07 Thread Rajkumar Raghuwanshi

Hi,
As this issue is reproducible without partition-wise aggregate also,
changing email subject from "Statistical aggregate functions are not
working with partitionwise aggregate " to "Statistical aggregate functions
are not working with PARTIAL aggregation".

original reported test case and discussion can be found at below link.
https://www.postgresql.org/message-id/flat/CAKcux6%3DuZEyWyLw0N7HtR9OBc-sWEFeByEZC7t-KDf15FKxVew%40mail.gmail.com

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation


On Fri, May 3, 2019 at 5:26 PM Jeevan Chalke 
wrote:

>
>
> On Fri, May 3, 2019 at 2:56 PM Rajkumar Raghuwanshi <
> rajkumar.raghuwan...@enterprisedb.com> wrote:
>
>> Hi,
>>
>> On PG-head, Some of statistical aggregate function are not giving correct
>> output when enable partitionwise aggregate while same is working on v11.
>>
>
> I had a quick look over this and observed that something broken with the
> PARTIAL aggregation.
>
> I can reproduce same issue with the larger dataset which results into
> parallel scan.
>
> CREATE TABLE tbl1(a int2,b float4) partition by range(a);
> create table tbl1_p1 partition of tbl1 for values from (minvalue) to (0);
> create table tbl1_p2 partition of tbl1 for values from (0) to (maxvalue);
> insert into tbl1 select i%2, i from generate_series(1, 100) i;
>
> # SELECT regr_count(b, a) FROM tbl1;
>  regr_count
> 
>   0
> (1 row)
>
> postgres:5432 [120536]=# explain SELECT regr_count(b, a) FROM tbl1;
>QUERY
> PLAN
>
> 
>  Finalize Aggregate  (cost=15418.08..15418.09 rows=1 width=8)
>->  Gather  (cost=15417.87..15418.08 rows=2 width=8)
>  Workers Planned: 2
>  ->  Partial Aggregate  (cost=14417.87..14417.88 rows=1 width=8)
>->  Parallel Append  (cost=0.00..11091.62 rows=443500
> width=6)
>  ->  Parallel Seq Scan on tbl1_p2  (cost=0.00..8850.00
> rows=442500 width=6)
>  ->  Parallel Seq Scan on tbl1_p1  (cost=0.00..24.12
> rows=1412 width=6)
> (7 rows)
>
> postgres:5432 [120536]=# set max_parallel_workers_per_gather to 0;
> SET
> postgres:5432 [120536]=# SELECT regr_count(b, a) FROM tbl1;
>  regr_count
> 
> 100
> (1 row)
>
> After looking further, it seems that it got broken by following commit:
>
> commit a9c35cf85ca1ff72f16f0f10d7ddee6e582b62b8
> Author: Andres Freund 
> Date:   Sat Jan 26 14:17:52 2019 -0800
>
> Change function call information to be variable length.
>
>
> This commit is too big to understand and thus could not get into the
> excact cause.
>
> Thanks
>
>
>> below are some of examples.
>>
>> CREATE TABLE tbl(a int2,b float4) partition by range(a);
>> create table tbl_p1 partition of tbl for values from (minvalue) to (0);
>> create table tbl_p2 partition of tbl for values from (0) to (maxvalue);
>> insert into tbl values (-1,-1),(0,0),(1,1),(2,2);
>>
>> --when partitionwise aggregate is off
>> postgres=# SELECT regr_count(b, a) FROM tbl;
>>  regr_count
>> 
>>   4
>> (1 row)
>> postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl;
>>  regr_avgx | regr_avgy
>> ---+---
>>0.5 |   0.5
>> (1 row)
>> postgres=# SELECT corr(b, a) FROM tbl;
>>  corr
>> --
>> 1
>> (1 row)
>>
>> --when partitionwise aggregate is on
>> postgres=# SET enable_partitionwise_aggregate = true;
>> SET
>> postgres=# SELECT regr_count(b, a) FROM tbl;
>>  regr_count
>> 
>>   0
>> (1 row)
>> postgres=# SELECT regr_avgx(b, a), regr_avgy(b, a) FROM tbl;
>>  regr_avgx | regr_avgy
>> ---+---
>>|
>> (1 row)
>> postgres=# SELECT corr(b, a) FROM tbl;
>>  corr
>> --
>>
>> (1 row)
>>
>> Thanks & Regards,
>> Rajkumar Raghuwanshi
>> QMG, EnterpriseDB Corporation
>>
>
>
> --
> Jeevan Chalke
> Technical Architect, Product Development
> EnterpriseDB Corporation
> The Enterprise PostgreSQL Company
>
>

Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-07 Thread Kyotaro HORIGUCHI

Hello.

At Tue, 7 May 2019 14:39:55 +0530, Rajkumar Raghuwanshi 
 wrote in 

> Hi,
> As this issue is reproducible without partition-wise aggregate also,
> changing email subject from "Statistical aggregate functions are not
> working with partitionwise aggregate " to "Statistical aggregate functions
> are not working with PARTIAL aggregation".
> 
> original reported test case and discussion can be found at below link.
> https://www.postgresql.org/message-id/flat/CAKcux6%3DuZEyWyLw0N7HtR9OBc-sWEFeByEZC7t-KDf15FKxVew%40mail.gmail.com

The immediate reason for the behavior seems that
EEOP_AGG_STRICT_INPUT_CHECK_ARGS regards the non-existent second
argument as null.

The invalid deserialfn_oid case in ExecBuildAggTrans, it
initializes args[1] using the second argument of the functoin
(int8pl() in the case) so the correct numTransInputs here is 1,
not 2.

The attached patch makes at least the test case work correctly
and this seems to be the alone instance of the same issue.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 0fb31f5c3d..9f83dbe51e 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -3009,8 +3009,9 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			{
 /*
  * Start from 1, since the 0th arg will be the transition
- * value
+ * value. Exclude it from numTransInputs.
  */
+pertrans->numTransInputs--;
 ExecInitExprRec(source_tle->expr, state,
 &trans_fcinfo->args[argno + 1].value,
 &trans_fcinfo->args[argno + 1].isnull);

Typos and wording in jsonpath-exec.c

2019-05-07 Thread Daniel Gustafsson

Spotted two minor typos when skimming through code, and a sentence on
returnvalue which seemed a bit odd since executeJsonPath() can exit on
ereport().  The attached diff fixes the typos and suggests a new wording.

cheers ./daniel



typo-jsonpath_exec.diff
Description: Binary data

RE: SQL statement PREPARE does not work in ECPG

2019-05-07 Thread Matsumura, Ryo

Hi Meskes-san

There are two points.

(1)
I attach a new patch. Please review it.

  - Preproc replaces any prepared_name to "$0" and changes it to an 
input-variable
for PREARE with typelist and EXECUTE with paramlist.
$0 is replaced in ecpg_build_params().
It's enable not to change ECPGdo interface.
  - Host variables can be used in paramlist of EXECUTE.

(2)
I found some bugs (two types). I didn't fix them and only avoid bison error.

Type1. Bugs or intentional unsupported features.
  - EXPLAIN EXECUTE
  - CREATE TABLE AS with using clause

  e.g.
EXPLAIN EXECUTE st;  /* It has not been supported before.*/

 ExplainableStmt:
 ExecuteStmt
 {
  - $$ = $1;
  + $$ = $1.name;/* only work arround for bison error */
 }

Type2. In multi-bytes encoding environment, a part of character of message is 
cut off.

  It may be very difficult to fix. I pretend I didn't see it for a while.

  [ecpglib/error.c]
snprintf(sqlca->sqlerrm.sqlerrmc, sizeof(sqlca->sqlerrm.sqlerrmc), "%s on 
line %d", message, line);
sqlca->sqlerrm.sqlerrml = strlen(sqlca->sqlerrm.sqlerrmc);
ecpg_log("raising sqlstate %.*s (sqlcode %ld): %s\n",
 (int) sizeof(sqlca->sqlstate), sqlca->sqlstate, sqlca->sqlcode, 
sqlca->sqlerrm.sqlerrmc);

Regards
Ryo Matsumura


ecpg_prepare_as_v1_2.patch
Description: ecpg_prepare_as_v1_2.patch

Re: accounting for memory used for BufFile during hash joins

2019-05-07 Thread Tomas Vondra


On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

Do we actually check how many duplicates are there during planning?


Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.



I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin. That's not quite what I
had in mind - I was thinking more about something along the lines "See the
larget group of duplicate values, disable hash join if it can't fit into
work_mem at all."

Of course, if the input estimates are off, that may not work too well. It
would certainly not help the query failing with OOM, because that was a
case of severe underestimate.

Or did you mean some other piece of code that I have missed.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH v1] Add a way to supply stdin to TAP tests

2019-05-07 Thread Andrew Dunstan



On 5/6/19 10:42 PM, David Fetter wrote:
> On Tue, May 07, 2019 at 11:05:32AM +0900, Kyotaro HORIGUCHI wrote:
>> Hi.
>>
>> At Sun, 28 Apr 2019 17:07:16 +0200, David Fetter  wrote in 
>> <20190428150716.gp28...@fetter.org>
>>> Our test coverage needs all the help it can get.
>>>
>>> This patch, extracted from another by Fabian Coelho, helps move things
>>> in that direction.
>>>
>>> I'd like to argue that it's not a new feature, and that it should be
>>> back-patched as far as possible.
>> The comment for the parameter "in".
>>
>> +# - in: standard input
>>
>> Perhaps this is "string to be fed to standard input". This also
>> can be a I/O reference but we don't care that?
> OK
>
>> +$in = '' if not defined $in;
>>
>> run($cmd, '<', \undef) seems to work, maybe assuming "<
>> /dev/null", which might be better?
> Is /dev/null a thing on Windows?



Not as such, although there is NUL (see src/include/port.h).


However, I don't think we should be faking anything here. I think it
would be better to  avoid setting $in if not supplied and then have this:


if (defined($in))

{

    IPC::Run::run($cmd, '<', \$in, '>', \$stdout, '2>', \$stderr);

}

else

{

    IPC::Run::run($cmd, >', \$stdout, '2>', \$stderr);   

}



cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-07 Thread Tom Lane

Amit Kapila  writes:
> On Mon, May 6, 2019 at 8:57 PM Andres Freund  wrote:
>> On 2019-05-06 11:10:15 -0400, Robert Haas wrote:
>>> I think it's legitimate to question whether sending additional
>>> invalidation messages as part of the design of this feature is a good
>>> idea.

>> I don't think it's an actual problem. We'd only do so when creating an
>> FSM, or when freeing up additional space that'd otherwise not be visible
>> to other backends.

> The other place we need to consider for this is when one of the
> backends updates its map (due to unavailability of space in the
> existing set of pages).  We can choose not to send invalidation in
> this case, but then different backends need to identify the same thing
> themselves and reconstruct the map again.

I'm inclined to wonder why bother with invals at all.  The odds are
quite good that no other backend will care (which, I imagine, is the
reasoning behind why the original patch was designed like it was).
A table that has a lot of concurrent write activity on it is unlikely
to stay small enough to not have a FSM for long.

The approach I'm imagining here is not too different from Robert's
"just search the table's pages every time" straw-man.  Backends would
cache the results of their own searches, but not communicate about it.

regards, tom lane

Re: PG12, PGXS and linking pgfeutils

2019-05-07 Thread Tom Lane

Ian Barwick  writes:
> Commit cc8d4151 [*] introduced a dependency between some functions in
> libpgcommon and libpgfeutils,

This seems rather seriously broken.  I do not think the answer is to
create a global dependency on libpgfeutils.

regards, tom lane

Re: accounting for memory used for BufFile during hash joins

2019-05-07 Thread Tomas Vondra

On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:

On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
 wrote:

On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
>Seems expensive for large numbers of slices -- you need to join the
>outer batch against each inner slice.

Nope, that's not how it works. It's the array of batches that gets
sliced, not the batches themselves.

Sorry, I read only the description and not the code, and got confused
about that.  So, I see three separate but related problems:

A.  Broken escape valve:  sometimes we generate a huge number of
batches while trying to split up many duplicates, because of the
presence of other more uniformly distributed keys.  We could fix that
with (say) a 95% rule.
B.  Lack of good alternative execution strategy when the escape valve
is triggered.  A batch cannot be split effectively, but cannot fit in
work_mem, so for now we decide to ignore work_mem.
C.  Unmetered explosion of batches and thus BufFiles, probably usually
caused by problem A, but theoretically also due to a real need for
partitions.

Right. I don't think a single solution addressing all those issues exists.
It's more likely we need multiple improvements.

>But I wonder how we'd deal with outer joins, as Tom Lane asked in
>another thread:
>
>https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us

That seems unrelated - we slice the array of batches, to keep memory
needed for BufFile under control. The hash table remains intact, so
there's no issue with outer joins.

Right, sorry, my confusion.  I thought you were describing
https://en.wikipedia.org/wiki/Block_nested_loop.  (I actually think we
can make that work for left outer joins without too much fuss by
writing out a stream of match bits to a new temporary file.  Googling,
I see that MySQL originally didn't support BNL for outer joins and
then added some match flag propagation thing recently.)

Possibly, I'm not against implementing that, although I don't have very
good idea what the benefits of BNL joins are (performance-wise). In any
case, I think entirely unrelated to hash joins.

I agree we should relax the 0%/100% split condition, and disable the
growth sooner. But I think we should also re-evaluate that decision
after a while - the data set may be correlated in some way, in which
case we may disable the growth prematurely. It may not reduce memory
usage now, but it may help in the future.

It's already an issue, but it would be even more likely if we disabled
growth e.g. with just 5%/95% splits.

FWIW I believe this is mostly orthogonal issue to what's discussed in
this thread.

But isn't problem A the root cause of problem C, in most cases?  There
must also be "genuine" cases of problem C that would occur even if we
fix that, of course: someone has small work_mem, and data that can be
effectively partitioned to fit it, but it just takes a huge number of
partitions to do it.  So that we don't behave badly in those cases, I
agree with you 100%: we should fix the memory accounting to count
BufFile overheads as you are proposing, and then I guess ideally
switch to our alternative strategy (BNL or sort-merge or ...) when we
see that BufFiles are wasting to much work_mem and its time to try
something else.  It seems you don't actually have one of those cases
here, though?

Maybe. Or maybe not. I don't have enough data to make such judgements
about the causes in general. We have one query from pgsql-performance.
There might be more, but IMO that's probably biased data set.

But even that reported query actually is not the case that A causes C.
The outer side of the hash join was significantly underestimated (34619
vs. 113478127) due to highly-correlated conditions.

And in that case it's trivial to cause nbatch explosion even with perfect
data sets with no duplicates (so no escape valve failure).

I think we should fix problem A.  Then handle problem C by accounting
for BufFiles, and figure out a way to switch to our alternative
strategy (currently: ignore work_mem), when we think that creating
more BufFiles will be futile (not sure exactly what the rule there
should be).  And then work on fixing B properly with a good strategy.
Here's a straw-man idea: we could adopt BNL, and then entirely remove
our repartitioning code.  If the planner's number of partitions turns
out to be not enough, we'll just handle it using BNL loops.

Yeah, something like that.

I think we can fix A by relaxing the escape valve condition, and then
rechecking it once in a while. So we fill work_mem, realize it didn't
actually reduce the batch size significantly and disable nbatch growth.
But at the same time we increase the threshold to 2x work_mem, and after
reaching it we "consider" a nbatch increase.  That is, we walk the batch
and see how many tuples would move if we increased nbatch (that should be
fairly cheap) - if it helps, great, enable growth and split the batch. If
not, double the threshold again.  Ri

Re: accounting for memory used for BufFile during hash joins

2019-05-07 Thread Tom Lane

Tomas Vondra  writes:
> On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:
>> Tomas Vondra  writes:
>>> Do we actually check how many duplicates are there during planning?

>> Certainly that's part of the planner's cost estimates ... but it's
>> only as good as the planner's statistical knowledge.

> I'm looking at the code, and the only place where I see code dealing with
> MCVs (probably the best place for info about duplicate values) is
> estimate_hash_bucketsize in final_cost_hashjoin.

What I'm thinking of is this bit in final_cost_hashjoin:

/*
 * If the bucket holding the inner MCV would exceed work_mem, we don't
 * want to hash unless there is really no other alternative, so apply
 * disable_cost.  (The executor normally copes with excessive memory usage
 * by splitting batches, but obviously it cannot separate equal values
 * that way, so it will be unable to drive the batch size below work_mem
 * when this is true.)
 */
if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
   inner_path->pathtarget->width) >
(work_mem * 1024L))
startup_cost += disable_cost;

It's certainly likely that that logic needs improvement in view of this
discussion --- I was just pushing back on the claim that we weren't
considering the issue at all.

regards, tom lane

Re: jsonpath

2019-05-07 Thread Tom Lane

Alexander Korotkov  writes:
> Attached patchset contains revised commit messages.  I'm going to
> commit this on no objections.

Sorry for slow response --- I was tied up with release preparations.

The -5 patches look pretty good.  A couple of nits:

@@ -774,9 +749,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, 
JsonPathItem *jsp,
 {
 RETURN_ERROR(ereport(ERROR,
  (errcode(ERRCODE_JSON_ARRAY_NOT_FOUND),
-  errmsg(ERRMSG_JSON_ARRAY_NOT_FOUND),
-  errdetail("jsonpath array accessor can "
-"only be applied to an 
array";
+  errdetail("jsonpath array accessor can 
only be applied to an array";
 }
 break;

I think you forgot to s/errdetail/errmsg/ in this one.  Likewise at:

+  errdetail("jsonpath wildcard member 
accessor can only be applied to an object";

Also, I saw two places where you missed removing a trailing period
from an errmsg:

+  errmsg("left operand of jsonpath operator %s is 
not a single numeric value.",
+ jspOperationName(jsp->type);

+  errmsg("right operand of jsonpath operator %s is 
not a single numeric value.",
+ jspOperationName(jsp->type);

I'd suggest making a quick pass for other instances of the same mistakes,
just in case.  I'm good with the wording on everything now.

regards, tom lane

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-07 Thread Tom Lane

Andres Freund  writes:
> I for sure thought I earlier had an idea that'd actually work. But
> either I've lost it, or it didn't actually work. But perhaps somebody
> else can come up with something based on the above strawman ideas?

Both of those ideas fail if an autovacuum starts up after you're
done looking.  I still think the only way you could make this
reliable enough for the buildfarm is to do it in a TAP test that's
set up a cluster with autovacuum disabled.  Whether it's worth the
cycles to do so is pretty unclear, since that wouldn't be a terribly
real-world test environment.  (I also wonder whether the existing
TAP tests for reindexdb don't provide largely the same coverage.)

My advice is to let it go until we have time to work on getting rid
of the deadlock issues.  If we're successful at that, it might be
possible to re-enable these tests in the regular regression environment.

regards, tom lane

Re: make \d pg_toast.foo show its indices

2019-05-07 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Rafia Sabih  writes:
> > On Fri, 3 May 2019 at 16:27, Justin Pryzby  wrote:
> >> Thanks - what about also showing the associated non-toast table ?
> 
> > IMHO, what makes more sense is to show the name of associated toast
> > table in the \dt+ of the normal table.
> 
> I'm not for that: it's useless information in at least 99.44% of cases.

I don't think I'd put it in \dt+, but the toast table is still
pg_toast.pg_toast_{relOid}, right?  What about showing the OID of the
table in the \d output, eg:

=> \d comments
   Table "public.comments" (50788)
 Column  |  Type   | Collation | Nullable |   Default

etc?

> Possibly it is useful in the other direction as Justin suggests.
> Not sure though --- generally, if you're looking at a specific
> toast table, you already know which table is its parent.  But
> maybe confirmation is a good thing.

As mentioned elsewhere, there are certainly times when you don't know
that info and if you're looking at the definition of a TOAST table,
which isn't terribly complex, it seems like a good idea to go ahead and
include the table it's the TOAST table for.

> That seems off-topic for this thread though.  I agree with the
> stated premise that \d on a toast table should show all the same
> information \d on a regular table would.

+1

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: accounting for memory used for BufFile during hash joins

2019-05-07 Thread Tomas Vondra


On Tue, May 07, 2019 at 10:42:36AM -0400, Tom Lane wrote:

Tomas Vondra  writes:

On Mon, May 06, 2019 at 11:18:28PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

Do we actually check how many duplicates are there during planning?



Certainly that's part of the planner's cost estimates ... but it's
only as good as the planner's statistical knowledge.



I'm looking at the code, and the only place where I see code dealing with
MCVs (probably the best place for info about duplicate values) is
estimate_hash_bucketsize in final_cost_hashjoin.


What I'm thinking of is this bit in final_cost_hashjoin:

   /*
* If the bucket holding the inner MCV would exceed work_mem, we don't
* want to hash unless there is really no other alternative, so apply
* disable_cost.  (The executor normally copes with excessive memory usage
* by splitting batches, but obviously it cannot separate equal values
* that way, so it will be unable to drive the batch size below work_mem
* when this is true.)
*/
   if (relation_byte_size(clamp_row_est(inner_path_rows * innermcvfreq),
  inner_path->pathtarget->width) >
   (work_mem * 1024L))
   startup_cost += disable_cost;

It's certainly likely that that logic needs improvement in view of this
discussion --- I was just pushing back on the claim that we weren't
considering the issue at all.



Ah, this code is new in 11, and I was looking at code from 10 for some
reason. I don't think we can do much better than this, except perhaps
falling back to (1/ndistinct) when there's no MCV available.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: make \d pg_toast.foo show its indices

2019-05-07 Thread Tom Lane

Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> Rafia Sabih  writes:
>>> IMHO, what makes more sense is to show the name of associated toast
>>> table in the \dt+ of the normal table.

>> I'm not for that: it's useless information in at least 99.44% of cases.

> I don't think I'd put it in \dt+, but the toast table is still
> pg_toast.pg_toast_{relOid}, right?  What about showing the OID of the
> table in the \d output, eg:
> => \d comments
>Table "public.comments" (50788)

Not unless you want to break every regression test that uses \d.
Instability of the output is also a reason not to show the
toast table's name in the parent's \d[+].

>> Possibly it is useful in the other direction as Justin suggests.
>> Not sure though --- generally, if you're looking at a specific
>> toast table, you already know which table is its parent.  But
>> maybe confirmation is a good thing.

> As mentioned elsewhere, there are certainly times when you don't know
> that info and if you're looking at the definition of a TOAST table,
> which isn't terribly complex, it seems like a good idea to go ahead and
> include the table it's the TOAST table for.

I'm not against putting that info into the result of \d on the toast
table.

regards, tom lane

Re: make \d pg_toast.foo show its indices

2019-05-07 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> Rafia Sabih  writes:
> >>> IMHO, what makes more sense is to show the name of associated toast
> >>> table in the \dt+ of the normal table.
> 
> >> I'm not for that: it's useless information in at least 99.44% of cases.
> 
> > I don't think I'd put it in \dt+, but the toast table is still
> > pg_toast.pg_toast_{relOid}, right?  What about showing the OID of the
> > table in the \d output, eg:
> > => \d comments
> >Table "public.comments" (50788)
> 
> Not unless you want to break every regression test that uses \d.
> Instability of the output is also a reason not to show the
> toast table's name in the parent's \d[+].

So we need a way to turn it off.  That doesn't seem like it'd be hard to
implement and the information is certainly quite useful.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: New EXPLAIN option: ALL

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 09:30:47 +0200, David Fetter wrote:
> It can get a little tedious turning on (or off) all the boolean
> options to EXPLAIN, so please find attached a shortcut.

I'm not convinced this is a good idea - it seems likely that we'll add
conflicting options at some point, and then this will just be a pain in
the neck.

Greetings,

Andres Freund

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 09:34:42 -0400, Tom Lane wrote:
> I'm inclined to wonder why bother with invals at all.  The odds are
> quite good that no other backend will care (which, I imagine, is the
> reasoning behind why the original patch was designed like it was).
> A table that has a lot of concurrent write activity on it is unlikely
> to stay small enough to not have a FSM for long.

But when updating the free space for the first four blocks, we're going
to either have to do an smgrexists() to check whether somebody
concurrently created the FSM, or we might not update an existing FSM. An
inval seems much better.

Greetings,

Andres Freund

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 10:50:19 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I for sure thought I earlier had an idea that'd actually work. But
> > either I've lost it, or it didn't actually work. But perhaps somebody
> > else can come up with something based on the above strawman ideas?
> 
> Both of those ideas fail if an autovacuum starts up after you're
> done looking.

Well, that's why I had proposed to basically to first lock pg_class, and
then wait for other sessions. Which'd be fine, except that it'd also
create deadlock risks :(.

> My advice is to let it go until we have time to work on getting rid
> of the deadlock issues.  If we're successful at that, it might be
> possible to re-enable these tests in the regular regression environment.

Yea, that might be right. I'm planning to leave the tests in until a
bunch of the open REINDEX issues are resolved. Not super likely that
it'd break something, but probably worth anyway?

Greetings,

Andres Freund

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-07 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-07 09:34:42 -0400, Tom Lane wrote:
>> I'm inclined to wonder why bother with invals at all.

> But when updating the free space for the first four blocks, we're going
> to either have to do an smgrexists() to check whether somebody
> concurrently created the FSM, or we might not update an existing FSM. An
> inval seems much better.

I do not think sinval messaging is going to be sufficient to avoid that
problem.  sinval is only useful to tell you about changes if you first
take a lock strong enough to guarantee that no interesting change is
happening while you hold the lock.  We are certainly not going to let
writes take an exclusive lock, so I don't see how we could be certain
that we've seen an sinval message telling us about FSM status change.

This seems tied into the idea we've occasionally speculated about
of tracking relation sizes in shared memory to avoid lseek calls.
If we could solve the problems around that, it'd provide a cheap(er)
way to detect whether an FSM should exist or not.

A different way of looking at it is that the FSM data is imprecise
by definition, therefore it doesn't matter that much if some backend
is slow to realize that the FSM exists.  That still doesn't lead me
to think that sinval messaging is a component of the solution though.

regards, tom lane

Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 12:25:43 +0900, Michael Paquier wrote:
> On Tue, May 07, 2019 at 12:07:56PM +0900, Michael Paquier wrote:
> > Now, what do we do about the potential deadlock issues in
> > WaitForOlderSnapshots?  The attached is an isolation test able to
> > reproduce the deadlock within WaitForOlderSnapshots() with two
> > parallel REINDEX CONCURRENTLY.  I'd like to think that the best way to
> > do that would be to track in vacuumFlags the backends running a
> > REINDEX and just exclude them from GetCurrentVirtualXIDs() because
> > we don't actually care about missing index entries in this case like
> > VACUUM.  But it looks also to me that is issue is broader and goes
> > down to utility commands which can take a lock on a table which cannot
> > be run in transaction blocks, hence code paths used by CREATE INDEX
> > CONCURRENTLY and DROP INDEX CONCURRENTLY could also cause a similar
> > deadlock, no?
> 
> More to the point, one can just do that without REINDEX:
> - session 1:
> create table aa (a int);
> begin;
> lock aa in row exclusive mode;
> - session 2:
> create index concurrently aai on aa(a); --blocks
> - session 3:
> create index concurrently aai2 on aa(a); --blocks
> - session 1:
> commit;
> 
> Then session 2 deadlocks while session 3 finishes correctly.  I don't
> know if this is a class of problems we'd want to address for v12, but
> if we do then CIC (and DROP INDEX CONCURRENTLY?) could benefit from
> it.

This seems like a pre-existing issue to me. We probably should improve
that, but I don't think it has to be tied to 12.

Greetings,

Andres Freund

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 12:04:11 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-05-07 09:34:42 -0400, Tom Lane wrote:
> >> I'm inclined to wonder why bother with invals at all.
> 
> > But when updating the free space for the first four blocks, we're going
> > to either have to do an smgrexists() to check whether somebody
> > concurrently created the FSM, or we might not update an existing FSM. An
> > inval seems much better.
> 
> I do not think sinval messaging is going to be sufficient to avoid that
> problem.  sinval is only useful to tell you about changes if you first
> take a lock strong enough to guarantee that no interesting change is
> happening while you hold the lock.  We are certainly not going to let
> writes take an exclusive lock, so I don't see how we could be certain
> that we've seen an sinval message telling us about FSM status change.

Sure, but it'll be pretty darn close, rather than there basically not
being any limit except backend lifetime to how long we might not notice
that we'd need to switch to the on-disk FSM.


> This seems tied into the idea we've occasionally speculated about
> of tracking relation sizes in shared memory to avoid lseek calls.
> If we could solve the problems around that, it'd provide a cheap(er)
> way to detect whether an FSM should exist or not.

I'm back on working on a patch that provides that, FWIW. Not yet really
spending time on it, but I've re-skimmed through the code I'd previously
written.

Greetings,

Andres Freund

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-07 Thread Tom Lane

Andres Freund  writes:
> Yea, that might be right. I'm planning to leave the tests in until a
> bunch of the open REINDEX issues are resolved. Not super likely that
> it'd break something, but probably worth anyway?

The number of deadlock failures is kind of annoying, so I'd rather remove
the tests from HEAD sooner than later.  What issues around that do you
think remain that these tests would be helpful for?

regards, tom lane

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 12:07:37 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Yea, that might be right. I'm planning to leave the tests in until a
> > bunch of the open REINDEX issues are resolved. Not super likely that
> > it'd break something, but probably worth anyway?
> 
> The number of deadlock failures is kind of annoying, so I'd rather remove
> the tests from HEAD sooner than later.  What issues around that do you
> think remain that these tests would be helpful for?

I was wondering about
https://postgr.es/m/20190430151735.wi52sxjvxsjvaxxt%40alap3.anarazel.de
but perhaps it's too unlikely to break anything the tests would detect
though.

Greetings,

Andres Freund

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-07 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-07 12:04:11 -0400, Tom Lane wrote:
>> I do not think sinval messaging is going to be sufficient to avoid that
>> problem.  sinval is only useful to tell you about changes if you first
>> take a lock strong enough to guarantee that no interesting change is
>> happening while you hold the lock.  We are certainly not going to let
>> writes take an exclusive lock, so I don't see how we could be certain
>> that we've seen an sinval message telling us about FSM status change.

> Sure, but it'll be pretty darn close, rather than there basically not
> being any limit except backend lifetime to how long we might not notice
> that we'd need to switch to the on-disk FSM.

Why do you think there's no limit?  We ordinarily do
RelationGetNumberOfBlocks at least once per query on a table, and
I should think we could fix things so that a "free" side-effect of
that is to get the relcache entry updated with whether an FSM
ought to exist or not.  So I think at worst we'd be one query behind.

regards, tom lane

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-07 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-07 12:07:37 -0400, Tom Lane wrote:
>> The number of deadlock failures is kind of annoying, so I'd rather remove
>> the tests from HEAD sooner than later.  What issues around that do you
>> think remain that these tests would be helpful for?

> I was wondering about
> https://postgr.es/m/20190430151735.wi52sxjvxsjvaxxt%40alap3.anarazel.de
> but perhaps it's too unlikely to break anything the tests would detect
> though.

Since we don't allow REINDEX CONCURRENTLY on system catalogs, I'm not
seeing any particular overlap there ...

regards, tom lane

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 12:14:43 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-05-07 12:07:37 -0400, Tom Lane wrote:
> >> The number of deadlock failures is kind of annoying, so I'd rather remove
> >> the tests from HEAD sooner than later.  What issues around that do you
> >> think remain that these tests would be helpful for?
> 
> > I was wondering about
> > https://postgr.es/m/20190430151735.wi52sxjvxsjvaxxt%40alap3.anarazel.de
> > but perhaps it's too unlikely to break anything the tests would detect
> > though.
> 
> Since we don't allow REINDEX CONCURRENTLY on system catalogs, I'm not
> seeing any particular overlap there ...

Well, it rejiggers the way table locks are acquired for all REINDEX
INDEX commands, not just in the CONCURRENTLY. But yea, it's probably
easy to catch issues there on user tables.

Greetings,

Andres Freund

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 12:12:37 -0400, Tom Lane wrote:
> Why do you think there's no limit?  We ordinarily do
> RelationGetNumberOfBlocks at least once per query on a table

Well, for the main fork. Which already could have shrunk below the size
that led the FSM to be created. And we only do
RelationGetNumberOfBlocks() when planning, right? Not when using
prepared statements.

Greetings,

Andres Freund

Re: Typos and wording in jsonpath-exec.c

2019-05-07 Thread Magnus Hagander

On Tue, May 7, 2019 at 2:39 PM Daniel Gustafsson  wrote:

> Spotted two minor typos when skimming through code, and a sentence on
> returnvalue which seemed a bit odd since executeJsonPath() can exit on
> ereport().  The attached diff fixes the typos and suggests a new wording.
>

Pushed. Thanks!

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: New EXPLAIN option: ALL

2019-05-07 Thread David Fetter

On Tue, May 07, 2019 at 11:13:15AM +0300, Sergei Kornilov wrote:
> Hi
> 
> I liked this idea.
> 
> + timing_set = true;
> + es->timing = defGetBoolean(opt);
> + summary_set = true;
> + es->timing = defGetBoolean(opt);
> 
> second es->timing should be es->summary, right?

You are correct! Sorry about the copy-paste-o.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 424d82c9ff519c02ae266700ed1f4b40ac9c495f Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Tue, 7 May 2019 00:26:29 -0700
Subject: [PATCH v2] Add an ALL option to EXPLAIN
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.21.0"

This is a multi-part message in MIME format.
--2.21.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


which starts all boolean options at once.

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 385d10411f..ade3793c5f 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -43,6 +43,7 @@ EXPLAIN [ ANALYZE ] [ VERBOSE ] statementboolean ]
 TIMING [ boolean ]
 SUMMARY [ boolean ]
+ALL [ boolean ]
 FORMAT { TEXT | XML | JSON | YAML }
 
  
@@ -224,6 +225,16 @@ ROLLBACK;
 

 
+   
+ALL
+
+ 
+  Include (or exclude) all of the above settings before examining the rest
+  of the options. It defaults to FALSE.
+ 
+
+   
+

 FORMAT
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a6c6de78f1..f58b9d7290 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -151,6 +151,26 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt, const char *queryString,
 	bool		summary_set = false;
 
 	/* Parse options list. */
+	/* First, see whether ALL is set */
+	foreach(lc, stmt->options)
+	{
+		DefElem	   *opt = (DefElem *) lfirst(lc);
+
+		if (strcmp(opt->defname, "all") == 0)
+		{
+			es->analyze = defGetBoolean(opt);
+			es->verbose = defGetBoolean(opt);
+			es->costs = defGetBoolean(opt);
+			es->buffers = defGetBoolean(opt);
+			es->settings = defGetBoolean(opt);
+			timing_set = true;
+			es->timing = defGetBoolean(opt);
+			summary_set = true;
+			es->summary_set = defGetBoolean(opt);
+		}
+	}
+
+	/* If you add another boolean option here, remember to add it above, too */
 	foreach(lc, stmt->options)
 	{
 		DefElem*opt = (DefElem *) lfirst(lc);
@@ -194,6 +214,8 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt, const char *queryString,
 opt->defname, p),
 		 parser_errposition(pstate, opt->location)));
 		}
+		else if (strcmp(opt->defname, "all") == 0)
+			; /* Do nothing */
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3dc0e8a4fb..4a69f9711c 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -10704,6 +10704,7 @@ explain_option_elem:
 explain_option_name:
 			NonReservedWord			{ $$ = $1; }
 			| analyze_keyword		{ $$ = "analyze"; }
+			| ALL	{ $$ = "all"; }
 		;
 
 explain_option_arg:
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index bcddc7601e..8177f9c6da 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -2862,9 +2862,9 @@ psql_completion(const char *text, int start, int end)
 		 * one word, so the above test is correct.
 		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
-			COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "BUFFERS",
+			COMPLETE_WITH("ALL", "ANALYZE", "VERBOSE", "COSTS", "BUFFERS",
 		  "TIMING", "SUMMARY", "FORMAT");
-		else if (TailMatches("ANALYZE|VERBOSE|COSTS|BUFFERS|TIMING|SUMMARY"))
+		else if (TailMatches("ALL|ANALYZE|VERBOSE|COSTS|BUFFERS|TIMING|SUMMARY"))
 			COMPLETE_WITH("ON", "OFF");
 		else if (TailMatches("FORMAT"))
 			COMPLETE_WITH("TEXT", "XML", "JSON", "YAML");
@@ -2874,7 +2874,8 @@ psql_completion(const char *text, int start, int end)
 	  "VERBOSE");
 	else if (Matches("EXPLAIN", "(*)") ||
 			 Matches("EXPLAIN", "VERBOSE") ||
-			 Matches("EXPLAIN", "ANALYZE", "VERBOSE"))
+			 Matches("EXPLAIN", "ANALYZE", "VERBOSE") ||
+			 Matches("EXPLAIN", "ALL"))
 		COMPLETE_WITH("SELECT", "INSERT", "DELETE", "UPDATE", "DECLARE");
 
 /* FETCH && MOVE */

--2.21.0--

Re: New EXPLAIN option: ALL

2019-05-07 Thread David Fetter

On Tue, May 07, 2019 at 11:03:23AM +0200, Rafia Sabih wrote:
> On Tue, 7 May 2019 at 09:30, David Fetter  wrote:
> >
> > Folks,
> >
> > It can get a little tedious turning on (or off) all the boolean
> > options to EXPLAIN, so please find attached a shortcut.
> 
> I don't understand this, do you mind explaining a bit may be with an
> example on how you want it to work.

If you're tuning a query interactively, it's a lot simpler to prepend,
for example,

EXPLAIN (ALL, FORMAT JSON)

to it than to prepend something along the lines of

EXPLAIN(ANALYZE, VERBOSE, COSTS, BUFFERS, SETTINGS, TIMING, SUMMARY, 
PARTRIDGE_IN_A_PEAR_TREE, FORMAT JSON)

to it.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: New EXPLAIN option: ALL

2019-05-07 Thread David Fetter

On Tue, May 07, 2019 at 08:41:47AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-05-07 09:30:47 +0200, David Fetter wrote:
> > It can get a little tedious turning on (or off) all the boolean
> > options to EXPLAIN, so please find attached a shortcut.
> 
> I'm not convinced this is a good idea - it seems likely that we'll
> add conflicting options at some point, and then this will just be a
> pain in the neck.

I already left out FORMAT for a similar reason, namely that it's not a
boolean, so it's not part of flipping on (or off) all the switches.

Are you seeing a point in the future where there'd be both mutually
exclusive boolean options and no principled reason to choose among
them?  If so, what might it look like?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: New EXPLAIN option: ALL

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 18:34:11 +0200, David Fetter wrote:
> On Tue, May 07, 2019 at 08:41:47AM -0700, Andres Freund wrote:
> > On 2019-05-07 09:30:47 +0200, David Fetter wrote:
> > > It can get a little tedious turning on (or off) all the boolean
> > > options to EXPLAIN, so please find attached a shortcut.
> > 
> > I'm not convinced this is a good idea - it seems likely that we'll
> > add conflicting options at some point, and then this will just be a
> > pain in the neck.
> 
> I already left out FORMAT for a similar reason, namely that it's not a
> boolean, so it's not part of flipping on (or off) all the switches.

Which is already somewhat hard to explain.

Imagine if we had CPU_PROFILE = on (which'd be *extremely*
useful). Would you want that to be switched on automatically?  How about
RECORD_IO_TRACE?  How about DISTINCT_BUFFERS which'd be like BUFFERS
except that we'd track how many different buffers are accessed using HLL
or such? Would also be extremely useful.

> Are you seeing a point in the future where there'd be both mutually
> exclusive boolean options and no principled reason to choose among
> them?  If so, what might it look like?

Yes. CPU_PROFILE_PERF, CPU_PROFILE_VTUNE. And lots more.

Greetings,

Andres Freund

Re: [PATCH v1] Add a way to supply stdin to TAP tests

2019-05-07 Thread David Fetter

On Tue, May 07, 2019 at 09:39:57AM -0400, Andrew Dunstan wrote:
> 
> On 5/6/19 10:42 PM, David Fetter wrote:
> > On Tue, May 07, 2019 at 11:05:32AM +0900, Kyotaro HORIGUCHI wrote:
> >> Hi.
> >>
> >> At Sun, 28 Apr 2019 17:07:16 +0200, David Fetter  wrote 
> >> in <20190428150716.gp28...@fetter.org>
> >>> Our test coverage needs all the help it can get.
> >>>
> >>> This patch, extracted from another by Fabian Coelho, helps move things
> >>> in that direction.
> >>>
> >>> I'd like to argue that it's not a new feature, and that it should be
> >>> back-patched as far as possible.
> >> The comment for the parameter "in".
> >>
> >> +# - in: standard input
> >>
> >> Perhaps this is "string to be fed to standard input". This also
> >> can be a I/O reference but we don't care that?
> > OK
> >
> >> +  $in = '' if not defined $in;
> >>
> >> run($cmd, '<', \undef) seems to work, maybe assuming "<
> >> /dev/null", which might be better?
> > Is /dev/null a thing on Windows?
> 
> However, I don't think we should be faking anything here. I think it
> would be better to  avoid setting $in if not supplied and then have this:
> 
> if (defined($in))
> 
> {
> 
>     IPC::Run::run($cmd, '<', \$in, '>', \$stdout, '2>', \$stderr);
> 
> }
> 
> else
> 
> {
> 
>     IPC::Run::run($cmd, >', \$stdout, '2>', \$stderr);   
> 
> }

Done that way.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From cfb0693c485a8b577c3fd96289ce249b3888777a Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 28 Apr 2019 08:01:01 -0700
Subject: [PATCH v2] Add a way to supply stdin to TAP tests
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.21.0"

This is a multi-part message in MIME format.
--2.21.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


by Fabian Coelho

This will help increase test coverage for anything that might need it,
and a lot of things currently need it.

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index a164cdbd8c..6ad7f681ae 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -519,22 +519,24 @@ sub command_fails_like
 }
 
 # Run a command and check its status and outputs.
-# The 5 arguments are:
+# The 5 to 6 arguments are:
 # - cmd: ref to list for command, options and arguments to run
 # - ret: expected exit status
 # - out: ref to list of re to be checked against stdout (all must match)
 # - err: ref to list of re to be checked against stderr (all must match)
 # - test_name: name of test
+# - in: standard input
 sub command_checks_all
 {
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
 
-	my ($cmd, $expected_ret, $out, $err, $test_name) = @_;
+	my ($cmd, $expected_ret, $out, $err, $test_name, $in) = @_;
+	$in = '' if not defined $in;
 
 	# run command
 	my ($stdout, $stderr);
 	print("# Running: " . join(" ", @{$cmd}) . "\n");
-	IPC::Run::run($cmd, '>', \$stdout, '2>', \$stderr);
+	IPC::Run::run($cmd, '<', \$in, '>', \$stdout, '2>', \$stderr);
 
 	# See http://perldoc.perl.org/perlvar.html#%24CHILD_ERROR
 	my $ret = $?;

--2.21.0--

Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-05-07 Thread Peter Geoghegan

On Tue, May 7, 2019 at 12:27 AM Heikki Linnakangas  wrote:
> I don't understand that reasoning. Yes, _bt_pagedel() will complain if
> it finds a half-dead internal page. But how does that mean that
> _bt_lock_branch_parent() can't encounter one?

I suppose that in theory it could, but only if you allow that any
possible state could be found -- it doesn't seem any more likely than
any other random illegal state.

Even when it happens, we'll get a "failed to re-find parent key" error
message when we go a bit further. Isn't that a logical outcome?

Actually, maybe we won't get that error, because we're talking about a
corrupt index, and all bets are off -- no reason to think that the
half-dead internal page would be consistent with other pages in any
way. But even then, you'll go on to report it in the usual way, since
VACUUM scans nbtree indexes in physical order.
-- 
Peter Geoghegan

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-05-07 Thread Fujii Masao

On Tue, May 7, 2019 at 5:33 PM Masahiko Sawada  wrote:
>
> On Mon, Apr 8, 2019 at 7:29 PM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 8, 2019 at 7:22 PM Fujii Masao  wrote:
> > >
> > > On Mon, Apr 8, 2019 at 5:30 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Mon, Apr 8, 2019 at 5:15 PM Fujii Masao  
> > > > wrote:
> > > > >
> > > > > On Mon, Apr 8, 2019 at 3:58 PM Julien Rouhaud  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Apr 8, 2019 at 8:01 AM Fujii Masao  
> > > > > > wrote:
> > > > > > >
> > > > > > > 2019年4月8日(月) 14:22 Tsunakawa, Takayuki 
> > > > > > > :
> > > > > > >>
> > > > > > >> From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> > > > > > >> > "vacuum_truncate" gets my vote too.
> > > > > > >>
> > > > > > >> +1
> > > > > > >
> > > > > > >
> > > > > > > +1
> > > > > > > ISTM that we have small consensus to
> > > > > > > use "vacuum_truncate".
> > > > > >
> > > > > > I'm also fine with this name, and I also saw reports that this 
> > > > > > option
> > > > > > is already needed in some production workload, as Andres explained.
> > > > >
> > > > > OK, so I pushed the "vacuum_truncate" version of the patch.
> > > >
> > > > Thank you!
> > > >
> > > > "TRUNCATE" option for vacuum command should be added to the open items?
> > >
> > > Yes, I think.
> >
> > Added.
> >
> > > Attached is the patch which adds TRUNCATE option into VACUUM.
> >
> > Thank you for the patch! I will review it.
> >
>
> Sorry for the late. I've reviewed the patch and it looks good to me.

Thanks for the review! I committed the patch.

Regards,

-- 
Fujii Masao

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-05-07 Thread Fujii Masao

On Mon, Apr 8, 2019 at 8:15 PM Julien Rouhaud  wrote:
>
> On Mon, Apr 8, 2019 at 12:22 PM Fujii Masao  wrote:
> >
> > On Mon, Apr 8, 2019 at 5:30 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Apr 8, 2019 at 5:15 PM Fujii Masao  wrote:
> > > >
> > > > On Mon, Apr 8, 2019 at 3:58 PM Julien Rouhaud  
> > > > wrote:
> > > > >
> > > > > On Mon, Apr 8, 2019 at 8:01 AM Fujii Masao  
> > > > > wrote:
> > > > > >
> > > > > > 2019年4月8日(月) 14:22 Tsunakawa, Takayuki 
> > > > > > :
> > > > > >>
> > > > > >> From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> > > > > >> > "vacuum_truncate" gets my vote too.
> > > > > >>
> > > > > >> +1
> > > > > >
> > > > > >
> > > > > > +1
> > > > > > ISTM that we have small consensus to
> > > > > > use "vacuum_truncate".
> > > > >
> > > > > I'm also fine with this name, and I also saw reports that this option
> > > > > is already needed in some production workload, as Andres explained.
> > > >
> > > > OK, so I pushed the "vacuum_truncate" version of the patch.
> > >
> > > Thank you!
> > >
> > > "TRUNCATE" option for vacuum command should be added to the open items?
> >
> > Yes, I think.
> > Attached is the patch which adds TRUNCATE option into VACUUM.
>
> Thanks.
>
> I just reviewed the patch, it works as expected, no special comment on the 
> code.
>
> Minor nitpicking:
>
> -  lock on the table.
> +  lock on the table. The TRUNCATE parameter
> +  to , if specified, overrides the value
> +  of this option.
>
> maybe "parameter of" instead of "parameter to"?

Thanks for the review! I changed the doc that way.

Regards,

-- 
Fujii Masao

Re: [PATCH v1] Add a way to supply stdin to TAP tests

2019-05-07 Thread David Fetter

On Tue, May 07, 2019 at 06:47:59PM +0200, David Fetter wrote:
> On Tue, May 07, 2019 at 09:39:57AM -0400, Andrew Dunstan wrote:
> > 
> > On 5/6/19 10:42 PM, David Fetter wrote:
> > > On Tue, May 07, 2019 at 11:05:32AM +0900, Kyotaro HORIGUCHI wrote:
> > >> Hi.
> > >>
> > >> At Sun, 28 Apr 2019 17:07:16 +0200, David Fetter  
> > >> wrote in <20190428150716.gp28...@fetter.org>
> > >>> Our test coverage needs all the help it can get.
> > >>>
> > >>> This patch, extracted from another by Fabian Coelho, helps move things
> > >>> in that direction.
> > >>>
> > >>> I'd like to argue that it's not a new feature, and that it should be
> > >>> back-patched as far as possible.
> > >> The comment for the parameter "in".
> > >>
> > >> +# - in: standard input
> > >>
> > >> Perhaps this is "string to be fed to standard input". This also
> > >> can be a I/O reference but we don't care that?
> > > OK
> > >
> > >> +$in = '' if not defined $in;
> > >>
> > >> run($cmd, '<', \undef) seems to work, maybe assuming "<
> > >> /dev/null", which might be better?
> > > Is /dev/null a thing on Windows?
> > 
> > However, I don't think we should be faking anything here. I think it
> > would be better to  avoid setting $in if not supplied and then have this:
> > 
> > if (defined($in))
> > 
> > {
> > 
> >     IPC::Run::run($cmd, '<', \$in, '>', \$stdout, '2>', \$stderr);
> > 
> > }
> > 
> > else
> > 
> > {
> > 
> >     IPC::Run::run($cmd, >', \$stdout, '2>', \$stderr);   
> > 
> > }
> 
> Done that way.

It helps to commit the work before putting together the patch.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 3d56bdc7acb5f70e57dc4a19b99ad31e7a13127a Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 28 Apr 2019 08:01:01 -0700
Subject: [PATCH v3] Add a way to supply stdin to TAP tests
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.21.0"

This is a multi-part message in MIME format.
--2.21.0
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


by Fabian Coelho

This will help increase test coverage for anything that might need it,
and a lot of things currently need it.

diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index a164cdbd8c..b9b18950ed 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -519,22 +519,30 @@ sub command_fails_like
 }
 
 # Run a command and check its status and outputs.
-# The 5 arguments are:
+# The 5 to 6 arguments are:
 # - cmd: ref to list for command, options and arguments to run
 # - ret: expected exit status
 # - out: ref to list of re to be checked against stdout (all must match)
 # - err: ref to list of re to be checked against stderr (all must match)
 # - test_name: name of test
+# - in: standard input
 sub command_checks_all
 {
 	local $Test::Builder::Level = $Test::Builder::Level + 1;
 
-	my ($cmd, $expected_ret, $out, $err, $test_name) = @_;
+	my ($cmd, $expected_ret, $out, $err, $test_name, $in) = @_;
 
 	# run command
 	my ($stdout, $stderr);
 	print("# Running: " . join(" ", @{$cmd}) . "\n");
-	IPC::Run::run($cmd, '>', \$stdout, '2>', \$stderr);
+if (defined $in)
+{
+IPC::Run::run($cmd, '<', \$in, '>', \$stdout, '2>', \$stderr);
+}
+else
+{
+IPC::Run::run($cmd, '>', \$stdout, '2>', \$stderr);
+}
 
 	# See http://perldoc.perl.org/perlvar.html#%24CHILD_ERROR
 	my $ret = $?;

--2.21.0--

vacuumdb and new VACUUM options

2019-05-07 Thread Fujii Masao

Hi,

vacuumdb command supports the corresponding options to
any VACUUM parameters except INDEX_CLEANUP and TRUNCATE
that were added recently. Should vacuumdb also support those
new parameters, i.e., add --index-cleanup and --truncate options
to the command?

Regards,

-- 
Fujii Masao

We're leaking predicate locks in HEAD

2019-05-07 Thread Tom Lane

After running the core regression tests with installcheck-parallel,
the pg_locks view sometimes shows me apparently-orphaned SIReadLock
entries.  They accumulate over repeated test runs.  Right now,
for example, I see

regression=# select * from pg_locks;
  locktype  | database | relation | page | tuple | virtualxid | transactionid | 
classid | objid | objsubid | virtualtransaction | pid  |  mode   | 
granted | fastpath 
+--+--+--+---++---+-+---+--++--+-+-+--
 relation   |   130144 |12137 |  |   ||   | 
|   |  | 3/7977 | 8924 | AccessShareLock | t
   | t
 virtualxid |  |  |  |   | 3/7977 |   | 
|   |  | 3/7977 | 8924 | ExclusiveLock   | t
   | t
 relation   |   130144 |   136814 |  |   ||   | 
|   |  | 22/536 | 8076 | SIReadLock  | t
   | f
 relation   |   95 |   118048 |  |   ||   | 
|   |  | 19/665 | 6738 | SIReadLock  | t
   | f
 relation   |   130144 |   134850 |  |   ||   | 
|   |  | 12/3093| 7984 | SIReadLock  | t
   | f
(5 rows)

after having done a couple of installcheck iterations since starting the
postmaster.

The PIDs shown as holding those locks don't exist anymore, but digging
in the postmaster log shows that they were session backends during the
regression test runs.  Furthermore, it seems like they usually were the
ones running either the triggers or portals tests.

I don't see this behavior in v11 (though maybe I just didn't run it
long enough).  In HEAD, a run adds one or two new entries more often
than not.

This is a pretty bad bug IMO --- quite aside from any ill effects
of the entries themselves, the leak seems fast enough that it'd run
a production installation out of locktable space before very long.

I'd have to say that my first suspicion falls on bb16aba50 ...

regards, tom lane

Re: We're leaking predicate locks in HEAD

2019-05-07 Thread Thomas Munro

On Wed, May 8, 2019 at 5:46 AM Tom Lane  wrote:
> After running the core regression tests with installcheck-parallel,
> the pg_locks view sometimes shows me apparently-orphaned SIReadLock
> entries. [...]

Ugh.

> I'd have to say that my first suspicion falls on bb16aba50 ...

Investigating.

-- 
Thomas Munro
https://enterprisedb.com

Fuzzy thinking in is_publishable_class

2019-05-07 Thread Tom Lane

is_publishable_class has a test "relid >= FirstNormalObjectId",
which I think we should drop, for two reasons:

1. It makes the comment claiming that this function tests the same
things as check_publication_add_relation a lie.

2. The comment about it claims that the purpose is to reject
information_schema relations, but if that's so, it's ineffective.
We consider it supported to drop and recreate information_schema,
and have indeed recommended doing so for some minor-version
upgrades.  After that, the information_schema relations would no
longer have OIDs recognizable to this test.

So what is the motivation for this test?  If there's an important
reason for it, we need to find a less fragile way to express it.

regards, tom lane

Re: PG12, PGXS and linking pgfeutils

2019-05-07 Thread Tomas Vondra


On Tue, May 07, 2019 at 09:46:07AM -0400, Tom Lane wrote:

Ian Barwick  writes:

Commit cc8d4151 [*] introduced a dependency between some functions in
libpgcommon and libpgfeutils,


This seems rather seriously broken.  I do not think the answer is to
create a global dependency on libpgfeutils.



Yeah. I've added it to the open items.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: We're leaking predicate locks in HEAD

2019-05-07 Thread Andrew Dunstan

On 5/7/19 1:46 PM, Tom Lane wrote:
> After running the core regression tests with installcheck-parallel,
> the pg_locks view sometimes shows me apparently-orphaned SIReadLock
> entries.  They accumulate over repeated test runs.  

Should we have a test for that run at/near the end of the regression
tests? The buildfarm will actually do multiple runs like this if set up
to do parallel checks and test multiple locales.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: We're leaking predicate locks in HEAD

2019-05-07 Thread Tom Lane

Andrew Dunstan  writes:
> On 5/7/19 1:46 PM, Tom Lane wrote:
>> After running the core regression tests with installcheck-parallel,
>> the pg_locks view sometimes shows me apparently-orphaned SIReadLock
>> entries.  They accumulate over repeated test runs.  

> Should we have a test for that run at/near the end of the regression
> tests? The buildfarm will actually do multiple runs like this if set up
> to do parallel checks and test multiple locales.

No, I'm not excited about that idea; I think it'd have all the same
fragility as the late lamented "REINDEX pg_class" test.  A given test
script has no business assuming that other test scripts aren't
legitimately taking out predicate locks, nor assuming that prior test
scripts are fully cleaned up when it runs.

regards, tom lane

Re: [HACKERS] Detrimental performance impact of ringbuffers on performance

2019-05-07 Thread Andres Freund

Hi,

On 2016-04-06 12:57:16 +0200, Andres Freund wrote:
> While benchmarking on hydra
> (c.f. 
> http://archives.postgresql.org/message-id/20160406104352.5bn3ehkcsceja65c%40alap3.anarazel.de),
> which has quite slow IO, I was once more annoyed by how incredibly long
> the vacuum at the the end of a pgbench -i takes.
> 
> The issue is that, even for an entirely shared_buffers resident scale,
> essentially no data is cached in shared buffers. The COPY to load data
> uses a 16MB ringbuffer. Then vacuum uses a 256KB ringbuffer. Which means
> that copy immediately writes and evicts all data. Then vacuum reads &
> writes the data in small chunks; again evicting nearly all buffers. Then
> the creation of the ringbuffer has to read that data *again*.
> 
> That's fairly idiotic.
> 
> While it's not easy to fix this in the general case, we introduced those
> ringbuffers for a reason after all, I think we at least should add a
> special case for loads where shared_buffers isn't fully used yet.  Why
> not skip using buffers from the ringbuffer if there's buffers on the
> freelist? If we add buffers gathered from there to the ringlist, we
> should have few cases that regress.
> 
> Additionally, maybe we ought to increase the ringbuffer sizes again one
> of these days? 256kb for VACUUM is pretty damn low.

Just to attach some numbers for this. On my laptop, with a pretty fast
disk (as in ~550MB/s read + write, limited by SATA, not the disk), I get
these results.

I initialized a cluster with pgbench -q -i -s 1000, and VACUUM FREEZEd
pgbenc_accounts. I ensured that there's enough WAL files pre-allocated
that neither of the tests run into having to allocate WAL files.

I first benchmarked master, and then in a second run neutered
GetAccessStrategy(), by returning NULL in the BAS_BULKWRITE, BAS_VACUUM
cases.

master:

postgres[949][1]=# CREATE TABLE pgbench_accounts_copy AS SELECT * FROM 
pgbench_accounts ;
SELECT 1
Time: 199803.198 ms (03:19.803)
postgres[949][1]=# VACUUM VERBOSE pgbench_accounts_copy;
INFO:  0: vacuuming "public.pgbench_accounts_copy"
LOCATION:  lazy_scan_heap, vacuumlazy.c:535
INFO:  0: "pgbench_accounts_copy": found 0 removable, 1 
nonremovable row versions in 1639345 out of 1639345 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4888968
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 13.31 s, system: 12.82 s, elapsed: 57.86 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1500
VACUUM
Time: 57890.969 ms (00:57.891)
postgres[949][1]=# VACUUM FREEZE VERBOSE pgbench_accounts_copy;
INFO:  0: aggressively vacuuming "public.pgbench_accounts_copy"
LOCATION:  lazy_scan_heap, vacuumlazy.c:530
INFO:  0: "pgbench_accounts_copy": found 0 removable, 1 
nonremovable row versions in 1639345 out of 1639345 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4888968
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 25.21 s, system: 33.45 s, elapsed: 185.76 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1500
Time: 185786.829 ms (03:05.787)

So 199803.198 + 57890.969 + 185786.829 ms

no-copy/vacuum-ringbuffers:

postgres[5372][1]=# CREATE TABLE pgbench_accounts_copy AS SELECT * FROM 
pgbench_accounts ;
SELECT 1
Time: 143109.959 ms (02:23.110)
postgres[5372][1]=# VACUUM VERBOSE pgbench_accounts_copy;
INFO:  0: vacuuming "public.pgbench_accounts_copy"
LOCATION:  lazy_scan_heap, vacuumlazy.c:535
INFO:  0: "pgbench_accounts_copy": found 0 removable, 1 
nonremovable row versions in 1639345 out of 1639345 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4888971
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 8.43 s, system: 0.01 s, elapsed: 8.49 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1500
VACUUM
Time: 8504.410 ms (00:08.504)
postgres[5372][1]=# VACUUM FREEZE VERBOSE pgbench_accounts_copy;
INFO:  0: aggressively vacuuming "public.pgbench_accounts_copy"
LOCATION:  lazy_scan_heap, vacuumlazy.c:530
INFO:  0: "pgbench_accounts_copy": found 0 removable, 1 
nonremovable row versions in 1639345 out of 1639345 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 4888971
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 9.07 s, system: 0.78 s, elapsed: 14.22 s.
LOCATION:  lazy_scan_heap, vacuumlazy.c:1500
VACUUM
Time: 14235.619 ms (00:14.236)

So 143109.959 + 8504.410 + 14235.619 ms.

The relative improvements are:
CREATE TABLE AS: 199803.198 -> 143109.959: 39% improvement
VACUUM: 57890.969 -> 8504.410: 580% improvement
VACUUM FREEZE: 1205% improvement

And even if you were to argue - which I don't find entirely convincing -
that the checkpoint's time should be added afterwards,

Re: make \d pg_toast.foo show its indices

2019-05-07 Thread Robert Haas

On Tue, May 7, 2019 at 11:30 AM Stephen Frost  wrote:
> > Not unless you want to break every regression test that uses \d.
> > Instability of the output is also a reason not to show the
> > toast table's name in the parent's \d[+].
>
> So we need a way to turn it off.  That doesn't seem like it'd be hard to
> implement and the information is certainly quite useful.

Ugh.  It's not really worth it if we have to go to such lengths.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Naming of pg_checksums

2019-05-07 Thread Robert Haas

On Mon, May 6, 2019 at 1:56 PM Bruce Momjian  wrote:
> Is there a reason pg_checksums is plural and not singular, i.e.,
> pg_checksum?  I know it is being renamed for PG 12.  It might have
> needed to be plural when it was pg_verify_checksums.

That is a good question, IMHO.  I am not sure whether pg_checksum is
better, but I'm pretty sure it's not worse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: make \d pg_toast.foo show its indices

2019-05-07 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, May 7, 2019 at 11:30 AM Stephen Frost  wrote:
> > > Not unless you want to break every regression test that uses \d.
> > > Instability of the output is also a reason not to show the
> > > toast table's name in the parent's \d[+].
> >
> > So we need a way to turn it off.  That doesn't seem like it'd be hard to
> > implement and the information is certainly quite useful.
> 
> Ugh.  It's not really worth it if we have to go to such lengths.

I don't think I agree..  We've gone to pretty great lengths to have
things that can be turned on and off for explain because they're useful
to have but not something that's predictible in the regression tests.
This doesn't strike me as all that different (indeed, if anything it
seems like it should be less of an issue since it's entirely client
side...).

Having our test framework deny us useful features just strikes me as
bizarre.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Inconsistent error message wording for REINDEX CONCURRENTLY

2019-05-07 Thread Tom Lane

I wrote:
> After looking around a bit, I propose that we invent
> "IsCatalogRelationOid(Oid reloid)" (not wedded to that name), which
> is a wrapper around IsCatalogClass() that does the needful syscache
> lookup for you.  Aside from this use-case, it could be used in
> sepgsql/dml.c, which I see is also using
> IsSystemNamespace(get_rel_namespace(oid)) for the wrong thing.

> I'm also thinking that it'd be a good idea to rename IsSystemNamespace
> to IsCatalogNamespace.  The existing naming is totally confusing given
> that it doesn't square with the distinction between IsSystemRelation
> and IsCatalogRelation (ie that the former includes user toast tables).
> There are only five external callers of it, and per this discussion
> at least two of them are wrong anyway.

After studying the callers of these catalog.c functions for awhile,
I realized that IsCatalogRelation/Class are really fundamentally wrong,
and have been so for a long time.  The reason is that while they will
return FALSE for tables in the information_schema, they will return
TRUE for toast tables attached to the information_schema tables.
(They're toast tables, and they have OIDs below FirstNormalObjectId,
so there you have it.)  This is wrong on its face: if those tables don't
need to be protected as catalogs, why should their TOAST appendages
need it?  Moreover, if you drop and recreate information_schema, you'll
start getting different behavior for them, which is even sillier.

I was driven to this realization by the following very confused (and
confusing) bit in ReindexMultipleTables:

/*
 * Skip system tables that index_create() would reject to index
 * concurrently.  XXX We need the additional check for
 * FirstNormalObjectId to skip information_schema tables, because
 * IsCatalogClass() here does not cover information_schema, but the
 * check in index_create() will error on the TOAST tables of
 * information_schema tables.
 */
if (concurrent &&
(IsCatalogClass(relid, classtuple) || relid < FirstNormalObjectId))
{

That's nothing but a hack, and the reason it's necessary is that
index_create will throw error if IsCatalogRelation is true, which
it will be for information_schema TOAST tables --- but not for their
parent tables that are being examined here.

After looking around, it seems to me that the correct definition for
IsCatalogRelation is just "is the OID less than FirstBootstrapObjectId?".
Currently we could actually restrict it to "less than
FirstGenbkiObjectId", because all the catalogs, indexes, and TOAST tables
have hand-assigned OIDs --- but perhaps someday we'll let genbki.pl
assign some of those OIDs, so I prefer the weaker constraint.  In any
case, this gives us a correct separation between objects that are
traceable to the bootstrap data and those that are created by plain SQL
later in initdb.

With this, the Form_pg_class argument to IsCatalogClass becomes
vestigial.  I'm tempted to get rid of that function altogether in
favor of direct calls to IsCatalogRelationOid, but haven't done so
in the attached.

Comments?

regards, tom lane

diff --git a/contrib/sepgsql/dml.c b/contrib/sepgsql/dml.c
index cc934b5..2892346 100644
--- a/contrib/sepgsql/dml.c
+++ b/contrib/sepgsql/dml.c
@@ -161,12 +161,10 @@ check_relation_privileges(Oid relOid,
 	 */
 	if (sepgsql_getenforce() > 0)
 	{
-		Oid			relnamespace = get_rel_namespace(relOid);
-
-		if (IsSystemNamespace(relnamespace) &&
-			(required & (SEPG_DB_TABLE__UPDATE |
+		if ((required & (SEPG_DB_TABLE__UPDATE |
 		 SEPG_DB_TABLE__INSERT |
-		 SEPG_DB_TABLE__DELETE)) != 0)
+		 SEPG_DB_TABLE__DELETE)) != 0 &&
+			IsCatalogRelationOid(relOid))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 	 errmsg("SELinux: hardwired security policy violation")));
diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 6d8c746..b5e554f 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -53,15 +53,18 @@

 /*
  * IsSystemRelation
- *		True iff the relation is either a system catalog or toast table.
- *		By a system catalog, we mean one that created in the pg_catalog schema
- *		during initdb.  User-created relations in pg_catalog don't count as
- *		system catalogs.
+ *		True iff the relation is either a system catalog or a toast table.
+ *		See IsCatalogRelation for the exact definition of a system catalog.
  *
- *		NB: TOAST relations are considered system relations by this test
- *		for compatibility with the old IsSystemRelationName function.
- *		This is appropriate in many places but not all.  Where it's not,
- *		also check IsToastRelation or use IsCatalogRelation().
+ *		We treat toast tables of user relations as "system relations" for
+ *		protection purposes, e.g. you can't change their schemas without
+ *		special permissions.  Therefore, most uses of this function are
+ *		checki

Re: New EXPLAIN option: ALL

2019-05-07 Thread David Fetter

On Tue, May 07, 2019 at 09:44:30AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-05-07 18:34:11 +0200, David Fetter wrote:
> > On Tue, May 07, 2019 at 08:41:47AM -0700, Andres Freund wrote:
> > > On 2019-05-07 09:30:47 +0200, David Fetter wrote:
> > > > It can get a little tedious turning on (or off) all the boolean
> > > > options to EXPLAIN, so please find attached a shortcut.
> > > 
> > > I'm not convinced this is a good idea - it seems likely that we'll
> > > add conflicting options at some point, and then this will just be a
> > > pain in the neck.
> > 
> > I already left out FORMAT for a similar reason, namely that it's not a
> > boolean, so it's not part of flipping on (or off) all the switches.
> 
> Which is already somewhat hard to explain.
> 
> Imagine if we had CPU_PROFILE = on (which'd be *extremely*
> useful). Would you want that to be switched on automatically?  How about
> RECORD_IO_TRACE?  How about DISTINCT_BUFFERS which'd be like BUFFERS
> except that we'd track how many different buffers are accessed using HLL
> or such? Would also be extremely useful.
> 
> 
> > Are you seeing a point in the future where there'd be both mutually
> > exclusive boolean options and no principled reason to choose among
> > them?  If so, what might it look like?
> 
> Yes. CPU_PROFILE_PERF, CPU_PROFILE_VTUNE. And lots more.

Thanks for clarifying.

Would you agree that there's a problem here as I described as
motivation for people who operate databases?

If so, do you have one or more abbreviations in mind that aren't
called ALL? I realize that Naming Things™ is one of two hard problems
in computer science, but it's still one we have to tackle.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Fuzzy thinking in is_publishable_class

2019-05-07 Thread Tom Lane

I wrote:
> is_publishable_class has a test "relid >= FirstNormalObjectId",
> which I think we should drop, for two reasons:

> 1. It makes the comment claiming that this function tests the same
> things as check_publication_add_relation a lie.

> 2. The comment about it claims that the purpose is to reject
> information_schema relations, but if that's so, it's ineffective.
> We consider it supported to drop and recreate information_schema,
> and have indeed recommended doing so for some minor-version
> upgrades.  After that, the information_schema relations would no
> longer have OIDs recognizable to this test.

> So what is the motivation for this test?  If there's an important
> reason for it, we need to find a less fragile way to express it.

After further digging around, I wonder whether this test wasn't
somehow related to the issue described in

https://postgr.es/m/2321.1557263...@sss.pgh.pa.us

That doesn't completely make sense, since the restriction on
relkind should render it moot whether IsCatalogClass thinks
that a toast table is a catalog table, but maybe there's a link?

regards, tom lane

Re: New EXPLAIN option: ALL

2019-05-07 Thread Peter Geoghegan

On Tue, May 7, 2019 at 9:31 AM David Fetter  wrote:
> If you're tuning a query interactively, it's a lot simpler to prepend,
> for example,
>
> EXPLAIN (ALL, FORMAT JSON)
>
> to it than to prepend something along the lines of
>
> EXPLAIN(ANALYZE, VERBOSE, COSTS, BUFFERS, SETTINGS, TIMING, SUMMARY, 
> PARTRIDGE_IN_A_PEAR_TREE, FORMAT JSON)
>
> to it.

FWIW, I have the following in my psqlrc:

\set ea 'EXPLAIN (ANALYZE, SETTINGS, VERBOSE, BUFFERS) '

The idea behind that is that I can prepend ":ea" as needed, rather
than doing a lot of typing each time, as in:

:ea SELECT ...


--
Peter Geoghegan

Re: make \d pg_toast.foo show its indices

2019-05-07 Thread Tom Lane

Stephen Frost  writes:
> Having our test framework deny us useful features just strikes me as
> bizarre.

This is presuming that it's useful, which is debatable IMO.
I think most people will find it useless noise almost all of the time.

regards, tom lane

Re: New EXPLAIN option: ALL

2019-05-07 Thread Andres Freund

Hi,

On 2019-05-07 23:23:55 +0200, David Fetter wrote:
> Would you agree that there's a problem here as I described as
> motivation for people who operate databases?

Yea, but I don't think the solution is where you seek it.  I think the
problem is that our defaults for EXPLAIN, in particular EXPLAIN ANALYZE,
are dumb. And that your desire for ALL stems from that, rather than it
being desirable on its own.

We really e.g. should just enable BUFFERS by default. The reason we
can't is that right now we have checks like:
EXPLAIN (BUFFERS) SELECT 1;
ERROR:  22023: EXPLAIN option BUFFERS requires ANALYZE
LOCATION:  ExplainQuery, explain.c:206

but we ought to simply remove them. There's no benefit, and besides
preventing from enabling BUFFERS by default it means that
enabling/disabling ANALYZE is more work than necessary.

> If so, do you have one or more abbreviations in mind that aren't
> called ALL? I realize that Naming Things™ is one of two hard problems
> in computer science, but it's still one we have to tackle.

As I said, I don't think ALL is a good idea under any name.  Like it
just makes no sense to have ANALYZE, SUMMARY, VERBOSE, BUFFERS,
SETTINGS, FORMAT controlled by one option, unless you call it DWIM. It's
several separate axis (query is executed or not (ANALYZE), verbosity
(SUMMARY, VERBOSE), collecting additional information (BUFFERS, TIMING),
output format).

Greetings,

Andres Freund

Re: make \d pg_toast.foo show its indices

2019-05-07 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > Having our test framework deny us useful features just strikes me as
> > bizarre.
> 
> This is presuming that it's useful, which is debatable IMO.
> I think most people will find it useless noise almost all of the time.

Alright, maybe I'm not the best representation of our user base, but I
sure type 'select oid,* from pg_class where relname = ...' with some
regularity, mostly to get the oid to then go do something else.  Having
the relfilenode would be nice too, now that I think about it, and
reltuples.  There's ways to get *nearly* everything that's in pg_class
and friends out of various \d incantations, but not quite everything,
which seems unfortunate.

In any case, I can understand an argument that the code it requires is
too much to maintain for a relatively minor feature (though it hardly
seems like it would be...) or that it would be confusing or unhelpful to
users (aka "noise") much of the time, so I'll leave it to others to
comment on if they think any of these ideas be a useful addition or not.

I just don't think we should be voting down a feature because it'd take
a bit of extra effort to make our regression tests work with it, which
is all I was intending to get at here.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: jsonpath

2019-05-07 Thread Alexander Korotkov

On Tue, May 7, 2019 at 5:35 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > Attached patchset contains revised commit messages.  I'm going to
> > commit this on no objections.
>
> Sorry for slow response --- I was tied up with release preparations.
>
> The -5 patches look pretty good.  A couple of nits:
>
> @@ -774,9 +749,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, 
> JsonPathItem *jsp,
>  {
>  RETURN_ERROR(ereport(ERROR,
>   (errcode(ERRCODE_JSON_ARRAY_NOT_FOUND),
> -  errmsg(ERRMSG_JSON_ARRAY_NOT_FOUND),
> -  errdetail("jsonpath array accessor can 
> "
> -"only be applied to an 
> array";
> +  errdetail("jsonpath array accessor can 
> only be applied to an array";
>  }
>  break;
>
> I think you forgot to s/errdetail/errmsg/ in this one.  Likewise at:
>
> +  errdetail("jsonpath wildcard member 
> accessor can only be applied to an object";
>
>
> Also, I saw two places where you missed removing a trailing period
> from an errmsg:
>
> +  errmsg("left operand of jsonpath operator %s 
> is not a single numeric value.",
> + jspOperationName(jsp->type);
>
> +  errmsg("right operand of jsonpath operator %s 
> is not a single numeric value.",
> + jspOperationName(jsp->type);
>
>
> I'd suggest making a quick pass for other instances of the same mistakes,
> just in case.  I'm good with the wording on everything now.

Thank you!  I've catched couple other cases with errdetail() instead
of errmsg().  Pushed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: New EXPLAIN option: ALL

2019-05-07 Thread Tom Lane

Andres Freund  writes:
> As I said, I don't think ALL is a good idea under any name.  Like it
> just makes no sense to have ANALYZE, SUMMARY, VERBOSE, BUFFERS,
> SETTINGS, FORMAT controlled by one option, unless you call it DWIM. It's
> several separate axis (query is executed or not (ANALYZE), verbosity
> (SUMMARY, VERBOSE), collecting additional information (BUFFERS, TIMING),
> output format).

FWIW, I find this line of argument fairly convincing.  There may well
be a case for rethinking just how EXPLAIN's options behave, but "ALL"
doesn't seem like a good conceptual model.

One idea that comes to mind is that VERBOSE could be redefined as some
sort of package of primitive options, including all of the "additional
information" options, with the ability to turn individual ones off again
if you wanted.  So for example (VERBOSE, BUFFERS OFF) would give you
everything except buffer stats.  We'd need a separate flag/flags to
control what VERBOSE originally did, but that doesn't bother me ---
it's an opportunity for more clarity of definition, anyway.

I do feel that it's a good idea to keep ANALYZE separate. "Execute
the query or not" is a mighty fundamental thing.  I've never liked
that name for the option though --- maybe we could deprecate it
in favor of EXECUTE?

regards, tom lane

Re: New EXPLAIN option: ALL

2019-05-07 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Andres Freund  writes:
> > As I said, I don't think ALL is a good idea under any name.  Like it
> > just makes no sense to have ANALYZE, SUMMARY, VERBOSE, BUFFERS,
> > SETTINGS, FORMAT controlled by one option, unless you call it DWIM. It's
> > several separate axis (query is executed or not (ANALYZE), verbosity
> > (SUMMARY, VERBOSE), collecting additional information (BUFFERS, TIMING),
> > output format).
> 
> FWIW, I find this line of argument fairly convincing.  There may well
> be a case for rethinking just how EXPLAIN's options behave, but "ALL"
> doesn't seem like a good conceptual model.
> 
> One idea that comes to mind is that VERBOSE could be redefined as some
> sort of package of primitive options, including all of the "additional
> information" options, with the ability to turn individual ones off again
> if you wanted.  So for example (VERBOSE, BUFFERS OFF) would give you
> everything except buffer stats.  We'd need a separate flag/flags to
> control what VERBOSE originally did, but that doesn't bother me ---
> it's an opportunity for more clarity of definition, anyway.

I'm generally in favor of doing something like what Tom is suggesting
with VERBOSE, but I also feel like it should be the default for formats
like JSON.  If you're asking for the output in JSON, then we really
should include everything that a flag like VERBOSE would contain because
you're pretty clearly planning to copy/paste that output into something
else to read it anyway.

> I do feel that it's a good idea to keep ANALYZE separate. "Execute
> the query or not" is a mighty fundamental thing.  I've never liked
> that name for the option though --- maybe we could deprecate it
> in favor of EXECUTE?

Let's not fool ourselves by saying we'd 'deprecate' it because that
implies, at least to me, that there's some intention of later on
removing it and people will potentially propose patches to do that,
which we will then almost certainly spend hours arguing about with the
result being that we don't actually remove it.

I'm all in favor of adding an alias for analyze called 'execute', as
that makes a lot more sense and then updating our documentation to use
it, with 'analyze is accepted as an alias' as a footnote.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: New EXPLAIN option: ALL

2019-05-07 Thread Tom Lane

Stephen Frost  writes:
> I'm generally in favor of doing something like what Tom is suggesting
> with VERBOSE, but I also feel like it should be the default for formats
> like JSON.  If you're asking for the output in JSON, then we really
> should include everything that a flag like VERBOSE would contain because
> you're pretty clearly planning to copy/paste that output into something
> else to read it anyway.

Meh --- I don't especially care for non-orthogonal behaviors like that.
If you wanted JSON but *not* all of the additional info, how would you
specify that?  (The implementation I had in mind would make VERBOSE OFF
more or less a no-op, so that wouldn't get you there.)

>> I do feel that it's a good idea to keep ANALYZE separate. "Execute
>> the query or not" is a mighty fundamental thing.  I've never liked
>> that name for the option though --- maybe we could deprecate it
>> in favor of EXECUTE?

> Let's not fool ourselves by saying we'd 'deprecate' it because that
> implies, at least to me, that there's some intention of later on
> removing it

True, the odds of ever actually removing it are small :-(.  I meant
mostly changing all of our docs to use the other spelling, except
for some footnote.  Maybe we could call ANALYZE a "legacy spelling"
of EXECUTE.

regards, tom lane

Re: Failure in contrib test _int on loach

2019-05-07 Thread Heikki Linnakangas


On 02/05/2019 10:37, Heikki Linnakangas wrote:

(resending, previous attempt didn't make it to pgsql-hackers)

On 29/04/2019 16:16, Anastasia Lubennikova wrote:

In previous emails, I have sent two patches with test and bugfix (see
attached).
After Heikki shared his concerns, I've rechecked the algorithm and
haven't found any potential error.
So, if other hackers are agreed with my reasoning, the suggested fix is
sufficient and can be committed.


I still believe there is a problem with grandparent splits with this.
I'll try to construct a test case later this week, unless you manage to
create one before that.


Here you go. If you apply the two patches from 
https://www.postgresql.org/message-id/5d48ce28-34cf-9b03-5d42-dbd5457926bf%40postgrespro.ru, 
and run the attached script, it will print out something like this:


postgres=# \i grandparent.sql
DROP TABLE
CREATE TABLE
INSERT 0 15
CREATE INDEX
psql:grandparent.sql:27: NOTICE:  working on 1
psql:grandparent.sql:27: NOTICE:  working on 2
psql:grandparent.sql:27: NOTICE:  working on 3
psql:grandparent.sql:27: NOTICE:  working on 4
psql:grandparent.sql:27: NOTICE:  working on 5
psql:grandparent.sql:27: NOTICE:  working on 6
psql:grandparent.sql:27: NOTICE:  working on 7
psql:grandparent.sql:27: NOTICE:  working on 8
psql:grandparent.sql:27: NOTICE:  working on 9
psql:grandparent.sql:27: NOTICE:  working on 10
psql:grandparent.sql:27: NOTICE:  working on 11
psql:grandparent.sql:27: NOTICE:  failed for 114034
psql:grandparent.sql:27: NOTICE:  working on 12
DO

That "failed for 114034" should not happen.

Fortunately, that's not too hard to fix. We just need to arrange things 
so that the "retry_from_parent" flag also gets set for the grandparent, 
when the grandparent is split. Like in the attached patch.


- Heikki
>From 7e0b7fa120b9c4c3e0f0f71dc7a5f35ba806ea7f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 8 May 2019 01:29:52 +0300
Subject: [PATCH 1/1] Detect internal GiST page splits correctly during index
 build.

As we descend the GiST tree during insertion, we modify any downlinks on
the way down to include the new tuple we're about to insert (if they don't
cover it already). Modifying an existing downlink might cause an internal
page to split, if the new downlink tuple is larger than the old one. If
that happens, we need to back up to the parent and re-choose a page to
insert to. We used to detect that situation, thanks to the NSN-LSN
interlock normally used to detect concurrent page splits, but that got
broken by commit 9155580fd5. With that commit, we now use a dummy constant
LSN value for every page during index build, so the LSN-NSN interlock no
longer works. I thought that was OK because there can't be any other
backends modifying the index during index build, but missed that the
insertion itself can modify the page we're inserting to. The consequence
was that we would sometimes insert the new tuple to an incorrect page, one
whose downlink doesn't cover the new tuple.

Thomas Munro noticed that the contrib/intarray regression test started
failing occasionally on the buildfarm after commit 9155580fd5. The failure
was intermittent, because the gistchoose() function is not deterministic,
and would only occasionally create the right circumstances for this bug to
cause the failure.

To fix, add a flag to the stack that keeps track of the state while
descending tree, to indicate that a page was split, and we need to retry
the descend from the parent.

Patch by Anastasia Lubennikova, with some changes by me to make it work
correctly also when the internal page split also causes the "grandparent"
to be split.

Discussion: https://www.postgresql.org/message-id/CA%2BhUKGJRzLo7tZExWfSbwM3XuK7aAK7FhdBV0FLkbUG%2BW0v0zg%40mail.gmail.com
---
 src/backend/access/gist/gist.c| 33 +++
 src/include/access/gist_private.h |  7 +++
 2 files changed, 40 insertions(+)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 769aca42e3..d70a138f54 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -639,6 +639,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
 	firststack.lsn = 0;
+	firststack.retry_from_parent = false;
 	firststack.parent = NULL;
 	firststack.downlinkoffnum = InvalidOffsetNumber;
 	state.stack = stack = &firststack;
@@ -651,6 +652,21 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	 */
 	for (;;)
 	{
+		/*
+		 * If we split an internal page while descending the tree, we have to
+		 * retry at the parent. (Normally, the LSN-NSN interlock below would
+		 * also catch this and cause us to retry. But LSNs are not updated
+		 * during index build.)
+		 */
+		while (stack->retry_from_parent)
+		{
+			if (xlocked)
+LockBuffer(stack->buffer, GIST_UNLOCK);
+			xlocked = false;
+			ReleaseBuffer(stack->buffer

Re: New EXPLAIN option: ALL

2019-05-07 Thread David Fetter

On Tue, May 07, 2019 at 06:12:56PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > 
> > One idea that comes to mind is that VERBOSE could be redefined as
> > some sort of package of primitive options, including all of the
> > "additional information" options, with the ability to turn
> > individual ones off again if you wanted.  So for example (VERBOSE,
> > BUFFERS OFF) would give you everything except buffer stats.  We'd
> > need a separate flag/flags to control what VERBOSE originally did,
> > but that doesn't bother me --- it's an opportunity for more
> > clarity of definition, anyway.
> 
> I'm generally in favor of doing something like what Tom is
> suggesting with VERBOSE, but I also feel like it should be the
> default for formats like JSON.  If you're asking for the output in
> JSON, then we really should include everything that a flag like
> VERBOSE would contain because you're pretty clearly planning to
> copy/paste that output into something else to read it anyway.

So basically, every format but text gets the full treatment for
(essentially) the functionality I wrapped up in ALL?  That makes a lot
of sense.

> > I do feel that it's a good idea to keep ANALYZE separate. "Execute
> > the query or not" is a mighty fundamental thing.  I've never liked
> > that name for the option though --- maybe we could deprecate it
> > in favor of EXECUTE?
> 
> Let's not fool ourselves by saying we'd 'deprecate' it because that
> implies, at least to me, that there's some intention of later on
> removing it and people will potentially propose patches to do that,
> which we will then almost certainly spend hours arguing about with the
> result being that we don't actually remove it.

Excellent point.

> I'm all in favor of adding an alias for analyze called 'execute', as
> that makes a lot more sense and then updating our documentation to
> use it, with 'analyze is accepted as an alias' as a footnote.

How about making ANALYZE a backward-compatibility feature in the sense
of replacing examples, docs, etc., with EXECUTE? If most of our users
are in the future, this makes those same users's better without
qualification, and helps some positive fraction of our current users.

On a slightly related topic, we haven't, to date, made any promises
about what EXPLAIN will put out, but as we make more machine-readable
versions, we should at least think about its schema and versioning
of same.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?

2019-05-07 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-07 12:25:43 +0900, Michael Paquier wrote:
>> Then session 2 deadlocks while session 3 finishes correctly.  I don't
>> know if this is a class of problems we'd want to address for v12, but
>> if we do then CIC (and DROP INDEX CONCURRENTLY?) could benefit from
>> it.

> This seems like a pre-existing issue to me. We probably should improve
> that, but I don't think it has to be tied to 12.

Yeah.  CREATE INDEX CONCURRENTLY has always had a deadlock hazard,
so it's hardly surprising that REINDEX CONCURRENTLY does too.
I don't think that fixing that is in-scope for v12, even if we had
an idea how to do it, which we don't.

We do need to fix the wrong-lock-level problem of course, but
that seems straightforward.

regards, tom lane

Re: New EXPLAIN option: ALL

2019-05-07 Thread Stephen Frost

Greetings,

* David Fetter (da...@fetter.org) wrote:
> On Tue, May 07, 2019 at 06:12:56PM -0400, Stephen Frost wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > One idea that comes to mind is that VERBOSE could be redefined as
> > > some sort of package of primitive options, including all of the
> > > "additional information" options, with the ability to turn
> > > individual ones off again if you wanted.  So for example (VERBOSE,
> > > BUFFERS OFF) would give you everything except buffer stats.  We'd
> > > need a separate flag/flags to control what VERBOSE originally did,
> > > but that doesn't bother me --- it's an opportunity for more
> > > clarity of definition, anyway.
> > 
> > I'm generally in favor of doing something like what Tom is
> > suggesting with VERBOSE, but I also feel like it should be the
> > default for formats like JSON.  If you're asking for the output in
> > JSON, then we really should include everything that a flag like
> > VERBOSE would contain because you're pretty clearly planning to
> > copy/paste that output into something else to read it anyway.
> 
> So basically, every format but text gets the full treatment for
> (essentially) the functionality I wrapped up in ALL?  That makes a lot
> of sense.

Something along those lines is what I was thinking, yes, and it's what
at least some other projects do (admittedly, that's at least partially
my fault because I'm thinking of the 'info' command for pgbackrest, but
David Steele seemed to think it made sense also, at least).

> > I'm all in favor of adding an alias for analyze called 'execute', as
> > that makes a lot more sense and then updating our documentation to
> > use it, with 'analyze is accepted as an alias' as a footnote.
> 
> How about making ANALYZE a backward-compatibility feature in the sense
> of replacing examples, docs, etc., with EXECUTE? If most of our users
> are in the future, this makes those same users's better without
> qualification, and helps some positive fraction of our current users.

I'd rather not refer to it as a backwards-compatibility feature since
we, thankfully, don't typically do that and I generally think that's the
right way to go- but in some cases, like this one, having a 'legacy'
spelling or an alias seems to be darn near free without opening the box
of trying to provide backwards compatibility for everything.

> On a slightly related topic, we haven't, to date, made any promises
> about what EXPLAIN will put out, but as we make more machine-readable
> versions, we should at least think about its schema and versioning
> of same.

Not really sure that I agree on this point.  Do you see a reason to need
versioning or schema when the schema is, essentially, included in each
result since it's JSON or the other machine-readable formats?  I can
imagine that we might need a version if we decided to redefine some
existing field in a non-compatible or non-sensible way, but is that
possibility likely enough to warrent adding versioning and complicating
everything downstream?  I have a hard time seeing that.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: New EXPLAIN option: ALL

2019-05-07 Thread Stephen Frost

Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > I'm generally in favor of doing something like what Tom is suggesting
> > with VERBOSE, but I also feel like it should be the default for formats
> > like JSON.  If you're asking for the output in JSON, then we really
> > should include everything that a flag like VERBOSE would contain because
> > you're pretty clearly planning to copy/paste that output into something
> > else to read it anyway.
> 
> Meh --- I don't especially care for non-orthogonal behaviors like that.
> If you wanted JSON but *not* all of the additional info, how would you
> specify that?  (The implementation I had in mind would make VERBOSE OFF
> more or less a no-op, so that wouldn't get you there.)

You'd do it the same way you proposed for verbose- eg: BUFFERS OFF, et
al, but, really, the point here is that what you're doing with the JSON
result is fundamentally different- you're going to paste it into some
other tool and it should be that tool's job to manage the visualization
of it and what's included or not in what you see.  Passing the
information about what should be seen in the json-based EXPLAIN viewer
by way of omitting things from the JSON output strikes me as downright
odd, and doesn't give that other tool the ability to show that data if
the users ends up wanting it without rerunning the query.

> >> I do feel that it's a good idea to keep ANALYZE separate. "Execute
> >> the query or not" is a mighty fundamental thing.  I've never liked
> >> that name for the option though --- maybe we could deprecate it
> >> in favor of EXECUTE?
> 
> > Let's not fool ourselves by saying we'd 'deprecate' it because that
> > implies, at least to me, that there's some intention of later on
> > removing it
> 
> True, the odds of ever actually removing it are small :-(.  I meant
> mostly changing all of our docs to use the other spelling, except
> for some footnote.  Maybe we could call ANALYZE a "legacy spelling"
> of EXECUTE.

Sure, that'd be fine too.

Thanks,

Stephen

signature.asc
Description: PGP signature

Why could GEQO produce plans with lower costs than the standard_join_search?

2019-05-07 Thread Donald Dong

Hi,

I was expecting the plans generated by standard_join_search to have lower costs
than the plans from GEQO. But after the results I have from a join order
benchmark show that GEQO produces plans with lower costs most of the time!

I wonder what is causing this observation? From my understanding,
standard_join_search is doing a complete search. So I'm not sure how the GEQO
managed to do better than that.

Thank you,
Donald Dong

Re: Heap lock levels for REINDEX INDEX CONCURRENTLY not quite right?

2019-05-07 Thread Michael Paquier

On Tue, May 07, 2019 at 06:45:36PM -0400, Tom Lane wrote:
> Yeah.  CREATE INDEX CONCURRENTLY has always had a deadlock hazard,
> so it's hardly surprising that REINDEX CONCURRENTLY does too.
> I don't think that fixing that is in-scope for v12, even if we had
> an idea how to do it, which we don't.

The most straight-forward approach I can think of would be to
determine if non-transactional commands taking a lock on a table can
be safely skipped or not when checking for older snapshots than the
minimum where the index is marked as valid.  That's quite complex to
target v12, so I agree to keep it out of the stability work.

> We do need to fix the wrong-lock-level problem of course, but
> that seems straightforward.

Sure.
--
Michael

signature.asc
Description: PGP signature

Re: Why could GEQO produce plans with lower costs than the standard_join_search?

2019-05-07 Thread Tom Lane

Donald Dong  writes:
> I was expecting the plans generated by standard_join_search to have lower 
> costs
> than the plans from GEQO. But after the results I have from a join order
> benchmark show that GEQO produces plans with lower costs most of the time!

> I wonder what is causing this observation? From my understanding,
> standard_join_search is doing a complete search. So I'm not sure how the GEQO
> managed to do better than that.

standard_join_search is *not* exhaustive; there's a heuristic that causes
it not to consider clauseless joins unless it has to.

For the most part, GEQO uses the same heuristic (cf desirable_join()),
but given the right sort of query shape you can probably trick it into
situations where it will be forced to use a clauseless join when the
core code wouldn't.  It'd still be surprising for that to come out with
a lower cost estimate than a join order that obeys the heuristic,
though.  Clauseless joins are generally pretty awful.

I'm a tad suspicious about the representativeness of your benchmark
queries if you find this is happening "most of the time".

regards, tom lane

Re: accounting for memory used for BufFile during hash joins

2019-05-07 Thread Melanie Plageman

On Mon, May 6, 2019 at 8:15 PM Tomas Vondra 
wrote:

> Nope, that's not how it works. It's the array of batches that gets
> sliced, not the batches themselves.
>
> It does slightly increase the amount of data we need to shuffle between
> the temp files, because we can't write the data directly to batches in
> "future" slices. But that amplification is capped to ~2.2x (compared to
> the ~1.4x in master) - I've shared some measurements in [1].
>
> [1]
> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>
>
Cool, I misunderstood. I looked at the code again today, and, at the email
thread where you measured "amplification".

In terms of how many times you write each tuple, is it accurate to say that
a
tuple can now be spilled three times (in the worst case) whereas, before, it
could be spilled only twice?

1 - when building the inner side hashtable, tuple is spilled to a "slice"
file
2 - (assuming the number of batches was increased) during execution, when a
tuple belonging to a later slice's spill file is found, it is re-spilled to
that
slice's spill file
3 - during execution, when reading from its slice file, it is re-spilled
(again)
to its batch's spill file

Is it correct that the max number of BufFile structs you will have is equal
to
the number of slices + number of batches in a slice
because that is the max number of open BufFiles you would have at a time?

By the way, applying v4 patch on master, in an assert build, I am tripping
some
asserts -- starting with
Assert(!file->readOnly);
in BufFileWrite

One thing I was a little confused by was the nbatch_inmemory member of the
hashtable.  The comment in ExecChooseHashTableSize says that it is
determining
the number of batches we can fit in memory.  I thought that the problem was
the
amount of space taken up by the BufFile data structure itself--which is
related
to the number of open BufFiles you need at a time. This comment in
ExecChooseHashTableSize makes it sound like you are talking about fitting
more
than one batch of tuples into memory at a time. I was under the impression
that
you could only fit one batch of tuples in memory at a time.

So, I was stepping through the code with work_mem set to the lower bound,
and in
ExecHashIncreaseNumBatches, I got confused.
hashtable->nbatch_inmemory was 2 for me, thus, nbatch_tmp was 2
so, I didn't meet this condition
if (nbatch_tmp > hashtable->nbatch_inmemory)
since I just set nbatch_tmp using hashtable->nbatch_inmemory
So, I didn't increase the number of slices, which is what I was expecting.
What happens when hashtable->nbatch_inmemory is equal to nbatch_tmp?

-- 
Melanie Plageman

Re: vacuumdb and new VACUUM options

2019-05-07 Thread Masahiko Sawada

On Wed, May 8, 2019 at 2:41 AM Fujii Masao  wrote:
>
> Hi,
>
> vacuumdb command supports the corresponding options to
> any VACUUM parameters except INDEX_CLEANUP and TRUNCATE
> that were added recently. Should vacuumdb also support those
> new parameters, i.e., add --index-cleanup and --truncate options
> to the command?

I think it's a good idea to add new options of these parameters for
vacuumdb. While making INDEX_CLEANUP option patch I also attached the
patch for INDEX_CLEANUP parameter before[1], although it adds
--disable-index-cleanup option instead.

[1] 0002 patch on
https://www.postgresql.org/message-id/CAD21AoBtM%3DHGLkMKBgch37mf0-epa3_o%3DY1PU0_m9r5YmtS-NQ%40mail.gmail.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: accounting for memory used for BufFile during hash joins

2019-05-07 Thread Melanie Plageman

On Tue, May 7, 2019 at 6:59 AM Tomas Vondra 
wrote:

> On Tue, May 07, 2019 at 04:28:36PM +1200, Thomas Munro wrote:
> >On Tue, May 7, 2019 at 3:15 PM Tomas Vondra
> > wrote:
> >> On Tue, May 07, 2019 at 01:48:40PM +1200, Thomas Munro wrote:
> >> Switching to some other algorithm during execution moves the goal posts
> >> to the next galaxy, I'm afraid.
> >
> >The main problem I'm aware of with sort-merge join is: not all that is
> >hashable is sortable.  So BNL is actually the only solution I'm aware
> >of for problem B that doesn't involve changing a fundamental thing
> >about PostgreSQL's data type requirements.
> >
>
> Sure, each of those algorithms has limitations. But I think that's mostly
> irrelevant to the main issue - switching between algorithms mid-execution.
> At that point some of the tuples might have been already sent sent to the
> other nodes, and I have no idea how to "resume" the tuple stream short of
> buffering everything locally until the join completes. And that would be
> rather terrible, I guess.
>
>
What if you switched to NLJ on a batch-by-batch basis and did it before
starting
execution of the join but after building the inner side of the hash table.
That
way, no tuples will have been sent to other nodes yet.

-- 
Melanie Plageman

Re: _bt_split(), and the risk of OOM before its critical section

2019-05-07 Thread Peter Geoghegan

On Mon, May 6, 2019 at 4:11 PM Peter Geoghegan  wrote:
> I am tempted to move the call to _bt_truncate() out of _bt_split()
> entirely on HEAD, possibly relocating it to
> nbtsplitloc.c/_bt_findsplitloc(). That way, there is a clearer
> separation between how split points are chosen, suffix truncation, and
> the mechanical process of executing a legal page split.

I decided against that -- better to make it clear how truncation deals
with space overhead within _bt_split(). Besides, the resource
management around sharing a maybe-palloc()'d high key across module
boundaries seems complicated, and best avoided.

Attached draft patch for HEAD fixes the bug by organizing _bt_split()
into clear phases. _bt_split() now works as follows, which is a little
different:

*  An initial phase that is entirely concerned with the left page temp
buffer itself -- initializes its special area.

* Suffix truncation to get left page's new high key, and then add it
to left page.

* A phase that is mostly concerned with initializing the right page
special area, but also finishes off one or two details about the left
page that needed to be delayed. This is also where the "shadow
critical section" begins. Note also that this is where
_bt_vacuum_cycleid() is called, because its contract actually
*requires* that caller has a buffer lock on both pages at once. This
should not be changed on the grounds that _bt_vacuum_cycleid() might
fail (nor for any other reason).

* Add new high key to right page if needed. (No change, other than the
fact that it happens later now.)

* Add other items to both leftpage and rightpage. Critical section
that copies leftpage into origpage buffer. (No changes here.)

I suppose I'm biased, but I prefer the new approach anyway. Adding the
left high key first, and then the right high key seems simpler and
more logical. It emphasizes the similarities and differences between
leftpage and rightpage. Furthermore, this approach fixes the
theoretical risk of leaving behind a minimally-initialized nbtree page
that has existed since 2010. We don't allocated *any* memory after the
point that a new rightpage buffer is acquired.

I suppose that this will need to be backpatched.

Thoughts?
--
Peter Geoghegan

v1-0001-Don-t-leave-behind-junk-nbtree-pages-during-split.patch
Description: Binary data

Re: We're leaking predicate locks in HEAD

2019-05-07 Thread Thomas Munro

On Wed, May 8, 2019 at 6:56 AM Thomas Munro  wrote:
> On Wed, May 8, 2019 at 5:46 AM Tom Lane  wrote:
> > I'd have to say that my first suspicion falls on bb16aba50 ...
>
> Investigating.

Reproduced here.  Once the system reaches a state where it's leaking
(which happens only occasionally for me during installcheck-parallel),
it keeps leaking for future SSI transactions.  The cause is
SxactGlobalXmin getting stuck.  The attached fixes it for me.  I can't
remember why on earth I made that change, but it is quite clearly
wrong: you have to check every transaction, or you might never advance
SxactGlobalXmin.

-- 
Thomas Munro
https://enterprisedb.com

0001-Fix-SxactGlobalXmin-tracking.patch
Description: Binary data

Re: postgres_fdw: another oddity in costing aggregate pushdown paths

2019-05-07 Thread Etsuro Fujita


(2019/02/25 19:59), Etsuro Fujita wrote:

(2019/02/22 23:10), Antonin Houska wrote:

Etsuro Fujita wrote:

As mentioned in the near thread, I think there is another oversight in
the cost estimation for aggregate pushdown paths in postgres_fdw, IIUC.
When costing an aggregate pushdown path using local statistics, we
re-use the estimated costs of implementing the underlying scan/join
relation, cached in the relation's PgFdwRelationInfo (ie,
rel_startup_cost and rel_total_cost). Since these costs wouldn't yet
contain the costs of evaluating the final scan/join target, as tlist
replacement by apply_scanjoin_target_to_paths() is performed afterwards.
So I think we need to adjust these costs so that the tlist eval costs
are included, but ISTM that estimate_path_cost_size() forgot to do so.
Attached is a patch for fixing this issue.


I think the following comment in apply_scanjoin_target_to_paths() should
mention that FDWs rely on the new value of reltarget.

/*
* Update the reltarget. This may not be strictly necessary in all cases,
* but it is at least necessary when create_append_path() gets called
* below directly or indirectly, since that function uses the reltarget as
* the pathtarget for the resulting path. It seems like a good idea to do
* it unconditionally.
*/
rel->reltarget = llast_node(PathTarget, scanjoin_targets);


Agreed. How about mentioning that like the attached? In addition, I
added another assertion to estimate_path_cost_size() in that patch.


This doesn't get applied cleanly after commit 1d33858406.  Here is a 
rebased version of the patch.  I also modified the comments a little 
bit.  If there are no objections from Antonin or anyone else, I'll 
commit the patch.


Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***
*** 2842,2853  estimate_path_cost_size(PlannerInfo *root,
--- 2842,2859 
  		}
  		else if (IS_UPPER_REL(foreignrel))
  		{
+ 			RelOptInfo *outerrel = fpinfo->outerrel;
  			PgFdwRelationInfo *ofpinfo;
  			AggClauseCosts aggcosts;
  			double		input_rows;
  			int			numGroupCols;
  			double		numGroups = 1;
  
+ 			/* The upper relation should have its outer relation set */
+ 			Assert(outerrel);
+ 			/* and that outer relation should have its reltarget set */
+ 			Assert(outerrel->reltarget);
+ 
  			/*
  			 * This cost model is mixture of costing done for sorted and
  			 * hashed aggregates in cost_agg().  We are not sure which
***
*** 2856,2862  estimate_path_cost_size(PlannerInfo *root,
  			 * and all finalization and run cost are added in total_cost.
  			 */
  
! 			ofpinfo = (PgFdwRelationInfo *) fpinfo->outerrel->fdw_private;
  
  			/* Get rows and width from input rel */
  			input_rows = ofpinfo->rows;
--- 2862,2868 
  			 * and all finalization and run cost are added in total_cost.
  			 */
  
! 			ofpinfo = (PgFdwRelationInfo *) outerrel->fdw_private;
  
  			/* Get rows and width from input rel */
  			input_rows = ofpinfo->rows;
***
*** 2909,2919  estimate_path_cost_size(PlannerInfo *root,
  
  			/*-
  			 * Startup cost includes:
! 			 *	  1. Startup cost for underneath input relation
  			 *	  2. Cost of performing aggregation, per cost_agg()
  			 *-
  			 */
  			startup_cost = ofpinfo->rel_startup_cost;
  			startup_cost += aggcosts.transCost.startup;
  			startup_cost += aggcosts.transCost.per_tuple * input_rows;
  			startup_cost += aggcosts.finalCost.startup;
--- 2915,2927 
  
  			/*-
  			 * Startup cost includes:
! 			 *	  1. Startup cost for underneath input relation, adjusted for
! 			 *	 tlist replacement by apply_scanjoin_target_to_paths()
  			 *	  2. Cost of performing aggregation, per cost_agg()
  			 *-
  			 */
  			startup_cost = ofpinfo->rel_startup_cost;
+ 			startup_cost += outerrel->reltarget->cost.startup;
  			startup_cost += aggcosts.transCost.startup;
  			startup_cost += aggcosts.transCost.per_tuple * input_rows;
  			startup_cost += aggcosts.finalCost.startup;
***
*** 2921,2931  estimate_path_cost_size(PlannerInfo *root,
  
  			/*-
  			 * Run time cost includes:
! 			 *	  1. Run time cost of underneath input relation
  			 *	  2. Run time cost of performing aggregation, per cost_agg()
  			 *-
  			 */
  			run_cost = ofpinfo->rel_total_cost - ofpinfo->rel_startup_cost;
  			run_cost += aggcosts.finalCost.per_tuple * numGroups;
  			run_cost += cpu_tuple_cost * numGroups;
  
--- 2929,2941 
  
  			/*-
  			 * Run time cost includes:
! 			 *	  1. Run time cost of underneath input relation, adjusted for
! 			 *	 tlist replacement by apply_scanjoin_target_to_paths()
  			 *	  2. Run time cost of performing aggregation, per cost_agg()
  			 *-
  			 */
  			run_cost = ofpinfo->rel_total_cost - ofpinfo->rel_startup_cost;
+ 			run_cost += outerrel->reltarget->cost.per_tuple * input_rows;
  			run_cost += aggcosts.finalCost.per_tuple * numGroups;

Re: We're leaking predicate locks in HEAD

2019-05-07 Thread Tom Lane

Thomas Munro  writes:
> Reproduced here.  Once the system reaches a state where it's leaking
> (which happens only occasionally for me during installcheck-parallel),
> it keeps leaking for future SSI transactions.  The cause is
> SxactGlobalXmin getting stuck.  The attached fixes it for me.  I can't
> remember why on earth I made that change, but it is quite clearly
> wrong: you have to check every transaction, or you might never advance
> SxactGlobalXmin.

Hm.  So I don't have any opinion about whether this is a correct fix for
the leak, but I am quite distressed that the system failed to notice that
it was leaking predicate locks.  Shouldn't there be the same sort of
leak-detection infrastructure that we have for most types of resources?

regards, tom lane

Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-07 Thread Kyotaro HORIGUCHI

At Tue, 07 May 2019 20:47:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190507.204728.233299873.horiguchi.kyot...@lab.ntt.co.jp>
> Hello.
> 
> At Tue, 7 May 2019 14:39:55 +0530, Rajkumar Raghuwanshi 
>  wrote in 
> 
> > Hi,
> > As this issue is reproducible without partition-wise aggregate also,
> > changing email subject from "Statistical aggregate functions are not
> > working with partitionwise aggregate " to "Statistical aggregate functions
> > are not working with PARTIAL aggregation".
> > 
> > original reported test case and discussion can be found at below link.
> > https://www.postgresql.org/message-id/flat/CAKcux6%3DuZEyWyLw0N7HtR9OBc-sWEFeByEZC7t-KDf15FKxVew%40mail.gmail.com
> 
> The immediate reason for the behavior seems that
> EEOP_AGG_STRICT_INPUT_CHECK_ARGS considers non existent second
> argument as null, which is out of arguments list in
> trans_fcinfo->args[].
> 
> The invalid deserialfn_oid case in ExecBuildAggTrans, it
> initializes args[1] using the second argument of the functoin
> (int8pl() in the case) so the correct numTransInputs here is 1,
> not 2.
> 
> I don't understand this fully but at least the attached patch
> makes the test case work correctly and this seems to be the only
> case of this issue.

This behavior is introduced by 69c3936a14 (in v11).  At that time
FunctionCallInfoData is pallioc0'ed and has fixed length members
arg[6] and argnull[7]. So nulls[1] is always false even if nargs
= 1 so the issue had not been revealed.

After introducing a9c35cf85c (in v12) the same check is done on
FunctionCallInfoData that has NullableDatum args[] of required
number of elements. In that case args[1] is out of palloc'ed
memory so this issue has been revealed.

In a second look, I seems to me that the right thing to do here
is setting numInputs instaed of numArguments to numTransInputs in
combining step.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c5dcbc32646e8e4865307bb0525a143da573e240 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 8 May 2019 13:01:01 +0900
Subject: [PATCH] Give correct number of arguments to combine function.

Combine function's first parameter is used as transition value. The
correct number of transition input for the function is not the number
of arguments but of transition input.
---
 src/backend/executor/nodeAgg.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index d01fc4f52e..e13a8cb304 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2912,7 +2912,8 @@ build_pertrans_for_aggref(AggStatePerTrans pertrans,
 	pertrans->aggtranstype = aggtranstype;
 
 	/* Detect how many arguments to pass to the transfn */
-	if (AGGKIND_IS_ORDERED_SET(aggref->aggkind))
+	if (AGGKIND_IS_ORDERED_SET(aggref->aggkind) ||
+		DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
 		pertrans->numTransInputs = numInputs;
 	else
 		pertrans->numTransInputs = numArguments;
-- 
2.16.3

Re: Statistical aggregate functions are not working with PARTIAL aggregation

2019-05-07 Thread Kyotaro HORIGUCHI

At Wed, 08 May 2019 13:06:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190508.130636.184826233.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 07 May 2019 20:47:28 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20190507.204728.233299873.horiguchi.kyot...@lab.ntt.co.jp>
> > Hello.
> > 
> > At Tue, 7 May 2019 14:39:55 +0530, Rajkumar Raghuwanshi 
> >  wrote in 
> > 
> > > Hi,
> > > As this issue is reproducible without partition-wise aggregate also,
> > > changing email subject from "Statistical aggregate functions are not
> > > working with partitionwise aggregate " to "Statistical aggregate functions
> > > are not working with PARTIAL aggregation".
> > > 
> > > original reported test case and discussion can be found at below link.
> > > https://www.postgresql.org/message-id/flat/CAKcux6%3DuZEyWyLw0N7HtR9OBc-sWEFeByEZC7t-KDf15FKxVew%40mail.gmail.com
> > 
> > The immediate reason for the behavior seems that
> > EEOP_AGG_STRICT_INPUT_CHECK_ARGS considers non existent second
> > argument as null, which is out of arguments list in
> > trans_fcinfo->args[].
> > 
> > The invalid deserialfn_oid case in ExecBuildAggTrans, it
> > initializes args[1] using the second argument of the functoin
> > (int8pl() in the case) so the correct numTransInputs here is 1,
> > not 2.
> > 
> > I don't understand this fully but at least the attached patch
> > makes the test case work correctly and this seems to be the only
> > case of this issue.
> 
> This behavior is introduced by 69c3936a14 (in v11).  At that time
> FunctionCallInfoData is pallioc0'ed and has fixed length members
> arg[6] and argnull[7]. So nulls[1] is always false even if nargs
> = 1 so the issue had not been revealed.
> 
> After introducing a9c35cf85c (in v12) the same check is done on
> FunctionCallInfoData that has NullableDatum args[] of required
> number of elements. In that case args[1] is out of palloc'ed
> memory so this issue has been revealed.
> 
> In a second look, I seems to me that the right thing to do here
> is setting numInputs instaed of numArguments to numTransInputs in
> combining step.

By the way, as mentioned above, this issue exists since 11 but
harms at 12. Is this an open item, or older bug?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: We're leaking predicate locks in HEAD

2019-05-07 Thread Thomas Munro

On Wed, May 8, 2019 at 3:53 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Reproduced here.  Once the system reaches a state where it's leaking
> > (which happens only occasionally for me during installcheck-parallel),
> > it keeps leaking for future SSI transactions.  The cause is
> > SxactGlobalXmin getting stuck.  The attached fixes it for me.  I can't
> > remember why on earth I made that change, but it is quite clearly
> > wrong: you have to check every transaction, or you might never advance
> > SxactGlobalXmin.
>
> Hm.  So I don't have any opinion about whether this is a correct fix for
> the leak, but I am quite distressed that the system failed to notice that
> it was leaking predicate locks.  Shouldn't there be the same sort of
> leak-detection infrastructure that we have for most types of resources?

Well, it is hooked up the usual release machinery, because it's in
ReleasePredicateLocks(), which is wired into the
RESOURCE_RELEASE_LOCKS phase of resowner.c.  The thing is that lock
lifetime is linked to the last transaction with the oldest known xmin,
not the transaction that created them.

More analysis:  Lock clean-up is deferred until "... the last
serializable transaction with the oldest xmin among serializable
transactions completes", but I broke that by excluding read-only
transactions from the check so that SxactGlobalXminCount gets out of
sync.  There's a read-only SSI transaction in
src/test/regress/sql/transactions.sql, but I think the reason the
problem manifests only intermittently with installcheck-parallel is
because sometimes the read-only optimisation kicks in (effectively
dropping us to plain old SI because there's no concurrent serializable
activity) and it doesn't take any locks at all, and sometimes the
read-only transaction doesn't have the oldest known xmin among
serializable transactions.  However, if a read-write SSI transaction
had already taken a snapshot and has the oldest xmin and then the
read-only one starts with the same xmin, we get into trouble.  When
the read-only one releases, we fail to decrement SxactGlobalXminCount,
and then we'll never call ClearOldPredicateLocks().

-- 
Thomas Munro
https://enterprisedb.com

RE: Copy data to DSA area

2019-05-07 Thread Ideriha, Takeshi

Hi, 

>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>Sent: Friday, April 26, 2019 11:50 PM
>Well, after developing PoC, I realized that this PoC doesn't solve the local 
>process is
>crashed before the context becomes shared because local process keeps track of
>pointer to chunks.
>Maybe all of you have already noticed and pointed out this case :) So it needs 
>another
>work but this poc is a good step for me to advance more.

I think the point to prevent memory leak is allocating memory and storing its 
address into a structure at the same time. This structure should be trackable 
from
other process. 

So I'm going to change the design. I'll allocate a buffer of pointers to 
dsa objects which are not permanent yet. This buffer is located on shared 
memory. For simplicity, the buffer is allocated per process. That is the 
number of backends equals to MaxBackends. 

Developer calls API to make objects permanent. If objects become permanent, 
corresponding pointers are deleted from the buffer. If they fail to become 
permanent, they are freed from shared memory by checking the buffer and 
corresponding pointers also deleted from the buffer. 

There are two cases of failure: one is transaction abort, the other is process
crash. If transaction aborts, that process itself has responsibility to free
the objects. The process free the objects. In case of process crash, another 
backend needs to take care of it. I'm assuming on_shmem_exit callback can be 
used to free objects by postmaster. 

Regarding MemoryContext, one Shared MemoryContext is on the shared memory
and each chunk has a back pointer to it to pfree and so on.

I'm going to make a new patch and send it later.

Regards,
Takeshi Ideriha

Re: Copy data to DSA area

2019-05-07 Thread Thomas Munro

On Wed, May 8, 2019 at 5:29 PM Ideriha, Takeshi
 wrote:
> >From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
> >Sent: Friday, April 26, 2019 11:50 PM
> >Well, after developing PoC, I realized that this PoC doesn't solve the local 
> >process is
> >crashed before the context becomes shared because local process keeps track 
> >of
> >pointer to chunks.
> >Maybe all of you have already noticed and pointed out this case :) So it 
> >needs another
> >work but this poc is a good step for me to advance more.
>
> I think the point to prevent memory leak is allocating memory and storing its
> address into a structure at the same time. This structure should be trackable 
> from
> other process.

I'm not sure that it's necessarily wrong to keep tracking information
in private memory.  If any backend crashes, the postmaster will
terminate all backends and reinitialise everything anyway, because
shared memory might be corrupted.


--
Thomas Munro
https://enterprisedb.com

Re: Pluggable Storage - Andres's take

2019-05-07 Thread Ashwin Agrawal

On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal  wrote:
>
> Also wish to point out, while working on Zedstore, we realized that
> TupleDesc from Relation object can be trusted at AM layer for
> scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> catalog is updated first and hence the relation object passed to AM
> layer reflects new TupleDesc. For heapam its fine as it doesn't use
> the TupleDesc today during scans in AM layer for scan_getnextslot().
> Only TupleDesc which can trusted and matches the on-disk layout of the
> tuple for scans hence is from TupleTableSlot. Which is little
> unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> and not in scan_begin(). Means if AM wishes to do some initialization
> based on TupleDesc for scans can't be done in scan_begin() and forced
> to delay till has access to TupleTableSlot. We should at least add
> comment for scan_begin() to strongly clarify not to trust Relation
> object TupleDesc. Or maybe other alternative would be have separate
> API for rewrite case.

Just to correct my typo, I wish to say, TupleDesc from Relation object can't
be trusted at AM layer for scan_begin() API.

Andres, any thoughts on above. I see you had proposed "change the
table_beginscan* API so it
provides a slot" in [1], but seems received no response/comments that time.

[1]
https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de

Adding SMGR discriminator to buffer tags

2019-05-07 Thread Thomas Munro

Hello hackers,

On another thread, lots of undo log-related patches have been traded.
Buried deep in the stack is one that I'd like to highlight and discuss
in a separate thread, because it relates to a parallel thread of
development and it'd be good to get feedback on it.

In commit 3eb77eba, Shawn Debnath and I extended the checkpointer
fsync machinery to support more kinds of files. Next, we'd like to
teach the buffer pool to deal with more kinds of buffers. The context
for this collaboration is that he's working on putting things like
CLOG into shared buffers, and my EDB colleagues and I are working on
putting undo logs into shared buffers. We want a simple way to put
any block-structured stuff into shared buffers, not just plain
"relations".

The questions are: how should buffer tags distinguish different kinds
of buffers, and how should SMGR direct IO traffic to the right place
when it needs to schlepp pages in and out?

In earlier prototype code, I'd been using a special database number
for undo logs. In a recent thread[1], Tom and others didn't like that
idea much, and Shawn mentioned his colleague's idea of stealing unused
bits from the fork number so that there is no net change in tag size,
but we have entirely separate namespaces for each kind of buffered
data.

Here's a patch that does that, and then makes changes in the main
places I have found so far that need to be aware of the new SMGR ID
field.

Thoughts?

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BDE0mmiBZMtZyvwWtgv1sZCniSVhXYsXkvJ_Wo%2B83vvw%40mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

0001-Add-SmgrId-to-smgropen-and-BufferTag.patch
Description: Binary data

0002-Move-tablespace-dir-creation-from-smgr.c-to-md.c.patch
Description: Binary data

Missing FDW documentation about GetForeignUpperPaths

2019-05-07 Thread Etsuro Fujita

In commit d50d172e51, which adds support for FINAL relation pushdown
in postgres_fdw, I forgot to update the FDW documentation about
GetForeignUpperPaths to mention that the extra parameter of that
function points to a FinalPathExtraData structure introduced by that
commit in the case of FINAL relation pushdown.  Attached is a patch
for that.

Best regards,
Etsuro Fujita


FDW-documentation-GetForeignUpperPaths.patch
Description: Binary data

RE: Copy data to DSA area

2019-05-07 Thread Ideriha, Takeshi

Hi, Thomas 

>-Original Message-
>From: Thomas Munro [mailto:thomas.mu...@gmail.com]
>Subject: Re: Copy data to DSA area
>
>On Wed, May 8, 2019 at 5:29 PM Ideriha, Takeshi 
>
>wrote:
>> >From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>> >Sent: Friday, April 26, 2019 11:50 PM Well, after developing PoC, I
>> >realized that this PoC doesn't solve the local process is crashed
>> >before the context becomes shared because local process keeps track
>> >of pointer to chunks.
>> >Maybe all of you have already noticed and pointed out this case :) So
>> >it needs another work but this poc is a good step for me to advance more.
>>
>> I think the point to prevent memory leak is allocating memory and
>> storing its address into a structure at the same time. This structure
>> should be trackable from other process.
>
>I'm not sure that it's necessarily wrong to keep tracking information in 
>private memory.
>If any backend crashes, the postmaster will terminate all backends and 
>reinitialise
>everything anyway, because shared memory might be corrupted.

Thank you very much for the clarification. I haven't looked into reinitialize 
sequence
so much. I checked CreateSharedMemoryAndSemaphores is called again and shared
memory gets initialized. So I'm going to put keep tracking information in 
private 
memory and send a patch.

Regards,
Takeshi Ideriha

1 2 >

100 matches

Mail list logo