Re: Checkpoint start logging is done inside critical section

2018-10-18 Thread Ants Aasma
On Thu, Oct 18, 2018 at 9:02 AM Amit Kapila  wrote:
>
> On Thu, Oct 18, 2018 at 10:27 AM Andres Freund  wrote:
> > (that's why we mark the ctx as being ok with that).
> >
>
> Yeah, as the palloc for log message would be called in an ErrorContext
> where it is safe to do the allocation, so ideally this shouldn't be a
> problem.  So, it seems to me that this is not a problem, Ants, do you
> see any problem in any particular scenario or was this based on
> theoretical analysis?

This was purely theoretical, as also evidenced by lack of complaints
even though the code has been like that for a very long time. I was
actually mostly worried about extension code run by logging hook
causing the panic.

Regards,
Ants Aasma



Re: DSM robustness failure (was Re: Peripatus/failures)

2018-10-18 Thread Thomas Munro
On Thu, Oct 18, 2018 at 6:02 PM Larry Rosenman  wrote:
> Let me know soon(ish) if any of you want to poke at this machine, as I'm
> likely to forget and reboot it.

Hi Larry,

Thanks for the offer but it looks like there is no way to get our
hands on that memory for forensics now.  I'll see if there is some way
to improve that situation for next time something like this happens.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: DSM robustness failure (was Re: Peripatus/failures)

2018-10-18 Thread Thomas Munro
On Thu, Oct 18, 2018 at 5:00 PM Amit Kapila  wrote:
> The below code seems to be problemetic:
> dsm_cleanup_using_control_segment()
> {
> ..
> if (!dsm_control_segment_sane(old_control, mapped_size))
> {
> dsm_impl_op(DSM_OP_DETACH, old_control_handle, 0, &impl_private,
> &mapped_address, &mapped_size, LOG);
> ..
> }
>
> Here, don't we need to use dsm_control_* variables instead of local
> variable mapped_* variables?

I was a little fuzzy on when exactly
dsm_cleanup_using_control_segment() and dsm_postmaster_shutdown() run,
but after some more testing I think I have this straight now.  You can
test by setting dsm_control->magic to 42 in a debugger and trying
three cases:

1.  Happy shutdown: dsm_postmaster_shutdown() complains on shutdown.
2.  kill -9 a non-postmaster process: dsm_postmaster_shutdown()
complains during auto-restart.
3.  kill -9 the postmaster, manually start up again:
dsm_cleanup_using_control_segment() runs.  It ignores the old segment
quietly if it doesn't pass the sanity test.

So to answer your question: no, dsm_cleanup_using_control_segment() is
case 3.  This entirely new postmaster process has never had the
segment mapped in, so the dsm_control_* variables are not relevant
here.

Hmm but if you're running N other independent clusters on the same
machine that started up after this cluster crashed in case 3, I think
there is an N-in-four-billion chance that the segment with that ID now
belongs to another cluster and happens to be its DSM control segment,
and therefore passes the magic-number sanity test, and then we'll nuke
it and all the segments it references.  Am I missing something?  Could
we fix that by putting some kind of cluster ID in the control segment?

So on second thoughts, I think I should remove the
dsm_cleanup_using_control_segment() change from the patch, because it
widens the risk of case 3 nuking someone else's stuff to include even
segments that don't pass the sanity test.  That's not good.  Hunk
removed.

> ... Do we need to reset the dsm_control_*
> variables in dsm_cleanup_for_mmap?

I don't think so -- that unlinks the files in file-backed DSM, but we
also call the other code in that case.

> @@ -336,13 +333,14 @@ dsm_postmaster_shutdown(int code, Datum arg)
>   * stray shared memory segments, but there's not much we can do about that
>   * if the metadata is gone.
>   */
> - nitems = dsm_control->nitems;
>   if (!dsm_control_segment_sane(dsm_control, dsm_control_mapped_size))
>   {
>   ereport(LOG,
>   (errmsg("dynamic shared memory control segment is corrupt")));
> - return;
> + nitems = 0;
>
> The comment above this code says:
> /*
> * If some other backend exited uncleanly, it might have corrupted the
> * control segment while it was dying.  In that case, we warn and ignore
> * the contents of the control segment.  This may end up leaving behind
> * stray shared memory segments, but there's not much we can do about that
> * if the metadata is gone.
> */
>
> Don't we need to adjust this comment, if we want to go with what you
> have proposed?

The comment seems correct: we ignore the *contents* of the control
segment.  But to be clearer I'll add a sentence to say that we still
attempt to destroy the control segment itself.

>  BTW, it doesn't appear insane to me that if the
> control segment got corrupted, then we leave it as it is rather than
> trying to destroy it.  I am not sure, but if it is corrupted, then is
> it guaranteed that destroy will always succeed?

You're right that it might fail.  If the shm_unlink() fails it doesn't
matter, we just produce LOG, but if the unmap() fails we still
consider the segment to be mapped (because it is!), so ... hmm, we'd
better forget about it if we plan to continue running, by clearing
those variables explicitly.  That's actually another pre-existing way
for the assertion to fail in current master.  (Realistically, that
unmmap() could only fail if our global variables are trashed so we try
to unmap() a junk address; AFAIK our model here is that the postmaster
is sane, even if everything else is insane, so that's sort of outside
the model.  But this new patch addresses it anyway.)

Thanks for the review!

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Fix-thinko-in-handling-of-corrupted-DSM-control-a-v2.patch
Description: Binary data


lowering pg_regress privileges on Windows

2018-10-18 Thread Andrew Dunstan


From the "scratch a long running itch" department.


The attached ridiculously tiny patch solves the problem whereby while we 
can run Postgres on Windows safely from an Administrator account, we 
can't run run the regression tests from the same account, since it fails 
on the tablespace test, the tablespace directory having been set up 
without first having lowered privileges. The solution is to lower 
pg_regress' privileges in the same way that we do with other binaries. 
This is useful in setups like Appveyor where running under any other 
account is ... difficult. For the cfbot Thomas has had to make the 
script hack the schedule file to omit the tablespace test. This would 
make that redundant.



I propose to backpatch this. It's close enough to a bug and the risk is 
almost infinitely small.



cheers


andrew


--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 6890678..3248603 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2081,6 +2081,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 	progname = get_progname(argv[0]);
 	set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("pg_regress"));
 
+	get_restricted_token(progname);
+
 	atexit(stop_postmaster);
 
 #ifndef HAVE_UNIX_SOCKETS


Re: Removing unneeded self joins

2018-10-18 Thread Alexander Kuzmenkov

Here is a version that compiles.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 3bb91c9..62d46a1 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -166,10 +166,13 @@ _equalVar(const Var *a, const Var *b)
 	COMPARE_SCALAR_FIELD(vartypmod);
 	COMPARE_SCALAR_FIELD(varcollid);
 	COMPARE_SCALAR_FIELD(varlevelsup);
-	COMPARE_SCALAR_FIELD(varnoold);
-	COMPARE_SCALAR_FIELD(varoattno);
 	COMPARE_LOCATION_FIELD(location);
 
+	/*
+	 * varnoold/varoattno are used only for debugging and may differ even
+	 * when the variables are logically the same.
+	 */
+
 	return true;
 }
 
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index f295558..fb07ca0 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -2964,7 +2964,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
  * relation_has_unique_index_for
  *	  Determine whether the relation provably has at most one row satisfying
  *	  a set of equality conditions, because the conditions constrain all
- *	  columns of some unique index.
+ *	  columns of some unique index. If index_info is not null, it is set to
+ *	  point to a new UniqueIndexInfo containing the index and conditions.
  *
  * The conditions can be represented in either or both of two ways:
  * 1. A list of RestrictInfo nodes, where the caller has already determined
@@ -2985,7 +2986,8 @@ ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
 bool
 relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 			  List *restrictlist,
-			  List *exprlist, List *oprlist)
+			  List *exprlist, List *oprlist,
+			  UniqueIndexInfo **index_info)
 {
 	ListCell   *ic;
 
@@ -3041,6 +3043,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexOptInfo *ind = (IndexOptInfo *) lfirst(ic);
 		int			c;
+		List *matched_restrictlist = NIL;
 
 		/*
 		 * If the index is not unique, or not immediately enforced, or if it's
@@ -3089,6 +3092,7 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 if (match_index_to_operand(rexpr, c, ind))
 {
 	matched = true; /* column is unique */
+	matched_restrictlist = lappend(matched_restrictlist, rinfo);
 	break;
 }
 			}
@@ -3131,7 +3135,22 @@ relation_has_unique_index_for(PlannerInfo *root, RelOptInfo *rel,
 
 		/* Matched all columns of this index? */
 		if (c == ind->ncolumns)
+		{
+			if (index_info != NULL)
+			{
+/* This may be called in GEQO memory context. */
+MemoryContext oldContext = MemoryContextSwitchTo(root->planner_cxt);
+*index_info = palloc(sizeof(UniqueIndexInfo));
+(*index_info)->index = ind;
+(*index_info)->clauses = list_copy(matched_restrictlist);
+MemoryContextSwitchTo(oldContext);
+			}
+			if (matched_restrictlist)
+list_free(matched_restrictlist);
 			return true;
+		}
+		if (matched_restrictlist)
+			list_free(matched_restrictlist);
 	}
 
 	return false;
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 642f951..2edcf22 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -176,7 +176,8 @@ add_paths_to_joinrel(PlannerInfo *root,
 	innerrel,
 	JOIN_INNER,
 	restrictlist,
-	false);
+	false,
+	NULL /*index_info*/);
 			break;
 		default:
 			extra.inner_unique = innerrel_is_unique(root,
@@ -185,7 +186,8 @@ add_paths_to_joinrel(PlannerInfo *root,
 	innerrel,
 	jointype,
 	restrictlist,
-	false);
+	false,
+	NULL /*index_info*/);
 			break;
 	}
 
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 0e73f9c..2bb3a10 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,12 +22,15 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/planmain.h"
+#include "optimizer/predtest.h"
+#include "optimizer/restrictinfo.h"
 #include "optimizer/tlist.h"
 #include "optimizer/var.h"
 #include "utils/lsyscache.h"
@@ -39,14 +42,15 @@ static void remove_rel_from_query(PlannerInfo *root, int relid,
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
 static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel,
-	List *clause_list);
+	List *clause_list, UniqueIndexInfo **info);
 static Oid	distinct_col_search(int colno, List *colnos, List *opids);
 stat

Possibly redundant context switch in postgres_fdw

2018-10-18 Thread Ildar Musin
Hi hackers,

ISTM that context switch in `create_cursor()`:

if (numParams > 0)
{
MemoryContext oldcontext;
oldcontext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
process_query_params(econtext,
fsstate->param_flinfo,
fsstate->param_exprs,
values);
MemoryContextSwitchTo(oldcontext);
}

is redundant since we should already be in `ecxt_per_tuple_memory` context
according to `ForeignNext()`. Do I miss some hidden purpose? If not here is
a patch that removes it.

Regards,
Ildar Musin
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fd20aa96aa..65962ea657 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -3083,16 +3083,10 @@ create_cursor(ForeignScanState *node)
 	 */
 	if (numParams > 0)
 	{
-		MemoryContext oldcontext;
-
-		oldcontext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);
-
 		process_query_params(econtext,
 			 fsstate->param_flinfo,
 			 fsstate->param_exprs,
 			 values);
-
-		MemoryContextSwitchTo(oldcontext);
 	}
 
 	/* Construct the DECLARE CURSOR command */


Re: PG vs macOS Mojave

2018-10-18 Thread Tom Lane
I wrote:
> Jakob Egger  writes:
>> I would assume that clang sets -isysroot automatically, but I have no idea 
>> why that didn't work for you previously.

> [ experiments further ... ]  It looks like clang does default to assuming
> -isysroot with the correct sysroot for its Xcode version.  The missing bit
> of info is that -I /System/... is taken as an absolute path that's not
> affected by the sysroot.  You have to write -iwithsysroot /System/...
> to have something that is taken relative to the sysroot.

> So as far as pltcl is concerned, we don't really need the configure
> hack to insert PG_SYSROOT into TCL_INCLUDE_SPEC; we could leave Apple's
> value alone and it should work.  (Not that that gets pltcl out of
> dependency on the sysroot entirely, because AFAICS there's still no way
> to find tclConfig.sh without explicitly accounting for sysroot.  And
> I'm not exactly convinced that using explicit sysroot to do that and
> then leaving it out later is a hot idea.)

> However, on the Perl side, we'd have to change -I to -iwithsysroot if
> we wanted to avoid explicitly mentioning the sysroot path, and that
> seems way more invasive than it's worth.

> I'm inclined to leave things as they stand now; it makes for reasonably
> consistent build behavior between pltcl and plperl.

After sleeping on it, I'm not really happy with that answer.  I think
we'd be better advised to remove the hack on TCL_INCLUDE_SPEC, and
change what we've done for Perl to look more like what Tcl does.
That is, instead of writing "-I$(perl_includedir)/CORE" in Makefiles,
we should write "$(perl_includespec)" so that the switch spelling
is included in the variable.  Then we'd make configure set that
variable to either "-I$perl_archlibexp/CORE" or
"-iwithsysroot $perl_archlibexp/CORE" depending on what it finds
to be needed.  In this way, while we can't avoid knowing the sysroot
path explicitly during configure, we aren't wiring it into anything
that configure emits.

It would be a good idea to change this now, before the results of commit
5e2217131 spread too far in the wild, so that plperl-related extensions
only have to deal with one Makefile change not two.

(BTW, plpython is already doing it like this, meaning that it's plperl
that's the odd man out, not pltcl.  It appears that Apple have not yet
sysroot-ified their Python installation, but it is not hard to see where
that train is going.)

regards, tom lane



Re: Implementation of Flashback Query

2018-10-18 Thread Yang Jie






Hello,I'll take a look at some of the features that zheap undo, is considering for flashback based on undo.Flashback Drop is more complex and, at the moment, doesn't do anything about it.






 



Yang JieHighgo Software http://www.highgo.com/



 

On 10/18/2018 08:16,Tsunakawa, Takayuki wrote: 


From: Yang Jie [mailto:yang...@highgo.com] Delayed cleanup, resulting in performance degradation, what are the solutions recommended? What do you suggest for the flashback feature? Although postgres has excellent backup and restore capabilities, have you considered adding flashbacks?Great proposal, I appreciate it.Regarding the bloat, check the recent mail threads titled "undo xxx".  A new table storage format is being developed to use undo logs like Oracle and MySQL instead of generating dead tuples.How about the possibility of adding a feature like Flashback Drop to restore dropped and truncated tables?  I sometimes saw customers who mistakenly dropped or truncated tables but didn't have any backup.RegardsTakayuki Tsunakawa




Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-18 Thread Peter Geoghegan
On Wed, Oct 3, 2018 at 4:39 PM Peter Geoghegan  wrote:
> I did find a pretty clear regression, though only with writes to
> unique indexes. Attached is v6, which fixes the issue. More on that
> below.

I've been benchmarking my patch using oltpbench's TPC-C benchmark
these past few weeks, which has been very frustrating -- the picture
is very mixed. I'm testing a patch that has evolved from v6, but isn't
too different.

In one way, the patch does exactly what it's supposed to do when these
benchmarks are run: it leaves indexes *significantly* smaller than the
master branch will on the same (rate-limited) workload, without
affecting the size of tables in any noticeable way. The numbers that I
got from my much earlier synthetic single client benchmark mostly hold
up. For example, the stock table's primary key is about 35% smaller,
and the order line index is only about 20% smaller relative to master,
which isn't quite as good as in the synthetic case, but I'll take it
(this is all because of the
v6-0003-Add-split-at-new-tuple-page-split-optimization.patch stuff).
However, despite significant effort, and despite the fact that the
index shrinking is reliable, I cannot yet consistently show an
increase in either transaction throughput, or transaction latency.

I can show a nice improvement in latency on a slightly-rate-limited
TPC-C workload when backend_flush_after=0 (something like a 40%
reduction on average), but that doesn't hold up when oltpbench isn't
rate-limited and/or has backend_flush_after set. Usually, there is a
1% - 2% regression, despite the big improvements in index size, and
despite the big reduction in the amount of buffers that backends must
write out themselves.

The obvious explanation is that throughput is decreased due to our
doing extra work (truncation) while under an exclusive buffer lock.
However, I've worked hard on that, and, as I said, I can sometimes
observe a nice improvement in latency. This makes me doubt the obvious
explanation. My working theory is that this has something to do with
shared_buffers eviction. Maybe we're making worse decisions about
which buffer to evict, or maybe the scalability of eviction is hurt.
Perhaps both.

You can download results from a recent benchmark to get some sense of
this. It includes latency and throughput graphs, plus details
statistics collector stats:

https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing

I would welcome any theories as to what could be the problem here. I'm
think that this is fixable, since the picture for the patch is very
positive, provided you only focus on bgwriter/checkpoint activity and
on-disk sizes. It seems likely that there is a very specific gap in my
understanding of how the patch affects buffer cleaning.

--
Peter Geoghegan



removing unnecessary get_att*() lsyscache functions

2018-10-18 Thread Peter Eisentraut
I noticed that get_attidentity() isn't really necessary because the
information can be obtained from an existing tuple descriptor in each case.

Also, get_atttypmod() hasn't been used since 2004.

I propose the attached patches to remove these two functions.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 2003c5b4fdad1cd444ddb1fad2bdc9ab09db2fd2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 18 Oct 2018 19:27:18 +0200
Subject: [PATCH 1/2] Remove get_attidentity()

All existing uses can get this information more easily from the
relation descriptor, so the detour through the syscache is not
necessary.
---
 src/backend/commands/tablecmds.c|  4 ++--
 src/backend/parser/parse_utilcmd.c  |  3 ++-
 src/backend/utils/cache/lsyscache.c | 32 -
 src/include/utils/lsyscache.h   |  1 -
 4 files changed, 4 insertions(+), 36 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 3e112b4ef4..7ff25a9eff 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -5945,7 +5945,7 @@ ATExecDropNotNull(Relation rel, const char *colName, 
LOCKMODE lockmode)
 errmsg("cannot alter system column \"%s\"",
colName)));
 
-   if (get_attidentity(RelationGetRelid(rel), attnum))
+   if (TupleDescAttr(RelationGetDescr(rel), attnum - 1)->attidentity)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("column \"%s\" of relation \"%s\" is an 
identity column",
@@ -6148,7 +6148,7 @@ ATExecColumnDefault(Relation rel, const char *colName,
 errmsg("cannot alter system column \"%s\"",
colName)));
 
-   if (get_attidentity(RelationGetRelid(rel), attnum))
+   if (TupleDescAttr(RelationGetDescr(rel), attnum - 1)->attidentity)
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
 errmsg("column \"%s\" of relation \"%s\" is an 
identity column",
diff --git a/src/backend/parser/parse_utilcmd.c 
b/src/backend/parser/parse_utilcmd.c
index d8387d4356..212eedfa87 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -3067,7 +3067,8 @@ transformAlterTableStmt(Oid relid, AlterTableStmt *stmt,
 * if attribute not found, something 
will error about it
 * later
 */
-   if (attnum != InvalidAttrNumber && 
get_attidentity(relid, attnum))
+   if (attnum != InvalidAttrNumber &&
+   
TupleDescAttr(RelationGetDescr(rel), attnum - 1)->attidentity)
{
Oid 
seq_relid = getOwnedSequence(relid, attnum);
Oid typeOid 
= typenameTypeId(pstate, def->typeName);
diff --git a/src/backend/utils/cache/lsyscache.c 
b/src/backend/utils/cache/lsyscache.c
index 12b2532d95..c5fcaa9542 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -821,38 +821,6 @@ get_attnum(Oid relid, const char *attname)
return InvalidAttrNumber;
 }
 
-/*
- * get_attidentity
- *
- * Given the relation id and the attribute name,
- * return the "attidentity" field from the attribute relation.
- *
- * Returns '\0' if not found.
- *
- * Since no identity is represented by '\0', this can also be used 
as a
- * Boolean test.
- */
-char
-get_attidentity(Oid relid, AttrNumber attnum)
-{
-   HeapTuple   tp;
-
-   tp = SearchSysCache2(ATTNUM,
-ObjectIdGetDatum(relid),
-Int16GetDatum(attnum));
-   if (HeapTupleIsValid(tp))
-   {
-   Form_pg_attribute att_tup = (Form_pg_attribute) GETSTRUCT(tp);
-   charresult;
-
-   result = att_tup->attidentity;
-   ReleaseSysCache(tp);
-   return result;
-   }
-   else
-   return '\0';
-}
-
 /*
  * get_atttype
  *
diff --git a/src/include/utils/lsyscache.h b/src/include/utils/lsyscache.h
index 8c57de77c0..151081c693 100644
--- a/src/include/utils/lsyscache.h
+++ b/src/include/utils/lsyscache.h
@@ -85,7 +85,6 @@ extern Oid get_opfamily_proc(Oid opfamily, Oid lefttype, Oid 
righttype,
  int16 procnum);
 extern char *get_attname(Oid relid, AttrNumber attnum, bo

Re: Large writable variables

2018-10-18 Thread Peter Eisentraut
On 17/10/2018 23:51, Andres Freund wrote:
>> __builtin_types_compatible_p(const char *, char *) returns false (0) for me.
> 
> Right, that's why I added a const, inside the macro,  to the type
> specified in the unconstify argument. So typeof() yields a const char *,
> and the return type is specified as char *, and adding a const in the
> argument also yields a const char *.

Yeah, that works.  The C++-inspired version also allowed casting from
not-const to const, which we don't really need.

I'd perhaps change the signature

#define unconstify(underlying_type, var)

because the "var" doesn't actually have to be a variable.

Attached is my previous patch adapted to your macro.

I'm tempted to get rid of the stuff in dfmgr.c and just let the "older
platforms" get the warning.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 54693f7201aa4768509c5d6939961b50a9021bcf Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 18 Oct 2018 22:11:44 +0200
Subject: [PATCH v2] Apply unconstify() in more places

---
 src/backend/utils/adt/json.c| 4 ++--
 src/backend/utils/adt/misc.c| 2 +-
 src/backend/utils/adt/varlena.c | 4 ++--
 src/backend/utils/fmgr/dfmgr.c  | 4 ++--
 src/port/win32setlocale.c   | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/utils/adt/json.c b/src/backend/utils/adt/json.c
index 6f0fe94d63..f47a498228 100644
--- a/src/backend/utils/adt/json.c
+++ b/src/backend/utils/adt/json.c
@@ -207,12 +207,12 @@ IsValidJsonNumber(const char *str, int len)
 */
if (*str == '-')
{
-   dummy_lex.input = (char *) str + 1;
+   dummy_lex.input = unconstify(char *, str) + 1;
dummy_lex.input_length = len - 1;
}
else
{
-   dummy_lex.input = (char *) str;
+   dummy_lex.input = unconstify(char *, str);
dummy_lex.input_length = len;
}
 
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
index 6ea3679835..309eb2935c 100644
--- a/src/backend/utils/adt/misc.c
+++ b/src/backend/utils/adt/misc.c
@@ -423,7 +423,7 @@ pg_get_keywords(PG_FUNCTION_ARGS)
HeapTuple   tuple;
 
/* cast-away-const is ugly but alternatives aren't much better 
*/
-   values[0] = (char *) ScanKeywords[funcctx->call_cntr].name;
+   values[0] = unconstify(char *, 
ScanKeywords[funcctx->call_cntr].name);
 
switch (ScanKeywords[funcctx->call_cntr].category)
{
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index a5e812d026..0fd3b15748 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -182,7 +182,7 @@ char *
 text_to_cstring(const text *t)
 {
/* must cast away the const, unfortunately */
-   text   *tunpacked = pg_detoast_datum_packed((struct varlena *) t);
+   text   *tunpacked = pg_detoast_datum_packed(unconstify(text *, t));
int len = VARSIZE_ANY_EXHDR(tunpacked);
char   *result;
 
@@ -213,7 +213,7 @@ void
 text_to_cstring_buffer(const text *src, char *dst, size_t dst_len)
 {
/* must cast away the const, unfortunately */
-   text   *srcunpacked = pg_detoast_datum_packed((struct varlena *) 
src);
+   text   *srcunpacked = pg_detoast_datum_packed(unconstify(text *, 
src));
size_t  src_len = VARSIZE_ANY_EXHDR(srcunpacked);
 
if (dst_len > 0)
diff --git a/src/backend/utils/fmgr/dfmgr.c b/src/backend/utils/fmgr/dfmgr.c
index 4a5cc7cfc7..d200b960d2 100644
--- a/src/backend/utils/fmgr/dfmgr.c
+++ b/src/backend/utils/fmgr/dfmgr.c
@@ -126,7 +126,7 @@ load_external_function(const char *filename, const char 
*funcname,
 * should declare its second argument as "const char *", but older
 * platforms might not, so for the time being we just cast away const.
 */
-   retval = (PGFunction) dlsym(lib_handle, (char *) funcname);
+   retval = (PGFunction) dlsym(lib_handle, unconstify(char *, funcname));
 
if (retval == NULL && signalNotFound)
ereport(ERROR,
@@ -175,7 +175,7 @@ PGFunction
 lookup_external_function(void *filehandle, const char *funcname)
 {
/* as above, cast away const for the time being */
-   return (PGFunction) dlsym(filehandle, (char *) funcname);
+   return (PGFunction) dlsym(filehandle, unconstify(char *, funcname));
 }
 
 
diff --git a/src/port/win32setlocale.c b/src/port/win32setlocale.c
index 0597c2afca..a8cf170dd1 100644
--- a/src/port/win32setlocale.c
+++ b/src/port/win32setlocale.c
@@ -183,7 +183,7 @@ pgwin32_setlocale(int category, const char *locale)
 * forbidden to modify, so casting away the "const" is innocuous.
 */
if (result)
-   result = (char *) map_locale(locale_map_result, result

Re: Large writable variables

2018-10-18 Thread Peter Eisentraut
On 18/10/2018 22:17, Peter Eisentraut wrote:
> On 17/10/2018 23:51, Andres Freund wrote:
>>> __builtin_types_compatible_p(const char *, char *) returns false (0) for me.
>>
>> Right, that's why I added a const, inside the macro,  to the type
>> specified in the unconstify argument. So typeof() yields a const char *,
>> and the return type is specified as char *, and adding a const in the
>> argument also yields a const char *.
> 
> Yeah, that works.  The C++-inspired version also allowed casting from
> not-const to const, which we don't really need.

> Attached is my previous patch adapted to your macro.

Oh, I forgot to mention, your version doesn't work for this code in
pqexpbuffer.c:

str->data = (char *) oom_buffer;

That's probably not a big deal though.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Large writable variables

2018-10-18 Thread Andres Freund
On 2018-10-18 22:20:29 +0200, Peter Eisentraut wrote:
> On 18/10/2018 22:17, Peter Eisentraut wrote:
> > On 17/10/2018 23:51, Andres Freund wrote:
> >>> __builtin_types_compatible_p(const char *, char *) returns false (0) for 
> >>> me.
> >>
> >> Right, that's why I added a const, inside the macro,  to the type
> >> specified in the unconstify argument. So typeof() yields a const char *,
> >> and the return type is specified as char *, and adding a const in the
> >> argument also yields a const char *.
> >
> > Yeah, that works.  The C++-inspired version also allowed casting from
> > not-const to const, which we don't really need.
>
> > Attached is my previous patch adapted to your macro.
>
> Oh, I forgot to mention, your version doesn't work for this code in
> pqexpbuffer.c:
>
> str->data = (char *) oom_buffer;

That kind of seems correct ;). Can easily be done via unconstify(char *,
&oom_buffer[0]), if necessary.

> That's probably not a big deal though.

Greetings,

Andres Freund



Re: Large writable variables

2018-10-18 Thread Andres Freund
On 2018-10-18 22:17:55 +0200, Peter Eisentraut wrote:
> I'd perhaps change the signature
> 
> #define unconstify(underlying_type, var)
> 
> because the "var" doesn't actually have to be a variable.

Hm, so expr, or what would you use?

> Attached is my previous patch adapted to your macro.
> 
> I'm tempted to get rid of the stuff in dfmgr.c and just let the "older
> platforms" get the warning.

Yea, that works for me.


Are you planning to apply this?

Greetings,

Andres Freund



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-18 Thread Andres Freund
Hi,

On 2018-10-18 12:54:27 -0700, Peter Geoghegan wrote:
> I can show a nice improvement in latency on a slightly-rate-limited
> TPC-C workload when backend_flush_after=0 (something like a 40%
> reduction on average), but that doesn't hold up when oltpbench isn't
> rate-limited and/or has backend_flush_after set. Usually, there is a
> 1% - 2% regression, despite the big improvements in index size, and
> despite the big reduction in the amount of buffers that backends must
> write out themselves.

What kind of backend_flush_after values where you trying?
backend_flush_after=0 obviously is the default, so I'm not clear on
that.   How large is the database here, and how high is shared_buffers


> The obvious explanation is that throughput is decreased due to our
> doing extra work (truncation) while under an exclusive buffer lock.
> However, I've worked hard on that, and, as I said, I can sometimes
> observe a nice improvement in latency. This makes me doubt the obvious
> explanation. My working theory is that this has something to do with
> shared_buffers eviction. Maybe we're making worse decisions about
> which buffer to evict, or maybe the scalability of eviction is hurt.
> Perhaps both.

Is it possible that there's new / prolonged cases where a buffer is read
from disk after the patch? Because that might require doing *write* IO
when evicting the previous contents of the victim buffer, and obviously
that can take longer if you're running with backend_flush_after > 0.

I wonder if it'd make sense to hack up a patch that logs when evicting a
buffer while already holding another lwlock. That shouldn't be too hard.


> You can download results from a recent benchmark to get some sense of
> this. It includes latency and throughput graphs, plus details
> statistics collector stats:
> 
> https://drive.google.com/file/d/1oIjJ3YpSPiyRV_KF6cAfAi4gSm7JdPK1/view?usp=sharing

I'm uncllear which runs are what here? I assume "public" is your
patchset, and master is master? Do you reset the stats inbetween runs?

Greetings,

Andres Freund



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-18 Thread Peter Geoghegan
Shared_buffers is 10gb iirc. The server has 32gb of memory. Yes, 'public'
is the patch case. Sorry for not mentioning it initially.

--
Peter Geoghegan
(Sent from my phone)


Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-10-18 Thread Peter Geoghegan
On Thu, Oct 18, 2018 at 1:44 PM Andres Freund  wrote:
> What kind of backend_flush_after values where you trying?
> backend_flush_after=0 obviously is the default, so I'm not clear on
> that.   How large is the database here, and how high is shared_buffers

I *was* trying backend_flush_after=512kB, but it's
backend_flush_after=0 in the benchmark I posted. See the
"postgres*settings" files.

On the master branch, things looked like this after the last run:

pg@tpcc_oltpbench[15547]=# \dt+
  List of relations
 Schema │Name│ Type  │ Owner │   Size   │ Description
┼┼───┼───┼──┼─
 public │ customer   │ table │ pg│ 4757 MB  │
 public │ district   │ table │ pg│ 5240 kB  │
 public │ history│ table │ pg│ 1442 MB  │
 public │ item   │ table │ pg│ 10192 kB │
 public │ new_order  │ table │ pg│ 140 MB   │
 public │ oorder │ table │ pg│ 1185 MB  │
 public │ order_line │ table │ pg│ 19 GB│
 public │ stock  │ table │ pg│ 9008 MB  │
 public │ warehouse  │ table │ pg│ 4216 kB  │
(9 rows)

pg@tpcc_oltpbench[15547]=# \di+
 List of relations
 Schema │ Name │ Type  │ Owner │
Table│  Size   │ Description
┼──┼───┼───┼┼─┼─
 public │ customer_pkey│ index │ pg│
customer   │ 367 MB  │
 public │ district_pkey│ index │ pg│
district   │ 600 kB  │
 public │ idx_customer_name│ index │ pg│
customer   │ 564 MB  │
 public │ idx_order│ index │ pg│
oorder │ 715 MB  │
 public │ item_pkey│ index │ pg│ item
 │ 2208 kB │
 public │ new_order_pkey   │ index │ pg│
new_order  │ 188 MB  │
 public │ oorder_o_w_id_o_d_id_o_c_id_o_id_key │ index │ pg│
oorder │ 715 MB  │
 public │ oorder_pkey  │ index │ pg│
oorder │ 958 MB  │
 public │ order_line_pkey  │ index │ pg│
order_line │ 9624 MB │
 public │ stock_pkey   │ index │ pg│ stock
 │ 904 MB  │
 public │ warehouse_pkey   │ index │ pg│
warehouse  │ 56 kB   │
(11 rows)

> Is it possible that there's new / prolonged cases where a buffer is read
> from disk after the patch? Because that might require doing *write* IO
> when evicting the previous contents of the victim buffer, and obviously
> that can take longer if you're running with backend_flush_after > 0.

Yes, I suppose that that's possible, because the buffer
popularity/usage_count will be affected in ways that cannot easily be
predicted. However, I'm not running with "backend_flush_after > 0"
here -- that was before.

> I wonder if it'd make sense to hack up a patch that logs when evicting a
> buffer while already holding another lwlock. That shouldn't be too hard.

I'll look into this.

Thanks
-- 
Peter Geoghegan


Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-18 Thread Peter Eisentraut
On 11/10/2018 16:50, Peter Eisentraut wrote:
> On 10/10/2018 21:50, Bruce Momjian wrote:
>>> I see.  Peter is proposing to have a fourth mode, essentially
>>> --transfer-mode=clone-or-copy.
>>
>> Uh, if you use --link, and the two data directories are on different
>> file systems, it fails.  No one has ever asked for link-or-copy, so why
>> are we considering clone-or-copy?  Are we going to need
>> link-or-clone-or-copy too?  I do realize that clone and copy have
>> non-destructive behavior on the old cluster once started, so it does
>> make some sense to merge them, unlike link.
> 
> I'm OK to get rid of the clone-or-copy mode.  I can see the confusion.

New patch that removes all the various reflink modes and adds a new
option --clone that works similar to --link.  I think it's much cleaner now.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 8b7285e61e425fad4d07a011efe2ccd7fd280d54 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Thu, 18 Oct 2018 23:09:59 +0200
Subject: [PATCH v6] pg_upgrade: Allow use of file cloning

Add another transfer mode --clone to pg_upgrade (besides the existing
--link and the default copy), using special file cloning calls.  This
makes the file transfer faster and more space efficient, achieving
speed similar to --link mode without the associated drawbacks.

On Linux, file cloning is supported on Btrfs and XFS (if formatted with
reflink support).  On macOS, file cloning is supported on APFS.
---
 configure|  2 +-
 configure.in |  1 +
 doc/src/sgml/ref/pgupgrade.sgml  | 37 ++---
 src/bin/pg_upgrade/check.c   | 13 -
 src/bin/pg_upgrade/file.c| 90 
 src/bin/pg_upgrade/option.c  |  8 +++
 src/bin/pg_upgrade/pg_upgrade.h  |  6 ++-
 src/bin/pg_upgrade/relfilenode.c | 44 ++--
 src/include/pg_config.h.in   |  3 ++
 9 files changed, 179 insertions(+), 25 deletions(-)

diff --git a/configure b/configure
index 43ae8c869d..5a7aa02b27 100755
--- a/configure
+++ b/configure
@@ -15130,7 +15130,7 @@ fi
 LIBS_including_readline="$LIBS"
 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
 
-for ac_func in cbrt clock_gettime fdatasync getifaddrs getpeerucred getrlimit 
mbstowcs_l memmove poll posix_fallocate ppoll pstat pthread_is_threaded_np 
readlink setproctitle setproctitle_fast setsid shm_open strchrnul symlink 
sync_file_range utime utimes wcstombs_l
+for ac_func in cbrt clock_gettime copyfile fdatasync getifaddrs getpeerucred 
getrlimit mbstowcs_l memmove poll posix_fallocate ppoll pstat 
pthread_is_threaded_np readlink setproctitle setproctitle_fast setsid shm_open 
strchrnul symlink sync_file_range utime utimes wcstombs_l
 do :
   as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
 ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
diff --git a/configure.in b/configure.in
index 519ecd5e1e..7f22bcee75 100644
--- a/configure.in
+++ b/configure.in
@@ -1602,6 +1602,7 @@ LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 
's/-lreadline//g'`
 AC_CHECK_FUNCS(m4_normalize([
cbrt
clock_gettime
+   copyfile
fdatasync
getifaddrs
getpeerucred
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index d51146d641..e616ea9dbc 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -182,6 +182,25 @@ Options
   display version information, then exit
  
 
+ 
+  --clone
+  
+   
+Use efficient file cloning (also known as reflinks)
+instead of copying files to the new cluster.  This can result in
+near-instantaneous copying of the data files, giving the speed
+advantages of -k/--link while
+leaving the old cluster untouched.
+   
+
+   
+At present, reflinks are supported on Linux (kernel 4.5 or later) with
+Btrfs and XFS (on file systems created with reflink support, which is
+not the default for XFS at this writing), and on macOS with APFS.
+   
+  
+ 
+
  
   -?
   --help
@@ -340,7 +359,7 @@ Run pg_upgrade
  Always run the pg_upgrade binary of the new 
server, not the old one.
  pg_upgrade requires the specification of the 
old and new cluster's
  data and executable (bin) directories. You can also 
specify
- user and port values, and whether you want the data files linked
+ user and port values, and whether you want the data files linked or cloned
  instead of the default copy behavior.
 
 
@@ -351,8 +370,12 @@ Run pg_upgrade
  once you start the new cluster after the upgrade.  Link mode also
  requires that the old and new cluster data directories be in the
  same file system.  (Tablespaces and pg_wal can be on
- different file systems.)  See pg_upgrade --help for a 
full
- list of options.
+ different file sy

Re: Postgres, fsync, and OSs (specifically linux)

2018-10-18 Thread Thomas Munro
Hello hackers,

Let's try to get this issue resolved.  Here is my position on the
course of action we should take in back-branches:

1.  I am -1 on back-patching the fd-transfer code.  It's a significant
change, and even when sufficiently debugged (I don't think it's there
yet), we have no idea what will happen on all the kernels we support
under extreme workloads.  IMHO there is no way we can spring this on
users in a point release.

2.  I am +1 on back-patching Craig's PANIC-on-failure logic.  Doing
nothing is not an option I like.  I have some feedback and changes to
propose though; see attached.

Responses to a review from Robert:

On Thu, Jul 19, 2018 at 7:23 AM Robert Haas  wrote:
> 2. I don't like promote_ioerr_to_panic() very much, partly because the
> same pattern gets repeated over and over, and partly because it would
> be awkwardly-named if we discovered that another 2 or 3 errors needed
> similar handling (or some other variant handling).  I suggest instead
> having a function like report_critical_fsync_failure(char *path) that
> does something like this:
>
> int elevel = ERROR;
> if (errno == EIO)
> elevel = PANIC;
> ereport(elevel,
> (errcode_for_file_access(),
>  errmsg("could not fsync file \"%s\": %m", path);
>
> And similarly I'd add report_critical_close_failure.  In some cases,
> this would remove wording variations (e.g. in twophase.c) but I think
> that's fine, and maybe an improvement, as discussed on another recent
> thread.

I changed it to look like data_sync_elevel(ERROR) and made it treat
all errnos the same.  ENOSPC, EIO, EWOK, EIEIO, it makes no difference
to the level of faith I have that my data still exists.

> 3. slru.c calls pg_fsync() but isn't changed by the patch.  That looks wrong.

Fixed.

> 4. The comment changes in snapbuild.c interact with the TODO that
> immediately follows.  I think more adjustment is needed here.

I don't understand this.

> 5. It seems odd that you adjusted the comment for
> pg_fsync_no_writethrough() but not pg_fsync_writethrough() or
> pg_fsync().  Either pg_fsync_writethrough() doesn't have the same
> problem, in which case, awesome, but let's add a comment, or it does,
> in which case it should refer to the other one.  And I think
> pg_fsync() itself needs a comment saying that every caller must be
> careful to use promote_ioerr_to_panic() or
> report_critical_fsync_failure() or whatever we end up calling it
> unless the fsync is not critical for data integrity.

I removed these comments and many others; I don't see the point in
scattering descriptions of this problem and references to specific
versions of Linux and -hackers archive links all over the place.  I
added a comment in one place, and also added some user documentation
of the problem.

> 6. In md.c, there's a stray blank line added.  But more importantly,
> the code just above that looks like this:
>
>  if (!FILE_POSSIBLY_DELETED(errno) ||
>  failures > 0)
> -ereport(ERROR,
> +ereport(promote_ioerr_to_panic(ERROR),
>  (errcode_for_file_access(),
>   errmsg("could not fsync file \"%s\": %m",
>  path)));
>  else
>  ereport(DEBUG1,
>  (errcode_for_file_access(),
>   errmsg("could not fsync file \"%s\"
> but retrying: %m",
>  path)));
>
> I might be all wet here, but it seems like if we enter the bottom
> branch, we still need the promote-to-panic behavior.

That case only is only reached if FILE_POSSIBLY_DELETED() on the first
time through the loop, and it detects an errno value not actually from
fsync().  It's from FileSync(), when it tries to reopen a virtual fd
and gets ENOENT, before calling fsync().  Code further down then
absorbs incoming requests before checking if that was expected,
closing a race.  The comments could make that clearer, admittedly.

> 7. The comment adjustment for SyncDataDirectory mentions an
> "important" fact about fsync behavior, but then doesn't seem to change
> any logic on that basis.  I think in general a number of these
> comments need a little more thought, but in this particular case, I
> think we also need to consider what the behavior should be (and the
> comment should reflect our considered judgement on that point, and the
> implementation should match).

I updated the comment.  I don't think this is too relevant to the
fsync() failure case, because we'll be rewriting all changes from the
WAL again during recovery; I think this function is mostly useful for
switching from fsync = off to fsync = on and restarting, not coping
with previous fsync() failures by retrying (which we know to be
useless anyway).  Someone could argue that if you restarted after
ch

Re: file cloning in pg_upgrade and CREATE DATABASE

2018-10-18 Thread Michael Paquier
On Thu, Oct 18, 2018 at 11:59:00PM +0200, Peter Eisentraut wrote:
> New patch that removes all the various reflink modes and adds a new
> option --clone that works similar to --link.  I think it's much cleaner now.

Thanks Peter for the new version.

+
+   {"clone", no_argument, NULL, 1},
+
{NULL, 0, NULL, 0}

Why newlines here?

@@ -293,6 +300,7 @@ usage(void)
printf(_("  -U, --username=NAME   cluster superuser (default 
\"%s\")\n"), os_info.user);
printf(_("  -v, --verbose enable verbose internal 
logging\n"));
printf(_("  -V, --version display version information, 
then exit\n"));
+   printf(_("  --clone   clone instead of copying 
files to new cluster\n"));
printf(_("  -?, --helpshow this help, then 
exit\n"));
printf(_("\n"

An idea for a one-letter option could be -n.  This patch can live
without.

+ pg_fatal("error while cloning relation \"%s.%s\": could not open file 
\"%s\": %s\n",
+  schemaName, relName, src, strerror(errno));

The tail of those error messages "could not open file" and "could not
create file" are already available as translatable error messages.
Would it be better to split those messages in two for translators?  One
is generated with pg_fatal("error while cloning relation \"%s.%s\":
could not open file \"%s\": %s\n",
+  schemaName, relName, src, strerror(errno));

The tail of those error messages "could not open file" and "could not
create file" are already available as translatable error messages.
Would it be better to split those messages in two for translators?  One
is generated with PG_VERBOSE to mention that cloning checks are
done, and the second one is fatal with the actual errors.

Those are all minor points, the patch looks good to me.
--
Michael


signature.asc
Description: PGP signature


Re: Function to promote standby servers

2018-10-18 Thread Michael Paquier
On Tue, Oct 16, 2018 at 09:49:20AM +0200, Laurenz Albe wrote:
> Only for hot standby - how do you want to execute a function on a standby
> server that doesn't allow connections?

In most deployments hot standby will be enabled, and wal_level uses the
same value for archive and hot_standby for some time now, so that does
not sound like a huge issue to me. 
--
Michael


signature.asc
Description: PGP signature


Re: Function to promote standby servers

2018-10-18 Thread Michael Paquier
On Mon, Oct 15, 2018 at 04:16:02PM +0200, Laurenz Albe wrote:
> Michael Paquier wrote:
>> Let's remove this restriction at code level, and instead use
>> function-level ACLs so as it is possible to allow non-superusers to
>> trigger a promotion as well, say a dedicated role.  We could add a
>> system role for this purpose, but I am not sure that it is worth the
>> addition yet.
> 
> Agreed, done.

>> As of now, this function returns true if promotion has been triggered,
>> and false if not.  However it seems to me that we should have something
>> more consistent with "pg_ctl promote", so there are actually three
>> possible states:
>> 1) Node is in recovery, with promotion triggered.  pg_promote() returns
>> true in case of success in creating the trigger file.
>> 2) Node is in recovery, with promotion triggered.  pg_promote() returns
>> false in case of failure in creating the trigger file.
>> 3) Node is not in recovery, where I think a call to pg_promote should
>> trigger an error.  This is not handled in the patch.
> 
> So far I had returned "false" in the last case, but I am fine with
> throwing an error in that case.  Done.

Thanks, that looks correct to me.  I think that consistency with pg_ctl
is quite important, as this is a feature present for many releases now.

>> At the end this patch needs a bit more work.
> 
> Yes, it did.  Thanks for the thorough review!

+   /* wait for up to a minute for promotion */
+   for (i = 0; i < WAITS_PER_SECOND * WAIT_SECONDS; ++i)
+   {
+   if (!RecoveryInProgress())
+   PG_RETURN_BOOL(true);
+
+   pg_usleep(100L / WAITS_PER_SECOND);
+   }
I would recommend to avoid pg_usleep and instead use a WaitLatch() or
similar to generate a wait event.  The wait can then also be seen in
pg_stat_activity, which is useful for monitoring purposes.  Using
RecoveryInProgress is indeed doable, and that's more simple than what I
thought first.

Something I missed to mention in the previous review: the timeout should
be manually enforceable, with a default at 60s.

Is the function marked as strict?  Per the code it should be, I am not
able to test now though.

You are missing REVOKE EXECUTE ON FUNCTION pg_promote() in
system_views.sql, or any users could trigger a promotion, no?
--
Michael


signature.asc
Description: PGP signature


Re: lowering pg_regress privileges on Windows

2018-10-18 Thread Michael Paquier
On Thu, Oct 18, 2018 at 08:31:11AM -0400, Andrew Dunstan wrote:
> The attached ridiculously tiny patch solves the problem whereby while we can
> run Postgres on Windows safely from an Administrator account, we can't run
> run the regression tests from the same account, since it fails on the
> tablespace test, the tablespace directory having been set up without first
> having lowered privileges. The solution is to lower pg_regress' privileges
> in the same way that we do with other binaries. This is useful in setups
> like Appveyor where running under any other account is ... difficult. For
> the cfbot Thomas has had to make the script hack the schedule file to omit
> the tablespace test. This would make that redundant.
> 
> I propose to backpatch this. It's close enough to a bug and the risk is
> almost infinitely small.

+1.  get_restricted_token() refactoring has been done down to
REL9_5_STABLE.  With 9.4 and older you would need to copy again this
full routine into pg_regress.c, which is in my opinion not worth
worrying about.
--
Michael


signature.asc
Description: PGP signature


Re: lowering pg_regress privileges on Windows

2018-10-18 Thread Thomas Munro
On Fri, Oct 19, 2018 at 1:13 PM Michael Paquier  wrote:
> On Thu, Oct 18, 2018 at 08:31:11AM -0400, Andrew Dunstan wrote:
> > The attached ridiculously tiny patch solves the problem whereby while we can
> > run Postgres on Windows safely from an Administrator account, we can't run
> > run the regression tests from the same account, since it fails on the
> > tablespace test, the tablespace directory having been set up without first
> > having lowered privileges. The solution is to lower pg_regress' privileges
> > in the same way that we do with other binaries. This is useful in setups
> > like Appveyor where running under any other account is ... difficult. For
> > the cfbot Thomas has had to make the script hack the schedule file to omit
> > the tablespace test. This would make that redundant.
> >
> > I propose to backpatch this. It's close enough to a bug and the risk is
> > almost infinitely small.
>
> +1.  get_restricted_token() refactoring has been done down to
> REL9_5_STABLE.  With 9.4 and older you would need to copy again this
> full routine into pg_regress.c, which is in my opinion not worth
> worrying about.

FWIW here is a successful Appveyor build including the full test
schedule (CI patch attached in case anyone is interested).  Woohoo!
Thanks for figuring that out Andrew.  I will be very happy to remove
that wart from my workflows.

https://ci.appveyor.com/project/macdice/postgres/builds/19626669

--
Thomas Munro
http://www.enterprisedb.com


0001-CI-configuration-for-Appveyor.patch
Description: Binary data


Re: DSM segment handle generation in background workers

2018-10-18 Thread Thomas Munro
On Sat, Oct 13, 2018 at 11:45 PM Thomas Munro
 wrote:
> On Thu, Oct 11, 2018 at 5:51 PM Tom Lane  wrote:
> > The comment adjacent to the change in InitStandaloneProcess bothers me.
> > In particular, it points out that what BackendRun() is currently doing
> > creates more entropy in the process's random seed than what you have
> > here: MyStartTime is only good to the nearest second, whereas the code
> > you removed from BackendRun() factors in fractional seconds to whatever
> > the precision of GetCurrentTimestamp is.  This does not seem like an
> > improvement.
>
> True.
>
> > I'm inclined to think we should refactor a bit more aggressively so
> > that, rather than dumbing backends down to the standard of other
> > processes, we make them all use the better method.  A reasonable way
> > to approach this would be to invent a global variable MyStartTimestamp
> > beside MyStartTime, and initialize both of them in the places that
> > currently initialize the latter, using code like that in
> > BackendInitialize:
> >
> > /* save process start time */
> > port->SessionStartTime = GetCurrentTimestamp();
> > MyStartTime = timestamptz_to_time_t(port->SessionStartTime);
> >
> > and then this new InitRandomSeeds function could adopt BackendRun's
> > seed initialization method instead of the stupider way.
>
> Ok, done.
>
> With MyStartTimestamp in the picture, port->SessionStartTime is
> probably redundant...  and in fact the comment in struct
> Port says it shouldn't even be there.  So... I removed it, and used
> the new MyStartTimestamp in the couple of places that wanted it.
> Thoughts?
>
> > Possibly it'd be sane to merge the MyStartTime* initializations
> > and the srandom resets into one function, not sure.
>
> +1
>
> The main difficulty was coming up with a name for the function that
> does those things.  I went with InitProcessGlobals().  Better
> suggestions welcome.

Pushed to master only.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: partition tree inspection functions

2018-10-18 Thread Amit Langote
On 2018/10/10 13:01, Michael Paquier wrote:
> On Wed, Oct 10, 2018 at 11:54:48AM +0900, Amit Langote wrote:
>> I was partly wrong in saying that we wouldn't need any changes to support
>> partitioned indexes here.  Actually, the core function
>> find_inheritance_children wouldn't scan pg_inherits for us if we pass an
>> (partitioned) index to it, even if the index may have children.
>>
>> It appears that we don't set relhassubclass for partitioned indexes (that
>> is, on parent indexes), so the above the test is useless for indexes.
> 
> Aha, so that was it.
> 
>> Maybe, we need to start another thread to discuss whether we should be
>> manipulating relhassubclass for index partitioning (I'd brought this up in
>> the context of relispartition, btw [1]).  I'm not immediately sure if
>> setting relhassubclass correctly for indexes will have applications beyond
>> this one, but it might be something we should let others know so that we
>> can hear more opinions.
> 
> This reminds of https://postgr.es/m/20180226053613.gd6...@paquier.xyz,
> which has resulted in relhaspkey being removed from pg_class with
> f66e8bf8.  I would much prefer if we actually removed it...  Now
> relhassubclass is more tricky than relhaspkey as it gets used as a
> fast-exit path for a couple of things:
> 1) prepunion.c when planning for child relations.
> 2) plancat.c
> 2) analyze.c
> 3) rewriteHandler.c

I'm concerned about the 1st one.  Currently in expand_inherited_rtentry(),
checking relhassubclass allows us to short-circuit scanning pg_inherits to
find out if a table has children.  Note that expand_inherited_rtentry() is
called for *all* queries that contain a table so that we correctly
identify tables that would require inheritance planning.  So, removing
relhassubclass means a slight (?) degradation for *all* queries.

>> For time being, I've modified that code as follows:
>>
>> @@ -68,9 +70,11 @@ find_inheritance_children(Oid parentrelId, LOCKMODE
>> lockmode)
>>
>>  /*
>>   * Can skip the scan if pg_class shows the relation has never had a
>> - * subclass.
>> + * subclass.  Since partitioned indexes never have their
>> + * relhassubclass set, we cannot skip the scan based on that.
>>   */
>> -if (!has_subclass(parentrelId))
>> +if (get_rel_relkind(parentrelId) != RELKIND_PARTITIONED_INDEX &&
>> +!has_subclass(parentrelId))
>>  return NIL;
> 
> I don't like living with this hack.  What if we simply nuked this
> fast-path exit all together?  The only code path where this could
> represent a performance issue is RelationBuildPartitionKey(), but we
> expect a partitioned table to have children anyway, and there will be
> few cases where partitioned tables have no partitions.

As I said above, the price of removing relhassubclass might be a bit
steep.  So, the other alternative I mentioned before is to set
relhassubclass correctly even for indexes if only for pg_partition_tree to
be able to use find_inheritance_children unchanged.

Thanks,
Amit




relhassubclass and partitioned indexes

2018-10-18 Thread Amit Langote
Hi,

$subject came up in [1].

Should relhassubclass be set/reset for partitioned indexes?

The only use case being sought here is to use  find_inheritance_children()
for getting an index's partitions, but it uses relhassublcass test to
short-circuit scanning pg_inherits.  That means it cannot be used to get
an index's children, because relhassublcass is never set for indexes.

Michael suggested on the linked thread to get rid of relhassubclass
altogether, like we did for relhaspkey recently, but I'm not sure whether
it would be a good idea right yet.

Thoughts?

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/3acdcbf4-6a62-fb83-3bfd-5f275602ca7d%40lab.ntt.co.jp




Re: Postgres, fsync, and OSs (specifically linux)

2018-10-18 Thread Craig Ringer
On Fri, 19 Oct 2018 at 07:27, Thomas Munro 
wrote:

>
> 2.  I am +1 on back-patching Craig's PANIC-on-failure logic.  Doing
> nothing is not an option I like.  I have some feedback and changes to
> propose though; see attached.
>

Thanks very much for the work on reviewing and revising this.


> I don't see why sync_file_range(SYNC_FILE_RANGE_WRITE) should get a
> pass here.  Inspection of some version of the kernel might tell us it
> can't advance the error counter and report failure, but what do we
> gain by relying on that?  Changed.
>

I was sure it made sense at the time, but I can't explain that decision
now, and it looks like we should treat it as a failure.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: relhassubclass and partitioned indexes

2018-10-18 Thread Tom Lane
Amit Langote  writes:
> Should relhassubclass be set/reset for partitioned indexes?

Seems like a reasonable idea to me, at least the "set" end of it.
We don't ever clear relhassubclass for tables, so maybe that's
not necessary for indexes either.

> Michael suggested on the linked thread to get rid of relhassubclass
> altogether, like we did for relhaspkey recently, but I'm not sure whether
> it would be a good idea right yet.

We got rid of relhaspkey mostly because it was of no use to the backend.
That's far from true for relhassubclass.

regards, tom lane



Re: relhassubclass and partitioned indexes

2018-10-18 Thread Michael Paquier
On Fri, Oct 19, 2018 at 01:45:03AM -0400, Tom Lane wrote:
> Amit Langote  writes:
>> Should relhassubclass be set/reset for partitioned indexes?
> 
> Seems like a reasonable idea to me, at least the "set" end of it.
> We don't ever clear relhassubclass for tables, so maybe that's
> not necessary for indexes either.

No objections to the proposal.  Allowing find_inheritance_children to
find index trees for partitioned indexes could be actually useful for
extensions like pg_partman.

>> Michael suggested on the linked thread to get rid of relhassubclass
>> altogether, like we did for relhaspkey recently, but I'm not sure whether
>> it would be a good idea right yet.
> 
> We got rid of relhaspkey mostly because it was of no use to the backend.
> That's far from true for relhassubclass.

Partitioned tables are expected to have partitions, so the optimizations
related to relhassubclass don't seem much worth worrying.  However
relations not having inherited tables may take a performance hit.  If
this flag removal would be done, we'd need to be careful about the
performance impact and the cost of extra lookups at pg_inherit.
--
Michael


signature.asc
Description: PGP signature