date:20190422

Re: [PATCH v1] Add \echo_stderr to psql

2019-04-22 Thread Fabien COELHO




Hello Corey,


   \warn ...
   \warning ...


These two seem about the best to me, drawing from the perl warn command.


Yep, I was thinking of perl & gmake. Maybe the 4 letter option is better 
because its the same length as "echo".



I suppose we could go the bash &2 route here, but I don't want to.


I agree on this one.

--
Fabien.

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-22 Thread Kyotaro HORIGUCHI

Hello.

At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo  wrote in 

> Please see my replies inline. Thanks.
> 
> On Fri, Apr 19, 2019 at 12:38 PM Asim R P  wrote:
> 
> > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo  wrote:
> > >
> > > create db with tablespace
> > > drop database
> > > drop tablespace.
> >
> > Essentially, that sequence of operations causes crash recovery to fail
> > if the "drop tablespace" transaction was committed before crashing.
> > This is a bug in crash recovery in general and should be reproducible
> > without configuring a standby.  Is that right?
> >
> 
> No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
> master.
> That makes the file/directory update-to-date if I understand the related
> code correctly.
> For standby, checkpoint redo does not ensure that.

That's right partly. As you must have seen, fast shutdown forces
restartpoint for the last checkpoint and it prevents the problem
from happening. Anyway it seems to be a problem.

> > Your patch creates missing directories in the destination.  Don't we
> > need to create the tablespace symlink under pg_tblspc/?  I would
> >
> 
>  'create db with tablespace' redo log does not include the tablespace real
> directory information.
> Yes, we could add in it into the xlog, but that seems to be an overdesign.

But I don't think creating directory that is to be removed just
after is a wanted solution. The directory most likely to be be
removed just after.

> > prefer extending the invalid page mechanism to deal with this, as
> > suggested by Ashwin off-list.  It will allow us to avoid creating
> 
> directories and files only to remove them shortly afterwards when the
> > drop database and drop tablespace records are replayed.
> >
> >
> 'invalid page' mechanism seems to be more proper for missing pages of a
> file. For
> missing directories, we could, of course, hack to use that (e.g. reading
> any page of
> a relfile in that database) to make sure the tablespace create code
> (without symlink)
> safer (It assumes those directories will be deleted soon).
> 
> More feedback about all of the previous discussed solutions is welcome.

It doesn't seem to me that the invalid page mechanism is
applicable in straightforward way, because it doesn't consider
simple file copy.

Drop failure is ignored any time. I suppose we can ignore the
error to continue recovering as far as recovery have not reached
consistency. The attached would work *at least* your case, but I
haven't checked this covers all places where need the same
treatment.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..0bc63f48da 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,44 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+	int ret;
+
+	ret = MakePGDirectory(directoryName);
+
+	if (ret != 0)
+	{
+		int elevel = ERROR;
+
+		/*
+		 * We might get error trying to create existing directory that is to
+		 * be removed just after.  Don't issue ERROR in the case. Recovery
+		 * will stop if we again failed after reaching consistency.
+		 */
+		if (InRecovery && !reachedConsistency)
+			elevel = WARNING;
+
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", directoryName)));
+		return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storage/file/copydir.c
+++ b/src/backend/storage/file/copydir.c
@@ -22,11 +22,11 @@
 #include 
 #include 
 
+#include "access/xlogutils.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-
 /*
  * copydir: copy a directory
  *
@@ -41,10 +41,12 @@ copydir(char *fromdir, char *todir, bool recurse)
 	char		fromfile[MAXPGPATH * 2];
 	char		tofile[MAXPGPATH * 2];
 
-	if (MakePGDirectory(todir) != 0)
-		ereport(ERROR,
-(errcode_for_file_access(),
- errmsg("could not create directory \"%s\": %m", todir)));
+	/*
+	 * We might have to skip copydir to continue recovery. See the function
+	 * for details.
+	 */
+	if (XLogMakePGDirectory(todir) != 0)
+		return;
 
 	xldir = AllocateDir(fromdir);
 
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 0ab5ba62f5..46a7596315 1

Re: block-level incremental backup

2019-04-22 Thread Konstantin Knizhnik





On 22.04.2019 2:02, Robert Haas wrote:

On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost  wrote:

Having been around for a while working on backup-related things, if I
was to implement the protocol for pg_basebackup today, I'd definitely
implement "give me a list" and "give me this file" rather than the
tar-based approach, because I've learned that people want to be
able to do parallel backups and that's a decent way to do that.  I
wouldn't set out and implement something new that's there's just no hope
of making parallel.  Maybe the first write of pg_basebackup would still
be simple and serial since it's certainly more work to make a frontend
tool like that work in parallel, but at least the protocol would be
ready to support a parallel option being added alter without being
rewritten.

And that's really what I was trying to get at here- if we've got the
choice now to decide what this is going to look like from a protocol
level, it'd be great if we could make it able to support being used in a
parallel fashion, even if pg_basebackup is still single-threaded.

I think we're getting closer to a meeting of the minds here, but I
don't think it's intrinsically necessary to rewrite the whole method
of operation of pg_basebackup to implement incremental backup in a
sensible way.  One could instead just do a straightforward extension
to the existing BASE_BACKUP command to enable incremental backup.
Then, to enable parallel full backup and all sorts of out-of-core
hacking, one could expand the command language to allow tools to
access individual steps: START_BACKUP, SEND_FILE_LIST,
SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
for an appealing project, but I do not think there is a technical
reason why it has to be done first.  Or for that matter why it has to
be done second.  As I keep saying, incremental backup and full backup
are separate projects and I believe it's completely reasonable for
whoever is doing the work to decide on the order in which they would
like to do the work.

Having said that, I'm curious what people other than Stephen (and
other pgbackrest hackers) think about the relative value of parallel
backup vs. incremental backup.  Stephen appears quite convinced that
parallel backup is full of win and incremental backup is a bit of a
yawn by comparison, and while I certainly would not want to discount
the value of his experience in this area, it sometimes happens on this
mailing list that [ drum roll please ] not everybody agrees about
everything.  So, what do other people think?



Based on the experience of pg_probackup users I can say that  there is 
no 100% winer and depending on use case either

parallel either incremental backups are preferable.
- If size of database is not so larger and intensity of updates is high 
enough, then parallel backup within one data center is definitely more 
efficient solution.
- If size of database is very large and data is rarely updated or 
database is mostly append-only, then incremental backup is preferable.
- Some customers need to collect at central server backups of databases 
installed at many nodes with slow and unreliable connection (assume DBMS 
installed at locomotives). Definitely parallelism can not help here, 
unlike support of incremental backup.
- Parallel backup more aggressively consumes resources of the system, 
interfering with normal work of application. So performing parallel 
backup may cause significant degradation of application speed.


pg_probackup supports both features: parallel and incremental backups 
and it is up to user how to use it in more efficient way for particular 
configuration.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-22 Thread Kyotaro HORIGUCHI

Oops! The comment in the previous patch is wrong.

At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190422.161513.258021727.horiguchi.kyot...@lab.ntt.co.jp>
> At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo  wrote in 
> 
> > Please see my replies inline. Thanks.
> > 
> > On Fri, Apr 19, 2019 at 12:38 PM Asim R P  wrote:
> > 
> > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo  wrote:
> > > >
> > > > create db with tablespace
> > > > drop database
> > > > drop tablespace.
> > >
> > > Essentially, that sequence of operations causes crash recovery to fail
> > > if the "drop tablespace" transaction was committed before crashing.
> > > This is a bug in crash recovery in general and should be reproducible
> > > without configuring a standby.  Is that right?
> > >
> > 
> > No. In general, checkpoint is done for drop_db/create_db/drop_tablespace on
> > master.
> > That makes the file/directory update-to-date if I understand the related
> > code correctly.
> > For standby, checkpoint redo does not ensure that.
> 
> That's right partly. As you must have seen, fast shutdown forces
> restartpoint for the last checkpoint and it prevents the problem
> from happening. Anyway it seems to be a problem.
> 
> > > Your patch creates missing directories in the destination.  Don't we
> > > need to create the tablespace symlink under pg_tblspc/?  I would
> > >
> > 
> >  'create db with tablespace' redo log does not include the tablespace real
> > directory information.
> > Yes, we could add in it into the xlog, but that seems to be an overdesign.
> 
> But I don't think creating directory that is to be removed just
> after is a wanted solution. The directory most likely to be be
> removed just after.
> 
> > > prefer extending the invalid page mechanism to deal with this, as
> > > suggested by Ashwin off-list.  It will allow us to avoid creating
> > 
> > directories and files only to remove them shortly afterwards when the
> > > drop database and drop tablespace records are replayed.
> > >
> > >
> > 'invalid page' mechanism seems to be more proper for missing pages of a
> > file. For
> > missing directories, we could, of course, hack to use that (e.g. reading
> > any page of
> > a relfile in that database) to make sure the tablespace create code
> > (without symlink)
> > safer (It assumes those directories will be deleted soon).
> > 
> > More feedback about all of the previous discussed solutions is welcome.
> 
> It doesn't seem to me that the invalid page mechanism is
> applicable in straightforward way, because it doesn't consider
> simple file copy.
> 
> Drop failure is ignored any time. I suppose we can ignore the
> error to continue recovering as far as recovery have not reached
> consistency. The attached would work *at least* your case, but I
> haven't checked this covers all places where need the same
> treatment.

The comment for the new function XLogMakePGDirectory is wrong:

+ * There is a possibility that WAL replay causes a creation of the same
+ * directory left by the previous crash. Issuing ERROR prevents the caller
+ * from continuing recovery.

The correct one is:

+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.

It is fixed in the attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..01331f0da9 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,44 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+	int ret;
+
+	ret = MakePGDirectory(directoryName);
+
+	if (ret != 0)
+	{
+		int elevel = ERROR;
+
+		/*
+		 * We might get error trying to create existing directory that is to
+		 * be removed just after.  Don't issue ERROR in the case. Recovery
+		 * will stop if we again failed after reaching consistency.
+		 */
+		if (InRecovery && !reachedConsistency)
+			elevel = WARNING;
+
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", directoryName)));
+		return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/storage/file/copydir.c b/src/backend/storage/file/copydir.c
index 30f6200a86..0216270dd3 100644
--- a/src/backend/storag

display of variables in EXPLAIN VERBOSE

2019-04-22 Thread Amit Langote

Hi,

ISTM show_plan_tlist()'s rule of whether to the show range table prefix
with displayed variables contradicts the description of the VERBOSE option
in EXPLAIN documentation, which is as follows:

===
VERBOSE

Display additional information regarding the plan. Specifically, include
the output column list for each node in the plan tree, schema-qualify
table and function names, always label variables in expressions with their
range table alias, and always print the name of each trigger for which
statistics are displayed. This parameter defaults to FALSE.
===

Specifically, the current behavior contradicts the part of the sentence
that says "always label variables in expressions with their range table
alias".  See this example:

create table foo (a int);
create table foo1 () inherits (foo);

-- "a" is not labeled here
explain verbose select * from only foo order by 1;
   QUERY PLAN

 Sort  (cost=0.01..0.02 rows=1 width=4)
   Output: a
   Sort Key: foo.a
   ->  Seq Scan on public.foo  (cost=0.00..0.00 rows=1 width=4)
 Output: a
(5 rows)

-- it's labeled in this case
explain verbose select * from foo order by 1;
QUERY PLAN
───
 Sort  (cost=192.60..198.98 rows=2551 width=4)
   Output: foo.a
   Sort Key: foo.a
   ->  Append  (cost=0.00..48.26 rows=2551 width=4)
 ->  Seq Scan on public.foo  (cost=0.00..0.00 rows=1 width=4)
   Output: foo.a
 ->  Seq Scan on public.foo1  (cost=0.00..35.50 rows=2550 width=4)
   Output: foo1.a
(8 rows)

Seeing that "Sort Key" is always displayed with the range table alias, I
checked explain.c to see why the discrepancy exists and it seems that
show_plan_tlist() (and show_tablesample()) use the following condition for
whether or not to use the range table prefix:

useprefix = list_length(es->rtable) > 1;

whereas other functions, including show_sort_group_keys() that prints the
"Sort Key", use the following condition:

useprefix = (list_length(es->rtable) > 1 || es->verbose);

I can think of two ways we could do:

1. Change show_plan_tlist() and show_tablesample() to use the same rule as
others

2. Change other functions to use the same rule as show_plan_tlist(), also
updating the documentation to note the exceptional case when column names
are not prefixed

Thoughts?

Thanks,
Amit

Allow any[] as input arguments for sql/plpgsql functions to mimic format()

2019-04-22 Thread phoe

Hey everyone,

I am writing a plpgsql function that (to greatly simplify) raises an
exception with a formatted* message. Ideally, I should be able to call
it with raise_exception('The person %I has only %I bananas.', 'Fred',
8), which mimics the format(text, any[]) calling convention.

Here is where I have encountered a limitation of PostgreSQL's design:
https://www.postgresql.org/docs/11/datatype-pseudo.html mentions
explicitly that, "At present most procedural languages forbid use of a
pseudo-type as an argument type".

My reasoning is that I should be able to accept a value of some type if
all I do is passing it to a function that accepts exactly that type,
such as format(text, any[]). Given the technical reality, I assume that
I wouldn't be able to do anything else with that value, but that is
fine, since I don't have to do anything with it regardless.

BR
Michał "phoe" Herda

*I do not want to use the obvious solution of
raise_exception(format(...)) because the argument to that function is
the error ID that is then looked up in a table from which the error
message and sqlstate are retrieved. My full code is in the attached SQL
file. Once it is executed:

SELECT gateway_error('user_does_not_exist', '2'); -- works but is unnatural,
SELECT gateway_error('user_does_not_exist', 2); -- is natural but
doesn't work.



install-errors.sql
Description: application/sql

Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()

2019-04-22 Thread Pavel Stehule

Hi

po 22. 4. 2019 v 11:27 odesílatel Michał "phoe" Herda 
napsal:

> Hey everyone,
>
> I am writing a plpgsql function that (to greatly simplify) raises an
> exception with a formatted* message. Ideally, I should be able to call
> it with raise_exception('The person %I has only %I bananas.', 'Fred',
> 8), which mimics the format(text, any[]) calling convention.
>
> Here is where I have encountered a limitation of PostgreSQL's design:
> https://www.postgresql.org/docs/11/datatype-pseudo.html mentions
> explicitly that, "At present most procedural languages forbid use of a
> pseudo-type as an argument type".
>
> My reasoning is that I should be able to accept a value of some type if
> all I do is passing it to a function that accepts exactly that type,
> such as format(text, any[]). Given the technical reality, I assume that
> I wouldn't be able to do anything else with that value, but that is
> fine, since I don't have to do anything with it regardless.
>
> BR
> Michał "phoe" Herda
>
> *I do not want to use the obvious solution of
> raise_exception(format(...)) because the argument to that function is
> the error ID that is then looked up in a table from which the error
> message and sqlstate are retrieved. My full code is in the attached SQL
> file. Once it is executed:
>
> SELECT gateway_error('user_does_not_exist', '2'); -- works but is
> unnatural,
> SELECT gateway_error('user_does_not_exist', 2); -- is natural but
> doesn't work.
>

It is known problem, and fix is not easy.

Any expressions inside plpgsql are simple queries like SELECT expr, and
they are executed same pipeline like queries.

The plans of these queries are stored and reused. Originally these plans
disallow any changes, now some changes are supported, but parameters should
be same all time. This is ensured by disallowing "any" type.

Other polymorphic types are very static, so there is not described risk.

Probably some enhancement can be in this are. The plan can be re-planed
after some change - but it can has lot of performance impacts. It is long
open topic. Some changes in direction to dynamic languages can increase
cost of  some future optimization to higher performance :-(.

Regards

Pavel

Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Kuntal Ghosh

Hello hackers,

With a master-standby setup configured on the same machine, I'm
getting a panic in tablespace test while running make installcheck.

1. CREATE TABLESPACE regress_tblspacewith LOCATION 'blah';
2. DROP TABLESPACE regress_tblspacewith;
3. CREATE TABLESPACE regress_tblspace LOCATION 'blah';
-- do some operations in this tablespace
4. DROP TABLESPACE regress_tblspace;

The master panics at the last statement when standby has completed
applying the WAL up to step 2 but hasn't started step 3.
PANIC:  could not fsync file
"pg_tblspc/16387/PG_12_201904072/16384/16446": No such file or
directory

The reason is both the tablespace points to the same location. When
master tries to delete the new tablespace (and requests a checkpoint),
the corresponding folder is already deleted by the standby while
applying WAL to delete the old tablespace. I'm able to reproduce the
issue with the attached script.

sh standby-server-setup.sh
make installcheck

I accept that configuring master-standby on the same machine for this
test is not okay. But, can we avoid the PANIC somehow? Or, is this
intentional and I should not include testtablespace in this case?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com
echo "Cleaning.."
pkill -9 postgres
rm -rf *_logfile
rm /tmp/failover.log
rm -rf /tmp/archive_dir
mkdir /tmp/archive_dir

export PGPORT=5432			#MASTER PORT
PGSQL_DIR=$(pwd)
PGSQL_BIN=$PGSQL_DIR/install/bin
PGSQL_MASTER=$PGSQL_BIN/master		#DATA FOLDER FOR PRIMARY/MASTER SERVER
PGSQL_STANDBY=$PGSQL_BIN/standby	#DATA FOLDER FOR BACKUP/STANDBY SERVER

#Cleanup the master and slave data directory and create a new one.
rm -rf $PGSQL_MASTER $PGSQL_STANDBY
mkdir $PGSQL_MASTER $PGSQL_STANDBY
chmod 700 $PGSQL_MASTER
chmod 700 $PGSQL_STANDBY

#Initialize MASTER
$PGSQL_BIN/initdb -D $PGSQL_MASTER
echo "wal_level = hot_standby" >> $PGSQL_MASTER/postgresql.conf
echo "max_wal_senders = 3" >> $PGSQL_MASTER/postgresql.conf
echo "wal_keep_segments = 50" >> $PGSQL_MASTER/postgresql.conf
echo "hot_standby = on" >> $PGSQL_MASTER/postgresql.conf
#echo "max_standby_streaming_delay= -1" >> $PGSQL_MASTER/postgresql.conf
#echo "wal_consistency_checking='all'" >> $PGSQL_MASTER/postgresql.conf
echo "max_wal_size= 10GB" >> $PGSQL_MASTER/postgresql.conf
#echo "log_min_messages= debug2" >> $PGSQL_MASTER/postgresql.conf

#Setup replication settings

#Start Master
export PGPORT=5432
echo "Starting Master.."
$PGSQL_BIN/pg_ctl -D $PGSQL_MASTER -c -w -l master_logfile start

#Perform Backup in the Standy Server
$PGSQL_BIN/pg_basebackup --pgdata=$PGSQL_STANDBY -P
echo "primary_conninfo= 'port=5432'" >> $PGSQL_STANDBY/postgresql.conf
touch $PGSQL_STANDBY/standby.signal
#echo "standby_mode = on" >> $PGSQL_STANDBY/postgresql.conf
#Start STANDBY
export PGPORT=5433
echo "Starting Slave.."
$PGSQL_BIN/pg_ctl -D $PGSQL_STANDBY -c -w -l slave_logfile start

Re: bug in update tuple routing with foreign partitions

2019-04-22 Thread Etsuro Fujita


(2019/04/19 14:39), Etsuro Fujita wrote:

(2019/04/19 13:00), Amit Langote wrote:

On 2019/04/18 22:10, Etsuro Fujita wrote:

On 2019/04/18 14:06, Amit Langote wrote:

On 2019/04/17 21:49, Etsuro Fujita wrote:

So what I'm thinking is to throw an error for cases like this.
(Though, I
think we should keep to allow rows to be moved from local
partitions to a
foreign-table subplan targetrel that has been updated already.)


Hmm, how would you distinguish (totally inside postgred_fdw I
presume) the
two cases?


One thing I came up with to do that is this:

@@ -1917,6 +1937,23 @@ postgresBeginForeignInsert(ModifyTableState
*mtstate,

initStringInfo(&sql);

+ /*
+ * If the foreign table is an UPDATE subplan resultrel that hasn't
yet
+ * been updated, routing tuples to the table might yield incorrect
+ * results, because if routing tuples, routed tuples will be
mistakenly
+ * read from the table and updated twice when updating the table
--- it
+ * would be nice if we could handle this case; but for now, throw
an error
+ * for safety.
+ */
+ if (plan&& plan->operation == CMD_UPDATE&&
+ (resultRelInfo->ri_usesFdwDirectModify ||
+ resultRelInfo->ri_FdwState)&&
+ resultRelInfo> mtstate->resultRelInfo +
mtstate->mt_whichplan)
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot route tuples into foreign
table to be updated \"%s\"",
+ RelationGetRelationName(rel;


Oh, I see.

So IIUC, you're making postgresBeginForeignInsert() check two things:

1. Whether the foreign table it's about to begin inserting (moved) rows
into is a subplan result rel, checked by
(resultRelInfo->ri_usesFdwDirectModify || resultRelInfo->ri_FdwState)

2. Whether the foreign table it's about to begin inserting (moved) rows
into will be updated later, checked by (resultRelInfo>
mtstate->resultRelInfo + mtstate->mt_whichplan)

This still allows a foreign table to receive rows moved from the local
partitions if it has already finished the UPDATE operation.

Seems reasonable to me.



Forgot to say that since this is a separate issue from the original bug
report, maybe we can first finish fixing the latter. What to do you
think?


Yeah, I think we can do that, but my favorite would be to fix the two
issues in a single shot, because they seem to me rather close to each
other. So I updated a new version in a single patch, which I'm
attaching.


I agree that we can move to fix the other issue right away as the fix you
outlined above seems reasonable, but I wonder if it wouldn't be better to
commit the two fixes separately? The two fixes, although small, are
somewhat complicated and combining them in a single commit might be
confusing. Also, a combined commit might make it harder for the release
note author to list down the exact set of problems being fixed.


OK, I'll separate the patch into two.


After I tried to separate the patch a bit, I changed my mind: I think we 
should commit the two issues in a single patch, because the original 
issue that overriding fmstate for the UPDATE operation mistakenly by 
fmstate for the INSERT operation caused backend crash is fixed by what I 
proposed above.  So I add the commit message to the previous single 
patch, as you suggested.



Some mostly cosmetic comments on the code changes:

* In the following comment:

+ /*
+ * If the foreign table is an UPDATE subplan resultrel that hasn't yet
+ * been updated, routing tuples to the table might yield incorrect
+ * results, because if routing tuples, routed tuples will be mistakenly
+ * read from the table and updated twice when updating the table --- it
+ * would be nice if we could handle this case; but for now, throw an
error
+ * for safety.
+ */

the part that start with "because if routing tuples..." reads a bit
unclear to me. How about writing this as:

/*
* If the foreign table we are about to insert routed rows into is
* also an UPDATE result rel and the UPDATE hasn't been performed yet,
* proceeding with the INSERT will result in the later UPDATE
* incorrectly modifying those routed rows, so prevent the INSERT ---
* it would be nice if we could handle this case, but for now, throw
* an error for safety.
*/


I think that's better than mine; will use that wording.


Done.  I simplified your wording a little bit, though.


I see that in all the hunks involving some manipulation of aux_fmstate,
there's a comment explaining what it is, which seems a bit repetitive. I
can see more or less the same explanation in postgresExecForeignInsert(),
postgresBeginForeignInsert(), and postgresEndForeignInsert(). Maybe just
keep the description in postgresBeginForeignInsert as follows:

@@ -1983,7 +2020,19 @@ postgresBeginForeignInsert(ModifyTableState
*mtstate,
retrieved_attrs != NIL,
retrieved_attrs);

- resultRelInfo->ri_FdwState = fmstate;
+ /*
+ * If the given resultRelInfo already has PgFdwModifyState set, it means
+ * the foreign table is an UPDATE subplan resultrel; in which case,
store
+ * the resulting state into the aux_fmstate of th

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-22 Thread Kyotaro HORIGUCHI

At Mon, 22 Apr 2019 16:40:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190422.164027.33866403.horiguchi.kyot...@lab.ntt.co.jp>
> At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20190422.161513.258021727.horiguchi.kyot...@lab.ntt.co.jp>
> > At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo  wrote in 
> > 
> > > Please see my replies inline. Thanks.
> > > 
> > > On Fri, Apr 19, 2019 at 12:38 PM Asim R P  wrote:
> > > 
> > > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo  wrote:
> > > > >
> > > > > create db with tablespace
> > > > > drop database
> > > > > drop tablespace.
> > > >
> > > > Essentially, that sequence of operations causes crash recovery to fail
> > > > if the "drop tablespace" transaction was committed before crashing.
> > > > This is a bug in crash recovery in general and should be reproducible
> > > > without configuring a standby.  Is that right?
> > > >
> > > 
> > > No. In general, checkpoint is done for drop_db/create_db/drop_tablespace 
> > > on
> > > master.
> > > That makes the file/directory update-to-date if I understand the related
> > > code correctly.
> > > For standby, checkpoint redo does not ensure that.

The attached exercises this sequence, needing some changes in
PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From dbe6306a730f94a5bd8beaf0e534c28ebdd815d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 Apr 2019 20:10:20 +0900
Subject: [PATCH 1/2] Allow TAP test to excecise tablespace.

To perform tablespace related checks, this patch lets
PostgresNode::backup have a new parameter "tablespace_mapping", and
make init_from_backup handle capable to handle a backup created using
tablespace_mapping.
---
 src/test/perl/PostgresNode.pm  | 10 --
 src/test/perl/RecursiveCopy.pm | 33 +
 2 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 76874141c5..59a939821d 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -540,13 +540,19 @@ target server since it isn't done by default.
 
 sub backup
 {
-	my ($self, $backup_name) = @_;
+	my ($self, $backup_name, %params) = @_;
 	my $backup_path = $self->backup_dir . '/' . $backup_name;
 	my $name= $self->name;
+	my @rest = ();
+
+	if (defined $params{tablespace_mapping})
+	{
+		push(@rest, "--tablespace-mapping=$params{tablespace_mapping}");
+	}
 
 	print "# Taking pg_basebackup $backup_name from node \"$name\"\n";
 	TestLib::system_or_bail('pg_basebackup', '-D', $backup_path, '-h',
-		$self->host, '-p', $self->port, '--no-sync');
+		$self->host, '-p', $self->port, '--no-sync', @rest);
 	print "# Backup finished\n";
 	return;
 }
diff --git a/src/test/perl/RecursiveCopy.pm b/src/test/perl/RecursiveCopy.pm
index baf5d0ac63..c912ce412d 100644
--- a/src/test/perl/RecursiveCopy.pm
+++ b/src/test/perl/RecursiveCopy.pm
@@ -22,6 +22,7 @@ use warnings;
 use Carp;
 use File::Basename;
 use File::Copy;
+use TestLib;
 
 =pod
 
@@ -97,14 +98,38 @@ sub _copypath_recurse
 	# invoke the filter and skip all further operation if it returns false
 	return 1 unless &$filterfn($curr_path);
 
-	# Check for symlink -- needed only on source dir
-	# (note: this will fall through quietly if file is already gone)
-	croak "Cannot operate on symlink \"$srcpath\"" if -l $srcpath;
-
 	# Abort if destination path already exists.  Should we allow directories
 	# to exist already?
 	croak "Destination path \"$destpath\" already exists" if -e $destpath;
 
+	# Check for symlink -- needed only on source dir
+	# (note: this will fall through quietly if file is already gone)
+	if (-l $srcpath)
+	{
+		croak "Cannot operate on symlink \"$srcpath\""
+		  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+		# We have mapped tablespaces. Copy them individually
+		my $linkname = $1;
+		my $tmpdir = TestLib::tempdir;
+		my $dstrealdir = TestLib::real_dir($tmpdir);
+		my $srcrealdir = readlink($srcpath);
+
+		opendir(my $dh, $srcrealdir);
+		while (readdir $dh)
+		{
+			next if (/^\.\.?$/);
+			my $spath = "$srcrealdir/$_";
+			my $dpath = "$dstrealdir/$_";
+
+			copypath($spath, $dpath);
+		}
+		closedir $dh;
+
+		symlink $dstrealdir, $destpath;
+		return 1;
+	}
+
 	# If this source path is a file, simply copy it to destination with the
 	# same name and we're done.
 	if (-f $srcpath)
-- 
2.16.3

>From 382910fbe3738c9098c0568cdc992928f471c7c5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 Apr 2019 20:10:25 +0900
Subject: [PATCH 2/2] Add check for recovery failure caused by tablespace.

Removal of a tablespace on master can cause recovery failure on
standby. This patch adds the check for the case.
---
 src/test/recovery/t/011_crash_recovery.pl | 52 ++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/011_crash_recovery.p

Re: Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Kyotaro HORIGUCHI

Hello.

At Mon, 22 Apr 2019 15:52:59 +0530, Kuntal Ghosh  
wrote in 
> Hello hackers,
> 
> With a master-standby setup configured on the same machine, I'm
> getting a panic in tablespace test while running make installcheck.
> 
> 1. CREATE TABLESPACE regress_tblspacewith LOCATION 'blah';
> 2. DROP TABLESPACE regress_tblspacewith;
> 3. CREATE TABLESPACE regress_tblspace LOCATION 'blah';
> -- do some operations in this tablespace
> 4. DROP TABLESPACE regress_tblspace;
> 
> The master panics at the last statement when standby has completed
> applying the WAL up to step 2 but hasn't started step 3.
> PANIC:  could not fsync file
> "pg_tblspc/16387/PG_12_201904072/16384/16446": No such file or
> directory
> 
> The reason is both the tablespace points to the same location. When
> master tries to delete the new tablespace (and requests a checkpoint),
> the corresponding folder is already deleted by the standby while
> applying WAL to delete the old tablespace. I'm able to reproduce the
> issue with the attached script.
> 
> sh standby-server-setup.sh
> make installcheck
> 
> I accept that configuring master-standby on the same machine for this
> test is not okay. But, can we avoid the PANIC somehow? Or, is this
> intentional and I should not include testtablespace in this case?

If you don't have a problem using TAP test suite, tablespace is
allowed with a bit restricted steps using the first patch in my
just posted patchset[1]. This will work for you if you are okay
with creating a standby after creating a tablespace. See the
second patch in the patchset.

If you stick on shell script, the following steps allow tablespaces.

1. Create tablespace directories for both master and standby.
2. Create a master then start.
3. Create tablespaces on the master.
4. Create a standby using pg_basebackup --tablespace_mapping==
5. Start the standby.

[1] 
https://www.postgresql.org/message-id/20190422.211933.156769089.horiguchi.kyot...@lab.ntt.co.jp

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Etsuro Fujita


(2019/04/20 20:53), Laurenz Albe wrote:

On Fri, 2018-04-06 at 23:24 +, Robert Haas wrote:

Allow insert and update tuple routing and COPY for foreign tables.

Also enable this for postgres_fdw.

Etsuro Fujita, based on an earlier patch by Amit Langote. The larger
patch series of which this is a part has been reviewed by Amit
Langote, David Fetter, Maksim Milyutin, Álvaro Herrera, Stephen Frost,
and me.  Minor documentation changes to the final version by me.

Discussion: 
http://postgr.es/m/29906a26-da12-8c86-4fb9-d8f88442f...@lab.ntt.co.jp


This commit makes life hard for foreign data wrappers that support
data modifications.

If a FDW implements ExecForeignInsert, this commit automatically assumes
that it also supports COPY FROM.  It will call ExecForeignInsert without
calling PlanForeignModify and BeginForeignModify, and a FDW that does not
expect that will probably fail.


This is not 100% correct; the FDW documentation says:


 Tuples inserted into a partitioned table by 
INSERT or

 COPY FROM are routed to partitions.  If an FDW
 supports routable foreign-table partitions, it should also provide the
 following callback functions.  These functions are also called when
 COPY FROM is executed on a foreign table.



maybe there are FDWs that support INSERT but don't want to support COPY
for some reason.


I agree on that point.


I propose that PostgreSQL only allows COPY FROM on a foreign table if the FDW
implements BeginForeignInsert.  The attached patch implements that.


I don't think that is a good idea, because there might be some FDWs that 
want to support COPY FROM on foreign tables without providing 
BeginForeignInsert.  (As for INSERT into foreign tables, we actually 
allow FDWs to support it without providing PlanForeignModify, 
BeginForeignModify, or EndForeignModify.)


It's permissible to throw an error in BeginForeignInsert, so what I was 
thinking for FDWs that don't want to support COPY FROM and 
INSERT/UPDATE/COPY FROM tuple routing was to provide BeginForeignInsert 
implementing something like this:


static void
fooBeginForeignInsert(ModifyTableState *mtstate,
  ResultRelInfo *resultRelInfo)
{
Relationrel = resultRelInfo->ri_RelationDesc;

if (mtstate->ps.plan == NULL)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot copy to foreign table \"%s\"",
RelationGetRelationName(rel;
else
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot route tuples into foreign table \"%s\"",
RelationGetRelationName(rel;
}

Best regards,
Etsuro Fujita

Re: Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Kuntal Ghosh

On Mon, Apr 22, 2019 at 6:07 PM Kyotaro HORIGUCHI
 wrote:
> If you don't have a problem using TAP test suite, tablespace is
> allowed with a bit restricted steps using the first patch in my
> just posted patchset[1]. This will work for you if you are okay
> with creating a standby after creating a tablespace. See the
> second patch in the patchset.
>
> If you stick on shell script, the following steps allow tablespaces.
>
> 1. Create tablespace directories for both master and standby.
> 2. Create a master then start.
> 3. Create tablespaces on the master.
> 4. Create a standby using pg_basebackup --tablespace_mapping==
> 5. Start the standby.
>
> [1] 
> https://www.postgresql.org/message-id/20190422.211933.156769089.horiguchi.kyot...@lab.ntt.co.jp
>
Thank you for the info. I'll try the same.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Runtime pruning problem

2019-04-22 Thread David Rowley

On Fri, 19 Apr 2019 at 05:25, Tom Lane  wrote:
> So what I'm thinking is that I made a bad decision in 1cc29fe7c,
> which did this:
>
> ... In passing, simplify the EXPLAIN code by
> having it deal primarily in the PlanState tree rather than separately
> searching Plan and PlanState trees.  This is noticeably cleaner for
> subplans, and about a wash elsewhere.
>
> It was definitely silly to have the recursion in explain.c passing down
> both Plan and PlanState nodes, when the former is always easily accessible
> from the latter.  So that was an OK change, but at the same time I changed
> ruleutils.c to accept PlanState pointers not Plan pointers from explain.c,
> and that is now looking like a bad idea.  If we were to revert that
> decision, then instead of assuming that an AppendState always has at least
> one live child, we'd only have to assume that an Append has at least one
> live child.  Which is true.
>
> I don't recall that there was any really strong reason for switching
> ruleutils' API like that, although maybe if we look harder we'll find one.
> I think it was mainly just for consistency with the way that explain.c
> now looks at the world; which is not a negligible consideration, but
> it's certainly something we could overrule.

I started working on this today and I've attached what I have so far.

For a plan like the following, as shown by master's EXPLAIN, we get:

postgres=# explain verbose select *,(select * from t1 where
t1.a=listp.a) z from listp where a = three() order by z;
QUERY PLAN
--
 Sort  (cost=1386.90..1386.95 rows=22 width=12)
   Output: listp1.a, listp1.b, ((SubPlan 1))
   Sort Key: ((SubPlan 1))
   ->  Append  (cost=0.00..1386.40 rows=22 width=12)
 Subplans Removed: 1
 ->  Seq Scan on public.listp1  (cost=0.00..693.15 rows=11 width=12)
   Output: listp1.a, listp1.b, (SubPlan 1)
   Filter: (listp1.a = three())
   SubPlan 1
 ->  Index Only Scan using t1_pkey on public.t1
(cost=0.15..8.17 rows=1 width=4)
   Output: t1.a
   Index Cond: (t1.a = listp1.a)
(12 rows)

With the attached we end up with:

postgres=# explain verbose select *,(select * from t1 where
t1.a=listp.a) z from listp where a = three() order by z;
 QUERY PLAN
-
 Sort  (cost=1386.90..1386.95 rows=22 width=12)
   Output: listp1.a, listp1.b, ((SubPlan 1))
   Sort Key: ((SubPlan 1))
   ->  Append  (cost=0.00..1386.40 rows=22 width=12)
 Subplans Removed: 2
(5 rows)

notice the reference to SubPlan 1, but no definition of what Subplan 1
actually is. I don't think this is particularly good, but not all that
sure what to do about it.

The code turned a bit more complex than I'd have hoped. In order to
still properly resolve the parameters in find_param_referent() I had
to keep the ancestor list, but also had to add an ancestor_plan list
so that we can properly keep track of the Plan node parents too.
Remember that a Plan node may not have a corresponding PlanState node
if the state was never initialized.

For Append and MergeAppend I ended up always using the first of the
initialized subnodes if at least one is present and only resorted to
using the first planned subnode if there are no subnodes in the
AppendState/MergeAppendState. Without this, the Vars shown in the
MergeAppend sort keys lost their alias prefix if the first subplan
happened to have been pruned, but magically would gain it again if it
was some other node that was pruned. This was just a bit too weird, so
I ended up making a special case for this.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

wip_improve_runtime_pruning_explain_output.patch
Description: Binary data

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-22 Thread Amit Kapila

On Fri, Apr 19, 2019 at 2:46 PM Amit Kapila  wrote:
>
>
> > > I am thinking that we should at least give it a try to move the map to
> > > rel cache level to see how easy or difficult it is and also let's wait
> > > for a day or two to see if Andres/Tom has to say anything about this
> > > or on the response by me above to improve the current patch.
> >
> > Since we have a definite timeline, I'm okay with that, although I'm
> > afraid I'm not quite knowledgeable enough to help much with the
> > relcache piece.
> >
>
> Okay, I can try to help.  I think you can start by looking at
> RelationData members (for ex. see how we cache index's metapage in
> rd_amcache) and study a bit about routines in relcache.h.
>

Attached is a hacky and work-in-progress patch to move fsm map to
relcache.  This will give you some idea.  I think we need to see if
there is a need to invalidate the relcache due to this patch.  I think
some other comments of Andres also need to be addressed, see if you
can attempt to fix some of them.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


store_local_map_relcache_v1.patch
Description: Binary data

Re: display of variables in EXPLAIN VERBOSE

2019-04-22 Thread David Rowley

On Mon, 22 Apr 2019 at 19:49, Amit Langote
 wrote:
> Seeing that "Sort Key" is always displayed with the range table alias, I
> checked explain.c to see why the discrepancy exists and it seems that
> show_plan_tlist() (and show_tablesample()) use the following condition for
> whether or not to use the range table prefix:
>
> useprefix = list_length(es->rtable) > 1;
>
> whereas other functions, including show_sort_group_keys() that prints the
> "Sort Key", use the following condition:
>
> useprefix = (list_length(es->rtable) > 1 || es->verbose);
>
> I can think of two ways we could do:
>
> 1. Change show_plan_tlist() and show_tablesample() to use the same rule as
> others
>
> 2. Change other functions to use the same rule as show_plan_tlist(), also
> updating the documentation to note the exceptional case when column names
> are not prefixed

I'd vote to make the code match the documentation, but probably
implement it by adding a new field to ExplainState and just calculate
what to do once in ExplainQuery() instead of calculating what to do in
various random places.

I don't think we should backpatch this change, likely it would be
better to keep the explain output as stable as possible in the back
branches, so that might mean a documentation tweak should be done for
them.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Do CustomScan as not projection capable node

2019-04-22 Thread Robert Haas

On Fri, Apr 19, 2019 at 12:45 AM Tom Lane  wrote:
> I don't buy this for a minute.  Where do you think projection is
> going to happen?  There isn't any existing node type that *couldn't*
> support projection if we insisted that that be done across-the-board.
> I think it's mostly just a legacy thing that some don't.

I think there may actually be some good reasons for that.  If
something like an Append or Material node projects, it seems to me
that this means that we built the wrong tlist for its input(s).

That justification doesn't apply to custom scans, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH v1] Add \echo_stderr to psql

2019-04-22 Thread David Fetter

On Mon, Apr 22, 2019 at 09:04:08AM +0200, Fabien COELHO wrote:
> 
> Hello Corey,
> 
> > >\warn ...
> > >\warning ...
> > 
> > These two seem about the best to me, drawing from the perl warn command.
> 
> Yep, I was thinking of perl & gmake. Maybe the 4 letter option is better
> because its the same length as "echo".
> 
> > I suppose we could go the bash &2 route here, but I don't want to.
> 
> I agree on this one.

Please find attached v2, name is now \warn.

How might we test this portably?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From db987e21d0e25ce306d5482fa78d71de8454f1aa Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 21 Apr 2019 11:08:40 -0700
Subject: [PATCH v2] Add \warn to psql
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.20.1"

This is a multi-part message in MIME format.
--2.20.1
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


- Does what it says on the label
- Do we have any way to test this?

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index b86764003d..4edf8e67a1 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1879,6 +1879,26 @@ Tue Oct 26 21:40:57 CEST 1999
 
   
 
+  
+\warn text [ ... ]
+
+
+Prints the arguments to the standard error, separated by one
+space and followed by a newline. This can be useful to
+intersperse information in the output of scripts when not polluting
+standard output is desired. For example:
+
+
+=> \echo :variable
+=> \warn `date`
+Sun Apr 21 10:48:11 PDT 2019
+
+If the first argument is an unquoted -n the trailing
+newline is not written.
+
+
+  
+
   
 \ef  function_description   line_number   
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 8254d61099..9425f02005 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -319,7 +319,7 @@ exec_command(const char *cmd,
 		status = exec_command_ef_ev(scan_state, active_branch, query_buf, true);
 	else if (strcmp(cmd, "ev") == 0)
 		status = exec_command_ef_ev(scan_state, active_branch, query_buf, false);
-	else if (strcmp(cmd, "echo") == 0 || strcmp(cmd, "qecho") == 0)
+	else if (strcmp(cmd, "echo") == 0 || strcmp(cmd, "warn") == 0 || strcmp(cmd, "qecho") == 0)
 		status = exec_command_echo(scan_state, active_branch, cmd);
 	else if (strcmp(cmd, "elif") == 0)
 		status = exec_command_elif(scan_state, cstack, query_buf);
@@ -1114,7 +1114,7 @@ exec_command_ef_ev(PsqlScanState scan_state, bool active_branch,
 }
 
 /*
- * \echo and \qecho -- echo arguments to stdout or query output
+ * \echo, \warn and \qecho -- echo arguments to stdout or query output
  */
 static backslashResult
 exec_command_echo(PsqlScanState scan_state, bool active_branch, const char *cmd)
@@ -1129,6 +1129,8 @@ exec_command_echo(PsqlScanState scan_state, bool active_branch, const char *cmd)
 
 		if (strcmp(cmd, "qecho") == 0)
 			fout = pset.queryFout;
+		else if (strcmp(cmd, "warn") == 0)
+			fout = stderr;
 		else
 			fout = stdout;
 
diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c
index d6d41b51d5..2ecb8b73c1 100644
--- a/src/bin/psql/help.c
+++ b/src/bin/psql/help.c
@@ -207,6 +207,7 @@ slashUsage(unsigned short int pager)
 	fprintf(output, _("Input/Output\n"));
 	fprintf(output, _("  \\copy ...  perform SQL COPY with data stream to the client host\n"));
 	fprintf(output, _("  \\echo [STRING] write string to standard output\n"));
+	fprintf(output, _("  \\warn [STRING]   write string to standard error\n"));
 	fprintf(output, _("  \\i FILEexecute commands from file\n"));
 	fprintf(output, _("  \\ir FILE   as \\i, but relative to location of current script\n"));
 	fprintf(output, _("  \\o [FILE]  send all query results to file or |pipe\n"));
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index bcddc7601e..383a17a7f4 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1421,7 +1421,7 @@ psql_completion(const char *text, int start, int end)
 		"\\t", "\\T", "\\timing",
 		"\\unset",
 		"\\x",
-		"\\w", "\\watch",
+		"\\w","\\warn", "\\watch",
 		"\\z",
 		"\\!", "\\?",
 		NULL

--2.20.1--

Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-22 Thread Robert Haas

On Thu, Apr 18, 2019 at 11:46 AM Pavel Stehule  wrote:
> My idea is following - pseudocode
>
> if view is a table
> {
>   if is_defined PSQL_TABULAR_PAGER
>   {
> pager = PSQL_TABULAR_PAGER
>   }
>   else if is_defined PSQL_PAGER
>   {
> pager = PSQL_PAGER
>   }
>   else
>   {
> pager = PAGER
>   }
> }
> else /* for \h xxx */
> {
>   if is_defined PSQL_PAGER
>   {
> pager = PSQL_PAGER
>   }
>   else
>   {
> pager = PAGER
>   }
>

Seems like pspg could just hand off to the regular pager if it
discovers that the input is not in a format it finds suitable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-22 Thread Tom Lane

Robert Haas  writes:
> Seems like pspg could just hand off to the regular pager if it
> discovers that the input is not in a format it finds suitable.

It might be slightly tricky to do that after having already consumed
some of the input :-(.

Still, I've got to say that I find this proposal pretty horrid.
I already thought that PSQL_PAGER was a dubious idea: what other
program do you know anywhere that isn't satisfied with PAGER?
Inventing still more variables of the same ilk is making it even
messier, and more obviously poorly designed, and more obviously
likely to end up with forty-nine different variables for slightly
different purposes.

I think that the general problem here is "we need psql to be able to
give some context info to pspg", and the obvious way to handle that
is to make a provision for arguments on pspg's command line.  That
is, instead of just calling "pspg", call "pspg table" or "pspg help"
etc etc, with the understanding that the set of context words could
be extended over time.  We could shoehorn this into what we already
have by saying that PSQL_PAGER is interpreted as a format, and if
it contains say "%c" then replace that with a context word (and
again, there's room for more format codes over time).  Probably
best *not* to apply such an interpretation to PAGER, though.

Whether the whole problem is really worth this much infrastructure
is a fair question.  But if we're going to do something, I'd rather
go down a path like this than inventing a new environment variable
every month.

regards, tom lane

Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-22 Thread Pavel Stehule

po 22. 4. 2019 v 15:46 odesílatel Robert Haas 
napsal:

> On Thu, Apr 18, 2019 at 11:46 AM Pavel Stehule 
> wrote:
> > My idea is following - pseudocode
> >
> > if view is a table
> > {
> >   if is_defined PSQL_TABULAR_PAGER
> >   {
> > pager = PSQL_TABULAR_PAGER
> >   }
> >   else if is_defined PSQL_PAGER
> >   {
> > pager = PSQL_PAGER
> >   }
> >   else
> >   {
> > pager = PAGER
> >   }
> > }
> > else /* for \h xxx */
> > {
> >   if is_defined PSQL_PAGER
> >   {
> > pager = PSQL_PAGER
> >   }
> >   else
> >   {
> > pager = PAGER
> >   }
> >
>
> Seems like pspg could just hand off to the regular pager if it
> discovers that the input is not in a format it finds suitable.
>

This is possible, and I wrote it. But it is "little bit" strange, start
another pager from a pager.

I think so task oriented pagers can enhance custom experience of TUI
applications - and there is a big space for enhancement.

Currently pspg have to reparse data and there are some heuristic to detect
format. Can be nice, if psql can send some additional info about the data.

Maybe psql can send raw data, and printing formatting can be on parser side.

Pavel

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: proposal: psql PSQL_TABULAR_PAGER variable

2019-04-22 Thread Pavel Stehule

po 22. 4. 2019 v 16:21 odesílatel Tom Lane  napsal:

> Robert Haas  writes:
> > Seems like pspg could just hand off to the regular pager if it
> > discovers that the input is not in a format it finds suitable.
>
> It might be slightly tricky to do that after having already consumed
> some of the input :-(.
>

pspg supports both direction scrolling, so data are in buffer, and can be
displayed again.


> Still, I've got to say that I find this proposal pretty horrid.
> I already thought that PSQL_PAGER was a dubious idea: what other
> program do you know anywhere that isn't satisfied with PAGER?
> Inventing still more variables of the same ilk is making it even
> messier, and more obviously poorly designed, and more obviously
> likely to end up with forty-nine different variables for slightly
> different purposes.
>

The programs with some complex output usually doesn't use a pagers - or use
pagers only for part of their output.

Initially I would to teach "less" to support tabular data - but the after
some initial research I found so I am not able to modify "less".


> I think that the general problem here is "we need psql to be able to
> give some context info to pspg", and the obvious way to handle that
> is to make a provision for arguments on pspg's command line.  That
> is, instead of just calling "pspg", call "pspg table" or "pspg help"
> etc etc, with the understanding that the set of context words could
> be extended over time.  We could shoehorn this into what we already
> have by saying that PSQL_PAGER is interpreted as a format, and if
> it contains say "%c" then replace that with a context word (and
> again, there's room for more format codes over time).  Probably
> best *not* to apply such an interpretation to PAGER, though.
>

It can be a way. There are some issues unfixable on pager side - like
dynamic column resizing when FETCH_COUNT > 0 and some others.

I can imagine a situation, when psql send just raw data in some easy
machine readable format (like CSV), and specialized pager can format these
data, and can support some interactive work (hiding columns, columns
switch, ..)

Regards

Pavel


>
> Whether the whole problem is really worth this much infrastructure
> is a fair question.  But if we're going to do something, I'd rather
> go down a path like this than inventing a new environment variable
> every month.
>
> regards, tom lane
>

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Robert Haas

On Wed, Apr 17, 2019 at 6:06 PM Alvaro Herrera  wrote:
> > > 3. Partitioned relations can have the database tablespace in
> > >pg_class.reltablespace, as opposed to storage-bearing relations which
> > >cannot.  This is useful to be able to put partitions in the database
> > >tablespace even if the default_tablespace is set to something else.
> >
> > I still feel that this is a darn bad idea.  It goes against the rule
> > that's existed for pg_class.reltablespace since its beginning, and
> > I think it's inevitable that that's going to break something.
>
> Yes, this deviates from current practice, and while I tested this in as
> many ways as I could think of, I cannot deny that it might break
> something unexpectedly.

Like Tom, I think this has got to be broken.

Suppose that you have a partitioned table which has reltablespace =
dattablespace.  It has a bunch of children with reltablespace = 0.
So far so good: new children of the partitioned table go into the
database tablespace regardless of default_tablespace.

Now somebody changes the default tablespace using ALTER DATABASE ..
SET TABLESPACE.  All the existing children end up in the new default
tablespace, but new children of the partitioned table end up going
into the tablespace that used to be the default but is no longer.
That's pretty odd, because the whole point of setting a tablespace on
the children was to get all of the children into the same tablespace.

The same thing happens if you clone the database using CREATE DATABASE
.. TEMPLATE .. TABLESPACE, as Tom mentioned.

PostgreSQL has historically and very deliberately *not made a
distinction* between "this object is in the default tablespace" and
"this object is in tablespace X which happens to be the default."  I
think that it's too late to invent such a distinction for reasons of
backward compatibility -- and if we were going to do it, surely it
would need to exist for both partitioned tables and the partitions
themselves. Otherwise it just produces more strange inconsistencies.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Stephen Frost

Greetings,

* Peter Geoghegan (p...@bowt.ie) wrote:
> Andres has suggested that I work on teaching nbtree to accommodate
> variable-width, logical table identifiers, such as those required for
> indirect indexes, or clustered indexes, where secondary indexes must
> use a logical primary key value instead of a heap TID. I'm not
> currently committed to working on this as a project, but I definitely
> don't want to make it any harder. This has caused me to think about
> the problem as it relates to the new on-disk representation for v4
> nbtree indexes in Postgres 12. I do have a minor misgiving about one
> particular aspect of what I came up with: The precise details of how
> we represent heap TID in pivot tuples seems like it might make things
> harder than they need to be for a future logical/varwidth table
> identifier project. This probably isn't worth doing anything about
> now, but it seems worth discussing now, just in case.

This seems like it would be helpful for global indexes as well, wouldn't
it?

> This also results in an immediate though small benefit for v4 nbtree
> indexes: _bt_truncate() produces smaller pivot tuples in a few cases.
> For example, indexes with one or two boolean fields will have pivot
> tuples that are 15 bytes and 16 bytes in length respectively,
> occupying 16 bytes of tuple space on internal pages. The saving comes
> because we can use the alignment padding hole, that was empty in the
> original non-pivot index tuple that the new pivot tuple is to be
> formed from. Currently, the size of these pivot tuples would be 24
> bytes, so we're occasionally saving a MAXALIGN() quantum in space this
> way. It is unlikely that anyone would actually care very much about
> these kinds of space savings, but at the same time it feels more
> elegant to me. The heap TID may not have a pg_attribute entry, but
> ISTM that the on-disk representation should not have padding "in the
> wrong place", on general principle.
> 
> Thoughts?

I agree with trying to avoid having padding 'in the wrong place' and if
it makes some indexes smaller, great, even if they're unlikely to be
interesting in the vast majority of cases, they may still exist out
there.  Of course, this is provided that it doesn't overly complicate
the code, but it sounds like it wouldn't be too bad in this case.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Sun, Apr 21, 2019 at 06:24:50PM -0400, Robert Haas wrote:
> On Sat, Apr 20, 2019 at 5:54 PM Bruce Momjian  wrote:
> > Good point.  You mentioned:
> >
> > It seems better to me to give the files names like
> > ${TLI}.${STARTLSN}.${ENDLSN}.modblock, e.g.
> > 0001.00016858.0001687DBBB8.modblock, so that you can
> > see exactly which *records* are covered by that segment.
> >
> > but it seems like it should be ${TLI}.${ENDLSN}... (END first) because
> > you would not want to delete the modblock file until you are about to
> > delete the final WAL, not the first WAL, but as you mentioned, it might
> > be ENDLSN-1.
> 
> Hmm.  It seems to me that it is almost universally the convention to
> put the starting point prior to the ending point.  If you are taking a
> biology class, the teacher will not tell you to study chapters six
> through three.

My point is that most WAL archive tools will order and remove files
based on their lexical ordering, so if you put the start first, the file
will normally be removed when it should be kept, e.g., if you have WAL
files like:

00010001
00010002
00010003
00010004
00010005

putting the start first and archiving some wal would lead to:

00010001-00010004.modblock
00010003
00010004
00010005

We removed 1 and 2, but kept the modblock file, which looks out of
order. Having the end at the start would have:

00010003
00010004
00010004-00010001.modblock
00010005

My point is that you would normally only remove the modblock file when 4
is removed because this modblock files is useful for incremental backups
from base backups that happened between 1 and 4.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

make \d pg_toast.foo show its indices

2019-04-22 Thread Justin Pryzby

It's deliberate that \dt doesn't show toast tables.
\d shows them, but doesn't show their indices.

It seems to me that their indices should be shown, without having to think and
know to query pg_index.

postgres=# \d pg_toast.pg_toast_2600
TOAST table "pg_toast.pg_toast_2600"
   Column   |  Type   
+-
 chunk_id   | oid
 chunk_seq  | integer
 chunk_data | bytea
Indexes:
"pg_toast_2600_index" PRIMARY KEY, btree (chunk_id, chunk_seq)

Justin
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index d7390d5..d26d986 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -2274,6 +2274,7 @@ describeOneTableDetails(const char *schemaname,
 	else if (tableinfo.relkind == RELKIND_RELATION ||
 			 tableinfo.relkind == RELKIND_MATVIEW ||
 			 tableinfo.relkind == RELKIND_FOREIGN_TABLE ||
+			 tableinfo.relkind == RELKIND_TOASTVALUE ||
 			 tableinfo.relkind == RELKIND_PARTITIONED_TABLE)
 	{
 		/* Footer information about a table */

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Mon, Apr 22, 2019 at 11:20:43AM +0900, Michael Paquier wrote:
> On Sat, Apr 20, 2019 at 12:21:36AM -0400, Robert Haas wrote:
> > The segment size doesn't have much to do with it.  If you make
> > segments bigger, you'll have to scan fewer larger ones; if you make
> > them smaller, you'll have more smaller ones.  The only thing that
> > really matters is the amount of I/O and CPU required, and that doesn't
> > change very much as you vary the segment size.
> 
> If you create the extra file when a segment is finished and we switch
> to a new one, then the extra work would happen for a random backend,
> and it is going to be more costly to scan a 1GB segment than a 16MB
> segment as a one-time operation, and less backends would see a
> slowdown at equal WAL data generated.  From what I can see, you are
> not planning to do such operations when a segment finishes being
> written, which would be much better.

I think your point is that the 16MB is more likely to be in memory,
while the 1GB is less likely.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: display of variables in EXPLAIN VERBOSE

2019-04-22 Thread Tom Lane

David Rowley  writes:
> I'd vote to make the code match the documentation, but probably
> implement it by adding a new field to ExplainState and just calculate
> what to do once in ExplainQuery() instead of calculating what to do in
> various random places.

Yeah, this is none too consistent:

$ grep -n 'useprefix =' explain.c
2081:   useprefix = list_length(es->rtable) > 1;
2151:   useprefix = (IsA(planstate->plan, SubqueryScan) ||es->verbose);
2165:   useprefix = (list_length(es->rtable) > 1 || es->verbose);
2238:   useprefix = (list_length(es->rtable) > 1 || es->verbose);
2377:   useprefix = (list_length(es->rtable) > 1 || es->verbose);
2485:   useprefix = list_length(es->rtable) > 1;

If we're going to mess with this, I'd also suggest that we not depend on
list_length(es->rtable) per se, as that counts RTEs that may have nothing
to do with the plan.  For instance, I've never been very happy about
this behavior:

regression=# create table tt (f1 int, f2 int);
CREATE TABLE
regression=# explain verbose select * from tt;
 QUERY PLAN  
-
 Seq Scan on public.tt  (cost=0.00..32.60 rows=2260 width=8)
   Output: f1, f2
(2 rows)

regression=# create view vv as select * from tt;
CREATE VIEW
regression=# explain verbose select * from vv;
 QUERY PLAN  
-
 Seq Scan on public.tt  (cost=0.00..32.60 rows=2260 width=8)
   Output: tt.f1, tt.f2
(2 rows)

The reason for the difference is the presence of the view's RTE
in the plan, but why should that affect the printout?  Maybe we
could make it depend on the number of RTE names assigned by
select_rtable_names_for_explain, instead.

BTW, now that I look at this, I think the reason why I didn't make
tlist printouts pay attention to VERBOSE for this purpose is that
you don't get them at all if not verbose:

regression=# explain select * from tt;
  QUERY PLAN  
--
 Seq Scan on tt  (cost=0.00..32.60 rows=2260 width=8)
(1 row)

So if we were to be rigidly consistent with this point of the docs,
there would be no way to see a tlist without variable qualification,
which doesn't really seem that nice.

Alternatively, we could just leave this as-is.  I do not think the
quoted doc paragraph was ever meant as an exact specification
of what EXPLAIN VERBOSE does, nor do I believe that making it so
would be helpful.

regards, tom lane

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Robert Haas

On Sun, Apr 21, 2019 at 10:21 PM Michael Paquier  wrote:
> If you create the extra file when a segment is finished and we switch
> to a new one, then the extra work would happen for a random backend,
> and it is going to be more costly to scan a 1GB segment than a 16MB
> segment as a one-time operation, and less backends would see a
> slowdown at equal WAL data generated.  From what I can see, you are
> not planning to do such operations when a segment finishes being
> written, which would be much better.

Well, my plan was to do it all from a background worker, so I do not
think a random backend would ever have to do any extra work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 14:48:26 +0900, Michael Paquier wrote:
> On Fri, Apr 19, 2019 at 09:43:01AM -0500, Justin Pryzby wrote:
> > Thanks for committing those portions.
> 
> I have done an extra pass on your patch set to make sure that I am
> missing nothing, and the last two remaining places which need some
> tweaks are the comments from the JIT code you pointed out.  Attached
> is a patch with these adjustments.
> --
> Michael

> diff --git a/src/backend/jit/llvm/llvmjit_deform.c 
> b/src/backend/jit/llvm/llvmjit_deform.c
> index 94b4635218..e7aa92e274 100644
> --- a/src/backend/jit/llvm/llvmjit_deform.c
> +++ b/src/backend/jit/llvm/llvmjit_deform.c
> @@ -298,9 +298,9 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc 
> desc,
>   }
>  
>   /*
> -  * Check if's guaranteed the all the desired attributes are available in
> -  * tuple. If so, we can start deforming. If not, need to make sure to
> -  * fetch the missing columns.
> +  * Check if all the desired attributes are available in the tuple.  If 
> so,
> +  * we can start deforming.  If not, we need to make sure to fetch the
> +  * missing columns.
>*/

That's imo not an improvement. The guaranteed bit is actually
relevant. What this block is doing is eliding the check against the
tuple header for the number of attributes, if NOT NULL attributes for
later columns guarantee that the desired columns are present in the NULL
bitmap. But the rephrasing makes it sound like we're actually checking
against the tuple.

I think it'd be better just to fix s/the all/that all/.


>   if ((natts - 1) <= guaranteed_column_number)
>   {
> @@ -383,7 +383,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc 
> desc,
>  
>   /*
>* If this is the first attribute, slot->tts_nvalid was 0. 
> Therefore
> -  * reset offset to 0 to, it be from a previous execution.
> +  * reset offset to 0 too, as it may be from a previous 
> execution.
>*/
>   if (attnum == 0)
>   {

That obviously makes sense.


Greetings,

Andres Freund

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian  wrote:
> My point is that most WAL archive tools will order and remove files
> based on their lexical ordering, so if you put the start first, the file
> will normally be removed when it should be kept, e.g., if you have WAL
> files like:
>
> 00010001
> 00010002
> 00010003
> 00010004
> 00010005
>
> putting the start first and archiving some wal would lead to:
>
> 00010001-00010004.modblock
> 00010003
> 00010004
> 00010005
>
> We removed 1 and 2, but kept the modblock file, which looks out of
> order. Having the end at the start would have:
>
> 00010003
> 00010004
> 00010004-00010001.modblock
> 00010005
>
> My point is that you would normally only remove the modblock file when 4
> is removed because this modblock files is useful for incremental backups
> from base backups that happened between 1 and 4.

That's an interesting point.  On the other hand, I think it would be
typical to want the master to retain .modblock files for much longer
than it retains WAL segments, and in my design, the WAL archive
wouldn't see those files at all; they'd be stored on the master.  I
was actually thinking that they should possibly be stored in a
separate directory to avoid confusion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: clean up docs for v12

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Andres Freund wrote:

> On 2019-04-22 14:48:26 +0900, Michael Paquier wrote:

> > /*
> > -* Check if's guaranteed the all the desired attributes are available in
> > -* tuple. If so, we can start deforming. If not, need to make sure to
> > -* fetch the missing columns.
> > +* Check if all the desired attributes are available in the tuple.  If 
> > so,
> > +* we can start deforming.  If not, we need to make sure to fetch the
> > +* missing columns.
> >  */
> 
> That's imo not an improvement. The guaranteed bit is actually
> relevant. What this block is doing is eliding the check against the
> tuple header for the number of attributes, if NOT NULL attributes for
> later columns guarantee that the desired columns are present in the NULL
> bitmap. But the rephrasing makes it sound like we're actually checking
> against the tuple.
> 
> I think it'd be better just to fix s/the all/that all/.

(and s/if's/if it's/)

> 
> > if ((natts - 1) <= guaranteed_column_number)
> > {
> > @@ -383,7 +383,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc 
> > desc,
> >  
> > /*
> >  * If this is the first attribute, slot->tts_nvalid was 0. 
> > Therefore
> > -* reset offset to 0 to, it be from a previous execution.
> > +* reset offset to 0 too, as it may be from a previous 
> > execution.
> >  */
> > if (attnum == 0)
> > {
> 
> That obviously makes sense.

Hmm, I think "as it *is*", not "as it *may be*", right?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: clean up docs for v12

2019-04-22 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Apr-22, Andres Freund wrote:
>> On 2019-04-22 14:48:26 +0900, Michael Paquier wrote:
>>> /*
>>> -* Check if's guaranteed the all the desired attributes are available in
>>> -* tuple. If so, we can start deforming. If not, need to make sure to
>>> -* fetch the missing columns.
>>> +* Check if all the desired attributes are available in the tuple.  If 
>>> so,
>>> +* we can start deforming.  If not, we need to make sure to fetch the
>>> +* missing columns.
>>> */

>> That's imo not an improvement. The guaranteed bit is actually
>> relevant. What this block is doing is eliding the check against the
>> tuple header for the number of attributes, if NOT NULL attributes for
>> later columns guarantee that the desired columns are present in the NULL
>> bitmap. But the rephrasing makes it sound like we're actually checking
>> against the tuple.
>> 
>> I think it'd be better just to fix s/the all/that all/.

> (and s/if's/if it's/)

ISTM that Michael's proposed wording change shows that the existing
comment is easily misinterpreted.  I don't think these minor grammatical
fixes will avoid the misinterpretation problem, and so some more-extensive
rewording is called for.

But TBH, now that I look at the code, I think the entire optimization
is a bad idea and should be removed.  Am I right in thinking that the
presence of a wrong attnotnull marker could cause the generated code to
actually crash, thanks to not checking the tuple's natts field?  I don't
have enough faith in our enforcement of those constraints to want to see
JIT taking that risk to save a nanosecond or two.

(Possibly I'd not think this if I weren't fresh off a couple of days
with my nose in the ALTER TABLE SET NOT NULL code.  But right now,
I think that believing that that code does not and never will have
any bugs is just damfool.)

regards, tom lane

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote:
> Andres has suggested that I work on teaching nbtree to accommodate
> variable-width, logical table identifiers, such as those required for
> indirect indexes, or clustered indexes, where secondary indexes must
> use a logical primary key value instead of a heap TID.

I think it's two more cases:

- table AMs that want to support tables that are bigger than 32TB. That
  used to be unrealistic, but it's not anymore. Especially when the need
  to VACUUM etc is largely removed / reduced.
- global indexes (for cross-partition unique constraints and such),
  which need a partition identifier as part of the tid (or as part of
  the index key, but I think that actually makes interaction with
  indexam from other layers more complicated - the inside of the index
  maybe may want to represent it as a column, but to the outside that
  ought not to be visible)



> Thoughts?

Seems reasonable to me.

I, more generally, wonder if there's not a case to squeeze out more
padding than "just" what you describe (since we IIRC don't frequently
keep pointers into such tuples anyway, and definitely don't for byval
attrs). But that's very likely better done separately.

Greetings,

Andres Freund

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Mon, Apr 22, 2019 at 12:15:32PM -0400, Robert Haas wrote:
> On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian  wrote:
> > My point is that you would normally only remove the modblock file when 4
> > is removed because this modblock files is useful for incremental backups
> > from base backups that happened between 1 and 4.
> 
> That's an interesting point.  On the other hand, I think it would be
> typical to want the master to retain .modblock files for much longer
> than it retains WAL segments, and in my design, the WAL archive
> wouldn't see those files at all; they'd be stored on the master.  I
> was actually thinking that they should possibly be stored in a
> separate directory to avoid confusion.

I assumed the modblock files would be stored in the WAL archive so some
external tools could generate incremental backups using just the WAL
files.  I assumed they would also be sent to standby servers so
incremental backups could be done on standby servers too.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 12:33:24 -0400, Tom Lane wrote:
> ISTM that Michael's proposed wording change shows that the existing
> comment is easily misinterpreted.  I don't think these minor grammatical
> fixes will avoid the misinterpretation problem, and so some more-extensive
> rewording is called for.

Fair enough.

> But TBH, now that I look at the code, I think the entire optimization
> is a bad idea and should be removed.  Am I right in thinking that the
> presence of a wrong attnotnull marker could cause the generated code to
> actually crash, thanks to not checking the tuple's natts field?  I don't
> have enough faith in our enforcement of those constraints to want to see
> JIT taking that risk to save a nanosecond or two.

It's not a minor optimization, it's very measurable. Without the check
there's no pipeline stall when the memory for the tuple header is not in
the CPU cache (very common, especially for seqscans and such, due to the
"backward" memory location ordering of tuples in seqscans, which CPUs
don't predict). Server grade CPUs of the last ~5 years just march on and
start the work to fetch the first attributes (especially if they're NOT
NULL) - but can't do that if natts has to be checked. And starting to
check the NULL bitmap for NOT NULL attributes, would make that even
worse - and would required if we don't trust attnotnull.

> (Possibly I'd not think this if I weren't fresh off a couple of days
> with my nose in the ALTER TABLE SET NOT NULL code.  But right now,
> I think that believing that that code does not and never will have
> any bugs is just damfool.)

But there's plenty places where we rely on NOT NULL actually working?
We'll return wrong query results, and even crash in non-JIT places
because we thought there was guaranteed to be datum?

Greetings,

Andres Freund

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 09:43:56 -0700, Andres Freund wrote:
> On 2019-04-22 12:33:24 -0400, Tom Lane wrote:
> > ISTM that Michael's proposed wording change shows that the existing
> > comment is easily misinterpreted.  I don't think these minor grammatical
> > fixes will avoid the misinterpretation problem, and so some more-extensive
> > rewording is called for.
> 
> Fair enough.

The computation of that variable above has:

 * If the column is possibly missing, we can't rely on its (or
 * subsequent) NOT NULL constraints to indicate minimum 
attributes in
 * the tuple, so stop here.
 */
if (att->atthasmissing)
break;

/*
 * Column is NOT NULL and there've been no preceding missing 
columns,
 * it's guaranteed that all columns up to here exist at least 
in the
 * NULL bitmap.
 */
if (att->attnotnull)
guaranteed_column_number = attnum;

and only then the comment referenced in the discussion here follows:
/*
 * Check if's guaranteed the all the desired attributes are available in
 * tuple. If so, we can start deforming. If not, need to make sure to
 * fetch the missing columns.
 */

I think just reformulating that to something like

/*
 * Check if it's guaranteed that all the desired attributes are 
available
 * in the tuple (but still possibly NULL), by dint of either the last
 * to-be-deformed column being NOT NULL, or subsequent ones not accessed
 * here being NOT NULL.  If that's not guaranteed the tuple headers 
natt's
 * has to be checked, and missing attributes potentially have to be
 * fetched (using slot_getmissingattrs().
*/

should make that clearer?

Greetings,

Andres Freund

Optimizer items in the release notes

2019-04-22 Thread Bruce Momjian

We had a discussion in October about adding more optimizer items to the
release notes:


https://www.postgresql.org/message-id/flat/20181010220601.GA7807%40momjian.us#11d805ea0b0fcd0552dfa99251417cc1

There was no agreement on a change, but if people want to propose a
change, please post here and we can discuss it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Trouble with FETCH_COUNT and combined queries in psql

2019-04-22 Thread Daniel Verite

  Hi,

When FETCH_COUNT is set, queries combined in a single request don't work
as expected:

 \set FETCH_COUNT 10
 select pg_sleep(2) \; select 1;

No result is displayed, the pg_sleep(2) is not run, and no error
is shown. That's disconcerting.

The sequence that is sent under the hood is:

#1 BEGIN
#2  DECLARE _psql_cursor NO SCROLL CURSOR FOR
select pg_sleep(2) ; select 1;
#3 CLOSE _psql_cursor
#4 ROLLBACK

The root problem is in deciding that a statement can be run
through a cursor if the query text starts with "select" or "values"
(in is_select_command() in common.c), but not knowing about multiple
queries in the buffer, which are not compatible with the cursor thing.

When sending #2, psql expects the PQexec("DECLARE...") to yield a
PGRES_COMMAND_OK, but it gets a PGRES_TUPLES_OK instead. Given
that, it abandons the cursor, rollbacks the transaction (if
it opened it), and clears out the results of the second select
without displaying them.

If there was already a transaction open, the problem is worse because
it doesn't rollback and we're silently missing an SQL statement that
was possibly meant to change the state of the data, as in
 BEGIN; SELECT compute_something() \; select get_results(); END;

Does anyone have thoughts about how to fix this?
ATM I don't see a plausible fix that does not involve the parser
to store the information that it's a multiple-query command and pass
it down somehow to is_select_command().
Or a more modern approach could be to give up on the
cursor-based method in favor of  PQsetSingleRowMode().
That might be too big a change for a bug fix though,


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:
> Attached is a hacky and work-in-progress patch to move fsm map to
> relcache.  This will give you some idea.  I think we need to see if
> there is a need to invalidate the relcache due to this patch.  I think
> some other comments of Andres also need to be addressed, see if you
> can attempt to fix some of them.



>  /*
> @@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
>   BlockNumber blkno,
>   cached_target_block;
>  
> - /* The local map must not be set already. */
> - Assert(!FSM_LOCAL_MAP_EXISTS);
> -
>   /*
>* Starting at the current last block in the relation and working
>* backwards, mark alternating blocks as available.
> @@ -1142,7 +1117,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
>   blkno = cur_nblocks - 1;
>   while (true)
>   {
> - fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
> + rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
>   if (blkno >= 2)
>   blkno -= 2;
>   else
> @@ -1150,13 +1125,13 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
>   }
>  
>   /* Cache the number of blocks. */
> - fsm_local_map.nblocks = cur_nblocks;
> + rel->fsm_local_map->nblocks = cur_nblocks;
>  
>   /* Set the status of the cached target block to 'unavailable'. */
>   cached_target_block = RelationGetTargetBlock(rel);
>   if (cached_target_block != InvalidBlockNumber &&
>   cached_target_block < cur_nblocks)
> - fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
> + rel->fsm_local_map->map[cached_target_block] = 
> FSM_LOCAL_NOT_AVAIL;
>  }

I think there shouldn't be any need for this anymore. After this change
we shouldn't invalidate the map until there's no space on it - thereby
addressing the cost overhead, and greatly reducing the likelihood that
the local FSM can lead to increased bloat.


>  /*
> @@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
>   * This function is used when there is no FSM.
>   */
>  static BlockNumber
> -fsm_local_search(void)
> +fsm_local_search(Relation rel)
>  {
>   BlockNumber target_block;
>  
>   /* Local map must be set by now. */
> - Assert(FSM_LOCAL_MAP_EXISTS);
> + Assert(FSM_LOCAL_MAP_EXISTS(rel));
>  
> - target_block = fsm_local_map.nblocks;
> + target_block = rel->fsm_local_map->nblocks;
>   do
>   {
>   target_block--;
> - if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
> + if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
>   return target_block;
>   } while (target_block > 0);
>  
> @@ -1189,7 +1164,22 @@ fsm_local_search(void)
>* first, which would otherwise lead to the same conclusion again and
>* again.
>*/
> - FSMClearLocalMap();
> + fsm_clear_local_map(rel);

I'm not sure I like this. My inclination would be that we should be able
to check the local fsm repeatedly even if there's no space in the
in-memory representation - otherwise the use of the local FSM increases
bloat.

Greetings,

Andres Freund

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost  wrote:
> > Having been around for a while working on backup-related things, if I
> > was to implement the protocol for pg_basebackup today, I'd definitely
> > implement "give me a list" and "give me this file" rather than the
> > tar-based approach, because I've learned that people want to be
> > able to do parallel backups and that's a decent way to do that.  I
> > wouldn't set out and implement something new that's there's just no hope
> > of making parallel.  Maybe the first write of pg_basebackup would still
> > be simple and serial since it's certainly more work to make a frontend
> > tool like that work in parallel, but at least the protocol would be
> > ready to support a parallel option being added alter without being
> > rewritten.
> >
> > And that's really what I was trying to get at here- if we've got the
> > choice now to decide what this is going to look like from a protocol
> > level, it'd be great if we could make it able to support being used in a
> > parallel fashion, even if pg_basebackup is still single-threaded.
> 
> I think we're getting closer to a meeting of the minds here, but I
> don't think it's intrinsically necessary to rewrite the whole method
> of operation of pg_basebackup to implement incremental backup in a
> sensible way.  

It wasn't my intent to imply that the whole method of operation of
pg_basebackup would have to change for this.

> One could instead just do a straightforward extension
> to the existing BASE_BACKUP command to enable incremental backup.

Ok, how do you envision that?  As I mentioned up-thread, I am concerned
that we're talking too high-level here and it's making the discussion
more difficult than it would be if we were to put together specific
ideas and then discuss them.

One way I can imagine to extend BASE_BACKUP is by adding LSN as an
optional parameter and then having the database server scan the entire
cluster and send a tarball which contains essentially a 'diff' file of
some kind for each file where we can construct a diff based on the LSN,
and then the complete contents of the file for everything else that
needs to be in the backup.

So, sure, that would work, but it wouldn't be able to be parallelized
and I don't think it'd end up being very exciting for the external tools
because of that, but it would be fine for pg_basebackup.

On the other hand, if you added new commands for 'list of files changed
since this LSN' and 'give me this file' and 'give me this file with the
changes in it since this LSN', then pg_basebackup could work with that
pretty easily in a single-threaded model (maybe with two connections to
the backend, but still in a single process, or maybe just by slurping up
the file list and then asking for each one) and the external tools could
leverage those new capabilities too for their backups, both full backups
and incremental ones.  This also wouldn't have to change how
pg_basebackup does full backups today one bit, so what we're really
talking about here is the direction to take the new code that's being
written, not about rewriting existing code.  I agree that it'd be a bit
more work...  but hopefully not *that* much more, and it would mean we
could later add parallel backup to pg_basebackup more easily too, if we
wanted to.

> Then, to enable parallel full backup and all sorts of out-of-core
> hacking, one could expand the command language to allow tools to
> access individual steps: START_BACKUP, SEND_FILE_LIST,
> SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
> for an appealing project, but I do not think there is a technical
> reason why it has to be done first.  Or for that matter why it has to
> be done second.  As I keep saying, incremental backup and full backup
> are separate projects and I believe it's completely reasonable for
> whoever is doing the work to decide on the order in which they would
> like to do the work.

I didn't mean to imply that one had to be done before the other from a
technical standpoint.  I agree that they don't depend on each other.

You're certainly welcome to do what you would like, I simply wanted to
share my experiences and try to help move this in a direction that would
involve less code rewrite in the future and to have a feature that would
be more appealing to the external tools.

> Having said that, I'm curious what people other than Stephen (and
> other pgbackrest hackers) 

While David and I do talk, we haven't really discussed this proposal all
that much, so please don't assume that he shares my thoughts here.  I'd
also like to hear what others think, particularly those who have been
working in this area.

> think about the relative value of parallel
> backup vs. incremental backup.  Stephen appears quite convinced that
> parallel backup is full of win and incremental backup is a bit of a
> yawn by comparison, and while I certainly would not want to di

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian  wrote:
> I assumed the modblock files would be stored in the WAL archive so some
> external tools could generate incremental backups using just the WAL
> files.  I assumed they would also be sent to standby servers so
> incremental backups could be done on standby servers too.

Yeah, that's another possible approach.  I am not sure what is best.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote:
> On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian  wrote:
> > I assumed the modblock files would be stored in the WAL archive so some
> > external tools could generate incremental backups using just the WAL
> > files.  I assumed they would also be sent to standby servers so
> > incremental backups could be done on standby servers too.
> 
> Yeah, that's another possible approach.  I am not sure what is best.

I am thinking you need to allow any of these, and putting the WAL files
in pg_wal and having them streamed and archived gives that flexibility.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost  wrote:
> This seems like it would be helpful for global indexes as well, wouldn't
> it?

Yes, though that should probably work by reusing what we already do
with heap TID (use standard IndexTuple fields on the leaf level for
heap TID), plus an additional identifier for the partition number that
is located at the physical end of the tuple. IOW, I think that this
might benefit from a design that is half way between what we already
do with heap TIDs and what we would be required to do to make varwidth
logical row identifiers in tables work -- the partition number is
varwidth, though often only a single byte.

> I agree with trying to avoid having padding 'in the wrong place' and if
> it makes some indexes smaller, great, even if they're unlikely to be
> interesting in the vast majority of cases, they may still exist out
> there.  Of course, this is provided that it doesn't overly complicate
> the code, but it sounds like it wouldn't be too bad in this case.

Here is what it took:

* Removed the "conservative" MAXALIGN() within index_form_tuple(),
bringing it in line with heap_form_tuple(), which only MAXALIGN()s so
that the first attribute in tuple's data area can safely be accessed
on alignment-picky platforms, but doesn't do the same with data_len.

* Removed most of the MAXALIGN()s from nbtinsert.c, except one that
considers if a page split is required.

* Didn't change the nbtsplitloc.c code, because we need to assume
MAXALIGN()'d space quantities there. We continue to not trust the
reported tuple length to be MAXALIGN()'d, which is now essentially
rather than just defensive.

* Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the
whole tuple size in the case where new pivot tuple requires a heap TID
representation. We access TIDs as 3 2 byte integers, so this is
necessary for alignment-picky platforms.

I will pursue this as a project for PostgreSQL 13. It doesn't affect
on-disk compatibility, because BTreeTupleGetHeapTID() works just as
well with either the existing scheme, or this new one. Having the
"real" tuple length available will make it easier to implement "true"
suffix truncation, where we truncate *within* a text attribute (i.e.
generate a new, shorter value using new opclass infrastructure).

-- 
Peter Geoghegan

Re: clean up docs for v12

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> On 2019-04-22 12:33:24 -0400, Tom Lane wrote:
>> But TBH, now that I look at the code, I think the entire optimization
>> is a bad idea and should be removed.  Am I right in thinking that the
>> presence of a wrong attnotnull marker could cause the generated code to
>> actually crash, thanks to not checking the tuple's natts field?  I don't
>> have enough faith in our enforcement of those constraints to want to see
>> JIT taking that risk to save a nanosecond or two.

> It's not a minor optimization, it's very measurable.

Doesn't matter, if it's unsafe.

>> (Possibly I'd not think this if I weren't fresh off a couple of days
>> with my nose in the ALTER TABLE SET NOT NULL code.  But right now,
>> I think that believing that that code does not and never will have
>> any bugs is just damfool.)

> But there's plenty places where we rely on NOT NULL actually working?

I do not think there are any other places where we make this particular
assumption.  Given the number of ways in which we rely on there being
natts checks to avoid rewriting tables, I'm very afraid of the idea
that JIT is making more assumptions than the mainline code does.

In hopes of putting some fear into you too, I exhibit the following
behavior, which is not a bug according to our current definitions:

regression=# create table pp(f1 int);
CREATE TABLE
regression=# create table cc() inherits (pp);
CREATE TABLE
regression=# insert into cc values(1);
INSERT 0 1
regression=# insert into cc values(2);
INSERT 0 1
regression=# insert into cc values(null);
INSERT 0 1
regression=# alter table pp add column f2 text;
ALTER TABLE
regression=# alter table pp add column f3 text;
ALTER TABLE
regression=# alter table only pp alter f3 set not null;
ALTER TABLE
regression=# select * from pp;
 f1 | f2 | f3 
++
  1 || 
  2 || 
|| 
(3 rows)

The tuples coming out of cc will still have natts = 1, I believe.
If they were deformed according to pp's tupdesc, there'd be a
problem.  Now, we shouldn't do that, because this is not the only
possible discrepancy between parent and child tupdescs --- but
I think this example shows that attnotnull is a lot spongier
than you are assuming, even without considering the possibility
of outright bugs.

regards, tom lane

Re: Do CustomScan as not projection capable node

2019-04-22 Thread Andrey Lepikhov





On 22/04/2019 18:40, Robert Haas wrote:

On Fri, Apr 19, 2019 at 12:45 AM Tom Lane  wrote:

I don't buy this for a minute.  Where do you think projection is
going to happen?  There isn't any existing node type that *couldn't*
support projection if we insisted that that be done across-the-board.
I think it's mostly just a legacy thing that some don't.


I think there may actually be some good reasons for that.  If
something like an Append or Material node projects, it seems to me
that this means that we built the wrong tlist for its input(s).

That justification doesn't apply to custom scans, though.
The main reason for my question was incomplete information about the 
parameter custom_scan_tlist / fdw_scan_tlist.
In the process of testing my custom node, I encountered an error in 
setrefs.c caused by optimization of the projection operation. In order 
to reliably understand how to properly use custom_scan_tlist, I had to 
study in detail the mechanics of the FDW plan generator and now the 
problem is solved.
We have only three references to this parameter in the hackers mailing 
list, a brief reference on postgresql.org and limited comments into two 
patches: 1a8a4e5 and e7cb7ee.
It is possible that custom_scan_tlist is designed too nontrivially, and 
it is possible that it needs some comments describing in more detail how 
to use it.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2019 at 9:35 AM Andres Freund  wrote:
> I, more generally, wonder if there's not a case to squeeze out more
> padding than "just" what you describe (since we IIRC don't frequently
> keep pointers into such tuples anyway, and definitely don't for byval
> attrs). But that's very likely better done separately.

There is one way that that is kind of relevant here. The current
requirement seems to be that *any* sort of tuple header be
MAXALIGN()'d, because in the worst case the first attribute needs to
be accessed at a MAXALIGN()'d boundary on an alignment-picky platform.
That isn't so bad today, since we usually find a reasonably good way
to put those 8 bytes (or 23/24 bytes in the case of heap tuples) to
use. However, with varwidth table identifiers, the only use of those 8
bytes that I can think of is the offset to the identifier (or perhaps
its length), plus the usual t_info stuff. We'd almost invariably waste
4 or 5 bytes, which seems like a problem to me.

-- 
Peter Geoghegan

Re: clean up docs for v12

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> The computation of that variable above has:

>* If the column is possibly missing, we can't rely on its (or
>* subsequent) NOT NULL constraints to indicate minimum 
> attributes in
>* the tuple, so stop here.
>*/
>   if (att->atthasmissing)
>   break;

BTW, why do we have to stop?  ISTM that a not-null column without
atthasmissing is enough to prove this, regardless of the state of prior
columns.  (This is assuming that you trust attnotnull for this, which
as I said I don't, but that's not relevant to this question.)  I wonder
also if it wouldn't be smart to explicitly check that the "guaranteeing"
column is not attisdropped.

> I think just reformulating that to something like

>   /*
>* Check if it's guaranteed that all the desired attributes are 
> available
>* in the tuple (but still possibly NULL), by dint of either the last
>* to-be-deformed column being NOT NULL, or subsequent ones not accessed
>* here being NOT NULL.  If that's not guaranteed the tuple headers 
> natt's
>* has to be checked, and missing attributes potentially have to be
>* fetched (using slot_getmissingattrs().
>   */

> should make that clearer?

OK by me.

regards, tom lane

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Stephen Frost

Greetings,

* Peter Geoghegan (p...@bowt.ie) wrote:
> On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost  wrote:
> > This seems like it would be helpful for global indexes as well, wouldn't
> > it?
> 
> Yes, though that should probably work by reusing what we already do
> with heap TID (use standard IndexTuple fields on the leaf level for
> heap TID), plus an additional identifier for the partition number that
> is located at the physical end of the tuple. IOW, I think that this
> might benefit from a design that is half way between what we already
> do with heap TIDs and what we would be required to do to make varwidth
> logical row identifiers in tables work -- the partition number is
> varwidth, though often only a single byte.

Yes, global indexes for partitioned tables could potentially be simpler
than the logical row identifiers, but maybe it'd be useful to just have
one implementation based around logical row identifiers which ends up
working for global indexes as well as the other types of indexes and
table access methods.

If we thought that the 'single-byte' partition number covered enough
use-cases then we could possibly consider supporting them for partitions
by just 'stealing' a byte from BlockIdData and having the per-partition
size be limited to 4TB when a global index exists on the partitioned
table.  That's certainly not an ideal limitation but it might be
appealing to some users who really would like global indexes and could
possibly require less to implement, though there's a lot of other things
that would have to be done to have global indexes.  Anyhow, just some
random thoughts that I figured I'd share in case there might be
something there worth thinking about.

> > I agree with trying to avoid having padding 'in the wrong place' and if
> > it makes some indexes smaller, great, even if they're unlikely to be
> > interesting in the vast majority of cases, they may still exist out
> > there.  Of course, this is provided that it doesn't overly complicate
> > the code, but it sounds like it wouldn't be too bad in this case.
> 
> Here is what it took:
> 
> * Removed the "conservative" MAXALIGN() within index_form_tuple(),
> bringing it in line with heap_form_tuple(), which only MAXALIGN()s so
> that the first attribute in tuple's data area can safely be accessed
> on alignment-picky platforms, but doesn't do the same with data_len.
> 
> * Removed most of the MAXALIGN()s from nbtinsert.c, except one that
> considers if a page split is required.
> 
> * Didn't change the nbtsplitloc.c code, because we need to assume
> MAXALIGN()'d space quantities there. We continue to not trust the
> reported tuple length to be MAXALIGN()'d, which is now essentially
> rather than just defensive.
> 
> * Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the
> whole tuple size in the case where new pivot tuple requires a heap TID
> representation. We access TIDs as 3 2 byte integers, so this is
> necessary for alignment-picky platforms.
> 
> I will pursue this as a project for PostgreSQL 13. It doesn't affect
> on-disk compatibility, because BTreeTupleGetHeapTID() works just as
> well with either the existing scheme, or this new one. Having the
> "real" tuple length available will make it easier to implement "true"
> suffix truncation, where we truncate *within* a text attribute (i.e.
> generate a new, shorter value using new opclass infrastructure).

This sounds pretty good to me, though I'm not nearly as close to the
code there as you are.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: block-level incremental backup

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
> I agree that we don't want another implementation and that there's a lot
> that we want to do to improve replay performance.  We've already got
> frontend tools which work with multiple execution threads, so I'm not
> sure I get the "not easily feasible" bit, and the argument about the
> checkpointer seems largely related to that (as in- if we didn't have
> multiple threads/processes then things would perform quite badly...  but
> we can and do have multiple threads/processes in frontend tools today,
> even in pg_basebackup).

You need not just multiple execution threads, but basically a new
implementation of shared buffers, locking, process monitoring, with most
of the related infrastructure. You're literally talking about
reimplementing a very substantial portion of the backend.  I'm not sure
I can transport in written words - via a public medium - how bad an idea
it would be to go there.

> You certainly bring up some good concerns though and they make me think
> of other bits that would seem like they'd possibly be larger issues for
> a frontend tool- like having a large pool of memory for cacheing (aka
> shared buffers) the changes.  If what we're talking about here is *just*
> replay though, without having the system available for reads, I wonder
> if we might want a different solution there.

No.

> > Which I think is entirely reasonable. With the 'consistent' and LSN
> > recovery targets one already can get most of what's needed from such a
> > tool, anyway.  I'd argue the biggest issue there is that there's no
> > equivalent to starting postgres with a private socket directory on
> > windows, and perhaps an option or two making it easier to start postgres
> > in a "private" mode for things like this.
> 
> This would mean building in a way to do parallel WAL replay into the
> server binary though, as discussed above, and it seems like making that
> work in a way that allows us to still be available as a read-only
> standby would be quite a bit more difficult.  We could possibly support
> parallel WAL replay only when we aren't a replica but from the same
> binary.

I'm doubtful that we should try to implement parallel WAL apply that
can't support HS - a substantial portion of the the logic to avoid
issues around relfilenode reuse, consistency etc is going to be to be
necessary for non-HS aware apply anyway.  But if somebody had a concrete
proposal for something that's fundamentally only doable without HS, I
could be convinced.

> The concerns mentioned about making it easier to start PG in a
> private mode don't seem too bad but I am not entirely sure that the
> tools which want to leverage that kind of capability would want to have
> to exec out to the PG binary to use it.

Tough luck.  But even leaving infeasability aside, it seems like a quite
bad idea to do this in-process inside a tool that manages backup &
recovery. Creating threads / sub-processes with complicated needs (like
any pared down version of pg to do just recovery would have) from within
a library has substantial complications. So you'd not want to do this
in-process anyway.

> A lot of this part of the discussion feels like a tangent though, unless
> I'm missing something.

I'm replying to:

On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
> Wow.  I have to admit that I feel completely opposite of that- I'd
> *love* to have an independent tool (which ideally uses the same code
> through the common library, or similar) that can be run to apply WAL.

And I'm basically saying that anything that starts from this premise is
fatally flawed (in the ex falso quodlibet kind of sense ;)).

> The "WAL compression" tool contemplated
> previously would be much simpler and not the full-blown WAL replay
> capability, which would be left to the server, unless you're suggesting
> that even that should be exclusively the purview of the backend?  Though
> that ship's already sailed, given that external projects have
> implemented it.

I'm extremely doubtful of such tools (but it's not what I was responding
too, see above). I'd be extremely surprised if even one of them came
close to being correct. The old FPI removal tool had data corrupting
bugs left and right.

> Having a library to provide that which external
> projects could leverage would be nicer than having everyone write their
> own version.

No, I don't think that's necessarily true. Something complicated that's
hard to get right doesn't have to be provided by core. Even if other
projects decide that their risk/reward assesment is different than core
postgres'. We don't have to take on all kind of work and complexity for
external tools.

Greetings,

Andres Freund

Re: Trouble with FETCH_COUNT and combined queries in psql

2019-04-22 Thread Fabien COELHO




Bonjour Daniel,


When FETCH_COUNT is set, queries combined in a single request don't work
as expected:

\set FETCH_COUNT 10
select pg_sleep(2) \; select 1;

No result is displayed, the pg_sleep(2) is not run, and no error
is shown. That's disconcerting.


Indeed.


Does anyone have thoughts about how to fix this?



ATM I don't see a plausible fix that does not involve the parser
to store the information that it's a multiple-query command and pass
it down somehow to is_select_command().


The lexer (not parser) is called by psql to know where the query stops 
(i.e. waiting for ";"), so it could indeed know whether there are several 
queries.


I added some stuff to extract embedded "\;" for pgbench "\cset", which has 
been removed though, but it is easy to add back a detection of "\;", and 
also to detect select. If the position of the last select is known, the 
cursor can be declared in the right place, which would also solve the 
problem.


Or a more modern approach could be to give up on the cursor-based method 
in favor of PQsetSingleRowMode().


Hmmm. I'm not sure that row count is available under this mode? ISTM that 
the FETCH_COUNT stuff should really batch fetching result by this amount.

I'm not sure of the 1 by 1 row approach.

--
Fabien.

Re: block-level incremental backup

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost  wrote:
> > I think we're getting closer to a meeting of the minds here, but I
> > don't think it's intrinsically necessary to rewrite the whole method
> > of operation of pg_basebackup to implement incremental backup in a
> > sensible way.
>
> It wasn't my intent to imply that the whole method of operation of
> pg_basebackup would have to change for this.

Cool.

> > One could instead just do a straightforward extension
> > to the existing BASE_BACKUP command to enable incremental backup.
>
> Ok, how do you envision that?  As I mentioned up-thread, I am concerned
> that we're talking too high-level here and it's making the discussion
> more difficult than it would be if we were to put together specific
> ideas and then discuss them.
>
> One way I can imagine to extend BASE_BACKUP is by adding LSN as an
> optional parameter and then having the database server scan the entire
> cluster and send a tarball which contains essentially a 'diff' file of
> some kind for each file where we can construct a diff based on the LSN,
> and then the complete contents of the file for everything else that
> needs to be in the backup.

/me scratches head.  Isn't that pretty much what I described in my
original post?  I even described what that "'diff' file of some kind"
would look like in some detail in the paragraph of that emailed
numbered "2.", and I described the reasons for that choice at length
in 
http://postgr.es/m/ca+tgmozrqdv-tb8ny9p+1pqlqkxp5f1afghuohh5qt6ewdk...@mail.gmail.com

I can't figure out how I'm managing to be so unclear about things
about which I thought I'd been rather explicit.

> So, sure, that would work, but it wouldn't be able to be parallelized
> and I don't think it'd end up being very exciting for the external tools
> because of that, but it would be fine for pg_basebackup.

Stop being such a pessimist.  Yes, if we only add the option to the
BASE_BACKUP command, it won't directly be very exciting for external
tools, but a lot of the work that is needed to do things that ARE
exciting for external tools will have been done.  For instance, if the
work to figure out which blocks have been modified via WAL-scanning
gets done, and initially that's only exposed via BASE_BACKUP, it won't
be much work for somebody to write code for a new code that exposes
that information directly through some new replication command.
There's a difference between something that's going in the wrong
direction and something that's going in the right direction but not as
far or as fast as you'd like.  And I'm 99% sure that everything I'm
proposing here falls in the latter category rather than the former.

> On the other hand, if you added new commands for 'list of files changed
> since this LSN' and 'give me this file' and 'give me this file with the
> changes in it since this LSN', then pg_basebackup could work with that
> pretty easily in a single-threaded model (maybe with two connections to
> the backend, but still in a single process, or maybe just by slurping up
> the file list and then asking for each one) and the external tools could
> leverage those new capabilities too for their backups, both full backups
> and incremental ones.  This also wouldn't have to change how
> pg_basebackup does full backups today one bit, so what we're really
> talking about here is the direction to take the new code that's being
> written, not about rewriting existing code.  I agree that it'd be a bit
> more work...  but hopefully not *that* much more, and it would mean we
> could later add parallel backup to pg_basebackup more easily too, if we
> wanted to.

For purposes of implementing parallel pg_basebackup, it would probably
be better if the server rather than the client decided which files to
send via which connection.  If the client decides, then every time the
server finishes sending a file, the client has to request another
file, and that introduces some latency: after the server finishes
sending each file, it has to wait for the client to finish receiving
the data, and it has to wait for the client to tell it what file to
send next.  If the server decides, then it can just send data at top
speed without a break.  So the ideal interface for pg_basebackup would
really be something like:

START_PARALLEL_BACKUP blah blah PARTICIPANTS 4;

...returning a cookie that can be then be used by each participant for
an argument to a new commands:

JOIN_PARALLLEL_BACKUP 'cookie';

However, that is obviously extremely inconvenient for third-party
tools.  It's possible we need both an interface like this -- for use
by parallel pg_basebackup -- and a
START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type
interface for use by external tools.  On the other hand, maybe the
additional overhead caused by managing the list of files to be fetched
on the client side is negligible.  It'd be interesting to see, though,
how busy the server is when running an incremental backup managed by
an external

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2019 at 10:32 AM Stephen Frost  wrote:
> Yes, global indexes for partitioned tables could potentially be simpler
> than the logical row identifiers, but maybe it'd be useful to just have
> one implementation based around logical row identifiers which ends up
> working for global indexes as well as the other types of indexes and
> table access methods.

Maybe so. I think that we'd have to actually try it out to know for sure.

> If we thought that the 'single-byte' partition number covered enough
> use-cases then we could possibly consider supporting them for partitions
> by just 'stealing' a byte from BlockIdData and having the per-partition
> size be limited to 4TB when a global index exists on the partitioned
> table.

I don't think that that would make it any easier to implement.

> This sounds pretty good to me, though I'm not nearly as close to the
> code there as you are.

I'm slightly concerned that I may have broken an index_form_tuple()
caller from some other access method, but they all seem to not trust
index_form_tuple() to have MAXALIGN()'d on their behalf, just like
nbtree (nbtree won't when I'm done, though, because it will actively
try to preserve the "real" tuple size). It's convenient to me that no
caller seems to rely on the index_form_tuple() MAXALIGN() that I want
to remove.

-- 
Peter Geoghegan

Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()

2019-04-22 Thread phoe

Hey!

OK - thank you for the update and the explanation.

My reasoning in this case is - if we allow the any[] type to only be
passed to other functions that accept any[], and disallow any kind of
other operations on this array (such as retrieving its elements or
modifying it), I do not yet see any places where it might introduce a
performance regression. These arguments will literally be pass-only, and
since we are unable to interact with them in any other way, there will
be no possibility of type mismatches and therefore for performance
penalties.

This approach puts all the heavy work on the plpgsql compiler - it will
need to ensure that, if there is a any[] or VARIADIC any variable in a
function arglist, it must NOT be accessed in any way, and can only be
passed to other functions which accept any[] or VARIADIC any.

BR
~phoe

On 22.04.2019 12:09, Pavel Stehule wrote:
> Hi
>
> po 22. 4. 2019 v 11:27 odesílatel Michał "phoe" Herda
> mailto:p...@disroot.org>> napsal:
>
> Hey everyone,
>
> I am writing a plpgsql function that (to greatly simplify) raises an
> exception with a formatted* message. Ideally, I should be able to call
> it with raise_exception('The person %I has only %I bananas.', 'Fred',
> 8), which mimics the format(text, any[]) calling convention.
>
> Here is where I have encountered a limitation of PostgreSQL's design:
> https://www.postgresql.org/docs/11/datatype-pseudo.html mentions
> explicitly that, "At present most procedural languages forbid use of a
> pseudo-type as an argument type".
>
> My reasoning is that I should be able to accept a value of some
> type if
> all I do is passing it to a function that accepts exactly that type,
> such as format(text, any[]). Given the technical reality, I assume
> that
> I wouldn't be able to do anything else with that value, but that is
> fine, since I don't have to do anything with it regardless.
>
> BR
> Michał "phoe" Herda
>
> *I do not want to use the obvious solution of
> raise_exception(format(...)) because the argument to that function is
> the error ID that is then looked up in a table from which the error
> message and sqlstate are retrieved. My full code is in the
> attached SQL
> file. Once it is executed:
>
> SELECT gateway_error('user_does_not_exist', '2'); -- works but is
> unnatural,
> SELECT gateway_error('user_does_not_exist', 2); -- is natural but
> doesn't work.
>
>
> It is known problem, and fix is not easy.
>
> Any expressions inside plpgsql are simple queries like SELECT expr,
> and they are executed same pipeline like queries.
>
> The plans of these queries are stored and reused. Originally these
> plans disallow any changes, now some changes are supported, but
> parameters should be same all time. This is ensured by disallowing
> "any" type.
>
> Other polymorphic types are very static, so there is not described risk.
>
> Probably some enhancement can be in this are. The plan can be
> re-planed after some change - but it can has lot of performance
> impacts. It is long open topic. Some changes in direction to dynamic
> languages can increase cost of  some future optimization to higher
> performance :-(.
>
> Regards
>
> Pavel
>
>
>
>

Re: [PATCH v1] Add \echo_stderr to psql

2019-04-22 Thread Fabien COELHO





   \warn ...
   \warning ...


These two seem about the best to me, drawing from the perl warn command.


Yep, I was thinking of perl & gmake. Maybe the 4 letter option is better
because its the same length as "echo".


I suppose we could go the bash &2 route here, but I don't want to.


I agree on this one.


Please find attached v2, name is now \warn.

How might we test this portably?


TAP testing? see pgbench which has tap test which can test stdout & stderr 
by calling utility command_checks_all, the same could be done with psql.


--
Fabien.

Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()

2019-04-22 Thread Pavel Stehule

Hi

po 22. 4. 2019 v 19:20 odesílatel Michał "phoe" Herda 
napsal:

> Hey!
>
> OK - thank you for the update and the explanation.
>
> My reasoning in this case is - if we allow the any[] type to only be
> passed to other functions that accept any[], and disallow any kind of other
> operations on this array (such as retrieving its elements or modifying it),
> I do not yet see any places where it might introduce a performance
> regression. These arguments will literally be pass-only, and since we are
> unable to interact with them in any other way, there will be no possibility
> of type mismatches and therefore for performance penalties.
>
> This approach puts all the heavy work on the plpgsql compiler - it will
> need to ensure that, if there is a any[] or VARIADIC any variable in a
> function arglist, it must NOT be accessed in any way, and can only be
> passed to other functions which accept any[] or VARIADIC any.
>
PLpgSQL compiler knows nothing about a expressions - the compiler process
only plpgsql statements. Expressions are processed at runtime only by SQL
parser and executor.

It is good to start with plpgsql codes -
https://github.com/postgres/postgres/tree/master/src/pl/plpgsql/src

you can see there, so plpgsql is very different from other compilers. It
just glue of SQL expressions or queries, that are black box for PLpgSQL
compiler and executor.

Just any[] is not plpgsql way. For your case you should to use a overloading

create or replace function fx(fmt text, par text)
returns void as $$
begin
  raise notice '%', format(fmt, par);
end;
$$ language plpgsql;

create or replace function fx(fmt text, par numeric)
returns void as $$
begin
  raise notice '%', format(fmt, par);
end;
$$ language plpgsql;

There is another limit, you cannot to declare function parameter type that
enforce explicit casting

can be nice (but it is strange idea) to have some other flags for arguments

CREATE OR REPLACE FUNCTION gateway_error(fmt text, par text FORCE EXPLICIT
CAST)
...

Regards

Pavel


> BR
> ~phoe
> On 22.04.2019 12:09, Pavel Stehule wrote:
>
> Hi
>
> po 22. 4. 2019 v 11:27 odesílatel Michał "phoe" Herda 
> napsal:
>
>> Hey everyone,
>>
>> I am writing a plpgsql function that (to greatly simplify) raises an
>> exception with a formatted* message. Ideally, I should be able to call
>> it with raise_exception('The person %I has only %I bananas.', 'Fred',
>> 8), which mimics the format(text, any[]) calling convention.
>>
>> Here is where I have encountered a limitation of PostgreSQL's design:
>> https://www.postgresql.org/docs/11/datatype-pseudo.html mentions
>> explicitly that, "At present most procedural languages forbid use of a
>> pseudo-type as an argument type".
>>
>> My reasoning is that I should be able to accept a value of some type if
>> all I do is passing it to a function that accepts exactly that type,
>> such as format(text, any[]). Given the technical reality, I assume that
>> I wouldn't be able to do anything else with that value, but that is
>> fine, since I don't have to do anything with it regardless.
>>
>> BR
>> Michał "phoe" Herda
>>
>> *I do not want to use the obvious solution of
>> raise_exception(format(...)) because the argument to that function is
>> the error ID that is then looked up in a table from which the error
>> message and sqlstate are retrieved. My full code is in the attached SQL
>> file. Once it is executed:
>>
>> SELECT gateway_error('user_does_not_exist', '2'); -- works but is
>> unnatural,
>> SELECT gateway_error('user_does_not_exist', 2); -- is natural but
>> doesn't work.
>>
>
> It is known problem, and fix is not easy.
>
> Any expressions inside plpgsql are simple queries like SELECT expr, and
> they are executed same pipeline like queries.
>
> The plans of these queries are stored and reused. Originally these plans
> disallow any changes, now some changes are supported, but parameters should
> be same all time. This is ensured by disallowing "any" type.
>
> Other polymorphic types are very static, so there is not described risk.
>
> Probably some enhancement can be in this are. The plan can be re-planed
> after some change - but it can has lot of performance impacts. It is long
> open topic. Some changes in direction to dynamic languages can increase
> cost of  some future optimization to higher performance :-(.
>
> Regards
>
> Pavel
>
>
>
>
>
>

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 13:27:17 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > The computation of that variable above has:
> 
> >  * If the column is possibly missing, we can't rely on its (or
> >  * subsequent) NOT NULL constraints to indicate minimum 
> > attributes in
> >  * the tuple, so stop here.
> >  */
> > if (att->atthasmissing)
> > break;
> 
> BTW, why do we have to stop?  ISTM that a not-null column without
> atthasmissing is enough to prove this, regardless of the state of prior
> columns.  (This is assuming that you trust attnotnull for this, which
> as I said I don't, but that's not relevant to this question.)

Are you wondering if we could also use this kind of logic to infer the
length of the null bitmap if there's preceding columns with
atthasmissing true as long as there's a later !hasmissing column that's
NOT NULL?  Right. The logic could be made more powerful - I implemented
the above after Andrew's commit of fast-not-null broke JIT (not because
of that logic, but because it simply didn't look up the missing
columns).  I assume it doesn't terribly matter to be fast once
attributes after a previously missing one are accessed - it's likely not
going to be the hotly accessed data?

> I wonder
> also if it wouldn't be smart to explicitly check that the "guaranteeing"
> column is not attisdropped.

Yea, that probably would be smart.  I don't think there's an active
problem, because we remove NOT NULL when deleting an attribute, but it
seems good to be doubly sure / explain why that's safe:

/* Remove any NOT NULL constraint the column may have */
attStruct->attnotnull = false;

I'm a bit unsure whether to make it an assert, elog(ERROR) or just not
assume column presence?

Greetings,

Andres Freund

Re: Patch: doc for pg_logical_emit_message()

2019-04-22 Thread Fujii Masao

On Fri, Apr 19, 2019 at 3:21 PM Matsumura, Ryo
 wrote:
>
> Hi, Hackers
>
> pg_logical_emit_message() can be used by any user,
> but the following document says that it can be used by only superuser.
>
> > Table 9.88. Replication SQL Functions
> > Use of these functions is restricted to superusers.
>
> I think that pg_logicl_emit_message() should be used by any user.
> Therefore, I attach the document patch.

Thanks for the patch!

Use of not only pg_logical_emit_message() but also other replication
functions in Table 9.88 is not restricted to superuser. For example,
replication role can execute pg_create_physical_replication_slot().
So I think that the patch should fix also the description for those
replication functions. Thought?

Regards,

-- 
Fujii Masao

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
> > I agree that we don't want another implementation and that there's a lot
> > that we want to do to improve replay performance.  We've already got
> > frontend tools which work with multiple execution threads, so I'm not
> > sure I get the "not easily feasible" bit, and the argument about the
> > checkpointer seems largely related to that (as in- if we didn't have
> > multiple threads/processes then things would perform quite badly...  but
> > we can and do have multiple threads/processes in frontend tools today,
> > even in pg_basebackup).
> 
> You need not just multiple execution threads, but basically a new
> implementation of shared buffers, locking, process monitoring, with most
> of the related infrastructure. You're literally talking about
> reimplementing a very substantial portion of the backend.  I'm not sure
> I can transport in written words - via a public medium - how bad an idea
> it would be to go there.

Yes, there'd be some need for locking and process monitoring, though if
we aren't supporting ongoing read queries at the same time, there's a
whole bunch of things that we don't need from the existing backend.

> > > Which I think is entirely reasonable. With the 'consistent' and LSN
> > > recovery targets one already can get most of what's needed from such a
> > > tool, anyway.  I'd argue the biggest issue there is that there's no
> > > equivalent to starting postgres with a private socket directory on
> > > windows, and perhaps an option or two making it easier to start postgres
> > > in a "private" mode for things like this.
> > 
> > This would mean building in a way to do parallel WAL replay into the
> > server binary though, as discussed above, and it seems like making that
> > work in a way that allows us to still be available as a read-only
> > standby would be quite a bit more difficult.  We could possibly support
> > parallel WAL replay only when we aren't a replica but from the same
> > binary.
> 
> I'm doubtful that we should try to implement parallel WAL apply that
> can't support HS - a substantial portion of the the logic to avoid
> issues around relfilenode reuse, consistency etc is going to be to be
> necessary for non-HS aware apply anyway.  But if somebody had a concrete
> proposal for something that's fundamentally only doable without HS, I
> could be convinced.

I'd certainly prefer that we support parallel WAL replay *with* HS, that
just seems like a much larger problem, but I'd be quite happy to be told
that it wouldn't be that much harder.

> > A lot of this part of the discussion feels like a tangent though, unless
> > I'm missing something.
> 
> I'm replying to:
> 
> On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
> > Wow.  I have to admit that I feel completely opposite of that- I'd
> > *love* to have an independent tool (which ideally uses the same code
> > through the common library, or similar) that can be run to apply WAL.
> 
> And I'm basically saying that anything that starts from this premise is
> fatally flawed (in the ex falso quodlibet kind of sense ;)).

I'd just say that it'd be... difficult. :)

> > The "WAL compression" tool contemplated
> > previously would be much simpler and not the full-blown WAL replay
> > capability, which would be left to the server, unless you're suggesting
> > that even that should be exclusively the purview of the backend?  Though
> > that ship's already sailed, given that external projects have
> > implemented it.
> 
> I'm extremely doubtful of such tools (but it's not what I was responding
> too, see above). I'd be extremely surprised if even one of them came
> close to being correct. The old FPI removal tool had data corrupting
> bugs left and right.

I have concerns about it myself, which is why I'd actually really like
to see something in core that does it, and does it the right way, that
other projects could then leverage (ideally by just linking into the
library without having to rewrite what's in core, though that might not
be an option for things like WAL-G that are in Go and possibly don't
want to link in some C library).

> > Having a library to provide that which external
> > projects could leverage would be nicer than having everyone write their
> > own version.
> 
> No, I don't think that's necessarily true. Something complicated that's
> hard to get right doesn't have to be provided by core. Even if other
> projects decide that their risk/reward assesment is different than core
> postgres'. We don't have to take on all kind of work and complexity for
> external tools.

No, it doesn't have to be provided by core, but I sure would like it to
be and I'd be much more comfortable if it was because then we'd also
take care to not break whatever assumptions are made (or to do so in a
way that can be detected and/or handled) as new code is written.  As
discussed above, as long as it isn't provided by core

Re: Do CustomScan as not projection capable node

2019-04-22 Thread Tom Lane

Andrey Lepikhov  writes:
> It is possible that custom_scan_tlist is designed too nontrivially, and 
> it is possible that it needs some comments describing in more detail how 
> to use it.

I totally buy the argument that the custom scan stuff is
underdocumented :-(.

FWIW, if we did have a use-case showing that somebody would like to make
custom scans that can't project, the way to do it would be to add a flag
bit showing whether a particular CustomPath/CustomScan could project or
not.  Not to assume that they all can't.  This wouldn't be that much code
really, but I'd still like to see a plausible use-case before adding it,
because it'd be a small API break for existing CustomPath providers.

regards, tom lane

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Robert Haas wrote:

> PostgreSQL has historically and very deliberately *not made a
> distinction* between "this object is in the default tablespace" and
> "this object is in tablespace X which happens to be the default."  I
> think that it's too late to invent such a distinction for reasons of
> backward compatibility -- and if we were going to do it, surely it
> would need to exist for both partitioned tables and the partitions
> themselves. Otherwise it just produces more strange inconsistencies.

Yeah, this is probably right.  (I don't think it's the same thing that
Tom was saying, though, or at least I didn't understand his argument
this way.)

I think we can get out of this whole class of problems by forbidding the
TABLESPACE clause for partitioned rels from mentioning the database
tablespace -- that is, users either mention some *other* tablespace, or
partitions follow default_tablespace like everybody else.  AFAICS with
that restriction this whole problem does not arise, and the patch may
become simpler.  I'll give it a spin.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: clean up docs for v12

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> On 2019-04-22 13:27:17 -0400, Tom Lane wrote:
>> I wonder
>> also if it wouldn't be smart to explicitly check that the "guaranteeing"
>> column is not attisdropped.

> Yea, that probably would be smart.  I don't think there's an active
> problem, because we remove NOT NULL when deleting an attribute, but it
> seems good to be doubly sure / explain why that's safe:
>   /* Remove any NOT NULL constraint the column may have */
>   attStruct->attnotnull = false;
> I'm a bit unsure whether to make it an assert, elog(ERROR) or just not
> assume column presence?

I'd just make the code look like

/*
 * If it's NOT NULL then it must be present in every tuple,
 * unless there's a "missing" entry that could provide a non-null
 * value for it.  Out of paranoia, also check !attisdropped.
 */
if (att->attnotnull &&
!att->atthasmissing &&
!att->attisdropped)
guaranteed_column_number = attnum;

I don't think the extra check is so expensive as to be worth obsessing
over.

regards, tom lane

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 14:16:28 -0400, Alvaro Herrera wrote:
> On 2019-Apr-22, Robert Haas wrote:
> 
> > PostgreSQL has historically and very deliberately *not made a
> > distinction* between "this object is in the default tablespace" and
> > "this object is in tablespace X which happens to be the default."  I
> > think that it's too late to invent such a distinction for reasons of
> > backward compatibility -- and if we were going to do it, surely it
> > would need to exist for both partitioned tables and the partitions
> > themselves. Otherwise it just produces more strange inconsistencies.
> 
> Yeah, this is probably right.  (I don't think it's the same thing that
> Tom was saying, though, or at least I didn't understand his argument
> this way.)
> 
> I think we can get out of this whole class of problems by forbidding the
> TABLESPACE clause for partitioned rels from mentioning the database
> tablespace -- that is, users either mention some *other* tablespace, or
> partitions follow default_tablespace like everybody else.  AFAICS with
> that restriction this whole problem does not arise, and the patch may
> become simpler.  I'll give it a spin.

Why is the obvious answer is to not just remove the whole tablespace
inheritance behaviour? It's obviously ambiguous and hard to get right.
I still don't see any usecase that even comes close to making the
inheritance useful enough to justify the amount of work (code, tests,
bugfixes) and docs that are required.

Greetings,

Andres Freund

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 14:17:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-04-22 13:27:17 -0400, Tom Lane wrote:
> >> I wonder
> >> also if it wouldn't be smart to explicitly check that the "guaranteeing"
> >> column is not attisdropped.
> 
> > Yea, that probably would be smart.  I don't think there's an active
> > problem, because we remove NOT NULL when deleting an attribute, but it
> > seems good to be doubly sure / explain why that's safe:
> > /* Remove any NOT NULL constraint the column may have */
> > attStruct->attnotnull = false;
> > I'm a bit unsure whether to make it an assert, elog(ERROR) or just not
> > assume column presence?
> 
> I'd just make the code look like
> 
> /*
>  * If it's NOT NULL then it must be present in every tuple,
>  * unless there's a "missing" entry that could provide a non-null
>  * value for it.  Out of paranoia, also check !attisdropped.
>  */
> if (att->attnotnull &&
> !att->atthasmissing &&
> !att->attisdropped)
> guaranteed_column_number = attnum;
> 
> I don't think the extra check is so expensive as to be worth obsessing
> over.

Oh, yea, the cost is irrelevant here - it's one-off work basically, and
pales in comparison to the cost of JITing. I was more thinking about
whether it's worth "escalating" the violation of assumptions.

Greetings,

Andres Freund

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 13:18:18 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> (Possibly I'd not think this if I weren't fresh off a couple of days
> >> with my nose in the ALTER TABLE SET NOT NULL code.  But right now,
> >> I think that believing that that code does not and never will have
> >> any bugs is just damfool.)
> 
> > But there's plenty places where we rely on NOT NULL actually working?
> 
> I do not think there are any other places where we make this particular
> assumption.

Sure, not exactly the assumtion that JITed deforming benefits from, but
as far as I can tell, plenty things would be broken just as well if we
allowed NOT NULL columns to not be present (whether "physically" present
or present via atthasmissing) for tuples in a table. Fast defaults
wouldn't work, Assert(!isnull) checks would fire, primary keys would be
broken etc.



> In hopes of putting some fear into you too, I exhibit the following
> behavior, which is not a bug according to our current definitions:
> 
> regression=# create table pp(f1 int);
> CREATE TABLE
> regression=# create table cc() inherits (pp);
> CREATE TABLE
> regression=# insert into cc values(1);
> INSERT 0 1
> regression=# insert into cc values(2);
> INSERT 0 1
> regression=# insert into cc values(null);
> INSERT 0 1
> regression=# alter table pp add column f2 text;
> ALTER TABLE
> regression=# alter table pp add column f3 text;
> ALTER TABLE
> regression=# alter table only pp alter f3 set not null;
> ALTER TABLE
> regression=# select * from pp;
>  f1 | f2 | f3 
> ++
>   1 || 
>   2 || 
> || 
> (3 rows)
> 
> The tuples coming out of cc will still have natts = 1, I believe.
> If they were deformed according to pp's tupdesc, there'd be a
> problem.  Now, we shouldn't do that, because this is not the only
> possible discrepancy between parent and child tupdescs --- but
> I think this example shows that attnotnull is a lot spongier
> than you are assuming, even without considering the possibility
> of outright bugs.

Unortunately it doesn't really put the fear into me - given that
attribute numbers don't even have to match between inheritance children,
making inferrences about the length of the NULL bitmap seems peanuts
compared to the breakage of using the wrong tupdesc to deform.

Greetings,

Andres Freund

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost  wrote:
> > > One could instead just do a straightforward extension
> > > to the existing BASE_BACKUP command to enable incremental backup.
> >
> > Ok, how do you envision that?  As I mentioned up-thread, I am concerned
> > that we're talking too high-level here and it's making the discussion
> > more difficult than it would be if we were to put together specific
> > ideas and then discuss them.
> >
> > One way I can imagine to extend BASE_BACKUP is by adding LSN as an
> > optional parameter and then having the database server scan the entire
> > cluster and send a tarball which contains essentially a 'diff' file of
> > some kind for each file where we can construct a diff based on the LSN,
> > and then the complete contents of the file for everything else that
> > needs to be in the backup.
> 
> /me scratches head.  Isn't that pretty much what I described in my
> original post?  I even described what that "'diff' file of some kind"
> would look like in some detail in the paragraph of that emailed
> numbered "2.", and I described the reasons for that choice at length
> in 
> http://postgr.es/m/ca+tgmozrqdv-tb8ny9p+1pqlqkxp5f1afghuohh5qt6ewdk...@mail.gmail.com
> 
> I can't figure out how I'm managing to be so unclear about things
> about which I thought I'd been rather explicit.

There was basically zero discussion about what things would look like at
a protocol level (I went back and skimmed over the thread before sending
my last email to specifically see if I was going to get this response
back..).  I get the idea behind the diff file, the contents of which I
wasn't getting into above.

> > So, sure, that would work, but it wouldn't be able to be parallelized
> > and I don't think it'd end up being very exciting for the external tools
> > because of that, but it would be fine for pg_basebackup.
> 
> Stop being such a pessimist.  Yes, if we only add the option to the
> BASE_BACKUP command, it won't directly be very exciting for external
> tools, but a lot of the work that is needed to do things that ARE
> exciting for external tools will have been done.  For instance, if the
> work to figure out which blocks have been modified via WAL-scanning
> gets done, and initially that's only exposed via BASE_BACKUP, it won't
> be much work for somebody to write code for a new code that exposes
> that information directly through some new replication command.
> There's a difference between something that's going in the wrong
> direction and something that's going in the right direction but not as
> far or as fast as you'd like.  And I'm 99% sure that everything I'm
> proposing here falls in the latter category rather than the former.

I didn't mean to imply that you're doing in the wrong direction here and
I thought I said somewhere in my last email more-or-less exactly the
same, that a great deal of the work needed for block-level incremental
backup would be done, but specifically that this proposal wouldn't allow
external tools to leverage that.  It sounds like what you're suggesting
now is that you're happy to implement the backend code, expose it in a
way that works just for pg_basebackup, and that if someone else wants to
add things to the protocol to make it easier for external tools to
leverage, great.  All I can say is that that's basically how we ended up
in the situation we're in today where pg_basebackup doesn't support
parallel backup but a bunch of external tools do and they don't go
through the backend to get there, even though they'd probably prefer to.

> > On the other hand, if you added new commands for 'list of files changed
> > since this LSN' and 'give me this file' and 'give me this file with the
> > changes in it since this LSN', then pg_basebackup could work with that
> > pretty easily in a single-threaded model (maybe with two connections to
> > the backend, but still in a single process, or maybe just by slurping up
> > the file list and then asking for each one) and the external tools could
> > leverage those new capabilities too for their backups, both full backups
> > and incremental ones.  This also wouldn't have to change how
> > pg_basebackup does full backups today one bit, so what we're really
> > talking about here is the direction to take the new code that's being
> > written, not about rewriting existing code.  I agree that it'd be a bit
> > more work...  but hopefully not *that* much more, and it would mean we
> > could later add parallel backup to pg_basebackup more easily too, if we
> > wanted to.
> 
> For purposes of implementing parallel pg_basebackup, it would probably
> be better if the server rather than the client decided which files to
> send via which connection.  If the client decides, then every time the
> server finishes sending a file, the client has to request another
> file, and that introduces some latency: after the server finishes
> sending each file, it has

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> On 2019-04-22 14:16:28 -0400, Alvaro Herrera wrote:
>> I think we can get out of this whole class of problems by forbidding the
>> TABLESPACE clause for partitioned rels from mentioning the database
>> tablespace -- that is, users either mention some *other* tablespace, or
>> partitions follow default_tablespace like everybody else.  AFAICS with
>> that restriction this whole problem does not arise, and the patch may
>> become simpler.  I'll give it a spin.

> Why is the obvious answer is to not just remove the whole tablespace
> inheritance behaviour? It's obviously ambiguous and hard to get right.
> I still don't see any usecase that even comes close to making the
> inheritance useful enough to justify the amount of work (code, tests,
> bugfixes) and docs that are required.

Yeah, that's where I'm at as well.  Alvaro's proposal could be made
to work perhaps, but I think it would still end up with some odd
corner-case behaviors.  One example is that "TABLESPACE X" would
be allowed if the database's default tablespace is Y, but if you
try to dump and restore into a database whose default is X, it'd be
rejected (?).  The results after ALTER DATABASE ... SET TABLESPACE X
are unclear too.

regards, tom lane

Re: block-level incremental backup

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
> I'm disappointed that the concerns about the trouble that end users are
> likely to have with this didn't garner more discussion.

My impression is that endusers are having a lot more trouble due to
important backup/restore features not being in core/pg_basebackup, than
due to external tools having a harder time to implement certain
features. Focusing on external tools being able to provide all those
features, because core hasn't yet, is imo entirely the wrong thing to
concentrate upon.  And it's not like things largely haven't been
implemented in pg_basebackup for fundamental architectural reasons.
It's because we've built like 5 different external tools with randomly
differing featureset and licenses.

Greetings,

Andres Freund

Re: TM format can mix encodings in to_char()

2019-04-22 Thread Peter Geoghegan

On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan
 wrote:
> How does one do that? Just set a Turkish locale?

tr_TR is, in a sense, special among locales:

http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html

The Turkish dotless i has apparently been implicated in all kinds of
bugs in quite a variety of contexts.

-- 
Peter Geoghegan

Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()

2019-04-22 Thread Tom Lane

=?UTF-8?Q?Micha=c5=82_=22phoe=22_Herda?=  writes:
> My reasoning in this case is - if we allow the any[] type to only be
> passed to other functions that accept any[], and disallow any kind of
> other operations on this array (such as retrieving its elements or
> modifying it), I do not yet see any places where it might introduce a
> performance regression.

Performance regressions are not the question here --- or at least, there
are a lot of other questions to get past first.

* plpgsql doesn't have any mechanism for restricting the use of a
parameter in the way you suggest.  It's not clear if it'd be practical
to add one, given the arms-length way in which plpgsql does expression
evaluation, and it seems likely that any such thing would be messy
and bug-prone.

* There's not actually any such type as any[].  There's anyarray,
which is not what you're wishing for here because it'd constrain
all the actual arguments to be the same type (or at least coercible
to the same array element type).  This is related to the next point...

* format() isn't declared as taking any[].  It's really

regression=# \df format
  List of functions
   Schema   |  Name  | Result data type | Argument data types  | Type 
++--+--+--
 pg_catalog | format | text | text | func
 pg_catalog | format | text | text, VARIADIC "any" | func
(2 rows)

"VARIADIC any" is a very special hack, because unlike other VARIADIC
cases, it doesn't result in collapsing the actual arguments into an
array.  (Again, it can't because they might not have a common type.)
The called function has to have special logic for examining its
arguments to find out how many there are and what their types are.
format() can do that because it's written in C, but a plpgsql function,
not so much.

* We could imagine allowing a plpgsql function to be declared
"VARIADIC any", and treating it as having N polymorphic arguments of
independent types, but then what?  plpgsql has no notation for
accessing such arguments (they wouldn't have names, to start with),
nor for finding out how many there are, and it certainly has no
notation for passing the whole group of them on to some other function.

I think the closest you'll be able to get here is to declare the
plpgsql function as taking "variadic text[]" and then passing the
text array to format() with a VARIADIC marker.  That should work
mostly okay, although calls might need explicit casts to text
in some cases.

FWIW, it'd likely be easier to get around these problems in plperl
or pltcl than in plpgsql, as those are both much less concerned
with the exact data types of their arguments than plpgsql, and
more accustomed to dealing with functions with variable argument
lists.  I don't know Python well enough to say whether the same
is true of plpython.

regards, tom lane

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Andres Freund wrote:

> Why is the obvious answer is to not just remove the whole tablespace
> inheritance behaviour?

Because it was requested by many, and there were plenty of people
surprised that things didn't work that way.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
> > I'm disappointed that the concerns about the trouble that end users are
> > likely to have with this didn't garner more discussion.
> 
> My impression is that endusers are having a lot more trouble due to
> important backup/restore features not being in core/pg_basebackup, than
> due to external tools having a harder time to implement certain
> features.

I had been referring specifically to the concern I raised about
incremental block-level backups being added to pg_basebackup and how
that'll make using pg_basebackup more complicated and therefore more
difficult for end-users to get right, particularly if the end user is
having to handle management of the association between the full backup
and the incremental backups.  I wasn't referring to anything regarding
external tools.

> Focusing on external tools being able to provide all those
> features, because core hasn't yet, is imo entirely the wrong thing to
> concentrate upon.  And it's not like things largely haven't been
> implemented in pg_basebackup for fundamental architectural reasons.
> It's because we've built like 5 different external tools with randomly
> differing featureset and licenses.

There's a few challenges when it comes to adding backup features to
core.  One of the reasons is that core naturally moves slower when it
comes to development than external projects do, as was discusssed
earlier on this thread.  Another is that, when it comes to backup,
specifically, people want to back up their *existing* systems, which
means that they need a backup tool that's going to work with whatever
version of PG they've currently got deployed and that's often a few
years old already.  Certainly when I've thought about features that we'd
like to see and considered if there's something that could be
implemented in core vs. implemented outside of core, the answer often
ends up being "well, if we do it ourselves then we can make it work for
PG 9.2 and above, and have it working for existing users, but if we work
it in as part of core, it won't be available until next year and only
for version 12 and above, and users can only use it once they've
upgraded.."

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Tom Lane wrote:

> Yeah, that's where I'm at as well.  Alvaro's proposal could be made
> to work perhaps, but I think it would still end up with some odd
> corner-case behaviors.  One example is that "TABLESPACE X" would
> be allowed if the database's default tablespace is Y, but if you
> try to dump and restore into a database whose default is X, it'd be
> rejected (?).

Hmm, I don't think so, because dump uses default_tablespace on a plain
table instead of TABLESPACE clauses, and the table is attached
afterwards.

> The results after ALTER DATABASE ... SET TABLESPACE X
> are unclear too.

Currently we disallow SET TABLESPACE X if you have any table in that
tablespace, and we do that by searching for files.  A partitioned table
would not have a file that would cause it to fail, so this is something
to study.

(BTW I think these tablespace behaviors are not tested very much.  The
tests we have are intra-database operations only, and there's only a
single non-default tablespace.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: TM format can mix encodings in to_char()

2019-04-22 Thread Tom Lane

Peter Geoghegan  writes:
> On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan
>  wrote:
>> How does one do that? Just set a Turkish locale?

> tr_TR is, in a sense, special among locales:
> http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html
> The Turkish dotless i has apparently been implicated in all kinds of
> bugs in quite a variety of contexts.

Yeah, we've had our share of those :-(.  But the dotless i is not the
problem here --- it happens to not trigger an encoding conversion
issue, it seems.  Amusingly, the existing test case for lc_time = tr_TR
in collate.linux.utf8.sql is specifically coded to check what happens
with dotted/dotless i, and yet it manages to not trip over this problem.
(I suspect the reason is that what comes out of strftime is "Nis" which
is ASCII, and the non-ASCII characters only arise from subsequent case
conversion within PG.)

regards, tom lane

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Apr-22, Andres Freund wrote:
>> Why is the obvious answer is to not just remove the whole tablespace
>> inheritance behaviour?

> Because it was requested by many, and there were plenty of people
> surprised that things didn't work that way.

There are lots of things in SQL that people find surprising.
In this particular case, "we can't do it because it conflicts with
ancient decisions about how PG tablespaces work" seems like a
defensible answer, even without getting into the question of
whether "partitions inherit their tablespace from the parent"
is really any less surprising than "partitions work exactly like
normal tables as far as tablespace selection goes".

regards, tom lane

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Laurenz Albe

On Mon, 2019-04-22 at 21:45 +0900, Etsuro Fujita wrote:

Thanks for looking into this!

> (2019/04/20 20:53), Laurenz Albe wrote:
> > On Fri, 2018-04-06 at 23:24 +, Robert Haas wrote:
> > > Allow insert and update tuple routing and COPY for foreign tables.
> > > 
> > > Also enable this for postgres_fdw.
> > > 
> > > Etsuro Fujita, based on an earlier patch by Amit Langote. The larger
> > > patch series of which this is a part has been reviewed by Amit
> > > Langote, David Fetter, Maksim Milyutin, Álvaro Herrera, Stephen Frost,
> > > and me.  Minor documentation changes to the final version by me.
> > > 
> > > Discussion: 
> > > http://postgr.es/m/29906a26-da12-8c86-4fb9-d8f88442f...@lab.ntt.co.jp
> > 
> > This commit makes life hard for foreign data wrappers that support
> > data modifications.
> > 
> > If a FDW implements ExecForeignInsert, this commit automatically assumes
> > that it also supports COPY FROM.  It will call ExecForeignInsert without
> > calling PlanForeignModify and BeginForeignModify, and a FDW that does not
> > expect that will probably fail.
> 
> This is not 100% correct; the FDW documentation says:
> 
>  
>   Tuples inserted into a partitioned table by 
> INSERT or
>   COPY FROM are routed to partitions.  If an FDW
>   supports routable foreign-table partitions, it should also provide the
>   following callback functions.  These functions are also called when
>   COPY FROM is executed on a foreign table.
>  

I don't see the difference between the documentation and what I wrote above.

Before v11, a FDW could expect that ExecForeignInsert is only called if
BeginForeignModify was called earlier.
That has silently changed with v11.

> > maybe there are FDWs that support INSERT but don't want to support COPY
> > for some reason.
> 
> I agree on that point.
> 
> > I propose that PostgreSQL only allows COPY FROM on a foreign table if the 
> > FDW
> > implements BeginForeignInsert.  The attached patch implements that.
> 
> I don't think that is a good idea, because there might be some FDWs that 
> want to support COPY FROM on foreign tables without providing 
> BeginForeignInsert.  (As for INSERT into foreign tables, we actually 
> allow FDWs to support it without providing PlanForeignModify, 
> BeginForeignModify, or EndForeignModify.)
> 
> It's permissible to throw an error in BeginForeignInsert, so what I was 
> thinking for FDWs that don't want to support COPY FROM and 
> INSERT/UPDATE/COPY FROM tuple routing was to provide BeginForeignInsert 
> implementing something like this:
> 
> static void
> fooBeginForeignInsert(ModifyTableState *mtstate,
>ResultRelInfo *resultRelInfo)
> {
>  Relationrel = resultRelInfo->ri_RelationDesc;
> 
>  if (mtstate->ps.plan == NULL)
>  ereport(ERROR,
>  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>   errmsg("cannot copy to foreign table \"%s\"",
>  RelationGetRelationName(rel;
>  else
>  ereport(ERROR,
>  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>   errmsg("cannot route tuples into foreign table \"%s\"",
>  RelationGetRelationName(rel;
> }

Sure, it is not hard to modify a FDW to continue working with v11.

My point is that this should not be necessary.

If a FDW worked well with v10, it should continue to work with v11
unless there is a necessity for a compatibility-breaking change.

On the other hand, if a FDW wants to support COPY in v11 and has no
need for BeginForeignInsert to support that, it is a simple exercise
for it to provide an empty BeginForeignInsert just to signal that it
wants to support COPY.

I realized that my previous patch forgot to check for tuple routing,
updated patch attached.

Yours,
Laurenz Albe
From 545ac9d567ea6ca610ce08355451923cc787e13d Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Mon, 22 Apr 2019 21:35:06 +0200
Subject: [PATCH] Foreign table COPY FROM and tuple routing requires
 BeginForeignInsert

Commit 3d956d956a introduced support for foreign tables as partitions
and COPY FROM on foreign tables.

If a foreign data wrapper supports data modifications, but either has
not adapted to this change yet or doesn't want to support it
for other reasons, it probably got broken by the above commit,
because COPY will just call ExecForeignInsert anyway, which might not
work because neither PlanForeignModify nor BeginForeignModify have
been called.

To avoid breaking third-party foreign data wrappers in that way, allow
COPY FROM and tuple routing for foreign tables only if the foreign data
wrapper implements BeginForeignInsert.
---
 doc/src/sgml/fdwhandler.sgml |  2 ++
 src/backend/commands/copy.c  |  5 +
 src/backend/executor/execPartition.c | 17 -
 3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 2c07a86637..84f

Re: block-level incremental backup

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost  wrote:
> There was basically zero discussion about what things would look like at
> a protocol level (I went back and skimmed over the thread before sending
> my last email to specifically see if I was going to get this response
> back..).  I get the idea behind the diff file, the contents of which I
> wasn't getting into above.

Well, I wrote:

"There should be a way to tell pg_basebackup to request from the
server only those blocks where LSN >= threshold_value."

I guess I assumed that people would interested in the details take
that to mean "and therefore the protocol would grow an option for this
type of request in whatever way is the most straightforward possible
extension of the current functionality is," which is indeed how you
eventually interpreted it when you said we could "extend BASE_BACKUP
is by adding LSN as an optional parameter."

I could have been more explicit, but sometimes people tell me that my
emails are too long.

> external tools to leverage that.  It sounds like what you're suggesting
> now is that you're happy to implement the backend code, expose it in a
> way that works just for pg_basebackup, and that if someone else wants to
> add things to the protocol to make it easier for external tools to
> leverage, great.

Yep, that's more or less it, although I am potentially willing to do
some modest amount of that other work along the way.  I just don't
want to prioritize it higher than getting the actual thing I want to
build built, which I think is a pretty fair position for me to take.

> All I can say is that that's basically how we ended up
> in the situation we're in today where pg_basebackup doesn't support
> parallel backup but a bunch of external tools do and they don't go
> through the backend to get there, even though they'd probably prefer to.

I certainly agree that core should try to do things in a way that is
useful to external tools when that can be done without undue effort,
but only if it can actually be done without undo effort.  Let's see
whether that's the case here:

- Anastasia wants a command added that dumps out whatever the server
knows about what files have changed, which I already agreed was a
reasonable extension of my initial proposal.

- You said that for this to be useful to pgbackrest, it'd have to use
a whole different mechanism that includes commands to request
individual files and blocks within those files, which would be a
significant rewrite of pg_basebackup that you agreed is more closely
related to parallel backup than to the project under discussion on
this thread.  And that even then pgbackrest probably wouldn't use it
because it also does server-side compression and encryption which are
not included in this proposal.

It seems to me that the first one falls into the category a reasonable
additional effort and the second one falls into the category of lots
of extra and unrelated work that wouldn't even get used.

> Thanks for sharing your thoughts on that, certainly having the backend
> able to be more intelligent about streaming files to avoid latency is
> good and possibly the best approach.  Another alternative to reducing
> the latency would be to have a way for the client to request a set of
> files, but I don't know that it'd be better.

I don't know either.  This is an area that needs more thought, I
think, although as discussed, it's more related to parallel backup
than $SUBJECT.

> I'm not really sure why the above is extremely inconvenient for
> third-party tools, beyond just that they've already been written to work
> with an assumption that the server-side of things isn't as intelligent
> as PG is.

Well, one thing you might want to do is have a tool that connects to
the server, enters backup mode, requests information on what blocks
have changed, copies those blocks via direct filesystem access, and
then exits backup mode.  Such a tool would really benefit from a
START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
command language, because it would just skip ever issuing the
SEND_FILE_CONTENTS command in favor of doing that part of the work via
other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
command is useless to such a tool.

Contrariwise, a tool that has its own magic - perhaps based on
WAL-scanning or something like ptrack - to know which files currently
exist and which blocks are modified could use SEND_FILE_CONTENTS but
not SEND_FILE_LIST.  And a filesystem-snapshot based technique might
use START_BACKUP and STOP_BACKUP but nothing else.

In short, providing granular commands like this lets the client be
really intelligent even if the server isn't, and lets the client have
fine-grained control of the process.  This is very good if you're an
out-of-core tool maintainer and your tool is trying to be smarter than
- or even just differently-designed than - core.

But if what you really want is just a maximally-efficient parallel
backup, you don't nee

translatability tweaks

2019-04-22 Thread Alvaro Herrera

Here's a small number of translatability tweaks to the error messages in
the backend.  I found these while updating the Spanish translation over
the past weekend, a task I had neglected for two or three years, so they
might involve some older messages.  However, I won't backpatch this --
only apply to pg12 tomorrow or so.

(I haven't yet gone over the full backend message catalog yet, so I
might propose more fixes later on.)

-- 
Álvaro Herrerahttp://www.twitter.com/alvherre
>From 091d4043aca3fa4936f1020b635a78e8b6d9010d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 22 Apr 2019 15:45:06 -0400
Subject: [PATCH] Unify error messages

... for translatability purposes.
---
 src/backend/storage/ipc/latch.c   | 12 +---
 src/backend/storage/ipc/signalfuncs.c |  4 +++-
 src/backend/utils/adt/formatting.c|  9 ++---
 src/backend/utils/adt/genfile.c   |  4 +++-
 src/backend/utils/adt/json.c  |  4 +++-
 src/backend/utils/adt/jsonb.c |  4 +++-
 6 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 59fa917ae04..e0712f906a1 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -856,7 +856,9 @@ WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action)
 	if (rc < 0)
 		ereport(ERROR,
 (errcode_for_socket_access(),
- errmsg("epoll_ctl() failed: %m")));
+ /* translator: %s is a syscall name, such as "poll()" */
+ errmsg("%s failed: %m",
+		"epoll_ctl()")));
 }
 #endif
 
@@ -1087,7 +1089,9 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			waiting = false;
 			ereport(ERROR,
 	(errcode_for_socket_access(),
-	 errmsg("epoll_wait() failed: %m")));
+	 /* translator: %s is a syscall name, such as "poll()" */
+	 errmsg("%s failed: %m",
+			"epoll_wait()")));
 		}
 		return 0;
 	}
@@ -1211,7 +1215,9 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			waiting = false;
 			ereport(ERROR,
 	(errcode_for_socket_access(),
-	 errmsg("poll() failed: %m")));
+	 /* translator: %s is a syscall name, such as "poll()" */
+	 errmsg("%s failed: %m",
+			"poll()")));
 		}
 		return 0;
 	}
diff --git a/src/backend/storage/ipc/signalfuncs.c b/src/backend/storage/ipc/signalfuncs.c
index 4769b1b51eb..4bfbd57464c 100644
--- a/src/backend/storage/ipc/signalfuncs.c
+++ b/src/backend/storage/ipc/signalfuncs.c
@@ -181,7 +181,9 @@ pg_rotate_logfile(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  (errmsg("must be superuser to rotate log files with adminpack 1.0"),
-  errhint("Consider using pg_logfile_rotate(), which is part of core, instead.";
+  /* translator: %s is a SQL function name */
+  errhint("Consider using %s, which is part of core, instead.",
+		  "pg_logfile_rotate()";
 
 	if (!Logging_collector)
 	{
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index df1db7bc9f1..69a691f18e7 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1566,7 +1566,8 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
  */
 ereport(ERROR,
 		(errcode(ERRCODE_INDETERMINATE_COLLATION),
-		 errmsg("could not determine which collation to use for lower() function"),
+		 errmsg("could not determine which collation to use for %s function",
+"lower()"),
 		 errhint("Use the COLLATE clause to set the collation explicitly.")));
 			}
 			mylocale = pg_newlocale_from_collation(collid);
@@ -1688,7 +1689,8 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
  */
 ereport(ERROR,
 		(errcode(ERRCODE_INDETERMINATE_COLLATION),
-		 errmsg("could not determine which collation to use for upper() function"),
+		 errmsg("could not determine which collation to use for %s function",
+"upper()"),
 		 errhint("Use the COLLATE clause to set the collation explicitly.")));
 			}
 			mylocale = pg_newlocale_from_collation(collid);
@@ -1811,7 +1813,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
  */
 ereport(ERROR,
 		(errcode(ERRCODE_INDETERMINATE_COLLATION),
-		 errmsg("could not determine which collation to use for initcap() function"),
+		 errmsg("could not determine which collation to use for %s function",
+"initcap()"),
 		 errhint("Use the COLLATE clause to set the collation explicitly.")));
 			}
 			mylocale = pg_newlocale_from_collation(collid);
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index d6976609968..a3c6adaf640 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -219,7 +219,9 @@ pg_read_file(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  (errmsg("must be superuser to read files with adminpack 1.0"),
-  errhint("

Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

2019-04-22 Thread Alexander Korotkov

On Thu, Nov 29, 2018 at 3:44 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Fri, Aug 24, 2018 at 5:53 PM Alexander Korotkov 
> >  wrote:
> >
> > Given I've no feedback on this idea yet, I'll try to implement a PoC
> > patch for that.  It doesn't look to be difficult.  And we'll see how
> > does it work.
>
> Unfortunately, current version of the patch doesn't pass the tests and fails 
> on
> initdb. But maybe you already have this PoC you were talking about, that will
> also incorporate the feedback from this thread? For now I'll move it to the
> next CF.

Finally, I managed to write a PoC.

If look at the list of problems I've enumerated in [1], this PoC is
aimed for 1 and 3.

> 1) Data corruption on file truncation error (explained in [1]).
> 2) Expensive scanning of the whole shared buffers before file truncation.
> 3) Cancel of read-only queries on standby even if hot_standby_feedback
> is on, caused by replication of AccessExclusiveLock.

2 is pretty independent problem and could be addressed later.

Basically, this patch does following:
1. Introduces new flag BM_DIRTY_BARRIER, which prevents dirty buffer
from being written out.
2. Implements two-phase truncation of node buffers.  First phase is
prior to file truncation and marks past truncation point dirty buffers
as BM_DIRTY_BARRIER.  Second phase is post file truncation and
actually wipes out past truncation point buffers.
3. On exception happen during file truncation, BM_DIRTY_BARRIER flag
will be released from buffers.  Thus, no data corruption should
happens here.  If file truncation was partially complete, then file
might be extended by write of dirty buffer.  I'm not sure how likely
is it, but extension could lead to the errors again.  But this still
shouldn't cause a data corruption.
4. Having too many buffers marked as BM_DIRTY_BARRIER, would paralyze
buffer manager.  This is why we're keeping not more than NBuffers/2 to
be marked as BM_DIRTY_BARRIER.  If limit is exceeded, then dirty
buffers are just written at the first phase.
5. lazy_truncate_heap() now takes ExclusiveLock instead of
AccessExclusiveLock.  This part is not really complete.  At least, we
need to ensure that past truncation point reads, caused by real-only
queries concurrent to truncation, don't lead to real errors.

Any thoughts?

1. 
https://www.postgresql.org/message-id/CAPpHfdtD3U2DpGZQJNe21s9s1s-Va7NRNcP1isvdCuJxzYypcg%40mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

0001-POC-fix-relation-truncation-1.patch
Description: Binary data

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 3:08 PM Alvaro Herrera  wrote:
> On 2019-Apr-22, Andres Freund wrote:
> > Why is the obvious answer is to not just remove the whole tablespace
> > inheritance behaviour?
>
> Because it was requested by many, and there were plenty of people
> surprised that things didn't work that way.

On the other hand, that behavior worked correctly on its own terms,
and this behavior seems to be broken, and you've been through a whole
series of possible designs trying to figure out how to fix it, and
it's still not clear that you've got a working solution.  I don't know
whether that shows that it is impossible to make this idea work
sensibly, but at the very least it proves that the whole area needed a
lot more thought than it got before this code was committed (a
complaint that I also made at the time, if you recall).  "Surprising"
is not great, but it is clearly superior to "broken."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 3:37 PM Laurenz Albe  wrote:
> Sure, it is not hard to modify a FDW to continue working with v11.
>
> My point is that this should not be necessary.

I'm not sure whether this proposal is a good idea or a bad idea, but I
think that it's inevitable that FDWs are going to need some updating
for new releases as the API evolves.  That has happened before and
will continue to happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 21:37:25 +0200, Laurenz Albe wrote:
> Subject: [PATCH] Foreign table COPY FROM and tuple routing requires
>  BeginForeignInsert
> 
> Commit 3d956d956a introduced support for foreign tables as partitions
> and COPY FROM on foreign tables.
> 
> If a foreign data wrapper supports data modifications, but either has
> not adapted to this change yet or doesn't want to support it
> for other reasons, it probably got broken by the above commit,
> because COPY will just call ExecForeignInsert anyway, which might not
> work because neither PlanForeignModify nor BeginForeignModify have
> been called.
> 
> To avoid breaking third-party foreign data wrappers in that way, allow
> COPY FROM and tuple routing for foreign tables only if the foreign data
> wrapper implements BeginForeignInsert.

Isn't this worse though? Before this it's an API change between major
versions. With this it's an API change in a *minor* version. Sure, it's
one that doesn't crash, but it's still a pretty substantial function
regression, no?

Greetings,

Andres Freund

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Robert Haas wrote:

> On Mon, Apr 22, 2019 at 3:08 PM Alvaro Herrera  
> wrote:
> > On 2019-Apr-22, Andres Freund wrote:
> > > Why is the obvious answer is to not just remove the whole tablespace
> > > inheritance behaviour?
> >
> > Because it was requested by many, and there were plenty of people
> > surprised that things didn't work that way.
> 
> On the other hand, that behavior worked correctly on its own terms,
> and this behavior seems to be broken, and you've been through a whole
> series of possible designs trying to figure out how to fix it, and
> it's still not clear that you've got a working solution.

Well, frequently when people discuss ideas on this list, others discuss
and provide further ideas to try help to find a working solution, rather
than throw every roadblock they can think of (though roadblocks are
indeed thrown now and then).  If I've taken a long time to find a
working solution, maybe it's because I have no shoulders of giants to
stand on, and I'm a pretty short guy, so I need to build me a ladder.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Laurenz Albe

On Mon, 2019-04-22 at 16:24 -0400, Robert Haas wrote:
> On Mon, Apr 22, 2019 at 3:37 PM Laurenz Albe  wrote:
> > Sure, it is not hard to modify a FDW to continue working with v11.
> > 
> > My point is that this should not be necessary.
> 
> I'm not sure whether this proposal is a good idea or a bad idea, but I
> think that it's inevitable that FDWs are going to need some updating
> for new releases as the API evolves.  That has happened before and
> will continue to happen.

Absolutely.
I am just unhappy that this change caused unnecessary breakage.

If you developed a read-only FDW for 9.2, it didn't break with the
write support introduced in 9.3, because that used different API
functions.  That's how it should be IMHO.

If you developed a FDW for 9.1, it got broken in 9.2, because the
API had to change to allow returning multiple paths.
That was unfortunate but necessary, so it is ok.

Nothing in this patch required an incompatible change.

Yours,
Laurenz Albe

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Laurenz Albe

On Mon, 2019-04-22 at 13:27 -0700, Andres Freund wrote:
> On 2019-04-22 21:37:25 +0200, Laurenz Albe wrote:
> > Commit 3d956d956a introduced support for foreign tables as partitions
> > and COPY FROM on foreign tables.
> > 
> > If a foreign data wrapper supports data modifications, but either has
> > not adapted to this change yet or doesn't want to support it
> > for other reasons, it probably got broken by the above commit,
> > because COPY will just call ExecForeignInsert anyway, which might not
> > work because neither PlanForeignModify nor BeginForeignModify have
> > been called.
> > 
> > To avoid breaking third-party foreign data wrappers in that way, allow
> > COPY FROM and tuple routing for foreign tables only if the foreign data
> > wrapper implements BeginForeignInsert.
> 
> Isn't this worse though? Before this it's an API change between major
> versions. With this it's an API change in a *minor* version. Sure, it's
> one that doesn't crash, but it's still a pretty substantial function
> regression, no?

Hm, that's a good point.
You could say that this patch is too late, because a FDW might already be
relying on COPY FROM to work without BeginForeignInsert in v11.

How about just applying the patch from v12 on?
Then it is a behavior change in a major release, which is acceptable.
It requires the imaginary FDW above to add an empty BeginForeignInsert
callback function, but will unbreak FDW that slept through the change
that added COPY support.

Yours,
Laurenz Albe

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 22:56:20 +0200, Laurenz Albe wrote:
> On Mon, 2019-04-22 at 13:27 -0700, Andres Freund wrote:
> > On 2019-04-22 21:37:25 +0200, Laurenz Albe wrote:
> > > Commit 3d956d956a introduced support for foreign tables as partitions
> > > and COPY FROM on foreign tables.
> > > 
> > > If a foreign data wrapper supports data modifications, but either has
> > > not adapted to this change yet or doesn't want to support it
> > > for other reasons, it probably got broken by the above commit,
> > > because COPY will just call ExecForeignInsert anyway, which might not
> > > work because neither PlanForeignModify nor BeginForeignModify have
> > > been called.
> > > 
> > > To avoid breaking third-party foreign data wrappers in that way, allow
> > > COPY FROM and tuple routing for foreign tables only if the foreign data
> > > wrapper implements BeginForeignInsert.
> > 
> > Isn't this worse though? Before this it's an API change between major
> > versions. With this it's an API change in a *minor* version. Sure, it's
> > one that doesn't crash, but it's still a pretty substantial function
> > regression, no?
> 
> Hm, that's a good point.
> You could say that this patch is too late, because a FDW might already be
> relying on COPY FROM to work without BeginForeignInsert in v11.

I think that's the case.


> How about just applying the patch from v12 on?
> Then it is a behavior change in a major release, which is acceptable.
> It requires the imaginary FDW above to add an empty BeginForeignInsert
> callback function, but will unbreak FDW that slept through the change
> that added COPY support.

I fail to see the advantage. It'll still require FDWs to be fixed to
work correctly for v11, but additionally adds another set of API
differences that needs to be fixed by another set of FDWs.  I think this
ship simply has sailed.

Greetings,

Andres Freund

Re: TM format can mix encodings in to_char()

2019-04-22 Thread Juan José Santamaría Flecha

Actually, I tried to show my findings with the tr_TR regression test, but
you
can reproduce the same issue with other locales and non-ASCII characters, as
Tom has pointed out.

For exampe:

de_DE ISO-8859-1: March
es_ES ISO-8859-1: Wednesday
fr_FR ISO-8859-1: February

Regards,

Juan José Santamaría Flecha

Re: speeding up planning with partitions

2019-04-22 Thread Tom Lane

Amit Langote  writes:
> On 2019/04/02 2:34, Tom Lane wrote:
>> I'm not at all clear on what we think the interaction between
>> enable_partition_pruning and constraint_exclusion ought to be,
>> so I'm not sure what the appropriate resolution is here.  Thoughts?

> Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and
> DELETE queries is realized by applying constraint exclusion to the
> partition constraint of the target partition.  The conclusion of the
> discussion when adding the enable_partition_pruning GUC [1] was that
> whether or not constraint exclusion is carried out (to facilitate
> partition pruning) should be governed by the new GUC, not
> constraint_exclusion, if only to avoid confusing users.

I got back to thinking about how this ought to work.  It appears to me
that we've got half a dozen different behaviors that depend on one or both
of these settings:

1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning,
  that is relation_excluded_by_constraints for baserels.
  This is enabled by constraint_exclusion = on.

2. Use of partition constraints in baserel pruning (applicable only
  when a partition is accessed directly).
  This is currently partly broken, and it's what your patch wants to
  change.

3. Use of ordinary table constraints in appendrel pruning,
  that is relation_excluded_by_constraints for appendrel members.
  This is enabled by constraint_exclusion >= partition.

4. Use of partition constraints in appendrel pruning.
  This is enabled by the combination of enable_partition_pruning AND
  constraint_exclusion >= partition.  However, it looks to me like this
  is now nearly if not completely useless because of #5.

5. Use of partition constraints in expand_partitioned_rtentry.
  Enabled by enable_partition_pruning (see prune_append_rel_partitions).

6. Use of partition constraints in run-time partition pruning.
  This is also enabled by enable_partition_pruning, cf
  create_append_plan, create_merge_append_plan.

Now in addition to what I mention above, there are assorted random
differences in behavior depending on whether we are in an inherited
UPDATE/DELETE or not.  I consider these differences to be so bogus
that I'm not even going to include them in this taxonomy; they should
not exist.  The UPDATE/DELETE target ought to act the same as a baserel.

I think this is ridiculously overcomplicated even without said random
differences.  I propose that we do the following:

* Get rid of point 4 by not considering partition constraints for
appendrel members in relation_excluded_by_constraints.  It's just
useless cycles in view of point 5, or nearly so.  (Possibly there
are corner cases where we could prove contradictions between a
relation's partition constraints and regular constraints ... but is
it really worth spending planner cycles to look for that?)

* Make point 2 like point 1 by treating partition constraints for
baserels like ordinary table constraints, ie, they are considered
only when constraint_exclusion = on (independently of whether
enable_partition_pruning is on).

* Treat an inherited UPDATE/DELETE target table as if it were an
appendrel member for the purposes of relation_excluded_by_constraints,
thus removing the behavioral differences between SELECT and UPDATE/DELETE.

With this, constraint_exclusion would act pretty much as it traditionally
has, and in most cases would not have any special impact on partitions
compared to old-style inheritance.  The behaviors that
enable_partition_pruning would control are expand_partitioned_rtentry
pruning and run-time pruning, neither of which have any applicability to
old-style inheritance.

Thoughts?

regards, tom lane

Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-22 Thread Melanie Plageman

Hi,
PrepareTempTablespaces is called by most callers of BufFileCreateTemp, so I
was
wondering if there is a reason not to call it inside BufFileCreateTemp.

As a developer using BufFileCreateTemp to write code that will create spill
files, it was easy to forget the extra step of checking the temp_tablespaces
GUC to ensure I create the spill files there if it is set.

Thanks,
Melanie Plageman

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 4:43 PM Alvaro Herrera  wrote:
> Well, frequently when people discuss ideas on this list, others discuss
> and provide further ideas to try help to find a working solution, rather
> than throw every roadblock they can think of (though roadblocks are
> indeed thrown now and then).  If I've taken a long time to find a
> working solution, maybe it's because I have no shoulders of giants to
> stand on, and I'm a pretty short guy, so I need to build me a ladder.

What exactly do you mean by throwing up roadblocks?  I don't have a
basket full of great ideas for how to solve this problem that I'm
failing to suggest out of some sort of perverse desire to see you
fail.  I'm not quite as convinced as Tom and Andres that this whole
idea is fatally flawed and can't ever be made to work correctly, but I
think it's quite possible that they are right, both because their
objections sound to me like they are target and because they are
pretty smart people.  But that's also because I haven't spent a lot of
time on this issue, which I think is pretty fair, because it seems
like it would be unfair to complain that I am not spending enough time
helping fix code that I advised against committing in the first place.
How much time should I spent giving you advice if my previous advice
was ignored?

But FWIW, it seems to me that a good place to start solving this
problem would be to think hard about what Andres said here:

http://postgr.es/m/20190306161744.22jdkg37fyi2z...@alap3.anarazel.de

Specifically, this complaint: "I still think the feature as is doesn't
seem to have very well defined behaviour."

If we know what the feature is supposed to do, then it should be
possible to look at each relevant piece of code and decides whether it
implements the specification correctly or not.  But if we don't have a
specification -- that is, we don't know precisely what the feature is
supposed to do -- then we'll just end up whacking the behavior around
trying to get some combination that makes sense, and the whole effort
is probably doomed.

I think that the large quote block from David Rowley in the middle of
the above-linked email gets at the definitional problem pretty
clearly: the documentation seems to be intending to say -- although it
is not 100% clear -- that if TABLESPACE is specified it has effect on
all future children, and if not specified then those children get the
tablespace they would have gotten anyway.  But that behavior is
impossible to implement correctly unless there is a way to distinguish
between a partitioned table for which TABLESPACE was not specified and
where it was specified to be the same as the default tablespace for
the database.  And we know, per previous conversation, that the
catalog representation admits of no way to make that distinction.
Using reltablespace = dattablespace is clearly the wrong answer,
because that breaks stuff.  Tom earlier suggested, I believe, that
some fixed OID could be used, e.g. reltablespace = 1 means "tablespace
not specified" and reltablespace = 0 means dattablespace.  That should
be safe enough, since I don't think OID 1 can ever be the OID of a
real tablespace.  I think this is probably the only way forward if
this definition captures the desired behavior.

The other option is to decide that we want some other behavior.  In
that case, the way forward depends on what we want the behavior to be.
Your proposal upthread is to disallow the case where the user provides
an explicit TABLESPACE setting whose value matches the default
tablespace for the database.  But that definition seems to have an
obvious problem: just because that's not true when the partitioned
table is defined doesn't mean it won't become true later, because the
default tablespace for the database can be changed, or the database
can be copied and the copy be assigned a different default tablespace.
Even if there were no issue, that definition doesn't sound very clean,
because it means that reltablespace = 0 for a regular relation means
dattablespace and for a partitioned relation it means none.

Now that's not the only way we could go either, I suppose.  There must
be a variety of other possible behaviors.  But the one Tom and Andres
are proposing -- go back to the way it worked in v10 -- is definitely
not crazy.  You are right that a lot of people didn't like that, but
the definition was absolutely clear, because it was the exact same
definition we use for normal tables, with the straigthforward
extension that for relations with no storage it meant nothing.

Going back to the proposal of making OID = 0 mean TABLESPACE
dattablespace, OID = 1 meaning no TABLESPACE clause specified for this
partitioned table, and OID = whatever meaning that particular
non-default tablespace, I do see a couple of problems with that idea
too:

- I don't have any idea how to salvage things in v11, where the
catalog representation is irredeemably ambiguous.  We'd probably have
to be satisfied with fairly goofy

Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2019 at 3:12 PM Melanie Plageman
 wrote:
> PrepareTempTablespaces is called by most callers of BufFileCreateTemp, so I 
> was
> wondering if there is a reason not to call it inside BufFileCreateTemp.

The best answer I can think of is that a BufFileCreateTemp() caller
might not want to do catalog access. Perhaps the contortions within
assign_temp_tablespaces() are something that callers ought to opt in
to explicitly.

That doesn't seem like a particularly good or complete answer, though.
Perhaps it should simply be called within BufFileCreateTemp(). The
BufFile/fd.c layering is confusing in a number of ways IMV.

-- 
Peter Geoghegan

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Mar-06, Tom Lane wrote:

> David Rowley  writes:
> > As far as I can see, the biggest fundamental difference with doing
> > things this way will be that the column order of partitions will be
> > preserved, where before it would inherit the order of the partitioned
> > table.  I'm a little unsure if doing this column reordering was an
> > intended side-effect or not.
> 
> Well, if the normal behavior results in changing the column order,
> it'd be necessary to do things differently in --binary-upgrade mode
> anyway, because there we *must* preserve column order.  I don't know
> if what you're describing represents a separate bug for pg_upgrade runs,
> but it might.  Is there any test case for the situation left behind by
> the core regression tests?

Now that I re-read this complaint once again, I wonder if a mismatching
column order in partitions isn't a thing we ought to preserve anyhow.
Robert, Amit -- is it by design that pg_dump loses the original column
order for partitions, when not in binary-upgrade mode?  To me, it sounds
unintuitive to accept partitions that don't exactly match the order of
the parent table; but it's been supported all along.  In the statu quo,
if users dump and restore such a database, the restored partition ends
up with the column order of the parent instead of its own column order
(by virtue of being created as CREATE TABLE PARTITION OF).  Isn't that
wrong?  It'll cause an INSERT/COPY direct to the partition that worked
prior to the restore to fail after the restore, if the column list isn't
specified.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Tomas Vondra


On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote:

On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian  wrote:

How would you choose the STARTLSN/ENDLSN?  If you could do it per
checkpoint, rather than per-WAL, I think that would be great.


I thought of that too.  It seems appealing, because you probably only
really care whether a particular block was modified between one
checkpoint and the next, not exactly when during that interval it was
modified. 


That's probably true for incremental backups, but there are other use
cases that could leverage this information.

Some time ago there was a discussion about prefetching blocks during
recovery on a standby, and that's a great example of a use case that
benefit from this - look which blocks where modified in the next chunk
of WAL, prefetch them. But that requires fairly detailed information
about which blocks were modified in the next few megabytes of WAL.

So just doing it once per checkpoint (or even anything above a single
WAL segment) and removing all the detailed LSN location makes it useless
for this use case.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Tomas Vondra

On Mon, Apr 22, 2019 at 01:15:49PM -0400, Bruce Momjian wrote:

On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote:

On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian  wrote:
> I assumed the modblock files would be stored in the WAL archive so some
> external tools could generate incremental backups using just the WAL
> files.  I assumed they would also be sent to standby servers so
> incremental backups could be done on standby servers too.

Yeah, that's another possible approach.  I am not sure what is best.

I am thinking you need to allow any of these, and putting the WAL files
in pg_wal and having them streamed and archived gives that flexibility.

I agree - this would be quite useful for the prefetching use case I've
already mentioned in my previous message.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-22 Thread Tom Lane

Peter Geoghegan  writes:
> On Mon, Apr 22, 2019 at 3:12 PM Melanie Plageman
>  wrote:
>> PrepareTempTablespaces is called by most callers of BufFileCreateTemp, so I 
>> was
>> wondering if there is a reason not to call it inside BufFileCreateTemp.

> The best answer I can think of is that a BufFileCreateTemp() caller
> might not want to do catalog access. Perhaps the contortions within
> assign_temp_tablespaces() are something that callers ought to opt in
> to explicitly.

It's kind of hard to see a reason to call it outside a transaction,
and even if we did, there are provisions for it not to go boom.

> That doesn't seem like a particularly good or complete answer, though.
> Perhaps it should simply be called within BufFileCreateTemp(). The
> BufFile/fd.c layering is confusing in a number of ways IMV.

I don't actually see why BufFileCreateTemp should do it; if
we're to add a call, seems like OpenTemporaryFile is the place,
as that's what is really concerned with the temp tablespace(s).

I'm in favor of doing this, I think, as it sure looks to me like
gistInitBuildBuffers() is calling BufFileCreateTemp without any
closely preceding PrepareTempTablespaces.  So we already have an
instance of Melanie's bug in core.  It'd be difficult to notice
because of the silent-fallback-to-default-tablespace behavior.

regards, tom lane

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Tom Lane

Alvaro Herrera  writes:
> Now that I re-read this complaint once again, I wonder if a mismatching
> column order in partitions isn't a thing we ought to preserve anyhow.
> Robert, Amit -- is it by design that pg_dump loses the original column
> order for partitions, when not in binary-upgrade mode?

I haven't looked at the partitioning code, but I am quite sure that that's
always happened for old-style inheritance children, and I imagine pg_dump
is just duplicating that old behavior.

Wasn't there already a patch on the table to change this, though,
by removing the code path that uses inheritance rather than the
binary-upgrade-like solution?

regards, tom lane

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:

On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost  wrote:

> Oh.  Well, I already explained my algorithm for doing that upthread,
> which I believe would be quite cheap.
>
> 1. When you generate the .modblock files, stick all the block
> references into a buffer.  qsort().  Dedup.  Write out in sorted
> order.

Having all of the block references in a sorted order does seem like it
would help, but would also make those potentially quite a bit larger
than necessary (I had some thoughts about making them smaller elsewhere
in this discussion).  That might be worth it though.  I suppose it might
also be possible to line up the bitmaps suggested elsewhere to do
essentially a BitmapOr of them to identify the blocks changed (while
effectively de-duping at the same time).

I don't see why this would make them bigger than necessary.  If you
sort by relfilenode/fork/blocknumber and dedup, then references to
nearby blocks will be adjacent in the file.  You can then decide what
format will represent that most efficiently on output.  Whether or not
a bitmap is better idea than a list of block numbers or something else
depends on what percentage of blocks are modified and how clustered
they are.

Not sure I understand correctly - do you suggest to deduplicate and sort
the data before writing them into the .modblock files? Because that the
the sorting would make this information mostly useless for the recovery
prefetching use case I mentioned elsewhere. For that to work we need
information about both the LSN and block, in the LSN order.

So if we want to allow that use case to leverage this infrastructure, we
need to write the .modfiles kinda "raw" and do this processing in some
later step.

Now, maybe the incremental backup use case is so much more important the
right thing to do is ignore this other use case, and I'm OK with that -
as long as it's a conscious choice.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Tue, Apr 23, 2019 at 01:21:27AM +0200, Tomas Vondra wrote:
> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> > On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost  wrote:
> > > > Oh.  Well, I already explained my algorithm for doing that upthread,
> > > > which I believe would be quite cheap.
> > > >
> > > > 1. When you generate the .modblock files, stick all the block
> > > > references into a buffer.  qsort().  Dedup.  Write out in sorted
> > > > order.
> > > 
> > > Having all of the block references in a sorted order does seem like it
> > > would help, but would also make those potentially quite a bit larger
> > > than necessary (I had some thoughts about making them smaller elsewhere
> > > in this discussion).  That might be worth it though.  I suppose it might
> > > also be possible to line up the bitmaps suggested elsewhere to do
> > > essentially a BitmapOr of them to identify the blocks changed (while
> > > effectively de-duping at the same time).
> > 
> > I don't see why this would make them bigger than necessary.  If you
> > sort by relfilenode/fork/blocknumber and dedup, then references to
> > nearby blocks will be adjacent in the file.  You can then decide what
> > format will represent that most efficiently on output.  Whether or not
> > a bitmap is better idea than a list of block numbers or something else
> > depends on what percentage of blocks are modified and how clustered
> > they are.
> > 
> 
> Not sure I understand correctly - do you suggest to deduplicate and sort
> the data before writing them into the .modblock files? Because that the
> the sorting would make this information mostly useless for the recovery
> prefetching use case I mentioned elsewhere. For that to work we need
> information about both the LSN and block, in the LSN order.
> 
> So if we want to allow that use case to leverage this infrastructure, we
> need to write the .modfiles kinda "raw" and do this processing in some
> later step.
> 
> Now, maybe the incremental backup use case is so much more important the
> right thing to do is ignore this other use case, and I'm OK with that -
> as long as it's a conscious choice.

I think the concern is that the more graunular the modblock files are
(with less de-duping), the larger they will be.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

memory leak checking

2019-04-22 Thread Mikhail Bautin

Hello PostgreSQL Hackers,

What is the standard memory leak checking policy for the PostgreSQL
codebase? I know there is some support for valgrind -- is the test suite
being run continuously with valgrind on the build farm?

Is there any plan to support clang's AddressSanitizer?

I've seen a thread that memory leaks are allowed in initdb, because it is a
short-lived process. Obviously they are not allowed in the database server.
Are memory leaks allowed in the psql tool?

Regards,
Mikhail

Re: memory leak checking

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 16:50:25 -0700, Mikhail Bautin wrote:
> What is the standard memory leak checking policy for the PostgreSQL
> codebase? I know there is some support for valgrind -- is the test suite
> being run continuously with valgrind on the build farm?

There's continuous use of valgrind on the buildfarm - but those animals
have leak checking disabled. Postgres for nearly all server allocations
uses memory contexts, which allows to bulk-free memory. There's also
plenty memory that's intentionally allocated till the end of the backend
lifetime (e.g. the various caches over the system catalogs).  Due to
that checks like valgrinds are not particularly meaningful.

> Is there any plan to support clang's AddressSanitizer?

Not for the leak portion. I use asan against the backend, after
disabling the leak check, and that's useful. Should probably set up a
proper buildfarm animal for that.

> I've seen a thread that memory leaks are allowed in initdb, because it is a
> short-lived process. Obviously they are not allowed in the database server.
> Are memory leaks allowed in the psql tool?

Leaks are allowed if they are once-per-backend type things. There's no
point in e.g. freeing information for timezone metadata, given that
it'll be used for the whole server lifetime. And there's such things in
psql too, IIRC.

Greetings,

Andres Freund

1 2 >

1 - 100 of 130 matches

Mail list logo