date:20190708

Re: Data-only pg_rewind, take 2

2019-07-08 Thread Thomas Munro

On Mon, Mar 18, 2019 at 8:46 PM Chris Travers  wrote:
> On Mon, Mar 18, 2019 at 4:09 AM Michael Paquier  wrote:
>> On Sun, Mar 17, 2019 at 09:00:57PM +0800, Chris Travers wrote:
>> > I also added test cases and some docs.  I don't know if the docs are
>> > sufficient.  Feedback is appreciated.
>>
>> To be honest, I don't think that this approach is a good idea per the
>> same reasons as mentioned the last time, as this can cause pg_rewind
>> to break if any newly-added folder in the data directory has
>> non-replaceable data which is needed at the beginning of recovery and
>> cannot be automatically rebuilt.  So that's one extra maintenance
>> burden to worry about.
>
> Actually I think this is safe.  Let me go through the cases not handled in 
> the current behavior at all:

Hi Chris,

Could you please post a rebase?  This has fairly thoroughly bitrotted.
The Commitfest is here, so now would be an excellent time for people
to be able to apply and test the patch.

Thanks,

-- 
Thomas Munro
https://enterprisedb.com

Re: FETCH FIRST clause PERCENT option

2019-07-08 Thread Surafel Temesgen

Hi Thomas,
Thank you for informing me

Hi Surafel,
>
> There's a call to adjust_limit_rows_costs() hiding under
> contrib/postgres_fdw, so this fails check-world.
>

Fixed . I also include the review given by Ryan in attached patch

regards
Surafel
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 1759b9e1b6..fb9c2f5319 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -3035,7 +3035,8 @@ estimate_path_cost_size(PlannerInfo *root,
 		if (fpextra && fpextra->has_limit)
 		{
 			adjust_limit_rows_costs(&rows, &startup_cost, &total_cost,
-	fpextra->offset_est, fpextra->count_est);
+	fpextra->offset_est, fpextra->count_est,
+	EXACT_NUMBER);
 			retrieved_rows = rows;
 		}
 	}
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 06d611b64c..1c240a3dab 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
+[ FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
@@ -1430,7 +1430,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY
+FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY
 
 In this syntax, the start
 or count value is required by
@@ -1440,7 +1440,8 @@ FETCH { FIRST | NEXT } [ count ] {
 ambiguity.
 If count is
 omitted in a FETCH clause, it defaults to 1.
-ROW
+With PERCENT count specifies the maximum number of rows to return
+in percentage round up to the nearest integer.ROW
 and ROWS as well as FIRST
 and NEXT are noise words that don't influence
 the effects of these clauses.
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index baa669abe8..c503512588 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -21,6 +21,8 @@
 
 #include "postgres.h"
 
+#include 
+
 #include "executor/executor.h"
 #include "executor/nodeLimit.h"
 #include "miscadmin.h"
@@ -52,6 +54,7 @@ ExecLimit(PlanState *pstate)
 	 */
 	direction = node->ps.state->es_direction;
 	outerPlan = outerPlanState(node);
+	slot = node->subSlot;
 
 	/*
 	 * The main logic is a simple state machine.
@@ -81,7 +84,15 @@ ExecLimit(PlanState *pstate)
 			/*
 			 * Check for empty window; if so, treat like empty subplan.
 			 */
-			if (node->count <= 0 && !node->noCount)
+			if (node->limitOption == PERCENTAGE)
+			{
+if (node->percent == 0.0)
+{
+	node->lstate = LIMIT_EMPTY;
+	return NULL;
+}
+			}
+			else if (node->count <= 0 && !node->noCount)
 			{
 node->lstate = LIMIT_EMPTY;
 return NULL;
@@ -102,11 +113,32 @@ ExecLimit(PlanState *pstate)
 	node->lstate = LIMIT_EMPTY;
 	return NULL;
 }
+
+/*
+ * Count the number of rows to return in exact number so far
+ */
+if (node->limitOption == PERCENTAGE)
+{
+	node->perExactCount = ceil(node->percent * node->position / 100.0);
+
+	if (node->perExactCount == node->perExactCount + 1)
+		node->perExactCount++;
+}
 node->subSlot = slot;
 if (++node->position > node->offset)
 	break;
 			}
 
+			/*
+			 * We may needed this tuple in backward scan so put it into
+			 * tuplestore.
+			 */
+			if (node->limitOption == PERCENTAGE)
+			{
+tuplestore_puttupleslot(node->tupleStore, slot);
+tuplestore_advance(node->tupleStore, true);
+			}
+
 			/*
 			 * Okay, we have the first tuple of the window.
 			 */
@@ -124,6 +156,106 @@ ExecLimit(PlanState *pstate)
 		case LIMIT_INWINDOW:
 			if (ScanDirectionIsForward(direction))
 			{
+/*
+ * In case of coming back from backward scan the tuple is
+ * already in tuple store.
+ */
+if (node->limitOption == PERCENTAGE && node->backwardPosition > 0)
+{
+	if (tuplestore_gettupleslot_heaptuple(node->tupleStore, true, true, slot))
+	{
+		node->subSlot = slot;
+		node->position++;
+		node->backwardPosition--;
+		return slot;
+	}
+	else
+	{
+		node->lstate = LIMIT_SUBPLANEOF;
+		return NULL;
+	}
+}
+
+/*
+ * In PERCENTAGE case no need of executing outerPlan multiple
+ * times.
+ */
+if (node->limitOption == PERCENTAGE && node->reachEnd)
+{
+	node->lstate = LIMIT_WINDOWEND;
+
+	/*
+	 * If we know we won't need to back up, we can release
+	 * resources at this point.
+	 */
+	if (!(node->ps.state->es_top_eflags & EXEC_FLAG_BACKWARD))
+		(void) ExecShutdownNode(out

Re: (select query)/relation as first class citizen

2019-07-08 Thread Roman Pekar

Hi,

what do you think about this idea in general? If you don't have to think
about implementation for now? From my point of view writing Sql queries is
very close to how functional language work if you treat "select" queries as
functions without side-effects, and having query being first-class-citizen
could move this even further.

Regards,
Roman

On Sun, 7 Jul 2019 at 16:22, Roman Pekar  wrote:

> Hi,
>
> Yes, I'm thinking about 'query like a view', 'query like a cursor' is
> probably possible even now in ms sql server (not sure about postgresql),
> but it requires this paradygm shift from set-based thinking to row-by-row
> thinking which I'd not want to do.
>
> I completely agree with your points of plan caching and static checks.
> With static checks, though it might be possible to do if the query would be
> defined as typed, so all the types of the columns is known in advance.
> In certain cases having possibility of much better decomposition is might
> be more important than having cached plan. Not sure how often these cases
> appear in general, but personally for me it'd be awesome to have this
> possibility.
>
> Regards,
> Roman Pekar
>
> On Sun, 7 Jul 2019 at 15:39, Pavel Stehule 
> wrote:
>
>> Hi
>>
>> ne 7. 7. 2019 v 14:54 odesílatel Roman Pekar 
>> napsal:
>>
>>> Hello,
>>>
>>> Just a bit of background - I currently work as a full-time db developer,
>>> mostly with Ms Sql server but I like Postgres a lot, especially because I
>>> really program in sql all the time and type system / plpgsql language of
>>> Postgres seems to me more suitable for actual programming then t-sql.
>>>
>>> Here's the problem - current structure of the language doesn't allow to
>>> decompose the code well and split calculations and data into different
>>> modules.
>>>
>>> For example. Suppose I have a table employee and I have a function like
>>> this (I'll skip definition of return types for the sake of simplicity):
>>>
>>> create function departments_salary ()
>>> returns  table (...)
>>> as
>>> return $$
>>> select department, sum(salary) as salary from employee group by
>>> department;
>>> $$;
>>>
>>> so that's fine, but what if I want to run this function on filtered
>>> employee? I can adjust the function of course, but it implies I can predict
>>> all possible filters I'm going to need in the future.
>>> And logically, function itself doesn't have to be run on employee table,
>>> anything with department and salary columns will fit.
>>> So it'd be nice to be able to define the function like this:
>>>
>>> create function departments_salary(_employee query)
>>> returns table (...)
>>> as
>>> return $$
>>> select department, sum(salary) as salary from _employee group by
>>> department;
>>> $$;
>>>
>>> and then call it like this:
>>>
>>> declare _employee query;
>>> ...
>>> _poor_employee = (select salary, department from employee where salary <
>>> 1000);
>>> select * from  departments_salary( _poor_employee);
>>>
>>> And just to be clear, the query is not really invoked until the last
>>> line, so re-assigning _employee variable is more like building query
>>> expression.
>>>
>>> As far as I understand the closest way to do this is to put the data
>>> into temporary table and use this temporary table inside of the function.
>>> It's not exactly the same of course, cause in case of temporary tables data
>>> should be transferred to temporary table, while it will might be filtered
>>> later. So it's something like array vs generator in python, or List vs
>>> IQueryable in C#.
>>>
>>> Adding this functionality will allow much better decomposition of the
>>> program's logic.
>>> What do you think about the idea itself? If you think the idea is
>>> worthy, is it even possible to implement it?
>>>
>>
>> If we talk about plpgsql, then I afraid so this idea can disallow plan
>> caching - or significantly increase the cost of plan cache.
>>
>> There are two possibilities of implementation - a) query like cursor -
>> unfortunately it effectively disables any optimization and it carry ORM
>> performance to procedures. This usage is known performance antipattern, b)
>> query like view - it should not to have a performance problems with late
>> optimization, but I am not sure about possibility to reuse execution plans.
>>
>> Currently PLpgSQL is compromise between performance and dynamic (PLpgSQL
>> is really static language). Your proposal increase much more dynamic
>> behave, but performance can be much more worse.
>>
>> More - with this behave, there is not possible to do static check - so
>> you have to find bugs only at runtime. I afraid about performance of this
>> solution.
>>
>> Regards
>>
>> Pavel
>>
>>
>>
>>> Regards,
>>> Roman Pekar
>>>
>>>
>>>

Re: Ltree syntax improvement

2019-07-08 Thread Thomas Munro

On Wed, Apr 17, 2019 at 5:29 AM Dmitry Belyavsky  wrote:
> I've applied your patch.
> From my point of view, there is no major difference between case and chain if 
> here.
> Neither case nor ifs allow extracting the common code to separate function - 
> just because there seem to be no identical pieces of code.

Hi Dmitry,

The documentation doesn't build[1], due to invalid XML.  Since I'm
here, here is some proof-reading of the English in the documentation:

   
-   A label is a sequence of alphanumeric characters
-   and underscores (for example, in C locale the characters
-   A-Za-z0-9_ are allowed).  Labels must be less
than 256 bytes
-   long.
+   A label is a sequence of characters. Labels must be
+   less than 256 symbols long. Label may contain any character
supported by Postgres

"fewer than 256 characters in length", and
"PostgreSQL"

+   except \0. If label contains spaces, dots,
lquery modifiers,

"spaces, dots or lquery modifiers,"

+   they may be escaped. Escaping can be done
either by preceeding
+   backslash (\\) symbol, or by wrapping the label
in whole in double
+   quotes ("). Initial and final unescaped
whitespace is stripped.

"Escaping can be done with either a preceding backslash [...] or by
wrapping the whole label in double quotes [...]."

   

+During converting text into internal representations, wrapping
double quotes

"During conversion to internal representation, "

+and escaping backslashes are removed. During converting internal
+representations into text, if the label does not contain any special

"During conversion from internal representation to text, "

+symbols, it is printed as is. Otherwise, it is wrapped in quotes and, if
+there are internal quotes, they are escaped with backslash. The
list of special

"escaped with backslashes."

+  
+Examples: 42, "\\42",
+\\4\\2,  42  and  "42"
+ will have the similar internal representation and, being

"will all have the same internal representation and,"

+converted from internal representation, will become 42.
+Literal abc def will turn into "abc
+def".
   

[1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/71856

--
Thomas Munro
https://enterprisedb.com

Re: Replication & recovery_min_apply_delay

2019-07-08 Thread Thomas Munro

On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera  wrote:
> On 2019-Jan-30, Konstantin Knizhnik wrote:
> > I wonder if it can be considered as acceptable solution of the problem or
> > there can be some better approach?
>
> I didn't find one.

It sounds like you are in agreement that there is a problem and this
is the best solution.  I didn't look at these patches, I'm just asking
with my Commitfest manager hat on: did I understand correctly, does
this need a TAP test, possibly the one Alvaro posted, and if so, could
we please have a fresh patch that includes the test, so we can see it
passing the test in CI?

-- 
Thomas Munro
https://enterprisedb.com

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-08 Thread Michael Paquier

On Fri, Jul 05, 2019 at 07:25:41PM +0200, Julien Rouhaud wrote:
> On Fri, Jul 5, 2019 at 6:16 PM Peter Eisentraut 
>  wrote:
>> Isn't that also the case for your proposal?  We are not going to release
>> a new reindexdb before a new REINDEX.
> 
> Sure, but my point was that once the new reindexdb is released (or if
> you're so desperate, using a nightly build or compiling your own), it
> can be used against any previous major version.  There is probably a
> large fraction of users who don't perform a postgres upgrade when they
> upgrade their OS, so that's IMHO also something to consider.

I think that we need to think long-term here and be confident in the
fact we will still see breakages with collations and glibc, using a
solution that we think is the right API.  Peter's idea to make the
backend-aware command of the filtering is cool.  On top of that, there
is no need to add any conflict logic in reindexdb and we can live with
restricting --jobs support for non-index objects.
--
Michael


signature.asc
Description: PGP signature

Re: Implementing Incremental View Maintenance

2019-07-08 Thread Tatsuo Ishii

> As for how to make internal columns invisible to SELECT *, previously
> there have been discussions about doing that using a new flag in
> pg_attribute:
> 
> https://www.postgresql.org/message-id/flat/CAEepm%3D3ZHh%3Dp0nEEnVbs1Dig_UShPzHUcMNAqvDQUgYgcDo-pA%40mail.gmail.com

Now that I realized that there are several use cases for invisible
columns, I think this is the way what we shoud go for.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Replication & recovery_min_apply_delay

2019-07-08 Thread Michael Paquier

On Mon, Jul 08, 2019 at 07:56:25PM +1200, Thomas Munro wrote:
> On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera  
> wrote:
> > On 2019-Jan-30, Konstantin Knizhnik wrote:
> > > I wonder if it can be considered as acceptable solution of the problem or
> > > there can be some better approach?
> >
> > I didn't find one.
> 
> It sounds like you are in agreement that there is a problem and this
> is the best solution.  I didn't look at these patches, I'm just asking
> with my Commitfest manager hat on: did I understand correctly, does
> this need a TAP test, possibly the one Alvaro posted, and if so, could
> we please have a fresh patch that includes the test, so we can see it
> passing the test in CI?

Please note that I have not looked at that stuff in details, but I
find the patch proposed kind of ugly with the scan of the last segment
using a WAL reader to find out what is the last LSN and react on
that..  This does not feel right.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] kNN for btree

2019-07-08 Thread Thomas Munro

On Tue, Jul 2, 2019 at 5:47 AM Nikita Glukhov  wrote:
> Attached 12th version of the patches rebased onto the current master.

Hi Nikita,

make check-world fails for me, and in tmp_install/log/install.log I see:

btree_int2.c:97:9: error: implicit declaration of function 'int2dist'
is invalid in C99 [-Werror,-Wimplicit-function-declaration]
return int2dist(fcinfo);
   ^
btree_int2.c:97:9: note: did you mean 'int2_dist'?
btree_int2.c:95:1: note: 'int2_dist' declared here
int2_dist(PG_FUNCTION_ARGS)
^
1 error generated.

-- 
Thomas Munro
https://enterprisedb.com

Re: Add test case for sslinfo

2019-07-08 Thread Michael Paquier

On Mon, Jul 08, 2019 at 02:11:34PM +0800, Hao Wu wrote:
> Thank you for your quick response! I work on greenplum, and I didn't see
> this folder(src/test/ssl/ssl) before.
> I will add more certificates to test and resend again.

Not having duplicates would be nice.

> Do you have any suggestion about the missing PGDATA? Since the test needs
> to configure postgresql.conf, maybe there are other ways to determine this
> environment.

+REGRESS = sslinfo
+REGRESS_OPT = --temp-config=$(top_srcdir)/contrib/sslinfo/sslinfo.conf

When it comes to custom configuration files in the regression tests,
you should always have NO_INSTALLCHECK = 1 in the Makefile because
there is no guarantee that that the running server will have the
configuration you want when running an installcheck.

+echo "preparing CRTs and KEYs"
+cp -f data/root.crt   $PGDATA/
+cp -f data/server.crt $PGDATA/
+cp -f data/server.key $PGDATA/
+chmod 400 $PGDATA/server.key
+chmod 644 $PGDATA/server.crt
+chmod 644 $PGDATA/root.crt
Using a TAP test here would be more adapted.  Another idea would be to
add that directly into src/test/ssl/ and enforce the installation of
with EXTRA_INSTALL when running the tests.

+-- start_ignore
+\! bash config.bash clean
+\! pg_ctl restart 2>&1 >/dev/null
+-- end_ignore
Please, no...
--
Michael


signature.asc
Description: PGP signature

Re: dropdb --force

2019-07-08 Thread Pavel Stehule

po 8. 7. 2019 v 0:07 odesílatel Thomas Munro 
napsal:

> On Thu, Jun 27, 2019 at 7:15 AM Pavel Stehule 
> wrote:
> > fixed
>
> Hi Pavel,
>
> FYI t/050_dropdb.pl fails consistently with this patch applied:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/555234838


with attached patch should be ok

Regards

Pavel

>
>
>
> --
> Thomas Munro
> https://enterprisedb.com
>
diff --git a/doc/src/sgml/ref/drop_database.sgml b/doc/src/sgml/ref/drop_database.sgml
index 3ac06c984a..84df485e11 100644
--- a/doc/src/sgml/ref/drop_database.sgml
+++ b/doc/src/sgml/ref/drop_database.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-DROP DATABASE [ IF EXISTS ] name
+DROP DATABASE [ IF EXISTS ] name [ FORCE ]
 
  
 
@@ -32,9 +32,11 @@ DROP DATABASE [ IF EXISTS ] name
DROP DATABASE drops a database. It removes the
catalog entries for the database and deletes the directory
containing the data.  It can only be executed by the database owner.
-   Also, it cannot be executed while you or anyone else are connected
-   to the target database.  (Connect to postgres or any
-   other database to issue this command.)
+   Also, it cannot be executed while you are connected to the target database.
+   (Connect to postgres or any other database to issue this
+   command.)
+   If anyone else is connected to the target database, this command will fail
+   - unless you use the FORCE option described below.
   
 
   
@@ -64,6 +66,24 @@ DROP DATABASE [ IF EXISTS ] name
  
 

+
+   
+FORCE
+
+ 
+  Attempt to terminate all existing connections to the target database.
+ 
+ 
+  This will fail, if current user has no permissions to terminate other
+  connections. Required permissions are the same as with 
+  pg_terminate_backend, described
+  in .
+
+  This will also fail, if the connections do not terminate in 60 seconds.
+ 
+
+   
+
   
  
 
diff --git a/doc/src/sgml/ref/dropdb.sgml b/doc/src/sgml/ref/dropdb.sgml
index 3fbdb33716..5d10ad1c92 100644
--- a/doc/src/sgml/ref/dropdb.sgml
+++ b/doc/src/sgml/ref/dropdb.sgml
@@ -86,6 +86,20 @@ PostgreSQL documentation
   
  
 
+ 
+  -f
+  --force
+  
+   
+Force termination of connected backends before removing the database.
+   
+   
+This will add the FORCE option to the DROP
+DATABASE command sent to the server.
+   
+  
+ 
+
  
-V
--version
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 863f89f19d..86330bad12 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -499,7 +499,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt)
 	 * potential waiting; we may as well throw an error first if we're gonna
 	 * throw one.
 	 */
-	if (CountOtherDBBackends(src_dboid, ¬herbackends, &npreparedxacts))
+	if (CountOtherDBBackends(src_dboid, ¬herbackends, &npreparedxacts, false))
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_IN_USE),
  errmsg("source database \"%s\" is being accessed by other users",
@@ -789,7 +789,7 @@ createdb_failure_callback(int code, Datum arg)
  * DROP DATABASE
  */
 void
-dropdb(const char *dbname, bool missing_ok)
+dropdb(const char *dbname, bool missing_ok, bool force)
 {
 	Oid			db_id;
 	bool		db_istemplate;
@@ -797,6 +797,7 @@ dropdb(const char *dbname, bool missing_ok)
 	HeapTuple	tup;
 	int			notherbackends;
 	int			npreparedxacts;
+	int			loops = 0;
 	int			nslots,
 nslots_active;
 	int			nsubscriptions;
@@ -880,12 +881,33 @@ dropdb(const char *dbname, bool missing_ok)
 	 *
 	 * As in CREATE DATABASE, check this after other error conditions.
 	 */
-	if (CountOtherDBBackends(db_id, ¬herbackends, &npreparedxacts))
-		ereport(ERROR,
-(errcode(ERRCODE_OBJECT_IN_USE),
- errmsg("database \"%s\" is being accessed by other users",
-		dbname),
- errdetail_busy_db(notherbackends, npreparedxacts)));
+	for (;;)
+	{
+		/*
+		 * CountOtherDBBackends check usage of database by other backends and try
+		 * to wait 5 sec. We try to raise warning after 1 minute and and raise
+		 * error after 5 minutes.
+		 */
+		if (!CountOtherDBBackends(db_id, ¬herbackends, &npreparedxacts, force))
+			break;
+
+		if (force && loops++ % 12 == 0)
+			ereport(WARNING,
+	(errcode(ERRCODE_OBJECT_IN_USE),
+errmsg("source database \"%s\" is being accessed by other users",
+	   dbname),
+	 errdetail_busy_db(notherbackends, npreparedxacts)));
+
+		/* without "force" flag raise error immediately, or after 5 minutes */
+		if (!force || loops % 60 == 0)
+			ereport(ERROR,
+	(errcode(ERRCODE_OBJECT_IN_USE),
+errmsg("source database \"%s\" is being accessed by other users",
+	   dbname),
+	 errdetail_busy_db(notherbackends, npreparedxacts)));
+
+		CHECK_FOR_INTERRUPTS();
+	}
 
 	/*
 	 * Check if there are subscriptions defined in the target database.
@@ -1053,7 +1075,7 @@ RenameDatabase(const char *oldname,

RE: [PATCH] memory leak in ecpglib

2019-07-08 Thread kuroda.hay...@fujitsu.com

Dear Meskes, Zhang,

I think this modification is not enough, and I have an another idea. 

>> If your patch is committed, in your example, any operation for cur1 will not 
>> be accepted.
> Although the return value after calling ecpg_get_con_name_by_cursor_name(cur1)
> is NULL, in ecpg_get_connection(), actual_connection will be returned.
> so, operation for cur1 will be accepted,

Did you mention about this code?
(Copied from ECPGfetch)

```
real_connection_name = ecpg_get_con_name_by_cursor_name(cursor_name);
if (real_connection_name == NULL)
{
/*
* If can't get the connection name by cursor name then using
* connection name coming from the parameter connection_name
*/
real_connection_name = connection_name;
}
```

If so, I think this approach is wrong. This connection_name corresponds to the 
following con1.

```
EXEC SQL AT con1 FETCH cur1 ...

```

Therefore the followed FETCH statement will fail
because the application forget the connection of cur_1.

```
EXEC SQL AT con1 DECLARE stmt_1 STATEMENT;
EXEC SQL PREPARE stmt_1 FROM :selectString;
EXEC SQL DECLARE cur_1 CURSOR FOR stmt_1;
EXEC SQL OPEN cur_1;   
EXEC SQL DECLARE cur_2 CURSOR FOR stmt_1;
EXEC SQL OPEN cur_2;
EXEC SQL FETCH cur_1;
```


I think the g_declared_list is not needed for managing connection. I was wrong.
We should treat DECLARE STAEMENT as declarative, like #include or #define in C 
macro.

Please send me your reply.

Best Regards,
Hayato Kuroda
Fujitsu LIMITED

Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-07-08 Thread Amit Langote

Hi Julien,

Thanks for taking a look at this.

On Thu, Jul 4, 2019 at 6:52 PM Julien Rouhaud  wrote:
> On Tue, May 28, 2019 at 6:57 AM Amit Langote wrote:
> > >> Maybe like the attached?  I'm not sure if we need to likewise be 
> > >> concerned
> > >> about exec_sql_string() being handed multi-query strings.
>
> the whole extension sql script is passed to execute_sql_string(), so I
> think that it's a good thing to have similar workaround there.

That makes sense, although it is perhaps much less likely for memory
usage explosion to occur in execute_sql_strings(), because the scripts
passed to execute_sql_strings() mostly contain utility statements and
rarely anything whose planning will explode in memory usage.

Anyway, I've added similar handling in execute_sql_strings() for consistency.

Now I wonder if we'll need to consider another path which calls
pg_plan_queries() on a possibly multi-statement query --
BuildCachedPlan()...

> About the patch:
>
>  -* Switch to appropriate context for constructing querytrees (again,
> -* these must outlive the execution context).
> +* Switch to appropriate context for constructing querytrees.
> +* Memory allocated during this construction is released before
> +* the generated plan is executed.
>
> The comment should mention query and plan trees, everything else seems ok to 
> me.

Okay, fixed.

Attached updated patch.  Thanks again.

Regards,
Amit

parse-plan-memcxt_v2.patch
Description: Binary data

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Masahiko Sawada

On Sun, Jul 7, 2019 at 1:05 AM Bruce Momjian  wrote:
>
> On Fri, Jul  5, 2019 at 10:24:39PM +0200, Tomas Vondra wrote:
> > I agree this is a pretty crucial challenge, and those requirements seem
> > in direct conflict. Users use encryption to protect privacy of the data,
> > but we need access to some of the data to implement some of the
> > important tasks of a RDBMS.
> >
> > And it's not just about things like recovery or replication. How do you
> > do ANALYZE on encrypted data? Sure, if a user runs it in a session that
> > has the right key, that's fine. But what about autovacuum/autoanalyze?
>
> There might be a way to defer ANALYZE and autovacuum/autoanalyze, but
> what about VACUUM FREEZE?  We can defer that too, but not the clog
> truncation that is eventually the product of the freeze.
>
> What about referential integrity constraints that need to check primary
> keys in the encrypted tables?  I also don't see a way of delaying that,
> and if you can't do referential integrity into the encrypted tables, it
> reduces the value of having encrypted data in the same database rather
> than in another database or cluster?
>

I just thought that PostgreSQL's auxiliary processes such as
autovacuum, startup, checkpointer, bgwriter should always be able to
access all keys because there are already in inside database. Even
today these processes don't check any privileges when accessing to
data. What security threats we can protect data from by requiring
privileges even for auxiliary processes? If this is a security problem
isn't it also true for cluster-wide encryption? I guess that processes
who have an access privilege on the table can always get the
corresponding encryption key. And any processes cannot access an
encryption key directly without accessing to a database object.

> I still feel we have not clearly described what the options are:
>
> 1.  Encrypt everything
>
> 2.  Encrypt only some tables (for performance reasons), and use only one
> key, or use multiple keys to allow for key rotation.  All keys are
> always unlocked.
>
> 3.  Encrypt only some tables with different keys, and the keys are not
> always unlocked.
>
> As Tomas already stated, using tablespaces to distinguish encrypted from
> non-encrypted tables doesn't make sense since the file system used for
> the storage is immaterial to the encryptions status. An easier way would
> be to just add a bit to WAL that would indicate if the rest of the WAL
> record is encrypted, though I doubt the performance boost is worth the
> complexity.

Okay, instead of using tablespaces we can create groups grouping
tables being encrypted with the same key. I think the one of the most
important point here is to provide a granular encryption feature and
have less the number of keys in database cluster, not to provide per
tablespace encryption feature. I'm not going to insist it should be
per tablespace encryption.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: (select query)/relation as first class citizen

2019-07-08 Thread Pavel Stehule

Hi

po 8. 7. 2019 v 9:33 odesílatel Roman Pekar  napsal:

> Hi,
>
> what do you think about this idea in general? If you don't have to think
> about implementation for now? From my point of view writing Sql queries is
> very close to how functional language work if you treat "select" queries as
> functions without side-effects, and having query being first-class-citizen
> could move this even further.
>

first - please, don't send top posts.

second - my opinion is not clear. I can imagine benefits - on second hand,
the usage is relative too close to one antipattern - only one query wrapped
by functions. I see your proposal as little bit more dynamic (with little
bit different syntax) views.

With my experience I really afraid about it - it can be very effective
(from developer perspective) and very slow (from customer perspective).
This is example of tool that looks nice on paper, but can be very badly
used.

Maybe I am not the best man for this topic - I like some functional
programming concepts, but I use it locally - your proposal moves SQL to
some unexplored areas - and I think so it can be interesting as real
research topic, but not today Postgres's theme.

The basic question is why extend SQL and don't use some native functional
language. Postgres should to implement ANSI SQL - and there is not a space
for big experiments. I am sceptic about it - relational databases are
static, SQL is static language, so it is hard to implement some dynamic
system over it - SQL language is language over relation algebra - it is not
functional language, I afraid so introduction another concept to this do
more bad than good.

Regards

Pavel

> Regards,
> Roman
>
> On Sun, 7 Jul 2019 at 16:22, Roman Pekar  wrote:
>
>> Hi,
>>
>> Yes, I'm thinking about 'query like a view', 'query like a cursor' is
>> probably possible even now in ms sql server (not sure about postgresql),
>> but it requires this paradygm shift from set-based thinking to row-by-row
>> thinking which I'd not want to do.
>>
>> I completely agree with your points of plan caching and static checks.
>> With static checks, though it might be possible to do if the query would be
>> defined as typed, so all the types of the columns is known in advance.
>> In certain cases having possibility of much better decomposition is might
>> be more important than having cached plan. Not sure how often these cases
>> appear in general, but personally for me it'd be awesome to have this
>> possibility.
>>
>> Regards,
>> Roman Pekar
>>
>> On Sun, 7 Jul 2019 at 15:39, Pavel Stehule 
>> wrote:
>>
>>> Hi
>>>
>>> ne 7. 7. 2019 v 14:54 odesílatel Roman Pekar 
>>> napsal:
>>>
 Hello,

 Just a bit of background - I currently work as a full-time db
 developer, mostly with Ms Sql server but I like Postgres a lot, especially
 because I really program in sql all the time and type system / plpgsql
 language of Postgres seems to me more suitable for actual programming then
 t-sql.

 Here's the problem - current structure of the language doesn't allow to
 decompose the code well and split calculations and data into different
 modules.

 For example. Suppose I have a table employee and I have a function like
 this (I'll skip definition of return types for the sake of simplicity):

 create function departments_salary ()
 returns  table (...)
 as
 return $$
 select department, sum(salary) as salary from employee group by
 department;
 $$;

 so that's fine, but what if I want to run this function on filtered
 employee? I can adjust the function of course, but it implies I can predict
 all possible filters I'm going to need in the future.
 And logically, function itself doesn't have to be run on employee
 table, anything with department and salary columns will fit.
 So it'd be nice to be able to define the function like this:

 create function departments_salary(_employee query)
 returns table (...)
 as
 return $$
 select department, sum(salary) as salary from _employee group by
 department;
 $$;

 and then call it like this:

 declare _employee query;
 ...
 _poor_employee = (select salary, department from employee where salary
 < 1000);
 select * from  departments_salary( _poor_employee);

 And just to be clear, the query is not really invoked until the last
 line, so re-assigning _employee variable is more like building query
 expression.

 As far as I understand the closest way to do this is to put the data
 into temporary table and use this temporary table inside of the function.
 It's not exactly the same of course, cause in case of temporary tables data
 should be transferred to temporary table, while it will might be filtered
 later. So it's something like array vs generator in python, or List vs
 IQueryable in C#.

 Adding this functionality will

Re: progress report for ANALYZE

2019-07-08 Thread Tatsuro Yamada


Hi Alvaro,


I'll review your patch in this week. :)


I tested your patch on 6b854896.
Here is a result. See below:

-
[Session #1]
create table hoge as select * from generate_series(1, 100) a;
analyze verbose hoge;

[Session #2]
\a
\t
select * from pg_stat_progress_analyze; \watch 0.001

17520|13599|postgres|16387|f|16387|scanning table|4425|14
17520|13599|postgres|16387|f|16387|scanning table|4425|64
17520|13599|postgres|16387|f|16387|scanning table|4425|111
...
17520|13599|postgres|16387|f|16387|scanning table|4425|4425
17520|13599|postgres|16387|f|16387|scanning table|4425|4425
17520|13599|postgres|16387|f|16387|scanning table|4425|4425
17520|13599|postgres|16387|f|16387|analyzing sample|0|0
17520|13599|postgres|16387|f|16387||0|0  <-- Is it Okay??
-

I have a question of the last line of the result.
I'm not sure it is able or not, but I guess it would be better
to keep the phase name (analyzing sample) on the view until the
end of this command. :)

Regards,
Tatsuro Yamada

Re: Synchronizing slots from primary to standby

2019-07-08 Thread Thomas Munro

On Mon, Dec 31, 2018 at 10:23 AM Petr Jelinek
 wrote:
> As Andres has mentioned over at minimal decoding on standby thread [1],
> that functionality can be used to add simple worker which periodically
> synchronizes the slot state from the primary to a standby.
>
> Attached patch is rough implementation of such worker. It's nowhere near
> committable in the current state, it servers primarily two purposes - to
> have something over what we can agree on the approach (and if we do,
> serve as base for that) and to demonstrate that the patch in [1] can
> indeed be used for this functionality. All this means that this patch
> depends on the [1] to work.

Hi Petr,

Do I understand correctly that this depends on the "logical decoding
on standby" patch, but that isn't in the Commitfest?  Seems like an
oversight, since that thread has a recently posted v11 patch that
applies OK, and there was recent review.  You patches no longer apply
on top though.  Would it make sense to post a patch set here including
logical-decoding-on-standby_v11.patch + your two patches (rebased),
since this is currently marked as "Needs review"?

-- 
Thomas Munro
https://enterprisedb.com

Re: Re: Reviving the "Stopping logical replication protocol" patch from Vladimir Gordichuk

2019-07-08 Thread Thomas Munro

On Thu, Mar 7, 2019 at 8:19 PM David Steele  wrote:
> On 2/16/19 10:38 PM, Dave Cramer wrote:
> > Thanks for looking at this. FYI, I did not originally write this, rather
> > the original author has not replied to requests.
> > JDBC could use this, I assume others could as well.
> >
> > That said I'm certainly open to suggestions on how to do this.
> >
> > Craig, do you have any other ideas?

> This patch appears to be stalled.  Are there any ideas on how to proceed?

Hi Dave (C),

It sounds like this should be withdrawn for now, and potentially
revived in some future CF if someone has time to work on it?

-- 
Thomas Munro
https://enterprisedb.com

Re: Re: Reviving the "Stopping logical replication protocol" patch from Vladimir Gordichuk

2019-07-08 Thread Dave Cramer

On Mon, 8 Jul 2019 at 06:40, Thomas Munro  wrote:

> On Thu, Mar 7, 2019 at 8:19 PM David Steele  wrote:
> > On 2/16/19 10:38 PM, Dave Cramer wrote:
> > > Thanks for looking at this. FYI, I did not originally write this,
> rather
> > > the original author has not replied to requests.
> > > JDBC could use this, I assume others could as well.
> > >
> > > That said I'm certainly open to suggestions on how to do this.
> > >
> > > Craig, do you have any other ideas?
>
> > This patch appears to be stalled.  Are there any ideas on how to proceed?
>
> Hi Dave (C),
>
> It sounds like this should be withdrawn for now, and potentially
> revived in some future CF if someone has time to work on it?
>
>
Hi Thomas,

I'm fine with that decision.

Thanks,


Dave Cramer

>
>
>

Re: How to estimate the shared memory size required for parallel scan?

2019-07-08 Thread Daniel Gustafsson

> On 1 Jul 2019, at 13:03, Thomas Munro  wrote:
> 
> On Fri, May 3, 2019 at 6:06 AM Thomas Munro  wrote:
>> Added to commitfest.
> 
> Rebased.

Below is a review of this patch.

It does what it says on the tin, applies clean without introducing compiler
warnings and it seems like a good addition (for both utility and as an example
implementation).  Testing with various parallel worker settings shows that it
properly observes user configuration for parallelism.

The tests pass, but there are also no new tests that could fail.  I tried to
construct a test case for this, but it’s fairly hard to make one that can be
added to the repo.  Have you given this any thought?

Regarding documentation, it seems reasonable to add a sentence on the file_fdw
page when the user can expect parallel scan.

A few smaller comments on the patch:

In the commit message, I assume you mean “writing *is* probablt some way off”:

"In theory this could be used for other kinds of parallel copying in the
future (but infrastructure for parallel writing it probably some way
off)."

Single-line comments don’t end with a period IIRC (there a few others but only
including the first one here):

+/* The size of the chunks used for parallel scans, in bytes. */

Is there a reason to create partial_path before compute_parallel_worker, since
the latter can make us end up not adding the path at all?  Can’t we reverse the
condition on parallel_workers to if (parallel_workers <= 0) and if so skip the
partial path?:

+   partial_path = create_foreignscan_path(root, baserel, NULL,
+  baserel->rows,
+  startup_cost,
+  total_cost,
+  NIL, NULL, NULL, coptions);
+
+   parallel_workers = compute_parallel_worker(baserel,
+  fdw_private->pages, -1,
+  
max_parallel_workers_per_gather);
...
+   if (parallel_workers > 0)
+   add_partial_path(baserel, (Path *) partial_path);

Since this is quite general code which can be used by other extensions as well,
maybe we should reword the comment to say so?:

+   /* For use by file_fdw's parallel scans. */

cheers ./daniel

Re: POC: Cleaning up orphaned files using undo logs

2019-07-08 Thread Amit Kapila

On Sat, Jul 6, 2019 at 1:47 AM Robert Haas  wrote:
>
> On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro  wrote:
> > [ new patches ]
>
> I took a look at 0012 today, Amit's patch for extending the binary
> heap machinery, and 0013, Amit's patch for "Infrastructure to register
> and fetch undo action requests."
>

Thanks for looking into the patches.

> I don't think that binaryheap_allocate_shm() is a good design.  It
> presupposes that we want to store the binary heap as its own chunk of
> shared memory allocated via ShmemInitStruct(), but we might want to do
> something else, like embed in another structure, store it in a DSM or
> DSA, etc., and this function can't do any of that.  I think we should
> have something more like:
>
> extern Size binaryheap_size(int capacity);
> extern void binaryheap_initialize(binaryheap *, int capacity,
> binaryheap_comparator compare, void *arg);
>
> Then the caller can do something like:
>
> sz = binaryheap_size(capacity);
> bh = ShmemInitStruct(name, sz, &found);
> if (!found)
> binaryheap_initialize(bh, capacity, comparator, whatever);
>
> If it wants to get the memory in some other way, it just needs to
> initialize bh differently; the rest is the same.  Note that there is
> no need, in this design, for binaryheap_size/initialize to make use of
> "shared" memory.  They could equally well be used on backend-local
> memory.  They do not need to care. You just provide the memory, and
> they do their thing.
>

I didn't have other use cases in mind and I think to some extent this
argument holds true for existing binaryheap_allocate allocate.  If we
want to make it more generic, then shouldn't we try to even change the
existing binaryheap_allocate to use this new model, that way the
binary heap allocation API will be more generic?

..
..
>
> Now, one objection to the above line of attack is the different queues
> actually contain different types of elements.  Apparently, the XID
> queue contains elements of type UndoXidQueue and the size queue
> contains elements of type UndoSizeQueue.  It is worth noting here that
> these are bad type names, because they sound like they are describing
> a type of queue, but it seems that they are actually describing an
> element in the queue. However, there are two larger problems:
>
> 1. I don't think we should have three different kinds of objects for
> each of the three different queues.  It seems like it would be much
> simpler and easier to just have one kind of object that stores all the
> information we need (full_xid, start_urec_ptr, dbid, request_size,
> next_retry_at, error_ocurred_at) and use that everywhere. You could
> object that this would increase the storage space requirement,

Yes, this was the reason to keep them separate, but I see your point.

> but it
> wouldn't be enough to make any real difference and it probably would
> be well worth it for the avoidance of complexity.
>

Okay, will give it a try and see if it can avoid some code complexity.
Along with this, I will investigate your other suggestions related to
code improvements as well.


>
> On another note, UNDO_PEEK_DEPTH is bogus.  It's used in UndoGetWork()
> and it passes the depth argument down to GetRollbackHashKeyFromQueue,
> which then does binaryheap_nth() on the relevant queue.  Note that
> this function is another places that is ends up duplicating code
> because of the questionable decision to have separate types of queue
> entries for each different queue; otherwise, it could probably just
> take the binary heap into which it's peeking as an argument instead of
> having three different cases. But that's not the main point here.  The
> main point is that it calls a function for whichever type of queue
> we've got and gets some kind of queue entry using binaryheap_nth().
> But binaryheap_nth(whatever, 2) does not give you the third-smallest
> element in the binary heap.  It gives you the third entry in the
> array, which may or may not have the heap property, but even if it
> does, the third element could be huge.  Consider this binary heap:
>
> 0 1 10 2 3 11 12 4 5 6 7 13 14 15 16
>
> This satisfies the binary heap property, because the element at
> position n is always smaller than the elements at positions 2n+1 and
> 2n+2 (assuming 0-based indexing). But if you want to look at the
> smallest three elements in the heap, you can't just look at indexes
> 0..2.  The second-smallest element must be at index 1 or 2, but it
> could be either place.  The third-smallest element could be the other
> of 1 and 2, or it could be either child of the smaller one, so there
> are three places it might be.  In general, a binary heap is not a good
> data structure for finding the smallest N elements of a collection
> unless N is 1, and what's going to happen with what you've got here is
> that we'll sometimes prioritize an item that would not have been
> pulled from the queue for a long time over one that would have
> otherwise been processe

Re: Switching PL/Python to Python 3 by default in PostgreSQL 12

2019-07-08 Thread Peter Eisentraut

On 2019-07-07 21:26, Steven Pousty wrote:
> The point of the links I sent from the Python community is that they
> wanted Python extinct in the wild as of Jan 1 next year. They are never
> fixing it, even for a security vulnerability.

The operating systems that most of our users are going to run PostgreSQL
12 on will keep maintaining Python 2 for the foreseeable future.

I'm not trying to dismiss the importance of managing the Python
transition.  But this issue has been known for many years, and the
current setup is more or less in line with the wider world.  For
example, the Debian release that came out over the weekend still ships
with /usr/bin/python being Python 2.  So it is neither timely nor urgent
to try to make some significant change about this in PostgreSQL 12 right
now.  I would welcome patches for this for PostgreSQL 13.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Commitfest 2019-07, the first of five* for PostgreSQL 13

2019-07-08 Thread Thomas Munro

On Tue, Jul 2, 2019 at 12:20 AM Thomas Munro  wrote:
> It's now July everywhere on Earth, so I marked CF 2019-07 as
> in-progress, and 2019-09 as open for bumping patches into.  I pinged
> most of the "Needs Review" threads that don't apply and will do a few
> more tomorrow, and then I'll try to chase patches that fail on CI, and
> then see what I can do to highlight some entries that really need
> review/discussion.  I'll do end-of-week status reports.

Hello hackers,

Here's a quick status report after the first week (I think only about
10 commits happened during the week, the rest were pre-CF activity):

 status | count
+---
 Committed  |32
 Moved to next CF   | 5
 Needs review   |   146
 Ready for Committer| 7
 Rejected   | 2
 Returned with feedback | 2
 Waiting on Author  |29
 Withdrawn  | 8

I wondered about reporting on the number of entries that didn't yet
have reviewers signed up, but then I noticed that there isn't a very
good correlation between signed up reviewers and reviews.  So instead,
here is a list of twenty "Needs review" entries that have gone a long
time without communication on the thread.  In some cases there has
been plenty of review, so it's time to make decisions.  In others,
there has been none at all.

If you're having trouble choosing something to review, please pick one
of these and help us figure out how to proceed.

 2026 | Spurious "apparent wraparound" via Simpl | {"Noah Misch"}
 2003 | Fix Deadlock Issue in Single User Mode W | {"Chengchao Yu"}
 1796 | documenting signal handling with readme  | {"Chris Travers"}
 2053 | NOTIFY options + COLLAPSE (make deduplic | {"Filip Rembiałkowski"}
 1974 | pg_stat_statements should notice FOR UPD | {"Andrew Gierth"}
 2061 | [WIP] Show a human-readable n_distinct i | {"Maxence Ahlouche"}
 2077 | fix pgbench -R hanging on pg11   | {"Fabien Coelho"}
 2062 | Unaccent extension python script Issue i | {"Hugh
Ranalli","Ramanarayana M"}
 1769 | libpq host/hostaddr consistency  | {"Fabien Coelho"}
 2060 | suppress errors thrown by to_reg*()  | {"takuma hoshiai"}
 2078 | Compile from source using latest Microso | {"Peifeng Qiu"}
 2081 | parse time support function  | {"Pavel Stehule"}
 1800 | amcheck verification for GiST| {"Andrey Borodin"}
 2018 | pg_basebackup to adjust existing data di | {"Haribabu Kommi"}
 2095 | pg_upgrade version and path checking | {"Daniel Gustafsson"}
 2044 | propagating replica identity to partitio | {"Álvaro Herrera"}
 2090 | pgbench - implement strict TPC-B benchma | {"Fabien Coelho"}
 2088 | Contribution to Perldoc for TestLib modu | {"Ramanarayana M"}
 2087 | Problem during Windows service start | {"Ramanarayana M"}
 2093 | Trigger autovacuum on tuple insertion| {"Darafei Praliaskouski"}

If you have submitted a patch and it's in "Waiting for author" state,
please aim to get it to "Needs review" state soon if you can, as
that's where people are most likely to be looking for things to
review.

I have pinged most threads that are in "Needs review" state and don't
apply, compile warning-free, or pass check-world.  I'll do some more
of that sort of thing, and I'll highlight a different set of patches
next week.

-- 
Thomas Munro
https://enterprisedb.com

Re: [PATCH] Speedup truncates of relation forks

2019-07-08 Thread Thomas Munro

On Fri, Jul 5, 2019 at 3:03 PM Jamison, Kirk  wrote:
> I updated the patch which is similar to V3 of the patch,
> but addressing my problem in (5) in the previous email regarding 
> FreeSpaceMapVacuumRange.
> It seems to pass the regression test now. Kindly check for validation.

Hi Kirk,

FYI there are a couple of compiler errors reported:

Windows compiler:

contrib/pg_visibility/pg_visibility.c(400): error C2143: syntax error
: missing ')' before '{'
[C:\projects\postgresql\pg_visibility.vcxproj]

GCC:

storage.c: In function ‘RelationTruncate’:
storage.c:238:14: error: variable ‘newnblocks’ set but not used
[-Werror=unused-but-set-variable]
  BlockNumber newnblocks = InvalidBlockNumber;
  ^
storage.c:237:14: error: variable ‘new_nfsmblocks’ set but not used
[-Werror=unused-but-set-variable]
  BlockNumber new_nfsmblocks = InvalidBlockNumber;
  ^
storage.c: In function ‘smgr_redo’:
storage.c:634:15: error: variable ‘newnblocks’ set but not used
[-Werror=unused-but-set-variable]
   BlockNumber newnblocks = InvalidBlockNumber;
   ^
storage.c:633:15: error: variable ‘new_nfsmblocks’ set but not used
[-Werror=unused-but-set-variable]
   BlockNumber new_nfsmblocks = InvalidBlockNumber;
   ^

-- 
Thomas Munro
https://enterprisedb.com

Re: Fix doc bug in logical replication.

2019-07-08 Thread Peter Eisentraut

On 2019-06-27 18:50, Tomas Vondra wrote:
>> Whether we *want* to document that it works, documenting that it
>> doesn't work when it does can't be the right answer. If you want to
>> couch the language to leave the door open that we may not support this
>> the same way in the future I wouldn't be opposed to that, but at this
>> point we will have three releases with the current behavior in
>> production, so if we decide to change the behavior, it is likely going
>> to break certain use cases. That may be ok, but I'd expect a
>> documentation update to accompany a change that would cause such a
>> breaking change.
>>
> I agree with that. We have this behavior for quite a bit of time, and
> while technically we could change the behavior in the future (using the
> "not supported" statement), IMO that'd be pretty annoying move. I always
> despised systems that "fix" bugs by documenting that it does not work, and
> this is a bit similar.

committed back to PG10

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-07-08 Thread Julien Rouhaud

Hi Amit,

On Mon, Jul 8, 2019 at 10:52 AM Amit Langote  wrote:
>
> On Thu, Jul 4, 2019 at 6:52 PM Julien Rouhaud  wrote:
> > On Tue, May 28, 2019 at 6:57 AM Amit Langote wrote:
> > > >> Maybe like the attached?  I'm not sure if we need to likewise be 
> > > >> concerned
> > > >> about exec_sql_string() being handed multi-query strings.
> >
> > the whole extension sql script is passed to execute_sql_string(), so I
> > think that it's a good thing to have similar workaround there.
>
> That makes sense, although it is perhaps much less likely for memory
> usage explosion to occur in execute_sql_strings(), because the scripts
> passed to execute_sql_strings() mostly contain utility statements and
> rarely anything whose planning will explode in memory usage.
>
> Anyway, I've added similar handling in execute_sql_strings() for consistency.

Thanks!

> Now I wonder if we'll need to consider another path which calls
> pg_plan_queries() on a possibly multi-statement query --
> BuildCachedPlan()...

I also thought about this when reviewing the patch, but AFAICS you
can't provide a multi-statement query to BuildCachedPlan() using
prepared statements and I'm not sure that SPI is worth the trouble.
I'll mark this patch as ready for committer.

>
> > About the patch:
> >
> >  -* Switch to appropriate context for constructing querytrees 
> > (again,
> > -* these must outlive the execution context).
> > +* Switch to appropriate context for constructing querytrees.
> > +* Memory allocated during this construction is released before
> > +* the generated plan is executed.
> >
> > The comment should mention query and plan trees, everything else seems ok 
> > to me.
>
> Okay, fixed.
>
> Attached updated patch.  Thanks again.
>
> Regards,
> Amit

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-08 Thread James Coleman

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
 wrote:
> We're running query like this:
>
>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 
> ORDER BY 1, 2, 3
>
> but we're trying to add the incremental sort *before* the aggregation,
> because the optimizer also considers group aggregate with a sorted
> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
> actually can do this, but clearly that's nonsense, because we can't
> possibly know the aggregates yet. Hence the error.
>
> If this is the actual issue, we need to ensure we actually can evaluate
> all the pathkeys. I don't know how to do that yet. I thought that maybe
> we should modify pathkeys_common_contained_in() to set presorted_keys to
> 0 in this case.
>
> But then I started wondering why we don't see this issue even for
> regular (non-incremental-sort) paths built in create_ordered_paths().
> How come we don't see these failures there? I've modified costing to
> make all incremental sort paths very cheap, and still nothing.

I assume you mean you modified costing to make regular sort paths very cheap?

> So presumably there's a check elsewhere (either implicit or explicit),
> because create_ordered_paths() uses pathkeys_common_contained_in() and
> does not have the same issue.

Given this comment in create_ordered_paths():

  generate_gather_paths() will have already generated a simple Gather
  path for the best parallel path, if any, and the loop above will have
  considered sorting it.  Similarly, generate_gather_paths() will also
  have generated order-preserving Gather Merge plans which can be used
  without sorting if they happen to match the sort_pathkeys, and the loop
  above will have handled those as well.  However, there's one more
  possibility: it may make sense to sort the cheapest partial path
  according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().

If I'm understanding what you're saying properly, I think you'd
expected create_ordered_paths() to be roughly similar in what it
considers as partial paths and so have the same problem, and I haven't
yet read enough of the code to understand if my proposed change
actually has any impact on the issue we're discussing, but it seems to
me that it at least fits more with what the comments imply.

I'll try to look at it a bit more later also, but at the moment other
work calls.

James Coleman

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-08 Thread Tomas Vondra


On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:

On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
 wrote:

We're running query like this:

  SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 
ORDER BY 1, 2, 3

but we're trying to add the incremental sort *before* the aggregation,
because the optimizer also considers group aggregate with a sorted
input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
actually can do this, but clearly that's nonsense, because we can't
possibly know the aggregates yet. Hence the error.

If this is the actual issue, we need to ensure we actually can evaluate
all the pathkeys. I don't know how to do that yet. I thought that maybe
we should modify pathkeys_common_contained_in() to set presorted_keys to
0 in this case.

But then I started wondering why we don't see this issue even for
regular (non-incremental-sort) paths built in create_ordered_paths().
How come we don't see these failures there? I've modified costing to
make all incremental sort paths very cheap, and still nothing.


I assume you mean you modified costing to make regular sort paths very cheap?



No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.


So presumably there's a check elsewhere (either implicit or explicit),
because create_ordered_paths() uses pathkeys_common_contained_in() and
does not have the same issue.


Given this comment in create_ordered_paths():

 generate_gather_paths() will have already generated a simple Gather
 path for the best parallel path, if any, and the loop above will have
 considered sorting it.  Similarly, generate_gather_paths() will also
 have generated order-preserving Gather Merge plans which can be used
 without sorting if they happen to match the sort_pathkeys, and the loop
 above will have handled those as well.  However, there's one more
 possibility: it may make sense to sort the cheapest partial path
 according to the required output order and then use Gather Merge.

my understanding is that generate_gather_paths() only considers paths
that already happen to be sorted (not explicit sorts), so I'm
wondering if it would make more sense for the incremental sort path
creation for this case to live alongside the explicit ordered path
creation in create_ordered_paths() rather than in
generate_gather_paths().



How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.


If I'm understanding what you're saying properly, I think you'd
expected create_ordered_paths() to be roughly similar in what it
considers as partial paths and so have the same problem, and I haven't
yet read enough of the code to understand if my proposed change
actually has any impact on the issue we're discussing, but it seems to
me that it at least fits more with what the comments imply.



Roughly. AFAICS the problem is that we're trying to use pathkeys that are
only valid for the (aggregated) upper relation, not before it.

I have to admit I've never quite understood this pathkeys business. I
mean, I know what pathkeys are for, but clearly it's not valid to look at
root->sort_pathkeys at this place. Maybe there's some other field we
should be looking at instead, or maybe it's ensured by grouping_planner in
some implicit way ...

I.e. the question is - when you do a query like

   SELECT a, count(*) FROM t GROUP BY a ORDER BY a, count(*);

and cost the incremental sort extremely cheap, how come we don't end up
with the same issue?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Bruce Momjian

On Mon, Jul  8, 2019 at 06:04:28PM +0900, Masahiko Sawada wrote:
> On Sun, Jul 7, 2019 at 1:05 AM Bruce Momjian  wrote:
> > What about referential integrity constraints that need to check primary
> > keys in the encrypted tables?  I also don't see a way of delaying that,
> > and if you can't do referential integrity into the encrypted tables, it
> > reduces the value of having encrypted data in the same database rather
> > than in another database or cluster?
> 
> I just thought that PostgreSQL's auxiliary processes such as
> autovacuum, startup, checkpointer, bgwriter should always be able to
> access all keys because there are already in inside database. Even
> today these processes don't check any privileges when accessing to
> data. What security threats we can protect data from by requiring
> privileges even for auxiliary processes? If this is a security problem
> isn't it also true for cluster-wide encryption? I guess that processes
> who have an access privilege on the table can always get the
> corresponding encryption key. And any processes cannot access an
> encryption key directly without accessing to a database object.

Well, see my list of three things that users want in an earlier email:


https://www.postgresql.org/message-id/20190706160514.b67q4f7abcxfd...@momjian.us

When people are asking for multiple keys (not just for key rotation),
they are asking to have multiple keys that can be supplied by users only
when they need to access the data.  Yes, the keys are always in the
datbase, but the feature request is that they are only unlocked when the
user needs to access the data.  Obviously, that will not work for
autovacuum when the encryption is at the block level.

If the key is always unlocked, there is questionable security value of
having multiple keys, beyond key rotation.

> > I still feel we have not clearly described what the options are:
> >
> > 1.  Encrypt everything
> >
> > 2.  Encrypt only some tables (for performance reasons), and use only one
> > key, or use multiple keys to allow for key rotation.  All keys are
> > always unlocked.
> >
> > 3.  Encrypt only some tables with different keys, and the keys are not
> > always unlocked.
> >
> > As Tomas already stated, using tablespaces to distinguish encrypted from
> > non-encrypted tables doesn't make sense since the file system used for
> > the storage is immaterial to the encryptions status. An easier way would
> > be to just add a bit to WAL that would indicate if the rest of the WAL
> > record is encrypted, though I doubt the performance boost is worth the
> > complexity.
> 
> Okay, instead of using tablespaces we can create groups grouping
> tables being encrypted with the same key. I think the one of the most
> important point here is to provide a granular encryption feature and

Why is this important?  What are you trying to accomplish?

> have less the number of keys in database cluster, not to provide per
> tablespace encryption feature. I'm not going to insist it should be
> per tablespace encryption.

It is unclear which item you are looking for.  Which number are you
suggesting from the three listed above in the email URL?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: allow_system_table_mods stuff

2019-07-08 Thread Bruce Momjian

On Sun, Jul  7, 2019 at 11:45:49PM -0400, Bruce Momjian wrote:
> On Mon, Jun 24, 2019 at 11:20:51AM -0400, Tom Lane wrote:
> > I do see value in two switches not one, but it's what I said above,
> > to not need to give people *more* chance-to-break-things than they
> > had before when doing manual catalog fixes.  That is, we need a
> > setting that corresponds more or less to current default behavior.
> > 
> > There's an aesthetic argument to be had about whether to have two
> > bools or one three-way switch, but I prefer the former; there's
> > no backward-compatibility issue here since allow_system_table_mods
> > couldn't be set by applications anyway.
> 
> I like a single three-way switch since if you are allowing DDL, you
> probably don't care if you restrict DML.  log_statement already has a
> similar distinction with values of none, ddl, mod, all.  I assume
> allow_system_table_mods could have value of false, dml, true.

Or, to match log_statement, use:  none, dml, all.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Broken defenses against dropping a partitioning column

2019-07-08 Thread Alvaro Herrera

On 2019-Jul-07, Tom Lane wrote:

> Ideally, perhaps, a DROP CASCADE like this would not cascade to
> the whole table but only to the table's partitioned-ness property,
> leaving you with a non-partitioned table with most of its data
> intact.  It would take a lot of work to make that happen though,
> and it certainly wouldn't be back-patchable, and I'm not really
> sure it's worth it.

Maybe we can add dependencies to rows of the pg_partitioned_table
relation, with the semantics of "depends on the partitioned-ness of the
table"?

That said, I'm not sure I see the use case for an ALTER TABLE .. DROP
COLUMN command that turns a partitioned table (with existing partitions
containing data) into one non-partitioned table with all data minus the
partitioning column(s).

This seems vaguely related to the issue of dropping foreign keys; see
https://postgr.es/m/20190329152239.GA29258@alvherre.pgsql wherein I
settled with a non-ideal solution to the problem of being unable to
depend on something that did not cause the entire table to be dropped
in certain cases.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-08 Thread James Coleman

On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
 wrote:
>
> On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
> >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
> > wrote:
> >> We're running query like this:
> >>
> >>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 
> >> ORDER BY 1, 2, 3
> >>
> >> but we're trying to add the incremental sort *before* the aggregation,
> >> because the optimizer also considers group aggregate with a sorted
> >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
> >> actually can do this, but clearly that's nonsense, because we can't
> >> possibly know the aggregates yet. Hence the error.
> >>
> >> If this is the actual issue, we need to ensure we actually can evaluate
> >> all the pathkeys. I don't know how to do that yet. I thought that maybe
> >> we should modify pathkeys_common_contained_in() to set presorted_keys to
> >> 0 in this case.
> >>
> >> But then I started wondering why we don't see this issue even for
> >> regular (non-incremental-sort) paths built in create_ordered_paths().
> >> How come we don't see these failures there? I've modified costing to
> >> make all incremental sort paths very cheap, and still nothing.
> >
> >I assume you mean you modified costing to make regular sort paths very cheap?
> >
>
> No, I mean costing of incremental sort paths, so that they end up being
> the cheapest ones. If some other path is cheaper, we won't see the error
> because it only happens when building plan from the cheapest path.

Ah, I misunderstood as you trying to figure out a way to try to cause
the same problem with a regular sort.

> >> So presumably there's a check elsewhere (either implicit or explicit),
> >> because create_ordered_paths() uses pathkeys_common_contained_in() and
> >> does not have the same issue.
> >
> >Given this comment in create_ordered_paths():
> >
> >  generate_gather_paths() will have already generated a simple Gather
> >  path for the best parallel path, if any, and the loop above will have
> >  considered sorting it.  Similarly, generate_gather_paths() will also
> >  have generated order-preserving Gather Merge plans which can be used
> >  without sorting if they happen to match the sort_pathkeys, and the loop
> >  above will have handled those as well.  However, there's one more
> >  possibility: it may make sense to sort the cheapest partial path
> >  according to the required output order and then use Gather Merge.
> >
> >my understanding is that generate_gather_paths() only considers paths
> >that already happen to be sorted (not explicit sorts), so I'm
> >wondering if it would make more sense for the incremental sort path
> >creation for this case to live alongside the explicit ordered path
> >creation in create_ordered_paths() rather than in
> >generate_gather_paths().
> >
>
> How would that solve the issue? Also, we're building a gather path, so
> I think generate_gather_paths() is the right place where to do it. And
> we're not changing the semantics of generate_gather_paths() - the result
> path should be sorted "correctly" with respect to sort_pathkeys.

Does that imply what the explicit sort in create_ordered_paths() is in
the wrong spot?

Or, to put it another way, do you think that both kinds of sorts
should be added in the same place? It seems confusing to me that
they'd be split between the two methods (unless I'm completely
misunderstanding how the two work).

I'm not saying it would solve the issue here; just noting that the
division of labor seemed odd to me at first read through.

James Coleman

Re: POC: Cleaning up orphaned files using undo logs

2019-07-08 Thread Robert Haas

On Mon, Jul 8, 2019 at 6:57 AM Amit Kapila  wrote:
> I didn't have other use cases in mind and I think to some extent this
> argument holds true for existing binaryheap_allocate allocate.  If we
> want to make it more generic, then shouldn't we try to even change the
> existing binaryheap_allocate to use this new model, that way the
> binary heap allocation API will be more generic?

No.  binaryheap_allocate is fine for simple cases and there's no
reason that I can see to change it.

> You are right that it won't be nth smallest element from the queue and
> we don't even care for that here.  The peeking logic is not to find
> the next prioritized element but to check if we can find some element
> for the same database in the next few entries to avoid frequent undo
> worker restart.

You *should* care for that here. The whole purpose of a binary heap is
to help us choose which task we should do first and which ones should
be done later.  I don't see why it's OK to decide that we only care
about doing the tasks in priority order sometimes, and other times
it's OK to just pick semi-randomly.

> This is a good test scenario, but I think it has not been considered
> that there are multiple queues and we peek into each one.

I think that makes very little difference, so I don't see why it
should be considered. It's true that this will sometimes mask the
problem, but so what? An algorithm that works 90% of the time is not
much better than one that works 80% of the time, and neither is the
caliber of work we are expecting to see in PostgreSQL.

> I think we should once try with the actual program which can test such
> a scenario on the undo patches before reaching any conclusion.  I or
> one of my colleague will work on this and report back the results.

There is certainly a place for empirical testing of a patch like this
(perhaps even before posting it). It does not substitute for a good
theoretical explanation of why the algorithm is correct, and I don't
think it is.

> >  You can construct way worse examples than
> > this one, too: imagine that there are two databases, each with a
> > worker, and one has 99% of the requests and the other one has 1% of
> > the requests.  It's really unlikely that there's going to be an entry
> > for the second database within the lookahead window.
>
> I am not sure if that is the case because as soon as the request from
> other database get prioritized (say because its XID becomes older) and
> came as the first request in one of the queues, the undo worker will
> exit (provided it has worked for some threshold time (10s) in that
> database) and allow the request from another database to be processed.

I don't see how this responds to what I wrote.  Neither worker needs
to exit in this scenario, but the worker from the less-popular
database is likely to exit anyway, which seems like it's probably not
the right thing.

> >  And note that
> > increasing the window doesn't really help either: you just need more
> > databases than the size of the lookahead window, or even almost as
> > many as the lookahead window, and things are going to stop working
> > properly.
> >
> > On the other hand, suppose that you have 10 databases and one undo
> > worker.  One database is pretty active and generates a continuous
> > stream of undo requests at exactly the same speed we can process them.
> > The others all have 1 pending undo request.  Now, what's going to
> > happen is that you'll always find the undo request for the current
> > database within the lookahead window.  So, you'll never exit.
>
> Following the logic given above, I think here also worker will exit as
> soon as the request from other database get prioritised.

OK.

> >  But
> > that means the undo requests in the other 9 databases will just sit
> > there for all eternity, because there's no other worker to process
> > them.  On the other hand, if you had 11 databases, there's a good
> > chance it would work fine, because the new request for the active
> > database would likely be outside the lookahead window, and so you'd
> > find no work to do and exit, allowing a worker to be started up in
> > some other database.
>
> As explained above, I think it will work the same way both for 10 or
> 11 databases.  Note, that we don't always try to look ahead. We look
> ahead when we have not worked on the current database for some
> threshold amount of time.

That's interesting, and it means that some of the scenarios that I
mentioned are not problems. However, I don't believe it means that
your code is actually correct. It's just means that it's wrong in
different ways.  The point is that, with the way you've implemented
this, whenever you do lookahead, you will, basically randomly,
sometimes find the next entry for the current database within the
lookahead window, and sometimes you won't.  And sometimes it will be
the next-highest-priority request, and sometimes it won't. That just
cannot possibly be the right thing to do.

Would you

Re: Duplicated LSN in ReorderBuffer

2019-07-08 Thread Masahiko Sawada

On Mon, Jul 8, 2019 at 9:00 AM Thomas Munro  wrote:
>
> On Wed, Jun 26, 2019 at 2:46 AM Ildar Musin  wrote:
> > Attached is a simple patch that uses subxid instead of top-level xid
> > in ReorderBufferAddNewTupleCids() call. It seems to fix the bug, but
> > i'm not sure that this is a valid change. Can someone please verify it
> > or maybe suggest a better solution for the issue?
>

I've reproduced this issue with script Ildar provided. I don't find
out the root cause yet and I'm not sure the patch takes a correct way
to fix this.

In my environment, I got the following pg_waldump output and the
logical decoding failed at 0/019FA058 when processing NEW_CID.

  90489 rmgr: Transaction len (rec/tot): 38/38, tx:
1999, lsn: 0/019F9E80, prev 0/019F9E38, desc: ASSIGNMENT xtop 1998:
subxacts: 1999
  90490 rmgr: Standby len (rec/tot):405/   405, tx:
0, lsn: 0/019F9EA8, prev 0/019F9E80, desc: RUNNING_XACTS nextXid 2000
latestCompletedXid 1949 oldestRunningXid 1836; 48 xacts: 1990 1954
1978 1850 1944 1972 1940 1924 1906 1970 1985 1998 1966 1987 1975 1858
1914 1982 1958 1840 1920 1926 1992 1962 1
  90490 910 1950 1874 1928 1974 1968 1946 1912 1918 1996 1922 1930
1964 1952 1994 1934 1980 1836 1984 1960 1956 1916 1908 1938
  90491 rmgr: Heap2   len (rec/tot): 60/60, tx:
1999, lsn: 0/019FA058, prev 0/019F9EA8, desc: NEW_CID rel
1663/12678/2615; tid 11/59; cmin: 0, cmax: 4294967295, combo:
4294967295
  90492 rmgr: Heaplen (rec/tot):127/   127, tx:
1999, lsn: 0/019FA098, prev 0/019FA058, desc: INSERT off 59 flags
0x00, blkref #0: rel 1663/12678/2615 blk 11

I thought that the logical decoding doesn't create ReorderBufferTXN of
xid=1999 when processing NEW_CID since it decodes ASSIGNMENT of
xid=1999 beforehand. But what actually happen is that it skips NEW_CID
since the state of snapshot builder is SNAPBUILD_BUILDING_SNAPSHOT yet
and then the state becomes SNAPBUILD_FULL_SNAPSHOT when processing
RUNNING_XACTS , and therefore it creates two ReorderBufferTXN entries
for xid = 1999 and xid = 1998 as top-level transactions when
processing NEW_CID (ReorderBufferXidSetCatalogChanges creates xid=1999
and ReorderBufferAddNewTupleCids creates xid = 1998). And therefore it
got the assertion failure when adding ReorderBufferTXN of xid = 1998.

I'll look into this more deeply tomorrow.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-08 Thread Tomas Vondra

On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
 wrote:

On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
>On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
> wrote:
>> We're running query like this:
>>
>>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 
ORDER BY 1, 2, 3
>>
>> but we're trying to add the incremental sort *before* the aggregation,
>> because the optimizer also considers group aggregate with a sorted
>> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
>> actually can do this, but clearly that's nonsense, because we can't
>> possibly know the aggregates yet. Hence the error.
>>
>> If this is the actual issue, we need to ensure we actually can evaluate
>> all the pathkeys. I don't know how to do that yet. I thought that maybe
>> we should modify pathkeys_common_contained_in() to set presorted_keys to
>> 0 in this case.
>>
>> But then I started wondering why we don't see this issue even for
>> regular (non-incremental-sort) paths built in create_ordered_paths().
>> How come we don't see these failures there? I've modified costing to
>> make all incremental sort paths very cheap, and still nothing.
>
>I assume you mean you modified costing to make regular sort paths very cheap?
>

No, I mean costing of incremental sort paths, so that they end up being
the cheapest ones. If some other path is cheaper, we won't see the error
because it only happens when building plan from the cheapest path.

Ah, I misunderstood as you trying to figure out a way to try to cause
the same problem with a regular sort.

>> So presumably there's a check elsewhere (either implicit or explicit),
>> because create_ordered_paths() uses pathkeys_common_contained_in() and
>> does not have the same issue.
>
>Given this comment in create_ordered_paths():
>
>  generate_gather_paths() will have already generated a simple Gather
>  path for the best parallel path, if any, and the loop above will have
>  considered sorting it.  Similarly, generate_gather_paths() will also
>  have generated order-preserving Gather Merge plans which can be used
>  without sorting if they happen to match the sort_pathkeys, and the loop
>  above will have handled those as well.  However, there's one more
>  possibility: it may make sense to sort the cheapest partial path
>  according to the required output order and then use Gather Merge.
>
>my understanding is that generate_gather_paths() only considers paths
>that already happen to be sorted (not explicit sorts), so I'm
>wondering if it would make more sense for the incremental sort path
>creation for this case to live alongside the explicit ordered path
>creation in create_ordered_paths() rather than in
>generate_gather_paths().
>

How would that solve the issue? Also, we're building a gather path, so
I think generate_gather_paths() is the right place where to do it. And
we're not changing the semantics of generate_gather_paths() - the result
path should be sorted "correctly" with respect to sort_pathkeys.

Does that imply what the explicit sort in create_ordered_paths() is in
the wrong spot?

I think those are essentially the right places where to do this sort of
stuff. Maybe there's a better place, but I don't think those places are
somehow wrong.

Or, to put it another way, do you think that both kinds of sorts
should be added in the same place? It seems confusing to me that
they'd be split between the two methods (unless I'm completely
misunderstanding how the two work).

The paths built in those two places were very different in one aspect:

1) generate_gather_paths only ever looked at pathkeys for the subpath, it
never even looked at ordering expected by paths above it (or the query as
a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
to maintain ordering of the different subpaths.

2) create_oredered_paths is meant to enforce ordering needed by upper
parts of the plan - either by using a properly sorted path, or adding an
explicit sort.

We want to extend (1) to also look at ordering expected by the upper parts
of the plan, and consider incremental sort if applicable. (2) already does
that, and it already has the correct pathkeys to enforce.

But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-(

The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys
and group_pathkeys too, so maybe we should be looking at those too?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [RFC] [PATCH] Flexible "partition pruning" hook

2019-07-08 Thread Mike Palmiotto

On Sun, Jul 7, 2019 at 9:46 PM Thomas Munro  wrote:

> On Sat, Apr 6, 2019 at 3:06 PM Andres Freund  wrote:
> > I've moved this to the next CF, and marked it as targeting v13.
>
> Hi Mike,
>
> Commifest 1 for PostgreSQL 13 is here.  I was just going through the
> automated build results for the 'fest and noticed that your patch
> causes a segfault in the regression tests (possibly due to other
> changes that have been made in master since February).  You can see
> the complete backtrace on the second link below, but it looks like
> this is happening every time, so hopefully not hard to track down
> locally.
>
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.46412
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/555380617


Thanks, Thomas, I'll look into this today.

-- 
Mike Palmiotto
Software Engineer
Crunchy Data Solutions
https://crunchydata.com

Re: Broken defenses against dropping a partitioning column

2019-07-08 Thread Robert Haas

On Mon, Jul 8, 2019 at 10:32 AM Alvaro Herrera  wrote:
> That said, I'm not sure I see the use case for an ALTER TABLE .. DROP
> COLUMN command that turns a partitioned table (with existing partitions
> containing data) into one non-partitioned table with all data minus the
> partitioning column(s).

I think it would be useful to have "ALTER TABLE blah NOT PARTITIONED" but I

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Broken defenses against dropping a partitioning column

2019-07-08 Thread Robert Haas

On Mon, Jul 8, 2019 at 11:02 AM Robert Haas  wrote:
> On Mon, Jul 8, 2019 at 10:32 AM Alvaro Herrera  
> wrote:
> > That said, I'm not sure I see the use case for an ALTER TABLE .. DROP
> > COLUMN command that turns a partitioned table (with existing partitions
> > containing data) into one non-partitioned table with all data minus the
> > partitioning column(s).
>
> I think it would be useful to have "ALTER TABLE blah NOT PARTITIONED" but I

...hit send too soon, and also, I don't think anyone will be very
happy if they get that behavior as a side effect of a DROP statement,
mostly because it could take an extremely long time to execute.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Joe Conway

On 7/8/19 10:19 AM, Bruce Momjian wrote:
> When people are asking for multiple keys (not just for key rotation),
> they are asking to have multiple keys that can be supplied by users only
> when they need to access the data.  Yes, the keys are always in the
> datbase, but the feature request is that they are only unlocked when the
> user needs to access the data.  Obviously, that will not work for
> autovacuum when the encryption is at the block level.

> If the key is always unlocked, there is questionable security value of
> having multiple keys, beyond key rotation.

That is not true. Having multiple keys also allows you to reduce the
amount of data encrypted with a single key, which is desirable because:

1. It makes cryptanalysis more difficult
2. Puts less data at risk if someone gets "lucky" in doing brute force


Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: Broken defenses against dropping a partitioning column

2019-07-08 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jul 8, 2019 at 11:02 AM Robert Haas  wrote:
>> On Mon, Jul 8, 2019 at 10:32 AM Alvaro Herrera  
>> wrote:
>>> That said, I'm not sure I see the use case for an ALTER TABLE .. DROP
>>> COLUMN command that turns a partitioned table (with existing partitions
>>> containing data) into one non-partitioned table with all data minus the
>>> partitioning column(s).

>> I think it would be useful to have "ALTER TABLE blah NOT PARTITIONED" but I

> ...hit send too soon, and also, I don't think anyone will be very
> happy if they get that behavior as a side effect of a DROP statement,
> mostly because it could take an extremely long time to execute.

FWIW, I was imagining the action as being (1) detach all the child
partitions, (2) make parent into a non-partitioned table, (3)
drop the target column in each of these now-independent tables.
No data movement.  Other than the need to acquire locks on all
the tables, it shouldn't be particularly slow.

But I'm still not volunteering to write it, because I'm not sure
anyone would want such a behavior.

regards, tom lane

Re: Broken defenses against dropping a partitioning column

2019-07-08 Thread Tom Lane

Alvaro Herrera  writes:
> That said, I'm not sure I see the use case for an ALTER TABLE .. DROP
> COLUMN command that turns a partitioned table (with existing partitions
> containing data) into one non-partitioned table with all data minus the
> partitioning column(s).

Yeah, it'd be a lot of work for a dubious goal.

> This seems vaguely related to the issue of dropping foreign keys; see
> https://postgr.es/m/20190329152239.GA29258@alvherre.pgsql wherein I
> settled with a non-ideal solution to the problem of being unable to
> depend on something that did not cause the entire table to be dropped
> in certain cases.

That's an interesting analogy.  Re-reading that thread, what I said
in <29497.1554217...@sss.pgh.pa.us> seems pretty apropos to the
current problem:

>> FWIW, I think that the dependency mechanism is designed around the idea
>> that whether it's okay to drop a *database object* depends only on what
>> other *database objects* rely on it, and that you can always make a DROP
>> valid by dropping all the dependent objects.  That's not an unreasonable
>> design assumption, considering that the SQL standard embeds the same
>> assumption in its RESTRICT/CASCADE syntax.

I think that is probably a fatal objection to my idea of putting an error
check into RemoveAttributeById().  As an example, consider the possibility
that somebody makes a temporary type and then makes a permanent table with
a partitioning column of that type.  What shall we do at session exit?
Failing to remove the temp type is not an acceptable choice, because that
leaves us with a permanently broken temp schema (compare bug #15631 [1]).

Also, I don't believe we can make that work without order-of-operations
problems in cases comparable to the original bug in this thread [2].
One or the other order of the object OIDs is going to lead to the column
being visited for deletion before the whole table is, and again rejecting
the column deletion is not going to be an acceptable behavior.

So I think we're probably stuck with the approach of adding new internal
dependencies.  If we go that route, then our options for the released
branches are (1) do nothing, or (2) back-patch the code that adds such
dependencies, but without a catversion bump.  That would mean that only
tables created after the next minor releases would have protection against
this problem.  That's not ideal but maybe it's okay, considering that we
haven't seen actual field reports of trouble of this kind.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/15631-188663b383e1e697%40postgresql.org

[2] 
https://www.postgresql.org/message-id/flat/CA%2Bu7OA4JKCPFrdrAbOs7XBiCyD61XJxeNav4LefkSmBLQ-Vobg%40mail.gmail.com

Re: Add missing operator <->(box, point)

2019-07-08 Thread Alexander Korotkov

On Mon, Mar 11, 2019 at 2:49 AM Nikita Glukhov  wrote:
> 2. Add <-> to GiST box_ops.
>Extracted gist_box_distance_helper() common for gist_box_distance() and
>gist_bbox_distance().

For me it doesn't look worth having two distinct functions
gist_box_distance_helper() and gist_bbox_distance().  What about
having just one and leave responsibility for recheck flag to the
caller?

> 3. Add <-> to SP-GiST.
>Changed only catalog and tests.  Box case is already checked in
>spg_box_quad_leaf_consistent():
>  out->recheckDistances = distfnoid == F_DIST_POLYP;

So, it seems to be fix of oversight in 2a6368343ff4.  But assuming
fixing this requires catalog changes, we shouldn't backpatch this.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Bruce Momjian

On Mon, Jul  8, 2019 at 11:18:01AM -0400, Joe Conway wrote:
> On 7/8/19 10:19 AM, Bruce Momjian wrote:
> > When people are asking for multiple keys (not just for key rotation),
> > they are asking to have multiple keys that can be supplied by users only
> > when they need to access the data.  Yes, the keys are always in the
> > datbase, but the feature request is that they are only unlocked when the
> > user needs to access the data.  Obviously, that will not work for
> > autovacuum when the encryption is at the block level.
> 
> > If the key is always unlocked, there is questionable security value of
> > having multiple keys, beyond key rotation.
> 
> That is not true. Having multiple keys also allows you to reduce the
> amount of data encrypted with a single key, which is desirable because:
> 
> 1. It makes cryptanalysis more difficult
> 2. Puts less data at risk if someone gets "lucky" in doing brute force

What systems use multiple keys like that?  I know of no website that
does that.  Your arguments seem hypothetical.  What is your goal here?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: errbacktrace

2019-07-08 Thread Dmitry Dolgov

> On Sat, Jun 29, 2019 at 7:41 AM Jaime Casanova 
>  wrote:
>
> This is certainly a very useful thing. Sadly, it doesn't seem to compile when
> trying to use libunwind.

Yeah, the same for me. To make it works I've restricted libunwind to local
unwinding only:

#ifdef USE_LIBUNWIND
#define UNW_LOCAL_ONLY
#include 
#endif

And result looks pretty nice:

2019-07-08 17:24:08.406 CEST [31828] ERROR:  invalid input syntax for
type integer: "foobar" at character 12
2019-07-08 17:24:08.406 CEST [31828] BACKTRACE:  #0
pg_strtoint32+0x1d1 [0x55fa389bcbbe]
#1  int4in+0xd [0x55fa38976d7b]
#2  InputFunctionCall+0x6f [0x55fa38a488e9]
#3  OidInputFunctionCall+0x44 [0x55fa38a48b0d]
#4  stringTypeDatum+0x33 [0x55fa386e222e]
#5  coerce_type+0x26d [0x55fa386ca14d]
#6  coerce_to_target_type+0x79 [0x55fa386c9494]
#7  transformTypeCast+0xaa [0x55fa386d0042]
#8  transformExprRecurse+0x22f [0x55fa386cf650]
#9  transformExpr+0x1a [0x55fa386cf30a]
#10 transformTargetEntry+0x79 [0x55fa386e1131]
#11 transformTargetList+0x86 [0x55fa386e11ce]
#12 transformSelectStmt+0xa1 [0x55fa386a29c9]
#13 transformStmt+0x9d [0x55fa386a345a]
#14 transformOptionalSelectInto+0x94 [0x55fa386a3f49]
#15 transformTopLevelStmt+0x15 [0x55fa386a3f88]
#16 parse_analyze+0x4e [0x55fa386a3fef]
#17 pg_analyze_and_rewrite+0x3e [0x55fa3890cfa5]
#18 exec_simple_query+0x35b [0x55fa3890d5b5]
#19 PostgresMain+0x91f [0x55fa3890f7a8]
#20 BackendRun+0x1ac [0x55fa3887ed17]
#21 BackendStartup+0x15c [0x55fa38881ea1]
#22 ServerLoop+0x1e6 [0x55fa388821bb]
#23 PostmasterMain+0x1101 [0x55fa388835a1]
#24 main+0x21a [0x55fa387db1a9]
#25 __libc_start_main+0xe7 [0x7f3d1a607fa7]
#26 _start+0x2a [0x55fa3858e4ea]

Re: tableam vs. TOAST

2019-07-08 Thread Robert Haas

On Tue, Jun 25, 2019 at 2:19 AM Prabhat Sahu
 wrote:
> I have tested the TOAST patches(v3) with different storage options like(MAIN, 
> EXTERNAL, EXTENDED, etc.), and
> combinations of compression and out-of-line storage options.
> I have used a few dummy tables with various tuple count say 10k, 20k, 40k, 
> etc. with different column lengths.
> Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size = 
> 10GB) before the test to avoid performance fluctuations,
> and calculated the results as a median value of a few consecutive test 
> executions.

Thanks for testing.

> All the observation looks good to me,
> except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC 
> INSERT with tuple count(40K)  there was a slightly increse in time taken
> incase of "with patch" result. For a better observation, I also have ran the 
> same "Test 1" for higher tuple count(i.e. 80K), and it also looks fine.

Did you run each test just once?  How stable are the results?

> While testing few concurrent transactions I have below query:
> -- Concurrent transactions acquire a lock for TOAST option(ALTER TABLE .. SET 
> STORAGE .. MAIN/EXTERNAL/EXTENDED/ etc)
>
> -- Session 1:
> CREATE TABLE a (a_id text PRIMARY KEY);
> CREATE TABLE b (b_id text);
> INSERT INTO a VALUES ('a'), ('b');
> INSERT INTO b VALUES ('a'), ('b'), ('b');
>
> BEGIN;
> ALTER TABLE b ADD CONSTRAINT bfk FOREIGN KEY (b_id) REFERENCES a (a_id); 
> -- Not Acquiring any lock

For me, this acquires AccessShareLock and ShareRowExclusiveLock on the
target table.

rhaas=# select locktype, database, relation, pid, mode, granted from
pg_locks where relation = 'b'::regclass;
 locktype | database | relation |  pid  | mode  | granted
--+--+--+---+---+-
 relation |16384 |16872 | 93197 | AccessShareLock   | t
 relation |16384 |16872 | 93197 | ShareRowExclusiveLock | t
(2 rows)

I don't see what that has to do with the topic at hand, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Jul  8, 2019 at 11:18:01AM -0400, Joe Conway wrote:
> > On 7/8/19 10:19 AM, Bruce Momjian wrote:
> > > When people are asking for multiple keys (not just for key rotation),
> > > they are asking to have multiple keys that can be supplied by users only
> > > when they need to access the data.  Yes, the keys are always in the
> > > datbase, but the feature request is that they are only unlocked when the
> > > user needs to access the data.  Obviously, that will not work for
> > > autovacuum when the encryption is at the block level.
> > 
> > > If the key is always unlocked, there is questionable security value of
> > > having multiple keys, beyond key rotation.
> > 
> > That is not true. Having multiple keys also allows you to reduce the
> > amount of data encrypted with a single key, which is desirable because:
> > 
> > 1. It makes cryptanalysis more difficult
> > 2. Puts less data at risk if someone gets "lucky" in doing brute force
> 
> What systems use multiple keys like that?  I know of no website that
> does that.  Your arguments seem hypothetical.  What is your goal here?

Not sure what the reference to 'website' is here, but one doesn't get
certificates for TLS/SSL usage that aren't time-bounded, and when it
comes to the actual on-the-wire encryption that's used, that's a
symmetric key that's generated on-the-fly for every connection.

Wouldn't the fact that they generate a different key for every
connection be a pretty clear indication that it's a good idea to use
multiple keys and not use the same key over and over..?

Of course, we can discuss if what websites do with over-the-wire
encryption is sensible to compare to what we want to do in PG for
data-at-rest, but then we shouldn't be talking about what websites do,
it'd make more sense to look at other data-at-rest encryption systems
and consider what they're doing.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Peter Eisentraut

On 2019-07-08 17:47, Stephen Frost wrote:
> Of course, we can discuss if what websites do with over-the-wire
> encryption is sensible to compare to what we want to do in PG for
> data-at-rest, but then we shouldn't be talking about what websites do,
> it'd make more sense to look at other data-at-rest encryption systems
> and consider what they're doing.

So, how do encrypted file systems do it?  Are there any encrypted file
systems in general use that allow encrypting only some files or
encrypting different parts of the file system with different keys, or
any of those other granular approaches being discussed?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-08 Thread James Coleman

On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra
 wrote:
>
> On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote:
> >On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra
> > wrote:
> >>
> >> On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote:
> >> >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra
> >> > wrote:
> >> >> We're running query like this:
> >> >>
> >> >>   SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) 
> >> >> < 3 ORDER BY 1, 2, 3
> >> >>
> >> >> but we're trying to add the incremental sort *before* the aggregation,
> >> >> because the optimizer also considers group aggregate with a sorted
> >> >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we
> >> >> actually can do this, but clearly that's nonsense, because we can't
> >> >> possibly know the aggregates yet. Hence the error.
> >> >>
> >> >> If this is the actual issue, we need to ensure we actually can evaluate
> >> >> all the pathkeys. I don't know how to do that yet. I thought that maybe
> >> >> we should modify pathkeys_common_contained_in() to set presorted_keys to
> >> >> 0 in this case.
> >> >>
> >> >> But then I started wondering why we don't see this issue even for
> >> >> regular (non-incremental-sort) paths built in create_ordered_paths().
> >> >> How come we don't see these failures there? I've modified costing to
> >> >> make all incremental sort paths very cheap, and still nothing.
> >> >
> >> >I assume you mean you modified costing to make regular sort paths very 
> >> >cheap?
> >> >
> >>
> >> No, I mean costing of incremental sort paths, so that they end up being
> >> the cheapest ones. If some other path is cheaper, we won't see the error
> >> because it only happens when building plan from the cheapest path.
> >
> >Ah, I misunderstood as you trying to figure out a way to try to cause
> >the same problem with a regular sort.
> >
> >> >> So presumably there's a check elsewhere (either implicit or explicit),
> >> >> because create_ordered_paths() uses pathkeys_common_contained_in() and
> >> >> does not have the same issue.
> >> >
> >> >Given this comment in create_ordered_paths():
> >> >
> >> >  generate_gather_paths() will have already generated a simple Gather
> >> >  path for the best parallel path, if any, and the loop above will have
> >> >  considered sorting it.  Similarly, generate_gather_paths() will also
> >> >  have generated order-preserving Gather Merge plans which can be used
> >> >  without sorting if they happen to match the sort_pathkeys, and the loop
> >> >  above will have handled those as well.  However, there's one more
> >> >  possibility: it may make sense to sort the cheapest partial path
> >> >  according to the required output order and then use Gather Merge.
> >> >
> >> >my understanding is that generate_gather_paths() only considers paths
> >> >that already happen to be sorted (not explicit sorts), so I'm
> >> >wondering if it would make more sense for the incremental sort path
> >> >creation for this case to live alongside the explicit ordered path
> >> >creation in create_ordered_paths() rather than in
> >> >generate_gather_paths().
> >> >
> >>
> >> How would that solve the issue? Also, we're building a gather path, so
> >> I think generate_gather_paths() is the right place where to do it. And
> >> we're not changing the semantics of generate_gather_paths() - the result
> >> path should be sorted "correctly" with respect to sort_pathkeys.
> >
> >Does that imply what the explicit sort in create_ordered_paths() is in
> >the wrong spot?
> >
>
> I think those are essentially the right places where to do this sort of
> stuff. Maybe there's a better place, but I don't think those places are
> somehow wrong.
>
> >Or, to put it another way, do you think that both kinds of sorts
> >should be added in the same place? It seems confusing to me that
> >they'd be split between the two methods (unless I'm completely
> >misunderstanding how the two work).
> >
>
> The paths built in those two places were very different in one aspect:
>
> 1) generate_gather_paths only ever looked at pathkeys for the subpath, it
> never even looked at ordering expected by paths above it (or the query as
> a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims
> to maintain ordering of the different subpaths.
>
> 2) create_oredered_paths is meant to enforce ordering needed by upper
> parts of the plan - either by using a properly sorted path, or adding an
> explicit sort.
>
>
> We want to extend (1) to also look at ordering expected by the upper parts
> of the plan, and consider incremental sort if applicable. (2) already does
> that, and it already has the correct pathkeys to enforce.

I guess I'm still not following. If (2) is responsible (currently) for
adding an explicit sort, why wouldn't adding an incremental sort be an
example of that responsibility? The subpath that either a Sort or
IncrementalSort is being added on top of (to then feed into the
GatherMerge) is the same

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Joe Conway

On 7/8/19 11:56 AM, Peter Eisentraut wrote:
> On 2019-07-08 17:47, Stephen Frost wrote:
>> Of course, we can discuss if what websites do with over-the-wire
>> encryption is sensible to compare to what we want to do in PG for
>> data-at-rest, but then we shouldn't be talking about what websites do,
>> it'd make more sense to look at other data-at-rest encryption systems
>> and consider what they're doing.
> 
> So, how do encrypted file systems do it?  Are there any encrypted file
> systems in general use that allow encrypting only some files or
> encrypting different parts of the file system with different keys, or
> any of those other granular approaches being discussed?

Well it is fairly common, for good reason IMHO, to encrypt some mount
points and not others on a system. In my mind, and in practice to a
large extent, a postgres tablespace == a unique mount point.

There is a description here:

  https://wiki.archlinux.org/index.php/Disk_encryption

A pertinent quote:

After it has been derived, the master key is securely stored in memory
(e.g. in a kernel keyring), for as long as the encrypted block device or
folder is mounted.

It is usually not used for de/encrypting the disk data directly, though.
For example, in the case of stacked filesystem encryption, each file can
be automatically assigned its own encryption key. Whenever the file is
to be read/modified, this file key first needs to be decrypted using the
main key, before it can itself be used to de/encrypt the file contents:

   ╭╮
   ┊ master key ┊
   file on disk:   ╰┈┬┈┈╯
  ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐│
  ╎╭───╮╎▼  ╭┈┈╮
  ╎│ encrypted file key│(decryption)━━━▶┊ file key ┊
  ╎╰───╯╎   ╰┬┈╯
  ╎┌───┐╎▼
┌┈┈┈┐
  ╎│ encrypted file│◀━(de/encryption)━━━▶┊ readable
file ┊
  ╎│ contents  │╎┊ contents
 ┊
  ╎└───┘╎
└┈┈┈┘
  └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘

In a similar manner, a separate key (e.g. one per folder) may be used
for the encryption of file names in the case of stacked filesystem
encryption.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

signature.asc
Description: OpenPGP digital signature

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Bruce Momjian

On Mon, Jul  8, 2019 at 11:47:33AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Mon, Jul  8, 2019 at 11:18:01AM -0400, Joe Conway wrote:
> > > On 7/8/19 10:19 AM, Bruce Momjian wrote:
> > > > When people are asking for multiple keys (not just for key rotation),
> > > > they are asking to have multiple keys that can be supplied by users only
> > > > when they need to access the data.  Yes, the keys are always in the
> > > > datbase, but the feature request is that they are only unlocked when the
> > > > user needs to access the data.  Obviously, that will not work for
> > > > autovacuum when the encryption is at the block level.
> > > 
> > > > If the key is always unlocked, there is questionable security value of
> > > > having multiple keys, beyond key rotation.
> > > 
> > > That is not true. Having multiple keys also allows you to reduce the
> > > amount of data encrypted with a single key, which is desirable because:
> > > 
> > > 1. It makes cryptanalysis more difficult
> > > 2. Puts less data at risk if someone gets "lucky" in doing brute force
> > 
> > What systems use multiple keys like that?  I know of no website that
> > does that.  Your arguments seem hypothetical.  What is your goal here?
> 
> Not sure what the reference to 'website' is here, but one doesn't get
> certificates for TLS/SSL usage that aren't time-bounded, and when it
> comes to the actual on-the-wire encryption that's used, that's a
> symmetric key that's generated on-the-fly for every connection.
> 
> Wouldn't the fact that they generate a different key for every
> connection be a pretty clear indication that it's a good idea to use
> multiple keys and not use the same key over and over..?
> 
> Of course, we can discuss if what websites do with over-the-wire
> encryption is sensible to compare to what we want to do in PG for
> data-at-rest, but then we shouldn't be talking about what websites do,
> it'd make more sense to look at other data-at-rest encryption systems
> and consider what they're doing.

(I talked to Joe on chat for clarity.)  In modern TLS, the certificate is
used only for authentication, and Diffie–Hellman is used for key
exchange:

https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange

So, the question is whether you can pass so much data in TLS that using
the same key for the entire session is a security issue.  TLS originally
had key renegotiation, but that was removed in TLS 1.3:


https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/
To mitigate these types of attacks, TLS 1.3 disallows renegotiation.

Of course, a database is going to process even more data so if the
amount of data encrypted is a problem, we might have a problem too in
using a single key.  This is not related to whether we use one key for
the entire cluster or multiple keys per tablespace --- the problem is
the same.  I guess we could create 1024 keys and use the bottom bits of
the block number to decide what key to use.  However, that still only
pushes the goalposts farther.

Anyway, I will to research the reasonable data size that can be secured
with a single key via AES.  I will look at how PGP encrypts large files
too.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Switching PL/Python to Python 3 by default in PostgreSQL 12

2019-07-08 Thread Tom Lane

Peter Eisentraut  writes:
> I'm not trying to dismiss the importance of managing the Python
> transition.  But this issue has been known for many years, and the
> current setup is more or less in line with the wider world.  For
> example, the Debian release that came out over the weekend still ships
> with /usr/bin/python being Python 2.  So it is neither timely nor urgent
> to try to make some significant change about this in PostgreSQL 12 right
> now.  I would welcome patches for this for PostgreSQL 13.

I don't think it's been mentioned in this thread yet, but we *did*
recently install a configure-time preference for python3 over python2:

Author: Peter Eisentraut 
Branch: master Release: REL_12_BR [7291733ac] 2019-01-13 10:23:48 +0100
Branch: REL_11_STABLE Release: REL_11_2 [3d498c65a] 2019-01-13 10:24:21 +0100
Branch: REL_10_STABLE Release: REL_10_7 [cd1873160] 2019-01-13 10:25:23 +0100

configure: Update python search order

Some systems don't ship with "python" by default anymore, only
"python3" or "python2" or some combination, so include those in the
configure search.

Discussion: 
https://www.postgresql.org/message-id/flat/1457.1543184081%40sss.pgh.pa.us#c9cc1199338fd6a257589c6dcea6cf8d

configure's search order is now $PYTHON, python, python3, python2.
I think it will be a very long time, if ever, before there would be
a reason to consider changing that.  Both of the first two options
represent following a clear user preference.

So the only thing that's really at stake is when/whether we can make
"plpythonu" a synonym for "plpython3u" rather than "plpython2u".
As I said already, I think that's got to be a long way off, since the
whole problem here is that python3 isn't a drop-in replacement for
python2.  We're much more likely to break existing functions than do
anything useful by forcibly switching the synonym.

But I could support having a way for individual installations to change
what the synonym means locally.  Perhaps we could think about how to do
that in conjunction with the project of getting rid of pg_pltemplate
that's been kicked around before [1][2][3].

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/763f2fe4-743f-d530-8831-20811edd3d6a%402ndquadrant.com

[2] https://www.postgresql.org/message-id/flat/7495.1524861244%40sss.pgh.pa.us

[3] 
https://www.postgresql.org/message-id/flat/5351890.TdMePpdHBD%40nb.usersys.redhat.com

Re: errbacktrace

2019-07-08 Thread Alvaro Herrera

On 2019-Jul-08, Dmitry Dolgov wrote:

> > On Sat, Jun 29, 2019 at 7:41 AM Jaime Casanova 
> >  wrote:
> >
> > This is certainly a very useful thing. Sadly, it doesn't seem to compile 
> > when
> > trying to use libunwind.
> 
> Yeah, the same for me. To make it works I've restricted libunwind to local
> unwinding only:
> 
> #ifdef USE_LIBUNWIND
> #define UNW_LOCAL_ONLY
> #include 
> #endif

Ah, yes.  unwind's manpage says:

  Normally, libunwind supports both local and remote unwinding (the latter will
  be explained in the next section). However, if you tell libunwind that your
  program only needs local unwinding, then a special implementation can be
  selected which may run much faster than the generic implementation which
  supports both kinds of unwinding. To select this optimized version, simply
  define the macro UNW_LOCAL_ONLY before including the headerfile .

so I agree with unconditionally defining that symbol.

Nitpicking dept: I think in these tests:

+   if (!edata->backtrace &&
+   edata->funcname &&
+   backtrace_function[0] &&
+   strcmp(backtrace_function, edata->funcname) == 0)
+   set_backtrace(edata, 2);

we should test for backtrace_function[0] before edata->funcname, since
it seems more likely to be unset.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Switching PL/Python to Python 3 by default in PostgreSQL 12

2019-07-08 Thread Tom Lane

I wrote:
> But I could support having a way for individual installations to change
> what the synonym means locally.  Perhaps we could think about how to do
> that in conjunction with the project of getting rid of pg_pltemplate
> that's been kicked around before [1][2][3].

... actually, if we had that (i.e., languages fully defined by extensions
with no help from pg_pltemplate), wouldn't this be nearly trivial?
I'm imagining two extensions, one that defines plpythonu to call the
python2 code and one that defines it to call the python3 code, and you
install whichever you want.  They're separate from the extensions that
define plpython2u and plpython3u, so mix and match as you wish.

regards, tom lane

Re: range_agg

2019-07-08 Thread Paul A Jungwirth

On Sat, Jul 6, 2019 at 12:13 PM Jeff Davis  wrote:
>
> On Fri, 2019-07-05 at 09:58 -0700, Paul A Jungwirth wrote:
> > user-defined range types. So how about I start on it and see how it
> > goes? I expect I can follow the existing code for range types pretty
> > closely, so maybe it won't be too hard.
>
> That would be great to at least take a look. If it starts to look like
> a bad idea, then we can re-evaluate and see if it's better to just
> return arrays.

I made some progress over the weekend. I don't have a patch yet but I
thought I'd ask for opinions on the approach I'm taking:

- A multirange type is an extra thing you get when you define a range
(just like how you get a tstzrange[]). Therefore
- I don't need separate commands to add/drop multirange types. You get
one when you define a range type, and if you drop a range type it gets
dropped automatically.
- I'm adding a new typtype for multiranges. ('m' in pg_type).
- I'm just adding a mltrngtypeid column to pg_range. I don't think I
need a new pg_multirange table.
- You can have a multirange[].
- Multirange in/out work just like arrays, e.g. '{"[1,3)", "[5,6)"}'
- I'll add an anymultirange pseudotype. When it's the return type of a
function that has an "anyrange" parameter, it will use the same base
element type. (So basically anymultirange : anyrange :: anyarray ::
anyelement.)
- You can cast from a multirange to an array. The individual ranges
are always sorted in the result array.
- You can cast from an array to a multirange but it will error if
there are overlaps (or not?). The array's ranges don't have to be
sorted but they will be after a "round trip".
- Interesting functions:
  - multirange_length
  - range_agg (range_union_agg if you like)
  - range_intersection_agg
- You can subscript a multirange like you do an array (? This could be
a function instead.)
- operators:
  - union (++) and intersection (*):
- We already have + for range union but it raises an error if
there is a gap, so ++ is the same but with no errors.
- r ++ r = mr (commutative, associative)
- mr ++ r = mr
- r ++ mr = mr
- r * r = r (unchanged)
- mr * r = r
- r * mr = r
- mr - r = mr
- r - mr = mr
- mr - mr = mr
  - comparison
- mr = mr
- mr @> x
- mr @> r
- mr @> mr
- x <@ mr
- r <@ mr
- mr <@ mr
- mr << mr (strictly left of)
- mr >> mr (strictly right of)
- mr &< mr (does not extend to the right of)
- mr &> mr (does not extend to the left of)
  - inverse operator?:
- the inverse of {"[1,2)"} would be {"[null, 1)", "[2, null)"}.
- not sure we want this or what the symbol should be. I don't like
-mr as an inverse because then mr - mr != mr ++ -mr.

Anything in there you think should be different?

Thanks,
Paul

Re: Comment typo in tableam.h

2019-07-08 Thread Ashwin Agrawal

On Sat, Jul 6, 2019 at 12:05 AM Amit Kapila  wrote:

> On Tue, Jul 2, 2019 at 1:00 AM Ashwin Agrawal  wrote:
> > Please find attached v2 of patch 1 without objectionable comment change.
> v1 of patch 2 attaching here just for convenience, no modifications made to
> it.
> >
>
> 0001*
>   * See table_index_fetch_tuple's comment about what the difference between
> - * these functions is. This function is the correct to use outside of
> - * index entry->table tuple lookups.
> + * these functions is. This function is correct to use outside of index
> + * entry->table tuple lookups.
>
> How about if we write the last line of comment as "It is correct to
> use this function outside of index entry->table tuple lookups."?  I am
> not an expert on this matter, but I find the way I am suggesting
> easier to read.
>

I am fine with the way you have suggested.

Re: Broken defenses against dropping a partitioning column

2019-07-08 Thread Robert Haas

On Mon, Jul 8, 2019 at 11:08 AM Tom Lane  wrote:
> FWIW, I was imagining the action as being (1) detach all the child
> partitions, (2) make parent into a non-partitioned table, (3)
> drop the target column in each of these now-independent tables.
> No data movement.  Other than the need to acquire locks on all
> the tables, it shouldn't be particularly slow.

I see.  I think that would be reasonable, but like you say, it's not
clear that it's really what users would prefer.  You can think of a
partitioned table as a first-class object and the partitions as
subordinate implementation details; or you can think of the partitions
as the first-class objects and the partitioned table as the
second-rate glue that holds them together. It seems like users prefer
the former view.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-08 Thread Tom Lane

Michael Paquier  writes:
> I have begun playing with regressplans.sh which enforces various
> combinations of "-f s|i|n|m|h" when running the regression tests, and
> I have noticed that -fh can cause the server to become stuck in the
> test join_hash.sql with this query (not sure which portion of the SET
> LOCAL parameters are involved) :
> select count(*) from simple r join extremely_skewed s using (id);

> This does not happen with REL_10_STABLE where the test executes
> immediately, so we has visibly an issue caused by v11 here.

Yeah, these test cases were added by fa330f9ad in v11.

What it looks like to me is that some of these test cases force
"enable_mergejoin = off", so if you also have enable_hashjoin off then
you are going to get a nestloop plan, and it's hardly surprising that
that takes an unreasonable amount of time on the rather large test
tables used in these tests.

Given the purposes of this test, I think it'd be reasonable to force
both enable_hashjoin = on and enable_mergejoin = off at the very top
of the join_hash script, or the corresponding place in join.sql in
v11.  Thomas, was there a specific reason for forcing enable_mergejoin
= off for only some of these tests?

regards, tom lane

Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-08 Thread Michael Paquier

On Mon, Jul 08, 2019 at 06:49:44PM +1200, Thomas Munro wrote:
> "simple" has 20k rows and "extremely_skewed" has 20k rows but the
> planner thinks it only has 2.  So this going to take O(n^2) time and n
> is 20k.  Not sure what to do about that.  Maybe "join_hash" should be
> skipped for the -h (= no hash joins please) case?

Ah, thanks.  Yes that's going to take a while :)

Well, another thing I'd like to think about is if there is any point
to keep regressplans.sh and the relevant options in postgres at this
stage.  From the top of the file one can read that:
# This script runs the Postgres regression tests with all useful combinations
# of the backend options that disable various query plan types.  If the
# results are not all the same, it may indicate a bug in a particular
# plan type, or perhaps just a regression test whose results aren't fully
# determinate (eg, due to lack of an ORDER BY keyword).

However if you run any option with make check, then in all runs there
are tests failing.  We can improve the situation for some of them with
ORDER BY queries by looking at the query outputs, but some EXPLAIN
queries are sensitive to that, and the history around regressplans.sh
does not play in favor of it (some people really use it?).  If you
look at the latest commits for it, it has not been really touched in
19 years.

So I would be rather in favor in nuking it.
--
Michael

signature.asc
Description: PGP signature

Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-08 Thread Thomas Munro

On Mon, Jul 8, 2019 at 5:53 PM Michael Paquier  wrote:
> I have begun playing with regressplans.sh which enforces various
> combinations of "-f s|i|n|m|h" when running the regression tests, and
> I have noticed that -fh can cause the server to become stuck in the
> test join_hash.sql with this query (not sure which portion of the SET
> LOCAL parameters are involved) :
> select count(*) from simple r join extremely_skewed s using (id);
>
> This does not happen with REL_10_STABLE where the test executes
> immediately, so we has visibly an issue caused by v11 here.

If you don't allow hash joins it makes this plan:

 Aggregate
   ->  Nested Loop
 Join Filter: (r.id = s.id)
 ->  Seq Scan on simple r
 ->  Materialize
   ->  Seq Scan on extremely_skewed s

"simple" has 20k rows and "extremely_skewed" has 20k rows but the
planner thinks it only has 2.  So this going to take O(n^2) time and n
is 20k.  Not sure what to do about that.  Maybe "join_hash" should be
skipped for the -h (= no hash joins please) case?

-- 
Thomas Munro
https://enterprisedb.com

Re: [PATCH] kNN for btree

2019-07-08 Thread Alexander Korotkov

On Mon, Jul 1, 2019 at 8:47 PM Nikita Glukhov  wrote:
> On 01.07.2019 13:41, Thomas Munro wrote:
> > On Tue, Mar 26, 2019 at 4:30 AM Nikita Glukhov  
> > wrote:
>  Fixed two bugs in patches 3 and 10 (see below).
>  Patch 3 was extracted from the main patch 5 (patch 4 in v9).
> >>> This patch no longer applies so marking Waiting on Author.
> >> Attached 11th version of the patches rebased onto current master.
> > Hi Nikita,
> >
> > Since a new Commitfest is here and this doesn't apply, could we please
> > have a fresh rebase?
>
> Attached 12th version of the patches rebased onto the current master.

I have more thoughts about planner/executor infrastructure.  It
appears that supporting "ORDER BY col1, col2 <-> val" is too complex
for initial version of patch.  Handling of "ORDER BY col" and "ORDER
BY col <-> val" cases uses different code paths in optimizer.

So, I'd like to limit first version of this patch to support just most
simple "ORDER BY col <-> val" case.  We would probably be able to
enhance it even in v13 release cycle, but let's start with just simple
case.  I also think we can evade replacing amcanorderbyop flag with
method, but introduce just new boolean flag indicating knn supports
only first column.

--
Alexander Korotkov

Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: progress report for ANALYZE

2019-07-08 Thread Robert Haas

On Mon, Jul 8, 2019 at 5:29 AM Tatsuro Yamada
 wrote:
> 17520|13599|postgres|16387|f|16387|scanning table|4425|4425
> 17520|13599|postgres|16387|f|16387|analyzing sample|0|0
> 17520|13599|postgres|16387|f|16387||0|0  <-- Is it Okay??

Why do we zero out the block numbers when we switch phases?  The
CREATE INDEX progress reporting patch does that kind of thing too, and
it seems like poor behavior to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Multivariate MCV list vs. statistics target

2019-07-08 Thread Tomas Vondra


On Sun, Jul 07, 2019 at 12:02:38AM +0200, Tomas Vondra wrote:

Hi,

apparently v1 of the ALTER STATISTICS patch was a bit confused about
length of the VacAttrStats array passed to statext_compute_stattarget,
causing segfaults. Attached v2 patch fixes that, and it also makes sure
we print warnings about ignored statistics only once.



v3 of the patch, adding pg_dump support - it works just like when you
tweak statistics target for a column, for example. When the value stored
in the catalog is not -1, pg_dump emits a separate ALTER STATISTICS
command setting it (for the already created statistics object).

I've considered making it part of CREATE STATISTICS itself, but it seems
a bit cumbersome and we don't do it for columns either. I'm not against
adding it in the future, but at this point I don't see a need.

At this point I'm not aware of any missing or broken pieces, so I'd
welcome feedback.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/ref/alter_statistics.sgml 
b/doc/src/sgml/ref/alter_statistics.sgml
index 58c7ed020d..987f9dbc6f 100644
--- a/doc/src/sgml/ref/alter_statistics.sgml
+++ b/doc/src/sgml/ref/alter_statistics.sgml
@@ -26,6 +26,7 @@ PostgreSQL documentation
 ALTER STATISTICS name OWNER TO { 
new_owner | CURRENT_USER | 
SESSION_USER }
 ALTER STATISTICS name RENAME TO 
new_name
 ALTER STATISTICS name SET SCHEMA 
new_schema
+ALTER STATISTICS name SET 
STATISTICS new_target
 
  
 
@@ -93,6 +94,22 @@ ALTER STATISTICS name SET SCHEMA 
  
 
+ 
+  new_target
+  
+   
+The statistic-gathering target for this statistic object for subsequent
+ operations.
+The target can be set in the range 0 to 1; alternatively, set it
+to -1 to revert to using the system default statistics
+target ().
+For more information on the use of statistics by the
+PostgreSQL query planner, refer to
+.
+   
+  
+ 
+
 

   
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index d7004e5313..caa4fca99d 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -490,6 +490,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
}
}
 
+   /*
+* Look at extended statistic objects too, because those may define 
their
+* own statistic target. So we need to sample enough rows and then build
+* the statistics with enough detail.
+*/
+   {
+   int nrows = ComputeExtStatisticsTarget(onerel,
+   
   attr_cnt, vacattrstats);
+
+   if (targrows < nrows)
+   targrows = nrows;
+   }
+
/*
 * Acquire the sample rows
 */
diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c
index cf406f6f96..356b6096ac 100644
--- a/src/backend/commands/statscmds.c
+++ b/src/backend/commands/statscmds.c
@@ -14,6 +14,7 @@
  */
 #include "postgres.h"
 
+#include "access/heapam.h"
 #include "access/relation.h"
 #include "access/relscan.h"
 #include "access/table.h"
@@ -21,6 +22,7 @@
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/namespace.h"
+#include "catalog/objectaccess.h"
 #include "catalog/pg_namespace.h"
 #include "catalog/pg_statistic_ext.h"
 #include "catalog/pg_statistic_ext_data.h"
@@ -29,6 +31,7 @@
 #include "miscadmin.h"
 #include "statistics/statistics.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -336,6 +339,7 @@ CreateStatistics(CreateStatsStmt *stmt)
values[Anum_pg_statistic_ext_stxrelid - 1] = ObjectIdGetDatum(relid);
values[Anum_pg_statistic_ext_stxname - 1] = NameGetDatum(&stxname);
values[Anum_pg_statistic_ext_stxnamespace - 1] = 
ObjectIdGetDatum(namespaceId);
+   values[Anum_pg_statistic_ext_stxstattarget - 1] = Int32GetDatum(-1);
values[Anum_pg_statistic_ext_stxowner - 1] = ObjectIdGetDatum(stxowner);
values[Anum_pg_statistic_ext_stxkeys - 1] = PointerGetDatum(stxkeys);
values[Anum_pg_statistic_ext_stxkind - 1] = PointerGetDatum(stxkind);
@@ -414,6 +418,85 @@ CreateStatistics(CreateStatsStmt *stmt)
return myself;
 }
 
+
+/*
+ * ALTER STATISTICS
+ */
+ObjectAddress
+AlterStatistics(AlterStatsStmt *stmt)
+{
+   Relationrel;
+   Oid stxoid;
+   HeapTuple   oldtup;
+   HeapTuple   newtup;
+   Datum   repl_val[Natts_pg_statistic_ext];
+   boolrepl_null[Natts_pg_statistic_ext];
+   boolrepl_repl[Natts_pg_statistic_ext];
+   ObjectAddress   address;
+   int newtarget = stmt->stxstattarget;
+
+   /* Lim

Re: progress report for ANALYZE

2019-07-08 Thread Alvaro Herrera

On 2019-Jul-08, Robert Haas wrote:

> On Mon, Jul 8, 2019 at 5:29 AM Tatsuro Yamada
>  wrote:
> > 17520|13599|postgres|16387|f|16387|scanning table|4425|4425
> > 17520|13599|postgres|16387|f|16387|analyzing sample|0|0
> > 17520|13599|postgres|16387|f|16387||0|0  <-- Is it Okay??
> 
> Why do we zero out the block numbers when we switch phases?  The
> CREATE INDEX progress reporting patch does that kind of thing too, and
> it seems like poor behavior to me.

Yeah, I got the impression that that was determined to be the desirable
behavior, so I made it do that, but I'm not really happy about it
either.  We're not too late to change the CREATE INDEX behavior, but
let's discuss what is it that we want.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Tomas Vondra

On Mon, Jul 08, 2019 at 11:25:10AM -0400, Bruce Momjian wrote:

On Mon, Jul  8, 2019 at 11:18:01AM -0400, Joe Conway wrote:

On 7/8/19 10:19 AM, Bruce Momjian wrote:
> When people are asking for multiple keys (not just for key rotation),
> they are asking to have multiple keys that can be supplied by users only
> when they need to access the data.  Yes, the keys are always in the
> datbase, but the feature request is that they are only unlocked when the
> user needs to access the data.  Obviously, that will not work for
> autovacuum when the encryption is at the block level.

> If the key is always unlocked, there is questionable security value of
> having multiple keys, beyond key rotation.

That is not true. Having multiple keys also allows you to reduce the
amount of data encrypted with a single key, which is desirable because:

1. It makes cryptanalysis more difficult
2. Puts less data at risk if someone gets "lucky" in doing brute force

What systems use multiple keys like that?  I know of no website that
does that.  Your arguments seem hypothetical.  What is your goal here?

I might ask the same question about databases - which databases use an
encryption scheme where the database does not have access to the keys?

Actually, I've already asked this question before ...

The databases I'm familiar with do store all the keys in a vault that's
unlocked during startup, and then users may get keys from it (including
maintenance processes). We could still control access to those keys in
various ways (ACL or whatever), of course.

BTW how do you know this is what users want? Maybe they do, but then
again - maybe they just see it as magic and don't realize the extra
complexity (not just at the database level). In my experience users
generally want more abstract things, like "Ensure data privacy in case
media theft," or "protection against evil DBA".

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Jul  8, 2019 at 11:47:33AM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > On Mon, Jul  8, 2019 at 11:18:01AM -0400, Joe Conway wrote:
> > > > On 7/8/19 10:19 AM, Bruce Momjian wrote:
> > > > > When people are asking for multiple keys (not just for key rotation),
> > > > > they are asking to have multiple keys that can be supplied by users 
> > > > > only
> > > > > when they need to access the data.  Yes, the keys are always in the
> > > > > datbase, but the feature request is that they are only unlocked when 
> > > > > the
> > > > > user needs to access the data.  Obviously, that will not work for
> > > > > autovacuum when the encryption is at the block level.
> > > > 
> > > > > If the key is always unlocked, there is questionable security value of
> > > > > having multiple keys, beyond key rotation.
> > > > 
> > > > That is not true. Having multiple keys also allows you to reduce the
> > > > amount of data encrypted with a single key, which is desirable because:
> > > > 
> > > > 1. It makes cryptanalysis more difficult
> > > > 2. Puts less data at risk if someone gets "lucky" in doing brute force
> > > 
> > > What systems use multiple keys like that?  I know of no website that
> > > does that.  Your arguments seem hypothetical.  What is your goal here?
> > 
> > Not sure what the reference to 'website' is here, but one doesn't get
> > certificates for TLS/SSL usage that aren't time-bounded, and when it
> > comes to the actual on-the-wire encryption that's used, that's a
> > symmetric key that's generated on-the-fly for every connection.
> > 
> > Wouldn't the fact that they generate a different key for every
> > connection be a pretty clear indication that it's a good idea to use
> > multiple keys and not use the same key over and over..?
> > 
> > Of course, we can discuss if what websites do with over-the-wire
> > encryption is sensible to compare to what we want to do in PG for
> > data-at-rest, but then we shouldn't be talking about what websites do,
> > it'd make more sense to look at other data-at-rest encryption systems
> > and consider what they're doing.
> 
> (I talked to Joe on chat for clarity.)  In modern TLS, the certificate is
> used only for authentication, and Diffie–Hellman is used for key
> exchange:
> 
>   https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange

Right, and the key that's figured out for each connection is at least
specific to the server AND client keys/certificates, thus meaning that
they're changed at least as frequently as those change (and clients end
up creating ones on the fly randomly if they don't have one, iirc).

> So, the question is whether you can pass so much data in TLS that using
> the same key for the entire session is a security issue.  TLS originally
> had key renegotiation, but that was removed in TLS 1.3:
> 
>   
> https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/
>   To mitigate these types of attacks, TLS 1.3 disallows renegotiation.

It was removed due to attacks targeting the renegotiation, not because
doing re-keying by itself was a bad idea, or because using multiple keys
was seen as a bad idea.

> Of course, a database is going to process even more data so if the
> amount of data encrypted is a problem, we might have a problem too in
> using a single key.  This is not related to whether we use one key for
> the entire cluster or multiple keys per tablespace --- the problem is
> the same.  I guess we could create 1024 keys and use the bottom bits of
> the block number to decide what key to use.  However, that still only
> pushes the goalposts farther.

All of this is about pushing the goalposts farther away, as I see it.
There's going to be trade-offs here and there isn't going to be any "one
right answer" when it comes to this space.  That's why I'm inclined to
argue that we should try to come up with a relatively *good* solution
that doesn't create a huge amount of work for us, and then build on
that.  To that end, leveraging metadata that we already have outside of
the catalogs (databases, tablespaces, potentially other information that
we store, essentially, in the filesystem metadata already) to decide on
what key to use, and how many we can support, strikes me as a good
initial target.

> Anyway, I will to research the reasonable data size that can be secured
> with a single key via AES.  I will look at how PGP encrypts large files
> too.

This seems unlikely to lead to a definitive result, but it would be
interesting to hear if there have been studies around that and what
their conclusions were.

When it comes to concerns about autovacuum or other system processes,
those don't have any direct user connections or interactions, so having
them be more privileged and having access to more is reasonable.

Ideally, all of this would leverage a vaulting system or other mechanism
w

Re: progress report for ANALYZE

2019-07-08 Thread Robert Haas

On Mon, Jul 8, 2019 at 2:18 PM Alvaro Herrera  wrote:
> Yeah, I got the impression that that was determined to be the desirable
> behavior, so I made it do that, but I'm not really happy about it
> either.  We're not too late to change the CREATE INDEX behavior, but
> let's discuss what is it that we want.

I don't think I intended to make any such determination -- which
commit do you think established this as the canonical behavior?

I propose that once a field is set, we should leave it set until the end.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: progress report for ANALYZE

2019-07-08 Thread Julien Rouhaud

On Mon, Jul 8, 2019 at 8:44 PM Robert Haas  wrote:
>
> On Mon, Jul 8, 2019 at 2:18 PM Alvaro Herrera  
> wrote:
> > Yeah, I got the impression that that was determined to be the desirable
> > behavior, so I made it do that, but I'm not really happy about it
> > either.  We're not too late to change the CREATE INDEX behavior, but
> > let's discuss what is it that we want.
>
> I don't think I intended to make any such determination -- which
> commit do you think established this as the canonical behavior?
>
> I propose that once a field is set, we should leave it set until the end.

+1

Note that this patch is already behaving like that if the table only
contains dead rows.

Re: Declared but no defined functions

2019-07-08 Thread Ashwin Agrawal

On Sat, Jul 6, 2019 at 4:32 PM Masahiko Sawada 
wrote:

> Indeed. I've tried to search again with the following script and got
> more such functions.
>
> for func in `git ls-files | egrep "\w+\.h$" | xargs cat | egrep -v
> "(^typedef)|(DECLARE)|(BKI)" | egrep "^(extern )*[\_0-9A-Za-z]+
> [\_\*0-9a-zA-Z]+ ?\(.+\);$" | sed -e "s/\(^extern \)*[\_0-9A-Za-z]\+
> \([\_0-9A-Za-z\*]\+\) \{0,1\}(.*);$/\2(/g" | sed -e "s/\*//g"`
> do
> if [ "`git grep "$func" -- "*.c" | wc -l`" -lt 1 ];then
> echo $func
> fi
> done
>
>
Do we wish to make this a tool and have it in src/tools, either as part of
find_static tool after renaming that one to more generic name or
independent script.

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-08 Thread Julien Rouhaud

On Mon, Jul 8, 2019 at 9:57 AM Michael Paquier  wrote:
>
> On Fri, Jul 05, 2019 at 07:25:41PM +0200, Julien Rouhaud wrote:
> > On Fri, Jul 5, 2019 at 6:16 PM Peter Eisentraut 
> >  wrote:
> >> Isn't that also the case for your proposal?  We are not going to release
> >> a new reindexdb before a new REINDEX.
> >
> > Sure, but my point was that once the new reindexdb is released (or if
> > you're so desperate, using a nightly build or compiling your own), it
> > can be used against any previous major version.  There is probably a
> > large fraction of users who don't perform a postgres upgrade when they
> > upgrade their OS, so that's IMHO also something to consider.
>
> I think that we need to think long-term here and be confident in the
> fact we will still see breakages with collations and glibc, using a
> solution that we think is the right API.  Peter's idea to make the
> backend-aware command of the filtering is cool.  On top of that, there
> is no need to add any conflict logic in reindexdb and we can live with
> restricting --jobs support for non-index objects.

Don't get me wrong, I do agree that implementing filtering in the
backend is a better design.  What's bothering me is that I also agree
that there will be more glibc breakage, and if that happens within a
few years, a lot of people will still be using pg12- version, and they
still won't have an efficient way to rebuild their indexes.  Now, it'd
be easy to publish an external tools that does a simple
parallel-and-glic-filtering reindex tool that will serve that purpose
for the few years it'll be needed, so everyone can be happy.

For now, I'll resubmit the parallel patch using per-table only
approach, and will submit the filtering in the backend using a new
REINDEX option in a different thread.

Re: Ltree syntax improvement

2019-07-08 Thread Dmitry Belyavsky

Dear Thomas,

Thank you for your proofreading!

Please find the updated patch attached. It also contains the missing
escaping.

On Mon, Jul 8, 2019 at 10:39 AM Thomas Munro  wrote:

> On Wed, Apr 17, 2019 at 5:29 AM Dmitry Belyavsky 
> wrote:
> > I've applied your patch.
> > From my point of view, there is no major difference between case and
> chain if here.
> > Neither case nor ifs allow extracting the common code to separate
> function - just because there seem to be no identical pieces of code.
>
> Hi Dmitry,
>
> The documentation doesn't build[1], due to invalid XML.  Since I'm
> here, here is some proof-reading of the English in the documentation:
>
>
> -   A label is a sequence of alphanumeric characters
> -   and underscores (for example, in C locale the characters
> -   A-Za-z0-9_ are allowed).  Labels must be less
> than 256 bytes
> -   long.
> +   A label is a sequence of characters. Labels
> must be
> +   less than 256 symbols long. Label may contain any character
> supported by Postgres
>
> "fewer than 256 characters in length", and
> "PostgreSQL"
>
> +   except \0. If label contains spaces, dots,
> lquery modifiers,
>
> "spaces, dots or lquery modifiers,"
>
> +   they may be escaped. Escaping can be done
> either by preceeding
> +   backslash (\\) symbol, or by wrapping the label
> in whole in double
> +   quotes ("). Initial and final unescaped
> whitespace is stripped.
>
> "Escaping can be done with either a preceding backslash [...] or by
> wrapping the whole label in double quotes [...]."
>
>
>
> +During converting text into internal representations, wrapping
> double quotes
>
> "During conversion to internal representation, "
>
> +and escaping backslashes are removed. During converting internal
> +representations into text, if the label does not contain any special
>
> "During conversion from internal representation to text, "
>
> +symbols, it is printed as is. Otherwise, it is wrapped in quotes and,
> if
> +there are internal quotes, they are escaped with backslash. The
> list of special
>
> "escaped with backslashes."
>
> +  
> +Examples: 42, "\\42",
> +\\4\\2,  42  and  "42"
> + will have the similar internal representation and, being
>
> "will all have the same internal representation and,"
>
> +converted from internal representation, will become
> 42.
> +Literal abc def will turn into "abc
> +def".
>
>
> [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/71856
>
> --
> Thomas Munro
> https://enterprisedb.com
>


-- 
SY, Dmitry Belyavsky
diff --git a/contrib/ltree/expected/ltree.out b/contrib/ltree/expected/ltree.out
index 8226930905..5f45726229 100644
--- a/contrib/ltree/expected/ltree.out
+++ b/contrib/ltree/expected/ltree.out
@@ -1,4 +1,5 @@
 CREATE EXTENSION ltree;
+SET standard_conforming_strings=on;
 -- Check whether any of our opclasses fail amvalidate
 SELECT amname, opcname
 FROM pg_opclass opc LEFT JOIN pg_am am ON am.oid = opcmethod
@@ -7679,3 +7680,1587 @@ SELECT count(*) FROM _ltreetest WHERE t ? '{23.*.1,23.*.2}' ;
 15
 (1 row)
 
+-- Extended syntax, escaping, quoting etc
+-- success
+SELECT E'\\.'::ltree;
+ ltree 
+---
+ "."
+(1 row)
+
+SELECT E'\\ '::ltree;
+ ltree 
+---
+ " "
+(1 row)
+
+SELECT E''::ltree;
+ ltree 
+---
+ "\"
+(1 row)
+
+SELECT E'\\a'::ltree;
+ ltree 
+---
+ a
+(1 row)
+
+SELECT E'\\n'::ltree;
+ ltree 
+---
+ n
+(1 row)
+
+SELECT E'x'::ltree;
+ ltree 
+---
+ "x\"
+(1 row)
+
+SELECT E'x\\ '::ltree;
+ ltree 
+---
+ "x "
+(1 row)
+
+SELECT E'x\\.'::ltree;
+ ltree 
+---
+ "x."
+(1 row)
+
+SELECT E'x\\a'::ltree;
+ ltree 
+---
+ xa
+(1 row)
+
+SELECT E'x\\n'::ltree;
+ ltree 
+---
+ xn
+(1 row)
+
+SELECT 'a b.с d'::ltree;
+ltree
+-
+ "a b"."с d"
+(1 row)
+
+SELECT ' e . f '::ltree;
+ ltree 
+---
+ e.f
+(1 row)
+
+SELECT ' '::ltree;
+ ltree 
+---
+ 
+(1 row)
+
+SELECT E'\\ g  . h\\ '::ltree;
+   ltree   
+---
+ " g"."h "
+(1 row)
+
+SELECT E'\\ g'::ltree;
+ ltree 
+---
+ " g"
+(1 row)
+
+SELECT E' h\\ '::ltree;
+ ltree 
+---
+ "h "
+(1 row)
+
+SELECT '" g  "." h "'::ltree;
+ltree 
+--
+ " g  "." h "
+(1 row)
+
+SELECT '" g  " '::ltree;
+ ltree  
+
+ " g  "
+(1 row)
+
+SELECT '" g  "   ." h "  '::ltree;
+ltree 
+--
+ " g  "." h "
+(1 row)
+
+SELECT nlevel(E'Bottom\\.Test'::ltree);
+ nlevel 
+
+  1
+(1 row)
+
+SELECT subpath(E'Bottom\\.'::ltree, 0, 1);
+  subpath  
+---
+ "Bottom."
+(1 row)
+
+SELECT subpath(E'a\\.b', 0, 1);
+ subpath 
+-
+ "a.b"
+(1 row)
+
+SELECT subpath(E'a\\..b', 1, 1);
+ subpath 
+-
+ b
+(1 row)
+
+SELECT subpath(E'a\\..\\b', 1, 1);
+ subpath 
+-
+ b
+(1 row)
+
+SELECT subpath(E'a b.с d'::ltree, 1, 1);
+ subpath 
+-
+ "с d"
+(1 row)
+
+SELECT(
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789012345678901234567890123456789' ||
+'01234567890123456789

Re: tableam: abstracting relation sizing code

2019-07-08 Thread Robert Haas

On Wed, Jun 12, 2019 at 9:14 AM Robert Haas  wrote:
> On Tue, Jun 11, 2019 at 7:17 PM Andres Freund  wrote:
> > Just to understand: What version are you targeting? It seems pretty
> > clearly v13 material to me?
>
> My current plan is to commit this to v13 as soon as the tree opens.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11

2019-07-08 Thread Tom Lane

Michael Paquier  writes:
> Well, another thing I'd like to think about is if there is any point
> to keep regressplans.sh and the relevant options in postgres at this
> stage.  From the top of the file one can read that:

The point of regressplans.sh is to see if anything goes seriously
wrong when forcing non-default plan choices --- seriously wrong being
defined as crashes or semantically wrong answers.  It's not expected
that the regression tests will automatically pass when you do that,
because of their dependencies on output row ordering, not to mention
all the EXPLAINs.  I'm not for removing it --- the fact that its
results require manual evaluation doesn't make it useless.

Having said that, join_hash.sql in particular seems to have zero
value if it's not testing hash joins, so I think it'd be reasonable
for it to override a global enable_hashjoin = off setting.  None of
the other regression test scripts seem to take nearly as much of a
performance hit from globally forcing poor plans.

regards, tom lane

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Tomas Vondra

On Mon, Jul 08, 2019 at 02:39:44PM -0400, Stephen Frost wrote:

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:

On Mon, Jul  8, 2019 at 11:47:33AM -0400, Stephen Frost wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Mon, Jul  8, 2019 at 11:18:01AM -0400, Joe Conway wrote:
> > > On 7/8/19 10:19 AM, Bruce Momjian wrote:
> > > > When people are asking for multiple keys (not just for key rotation),
> > > > they are asking to have multiple keys that can be supplied by users only
> > > > when they need to access the data.  Yes, the keys are always in the
> > > > datbase, but the feature request is that they are only unlocked when the
> > > > user needs to access the data.  Obviously, that will not work for
> > > > autovacuum when the encryption is at the block level.
> > >
> > > > If the key is always unlocked, there is questionable security value of
> > > > having multiple keys, beyond key rotation.
> > >
> > > That is not true. Having multiple keys also allows you to reduce the
> > > amount of data encrypted with a single key, which is desirable because:
> > >
> > > 1. It makes cryptanalysis more difficult
> > > 2. Puts less data at risk if someone gets "lucky" in doing brute force
> >
> > What systems use multiple keys like that?  I know of no website that
> > does that.  Your arguments seem hypothetical.  What is your goal here?
>
> Not sure what the reference to 'website' is here, but one doesn't get
> certificates for TLS/SSL usage that aren't time-bounded, and when it
> comes to the actual on-the-wire encryption that's used, that's a
> symmetric key that's generated on-the-fly for every connection.
>
> Wouldn't the fact that they generate a different key for every
> connection be a pretty clear indication that it's a good idea to use
> multiple keys and not use the same key over and over..?
>
> Of course, we can discuss if what websites do with over-the-wire
> encryption is sensible to compare to what we want to do in PG for
> data-at-rest, but then we shouldn't be talking about what websites do,
> it'd make more sense to look at other data-at-rest encryption systems
> and consider what they're doing.

(I talked to Joe on chat for clarity.)  In modern TLS, the certificate is
used only for authentication, and Diffie–Hellman is used for key
exchange:

https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange

Right, and the key that's figured out for each connection is at least
specific to the server AND client keys/certificates, thus meaning that
they're changed at least as frequently as those change (and clients end
up creating ones on the fly randomly if they don't have one, iirc).

So, the question is whether you can pass so much data in TLS that using
the same key for the entire session is a security issue.  TLS originally
had key renegotiation, but that was removed in TLS 1.3:

https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/
To mitigate these types of attacks, TLS 1.3 disallows renegotiation.

It was removed due to attacks targeting the renegotiation, not because
doing re-keying by itself was a bad idea, or because using multiple keys
was seen as a bad idea.

Of course, a database is going to process even more data so if the
amount of data encrypted is a problem, we might have a problem too in
using a single key.  This is not related to whether we use one key for
the entire cluster or multiple keys per tablespace --- the problem is
the same.  I guess we could create 1024 keys and use the bottom bits of
the block number to decide what key to use.  However, that still only
pushes the goalposts farther.

All of this is about pushing the goalposts farther away, as I see it.
There's going to be trade-offs here and there isn't going to be any "one
right answer" when it comes to this space.  That's why I'm inclined to
argue that we should try to come up with a relatively *good* solution
that doesn't create a huge amount of work for us, and then build on
that.  To that end, leveraging metadata that we already have outside of
the catalogs (databases, tablespaces, potentially other information that
we store, essentially, in the filesystem metadata already) to decide on
what key to use, and how many we can support, strikes me as a good
initial target.

Anyway, I will to research the reasonable data size that can be secured
with a single key via AES.  I will look at how PGP encrypts large files
too.

This seems unlikely to lead to a definitive result, but it would be
interesting to hear if there have been studies around that and what
their conclusions were.

When it comes to concerns about autovacuum or other system processes,
those don't have any direct user connections or interactions, so having
them be more privileged and having access to more is reasonable.

I think Bruce's proposal was to minimize the time the key is "unlocked"
in memory by only unlocking them when the user connects and supplies
s

Re: [HACKERS] Cached plans and statement generalization

2019-07-08 Thread Konstantin Knizhnik





On 08.07.2019 2:23, Thomas Munro wrote:

On Tue, Jul 2, 2019 at 3:29 AM Konstantin Knizhnik
 wrote:

Attached please find rebased version of the patch.
Also this version can be found in autoprepare branch of this repository
https://github.com/postgrespro/postgresql.builtin_pool.git
on github.

Thanks.  I haven't looked at the code but this seems like interesting
work and I hope it will get some review.  I guess this is bound to use
a lot of memory.  I guess we'd eventually want to figure out how to
share the autoprepared plan cache between sessions, which is obviously
a whole can of worms.

A couple of trivial comments with my CF manager hat on:

1.  Can you please fix the documentation?  It doesn't build.
Obviously reviewing the goals, design and implementation are more
important than the documentation at this point, but if that is fixed
then the CF bot will be able to run check-world every day and we might
learn something about the code.
2.  Accidental editor junk included:  src/include/catalog/pg_proc.dat.~1~



Sorry, are you tests autoprepare-16.patch I have sent in the last e-mail?
I can not reproduce the problem with building documentation:

knizhnik@xps:~/postgresql/doc$ make
make -C ../src/backend generated-headers
make[1]: Entering directory '/home/knizhnik/postgresql/src/backend'
make -C catalog distprep generated-header-symlinks
make[2]: Entering directory '/home/knizhnik/postgresql/src/backend/catalog'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/knizhnik/postgresql/src/backend/catalog'
make -C utils distprep generated-header-symlinks
make[2]: Entering directory '/home/knizhnik/postgresql/src/backend/utils'
make[2]: Nothing to be done for 'distprep'.
make[2]: Nothing to be done for 'generated-header-symlinks'.
make[2]: Leaving directory '/home/knizhnik/postgresql/src/backend/utils'
make[1]: Leaving directory '/home/knizhnik/postgresql/src/backend'
make -C src all
make[1]: Entering directory '/home/knizhnik/postgresql/doc/src'
make -C sgml all
make[2]: Entering directory '/home/knizhnik/postgresql/doc/src/sgml'
/usr/bin/xmllint --path . --noout --valid postgres.sgml
/usr/bin/xsltproc --path . --stringparam pg.version '12devel' 
stylesheet.xsl postgres.sgml

cp ./stylesheet.css html/
touch html-stamp
/usr/bin/xmllint --path . --noout --valid postgres.sgml
/usr/bin/xsltproc --path . --stringparam pg.version '12devel' 
stylesheet-man.xsl postgres.sgml

touch man-stamp
make[2]: Leaving directory '/home/knizhnik/postgresql/doc/src/sgml'
make[1]: Leaving directory '/home/knizhnik/postgresql/doc/src'


Also autoporepare-16.patch doesn't include any junk

src/include/catalog/pg_proc.dat.~1~

Re: Declared but no defined functions

2019-07-08 Thread Tom Lane

Ashwin Agrawal  writes:
> Do we wish to make this a tool and have it in src/tools, either as part of
> find_static tool after renaming that one to more generic name or
> independent script.

Well, the scripts described so far are little more than jury-rigged
hacks, with lots of room for false positives *and* false negatives.
I wouldn't want to institutionalize any of them as the right way to
check for such problems.  If somebody made the effort to create a
tool that was actually trustworthy, perhaps that'd be a different
story.

(Personally I was wondering whether pgindent could be hacked up to
emit things it thought were declarations of function names.  I'm
not sure that I'd trust that 100% either, but at least it would have
a better shot than the grep hacks we've discussed so far.  Note in
particular that pgindent would see things inside #ifdef blocks,
whether or not your local build ever sees those declarations.)

regards, tom lane

Assertion for logically decoding multi inserts into the catalog

2019-07-08 Thread Daniel Gustafsson

My patch for using heap_multi_insert in the catalog [1] failed the logical
decoding part of test/recovery [2].

The assertion it failed on seems to not take multi inserts into the catalog
into consideration, while the main logic does.  This assertion hasn't tripped
since there are no multi inserts into the catalog, but if we introduce them it
will so I’m raising it in a separate thread as it is sort of unrelated from the
patch in question.

The attached patch fixes my test failure and makes sense to me, but this code
is far from my neck of the tree, so I’m really not sure this is the best way to
express the assertion.

cheers ./daniel

[1] https://commitfest.postgresql.org/23/2125/
[2] 
https://postgr.es/m/CA+hUKGLg1vFiXnkxjp_bea5+VP8D=vhrwsdvj7rbikr_u4x...@mail.gmail.com



logdec_assert.patch
Description: Binary data

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Tomas Vondra


On Mon, Jul 08, 2019 at 12:16:04PM -0400, Bruce Momjian wrote:


...

Anyway, I will to research the reasonable data size that can be secured
with a single key via AES.  I will look at how PGP encrypts large files
too.



IMO there are various recommendations about this, for example from NIST.
But it varies on the exact encryption mode (say, GCM, XTS, ...) and the
recommendations are not "per key" but "per key + nonce" etc.

IANAC but my understanding is if we use e.g. "OID + blocknum" as nonce,
then we should be pretty safe.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Tomas Vondra


On Mon, Jul 08, 2019 at 12:09:58PM -0400, Joe Conway wrote:

On 7/8/19 11:56 AM, Peter Eisentraut wrote:

On 2019-07-08 17:47, Stephen Frost wrote:

Of course, we can discuss if what websites do with over-the-wire
encryption is sensible to compare to what we want to do in PG for
data-at-rest, but then we shouldn't be talking about what websites do,
it'd make more sense to look at other data-at-rest encryption systems
and consider what they're doing.


So, how do encrypted file systems do it?  Are there any encrypted file
systems in general use that allow encrypting only some files or
encrypting different parts of the file system with different keys, or
any of those other granular approaches being discussed?


Well it is fairly common, for good reason IMHO, to encrypt some mount
points and not others on a system. In my mind, and in practice to a
large extent, a postgres tablespace == a unique mount point.

There is a description here:

 https://wiki.archlinux.org/index.php/Disk_encryption



That link is a bit overwhelming, as it explains how various encrypted
filesystems do things. There's now official support for this in the
Linux kernel (encryption at the filesystem level, not block device) in
the form of fscrypt, see

 https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html

It's a bit different because that's not a stacked encryption, it's
integrated directly into filesystems (like ext4, at the moment) and it
leverages other kernel facilities (like keyring).

The link also discusses the threat model, which is interesting
particularly interesting for this discussion, IMO.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Ltree syntax improvement

2019-07-08 Thread Alvaro Herrera

On 2019-Jul-08, Dmitry Belyavsky wrote:

> Dear Thomas,
> 
> Thank you for your proofreading!
> 
> Please find the updated patch attached. It also contains the missing
> escaping.

I think all these functions you're adding should have a short sentence
explaining what it does.

I'm not really convinced that we need this much testing.  It seems a bit
excessive.  Many of these very focused test SQL lines could be removed
with no loss of coverage, and many of the others could be grouped into
one.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Ltree syntax improvement

2019-07-08 Thread Dmitry Belyavsky

Dear Alvaro,

On Mon, Jul 8, 2019 at 11:16 PM Alvaro Herrera 
wrote:

> On 2019-Jul-08, Dmitry Belyavsky wrote:
>
> > Dear Thomas,
> >
> > Thank you for your proofreading!
> >
> > Please find the updated patch attached. It also contains the missing
> > escaping.
>
> I think all these functions you're adding should have a short sentence
> explaining what it does.
>
> I'm not really convinced that we need this much testing.  It seems a bit
> excessive.  Many of these very focused test SQL lines could be removed
> with no loss of coverage, and many of the others could be grouped into
> one.
>

I did not introduce any functions. I've just changed the parser.
I'm not sure that it makes sense to remove any tests as most of them were
written to catch really happened bugs during the implementation.

-- 
SY, Dmitry Belyavsky

Re: Built-in connection pooler

2019-07-08 Thread Konstantin Knizhnik




On 08.07.2019 3:37, Thomas Munro wrote:

On Tue, Jul 2, 2019 at 3:11 AM Konstantin Knizhnik
 wrote:

On 01.07.2019 12:57, Thomas Munro wrote:

Interesting work.  No longer applies -- please rebase.


Rebased version of the patch is attached.
Also all this version of built-ni proxy can be found in conn_proxy
branch of https://github.com/postgrespro/postgresql.builtin_pool.git

Thanks Konstantin.  I haven't looked at the code, but I can't help
noticing that this CF entry and the autoprepare one are both features
that come up again an again on feature request lists I've seen.
That's very cool.  They also both need architectural-level review.
With my Commitfest manager hat on: reviewing other stuff would help
with that; if you're looking for something of similar complexity and
also the same level of
everyone-knows-we-need-to-fix-this-!@#$-we-just-don't-know-exactly-how-yet
factor, I hope you get time to provide some more feedback on Takeshi
Ideriha's work on shared caches, which doesn't seem a million miles
from some of the things you're working on.


Thank you, I will look at Takeshi Ideriha's patch.


Could you please fix these compiler warnings so we can see this
running check-world on CI?

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.46324
https://travis-ci.org/postgresql-cfbot/postgresql/builds/555180678

Sorry, I do not have access to Windows host, so can not check Win32 
build myself.
I have fixed all the reported warnings but can not verify that Win32 
build is now ok.




diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a30e5..9398e561e8 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,123 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is switched on.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connection are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will server each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing)."
+		 "Each proxy launches its own subset of backends. So maximal number of non-tainted backends is "
+		  "session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  
+   session_schedule configuration parameter
+  
+  
+  
+
+  Specifies scheduling policy for assigning session to proxies in case of
+  connection pooling. Default policy is round-robin.
+
+
+  With round-robin policy postmaster cyclicly scatter sessions between proxies.
+
+
+  With random policy postmaster randomly choose proxy for new session.
+
+
+  With load-balancing policy postmaster choose proxy with lowest load average.
+  Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections.
+   
+  
+ 
+
+ 
+  restar

Re: Ltree syntax improvement

2019-07-08 Thread Alvaro Herrera

On 2019-Jul-08, Dmitry Belyavsky wrote:

> I did not introduce any functions. I've just changed the parser.

I mean the C-level functions -- count_parts_ors() and so on.

> I'm not sure that it makes sense to remove any tests as most of them were
> written to catch really happened bugs during the implementation.

Well, I don't mean to decrease the coverage, only to condense a lot of
little tests in a small powerful test.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Add missing operator <->(box, point)

2019-07-08 Thread Nikita Glukhov


Attached 3rd version of the patches.

On 02.07.2019 21:55, Alexander Korotkov wrote:


On Tue, Jul 2, 2019 at 9:19 PM Nikita Glukhov  wrote:

We could use commuted "const <-> var" operators for kNN searches, but the
current implementation requires the existence of "var <-> const" operators, and
order-by-op clauses are rebuilt using them (see match_clause_to_ordering_op()
at /src/backend/optimizer/path/indxpath.c).

But probably it's still worth to just add commutator for every <->
operator and close this question.  Otherwise, it may arise again once
we want to add some more kNN support to opclasses or something.  On
the other hand, are we already going to limit oid consumption?


All missing distance operators were added to the first patch.



On 08.07.2019 18:22, Alexander Korotkov wrote:


On Mon, Mar 11, 2019 at 2:49 AM Nikita Glukhov  wrote:

2. Add <-> to GiST box_ops.
Extracted gist_box_distance_helper() common for gist_box_distance() and
gist_bbox_distance().

For me it doesn't look worth having two distinct functions
gist_box_distance_helper() and gist_bbox_distance().  What about
having just one and leave responsibility for recheck flag to the
caller?


gist_bbox_distance() was removed.

But maybe it would be better to replace two identical functions
gist_circle_distance() and gist_poly_distance() with the single
gist_bbox_distance()?



3. Add <-> to SP-GiST.
Changed only catalog and tests.  Box case is already checked in
spg_box_quad_leaf_consistent():
  out->recheckDistances = distfnoid == F_DIST_POLYP;

So, it seems to be fix of oversight in 2a6368343ff4.  But assuming
fixing this requires catalog changes, we shouldn't backpatch this.


--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

>From 14a0e22e1b68b084dd75a7f660955b52b6aaae79 Mon Sep 17 00:00:00 2001
From: Nikita Glukhov 
Date: Mon, 8 Jul 2019 22:55:00 +0300
Subject: [PATCH 1/3] Add missing distance operators

---
 src/backend/utils/adt/geo_ops.c| 136 -
 src/include/catalog/pg_operator.dat|  42 +-
 src/include/catalog/pg_proc.dat|  25 +
 src/test/regress/expected/geometry.out | 986 +
 src/test/regress/sql/geometry.sql  |  15 +-
 5 files changed, 687 insertions(+), 517 deletions(-)

diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c
index 373784f..6558ea3 100644
--- a/src/backend/utils/adt/geo_ops.c
+++ b/src/backend/utils/adt/geo_ops.c
@@ -2348,6 +2348,17 @@ dist_pl(PG_FUNCTION_ARGS)
 	PG_RETURN_FLOAT8(line_closept_point(NULL, line, pt));
 }
 
+/*
+ * Distance from a line to a point
+ */
+Datum
+dist_lp(PG_FUNCTION_ARGS)
+{
+	LINE	   *line = PG_GETARG_LINE_P(0);
+	Point	   *pt = PG_GETARG_POINT_P(1);
+
+	PG_RETURN_FLOAT8(line_closept_point(NULL, line, pt));
+}
 
 /*
  * Distance from a point to a lseg
@@ -2362,13 +2373,20 @@ dist_ps(PG_FUNCTION_ARGS)
 }
 
 /*
- * Distance from a point to a path
+ * Distance from a lseg to a point
  */
 Datum
-dist_ppath(PG_FUNCTION_ARGS)
+dist_sp(PG_FUNCTION_ARGS)
+{
+	LSEG	   *lseg = PG_GETARG_LSEG_P(0);
+	Point	   *pt = PG_GETARG_POINT_P(1);
+
+	PG_RETURN_FLOAT8(lseg_closept_point(NULL, lseg, pt));
+}
+
+static float8
+dist_ppath_internal(Point *pt, PATH *path)
 {
-	Point	   *pt = PG_GETARG_POINT_P(0);
-	PATH	   *path = PG_GETARG_PATH_P(1);
 	float8		result = 0.0;	/* keep compiler quiet */
 	bool		have_min = false;
 	float8		tmp;
@@ -2403,7 +2421,31 @@ dist_ppath(PG_FUNCTION_ARGS)
 		}
 	}
 
-	PG_RETURN_FLOAT8(result);
+	return result;
+}
+
+/*
+ * Distance from a point to a path
+ */
+Datum
+dist_ppath(PG_FUNCTION_ARGS)
+{
+	Point	   *pt = PG_GETARG_POINT_P(0);
+	PATH	   *path = PG_GETARG_PATH_P(1);
+
+	PG_RETURN_FLOAT8(dist_ppath_internal(pt, path));
+}
+
+/*
+ * Distance from a path to a point
+ */
+Datum
+dist_pathp(PG_FUNCTION_ARGS)
+{
+	PATH	   *path = PG_GETARG_PATH_P(0);
+	Point	   *pt = PG_GETARG_POINT_P(1);
+
+	PG_RETURN_FLOAT8(dist_ppath_internal(pt, path));
 }
 
 /*
@@ -2419,6 +2461,18 @@ dist_pb(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Distance from a box to a point
+ */
+Datum
+dist_bp(PG_FUNCTION_ARGS)
+{
+	BOX		   *box = PG_GETARG_BOX_P(0);
+	Point	   *pt = PG_GETARG_POINT_P(1);
+
+	PG_RETURN_FLOAT8(box_closept_point(NULL, box, pt));
+}
+
+/*
  * Distance from a lseg to a line
  */
 Datum
@@ -2431,6 +2485,18 @@ dist_sl(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Distance from a line to a lseg
+ */
+Datum
+dist_ls(PG_FUNCTION_ARGS)
+{
+	LINE	   *line = PG_GETARG_LINE_P(0);
+	LSEG	   *lseg = PG_GETARG_LSEG_P(1);
+
+	PG_RETURN_FLOAT8(lseg_closept_line(NULL, lseg, line));
+}
+
+/*
  * Distance from a lseg to a box
  */
 Datum
@@ -2443,6 +2509,18 @@ dist_sb(PG_FUNCTION_ARGS)
 }
 
 /*
+ * Distance from a box to a lseg
+ */
+Datum
+dist_bs(PG_FUNCTION_ARGS)
+{
+	BOX		   *box = PG_GETARG_BOX_P(0);
+	LSEG	   *lseg = PG_GETARG_LSEG_P(1);
+
+	PG_RETURN_FLOAT8(box_closept_lseg(NULL, box, lseg));
+}
+
+/*
  * Distance from a line to a box
  */
 Datum
@@ -2462,13 +2540

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Bruce Momjian

On Mon, Jul  8, 2019 at 12:09:58PM -0400, Joe Conway wrote:
> On 7/8/19 11:56 AM, Peter Eisentraut wrote:
> > On 2019-07-08 17:47, Stephen Frost wrote:
> >> Of course, we can discuss if what websites do with over-the-wire
> >> encryption is sensible to compare to what we want to do in PG for
> >> data-at-rest, but then we shouldn't be talking about what websites do,
> >> it'd make more sense to look at other data-at-rest encryption systems
> >> and consider what they're doing.
> > 
> > So, how do encrypted file systems do it?  Are there any encrypted file
> > systems in general use that allow encrypting only some files or
> > encrypting different parts of the file system with different keys, or
> > any of those other granular approaches being discussed?
> 
> Well it is fairly common, for good reason IMHO, to encrypt some mount
> points and not others on a system. In my mind, and in practice to a
> large extent, a postgres tablespace == a unique mount point.

Yes, that is a normal partition point for key use because one file
system is independent of others.  You could use different keys for
different directories in the same file system, but internally it all
uses the same storage, and storage theft would potentially happen at the
file system level.

For Postgres, tablespaces are not independent of the database system,
though media theft would still match.  Of course, in the case of a
tablespace media theft, Postgres would be quite confused, though you
could still start the database system:

SELECT * FROM test;
ERROR:  could not open file
"pg_tblspc/16385/PG_13_201907054/16384/16386": No such file or directory

but the data would be gone.  What you can't do with Postgres is to have
the tablespace be inaccessible and then later reappear.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-08 Thread Julien Rouhaud

On Mon, Jul 8, 2019 at 9:08 PM Julien Rouhaud  wrote:
>
> I'll resubmit the parallel patch using per-table only
> approach

Attached.


0001-Export-vacuumdb-s-parallel-infrastructure.patch
Description: Binary data


0002-Add-parallel-processing-to-reindexdb.patch
Description: Binary data

Re: Add missing operator <->(box, point)

2019-07-08 Thread Alexander Korotkov

On Mon, Jul 8, 2019 at 11:39 PM Nikita Glukhov  wrote:
> On 08.07.2019 18:22, Alexander Korotkov wrote:
> For me it doesn't look worth having two distinct functions
> gist_box_distance_helper() and gist_bbox_distance().  What about
> having just one and leave responsibility for recheck flag to the
> caller?
>
> gist_bbox_distance() was removed.

OK.

> But maybe it would be better to replace two identical functions
> gist_circle_distance() and gist_poly_distance() with the single
> gist_bbox_distance()?

Sounds reasonable to me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Bruce Momjian

On Mon, Jul  8, 2019 at 02:39:44PM -0400, Stephen Frost wrote:
> > > Of course, we can discuss if what websites do with over-the-wire
> > > encryption is sensible to compare to what we want to do in PG for
> > > data-at-rest, but then we shouldn't be talking about what websites do,
> > > it'd make more sense to look at other data-at-rest encryption systems
> > > and consider what they're doing.
> > 
> > (I talked to Joe on chat for clarity.)  In modern TLS, the certificate is
> > used only for authentication, and Diffie–Hellman is used for key
> > exchange:
> > 
> > https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange
> 
> Right, and the key that's figured out for each connection is at least
> specific to the server AND client keys/certificates, thus meaning that
> they're changed at least as frequently as those change (and clients end
> up creating ones on the fly randomly if they don't have one, iirc).
> 
> > So, the question is whether you can pass so much data in TLS that using
> > the same key for the entire session is a security issue.  TLS originally
> > had key renegotiation, but that was removed in TLS 1.3:
> > 
> > 
> > https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/
> > To mitigate these types of attacks, TLS 1.3 disallows renegotiation.
> 
> It was removed due to attacks targeting the renegotiation, not because
> doing re-keying by itself was a bad idea, or because using multiple keys
> was seen as a bad idea.

Well, if it was a necessary features, I assume TLS 1.3 would have found
a way to make it secure, no?  Certainly they are not shipping TLS 1.3
with a known weakness.

> > Of course, a database is going to process even more data so if the
> > amount of data encrypted is a problem, we might have a problem too in
> > using a single key.  This is not related to whether we use one key for
> > the entire cluster or multiple keys per tablespace --- the problem is
> > the same.  I guess we could create 1024 keys and use the bottom bits of
> > the block number to decide what key to use.  However, that still only
> > pushes the goalposts farther.
> 
> All of this is about pushing the goalposts farther away, as I see it.
> There's going to be trade-offs here and there isn't going to be any "one
> right answer" when it comes to this space.  That's why I'm inclined to
> argue that we should try to come up with a relatively *good* solution
> that doesn't create a huge amount of work for us, and then build on
> that.  To that end, leveraging metadata that we already have outside of
> the catalogs (databases, tablespaces, potentially other information that
> we store, essentially, in the filesystem metadata already) to decide on
> what key to use, and how many we can support, strikes me as a good
> initial target.

Yes, we will need that for a usable nonce that we don't need to store in
the blocks and WAL files.

> > Anyway, I will to research the reasonable data size that can be secured
> > with a single key via AES.  I will look at how PGP encrypts large files
> > too.
> 
> This seems unlikely to lead to a definitive result, but it would be
> interesting to hear if there have been studies around that and what
> their conclusions were.

I found this:


https://crypto.stackexchange.com/questions/44113/what-is-a-safe-maximum-message-size-limit-when-encrypting-files-to-disk-with-aes

https://crypto.stackexchange.com/questions/20333/encryption-of-big-files-in-java-with-aes-gcm/20340#20340

The numbers listed are:

Maximum Encrypted Plaintext Size:  68GB
Maximum Processed Additional Authenticated Data: 2 x 10^18

The 68GB value is "the maximum bits that can be processed with a single
key/IV(nonce) pair."  We would 8k of data for each 8k page.  If we
assume a unique nonce per page that is 10^32 bytes.

For the WAL we would probably use a different nonce for each 16MB page,
so we would be OK there too, since that is 10 ^ 36 bytes.

gives us 10^36 bytes before the segment number causes the nonce to
repeat.

> When it comes to concerns about autovacuum or other system processes,
> those don't have any direct user connections or interactions, so having
> them be more privileged and having access to more is reasonable.

Well, I am trying to understand the value of having some keys accessible
by some parts of the system, and some not.  I am unclear what security
value that has.

> Ideally, all of this would leverage a vaulting system or other mechanism
> which manages access to the keys and allows their usage to be limited.
> That's been generally accepted as a good way to bridge the gap between
> having to ask users every time for a key and having keys stored
> long-term in memory.  Having *only* the keys for the data which the
> currently connected user is allowed to access would certainly be a great
> initial capability, even if system processes (including potentially WAL
> replay) have to have access to

Re: Excessive memory usage in multi-statement queries w/ partitioning

2019-07-08 Thread Tom Lane

Amit Langote  writes:
> [ parse-plan-memcxt_v2.patch ]

I got around to looking at this finally.  I'm not at all happy with
the fact that it's added a plantree copy step to the only execution
path through exec_simple_query.  That's a very significant overhead,
in many use-cases, to solve something that nobody had complained
about for a couple of decades before now.  I don't see the need for
any added copy step anyway.  The only reason you're doing it AFAICS
is so you can release the per-statement context a bit earlier, which
is a completely unnecessary optimization.  Just wait to release it
till the bottom of the loop.

Also, creating/deleting the sub-context is in itself an added overhead
that accomplishes exactly nothing in the typical case where there's
not multiple statements.  I thought the idea was to do that only if
there was more than one raw parsetree (or, maybe better, do it for
all but the last parsetree).

To show that this isn't an empty concern, I did a quick pgbench
test.  Using a single-client select-only test ("pgbench -S -T 60"
in an -s 10 database), I got these numbers in three trials with HEAD:

tps = 9593.818478 (excluding connections establishing)
tps = 9570.189163 (excluding connections establishing)
tps = 9596.579038 (excluding connections establishing)

and these numbers after applying the patch:

tps = 9411.918165 (excluding connections establishing)
tps = 9389.279079 (excluding connections establishing)
tps = 9409.350175 (excluding connections establishing)

That's about a 2% dropoff.  Now it's possible that that can be
explained away as random variations from a slightly different layout
of critical loops vs cacheline boundaries ... but I don't believe it.

regards, tom lane

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Bruce Momjian

On Mon, Jul  8, 2019 at 09:30:03PM +0200, Tomas Vondra wrote:
> I think Bruce's proposal was to minimize the time the key is "unlocked"
> in memory by only unlocking them when the user connects and supplies
> some sort of secret (passphrase), and remove them from memory when the
> user disconnects. So there's no way for the auxiliary processes to gain
> access to those keys, because only the user knows the secret.

I mentioned that because I thought that was the security value that
people wanted.  While I can see the value, I don't see how it can be
cleanly accomplished.  Keeping the keys unlocked at all times seems to
be possible, but of much smaller value.

Part of my goal in this discussion is to reverse the rush to implement
and pick apart exactly what is possible, and desirable.

> FWIW I have doubts this scheme actually measurably improves privacy in
> practice, because most busy applications will end up having the keys in
> the memory all the time anyway.

Yep.

> It also assumes memory is unsafe, i.e. bad actors can read it, and
> that's probably a valid concern (root access, vulnerabilities etc.). But
> in that case we already have plenty of issues with data in flight
> anyway, and I doubt TDE is an answer to that.

Agreed.

> > Ideally, all of this would leverage a vaulting system or other mechanism
> > which manages access to the keys and allows their usage to be limited.
> > That's been generally accepted as a good way to bridge the gap between
> > having to ask users every time for a key and having keys stored
> > long-term in memory.
> 
> Right. I agree with this.
> 
> > Having *only* the keys for the data which the
> > currently connected user is allowed to access would certainly be a great
> > initial capability, even if system processes (including potentially WAL
> > replay) have to have access to all of the keys.  And yes, shared buffers
> > being unencrypted and accessible by every backend continues to be an
> > issue- it'd be great to improve on that situation too.  I don't think
> > having everything encrypted in shared buffers is likely the solution,
> > rather, segregating it up might make more sense, again, along similar
> > lines to keys and using metadata that's outside of the catalogs, which
> > has been discussed previously, though I don't think anyone's actively
> > working on it.
> > 
> 
> I very much doubt TDE is a solution to this. Essentially, TDE is a good
> data-at-rest solution, but this seems more like protecting data during
> execution. And in that case I think we may need an entirely different
> encryption scheme.

I thought client-level encryption or pgcrypto-style encryption fits that
need better.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Jul  8, 2019 at 02:39:44PM -0400, Stephen Frost wrote:
> > > > Of course, we can discuss if what websites do with over-the-wire
> > > > encryption is sensible to compare to what we want to do in PG for
> > > > data-at-rest, but then we shouldn't be talking about what websites do,
> > > > it'd make more sense to look at other data-at-rest encryption systems
> > > > and consider what they're doing.
> > > 
> > > (I talked to Joe on chat for clarity.)  In modern TLS, the certificate is
> > > used only for authentication, and Diffie–Hellman is used for key
> > > exchange:
> > > 
> > >   https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange
> > 
> > Right, and the key that's figured out for each connection is at least
> > specific to the server AND client keys/certificates, thus meaning that
> > they're changed at least as frequently as those change (and clients end
> > up creating ones on the fly randomly if they don't have one, iirc).
> > 
> > > So, the question is whether you can pass so much data in TLS that using
> > > the same key for the entire session is a security issue.  TLS originally
> > > had key renegotiation, but that was removed in TLS 1.3:
> > > 
> > >   
> > > https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/
> > >   To mitigate these types of attacks, TLS 1.3 disallows renegotiation.
> > 
> > It was removed due to attacks targeting the renegotiation, not because
> > doing re-keying by itself was a bad idea, or because using multiple keys
> > was seen as a bad idea.
> 
> Well, if it was a necessary features, I assume TLS 1.3 would have found
> a way to make it secure, no?  Certainly they are not shipping TLS 1.3
> with a known weakness.

As discussed below- this is about moving goalposts, and that's, in part
at least, why re-keying isn't a *necessary* feature of TLS.  As the
amount of data you transmit over a given TLS connection increases
though, the risk increases and it would be better to re-key.  How much
better?  That depends a great deal on if someone is trying to mount an
attack or not.

> > > Of course, a database is going to process even more data so if the
> > > amount of data encrypted is a problem, we might have a problem too in
> > > using a single key.  This is not related to whether we use one key for
> > > the entire cluster or multiple keys per tablespace --- the problem is
> > > the same.  I guess we could create 1024 keys and use the bottom bits of
> > > the block number to decide what key to use.  However, that still only
> > > pushes the goalposts farther.
> > 
> > All of this is about pushing the goalposts farther away, as I see it.
> > There's going to be trade-offs here and there isn't going to be any "one
> > right answer" when it comes to this space.  That's why I'm inclined to
> > argue that we should try to come up with a relatively *good* solution
> > that doesn't create a huge amount of work for us, and then build on
> > that.  To that end, leveraging metadata that we already have outside of
> > the catalogs (databases, tablespaces, potentially other information that
> > we store, essentially, in the filesystem metadata already) to decide on
> > what key to use, and how many we can support, strikes me as a good
> > initial target.
> 
> Yes, we will need that for a usable nonce that we don't need to store in
> the blocks and WAL files.

I'm not a fan of the idea of using something which is predictable as a
nonce.  Using the username as the salt for our md5 password mechanism
was, all around, a bad idea.  This seems like it's repeating that
mistake.

> > > Anyway, I will to research the reasonable data size that can be secured
> > > with a single key via AES.  I will look at how PGP encrypts large files
> > > too.
> > 
> > This seems unlikely to lead to a definitive result, but it would be
> > interesting to hear if there have been studies around that and what
> > their conclusions were.
> 
> I found this:
> 
>   
> https://crypto.stackexchange.com/questions/44113/what-is-a-safe-maximum-message-size-limit-when-encrypting-files-to-disk-with-aes
>   
> https://crypto.stackexchange.com/questions/20333/encryption-of-big-files-in-java-with-aes-gcm/20340#20340
> 
> The numbers listed are:
> 
>   Maximum Encrypted Plaintext Size:  68GB
>   Maximum Processed Additional Authenticated Data: 2 x 10^18

These are specific to AES, from a quick reading of those pages, right?

> The 68GB value is "the maximum bits that can be processed with a single
> key/IV(nonce) pair."  We would 8k of data for each 8k page.  If we
> assume a unique nonce per page that is 10^32 bytes.

A unique nonce per page strikes me as excessive...  but then, I think we
should have an actually random nonce rather than something calculated
from the metadata.

> For the WAL we would probably use a different nonce for each 16MB page,
> so we would be OK there too, since t

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Bruce Momjian

On Mon, Jul  8, 2019 at 05:41:51PM -0400, Stephen Frost wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
> > Well, if it was a necessary features, I assume TLS 1.3 would have found
> > a way to make it secure, no?  Certainly they are not shipping TLS 1.3
> > with a known weakness.
> 
> As discussed below- this is about moving goalposts, and that's, in part
> at least, why re-keying isn't a *necessary* feature of TLS.  As the

I agree we have to allow rekeying and allow multiple unlocked keys in
the server at the same time.  The open question is whether encrypting
different data with different keys and different unlock controls is
possible or useful.

> amount of data you transmit over a given TLS connection increases
> though, the risk increases and it would be better to re-key.  How much
> better?  That depends a great deal on if someone is trying to mount an
> attack or not.

Yep, we need to allow rekey.

> > > > Of course, a database is going to process even more data so if the
> > > > amount of data encrypted is a problem, we might have a problem too in
> > > > using a single key.  This is not related to whether we use one key for
> > > > the entire cluster or multiple keys per tablespace --- the problem is
> > > > the same.  I guess we could create 1024 keys and use the bottom bits of
> > > > the block number to decide what key to use.  However, that still only
> > > > pushes the goalposts farther.
> > > 
> > > All of this is about pushing the goalposts farther away, as I see it.
> > > There's going to be trade-offs here and there isn't going to be any "one
> > > right answer" when it comes to this space.  That's why I'm inclined to
> > > argue that we should try to come up with a relatively *good* solution
> > > that doesn't create a huge amount of work for us, and then build on
> > > that.  To that end, leveraging metadata that we already have outside of
> > > the catalogs (databases, tablespaces, potentially other information that
> > > we store, essentially, in the filesystem metadata already) to decide on
> > > what key to use, and how many we can support, strikes me as a good
> > > initial target.
> > 
> > Yes, we will need that for a usable nonce that we don't need to store in
> > the blocks and WAL files.
> 
> I'm not a fan of the idea of using something which is predictable as a
> nonce.  Using the username as the salt for our md5 password mechanism
> was, all around, a bad idea.  This seems like it's repeating that
> mistake.

Uh, well, renaming the user was a big problem, but that is the only case
I can think of.  I don't see that as an issue for block or WAL sequence
numbers.  If we want to use a different nonce, we have to find a way to
store it or look it up efficiently.  Considering the nonce size, I don't
see how that is possible.

> > > > Anyway, I will to research the reasonable data size that can be secured
> > > > with a single key via AES.  I will look at how PGP encrypts large files
> > > > too.
> > > 
> > > This seems unlikely to lead to a definitive result, but it would be
> > > interesting to hear if there have been studies around that and what
> > > their conclusions were.
> > 
> > I found this:
> > 
> > 
> > https://crypto.stackexchange.com/questions/44113/what-is-a-safe-maximum-message-size-limit-when-encrypting-files-to-disk-with-aes
> > 
> > https://crypto.stackexchange.com/questions/20333/encryption-of-big-files-in-java-with-aes-gcm/20340#20340
> > 
> > The numbers listed are:
> > 
> > Maximum Encrypted Plaintext Size:  68GB
> > Maximum Processed Additional Authenticated Data: 2 x 10^18
> 
> These are specific to AES, from a quick reading of those pages, right?

Yes, AES with GCM, which has authentication parts we would not use, so
we would use CBC and CTR, which I think has the same or larger spaces.
> 
> > The 68GB value is "the maximum bits that can be processed with a single
> > key/IV(nonce) pair."  We would 8k of data for each 8k page.  If we
> > assume a unique nonce per page that is 10^32 bytes.
> 
> A unique nonce per page strikes me as excessive...  but then, I think we
> should have an actually random nonce rather than something calculated
> from the metadata.

Uh, well, you are much less likely to get duplicate nonce values by
using block number or WAL sequence number.  If you look at the
implementations, few compute random nonce values.

> > For the WAL we would probably use a different nonce for each 16MB page,
> > so we would be OK there too, since that is 10 ^ 36 bytes.
> > 
> > gives us 10^36 bytes before the segment number causes the nonce to
> > repeat.
> 
> This presumes that the segment number is involved in the nonce
> selection, which again strikes me as a bad idea.  Maybe it could be
> involved in some way, but we should have a properly random nonce.

And you base the random goal on what?  Nonce is number used only once,
and randomness is not a requirement.  You can say you prefer it, but
why, because most implementations don't u

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Jul  8, 2019 at 05:41:51PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > Well, if it was a necessary features, I assume TLS 1.3 would have found
> > > a way to make it secure, no?  Certainly they are not shipping TLS 1.3
> > > with a known weakness.
> > 
> > As discussed below- this is about moving goalposts, and that's, in part
> > at least, why re-keying isn't a *necessary* feature of TLS.  As the
> 
> I agree we have to allow rekeying and allow multiple unlocked keys in
> the server at the same time.  The open question is whether encrypting
> different data with different keys and different unlock controls is
> possible or useful.

I'm not sure if there's really a question about if it's *possible*?  As
for if it's useful, I agree there's some debate.

> > amount of data you transmit over a given TLS connection increases
> > though, the risk increases and it would be better to re-key.  How much
> > better?  That depends a great deal on if someone is trying to mount an
> > attack or not.
> 
> Yep, we need to allow rekey.

Supporting a way to rekey is definitely a good idea.

> > > > > Of course, a database is going to process even more data so if the
> > > > > amount of data encrypted is a problem, we might have a problem too in
> > > > > using a single key.  This is not related to whether we use one key for
> > > > > the entire cluster or multiple keys per tablespace --- the problem is
> > > > > the same.  I guess we could create 1024 keys and use the bottom bits 
> > > > > of
> > > > > the block number to decide what key to use.  However, that still only
> > > > > pushes the goalposts farther.
> > > > 
> > > > All of this is about pushing the goalposts farther away, as I see it.
> > > > There's going to be trade-offs here and there isn't going to be any "one
> > > > right answer" when it comes to this space.  That's why I'm inclined to
> > > > argue that we should try to come up with a relatively *good* solution
> > > > that doesn't create a huge amount of work for us, and then build on
> > > > that.  To that end, leveraging metadata that we already have outside of
> > > > the catalogs (databases, tablespaces, potentially other information that
> > > > we store, essentially, in the filesystem metadata already) to decide on
> > > > what key to use, and how many we can support, strikes me as a good
> > > > initial target.
> > > 
> > > Yes, we will need that for a usable nonce that we don't need to store in
> > > the blocks and WAL files.
> > 
> > I'm not a fan of the idea of using something which is predictable as a
> > nonce.  Using the username as the salt for our md5 password mechanism
> > was, all around, a bad idea.  This seems like it's repeating that
> > mistake.
> 
> Uh, well, renaming the user was a big problem, but that is the only case
> I can think of.  I don't see that as an issue for block or WAL sequence
> numbers.  If we want to use a different nonce, we have to find a way to
> store it or look it up efficiently.  Considering the nonce size, I don't
> see how that is possible.

No, this also meant that, as an attacker, I *knew* the salt ahead of
time and therefore could build rainbow tables specifically for that
salt.  I could also use those *same* tables for any system where that
user had an account, even if they used different passwords on different
systems...

I appreciate that *some* of this might not be completely relevant for
the way a nonce is used in cryptography, but I'd be very surprised to
have a cryptographer tell me that a deterministic nonce didn't have
similar issues or didn't reduce the value of the nonce significantly.

> > > > > Anyway, I will to research the reasonable data size that can be 
> > > > > secured
> > > > > with a single key via AES.  I will look at how PGP encrypts large 
> > > > > files
> > > > > too.
> > > > 
> > > > This seems unlikely to lead to a definitive result, but it would be
> > > > interesting to hear if there have been studies around that and what
> > > > their conclusions were.
> > > 
> > > I found this:
> > > 
> > >   
> > > https://crypto.stackexchange.com/questions/44113/what-is-a-safe-maximum-message-size-limit-when-encrypting-files-to-disk-with-aes
> > >   
> > > https://crypto.stackexchange.com/questions/20333/encryption-of-big-files-in-java-with-aes-gcm/20340#20340
> > > 
> > > The numbers listed are:
> > > 
> > >   Maximum Encrypted Plaintext Size:  68GB
> > >   Maximum Processed Additional Authenticated Data: 2 x 10^18
> > 
> > These are specific to AES, from a quick reading of those pages, right?
> 
> Yes, AES with GCM, which has authentication parts we would not use, so
> we would use CBC and CTR, which I think has the same or larger spaces.
> > 
> > > The 68GB value is "the maximum bits that can be processed with a single
> > > key/IV(nonce) pair."  We would 8k of data for each 8k page.  If we
> > > assume a unique nonce per page that i

Detailed questions about pg_xact_commit_timestamp

2019-07-08 Thread Morris de Oryx

I have some specific questions about pg_xact_commit_timestamp, and am
hoping that this is the right place to ask. I read a lot of the commentary
about the original patch, and the contributors seem to be here. If I'm
asking in the wrong place, just let me know.

I'm working on a design for a concurrency-safe incremental aggregate rollup
system,and pg_xact_commit_timestamp sounds perfect. But I've found very
little commentary on it generally, and couldn't figure out how it works in
detail from the source code.

Hopefully, someone knows the answers to a few questions:

* Is it possible for pg_xact_commit_timestamp to produce times out of
order? What I'm after is a way to identify records that have been chagned
since a specific time so that I can get any later changes for processing. I
don't need them in commit order, so overlapping timestamps aren't a
problem.

* How many bytes are added to each row in the final implementation? The
discussions I saw seemed to be ranging from 12-24 bytes. There was
discussion of adding in extra bytes for "just in case." This is pre 9.5, so
a world ago.

* Are the timestamps indexed internally? With a B-tree? I ask for
capacity-planning reasons.

* I've seen on StackOverflow and the design discussions that the timestamps
are not kept indefinitely, but can't find the details on exactly how long
they are stored.

* Any rules of thumb on the performance impact of enabling
pg_xact_commit_timestamp? I don't need the data on all tables but, where I
do, it sounds like it might work perfectly.

Many thanks for any assistance!

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Bruce Momjian

On Mon, Jul  8, 2019 at 06:04:46PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (br...@momjian.us) wrote:
> > On Mon, Jul  8, 2019 at 05:41:51PM -0400, Stephen Frost wrote:
> > > * Bruce Momjian (br...@momjian.us) wrote:
> > > > Well, if it was a necessary features, I assume TLS 1.3 would have found
> > > > a way to make it secure, no?  Certainly they are not shipping TLS 1.3
> > > > with a known weakness.
> > > 
> > > As discussed below- this is about moving goalposts, and that's, in part
> > > at least, why re-keying isn't a *necessary* feature of TLS.  As the
> > 
> > I agree we have to allow rekeying and allow multiple unlocked keys in
> > the server at the same time.  The open question is whether encrypting
> > different data with different keys and different unlock controls is
> > possible or useful.
> 
> I'm not sure if there's really a question about if it's *possible*?  As
> for if it's useful, I agree there's some debate.

Right, it is easily possible to keep all keys unlocked, but the value is
minimal, and the complexity will have a cost, which is my point.

> > > amount of data you transmit over a given TLS connection increases
> > > though, the risk increases and it would be better to re-key.  How much
> > > better?  That depends a great deal on if someone is trying to mount an
> > > attack or not.
> > 
> > Yep, we need to allow rekey.
> 
> Supporting a way to rekey is definitely a good idea.

It is a requirement, I think.  We might have problem tracking exactly
what key _version_ each table (or 8k block), or WAL file are.  :-( 
Ideally we would allow only two active keys, and somehow mark each page
as using the odd or even key at a given time, or something strange. 
(Yeah, hand waving here.)

> > Uh, well, renaming the user was a big problem, but that is the only case
> > I can think of.  I don't see that as an issue for block or WAL sequence
> > numbers.  If we want to use a different nonce, we have to find a way to
> > store it or look it up efficiently.  Considering the nonce size, I don't
> > see how that is possible.
> 
> No, this also meant that, as an attacker, I *knew* the salt ahead of
> time and therefore could build rainbow tables specifically for that
> salt.  I could also use those *same* tables for any system where that
> user had an account, even if they used different passwords on different
> systems...

Yes, 'postgres' can be used to create a nice md5 rainbow table that
works on many servers --- good point.  Are rainbow tables possible with
something like AES?

> I appreciate that *some* of this might not be completely relevant for
> the way a nonce is used in cryptography, but I'd be very surprised to
> have a cryptographer tell me that a deterministic nonce didn't have
> similar issues or didn't reduce the value of the nonce significantly.

This post:


https://stackoverflow.com/questions/36760973/why-is-random-iv-fine-for-aes-cbc-but-not-for-aes-gcm

says:

GCM is a variation on Counter Mode (CTR).  As you say, with any variant
of Counter Mode, it is essential  that the Nonce is not repeated with
the same key.  Hence CTR mode  Nonces often include either a counter or
a timer element: something that  is guaranteed not to repeat over the
lifetime of the key.

CTR is what we use for WAL.  8k pages, we would use CBC, which says we
need a random nonce.  I need to dig deeper into ECB mode attack.

> > Uh, well, you are much less likely to get duplicate nonce values by
> > using block number or WAL sequence number.  If you look at the
> > implementations, few compute random nonce values.
> 
> Which implementations..?  Where do their nonce values come from?  I can
> see how a nonce might have to be naturally and deterministically random,
> if the source for it is sufficiently varied across the key space, but
> starting at '1' and going up with the same key seems like it's just
> giving a potential attacker more information about what the encrypted
> data contains...

Well, in many modes the nonce is just a counter, but as stated above,
not all modes.  I need to pull out my security books to remember for
which ones it is safe.  (Frankly, it is a lot easier to use a random
nonce for WAL than 8k pages.)

> > And you base the random goal on what?  Nonce is number used only once,
> > and randomness is not a requirement.  You can say you prefer it, but
> > why, because most implementations don't use random nonce.
> 
> The encryption schemes I've worked with in the past have used a random
> nonce, so I'm wondering where the disconnect is between us on that.

OK.

> > > > > When it comes to concerns about autovacuum or other system processes,
> > > > > those don't have any direct user connections or interactions, so 
> > > > > having
> > > > > them be more privileged and having access to more is reasonable.
> > > > 
> > > > Well, I am trying to understand the value of having some keys accessible
> > > > by some parts of the

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Joe Conway

On 7/8/19 6:04 PM, Stephen Frost wrote:
> * Bruce Momjian (br...@momjian.us) wrote:
>> Uh, well, renaming the user was a big problem, but that is the only case
>> I can think of.  I don't see that as an issue for block or WAL sequence
>> numbers.  If we want to use a different nonce, we have to find a way to
>> store it or look it up efficiently.  Considering the nonce size, I don't
>> see how that is possible.
> 
> No, this also meant that, as an attacker, I *knew* the salt ahead of
> time and therefore could build rainbow tables specifically for that
> salt.  I could also use those *same* tables for any system where that
> user had an account, even if they used different passwords on different
> systems...
> 
> I appreciate that *some* of this might not be completely relevant for
> the way a nonce is used in cryptography, but I'd be very surprised to
> have a cryptographer tell me that a deterministic nonce didn't have
> similar issues or didn't reduce the value of the nonce significantly.

I have worked side by side on projects with bona fide cryptographers and
I can assure you that they recommended random nonces. Granted, that was
in the early 2000s, but I don't think "modern cryptography" has changed
that any more than "web scale" has made Postgres irrelevant in the
intervening years.

Related links:

https://defuse.ca/cbcmodeiv.htm
https://www.cryptofails.com/post/70059609995/crypto-noobs-1-initialization-vectors


Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: FETCH FIRST clause PERCENT option

2019-07-08 Thread Ryan Lambert

Hi Surafel,

The v5 patch [1] applies cleanly and passes make installcheck-world.

My concern with this is its performance.  As a user I would expect using
this feature to enable queries to run faster than they would simply pulling
the full table.  I tested on tables ranging from 10k rows up to 10 million
with the same basic result that using FETCH FIRST N PERCENT is slower than
using the full table.  At best it ran slightly slower than querying the
full table, at worst it increased execution times by 1400% when using a
large high percentage (95%).

Testing was on a DigitalOcean 2 CPU, 2GB RAM droplet.  All queries were
executed multiple times (6+) with EXPLAIN (ANALYZE, COSTS) and /timing
enabled, query plans provided are from the faster end of each query.

Create the test table:

DROP TABLE IF EXISTS r10mwide;
CREATE TABLE r10mwide AS
SELECT generate_series(1, 1000)::INT AS id,
random() AS v1,
random() AS v2,
random() AS v3,
random() AS v4,
random() AS v5,
random() AS v6,
random() AS v7,
random() AS v8,
random() AS v9,
random() AS v10
;
VACUUM ANALYZE;


Start with a baseline query from the full table, the limiting will be done
within the CTE as I would typically do in a production query.  The outer
query does basic aggregates.  Below I provide each full query for easy
copy/paste but the only real change is how the inner results are limited.
This runs in 2.2s off the full table.

EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;

   QUERY
PLAN
-
 Finalize Aggregate  (cost=227192.07..227192.08 rows=1 width=24) (actual
time=2230.419..2230.419 rows=1 loops=1)
   ->  Gather  (cost=227191.84..227192.05 rows=2 width=72) (actual
time=2228.321..2231.225 rows=3 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=226191.84..226191.85 rows=1 width=72)
(actual time=2218.754..2218.754 rows=1 loops=3)
   ->  Parallel Seq Scan on r10mwide  (cost=0.00..184524.92
rows=4166692 width=16) (actual time=0.032..934.652 rows=333 loops=3)
 Planning Time: 0.116 ms
 Execution Time: 2231.935 ms
(8 rows)

Time: 2232.459 ms (00:02.232)


It did use parallel, since the FETCH FIRST queries apparently won't I
turned that off and ran it again.

SET max_parallel_workers_per_gather = 0;

New query plan for full table, no parallel, just over 4 seconds.

  QUERY PLAN

---
 Aggregate  (cost=342859.21..342859.22 rows=1 width=24) (actual
time=4253.861..4253.862 rows=1 loops=1)
   ->  Seq Scan on r10mwide  (cost=0.00..242858.60 rows=1060 width=16)
(actual time=0.062..1728.346 rows=1000 loops=1)
 Planning Time: 0.082 ms
 Execution Time: 4253.918 ms
(4 rows)


The following uses the explicit row count to pull 100k rows (1%) and gives
a massive improvement in speed, this query ranged from 55- 120ms.  This is
the type of speedup I would expect, and it beats the full-table query hands
down, regardless of parallel query.

EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST 10 ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;

   QUERY PLAN

-
 Aggregate  (cost=4428.58..4428.59 rows=1 width=24) (actual
time=55.963..55.963 rows=1 loops=1)
   ->  Limit  (cost=0.00..2428.57 rows=10 width=20) (actual
time=0.041..35.725 rows=10 loops=1)
 ->  Seq Scan on r10mwide  (cost=0.00..242858.60 rows=1060
width=20) (actual time=0.039..24.796 rows=10 loops=1)
 Planning Time: 0.134 ms
 Execution Time: 56.032 ms
(5 rows)

Time: 57.262 ms



Using FETCH FIRST 1 PERCENT it takes over 7 seconds, 71% slower than
querying the full table.

EXPLAIN (ANALYZE, COSTS)
WITH t AS (
SELECT id, v1, v2
FROM r10mwide
FETCH FIRST 1 PERCENT ROWS ONLY
) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t
;
 QUERY PLAN

-
 Aggregate  (cost=6857.22..6857.23 rows=1 width=24) (actual
time=7112.205..7112.206 rows=1 loops=1)
   ->  Limit  (cost=2428.60..4857.19 rows=11 width=20) (actual
time=0.036..7047.394 rows=10 loops=1)
 ->  Seq Scan on r10mwide  (cost=0.00..242858.60 rows=1060
width=20) (actual time=0.033..3072.881 rows=1000 loops=1)
 Planning Time: 0.

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-08 Thread Alexander Korotkov

On Mon, Jun 3, 2019 at 12:18 AM Tomas Vondra
 wrote:
> For a moment I thought we could/should look at the histogram, becase that
> could tell us if there are groups "before" the first MCV one, but I don't
> think we should do that, for two reasons. Firstly, rare values may not get
> to the histogram anyway, which makes this rather unreliable and might
> introduce sudden plan changes, because the cost would vary wildly
> depending on whether we happened to sample the rare row or not. And
> secondly, the rare row may be easily filtered out by a WHERE condition or
> something, at which point we'll have to deal with the large group anyway.

If first MCV is in the middle of first histogram bin, then it's
reasonable to think that it would fit to first group.  But if first
MCV is in the middle of histogram, such assumption would be
ridiculous.  Also, I'd like to note that with our
default_statistics_target == 100, non MCV values are not so rare.  So,
I'm +1 for taking histogram into account.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-07-08 Thread Alexander Korotkov

On Thu, Jul 4, 2019 at 4:25 PM James Coleman  wrote:
> Process questions:
> - Do I need to explicitly move the patch somehow to the next CF?

We didn't manage to register it on current (July) commitfest.  So,
please, register it on next (September) commitfest.

> - Since I've basically taken over patch ownership, should I move my
> name from reviewer to author in the CF app? And can there be two
> authors listed there?

Sure, you're co-author of this patch.  Two or more authors could be
listed at CF app, you can find a lot of examples on the list.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Stephen Frost

Greetings,

* Bruce Momjian (br...@momjian.us) wrote:
> On Mon, Jul  8, 2019 at 06:04:46PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (br...@momjian.us) wrote:
> > > On Mon, Jul  8, 2019 at 05:41:51PM -0400, Stephen Frost wrote:
> > > > * Bruce Momjian (br...@momjian.us) wrote:
> > > > > Well, if it was a necessary features, I assume TLS 1.3 would have 
> > > > > found
> > > > > a way to make it secure, no?  Certainly they are not shipping TLS 1.3
> > > > > with a known weakness.
> > > > 
> > > > As discussed below- this is about moving goalposts, and that's, in part
> > > > at least, why re-keying isn't a *necessary* feature of TLS.  As the
> > > 
> > > I agree we have to allow rekeying and allow multiple unlocked keys in
> > > the server at the same time.  The open question is whether encrypting
> > > different data with different keys and different unlock controls is
> > > possible or useful.
> > 
> > I'm not sure if there's really a question about if it's *possible*?  As
> > for if it's useful, I agree there's some debate.
> 
> Right, it is easily possible to keep all keys unlocked, but the value is
> minimal, and the complexity will have a cost, which is my point.

Having them all unlocked but only accessible to certain privileged
processes if very different from having them unlocked and available to
every backend process.

> > > > amount of data you transmit over a given TLS connection increases
> > > > though, the risk increases and it would be better to re-key.  How much
> > > > better?  That depends a great deal on if someone is trying to mount an
> > > > attack or not.
> > > 
> > > Yep, we need to allow rekey.
> > 
> > Supporting a way to rekey is definitely a good idea.
> 
> It is a requirement, I think.  We might have problem tracking exactly
> what key _version_ each table (or 8k block), or WAL file are.  :-( 
> Ideally we would allow only two active keys, and somehow mark each page
> as using the odd or even key at a given time, or something strange. 
> (Yeah, hand waving here.)

Well, that wouldn't be the ideal since it would limit us to some small
number of GBs of data written, based on the earlier discussion, right?

I'm not sure that I can see through to a system where we are rewriting
tables that are out on disk every time we hit 60GB of data written.

Or maybe I'm misunderstanding what you're suggesting here..?

> > > Uh, well, renaming the user was a big problem, but that is the only case
> > > I can think of.  I don't see that as an issue for block or WAL sequence
> > > numbers.  If we want to use a different nonce, we have to find a way to
> > > store it or look it up efficiently.  Considering the nonce size, I don't
> > > see how that is possible.
> > 
> > No, this also meant that, as an attacker, I *knew* the salt ahead of
> > time and therefore could build rainbow tables specifically for that
> > salt.  I could also use those *same* tables for any system where that
> > user had an account, even if they used different passwords on different
> > systems...
> 
> Yes, 'postgres' can be used to create a nice md5 rainbow table that
> works on many servers --- good point.  Are rainbow tables possible with
> something like AES?

I'm not a cryptographer, just to be clear...  but it sure seems like if
you know what the nonce is, and a strong idea about at least what some
of the contents are, then you could work to pre-calculate a portian of
the encrypted data and be able to determine the key based on that.

> > I appreciate that *some* of this might not be completely relevant for
> > the way a nonce is used in cryptography, but I'd be very surprised to
> > have a cryptographer tell me that a deterministic nonce didn't have
> > similar issues or didn't reduce the value of the nonce significantly.
> 
> This post:
> 
>   
> https://stackoverflow.com/questions/36760973/why-is-random-iv-fine-for-aes-cbc-but-not-for-aes-gcm
> 
> says:
> 
>   GCM is a variation on Counter Mode (CTR).  As you say, with any variant
>   of Counter Mode, it is essential  that the Nonce is not repeated with
>   the same key.  Hence CTR mode  Nonces often include either a counter or
>   a timer element: something that  is guaranteed not to repeat over the
>   lifetime of the key.
> 
> CTR is what we use for WAL.  8k pages, we would use CBC, which says we
> need a random nonce.  I need to dig deeper into ECB mode attack.

That page also says:

  Using a random IV / nonce for GCM has been specified as an official
  recommendation by - for instance - NIST. If anybody suggests differently
  then that's up to them.

and a recommendation by NIST certainly holds a lot of water, at least
for me.  They also have a recommendation regarding the amount of data to
encrypt with the same key, and that number is much lower than the 96-bit
randomness of the nonce, with a recommendation to use a
cryptographically sound random, meaning that the chances of a duplicate
are extremely low.

> > > Uh, well, you are

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-08 Thread Bruce Momjian

On Mon, Jul  8, 2019 at 06:23:13PM -0400, Bruce Momjian wrote:
> Yes, 'postgres' can be used to create a nice md5 rainbow table that
> works on many servers --- good point.  Are rainbow tables possible with
> something like AES?
> 
> > I appreciate that *some* of this might not be completely relevant for
> > the way a nonce is used in cryptography, but I'd be very surprised to
> > have a cryptographer tell me that a deterministic nonce didn't have
> > similar issues or didn't reduce the value of the nonce significantly.
> 
> This post:
> 
>   
> https://stackoverflow.com/questions/36760973/why-is-random-iv-fine-for-aes-cbc-but-not-for-aes-gcm
> 
> says:
> 
>   GCM is a variation on Counter Mode (CTR).  As you say, with any variant
>   of Counter Mode, it is essential  that the Nonce is not repeated with
>   the same key.  Hence CTR mode  Nonces often include either a counter or
>   a timer element: something that  is guaranteed not to repeat over the
>   lifetime of the key.
> 
> CTR is what we use for WAL.  8k pages, we would use CBC, which says we
> need a random nonce.  I need to dig deeper into ECB mode attack.

Looking here:

https://stackoverflow.com/questions/36760973/why-is-random-iv-fine-for-aes-cbc-but-not-for-aes-gcm

I think the issues is that we can't use a _counter_ for the nonce since
each page-0 of each table would use the same nonce, and each page-1,
etc.  I assume we would use the table oid and page number as the nonce. 
We can't use the database oid since we copy the files from one database
to another via file system copy and not through the shared buffer cache
where they would be re encrypted.  Using relfilenode seems dangerous. 
For WAL I think it would be the WAL segment number.  It would be nice
to mix that with the "Database system identifier:", but are these the
same on primary and replicas?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

1 2 >

1 - 100 of 143 matches

Mail list logo