Re: Data-only pg_rewind, take 2
On Mon, Mar 18, 2019 at 8:46 PM Chris Travers wrote: > On Mon, Mar 18, 2019 at 4:09 AM Michael Paquier wrote: >> On Sun, Mar 17, 2019 at 09:00:57PM +0800, Chris Travers wrote: >> > I also added test cases and some docs. I don't know if the docs are >> > sufficient. Feedback is appreciated. >> >> To be honest, I don't think that this approach is a good idea per the >> same reasons as mentioned the last time, as this can cause pg_rewind >> to break if any newly-added folder in the data directory has >> non-replaceable data which is needed at the beginning of recovery and >> cannot be automatically rebuilt. So that's one extra maintenance >> burden to worry about. > > Actually I think this is safe. Let me go through the cases not handled in > the current behavior at all: Hi Chris, Could you please post a rebase? This has fairly thoroughly bitrotted. The Commitfest is here, so now would be an excellent time for people to be able to apply and test the patch. Thanks, -- Thomas Munro https://enterprisedb.com
Re: FETCH FIRST clause PERCENT option
Hi Thomas, Thank you for informing me Hi Surafel, > > There's a call to adjust_limit_rows_costs() hiding under > contrib/postgres_fdw, so this fails check-world. > Fixed . I also include the review given by Ryan in attached patch regards Surafel diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 1759b9e1b6..fb9c2f5319 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -3035,7 +3035,8 @@ estimate_path_cost_size(PlannerInfo *root, if (fpextra && fpextra->has_limit) { adjust_limit_rows_costs(&rows, &startup_cost, &total_cost, - fpextra->offset_est, fpextra->count_est); + fpextra->offset_est, fpextra->count_est, + EXACT_NUMBER); retrieved_rows = rows; } } diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 06d611b64c..1c240a3dab 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ] [ LIMIT { count | ALL } ] [ OFFSET start [ ROW | ROWS ] ] -[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ] +[ FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY ] [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ] where from_item can be one of: @@ -1430,7 +1430,7 @@ OFFSET start which PostgreSQL also supports. It is: OFFSET start { ROW | ROWS } -FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY +FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY In this syntax, the start or count value is required by @@ -1440,7 +1440,8 @@ FETCH { FIRST | NEXT } [ count ] { ambiguity. If count is omitted in a FETCH clause, it defaults to 1. -ROW +With PERCENT count specifies the maximum number of rows to return +in percentage round up to the nearest integer.ROW and ROWS as well as FIRST and NEXT are noise words that don't influence the effects of these clauses. diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c index baa669abe8..c503512588 100644 --- a/src/backend/executor/nodeLimit.c +++ b/src/backend/executor/nodeLimit.c @@ -21,6 +21,8 @@ #include "postgres.h" +#include + #include "executor/executor.h" #include "executor/nodeLimit.h" #include "miscadmin.h" @@ -52,6 +54,7 @@ ExecLimit(PlanState *pstate) */ direction = node->ps.state->es_direction; outerPlan = outerPlanState(node); + slot = node->subSlot; /* * The main logic is a simple state machine. @@ -81,7 +84,15 @@ ExecLimit(PlanState *pstate) /* * Check for empty window; if so, treat like empty subplan. */ - if (node->count <= 0 && !node->noCount) + if (node->limitOption == PERCENTAGE) + { +if (node->percent == 0.0) +{ + node->lstate = LIMIT_EMPTY; + return NULL; +} + } + else if (node->count <= 0 && !node->noCount) { node->lstate = LIMIT_EMPTY; return NULL; @@ -102,11 +113,32 @@ ExecLimit(PlanState *pstate) node->lstate = LIMIT_EMPTY; return NULL; } + +/* + * Count the number of rows to return in exact number so far + */ +if (node->limitOption == PERCENTAGE) +{ + node->perExactCount = ceil(node->percent * node->position / 100.0); + + if (node->perExactCount == node->perExactCount + 1) + node->perExactCount++; +} node->subSlot = slot; if (++node->position > node->offset) break; } + /* + * We may needed this tuple in backward scan so put it into + * tuplestore. + */ + if (node->limitOption == PERCENTAGE) + { +tuplestore_puttupleslot(node->tupleStore, slot); +tuplestore_advance(node->tupleStore, true); + } + /* * Okay, we have the first tuple of the window. */ @@ -124,6 +156,106 @@ ExecLimit(PlanState *pstate) case LIMIT_INWINDOW: if (ScanDirectionIsForward(direction)) { +/* + * In case of coming back from backward scan the tuple is + * already in tuple store. + */ +if (node->limitOption == PERCENTAGE && node->backwardPosition > 0) +{ + if (tuplestore_gettupleslot_heaptuple(node->tupleStore, true, true, slot)) + { + node->subSlot = slot; + node->position++; + node->backwardPosition--; + return slot; + } + else + { + node->lstate = LIMIT_SUBPLANEOF; + return NULL; + } +} + +/* + * In PERCENTAGE case no need of executing outerPlan multiple + * times. + */ +if (node->limitOption == PERCENTAGE && node->reachEnd) +{ + node->lstate = LIMIT_WINDOWEND; + + /* + * If we know we won't need to back up, we can release + * resources at this point. + */ + if (!(node->ps.state->es_top_eflags & EXEC_FLAG_BACKWARD)) + (void) ExecShutdownNode(out
Re: (select query)/relation as first class citizen
Hi, what do you think about this idea in general? If you don't have to think about implementation for now? From my point of view writing Sql queries is very close to how functional language work if you treat "select" queries as functions without side-effects, and having query being first-class-citizen could move this even further. Regards, Roman On Sun, 7 Jul 2019 at 16:22, Roman Pekar wrote: > Hi, > > Yes, I'm thinking about 'query like a view', 'query like a cursor' is > probably possible even now in ms sql server (not sure about postgresql), > but it requires this paradygm shift from set-based thinking to row-by-row > thinking which I'd not want to do. > > I completely agree with your points of plan caching and static checks. > With static checks, though it might be possible to do if the query would be > defined as typed, so all the types of the columns is known in advance. > In certain cases having possibility of much better decomposition is might > be more important than having cached plan. Not sure how often these cases > appear in general, but personally for me it'd be awesome to have this > possibility. > > Regards, > Roman Pekar > > On Sun, 7 Jul 2019 at 15:39, Pavel Stehule > wrote: > >> Hi >> >> ne 7. 7. 2019 v 14:54 odesílatel Roman Pekar >> napsal: >> >>> Hello, >>> >>> Just a bit of background - I currently work as a full-time db developer, >>> mostly with Ms Sql server but I like Postgres a lot, especially because I >>> really program in sql all the time and type system / plpgsql language of >>> Postgres seems to me more suitable for actual programming then t-sql. >>> >>> Here's the problem - current structure of the language doesn't allow to >>> decompose the code well and split calculations and data into different >>> modules. >>> >>> For example. Suppose I have a table employee and I have a function like >>> this (I'll skip definition of return types for the sake of simplicity): >>> >>> create function departments_salary () >>> returns table (...) >>> as >>> return $$ >>> select department, sum(salary) as salary from employee group by >>> department; >>> $$; >>> >>> so that's fine, but what if I want to run this function on filtered >>> employee? I can adjust the function of course, but it implies I can predict >>> all possible filters I'm going to need in the future. >>> And logically, function itself doesn't have to be run on employee table, >>> anything with department and salary columns will fit. >>> So it'd be nice to be able to define the function like this: >>> >>> create function departments_salary(_employee query) >>> returns table (...) >>> as >>> return $$ >>> select department, sum(salary) as salary from _employee group by >>> department; >>> $$; >>> >>> and then call it like this: >>> >>> declare _employee query; >>> ... >>> _poor_employee = (select salary, department from employee where salary < >>> 1000); >>> select * from departments_salary( _poor_employee); >>> >>> And just to be clear, the query is not really invoked until the last >>> line, so re-assigning _employee variable is more like building query >>> expression. >>> >>> As far as I understand the closest way to do this is to put the data >>> into temporary table and use this temporary table inside of the function. >>> It's not exactly the same of course, cause in case of temporary tables data >>> should be transferred to temporary table, while it will might be filtered >>> later. So it's something like array vs generator in python, or List vs >>> IQueryable in C#. >>> >>> Adding this functionality will allow much better decomposition of the >>> program's logic. >>> What do you think about the idea itself? If you think the idea is >>> worthy, is it even possible to implement it? >>> >> >> If we talk about plpgsql, then I afraid so this idea can disallow plan >> caching - or significantly increase the cost of plan cache. >> >> There are two possibilities of implementation - a) query like cursor - >> unfortunately it effectively disables any optimization and it carry ORM >> performance to procedures. This usage is known performance antipattern, b) >> query like view - it should not to have a performance problems with late >> optimization, but I am not sure about possibility to reuse execution plans. >> >> Currently PLpgSQL is compromise between performance and dynamic (PLpgSQL >> is really static language). Your proposal increase much more dynamic >> behave, but performance can be much more worse. >> >> More - with this behave, there is not possible to do static check - so >> you have to find bugs only at runtime. I afraid about performance of this >> solution. >> >> Regards >> >> Pavel >> >> >> >>> Regards, >>> Roman Pekar >>> >>> >>>
Re: Ltree syntax improvement
On Wed, Apr 17, 2019 at 5:29 AM Dmitry Belyavsky wrote: > I've applied your patch. > From my point of view, there is no major difference between case and chain if > here. > Neither case nor ifs allow extracting the common code to separate function - > just because there seem to be no identical pieces of code. Hi Dmitry, The documentation doesn't build[1], due to invalid XML. Since I'm here, here is some proof-reading of the English in the documentation: - A label is a sequence of alphanumeric characters - and underscores (for example, in C locale the characters - A-Za-z0-9_ are allowed). Labels must be less than 256 bytes - long. + A label is a sequence of characters. Labels must be + less than 256 symbols long. Label may contain any character supported by Postgres "fewer than 256 characters in length", and "PostgreSQL" + except \0. If label contains spaces, dots, lquery modifiers, "spaces, dots or lquery modifiers," + they may be escaped. Escaping can be done either by preceeding + backslash (\\) symbol, or by wrapping the label in whole in double + quotes ("). Initial and final unescaped whitespace is stripped. "Escaping can be done with either a preceding backslash [...] or by wrapping the whole label in double quotes [...]." +During converting text into internal representations, wrapping double quotes "During conversion to internal representation, " +and escaping backslashes are removed. During converting internal +representations into text, if the label does not contain any special "During conversion from internal representation to text, " +symbols, it is printed as is. Otherwise, it is wrapped in quotes and, if +there are internal quotes, they are escaped with backslash. The list of special "escaped with backslashes." + +Examples: 42, "\\42", +\\4\\2, 42 and "42" + will have the similar internal representation and, being "will all have the same internal representation and," +converted from internal representation, will become 42. +Literal abc def will turn into "abc +def". [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/71856 -- Thomas Munro https://enterprisedb.com
Re: Replication & recovery_min_apply_delay
On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera wrote: > On 2019-Jan-30, Konstantin Knizhnik wrote: > > I wonder if it can be considered as acceptable solution of the problem or > > there can be some better approach? > > I didn't find one. It sounds like you are in agreement that there is a problem and this is the best solution. I didn't look at these patches, I'm just asking with my Commitfest manager hat on: did I understand correctly, does this need a TAP test, possibly the one Alvaro posted, and if so, could we please have a fresh patch that includes the test, so we can see it passing the test in CI? -- Thomas Munro https://enterprisedb.com
Re: Add parallelism and glibc dependent only options to reindexdb
On Fri, Jul 05, 2019 at 07:25:41PM +0200, Julien Rouhaud wrote: > On Fri, Jul 5, 2019 at 6:16 PM Peter Eisentraut > wrote: >> Isn't that also the case for your proposal? We are not going to release >> a new reindexdb before a new REINDEX. > > Sure, but my point was that once the new reindexdb is released (or if > you're so desperate, using a nightly build or compiling your own), it > can be used against any previous major version. There is probably a > large fraction of users who don't perform a postgres upgrade when they > upgrade their OS, so that's IMHO also something to consider. I think that we need to think long-term here and be confident in the fact we will still see breakages with collations and glibc, using a solution that we think is the right API. Peter's idea to make the backend-aware command of the filtering is cool. On top of that, there is no need to add any conflict logic in reindexdb and we can live with restricting --jobs support for non-index objects. -- Michael signature.asc Description: PGP signature
Re: Implementing Incremental View Maintenance
> As for how to make internal columns invisible to SELECT *, previously > there have been discussions about doing that using a new flag in > pg_attribute: > > https://www.postgresql.org/message-id/flat/CAEepm%3D3ZHh%3Dp0nEEnVbs1Dig_UShPzHUcMNAqvDQUgYgcDo-pA%40mail.gmail.com Now that I realized that there are several use cases for invisible columns, I think this is the way what we shoud go for. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: Replication & recovery_min_apply_delay
On Mon, Jul 08, 2019 at 07:56:25PM +1200, Thomas Munro wrote: > On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera > wrote: > > On 2019-Jan-30, Konstantin Knizhnik wrote: > > > I wonder if it can be considered as acceptable solution of the problem or > > > there can be some better approach? > > > > I didn't find one. > > It sounds like you are in agreement that there is a problem and this > is the best solution. I didn't look at these patches, I'm just asking > with my Commitfest manager hat on: did I understand correctly, does > this need a TAP test, possibly the one Alvaro posted, and if so, could > we please have a fresh patch that includes the test, so we can see it > passing the test in CI? Please note that I have not looked at that stuff in details, but I find the patch proposed kind of ugly with the scan of the last segment using a WAL reader to find out what is the last LSN and react on that.. This does not feel right. -- Michael signature.asc Description: PGP signature
Re: [PATCH] kNN for btree
On Tue, Jul 2, 2019 at 5:47 AM Nikita Glukhov wrote: > Attached 12th version of the patches rebased onto the current master. Hi Nikita, make check-world fails for me, and in tmp_install/log/install.log I see: btree_int2.c:97:9: error: implicit declaration of function 'int2dist' is invalid in C99 [-Werror,-Wimplicit-function-declaration] return int2dist(fcinfo); ^ btree_int2.c:97:9: note: did you mean 'int2_dist'? btree_int2.c:95:1: note: 'int2_dist' declared here int2_dist(PG_FUNCTION_ARGS) ^ 1 error generated. -- Thomas Munro https://enterprisedb.com
Re: Add test case for sslinfo
On Mon, Jul 08, 2019 at 02:11:34PM +0800, Hao Wu wrote: > Thank you for your quick response! I work on greenplum, and I didn't see > this folder(src/test/ssl/ssl) before. > I will add more certificates to test and resend again. Not having duplicates would be nice. > Do you have any suggestion about the missing PGDATA? Since the test needs > to configure postgresql.conf, maybe there are other ways to determine this > environment. +REGRESS = sslinfo +REGRESS_OPT = --temp-config=$(top_srcdir)/contrib/sslinfo/sslinfo.conf When it comes to custom configuration files in the regression tests, you should always have NO_INSTALLCHECK = 1 in the Makefile because there is no guarantee that that the running server will have the configuration you want when running an installcheck. +echo "preparing CRTs and KEYs" +cp -f data/root.crt $PGDATA/ +cp -f data/server.crt $PGDATA/ +cp -f data/server.key $PGDATA/ +chmod 400 $PGDATA/server.key +chmod 644 $PGDATA/server.crt +chmod 644 $PGDATA/root.crt Using a TAP test here would be more adapted. Another idea would be to add that directly into src/test/ssl/ and enforce the installation of with EXTRA_INSTALL when running the tests. +-- start_ignore +\! bash config.bash clean +\! pg_ctl restart 2>&1 >/dev/null +-- end_ignore Please, no... -- Michael signature.asc Description: PGP signature
Re: dropdb --force
po 8. 7. 2019 v 0:07 odesílatel Thomas Munro napsal: > On Thu, Jun 27, 2019 at 7:15 AM Pavel Stehule > wrote: > > fixed > > Hi Pavel, > > FYI t/050_dropdb.pl fails consistently with this patch applied: > > https://travis-ci.org/postgresql-cfbot/postgresql/builds/555234838 with attached patch should be ok Regards Pavel > > > > -- > Thomas Munro > https://enterprisedb.com > diff --git a/doc/src/sgml/ref/drop_database.sgml b/doc/src/sgml/ref/drop_database.sgml index 3ac06c984a..84df485e11 100644 --- a/doc/src/sgml/ref/drop_database.sgml +++ b/doc/src/sgml/ref/drop_database.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation -DROP DATABASE [ IF EXISTS ] name +DROP DATABASE [ IF EXISTS ] name [ FORCE ] @@ -32,9 +32,11 @@ DROP DATABASE [ IF EXISTS ] name DROP DATABASE drops a database. It removes the catalog entries for the database and deletes the directory containing the data. It can only be executed by the database owner. - Also, it cannot be executed while you or anyone else are connected - to the target database. (Connect to postgres or any - other database to issue this command.) + Also, it cannot be executed while you are connected to the target database. + (Connect to postgres or any other database to issue this + command.) + If anyone else is connected to the target database, this command will fail + - unless you use the FORCE option described below. @@ -64,6 +66,24 @@ DROP DATABASE [ IF EXISTS ] name + + +FORCE + + + Attempt to terminate all existing connections to the target database. + + + This will fail, if current user has no permissions to terminate other + connections. Required permissions are the same as with + pg_terminate_backend, described + in . + + This will also fail, if the connections do not terminate in 60 seconds. + + + + diff --git a/doc/src/sgml/ref/dropdb.sgml b/doc/src/sgml/ref/dropdb.sgml index 3fbdb33716..5d10ad1c92 100644 --- a/doc/src/sgml/ref/dropdb.sgml +++ b/doc/src/sgml/ref/dropdb.sgml @@ -86,6 +86,20 @@ PostgreSQL documentation + + -f + --force + + +Force termination of connected backends before removing the database. + + +This will add the FORCE option to the DROP +DATABASE command sent to the server. + + + + -V --version diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index 863f89f19d..86330bad12 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -499,7 +499,7 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) * potential waiting; we may as well throw an error first if we're gonna * throw one. */ - if (CountOtherDBBackends(src_dboid, ¬herbackends, &npreparedxacts)) + if (CountOtherDBBackends(src_dboid, ¬herbackends, &npreparedxacts, false)) ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), errmsg("source database \"%s\" is being accessed by other users", @@ -789,7 +789,7 @@ createdb_failure_callback(int code, Datum arg) * DROP DATABASE */ void -dropdb(const char *dbname, bool missing_ok) +dropdb(const char *dbname, bool missing_ok, bool force) { Oid db_id; bool db_istemplate; @@ -797,6 +797,7 @@ dropdb(const char *dbname, bool missing_ok) HeapTuple tup; int notherbackends; int npreparedxacts; + int loops = 0; int nslots, nslots_active; int nsubscriptions; @@ -880,12 +881,33 @@ dropdb(const char *dbname, bool missing_ok) * * As in CREATE DATABASE, check this after other error conditions. */ - if (CountOtherDBBackends(db_id, ¬herbackends, &npreparedxacts)) - ereport(ERROR, -(errcode(ERRCODE_OBJECT_IN_USE), - errmsg("database \"%s\" is being accessed by other users", - dbname), - errdetail_busy_db(notherbackends, npreparedxacts))); + for (;;) + { + /* + * CountOtherDBBackends check usage of database by other backends and try + * to wait 5 sec. We try to raise warning after 1 minute and and raise + * error after 5 minutes. + */ + if (!CountOtherDBBackends(db_id, ¬herbackends, &npreparedxacts, force)) + break; + + if (force && loops++ % 12 == 0) + ereport(WARNING, + (errcode(ERRCODE_OBJECT_IN_USE), +errmsg("source database \"%s\" is being accessed by other users", + dbname), + errdetail_busy_db(notherbackends, npreparedxacts))); + + /* without "force" flag raise error immediately, or after 5 minutes */ + if (!force || loops % 60 == 0) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_IN_USE), +errmsg("source database \"%s\" is being accessed by other users", + dbname), + errdetail_busy_db(notherbackends, npreparedxacts))); + + CHECK_FOR_INTERRUPTS(); + } /* * Check if there are subscriptions defined in the target database. @@ -1053,7 +1075,7 @@ RenameDatabase(const char *oldname,
RE: [PATCH] memory leak in ecpglib
Dear Meskes, Zhang, I think this modification is not enough, and I have an another idea. >> If your patch is committed, in your example, any operation for cur1 will not >> be accepted. > Although the return value after calling ecpg_get_con_name_by_cursor_name(cur1) > is NULL, in ecpg_get_connection(), actual_connection will be returned. > so, operation for cur1 will be accepted, Did you mention about this code? (Copied from ECPGfetch) ``` real_connection_name = ecpg_get_con_name_by_cursor_name(cursor_name); if (real_connection_name == NULL) { /* * If can't get the connection name by cursor name then using * connection name coming from the parameter connection_name */ real_connection_name = connection_name; } ``` If so, I think this approach is wrong. This connection_name corresponds to the following con1. ``` EXEC SQL AT con1 FETCH cur1 ... ``` Therefore the followed FETCH statement will fail because the application forget the connection of cur_1. ``` EXEC SQL AT con1 DECLARE stmt_1 STATEMENT; EXEC SQL PREPARE stmt_1 FROM :selectString; EXEC SQL DECLARE cur_1 CURSOR FOR stmt_1; EXEC SQL OPEN cur_1; EXEC SQL DECLARE cur_2 CURSOR FOR stmt_1; EXEC SQL OPEN cur_2; EXEC SQL FETCH cur_1; ``` I think the g_declared_list is not needed for managing connection. I was wrong. We should treat DECLARE STAEMENT as declarative, like #include or #define in C macro. Please send me your reply. Best Regards, Hayato Kuroda Fujitsu LIMITED
Re: Excessive memory usage in multi-statement queries w/ partitioning
Hi Julien, Thanks for taking a look at this. On Thu, Jul 4, 2019 at 6:52 PM Julien Rouhaud wrote: > On Tue, May 28, 2019 at 6:57 AM Amit Langote wrote: > > >> Maybe like the attached? I'm not sure if we need to likewise be > > >> concerned > > >> about exec_sql_string() being handed multi-query strings. > > the whole extension sql script is passed to execute_sql_string(), so I > think that it's a good thing to have similar workaround there. That makes sense, although it is perhaps much less likely for memory usage explosion to occur in execute_sql_strings(), because the scripts passed to execute_sql_strings() mostly contain utility statements and rarely anything whose planning will explode in memory usage. Anyway, I've added similar handling in execute_sql_strings() for consistency. Now I wonder if we'll need to consider another path which calls pg_plan_queries() on a possibly multi-statement query -- BuildCachedPlan()... > About the patch: > > -* Switch to appropriate context for constructing querytrees (again, > -* these must outlive the execution context). > +* Switch to appropriate context for constructing querytrees. > +* Memory allocated during this construction is released before > +* the generated plan is executed. > > The comment should mention query and plan trees, everything else seems ok to > me. Okay, fixed. Attached updated patch. Thanks again. Regards, Amit parse-plan-memcxt_v2.patch Description: Binary data
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Sun, Jul 7, 2019 at 1:05 AM Bruce Momjian wrote: > > On Fri, Jul 5, 2019 at 10:24:39PM +0200, Tomas Vondra wrote: > > I agree this is a pretty crucial challenge, and those requirements seem > > in direct conflict. Users use encryption to protect privacy of the data, > > but we need access to some of the data to implement some of the > > important tasks of a RDBMS. > > > > And it's not just about things like recovery or replication. How do you > > do ANALYZE on encrypted data? Sure, if a user runs it in a session that > > has the right key, that's fine. But what about autovacuum/autoanalyze? > > There might be a way to defer ANALYZE and autovacuum/autoanalyze, but > what about VACUUM FREEZE? We can defer that too, but not the clog > truncation that is eventually the product of the freeze. > > What about referential integrity constraints that need to check primary > keys in the encrypted tables? I also don't see a way of delaying that, > and if you can't do referential integrity into the encrypted tables, it > reduces the value of having encrypted data in the same database rather > than in another database or cluster? > I just thought that PostgreSQL's auxiliary processes such as autovacuum, startup, checkpointer, bgwriter should always be able to access all keys because there are already in inside database. Even today these processes don't check any privileges when accessing to data. What security threats we can protect data from by requiring privileges even for auxiliary processes? If this is a security problem isn't it also true for cluster-wide encryption? I guess that processes who have an access privilege on the table can always get the corresponding encryption key. And any processes cannot access an encryption key directly without accessing to a database object. > I still feel we have not clearly described what the options are: > > 1. Encrypt everything > > 2. Encrypt only some tables (for performance reasons), and use only one > key, or use multiple keys to allow for key rotation. All keys are > always unlocked. > > 3. Encrypt only some tables with different keys, and the keys are not > always unlocked. > > As Tomas already stated, using tablespaces to distinguish encrypted from > non-encrypted tables doesn't make sense since the file system used for > the storage is immaterial to the encryptions status. An easier way would > be to just add a bit to WAL that would indicate if the rest of the WAL > record is encrypted, though I doubt the performance boost is worth the > complexity. Okay, instead of using tablespaces we can create groups grouping tables being encrypted with the same key. I think the one of the most important point here is to provide a granular encryption feature and have less the number of keys in database cluster, not to provide per tablespace encryption feature. I'm not going to insist it should be per tablespace encryption. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: (select query)/relation as first class citizen
Hi po 8. 7. 2019 v 9:33 odesílatel Roman Pekar napsal: > Hi, > > what do you think about this idea in general? If you don't have to think > about implementation for now? From my point of view writing Sql queries is > very close to how functional language work if you treat "select" queries as > functions without side-effects, and having query being first-class-citizen > could move this even further. > first - please, don't send top posts. second - my opinion is not clear. I can imagine benefits - on second hand, the usage is relative too close to one antipattern - only one query wrapped by functions. I see your proposal as little bit more dynamic (with little bit different syntax) views. With my experience I really afraid about it - it can be very effective (from developer perspective) and very slow (from customer perspective). This is example of tool that looks nice on paper, but can be very badly used. Maybe I am not the best man for this topic - I like some functional programming concepts, but I use it locally - your proposal moves SQL to some unexplored areas - and I think so it can be interesting as real research topic, but not today Postgres's theme. The basic question is why extend SQL and don't use some native functional language. Postgres should to implement ANSI SQL - and there is not a space for big experiments. I am sceptic about it - relational databases are static, SQL is static language, so it is hard to implement some dynamic system over it - SQL language is language over relation algebra - it is not functional language, I afraid so introduction another concept to this do more bad than good. Regards Pavel > Regards, > Roman > > On Sun, 7 Jul 2019 at 16:22, Roman Pekar wrote: > >> Hi, >> >> Yes, I'm thinking about 'query like a view', 'query like a cursor' is >> probably possible even now in ms sql server (not sure about postgresql), >> but it requires this paradygm shift from set-based thinking to row-by-row >> thinking which I'd not want to do. >> >> I completely agree with your points of plan caching and static checks. >> With static checks, though it might be possible to do if the query would be >> defined as typed, so all the types of the columns is known in advance. >> In certain cases having possibility of much better decomposition is might >> be more important than having cached plan. Not sure how often these cases >> appear in general, but personally for me it'd be awesome to have this >> possibility. >> >> Regards, >> Roman Pekar >> >> On Sun, 7 Jul 2019 at 15:39, Pavel Stehule >> wrote: >> >>> Hi >>> >>> ne 7. 7. 2019 v 14:54 odesílatel Roman Pekar >>> napsal: >>> Hello, Just a bit of background - I currently work as a full-time db developer, mostly with Ms Sql server but I like Postgres a lot, especially because I really program in sql all the time and type system / plpgsql language of Postgres seems to me more suitable for actual programming then t-sql. Here's the problem - current structure of the language doesn't allow to decompose the code well and split calculations and data into different modules. For example. Suppose I have a table employee and I have a function like this (I'll skip definition of return types for the sake of simplicity): create function departments_salary () returns table (...) as return $$ select department, sum(salary) as salary from employee group by department; $$; so that's fine, but what if I want to run this function on filtered employee? I can adjust the function of course, but it implies I can predict all possible filters I'm going to need in the future. And logically, function itself doesn't have to be run on employee table, anything with department and salary columns will fit. So it'd be nice to be able to define the function like this: create function departments_salary(_employee query) returns table (...) as return $$ select department, sum(salary) as salary from _employee group by department; $$; and then call it like this: declare _employee query; ... _poor_employee = (select salary, department from employee where salary < 1000); select * from departments_salary( _poor_employee); And just to be clear, the query is not really invoked until the last line, so re-assigning _employee variable is more like building query expression. As far as I understand the closest way to do this is to put the data into temporary table and use this temporary table inside of the function. It's not exactly the same of course, cause in case of temporary tables data should be transferred to temporary table, while it will might be filtered later. So it's something like array vs generator in python, or List vs IQueryable in C#. Adding this functionality will
Re: progress report for ANALYZE
Hi Alvaro, I'll review your patch in this week. :) I tested your patch on 6b854896. Here is a result. See below: - [Session #1] create table hoge as select * from generate_series(1, 100) a; analyze verbose hoge; [Session #2] \a \t select * from pg_stat_progress_analyze; \watch 0.001 17520|13599|postgres|16387|f|16387|scanning table|4425|14 17520|13599|postgres|16387|f|16387|scanning table|4425|64 17520|13599|postgres|16387|f|16387|scanning table|4425|111 ... 17520|13599|postgres|16387|f|16387|scanning table|4425|4425 17520|13599|postgres|16387|f|16387|scanning table|4425|4425 17520|13599|postgres|16387|f|16387|scanning table|4425|4425 17520|13599|postgres|16387|f|16387|analyzing sample|0|0 17520|13599|postgres|16387|f|16387||0|0 <-- Is it Okay?? - I have a question of the last line of the result. I'm not sure it is able or not, but I guess it would be better to keep the phase name (analyzing sample) on the view until the end of this command. :) Regards, Tatsuro Yamada
Re: Synchronizing slots from primary to standby
On Mon, Dec 31, 2018 at 10:23 AM Petr Jelinek wrote: > As Andres has mentioned over at minimal decoding on standby thread [1], > that functionality can be used to add simple worker which periodically > synchronizes the slot state from the primary to a standby. > > Attached patch is rough implementation of such worker. It's nowhere near > committable in the current state, it servers primarily two purposes - to > have something over what we can agree on the approach (and if we do, > serve as base for that) and to demonstrate that the patch in [1] can > indeed be used for this functionality. All this means that this patch > depends on the [1] to work. Hi Petr, Do I understand correctly that this depends on the "logical decoding on standby" patch, but that isn't in the Commitfest? Seems like an oversight, since that thread has a recently posted v11 patch that applies OK, and there was recent review. You patches no longer apply on top though. Would it make sense to post a patch set here including logical-decoding-on-standby_v11.patch + your two patches (rebased), since this is currently marked as "Needs review"? -- Thomas Munro https://enterprisedb.com
Re: Re: Reviving the "Stopping logical replication protocol" patch from Vladimir Gordichuk
On Thu, Mar 7, 2019 at 8:19 PM David Steele wrote: > On 2/16/19 10:38 PM, Dave Cramer wrote: > > Thanks for looking at this. FYI, I did not originally write this, rather > > the original author has not replied to requests. > > JDBC could use this, I assume others could as well. > > > > That said I'm certainly open to suggestions on how to do this. > > > > Craig, do you have any other ideas? > This patch appears to be stalled. Are there any ideas on how to proceed? Hi Dave (C), It sounds like this should be withdrawn for now, and potentially revived in some future CF if someone has time to work on it? -- Thomas Munro https://enterprisedb.com
Re: Re: Reviving the "Stopping logical replication protocol" patch from Vladimir Gordichuk
On Mon, 8 Jul 2019 at 06:40, Thomas Munro wrote: > On Thu, Mar 7, 2019 at 8:19 PM David Steele wrote: > > On 2/16/19 10:38 PM, Dave Cramer wrote: > > > Thanks for looking at this. FYI, I did not originally write this, > rather > > > the original author has not replied to requests. > > > JDBC could use this, I assume others could as well. > > > > > > That said I'm certainly open to suggestions on how to do this. > > > > > > Craig, do you have any other ideas? > > > This patch appears to be stalled. Are there any ideas on how to proceed? > > Hi Dave (C), > > It sounds like this should be withdrawn for now, and potentially > revived in some future CF if someone has time to work on it? > > Hi Thomas, I'm fine with that decision. Thanks, Dave Cramer > > >
Re: How to estimate the shared memory size required for parallel scan?
> On 1 Jul 2019, at 13:03, Thomas Munro wrote: > > On Fri, May 3, 2019 at 6:06 AM Thomas Munro wrote: >> Added to commitfest. > > Rebased. Below is a review of this patch. It does what it says on the tin, applies clean without introducing compiler warnings and it seems like a good addition (for both utility and as an example implementation). Testing with various parallel worker settings shows that it properly observes user configuration for parallelism. The tests pass, but there are also no new tests that could fail. I tried to construct a test case for this, but it’s fairly hard to make one that can be added to the repo. Have you given this any thought? Regarding documentation, it seems reasonable to add a sentence on the file_fdw page when the user can expect parallel scan. A few smaller comments on the patch: In the commit message, I assume you mean “writing *is* probablt some way off”: "In theory this could be used for other kinds of parallel copying in the future (but infrastructure for parallel writing it probably some way off)." Single-line comments don’t end with a period IIRC (there a few others but only including the first one here): +/* The size of the chunks used for parallel scans, in bytes. */ Is there a reason to create partial_path before compute_parallel_worker, since the latter can make us end up not adding the path at all? Can’t we reverse the condition on parallel_workers to if (parallel_workers <= 0) and if so skip the partial path?: + partial_path = create_foreignscan_path(root, baserel, NULL, + baserel->rows, + startup_cost, + total_cost, + NIL, NULL, NULL, coptions); + + parallel_workers = compute_parallel_worker(baserel, + fdw_private->pages, -1, + max_parallel_workers_per_gather); ... + if (parallel_workers > 0) + add_partial_path(baserel, (Path *) partial_path); Since this is quite general code which can be used by other extensions as well, maybe we should reword the comment to say so?: + /* For use by file_fdw's parallel scans. */ cheers ./daniel
Re: POC: Cleaning up orphaned files using undo logs
On Sat, Jul 6, 2019 at 1:47 AM Robert Haas wrote: > > On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro wrote: > > [ new patches ] > > I took a look at 0012 today, Amit's patch for extending the binary > heap machinery, and 0013, Amit's patch for "Infrastructure to register > and fetch undo action requests." > Thanks for looking into the patches. > I don't think that binaryheap_allocate_shm() is a good design. It > presupposes that we want to store the binary heap as its own chunk of > shared memory allocated via ShmemInitStruct(), but we might want to do > something else, like embed in another structure, store it in a DSM or > DSA, etc., and this function can't do any of that. I think we should > have something more like: > > extern Size binaryheap_size(int capacity); > extern void binaryheap_initialize(binaryheap *, int capacity, > binaryheap_comparator compare, void *arg); > > Then the caller can do something like: > > sz = binaryheap_size(capacity); > bh = ShmemInitStruct(name, sz, &found); > if (!found) > binaryheap_initialize(bh, capacity, comparator, whatever); > > If it wants to get the memory in some other way, it just needs to > initialize bh differently; the rest is the same. Note that there is > no need, in this design, for binaryheap_size/initialize to make use of > "shared" memory. They could equally well be used on backend-local > memory. They do not need to care. You just provide the memory, and > they do their thing. > I didn't have other use cases in mind and I think to some extent this argument holds true for existing binaryheap_allocate allocate. If we want to make it more generic, then shouldn't we try to even change the existing binaryheap_allocate to use this new model, that way the binary heap allocation API will be more generic? .. .. > > Now, one objection to the above line of attack is the different queues > actually contain different types of elements. Apparently, the XID > queue contains elements of type UndoXidQueue and the size queue > contains elements of type UndoSizeQueue. It is worth noting here that > these are bad type names, because they sound like they are describing > a type of queue, but it seems that they are actually describing an > element in the queue. However, there are two larger problems: > > 1. I don't think we should have three different kinds of objects for > each of the three different queues. It seems like it would be much > simpler and easier to just have one kind of object that stores all the > information we need (full_xid, start_urec_ptr, dbid, request_size, > next_retry_at, error_ocurred_at) and use that everywhere. You could > object that this would increase the storage space requirement, Yes, this was the reason to keep them separate, but I see your point. > but it > wouldn't be enough to make any real difference and it probably would > be well worth it for the avoidance of complexity. > Okay, will give it a try and see if it can avoid some code complexity. Along with this, I will investigate your other suggestions related to code improvements as well. > > On another note, UNDO_PEEK_DEPTH is bogus. It's used in UndoGetWork() > and it passes the depth argument down to GetRollbackHashKeyFromQueue, > which then does binaryheap_nth() on the relevant queue. Note that > this function is another places that is ends up duplicating code > because of the questionable decision to have separate types of queue > entries for each different queue; otherwise, it could probably just > take the binary heap into which it's peeking as an argument instead of > having three different cases. But that's not the main point here. The > main point is that it calls a function for whichever type of queue > we've got and gets some kind of queue entry using binaryheap_nth(). > But binaryheap_nth(whatever, 2) does not give you the third-smallest > element in the binary heap. It gives you the third entry in the > array, which may or may not have the heap property, but even if it > does, the third element could be huge. Consider this binary heap: > > 0 1 10 2 3 11 12 4 5 6 7 13 14 15 16 > > This satisfies the binary heap property, because the element at > position n is always smaller than the elements at positions 2n+1 and > 2n+2 (assuming 0-based indexing). But if you want to look at the > smallest three elements in the heap, you can't just look at indexes > 0..2. The second-smallest element must be at index 1 or 2, but it > could be either place. The third-smallest element could be the other > of 1 and 2, or it could be either child of the smaller one, so there > are three places it might be. In general, a binary heap is not a good > data structure for finding the smallest N elements of a collection > unless N is 1, and what's going to happen with what you've got here is > that we'll sometimes prioritize an item that would not have been > pulled from the queue for a long time over one that would have > otherwise been processe
Re: Switching PL/Python to Python 3 by default in PostgreSQL 12
On 2019-07-07 21:26, Steven Pousty wrote: > The point of the links I sent from the Python community is that they > wanted Python extinct in the wild as of Jan 1 next year. They are never > fixing it, even for a security vulnerability. The operating systems that most of our users are going to run PostgreSQL 12 on will keep maintaining Python 2 for the foreseeable future. I'm not trying to dismiss the importance of managing the Python transition. But this issue has been known for many years, and the current setup is more or less in line with the wider world. For example, the Debian release that came out over the weekend still ships with /usr/bin/python being Python 2. So it is neither timely nor urgent to try to make some significant change about this in PostgreSQL 12 right now. I would welcome patches for this for PostgreSQL 13. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Commitfest 2019-07, the first of five* for PostgreSQL 13
On Tue, Jul 2, 2019 at 12:20 AM Thomas Munro wrote: > It's now July everywhere on Earth, so I marked CF 2019-07 as > in-progress, and 2019-09 as open for bumping patches into. I pinged > most of the "Needs Review" threads that don't apply and will do a few > more tomorrow, and then I'll try to chase patches that fail on CI, and > then see what I can do to highlight some entries that really need > review/discussion. I'll do end-of-week status reports. Hello hackers, Here's a quick status report after the first week (I think only about 10 commits happened during the week, the rest were pre-CF activity): status | count +--- Committed |32 Moved to next CF | 5 Needs review | 146 Ready for Committer| 7 Rejected | 2 Returned with feedback | 2 Waiting on Author |29 Withdrawn | 8 I wondered about reporting on the number of entries that didn't yet have reviewers signed up, but then I noticed that there isn't a very good correlation between signed up reviewers and reviews. So instead, here is a list of twenty "Needs review" entries that have gone a long time without communication on the thread. In some cases there has been plenty of review, so it's time to make decisions. In others, there has been none at all. If you're having trouble choosing something to review, please pick one of these and help us figure out how to proceed. 2026 | Spurious "apparent wraparound" via Simpl | {"Noah Misch"} 2003 | Fix Deadlock Issue in Single User Mode W | {"Chengchao Yu"} 1796 | documenting signal handling with readme | {"Chris Travers"} 2053 | NOTIFY options + COLLAPSE (make deduplic | {"Filip Rembiałkowski"} 1974 | pg_stat_statements should notice FOR UPD | {"Andrew Gierth"} 2061 | [WIP] Show a human-readable n_distinct i | {"Maxence Ahlouche"} 2077 | fix pgbench -R hanging on pg11 | {"Fabien Coelho"} 2062 | Unaccent extension python script Issue i | {"Hugh Ranalli","Ramanarayana M"} 1769 | libpq host/hostaddr consistency | {"Fabien Coelho"} 2060 | suppress errors thrown by to_reg*() | {"takuma hoshiai"} 2078 | Compile from source using latest Microso | {"Peifeng Qiu"} 2081 | parse time support function | {"Pavel Stehule"} 1800 | amcheck verification for GiST| {"Andrey Borodin"} 2018 | pg_basebackup to adjust existing data di | {"Haribabu Kommi"} 2095 | pg_upgrade version and path checking | {"Daniel Gustafsson"} 2044 | propagating replica identity to partitio | {"Álvaro Herrera"} 2090 | pgbench - implement strict TPC-B benchma | {"Fabien Coelho"} 2088 | Contribution to Perldoc for TestLib modu | {"Ramanarayana M"} 2087 | Problem during Windows service start | {"Ramanarayana M"} 2093 | Trigger autovacuum on tuple insertion| {"Darafei Praliaskouski"} If you have submitted a patch and it's in "Waiting for author" state, please aim to get it to "Needs review" state soon if you can, as that's where people are most likely to be looking for things to review. I have pinged most threads that are in "Needs review" state and don't apply, compile warning-free, or pass check-world. I'll do some more of that sort of thing, and I'll highlight a different set of patches next week. -- Thomas Munro https://enterprisedb.com
Re: [PATCH] Speedup truncates of relation forks
On Fri, Jul 5, 2019 at 3:03 PM Jamison, Kirk wrote: > I updated the patch which is similar to V3 of the patch, > but addressing my problem in (5) in the previous email regarding > FreeSpaceMapVacuumRange. > It seems to pass the regression test now. Kindly check for validation. Hi Kirk, FYI there are a couple of compiler errors reported: Windows compiler: contrib/pg_visibility/pg_visibility.c(400): error C2143: syntax error : missing ')' before '{' [C:\projects\postgresql\pg_visibility.vcxproj] GCC: storage.c: In function ‘RelationTruncate’: storage.c:238:14: error: variable ‘newnblocks’ set but not used [-Werror=unused-but-set-variable] BlockNumber newnblocks = InvalidBlockNumber; ^ storage.c:237:14: error: variable ‘new_nfsmblocks’ set but not used [-Werror=unused-but-set-variable] BlockNumber new_nfsmblocks = InvalidBlockNumber; ^ storage.c: In function ‘smgr_redo’: storage.c:634:15: error: variable ‘newnblocks’ set but not used [-Werror=unused-but-set-variable] BlockNumber newnblocks = InvalidBlockNumber; ^ storage.c:633:15: error: variable ‘new_nfsmblocks’ set but not used [-Werror=unused-but-set-variable] BlockNumber new_nfsmblocks = InvalidBlockNumber; ^ -- Thomas Munro https://enterprisedb.com
Re: Fix doc bug in logical replication.
On 2019-06-27 18:50, Tomas Vondra wrote: >> Whether we *want* to document that it works, documenting that it >> doesn't work when it does can't be the right answer. If you want to >> couch the language to leave the door open that we may not support this >> the same way in the future I wouldn't be opposed to that, but at this >> point we will have three releases with the current behavior in >> production, so if we decide to change the behavior, it is likely going >> to break certain use cases. That may be ok, but I'd expect a >> documentation update to accompany a change that would cause such a >> breaking change. >> > I agree with that. We have this behavior for quite a bit of time, and > while technically we could change the behavior in the future (using the > "not supported" statement), IMO that'd be pretty annoying move. I always > despised systems that "fix" bugs by documenting that it does not work, and > this is a bit similar. committed back to PG10 -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Excessive memory usage in multi-statement queries w/ partitioning
Hi Amit, On Mon, Jul 8, 2019 at 10:52 AM Amit Langote wrote: > > On Thu, Jul 4, 2019 at 6:52 PM Julien Rouhaud wrote: > > On Tue, May 28, 2019 at 6:57 AM Amit Langote wrote: > > > >> Maybe like the attached? I'm not sure if we need to likewise be > > > >> concerned > > > >> about exec_sql_string() being handed multi-query strings. > > > > the whole extension sql script is passed to execute_sql_string(), so I > > think that it's a good thing to have similar workaround there. > > That makes sense, although it is perhaps much less likely for memory > usage explosion to occur in execute_sql_strings(), because the scripts > passed to execute_sql_strings() mostly contain utility statements and > rarely anything whose planning will explode in memory usage. > > Anyway, I've added similar handling in execute_sql_strings() for consistency. Thanks! > Now I wonder if we'll need to consider another path which calls > pg_plan_queries() on a possibly multi-statement query -- > BuildCachedPlan()... I also thought about this when reviewing the patch, but AFAICS you can't provide a multi-statement query to BuildCachedPlan() using prepared statements and I'm not sure that SPI is worth the trouble. I'll mark this patch as ready for committer. > > > About the patch: > > > > -* Switch to appropriate context for constructing querytrees > > (again, > > -* these must outlive the execution context). > > +* Switch to appropriate context for constructing querytrees. > > +* Memory allocated during this construction is released before > > +* the generated plan is executed. > > > > The comment should mention query and plan trees, everything else seems ok > > to me. > > Okay, fixed. > > Attached updated patch. Thanks again. > > Regards, > Amit
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra wrote: > We're running query like this: > > SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 > ORDER BY 1, 2, 3 > > but we're trying to add the incremental sort *before* the aggregation, > because the optimizer also considers group aggregate with a sorted > input. And (a) is a prefix of (a,sum(b),count(b)) so we think we > actually can do this, but clearly that's nonsense, because we can't > possibly know the aggregates yet. Hence the error. > > If this is the actual issue, we need to ensure we actually can evaluate > all the pathkeys. I don't know how to do that yet. I thought that maybe > we should modify pathkeys_common_contained_in() to set presorted_keys to > 0 in this case. > > But then I started wondering why we don't see this issue even for > regular (non-incremental-sort) paths built in create_ordered_paths(). > How come we don't see these failures there? I've modified costing to > make all incremental sort paths very cheap, and still nothing. I assume you mean you modified costing to make regular sort paths very cheap? > So presumably there's a check elsewhere (either implicit or explicit), > because create_ordered_paths() uses pathkeys_common_contained_in() and > does not have the same issue. Given this comment in create_ordered_paths(): generate_gather_paths() will have already generated a simple Gather path for the best parallel path, if any, and the loop above will have considered sorting it. Similarly, generate_gather_paths() will also have generated order-preserving Gather Merge plans which can be used without sorting if they happen to match the sort_pathkeys, and the loop above will have handled those as well. However, there's one more possibility: it may make sense to sort the cheapest partial path according to the required output order and then use Gather Merge. my understanding is that generate_gather_paths() only considers paths that already happen to be sorted (not explicit sorts), so I'm wondering if it would make more sense for the incremental sort path creation for this case to live alongside the explicit ordered path creation in create_ordered_paths() rather than in generate_gather_paths(). If I'm understanding what you're saying properly, I think you'd expected create_ordered_paths() to be roughly similar in what it considers as partial paths and so have the same problem, and I haven't yet read enough of the code to understand if my proposed change actually has any impact on the issue we're discussing, but it seems to me that it at least fits more with what the comments imply. I'll try to look at it a bit more later also, but at the moment other work calls. James Coleman
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote: On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra wrote: We're running query like this: SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3 but we're trying to add the incremental sort *before* the aggregation, because the optimizer also considers group aggregate with a sorted input. And (a) is a prefix of (a,sum(b),count(b)) so we think we actually can do this, but clearly that's nonsense, because we can't possibly know the aggregates yet. Hence the error. If this is the actual issue, we need to ensure we actually can evaluate all the pathkeys. I don't know how to do that yet. I thought that maybe we should modify pathkeys_common_contained_in() to set presorted_keys to 0 in this case. But then I started wondering why we don't see this issue even for regular (non-incremental-sort) paths built in create_ordered_paths(). How come we don't see these failures there? I've modified costing to make all incremental sort paths very cheap, and still nothing. I assume you mean you modified costing to make regular sort paths very cheap? No, I mean costing of incremental sort paths, so that they end up being the cheapest ones. If some other path is cheaper, we won't see the error because it only happens when building plan from the cheapest path. So presumably there's a check elsewhere (either implicit or explicit), because create_ordered_paths() uses pathkeys_common_contained_in() and does not have the same issue. Given this comment in create_ordered_paths(): generate_gather_paths() will have already generated a simple Gather path for the best parallel path, if any, and the loop above will have considered sorting it. Similarly, generate_gather_paths() will also have generated order-preserving Gather Merge plans which can be used without sorting if they happen to match the sort_pathkeys, and the loop above will have handled those as well. However, there's one more possibility: it may make sense to sort the cheapest partial path according to the required output order and then use Gather Merge. my understanding is that generate_gather_paths() only considers paths that already happen to be sorted (not explicit sorts), so I'm wondering if it would make more sense for the incremental sort path creation for this case to live alongside the explicit ordered path creation in create_ordered_paths() rather than in generate_gather_paths(). How would that solve the issue? Also, we're building a gather path, so I think generate_gather_paths() is the right place where to do it. And we're not changing the semantics of generate_gather_paths() - the result path should be sorted "correctly" with respect to sort_pathkeys. If I'm understanding what you're saying properly, I think you'd expected create_ordered_paths() to be roughly similar in what it considers as partial paths and so have the same problem, and I haven't yet read enough of the code to understand if my proposed change actually has any impact on the issue we're discussing, but it seems to me that it at least fits more with what the comments imply. Roughly. AFAICS the problem is that we're trying to use pathkeys that are only valid for the (aggregated) upper relation, not before it. I have to admit I've never quite understood this pathkeys business. I mean, I know what pathkeys are for, but clearly it's not valid to look at root->sort_pathkeys at this place. Maybe there's some other field we should be looking at instead, or maybe it's ensured by grouping_planner in some implicit way ... I.e. the question is - when you do a query like SELECT a, count(*) FROM t GROUP BY a ORDER BY a, count(*); and cost the incremental sort extremely cheap, how come we don't end up with the same issue? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 8, 2019 at 06:04:28PM +0900, Masahiko Sawada wrote: > On Sun, Jul 7, 2019 at 1:05 AM Bruce Momjian wrote: > > What about referential integrity constraints that need to check primary > > keys in the encrypted tables? I also don't see a way of delaying that, > > and if you can't do referential integrity into the encrypted tables, it > > reduces the value of having encrypted data in the same database rather > > than in another database or cluster? > > I just thought that PostgreSQL's auxiliary processes such as > autovacuum, startup, checkpointer, bgwriter should always be able to > access all keys because there are already in inside database. Even > today these processes don't check any privileges when accessing to > data. What security threats we can protect data from by requiring > privileges even for auxiliary processes? If this is a security problem > isn't it also true for cluster-wide encryption? I guess that processes > who have an access privilege on the table can always get the > corresponding encryption key. And any processes cannot access an > encryption key directly without accessing to a database object. Well, see my list of three things that users want in an earlier email: https://www.postgresql.org/message-id/20190706160514.b67q4f7abcxfd...@momjian.us When people are asking for multiple keys (not just for key rotation), they are asking to have multiple keys that can be supplied by users only when they need to access the data. Yes, the keys are always in the datbase, but the feature request is that they are only unlocked when the user needs to access the data. Obviously, that will not work for autovacuum when the encryption is at the block level. If the key is always unlocked, there is questionable security value of having multiple keys, beyond key rotation. > > I still feel we have not clearly described what the options are: > > > > 1. Encrypt everything > > > > 2. Encrypt only some tables (for performance reasons), and use only one > > key, or use multiple keys to allow for key rotation. All keys are > > always unlocked. > > > > 3. Encrypt only some tables with different keys, and the keys are not > > always unlocked. > > > > As Tomas already stated, using tablespaces to distinguish encrypted from > > non-encrypted tables doesn't make sense since the file system used for > > the storage is immaterial to the encryptions status. An easier way would > > be to just add a bit to WAL that would indicate if the rest of the WAL > > record is encrypted, though I doubt the performance boost is worth the > > complexity. > > Okay, instead of using tablespaces we can create groups grouping > tables being encrypted with the same key. I think the one of the most > important point here is to provide a granular encryption feature and Why is this important? What are you trying to accomplish? > have less the number of keys in database cluster, not to provide per > tablespace encryption feature. I'm not going to insist it should be > per tablespace encryption. It is unclear which item you are looking for. Which number are you suggesting from the three listed above in the email URL? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: allow_system_table_mods stuff
On Sun, Jul 7, 2019 at 11:45:49PM -0400, Bruce Momjian wrote: > On Mon, Jun 24, 2019 at 11:20:51AM -0400, Tom Lane wrote: > > I do see value in two switches not one, but it's what I said above, > > to not need to give people *more* chance-to-break-things than they > > had before when doing manual catalog fixes. That is, we need a > > setting that corresponds more or less to current default behavior. > > > > There's an aesthetic argument to be had about whether to have two > > bools or one three-way switch, but I prefer the former; there's > > no backward-compatibility issue here since allow_system_table_mods > > couldn't be set by applications anyway. > > I like a single three-way switch since if you are allowing DDL, you > probably don't care if you restrict DML. log_statement already has a > similar distinction with values of none, ddl, mod, all. I assume > allow_system_table_mods could have value of false, dml, true. Or, to match log_statement, use: none, dml, all. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Broken defenses against dropping a partitioning column
On 2019-Jul-07, Tom Lane wrote: > Ideally, perhaps, a DROP CASCADE like this would not cascade to > the whole table but only to the table's partitioned-ness property, > leaving you with a non-partitioned table with most of its data > intact. It would take a lot of work to make that happen though, > and it certainly wouldn't be back-patchable, and I'm not really > sure it's worth it. Maybe we can add dependencies to rows of the pg_partitioned_table relation, with the semantics of "depends on the partitioned-ness of the table"? That said, I'm not sure I see the use case for an ALTER TABLE .. DROP COLUMN command that turns a partitioned table (with existing partitions containing data) into one non-partitioned table with all data minus the partitioning column(s). This seems vaguely related to the issue of dropping foreign keys; see https://postgr.es/m/20190329152239.GA29258@alvherre.pgsql wherein I settled with a non-ideal solution to the problem of being unable to depend on something that did not cause the entire table to be dropped in certain cases. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra wrote: > > On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote: > >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra > > wrote: > >> We're running query like this: > >> > >> SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 > >> ORDER BY 1, 2, 3 > >> > >> but we're trying to add the incremental sort *before* the aggregation, > >> because the optimizer also considers group aggregate with a sorted > >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we > >> actually can do this, but clearly that's nonsense, because we can't > >> possibly know the aggregates yet. Hence the error. > >> > >> If this is the actual issue, we need to ensure we actually can evaluate > >> all the pathkeys. I don't know how to do that yet. I thought that maybe > >> we should modify pathkeys_common_contained_in() to set presorted_keys to > >> 0 in this case. > >> > >> But then I started wondering why we don't see this issue even for > >> regular (non-incremental-sort) paths built in create_ordered_paths(). > >> How come we don't see these failures there? I've modified costing to > >> make all incremental sort paths very cheap, and still nothing. > > > >I assume you mean you modified costing to make regular sort paths very cheap? > > > > No, I mean costing of incremental sort paths, so that they end up being > the cheapest ones. If some other path is cheaper, we won't see the error > because it only happens when building plan from the cheapest path. Ah, I misunderstood as you trying to figure out a way to try to cause the same problem with a regular sort. > >> So presumably there's a check elsewhere (either implicit or explicit), > >> because create_ordered_paths() uses pathkeys_common_contained_in() and > >> does not have the same issue. > > > >Given this comment in create_ordered_paths(): > > > > generate_gather_paths() will have already generated a simple Gather > > path for the best parallel path, if any, and the loop above will have > > considered sorting it. Similarly, generate_gather_paths() will also > > have generated order-preserving Gather Merge plans which can be used > > without sorting if they happen to match the sort_pathkeys, and the loop > > above will have handled those as well. However, there's one more > > possibility: it may make sense to sort the cheapest partial path > > according to the required output order and then use Gather Merge. > > > >my understanding is that generate_gather_paths() only considers paths > >that already happen to be sorted (not explicit sorts), so I'm > >wondering if it would make more sense for the incremental sort path > >creation for this case to live alongside the explicit ordered path > >creation in create_ordered_paths() rather than in > >generate_gather_paths(). > > > > How would that solve the issue? Also, we're building a gather path, so > I think generate_gather_paths() is the right place where to do it. And > we're not changing the semantics of generate_gather_paths() - the result > path should be sorted "correctly" with respect to sort_pathkeys. Does that imply what the explicit sort in create_ordered_paths() is in the wrong spot? Or, to put it another way, do you think that both kinds of sorts should be added in the same place? It seems confusing to me that they'd be split between the two methods (unless I'm completely misunderstanding how the two work). I'm not saying it would solve the issue here; just noting that the division of labor seemed odd to me at first read through. James Coleman
Re: POC: Cleaning up orphaned files using undo logs
On Mon, Jul 8, 2019 at 6:57 AM Amit Kapila wrote: > I didn't have other use cases in mind and I think to some extent this > argument holds true for existing binaryheap_allocate allocate. If we > want to make it more generic, then shouldn't we try to even change the > existing binaryheap_allocate to use this new model, that way the > binary heap allocation API will be more generic? No. binaryheap_allocate is fine for simple cases and there's no reason that I can see to change it. > You are right that it won't be nth smallest element from the queue and > we don't even care for that here. The peeking logic is not to find > the next prioritized element but to check if we can find some element > for the same database in the next few entries to avoid frequent undo > worker restart. You *should* care for that here. The whole purpose of a binary heap is to help us choose which task we should do first and which ones should be done later. I don't see why it's OK to decide that we only care about doing the tasks in priority order sometimes, and other times it's OK to just pick semi-randomly. > This is a good test scenario, but I think it has not been considered > that there are multiple queues and we peek into each one. I think that makes very little difference, so I don't see why it should be considered. It's true that this will sometimes mask the problem, but so what? An algorithm that works 90% of the time is not much better than one that works 80% of the time, and neither is the caliber of work we are expecting to see in PostgreSQL. > I think we should once try with the actual program which can test such > a scenario on the undo patches before reaching any conclusion. I or > one of my colleague will work on this and report back the results. There is certainly a place for empirical testing of a patch like this (perhaps even before posting it). It does not substitute for a good theoretical explanation of why the algorithm is correct, and I don't think it is. > > You can construct way worse examples than > > this one, too: imagine that there are two databases, each with a > > worker, and one has 99% of the requests and the other one has 1% of > > the requests. It's really unlikely that there's going to be an entry > > for the second database within the lookahead window. > > I am not sure if that is the case because as soon as the request from > other database get prioritized (say because its XID becomes older) and > came as the first request in one of the queues, the undo worker will > exit (provided it has worked for some threshold time (10s) in that > database) and allow the request from another database to be processed. I don't see how this responds to what I wrote. Neither worker needs to exit in this scenario, but the worker from the less-popular database is likely to exit anyway, which seems like it's probably not the right thing. > > And note that > > increasing the window doesn't really help either: you just need more > > databases than the size of the lookahead window, or even almost as > > many as the lookahead window, and things are going to stop working > > properly. > > > > On the other hand, suppose that you have 10 databases and one undo > > worker. One database is pretty active and generates a continuous > > stream of undo requests at exactly the same speed we can process them. > > The others all have 1 pending undo request. Now, what's going to > > happen is that you'll always find the undo request for the current > > database within the lookahead window. So, you'll never exit. > > Following the logic given above, I think here also worker will exit as > soon as the request from other database get prioritised. OK. > > But > > that means the undo requests in the other 9 databases will just sit > > there for all eternity, because there's no other worker to process > > them. On the other hand, if you had 11 databases, there's a good > > chance it would work fine, because the new request for the active > > database would likely be outside the lookahead window, and so you'd > > find no work to do and exit, allowing a worker to be started up in > > some other database. > > As explained above, I think it will work the same way both for 10 or > 11 databases. Note, that we don't always try to look ahead. We look > ahead when we have not worked on the current database for some > threshold amount of time. That's interesting, and it means that some of the scenarios that I mentioned are not problems. However, I don't believe it means that your code is actually correct. It's just means that it's wrong in different ways. The point is that, with the way you've implemented this, whenever you do lookahead, you will, basically randomly, sometimes find the next entry for the current database within the lookahead window, and sometimes you won't. And sometimes it will be the next-highest-priority request, and sometimes it won't. That just cannot possibly be the right thing to do. Would you
Re: Duplicated LSN in ReorderBuffer
On Mon, Jul 8, 2019 at 9:00 AM Thomas Munro wrote: > > On Wed, Jun 26, 2019 at 2:46 AM Ildar Musin wrote: > > Attached is a simple patch that uses subxid instead of top-level xid > > in ReorderBufferAddNewTupleCids() call. It seems to fix the bug, but > > i'm not sure that this is a valid change. Can someone please verify it > > or maybe suggest a better solution for the issue? > I've reproduced this issue with script Ildar provided. I don't find out the root cause yet and I'm not sure the patch takes a correct way to fix this. In my environment, I got the following pg_waldump output and the logical decoding failed at 0/019FA058 when processing NEW_CID. 90489 rmgr: Transaction len (rec/tot): 38/38, tx: 1999, lsn: 0/019F9E80, prev 0/019F9E38, desc: ASSIGNMENT xtop 1998: subxacts: 1999 90490 rmgr: Standby len (rec/tot):405/ 405, tx: 0, lsn: 0/019F9EA8, prev 0/019F9E80, desc: RUNNING_XACTS nextXid 2000 latestCompletedXid 1949 oldestRunningXid 1836; 48 xacts: 1990 1954 1978 1850 1944 1972 1940 1924 1906 1970 1985 1998 1966 1987 1975 1858 1914 1982 1958 1840 1920 1926 1992 1962 1 90490 910 1950 1874 1928 1974 1968 1946 1912 1918 1996 1922 1930 1964 1952 1994 1934 1980 1836 1984 1960 1956 1916 1908 1938 90491 rmgr: Heap2 len (rec/tot): 60/60, tx: 1999, lsn: 0/019FA058, prev 0/019F9EA8, desc: NEW_CID rel 1663/12678/2615; tid 11/59; cmin: 0, cmax: 4294967295, combo: 4294967295 90492 rmgr: Heaplen (rec/tot):127/ 127, tx: 1999, lsn: 0/019FA098, prev 0/019FA058, desc: INSERT off 59 flags 0x00, blkref #0: rel 1663/12678/2615 blk 11 I thought that the logical decoding doesn't create ReorderBufferTXN of xid=1999 when processing NEW_CID since it decodes ASSIGNMENT of xid=1999 beforehand. But what actually happen is that it skips NEW_CID since the state of snapshot builder is SNAPBUILD_BUILDING_SNAPSHOT yet and then the state becomes SNAPBUILD_FULL_SNAPSHOT when processing RUNNING_XACTS , and therefore it creates two ReorderBufferTXN entries for xid = 1999 and xid = 1998 as top-level transactions when processing NEW_CID (ReorderBufferXidSetCatalogChanges creates xid=1999 and ReorderBufferAddNewTupleCids creates xid = 1998). And therefore it got the assertion failure when adding ReorderBufferTXN of xid = 1998. I'll look into this more deeply tomorrow. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote: On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra wrote: On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote: >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra > wrote: >> We're running query like this: >> >> SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) < 3 ORDER BY 1, 2, 3 >> >> but we're trying to add the incremental sort *before* the aggregation, >> because the optimizer also considers group aggregate with a sorted >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we >> actually can do this, but clearly that's nonsense, because we can't >> possibly know the aggregates yet. Hence the error. >> >> If this is the actual issue, we need to ensure we actually can evaluate >> all the pathkeys. I don't know how to do that yet. I thought that maybe >> we should modify pathkeys_common_contained_in() to set presorted_keys to >> 0 in this case. >> >> But then I started wondering why we don't see this issue even for >> regular (non-incremental-sort) paths built in create_ordered_paths(). >> How come we don't see these failures there? I've modified costing to >> make all incremental sort paths very cheap, and still nothing. > >I assume you mean you modified costing to make regular sort paths very cheap? > No, I mean costing of incremental sort paths, so that they end up being the cheapest ones. If some other path is cheaper, we won't see the error because it only happens when building plan from the cheapest path. Ah, I misunderstood as you trying to figure out a way to try to cause the same problem with a regular sort. >> So presumably there's a check elsewhere (either implicit or explicit), >> because create_ordered_paths() uses pathkeys_common_contained_in() and >> does not have the same issue. > >Given this comment in create_ordered_paths(): > > generate_gather_paths() will have already generated a simple Gather > path for the best parallel path, if any, and the loop above will have > considered sorting it. Similarly, generate_gather_paths() will also > have generated order-preserving Gather Merge plans which can be used > without sorting if they happen to match the sort_pathkeys, and the loop > above will have handled those as well. However, there's one more > possibility: it may make sense to sort the cheapest partial path > according to the required output order and then use Gather Merge. > >my understanding is that generate_gather_paths() only considers paths >that already happen to be sorted (not explicit sorts), so I'm >wondering if it would make more sense for the incremental sort path >creation for this case to live alongside the explicit ordered path >creation in create_ordered_paths() rather than in >generate_gather_paths(). > How would that solve the issue? Also, we're building a gather path, so I think generate_gather_paths() is the right place where to do it. And we're not changing the semantics of generate_gather_paths() - the result path should be sorted "correctly" with respect to sort_pathkeys. Does that imply what the explicit sort in create_ordered_paths() is in the wrong spot? I think those are essentially the right places where to do this sort of stuff. Maybe there's a better place, but I don't think those places are somehow wrong. Or, to put it another way, do you think that both kinds of sorts should be added in the same place? It seems confusing to me that they'd be split between the two methods (unless I'm completely misunderstanding how the two work). The paths built in those two places were very different in one aspect: 1) generate_gather_paths only ever looked at pathkeys for the subpath, it never even looked at ordering expected by paths above it (or the query as a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims to maintain ordering of the different subpaths. 2) create_oredered_paths is meant to enforce ordering needed by upper parts of the plan - either by using a properly sorted path, or adding an explicit sort. We want to extend (1) to also look at ordering expected by the upper parts of the plan, and consider incremental sort if applicable. (2) already does that, and it already has the correct pathkeys to enforce. But looking at root->sort_pathkeys in (1) seems to be the wrong thing :-( The thing is, we don't have just sort_pathkeys, there's distinct_pathkeys and group_pathkeys too, so maybe we should be looking at those too? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [RFC] [PATCH] Flexible "partition pruning" hook
On Sun, Jul 7, 2019 at 9:46 PM Thomas Munro wrote: > On Sat, Apr 6, 2019 at 3:06 PM Andres Freund wrote: > > I've moved this to the next CF, and marked it as targeting v13. > > Hi Mike, > > Commifest 1 for PostgreSQL 13 is here. I was just going through the > automated build results for the 'fest and noticed that your patch > causes a segfault in the regression tests (possibly due to other > changes that have been made in master since February). You can see > the complete backtrace on the second link below, but it looks like > this is happening every time, so hopefully not hard to track down > locally. > > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.46412 > https://travis-ci.org/postgresql-cfbot/postgresql/builds/555380617 Thanks, Thomas, I'll look into this today. -- Mike Palmiotto Software Engineer Crunchy Data Solutions https://crunchydata.com
Re: Broken defenses against dropping a partitioning column
On Mon, Jul 8, 2019 at 10:32 AM Alvaro Herrera wrote: > That said, I'm not sure I see the use case for an ALTER TABLE .. DROP > COLUMN command that turns a partitioned table (with existing partitions > containing data) into one non-partitioned table with all data minus the > partitioning column(s). I think it would be useful to have "ALTER TABLE blah NOT PARTITIONED" but I -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Broken defenses against dropping a partitioning column
On Mon, Jul 8, 2019 at 11:02 AM Robert Haas wrote: > On Mon, Jul 8, 2019 at 10:32 AM Alvaro Herrera > wrote: > > That said, I'm not sure I see the use case for an ALTER TABLE .. DROP > > COLUMN command that turns a partitioned table (with existing partitions > > containing data) into one non-partitioned table with all data minus the > > partitioning column(s). > > I think it would be useful to have "ALTER TABLE blah NOT PARTITIONED" but I ...hit send too soon, and also, I don't think anyone will be very happy if they get that behavior as a side effect of a DROP statement, mostly because it could take an extremely long time to execute. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 7/8/19 10:19 AM, Bruce Momjian wrote: > When people are asking for multiple keys (not just for key rotation), > they are asking to have multiple keys that can be supplied by users only > when they need to access the data. Yes, the keys are always in the > datbase, but the feature request is that they are only unlocked when the > user needs to access the data. Obviously, that will not work for > autovacuum when the encryption is at the block level. > If the key is always unlocked, there is questionable security value of > having multiple keys, beyond key rotation. That is not true. Having multiple keys also allows you to reduce the amount of data encrypted with a single key, which is desirable because: 1. It makes cryptanalysis more difficult 2. Puts less data at risk if someone gets "lucky" in doing brute force Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: Broken defenses against dropping a partitioning column
Robert Haas writes: > On Mon, Jul 8, 2019 at 11:02 AM Robert Haas wrote: >> On Mon, Jul 8, 2019 at 10:32 AM Alvaro Herrera >> wrote: >>> That said, I'm not sure I see the use case for an ALTER TABLE .. DROP >>> COLUMN command that turns a partitioned table (with existing partitions >>> containing data) into one non-partitioned table with all data minus the >>> partitioning column(s). >> I think it would be useful to have "ALTER TABLE blah NOT PARTITIONED" but I > ...hit send too soon, and also, I don't think anyone will be very > happy if they get that behavior as a side effect of a DROP statement, > mostly because it could take an extremely long time to execute. FWIW, I was imagining the action as being (1) detach all the child partitions, (2) make parent into a non-partitioned table, (3) drop the target column in each of these now-independent tables. No data movement. Other than the need to acquire locks on all the tables, it shouldn't be particularly slow. But I'm still not volunteering to write it, because I'm not sure anyone would want such a behavior. regards, tom lane
Re: Broken defenses against dropping a partitioning column
Alvaro Herrera writes: > That said, I'm not sure I see the use case for an ALTER TABLE .. DROP > COLUMN command that turns a partitioned table (with existing partitions > containing data) into one non-partitioned table with all data minus the > partitioning column(s). Yeah, it'd be a lot of work for a dubious goal. > This seems vaguely related to the issue of dropping foreign keys; see > https://postgr.es/m/20190329152239.GA29258@alvherre.pgsql wherein I > settled with a non-ideal solution to the problem of being unable to > depend on something that did not cause the entire table to be dropped > in certain cases. That's an interesting analogy. Re-reading that thread, what I said in <29497.1554217...@sss.pgh.pa.us> seems pretty apropos to the current problem: >> FWIW, I think that the dependency mechanism is designed around the idea >> that whether it's okay to drop a *database object* depends only on what >> other *database objects* rely on it, and that you can always make a DROP >> valid by dropping all the dependent objects. That's not an unreasonable >> design assumption, considering that the SQL standard embeds the same >> assumption in its RESTRICT/CASCADE syntax. I think that is probably a fatal objection to my idea of putting an error check into RemoveAttributeById(). As an example, consider the possibility that somebody makes a temporary type and then makes a permanent table with a partitioning column of that type. What shall we do at session exit? Failing to remove the temp type is not an acceptable choice, because that leaves us with a permanently broken temp schema (compare bug #15631 [1]). Also, I don't believe we can make that work without order-of-operations problems in cases comparable to the original bug in this thread [2]. One or the other order of the object OIDs is going to lead to the column being visited for deletion before the whole table is, and again rejecting the column deletion is not going to be an acceptable behavior. So I think we're probably stuck with the approach of adding new internal dependencies. If we go that route, then our options for the released branches are (1) do nothing, or (2) back-patch the code that adds such dependencies, but without a catversion bump. That would mean that only tables created after the next minor releases would have protection against this problem. That's not ideal but maybe it's okay, considering that we haven't seen actual field reports of trouble of this kind. regards, tom lane [1] https://www.postgresql.org/message-id/flat/15631-188663b383e1e697%40postgresql.org [2] https://www.postgresql.org/message-id/flat/CA%2Bu7OA4JKCPFrdrAbOs7XBiCyD61XJxeNav4LefkSmBLQ-Vobg%40mail.gmail.com
Re: Add missing operator <->(box, point)
On Mon, Mar 11, 2019 at 2:49 AM Nikita Glukhov wrote: > 2. Add <-> to GiST box_ops. >Extracted gist_box_distance_helper() common for gist_box_distance() and >gist_bbox_distance(). For me it doesn't look worth having two distinct functions gist_box_distance_helper() and gist_bbox_distance(). What about having just one and leave responsibility for recheck flag to the caller? > 3. Add <-> to SP-GiST. >Changed only catalog and tests. Box case is already checked in >spg_box_quad_leaf_consistent(): > out->recheckDistances = distfnoid == F_DIST_POLYP; So, it seems to be fix of oversight in 2a6368343ff4. But assuming fixing this requires catalog changes, we shouldn't backpatch this. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 8, 2019 at 11:18:01AM -0400, Joe Conway wrote: > On 7/8/19 10:19 AM, Bruce Momjian wrote: > > When people are asking for multiple keys (not just for key rotation), > > they are asking to have multiple keys that can be supplied by users only > > when they need to access the data. Yes, the keys are always in the > > datbase, but the feature request is that they are only unlocked when the > > user needs to access the data. Obviously, that will not work for > > autovacuum when the encryption is at the block level. > > > If the key is always unlocked, there is questionable security value of > > having multiple keys, beyond key rotation. > > That is not true. Having multiple keys also allows you to reduce the > amount of data encrypted with a single key, which is desirable because: > > 1. It makes cryptanalysis more difficult > 2. Puts less data at risk if someone gets "lucky" in doing brute force What systems use multiple keys like that? I know of no website that does that. Your arguments seem hypothetical. What is your goal here? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: errbacktrace
> On Sat, Jun 29, 2019 at 7:41 AM Jaime Casanova > wrote: > > This is certainly a very useful thing. Sadly, it doesn't seem to compile when > trying to use libunwind. Yeah, the same for me. To make it works I've restricted libunwind to local unwinding only: #ifdef USE_LIBUNWIND #define UNW_LOCAL_ONLY #include #endif And result looks pretty nice: 2019-07-08 17:24:08.406 CEST [31828] ERROR: invalid input syntax for type integer: "foobar" at character 12 2019-07-08 17:24:08.406 CEST [31828] BACKTRACE: #0 pg_strtoint32+0x1d1 [0x55fa389bcbbe] #1 int4in+0xd [0x55fa38976d7b] #2 InputFunctionCall+0x6f [0x55fa38a488e9] #3 OidInputFunctionCall+0x44 [0x55fa38a48b0d] #4 stringTypeDatum+0x33 [0x55fa386e222e] #5 coerce_type+0x26d [0x55fa386ca14d] #6 coerce_to_target_type+0x79 [0x55fa386c9494] #7 transformTypeCast+0xaa [0x55fa386d0042] #8 transformExprRecurse+0x22f [0x55fa386cf650] #9 transformExpr+0x1a [0x55fa386cf30a] #10 transformTargetEntry+0x79 [0x55fa386e1131] #11 transformTargetList+0x86 [0x55fa386e11ce] #12 transformSelectStmt+0xa1 [0x55fa386a29c9] #13 transformStmt+0x9d [0x55fa386a345a] #14 transformOptionalSelectInto+0x94 [0x55fa386a3f49] #15 transformTopLevelStmt+0x15 [0x55fa386a3f88] #16 parse_analyze+0x4e [0x55fa386a3fef] #17 pg_analyze_and_rewrite+0x3e [0x55fa3890cfa5] #18 exec_simple_query+0x35b [0x55fa3890d5b5] #19 PostgresMain+0x91f [0x55fa3890f7a8] #20 BackendRun+0x1ac [0x55fa3887ed17] #21 BackendStartup+0x15c [0x55fa38881ea1] #22 ServerLoop+0x1e6 [0x55fa388821bb] #23 PostmasterMain+0x1101 [0x55fa388835a1] #24 main+0x21a [0x55fa387db1a9] #25 __libc_start_main+0xe7 [0x7f3d1a607fa7] #26 _start+0x2a [0x55fa3858e4ea]
Re: tableam vs. TOAST
On Tue, Jun 25, 2019 at 2:19 AM Prabhat Sahu wrote: > I have tested the TOAST patches(v3) with different storage options like(MAIN, > EXTERNAL, EXTENDED, etc.), and > combinations of compression and out-of-line storage options. > I have used a few dummy tables with various tuple count say 10k, 20k, 40k, > etc. with different column lengths. > Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size = > 10GB) before the test to avoid performance fluctuations, > and calculated the results as a median value of a few consecutive test > executions. Thanks for testing. > All the observation looks good to me, > except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC > INSERT with tuple count(40K) there was a slightly increse in time taken > incase of "with patch" result. For a better observation, I also have ran the > same "Test 1" for higher tuple count(i.e. 80K), and it also looks fine. Did you run each test just once? How stable are the results? > While testing few concurrent transactions I have below query: > -- Concurrent transactions acquire a lock for TOAST option(ALTER TABLE .. SET > STORAGE .. MAIN/EXTERNAL/EXTENDED/ etc) > > -- Session 1: > CREATE TABLE a (a_id text PRIMARY KEY); > CREATE TABLE b (b_id text); > INSERT INTO a VALUES ('a'), ('b'); > INSERT INTO b VALUES ('a'), ('b'), ('b'); > > BEGIN; > ALTER TABLE b ADD CONSTRAINT bfk FOREIGN KEY (b_id) REFERENCES a (a_id); > -- Not Acquiring any lock For me, this acquires AccessShareLock and ShareRowExclusiveLock on the target table. rhaas=# select locktype, database, relation, pid, mode, granted from pg_locks where relation = 'b'::regclass; locktype | database | relation | pid | mode | granted --+--+--+---+---+- relation |16384 |16872 | 93197 | AccessShareLock | t relation |16384 |16872 | 93197 | ShareRowExclusiveLock | t (2 rows) I don't see what that has to do with the topic at hand, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Greetings, * Bruce Momjian (br...@momjian.us) wrote: > On Mon, Jul 8, 2019 at 11:18:01AM -0400, Joe Conway wrote: > > On 7/8/19 10:19 AM, Bruce Momjian wrote: > > > When people are asking for multiple keys (not just for key rotation), > > > they are asking to have multiple keys that can be supplied by users only > > > when they need to access the data. Yes, the keys are always in the > > > datbase, but the feature request is that they are only unlocked when the > > > user needs to access the data. Obviously, that will not work for > > > autovacuum when the encryption is at the block level. > > > > > If the key is always unlocked, there is questionable security value of > > > having multiple keys, beyond key rotation. > > > > That is not true. Having multiple keys also allows you to reduce the > > amount of data encrypted with a single key, which is desirable because: > > > > 1. It makes cryptanalysis more difficult > > 2. Puts less data at risk if someone gets "lucky" in doing brute force > > What systems use multiple keys like that? I know of no website that > does that. Your arguments seem hypothetical. What is your goal here? Not sure what the reference to 'website' is here, but one doesn't get certificates for TLS/SSL usage that aren't time-bounded, and when it comes to the actual on-the-wire encryption that's used, that's a symmetric key that's generated on-the-fly for every connection. Wouldn't the fact that they generate a different key for every connection be a pretty clear indication that it's a good idea to use multiple keys and not use the same key over and over..? Of course, we can discuss if what websites do with over-the-wire encryption is sensible to compare to what we want to do in PG for data-at-rest, but then we shouldn't be talking about what websites do, it'd make more sense to look at other data-at-rest encryption systems and consider what they're doing. Thanks, Stephen signature.asc Description: PGP signature
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 2019-07-08 17:47, Stephen Frost wrote: > Of course, we can discuss if what websites do with over-the-wire > encryption is sensible to compare to what we want to do in PG for > data-at-rest, but then we shouldn't be talking about what websites do, > it'd make more sense to look at other data-at-rest encryption systems > and consider what they're doing. So, how do encrypted file systems do it? Are there any encrypted file systems in general use that allow encrypting only some files or encrypting different parts of the file system with different keys, or any of those other granular approaches being discussed? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Mon, Jul 8, 2019 at 10:58 AM Tomas Vondra wrote: > > On Mon, Jul 08, 2019 at 10:32:18AM -0400, James Coleman wrote: > >On Mon, Jul 8, 2019 at 9:59 AM Tomas Vondra > > wrote: > >> > >> On Mon, Jul 08, 2019 at 09:22:39AM -0400, James Coleman wrote: > >> >On Sun, Jul 7, 2019 at 5:02 PM Tomas Vondra > >> > wrote: > >> >> We're running query like this: > >> >> > >> >> SELECT a, sum(b), count(*) FROM pagg_tab_ml GROUP BY a HAVING avg(b) > >> >> < 3 ORDER BY 1, 2, 3 > >> >> > >> >> but we're trying to add the incremental sort *before* the aggregation, > >> >> because the optimizer also considers group aggregate with a sorted > >> >> input. And (a) is a prefix of (a,sum(b),count(b)) so we think we > >> >> actually can do this, but clearly that's nonsense, because we can't > >> >> possibly know the aggregates yet. Hence the error. > >> >> > >> >> If this is the actual issue, we need to ensure we actually can evaluate > >> >> all the pathkeys. I don't know how to do that yet. I thought that maybe > >> >> we should modify pathkeys_common_contained_in() to set presorted_keys to > >> >> 0 in this case. > >> >> > >> >> But then I started wondering why we don't see this issue even for > >> >> regular (non-incremental-sort) paths built in create_ordered_paths(). > >> >> How come we don't see these failures there? I've modified costing to > >> >> make all incremental sort paths very cheap, and still nothing. > >> > > >> >I assume you mean you modified costing to make regular sort paths very > >> >cheap? > >> > > >> > >> No, I mean costing of incremental sort paths, so that they end up being > >> the cheapest ones. If some other path is cheaper, we won't see the error > >> because it only happens when building plan from the cheapest path. > > > >Ah, I misunderstood as you trying to figure out a way to try to cause > >the same problem with a regular sort. > > > >> >> So presumably there's a check elsewhere (either implicit or explicit), > >> >> because create_ordered_paths() uses pathkeys_common_contained_in() and > >> >> does not have the same issue. > >> > > >> >Given this comment in create_ordered_paths(): > >> > > >> > generate_gather_paths() will have already generated a simple Gather > >> > path for the best parallel path, if any, and the loop above will have > >> > considered sorting it. Similarly, generate_gather_paths() will also > >> > have generated order-preserving Gather Merge plans which can be used > >> > without sorting if they happen to match the sort_pathkeys, and the loop > >> > above will have handled those as well. However, there's one more > >> > possibility: it may make sense to sort the cheapest partial path > >> > according to the required output order and then use Gather Merge. > >> > > >> >my understanding is that generate_gather_paths() only considers paths > >> >that already happen to be sorted (not explicit sorts), so I'm > >> >wondering if it would make more sense for the incremental sort path > >> >creation for this case to live alongside the explicit ordered path > >> >creation in create_ordered_paths() rather than in > >> >generate_gather_paths(). > >> > > >> > >> How would that solve the issue? Also, we're building a gather path, so > >> I think generate_gather_paths() is the right place where to do it. And > >> we're not changing the semantics of generate_gather_paths() - the result > >> path should be sorted "correctly" with respect to sort_pathkeys. > > > >Does that imply what the explicit sort in create_ordered_paths() is in > >the wrong spot? > > > > I think those are essentially the right places where to do this sort of > stuff. Maybe there's a better place, but I don't think those places are > somehow wrong. > > >Or, to put it another way, do you think that both kinds of sorts > >should be added in the same place? It seems confusing to me that > >they'd be split between the two methods (unless I'm completely > >misunderstanding how the two work). > > > > The paths built in those two places were very different in one aspect: > > 1) generate_gather_paths only ever looked at pathkeys for the subpath, it > never even looked at ordering expected by paths above it (or the query as > a whole). Plain Gather ignores pathkeys entirely, Gather Merge only aims > to maintain ordering of the different subpaths. > > 2) create_oredered_paths is meant to enforce ordering needed by upper > parts of the plan - either by using a properly sorted path, or adding an > explicit sort. > > > We want to extend (1) to also look at ordering expected by the upper parts > of the plan, and consider incremental sort if applicable. (2) already does > that, and it already has the correct pathkeys to enforce. I guess I'm still not following. If (2) is responsible (currently) for adding an explicit sort, why wouldn't adding an incremental sort be an example of that responsibility? The subpath that either a Sort or IncrementalSort is being added on top of (to then feed into the GatherMerge) is the same
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 7/8/19 11:56 AM, Peter Eisentraut wrote: > On 2019-07-08 17:47, Stephen Frost wrote: >> Of course, we can discuss if what websites do with over-the-wire >> encryption is sensible to compare to what we want to do in PG for >> data-at-rest, but then we shouldn't be talking about what websites do, >> it'd make more sense to look at other data-at-rest encryption systems >> and consider what they're doing. > > So, how do encrypted file systems do it? Are there any encrypted file > systems in general use that allow encrypting only some files or > encrypting different parts of the file system with different keys, or > any of those other granular approaches being discussed? Well it is fairly common, for good reason IMHO, to encrypt some mount points and not others on a system. In my mind, and in practice to a large extent, a postgres tablespace == a unique mount point. There is a description here: https://wiki.archlinux.org/index.php/Disk_encryption A pertinent quote: After it has been derived, the master key is securely stored in memory (e.g. in a kernel keyring), for as long as the encrypted block device or folder is mounted. It is usually not used for de/encrypting the disk data directly, though. For example, in the case of stacked filesystem encryption, each file can be automatically assigned its own encryption key. Whenever the file is to be read/modified, this file key first needs to be decrypted using the main key, before it can itself be used to de/encrypt the file contents: ╭╮ ┊ master key ┊ file on disk: ╰┈┬┈┈╯ ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐│ ╎╭───╮╎▼ ╭┈┈╮ ╎│ encrypted file key│(decryption)━━━▶┊ file key ┊ ╎╰───╯╎ ╰┬┈╯ ╎┌───┐╎▼ ┌┈┈┈┐ ╎│ encrypted file│◀━(de/encryption)━━━▶┊ readable file ┊ ╎│ contents │╎┊ contents ┊ ╎└───┘╎ └┈┈┈┘ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ In a similar manner, a separate key (e.g. one per folder) may be used for the encryption of file names in the case of stacked filesystem encryption. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 8, 2019 at 11:47:33AM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (br...@momjian.us) wrote: > > On Mon, Jul 8, 2019 at 11:18:01AM -0400, Joe Conway wrote: > > > On 7/8/19 10:19 AM, Bruce Momjian wrote: > > > > When people are asking for multiple keys (not just for key rotation), > > > > they are asking to have multiple keys that can be supplied by users only > > > > when they need to access the data. Yes, the keys are always in the > > > > datbase, but the feature request is that they are only unlocked when the > > > > user needs to access the data. Obviously, that will not work for > > > > autovacuum when the encryption is at the block level. > > > > > > > If the key is always unlocked, there is questionable security value of > > > > having multiple keys, beyond key rotation. > > > > > > That is not true. Having multiple keys also allows you to reduce the > > > amount of data encrypted with a single key, which is desirable because: > > > > > > 1. It makes cryptanalysis more difficult > > > 2. Puts less data at risk if someone gets "lucky" in doing brute force > > > > What systems use multiple keys like that? I know of no website that > > does that. Your arguments seem hypothetical. What is your goal here? > > Not sure what the reference to 'website' is here, but one doesn't get > certificates for TLS/SSL usage that aren't time-bounded, and when it > comes to the actual on-the-wire encryption that's used, that's a > symmetric key that's generated on-the-fly for every connection. > > Wouldn't the fact that they generate a different key for every > connection be a pretty clear indication that it's a good idea to use > multiple keys and not use the same key over and over..? > > Of course, we can discuss if what websites do with over-the-wire > encryption is sensible to compare to what we want to do in PG for > data-at-rest, but then we shouldn't be talking about what websites do, > it'd make more sense to look at other data-at-rest encryption systems > and consider what they're doing. (I talked to Joe on chat for clarity.) In modern TLS, the certificate is used only for authentication, and Diffie–Hellman is used for key exchange: https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange So, the question is whether you can pass so much data in TLS that using the same key for the entire session is a security issue. TLS originally had key renegotiation, but that was removed in TLS 1.3: https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/ To mitigate these types of attacks, TLS 1.3 disallows renegotiation. Of course, a database is going to process even more data so if the amount of data encrypted is a problem, we might have a problem too in using a single key. This is not related to whether we use one key for the entire cluster or multiple keys per tablespace --- the problem is the same. I guess we could create 1024 keys and use the bottom bits of the block number to decide what key to use. However, that still only pushes the goalposts farther. Anyway, I will to research the reasonable data size that can be secured with a single key via AES. I will look at how PGP encrypts large files too. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Switching PL/Python to Python 3 by default in PostgreSQL 12
Peter Eisentraut writes: > I'm not trying to dismiss the importance of managing the Python > transition. But this issue has been known for many years, and the > current setup is more or less in line with the wider world. For > example, the Debian release that came out over the weekend still ships > with /usr/bin/python being Python 2. So it is neither timely nor urgent > to try to make some significant change about this in PostgreSQL 12 right > now. I would welcome patches for this for PostgreSQL 13. I don't think it's been mentioned in this thread yet, but we *did* recently install a configure-time preference for python3 over python2: Author: Peter Eisentraut Branch: master Release: REL_12_BR [7291733ac] 2019-01-13 10:23:48 +0100 Branch: REL_11_STABLE Release: REL_11_2 [3d498c65a] 2019-01-13 10:24:21 +0100 Branch: REL_10_STABLE Release: REL_10_7 [cd1873160] 2019-01-13 10:25:23 +0100 configure: Update python search order Some systems don't ship with "python" by default anymore, only "python3" or "python2" or some combination, so include those in the configure search. Discussion: https://www.postgresql.org/message-id/flat/1457.1543184081%40sss.pgh.pa.us#c9cc1199338fd6a257589c6dcea6cf8d configure's search order is now $PYTHON, python, python3, python2. I think it will be a very long time, if ever, before there would be a reason to consider changing that. Both of the first two options represent following a clear user preference. So the only thing that's really at stake is when/whether we can make "plpythonu" a synonym for "plpython3u" rather than "plpython2u". As I said already, I think that's got to be a long way off, since the whole problem here is that python3 isn't a drop-in replacement for python2. We're much more likely to break existing functions than do anything useful by forcibly switching the synonym. But I could support having a way for individual installations to change what the synonym means locally. Perhaps we could think about how to do that in conjunction with the project of getting rid of pg_pltemplate that's been kicked around before [1][2][3]. regards, tom lane [1] https://www.postgresql.org/message-id/flat/763f2fe4-743f-d530-8831-20811edd3d6a%402ndquadrant.com [2] https://www.postgresql.org/message-id/flat/7495.1524861244%40sss.pgh.pa.us [3] https://www.postgresql.org/message-id/flat/5351890.TdMePpdHBD%40nb.usersys.redhat.com
Re: errbacktrace
On 2019-Jul-08, Dmitry Dolgov wrote: > > On Sat, Jun 29, 2019 at 7:41 AM Jaime Casanova > > wrote: > > > > This is certainly a very useful thing. Sadly, it doesn't seem to compile > > when > > trying to use libunwind. > > Yeah, the same for me. To make it works I've restricted libunwind to local > unwinding only: > > #ifdef USE_LIBUNWIND > #define UNW_LOCAL_ONLY > #include > #endif Ah, yes. unwind's manpage says: Normally, libunwind supports both local and remote unwinding (the latter will be explained in the next section). However, if you tell libunwind that your program only needs local unwinding, then a special implementation can be selected which may run much faster than the generic implementation which supports both kinds of unwinding. To select this optimized version, simply define the macro UNW_LOCAL_ONLY before including the headerfile . so I agree with unconditionally defining that symbol. Nitpicking dept: I think in these tests: + if (!edata->backtrace && + edata->funcname && + backtrace_function[0] && + strcmp(backtrace_function, edata->funcname) == 0) + set_backtrace(edata, 2); we should test for backtrace_function[0] before edata->funcname, since it seems more likely to be unset. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Switching PL/Python to Python 3 by default in PostgreSQL 12
I wrote: > But I could support having a way for individual installations to change > what the synonym means locally. Perhaps we could think about how to do > that in conjunction with the project of getting rid of pg_pltemplate > that's been kicked around before [1][2][3]. ... actually, if we had that (i.e., languages fully defined by extensions with no help from pg_pltemplate), wouldn't this be nearly trivial? I'm imagining two extensions, one that defines plpythonu to call the python2 code and one that defines it to call the python3 code, and you install whichever you want. They're separate from the extensions that define plpython2u and plpython3u, so mix and match as you wish. regards, tom lane
Re: range_agg
On Sat, Jul 6, 2019 at 12:13 PM Jeff Davis wrote: > > On Fri, 2019-07-05 at 09:58 -0700, Paul A Jungwirth wrote: > > user-defined range types. So how about I start on it and see how it > > goes? I expect I can follow the existing code for range types pretty > > closely, so maybe it won't be too hard. > > That would be great to at least take a look. If it starts to look like > a bad idea, then we can re-evaluate and see if it's better to just > return arrays. I made some progress over the weekend. I don't have a patch yet but I thought I'd ask for opinions on the approach I'm taking: - A multirange type is an extra thing you get when you define a range (just like how you get a tstzrange[]). Therefore - I don't need separate commands to add/drop multirange types. You get one when you define a range type, and if you drop a range type it gets dropped automatically. - I'm adding a new typtype for multiranges. ('m' in pg_type). - I'm just adding a mltrngtypeid column to pg_range. I don't think I need a new pg_multirange table. - You can have a multirange[]. - Multirange in/out work just like arrays, e.g. '{"[1,3)", "[5,6)"}' - I'll add an anymultirange pseudotype. When it's the return type of a function that has an "anyrange" parameter, it will use the same base element type. (So basically anymultirange : anyrange :: anyarray :: anyelement.) - You can cast from a multirange to an array. The individual ranges are always sorted in the result array. - You can cast from an array to a multirange but it will error if there are overlaps (or not?). The array's ranges don't have to be sorted but they will be after a "round trip". - Interesting functions: - multirange_length - range_agg (range_union_agg if you like) - range_intersection_agg - You can subscript a multirange like you do an array (? This could be a function instead.) - operators: - union (++) and intersection (*): - We already have + for range union but it raises an error if there is a gap, so ++ is the same but with no errors. - r ++ r = mr (commutative, associative) - mr ++ r = mr - r ++ mr = mr - r * r = r (unchanged) - mr * r = r - r * mr = r - mr - r = mr - r - mr = mr - mr - mr = mr - comparison - mr = mr - mr @> x - mr @> r - mr @> mr - x <@ mr - r <@ mr - mr <@ mr - mr << mr (strictly left of) - mr >> mr (strictly right of) - mr &< mr (does not extend to the right of) - mr &> mr (does not extend to the left of) - inverse operator?: - the inverse of {"[1,2)"} would be {"[null, 1)", "[2, null)"}. - not sure we want this or what the symbol should be. I don't like -mr as an inverse because then mr - mr != mr ++ -mr. Anything in there you think should be different? Thanks, Paul
Re: Comment typo in tableam.h
On Sat, Jul 6, 2019 at 12:05 AM Amit Kapila wrote: > On Tue, Jul 2, 2019 at 1:00 AM Ashwin Agrawal wrote: > > Please find attached v2 of patch 1 without objectionable comment change. > v1 of patch 2 attaching here just for convenience, no modifications made to > it. > > > > 0001* > * See table_index_fetch_tuple's comment about what the difference between > - * these functions is. This function is the correct to use outside of > - * index entry->table tuple lookups. > + * these functions is. This function is correct to use outside of index > + * entry->table tuple lookups. > > How about if we write the last line of comment as "It is correct to > use this function outside of index entry->table tuple lookups."? I am > not an expert on this matter, but I find the way I am suggesting > easier to read. > I am fine with the way you have suggested.
Re: Broken defenses against dropping a partitioning column
On Mon, Jul 8, 2019 at 11:08 AM Tom Lane wrote: > FWIW, I was imagining the action as being (1) detach all the child > partitions, (2) make parent into a non-partitioned table, (3) > drop the target column in each of these now-independent tables. > No data movement. Other than the need to acquire locks on all > the tables, it shouldn't be particularly slow. I see. I think that would be reasonable, but like you say, it's not clear that it's really what users would prefer. You can think of a partitioned table as a first-class object and the partitions as subordinate implementation details; or you can think of the partitions as the first-class objects and the partitioned table as the second-rate glue that holds them together. It seems like users prefer the former view. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11
Michael Paquier writes: > I have begun playing with regressplans.sh which enforces various > combinations of "-f s|i|n|m|h" when running the regression tests, and > I have noticed that -fh can cause the server to become stuck in the > test join_hash.sql with this query (not sure which portion of the SET > LOCAL parameters are involved) : > select count(*) from simple r join extremely_skewed s using (id); > This does not happen with REL_10_STABLE where the test executes > immediately, so we has visibly an issue caused by v11 here. Yeah, these test cases were added by fa330f9ad in v11. What it looks like to me is that some of these test cases force "enable_mergejoin = off", so if you also have enable_hashjoin off then you are going to get a nestloop plan, and it's hardly surprising that that takes an unreasonable amount of time on the rather large test tables used in these tests. Given the purposes of this test, I think it'd be reasonable to force both enable_hashjoin = on and enable_mergejoin = off at the very top of the join_hash script, or the corresponding place in join.sql in v11. Thomas, was there a specific reason for forcing enable_mergejoin = off for only some of these tests? regards, tom lane
Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11
On Mon, Jul 08, 2019 at 06:49:44PM +1200, Thomas Munro wrote: > "simple" has 20k rows and "extremely_skewed" has 20k rows but the > planner thinks it only has 2. So this going to take O(n^2) time and n > is 20k. Not sure what to do about that. Maybe "join_hash" should be > skipped for the -h (= no hash joins please) case? Ah, thanks. Yes that's going to take a while :) Well, another thing I'd like to think about is if there is any point to keep regressplans.sh and the relevant options in postgres at this stage. From the top of the file one can read that: # This script runs the Postgres regression tests with all useful combinations # of the backend options that disable various query plan types. If the # results are not all the same, it may indicate a bug in a particular # plan type, or perhaps just a regression test whose results aren't fully # determinate (eg, due to lack of an ORDER BY keyword). However if you run any option with make check, then in all runs there are tests failing. We can improve the situation for some of them with ORDER BY queries by looking at the query outputs, but some EXPLAIN queries are sensitive to that, and the history around regressplans.sh does not play in favor of it (some people really use it?). If you look at the latest commits for it, it has not been really touched in 19 years. So I would be rather in favor in nuking it. -- Michael signature.asc Description: PGP signature
Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11
On Mon, Jul 8, 2019 at 5:53 PM Michael Paquier wrote: > I have begun playing with regressplans.sh which enforces various > combinations of "-f s|i|n|m|h" when running the regression tests, and > I have noticed that -fh can cause the server to become stuck in the > test join_hash.sql with this query (not sure which portion of the SET > LOCAL parameters are involved) : > select count(*) from simple r join extremely_skewed s using (id); > > This does not happen with REL_10_STABLE where the test executes > immediately, so we has visibly an issue caused by v11 here. If you don't allow hash joins it makes this plan: Aggregate -> Nested Loop Join Filter: (r.id = s.id) -> Seq Scan on simple r -> Materialize -> Seq Scan on extremely_skewed s "simple" has 20k rows and "extremely_skewed" has 20k rows but the planner thinks it only has 2. So this going to take O(n^2) time and n is 20k. Not sure what to do about that. Maybe "join_hash" should be skipped for the -h (= no hash joins please) case? -- Thomas Munro https://enterprisedb.com
Re: [PATCH] kNN for btree
On Mon, Jul 1, 2019 at 8:47 PM Nikita Glukhov wrote: > On 01.07.2019 13:41, Thomas Munro wrote: > > On Tue, Mar 26, 2019 at 4:30 AM Nikita Glukhov > > wrote: > Fixed two bugs in patches 3 and 10 (see below). > Patch 3 was extracted from the main patch 5 (patch 4 in v9). > >>> This patch no longer applies so marking Waiting on Author. > >> Attached 11th version of the patches rebased onto current master. > > Hi Nikita, > > > > Since a new Commitfest is here and this doesn't apply, could we please > > have a fresh rebase? > > Attached 12th version of the patches rebased onto the current master. I have more thoughts about planner/executor infrastructure. It appears that supporting "ORDER BY col1, col2 <-> val" is too complex for initial version of patch. Handling of "ORDER BY col" and "ORDER BY col <-> val" cases uses different code paths in optimizer. So, I'd like to limit first version of this patch to support just most simple "ORDER BY col <-> val" case. We would probably be able to enhance it even in v13 release cycle, but let's start with just simple case. I also think we can evade replacing amcanorderbyop flag with method, but introduce just new boolean flag indicating knn supports only first column. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: progress report for ANALYZE
On Mon, Jul 8, 2019 at 5:29 AM Tatsuro Yamada wrote: > 17520|13599|postgres|16387|f|16387|scanning table|4425|4425 > 17520|13599|postgres|16387|f|16387|analyzing sample|0|0 > 17520|13599|postgres|16387|f|16387||0|0 <-- Is it Okay?? Why do we zero out the block numbers when we switch phases? The CREATE INDEX progress reporting patch does that kind of thing too, and it seems like poor behavior to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Multivariate MCV list vs. statistics target
On Sun, Jul 07, 2019 at 12:02:38AM +0200, Tomas Vondra wrote: Hi, apparently v1 of the ALTER STATISTICS patch was a bit confused about length of the VacAttrStats array passed to statext_compute_stattarget, causing segfaults. Attached v2 patch fixes that, and it also makes sure we print warnings about ignored statistics only once. v3 of the patch, adding pg_dump support - it works just like when you tweak statistics target for a column, for example. When the value stored in the catalog is not -1, pg_dump emits a separate ALTER STATISTICS command setting it (for the already created statistics object). I've considered making it part of CREATE STATISTICS itself, but it seems a bit cumbersome and we don't do it for columns either. I'm not against adding it in the future, but at this point I don't see a need. At this point I'm not aware of any missing or broken pieces, so I'd welcome feedback. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/doc/src/sgml/ref/alter_statistics.sgml b/doc/src/sgml/ref/alter_statistics.sgml index 58c7ed020d..987f9dbc6f 100644 --- a/doc/src/sgml/ref/alter_statistics.sgml +++ b/doc/src/sgml/ref/alter_statistics.sgml @@ -26,6 +26,7 @@ PostgreSQL documentation ALTER STATISTICS name OWNER TO { new_owner | CURRENT_USER | SESSION_USER } ALTER STATISTICS name RENAME TO new_name ALTER STATISTICS name SET SCHEMA new_schema +ALTER STATISTICS name SET STATISTICS new_target @@ -93,6 +94,22 @@ ALTER STATISTICS name SET SCHEMA + + new_target + + +The statistic-gathering target for this statistic object for subsequent + operations. +The target can be set in the range 0 to 1; alternatively, set it +to -1 to revert to using the system default statistics +target (). +For more information on the use of statistics by the +PostgreSQL query planner, refer to +. + + + + diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index d7004e5313..caa4fca99d 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -490,6 +490,19 @@ do_analyze_rel(Relation onerel, VacuumParams *params, } } + /* +* Look at extended statistic objects too, because those may define their +* own statistic target. So we need to sample enough rows and then build +* the statistics with enough detail. +*/ + { + int nrows = ComputeExtStatisticsTarget(onerel, + attr_cnt, vacattrstats); + + if (targrows < nrows) + targrows = nrows; + } + /* * Acquire the sample rows */ diff --git a/src/backend/commands/statscmds.c b/src/backend/commands/statscmds.c index cf406f6f96..356b6096ac 100644 --- a/src/backend/commands/statscmds.c +++ b/src/backend/commands/statscmds.c @@ -14,6 +14,7 @@ */ #include "postgres.h" +#include "access/heapam.h" #include "access/relation.h" #include "access/relscan.h" #include "access/table.h" @@ -21,6 +22,7 @@ #include "catalog/dependency.h" #include "catalog/indexing.h" #include "catalog/namespace.h" +#include "catalog/objectaccess.h" #include "catalog/pg_namespace.h" #include "catalog/pg_statistic_ext.h" #include "catalog/pg_statistic_ext_data.h" @@ -29,6 +31,7 @@ #include "miscadmin.h" #include "statistics/statistics.h" #include "utils/builtins.h" +#include "utils/fmgroids.h" #include "utils/inval.h" #include "utils/memutils.h" #include "utils/rel.h" @@ -336,6 +339,7 @@ CreateStatistics(CreateStatsStmt *stmt) values[Anum_pg_statistic_ext_stxrelid - 1] = ObjectIdGetDatum(relid); values[Anum_pg_statistic_ext_stxname - 1] = NameGetDatum(&stxname); values[Anum_pg_statistic_ext_stxnamespace - 1] = ObjectIdGetDatum(namespaceId); + values[Anum_pg_statistic_ext_stxstattarget - 1] = Int32GetDatum(-1); values[Anum_pg_statistic_ext_stxowner - 1] = ObjectIdGetDatum(stxowner); values[Anum_pg_statistic_ext_stxkeys - 1] = PointerGetDatum(stxkeys); values[Anum_pg_statistic_ext_stxkind - 1] = PointerGetDatum(stxkind); @@ -414,6 +418,85 @@ CreateStatistics(CreateStatsStmt *stmt) return myself; } + +/* + * ALTER STATISTICS + */ +ObjectAddress +AlterStatistics(AlterStatsStmt *stmt) +{ + Relationrel; + Oid stxoid; + HeapTuple oldtup; + HeapTuple newtup; + Datum repl_val[Natts_pg_statistic_ext]; + boolrepl_null[Natts_pg_statistic_ext]; + boolrepl_repl[Natts_pg_statistic_ext]; + ObjectAddress address; + int newtarget = stmt->stxstattarget; + + /* Lim
Re: progress report for ANALYZE
On 2019-Jul-08, Robert Haas wrote: > On Mon, Jul 8, 2019 at 5:29 AM Tatsuro Yamada > wrote: > > 17520|13599|postgres|16387|f|16387|scanning table|4425|4425 > > 17520|13599|postgres|16387|f|16387|analyzing sample|0|0 > > 17520|13599|postgres|16387|f|16387||0|0 <-- Is it Okay?? > > Why do we zero out the block numbers when we switch phases? The > CREATE INDEX progress reporting patch does that kind of thing too, and > it seems like poor behavior to me. Yeah, I got the impression that that was determined to be the desirable behavior, so I made it do that, but I'm not really happy about it either. We're not too late to change the CREATE INDEX behavior, but let's discuss what is it that we want. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 08, 2019 at 11:25:10AM -0400, Bruce Momjian wrote: On Mon, Jul 8, 2019 at 11:18:01AM -0400, Joe Conway wrote: On 7/8/19 10:19 AM, Bruce Momjian wrote: > When people are asking for multiple keys (not just for key rotation), > they are asking to have multiple keys that can be supplied by users only > when they need to access the data. Yes, the keys are always in the > datbase, but the feature request is that they are only unlocked when the > user needs to access the data. Obviously, that will not work for > autovacuum when the encryption is at the block level. > If the key is always unlocked, there is questionable security value of > having multiple keys, beyond key rotation. That is not true. Having multiple keys also allows you to reduce the amount of data encrypted with a single key, which is desirable because: 1. It makes cryptanalysis more difficult 2. Puts less data at risk if someone gets "lucky" in doing brute force What systems use multiple keys like that? I know of no website that does that. Your arguments seem hypothetical. What is your goal here? I might ask the same question about databases - which databases use an encryption scheme where the database does not have access to the keys? Actually, I've already asked this question before ... The databases I'm familiar with do store all the keys in a vault that's unlocked during startup, and then users may get keys from it (including maintenance processes). We could still control access to those keys in various ways (ACL or whatever), of course. BTW how do you know this is what users want? Maybe they do, but then again - maybe they just see it as magic and don't realize the extra complexity (not just at the database level). In my experience users generally want more abstract things, like "Ensure data privacy in case media theft," or "protection against evil DBA". regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Greetings, * Bruce Momjian (br...@momjian.us) wrote: > On Mon, Jul 8, 2019 at 11:47:33AM -0400, Stephen Frost wrote: > > * Bruce Momjian (br...@momjian.us) wrote: > > > On Mon, Jul 8, 2019 at 11:18:01AM -0400, Joe Conway wrote: > > > > On 7/8/19 10:19 AM, Bruce Momjian wrote: > > > > > When people are asking for multiple keys (not just for key rotation), > > > > > they are asking to have multiple keys that can be supplied by users > > > > > only > > > > > when they need to access the data. Yes, the keys are always in the > > > > > datbase, but the feature request is that they are only unlocked when > > > > > the > > > > > user needs to access the data. Obviously, that will not work for > > > > > autovacuum when the encryption is at the block level. > > > > > > > > > If the key is always unlocked, there is questionable security value of > > > > > having multiple keys, beyond key rotation. > > > > > > > > That is not true. Having multiple keys also allows you to reduce the > > > > amount of data encrypted with a single key, which is desirable because: > > > > > > > > 1. It makes cryptanalysis more difficult > > > > 2. Puts less data at risk if someone gets "lucky" in doing brute force > > > > > > What systems use multiple keys like that? I know of no website that > > > does that. Your arguments seem hypothetical. What is your goal here? > > > > Not sure what the reference to 'website' is here, but one doesn't get > > certificates for TLS/SSL usage that aren't time-bounded, and when it > > comes to the actual on-the-wire encryption that's used, that's a > > symmetric key that's generated on-the-fly for every connection. > > > > Wouldn't the fact that they generate a different key for every > > connection be a pretty clear indication that it's a good idea to use > > multiple keys and not use the same key over and over..? > > > > Of course, we can discuss if what websites do with over-the-wire > > encryption is sensible to compare to what we want to do in PG for > > data-at-rest, but then we shouldn't be talking about what websites do, > > it'd make more sense to look at other data-at-rest encryption systems > > and consider what they're doing. > > (I talked to Joe on chat for clarity.) In modern TLS, the certificate is > used only for authentication, and Diffie–Hellman is used for key > exchange: > > https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange Right, and the key that's figured out for each connection is at least specific to the server AND client keys/certificates, thus meaning that they're changed at least as frequently as those change (and clients end up creating ones on the fly randomly if they don't have one, iirc). > So, the question is whether you can pass so much data in TLS that using > the same key for the entire session is a security issue. TLS originally > had key renegotiation, but that was removed in TLS 1.3: > > > https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/ > To mitigate these types of attacks, TLS 1.3 disallows renegotiation. It was removed due to attacks targeting the renegotiation, not because doing re-keying by itself was a bad idea, or because using multiple keys was seen as a bad idea. > Of course, a database is going to process even more data so if the > amount of data encrypted is a problem, we might have a problem too in > using a single key. This is not related to whether we use one key for > the entire cluster or multiple keys per tablespace --- the problem is > the same. I guess we could create 1024 keys and use the bottom bits of > the block number to decide what key to use. However, that still only > pushes the goalposts farther. All of this is about pushing the goalposts farther away, as I see it. There's going to be trade-offs here and there isn't going to be any "one right answer" when it comes to this space. That's why I'm inclined to argue that we should try to come up with a relatively *good* solution that doesn't create a huge amount of work for us, and then build on that. To that end, leveraging metadata that we already have outside of the catalogs (databases, tablespaces, potentially other information that we store, essentially, in the filesystem metadata already) to decide on what key to use, and how many we can support, strikes me as a good initial target. > Anyway, I will to research the reasonable data size that can be secured > with a single key via AES. I will look at how PGP encrypts large files > too. This seems unlikely to lead to a definitive result, but it would be interesting to hear if there have been studies around that and what their conclusions were. When it comes to concerns about autovacuum or other system processes, those don't have any direct user connections or interactions, so having them be more privileged and having access to more is reasonable. Ideally, all of this would leverage a vaulting system or other mechanism w
Re: progress report for ANALYZE
On Mon, Jul 8, 2019 at 2:18 PM Alvaro Herrera wrote: > Yeah, I got the impression that that was determined to be the desirable > behavior, so I made it do that, but I'm not really happy about it > either. We're not too late to change the CREATE INDEX behavior, but > let's discuss what is it that we want. I don't think I intended to make any such determination -- which commit do you think established this as the canonical behavior? I propose that once a field is set, we should leave it set until the end. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: progress report for ANALYZE
On Mon, Jul 8, 2019 at 8:44 PM Robert Haas wrote: > > On Mon, Jul 8, 2019 at 2:18 PM Alvaro Herrera > wrote: > > Yeah, I got the impression that that was determined to be the desirable > > behavior, so I made it do that, but I'm not really happy about it > > either. We're not too late to change the CREATE INDEX behavior, but > > let's discuss what is it that we want. > > I don't think I intended to make any such determination -- which > commit do you think established this as the canonical behavior? > > I propose that once a field is set, we should leave it set until the end. +1 Note that this patch is already behaving like that if the table only contains dead rows.
Re: Declared but no defined functions
On Sat, Jul 6, 2019 at 4:32 PM Masahiko Sawada wrote: > Indeed. I've tried to search again with the following script and got > more such functions. > > for func in `git ls-files | egrep "\w+\.h$" | xargs cat | egrep -v > "(^typedef)|(DECLARE)|(BKI)" | egrep "^(extern )*[\_0-9A-Za-z]+ > [\_\*0-9a-zA-Z]+ ?\(.+\);$" | sed -e "s/\(^extern \)*[\_0-9A-Za-z]\+ > \([\_0-9A-Za-z\*]\+\) \{0,1\}(.*);$/\2(/g" | sed -e "s/\*//g"` > do > if [ "`git grep "$func" -- "*.c" | wc -l`" -lt 1 ];then > echo $func > fi > done > > Do we wish to make this a tool and have it in src/tools, either as part of find_static tool after renaming that one to more generic name or independent script.
Re: Add parallelism and glibc dependent only options to reindexdb
On Mon, Jul 8, 2019 at 9:57 AM Michael Paquier wrote: > > On Fri, Jul 05, 2019 at 07:25:41PM +0200, Julien Rouhaud wrote: > > On Fri, Jul 5, 2019 at 6:16 PM Peter Eisentraut > > wrote: > >> Isn't that also the case for your proposal? We are not going to release > >> a new reindexdb before a new REINDEX. > > > > Sure, but my point was that once the new reindexdb is released (or if > > you're so desperate, using a nightly build or compiling your own), it > > can be used against any previous major version. There is probably a > > large fraction of users who don't perform a postgres upgrade when they > > upgrade their OS, so that's IMHO also something to consider. > > I think that we need to think long-term here and be confident in the > fact we will still see breakages with collations and glibc, using a > solution that we think is the right API. Peter's idea to make the > backend-aware command of the filtering is cool. On top of that, there > is no need to add any conflict logic in reindexdb and we can live with > restricting --jobs support for non-index objects. Don't get me wrong, I do agree that implementing filtering in the backend is a better design. What's bothering me is that I also agree that there will be more glibc breakage, and if that happens within a few years, a lot of people will still be using pg12- version, and they still won't have an efficient way to rebuild their indexes. Now, it'd be easy to publish an external tools that does a simple parallel-and-glic-filtering reindex tool that will serve that purpose for the few years it'll be needed, so everyone can be happy. For now, I'll resubmit the parallel patch using per-table only approach, and will submit the filtering in the backend using a new REINDEX option in a different thread.
Re: Ltree syntax improvement
Dear Thomas, Thank you for your proofreading! Please find the updated patch attached. It also contains the missing escaping. On Mon, Jul 8, 2019 at 10:39 AM Thomas Munro wrote: > On Wed, Apr 17, 2019 at 5:29 AM Dmitry Belyavsky > wrote: > > I've applied your patch. > > From my point of view, there is no major difference between case and > chain if here. > > Neither case nor ifs allow extracting the common code to separate > function - just because there seem to be no identical pieces of code. > > Hi Dmitry, > > The documentation doesn't build[1], due to invalid XML. Since I'm > here, here is some proof-reading of the English in the documentation: > > > - A label is a sequence of alphanumeric characters > - and underscores (for example, in C locale the characters > - A-Za-z0-9_ are allowed). Labels must be less > than 256 bytes > - long. > + A label is a sequence of characters. Labels > must be > + less than 256 symbols long. Label may contain any character > supported by Postgres > > "fewer than 256 characters in length", and > "PostgreSQL" > > + except \0. If label contains spaces, dots, > lquery modifiers, > > "spaces, dots or lquery modifiers," > > + they may be escaped. Escaping can be done > either by preceeding > + backslash (\\) symbol, or by wrapping the label > in whole in double > + quotes ("). Initial and final unescaped > whitespace is stripped. > > "Escaping can be done with either a preceding backslash [...] or by > wrapping the whole label in double quotes [...]." > > > > +During converting text into internal representations, wrapping > double quotes > > "During conversion to internal representation, " > > +and escaping backslashes are removed. During converting internal > +representations into text, if the label does not contain any special > > "During conversion from internal representation to text, " > > +symbols, it is printed as is. Otherwise, it is wrapped in quotes and, > if > +there are internal quotes, they are escaped with backslash. The > list of special > > "escaped with backslashes." > > + > +Examples: 42, "\\42", > +\\4\\2, 42 and "42" > + will have the similar internal representation and, being > > "will all have the same internal representation and," > > +converted from internal representation, will become > 42. > +Literal abc def will turn into "abc > +def". > > > [1] https://travis-ci.org/postgresql-cfbot/postgresql/builds/71856 > > -- > Thomas Munro > https://enterprisedb.com > -- SY, Dmitry Belyavsky diff --git a/contrib/ltree/expected/ltree.out b/contrib/ltree/expected/ltree.out index 8226930905..5f45726229 100644 --- a/contrib/ltree/expected/ltree.out +++ b/contrib/ltree/expected/ltree.out @@ -1,4 +1,5 @@ CREATE EXTENSION ltree; +SET standard_conforming_strings=on; -- Check whether any of our opclasses fail amvalidate SELECT amname, opcname FROM pg_opclass opc LEFT JOIN pg_am am ON am.oid = opcmethod @@ -7679,3 +7680,1587 @@ SELECT count(*) FROM _ltreetest WHERE t ? '{23.*.1,23.*.2}' ; 15 (1 row) +-- Extended syntax, escaping, quoting etc +-- success +SELECT E'\\.'::ltree; + ltree +--- + "." +(1 row) + +SELECT E'\\ '::ltree; + ltree +--- + " " +(1 row) + +SELECT E''::ltree; + ltree +--- + "\" +(1 row) + +SELECT E'\\a'::ltree; + ltree +--- + a +(1 row) + +SELECT E'\\n'::ltree; + ltree +--- + n +(1 row) + +SELECT E'x'::ltree; + ltree +--- + "x\" +(1 row) + +SELECT E'x\\ '::ltree; + ltree +--- + "x " +(1 row) + +SELECT E'x\\.'::ltree; + ltree +--- + "x." +(1 row) + +SELECT E'x\\a'::ltree; + ltree +--- + xa +(1 row) + +SELECT E'x\\n'::ltree; + ltree +--- + xn +(1 row) + +SELECT 'a b.с d'::ltree; +ltree +- + "a b"."с d" +(1 row) + +SELECT ' e . f '::ltree; + ltree +--- + e.f +(1 row) + +SELECT ' '::ltree; + ltree +--- + +(1 row) + +SELECT E'\\ g . h\\ '::ltree; + ltree +--- + " g"."h " +(1 row) + +SELECT E'\\ g'::ltree; + ltree +--- + " g" +(1 row) + +SELECT E' h\\ '::ltree; + ltree +--- + "h " +(1 row) + +SELECT '" g "." h "'::ltree; +ltree +-- + " g "." h " +(1 row) + +SELECT '" g " '::ltree; + ltree + + " g " +(1 row) + +SELECT '" g " ." h " '::ltree; +ltree +-- + " g "." h " +(1 row) + +SELECT nlevel(E'Bottom\\.Test'::ltree); + nlevel + + 1 +(1 row) + +SELECT subpath(E'Bottom\\.'::ltree, 0, 1); + subpath +--- + "Bottom." +(1 row) + +SELECT subpath(E'a\\.b', 0, 1); + subpath +- + "a.b" +(1 row) + +SELECT subpath(E'a\\..b', 1, 1); + subpath +- + b +(1 row) + +SELECT subpath(E'a\\..\\b', 1, 1); + subpath +- + b +(1 row) + +SELECT subpath(E'a b.с d'::ltree, 1, 1); + subpath +- + "с d" +(1 row) + +SELECT( +'01234567890123456789012345678901234567890123456789' || +'01234567890123456789012345678901234567890123456789' || +'01234567890123456789
Re: tableam: abstracting relation sizing code
On Wed, Jun 12, 2019 at 9:14 AM Robert Haas wrote: > On Tue, Jun 11, 2019 at 7:17 PM Andres Freund wrote: > > Just to understand: What version are you targeting? It seems pretty > > clearly v13 material to me? > > My current plan is to commit this to v13 as soon as the tree opens. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: PGOPTIONS="-fh" make check gets stuck since Postgres 11
Michael Paquier writes: > Well, another thing I'd like to think about is if there is any point > to keep regressplans.sh and the relevant options in postgres at this > stage. From the top of the file one can read that: The point of regressplans.sh is to see if anything goes seriously wrong when forcing non-default plan choices --- seriously wrong being defined as crashes or semantically wrong answers. It's not expected that the regression tests will automatically pass when you do that, because of their dependencies on output row ordering, not to mention all the EXPLAINs. I'm not for removing it --- the fact that its results require manual evaluation doesn't make it useless. Having said that, join_hash.sql in particular seems to have zero value if it's not testing hash joins, so I think it'd be reasonable for it to override a global enable_hashjoin = off setting. None of the other regression test scripts seem to take nearly as much of a performance hit from globally forcing poor plans. regards, tom lane
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 08, 2019 at 02:39:44PM -0400, Stephen Frost wrote: Greetings, * Bruce Momjian (br...@momjian.us) wrote: On Mon, Jul 8, 2019 at 11:47:33AM -0400, Stephen Frost wrote: > * Bruce Momjian (br...@momjian.us) wrote: > > On Mon, Jul 8, 2019 at 11:18:01AM -0400, Joe Conway wrote: > > > On 7/8/19 10:19 AM, Bruce Momjian wrote: > > > > When people are asking for multiple keys (not just for key rotation), > > > > they are asking to have multiple keys that can be supplied by users only > > > > when they need to access the data. Yes, the keys are always in the > > > > datbase, but the feature request is that they are only unlocked when the > > > > user needs to access the data. Obviously, that will not work for > > > > autovacuum when the encryption is at the block level. > > > > > > > If the key is always unlocked, there is questionable security value of > > > > having multiple keys, beyond key rotation. > > > > > > That is not true. Having multiple keys also allows you to reduce the > > > amount of data encrypted with a single key, which is desirable because: > > > > > > 1. It makes cryptanalysis more difficult > > > 2. Puts less data at risk if someone gets "lucky" in doing brute force > > > > What systems use multiple keys like that? I know of no website that > > does that. Your arguments seem hypothetical. What is your goal here? > > Not sure what the reference to 'website' is here, but one doesn't get > certificates for TLS/SSL usage that aren't time-bounded, and when it > comes to the actual on-the-wire encryption that's used, that's a > symmetric key that's generated on-the-fly for every connection. > > Wouldn't the fact that they generate a different key for every > connection be a pretty clear indication that it's a good idea to use > multiple keys and not use the same key over and over..? > > Of course, we can discuss if what websites do with over-the-wire > encryption is sensible to compare to what we want to do in PG for > data-at-rest, but then we shouldn't be talking about what websites do, > it'd make more sense to look at other data-at-rest encryption systems > and consider what they're doing. (I talked to Joe on chat for clarity.) In modern TLS, the certificate is used only for authentication, and Diffie–Hellman is used for key exchange: https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange Right, and the key that's figured out for each connection is at least specific to the server AND client keys/certificates, thus meaning that they're changed at least as frequently as those change (and clients end up creating ones on the fly randomly if they don't have one, iirc). So, the question is whether you can pass so much data in TLS that using the same key for the entire session is a security issue. TLS originally had key renegotiation, but that was removed in TLS 1.3: https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/ To mitigate these types of attacks, TLS 1.3 disallows renegotiation. It was removed due to attacks targeting the renegotiation, not because doing re-keying by itself was a bad idea, or because using multiple keys was seen as a bad idea. Of course, a database is going to process even more data so if the amount of data encrypted is a problem, we might have a problem too in using a single key. This is not related to whether we use one key for the entire cluster or multiple keys per tablespace --- the problem is the same. I guess we could create 1024 keys and use the bottom bits of the block number to decide what key to use. However, that still only pushes the goalposts farther. All of this is about pushing the goalposts farther away, as I see it. There's going to be trade-offs here and there isn't going to be any "one right answer" when it comes to this space. That's why I'm inclined to argue that we should try to come up with a relatively *good* solution that doesn't create a huge amount of work for us, and then build on that. To that end, leveraging metadata that we already have outside of the catalogs (databases, tablespaces, potentially other information that we store, essentially, in the filesystem metadata already) to decide on what key to use, and how many we can support, strikes me as a good initial target. Anyway, I will to research the reasonable data size that can be secured with a single key via AES. I will look at how PGP encrypts large files too. This seems unlikely to lead to a definitive result, but it would be interesting to hear if there have been studies around that and what their conclusions were. When it comes to concerns about autovacuum or other system processes, those don't have any direct user connections or interactions, so having them be more privileged and having access to more is reasonable. I think Bruce's proposal was to minimize the time the key is "unlocked" in memory by only unlocking them when the user connects and supplies s
Re: [HACKERS] Cached plans and statement generalization
On 08.07.2019 2:23, Thomas Munro wrote: On Tue, Jul 2, 2019 at 3:29 AM Konstantin Knizhnik wrote: Attached please find rebased version of the patch. Also this version can be found in autoprepare branch of this repository https://github.com/postgrespro/postgresql.builtin_pool.git on github. Thanks. I haven't looked at the code but this seems like interesting work and I hope it will get some review. I guess this is bound to use a lot of memory. I guess we'd eventually want to figure out how to share the autoprepared plan cache between sessions, which is obviously a whole can of worms. A couple of trivial comments with my CF manager hat on: 1. Can you please fix the documentation? It doesn't build. Obviously reviewing the goals, design and implementation are more important than the documentation at this point, but if that is fixed then the CF bot will be able to run check-world every day and we might learn something about the code. 2. Accidental editor junk included: src/include/catalog/pg_proc.dat.~1~ Sorry, are you tests autoprepare-16.patch I have sent in the last e-mail? I can not reproduce the problem with building documentation: knizhnik@xps:~/postgresql/doc$ make make -C ../src/backend generated-headers make[1]: Entering directory '/home/knizhnik/postgresql/src/backend' make -C catalog distprep generated-header-symlinks make[2]: Entering directory '/home/knizhnik/postgresql/src/backend/catalog' make[2]: Nothing to be done for 'distprep'. make[2]: Nothing to be done for 'generated-header-symlinks'. make[2]: Leaving directory '/home/knizhnik/postgresql/src/backend/catalog' make -C utils distprep generated-header-symlinks make[2]: Entering directory '/home/knizhnik/postgresql/src/backend/utils' make[2]: Nothing to be done for 'distprep'. make[2]: Nothing to be done for 'generated-header-symlinks'. make[2]: Leaving directory '/home/knizhnik/postgresql/src/backend/utils' make[1]: Leaving directory '/home/knizhnik/postgresql/src/backend' make -C src all make[1]: Entering directory '/home/knizhnik/postgresql/doc/src' make -C sgml all make[2]: Entering directory '/home/knizhnik/postgresql/doc/src/sgml' /usr/bin/xmllint --path . --noout --valid postgres.sgml /usr/bin/xsltproc --path . --stringparam pg.version '12devel' stylesheet.xsl postgres.sgml cp ./stylesheet.css html/ touch html-stamp /usr/bin/xmllint --path . --noout --valid postgres.sgml /usr/bin/xsltproc --path . --stringparam pg.version '12devel' stylesheet-man.xsl postgres.sgml touch man-stamp make[2]: Leaving directory '/home/knizhnik/postgresql/doc/src/sgml' make[1]: Leaving directory '/home/knizhnik/postgresql/doc/src' Also autoporepare-16.patch doesn't include any junk src/include/catalog/pg_proc.dat.~1~
Re: Declared but no defined functions
Ashwin Agrawal writes: > Do we wish to make this a tool and have it in src/tools, either as part of > find_static tool after renaming that one to more generic name or > independent script. Well, the scripts described so far are little more than jury-rigged hacks, with lots of room for false positives *and* false negatives. I wouldn't want to institutionalize any of them as the right way to check for such problems. If somebody made the effort to create a tool that was actually trustworthy, perhaps that'd be a different story. (Personally I was wondering whether pgindent could be hacked up to emit things it thought were declarations of function names. I'm not sure that I'd trust that 100% either, but at least it would have a better shot than the grep hacks we've discussed so far. Note in particular that pgindent would see things inside #ifdef blocks, whether or not your local build ever sees those declarations.) regards, tom lane
Assertion for logically decoding multi inserts into the catalog
My patch for using heap_multi_insert in the catalog [1] failed the logical decoding part of test/recovery [2]. The assertion it failed on seems to not take multi inserts into the catalog into consideration, while the main logic does. This assertion hasn't tripped since there are no multi inserts into the catalog, but if we introduce them it will so I’m raising it in a separate thread as it is sort of unrelated from the patch in question. The attached patch fixes my test failure and makes sense to me, but this code is far from my neck of the tree, so I’m really not sure this is the best way to express the assertion. cheers ./daniel [1] https://commitfest.postgresql.org/23/2125/ [2] https://postgr.es/m/CA+hUKGLg1vFiXnkxjp_bea5+VP8D=vhrwsdvj7rbikr_u4x...@mail.gmail.com logdec_assert.patch Description: Binary data
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 08, 2019 at 12:16:04PM -0400, Bruce Momjian wrote: ... Anyway, I will to research the reasonable data size that can be secured with a single key via AES. I will look at how PGP encrypts large files too. IMO there are various recommendations about this, for example from NIST. But it varies on the exact encryption mode (say, GCM, XTS, ...) and the recommendations are not "per key" but "per key + nonce" etc. IANAC but my understanding is if we use e.g. "OID + blocknum" as nonce, then we should be pretty safe. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 08, 2019 at 12:09:58PM -0400, Joe Conway wrote: On 7/8/19 11:56 AM, Peter Eisentraut wrote: On 2019-07-08 17:47, Stephen Frost wrote: Of course, we can discuss if what websites do with over-the-wire encryption is sensible to compare to what we want to do in PG for data-at-rest, but then we shouldn't be talking about what websites do, it'd make more sense to look at other data-at-rest encryption systems and consider what they're doing. So, how do encrypted file systems do it? Are there any encrypted file systems in general use that allow encrypting only some files or encrypting different parts of the file system with different keys, or any of those other granular approaches being discussed? Well it is fairly common, for good reason IMHO, to encrypt some mount points and not others on a system. In my mind, and in practice to a large extent, a postgres tablespace == a unique mount point. There is a description here: https://wiki.archlinux.org/index.php/Disk_encryption That link is a bit overwhelming, as it explains how various encrypted filesystems do things. There's now official support for this in the Linux kernel (encryption at the filesystem level, not block device) in the form of fscrypt, see https://www.kernel.org/doc/html/latest/filesystems/fscrypt.html It's a bit different because that's not a stacked encryption, it's integrated directly into filesystems (like ext4, at the moment) and it leverages other kernel facilities (like keyring). The link also discusses the threat model, which is interesting particularly interesting for this discussion, IMO. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Ltree syntax improvement
On 2019-Jul-08, Dmitry Belyavsky wrote: > Dear Thomas, > > Thank you for your proofreading! > > Please find the updated patch attached. It also contains the missing > escaping. I think all these functions you're adding should have a short sentence explaining what it does. I'm not really convinced that we need this much testing. It seems a bit excessive. Many of these very focused test SQL lines could be removed with no loss of coverage, and many of the others could be grouped into one. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Ltree syntax improvement
Dear Alvaro, On Mon, Jul 8, 2019 at 11:16 PM Alvaro Herrera wrote: > On 2019-Jul-08, Dmitry Belyavsky wrote: > > > Dear Thomas, > > > > Thank you for your proofreading! > > > > Please find the updated patch attached. It also contains the missing > > escaping. > > I think all these functions you're adding should have a short sentence > explaining what it does. > > I'm not really convinced that we need this much testing. It seems a bit > excessive. Many of these very focused test SQL lines could be removed > with no loss of coverage, and many of the others could be grouped into > one. > I did not introduce any functions. I've just changed the parser. I'm not sure that it makes sense to remove any tests as most of them were written to catch really happened bugs during the implementation. -- SY, Dmitry Belyavsky
Re: Built-in connection pooler
On 08.07.2019 3:37, Thomas Munro wrote: On Tue, Jul 2, 2019 at 3:11 AM Konstantin Knizhnik wrote: On 01.07.2019 12:57, Thomas Munro wrote: Interesting work. No longer applies -- please rebase. Rebased version of the patch is attached. Also all this version of built-ni proxy can be found in conn_proxy branch of https://github.com/postgrespro/postgresql.builtin_pool.git Thanks Konstantin. I haven't looked at the code, but I can't help noticing that this CF entry and the autoprepare one are both features that come up again an again on feature request lists I've seen. That's very cool. They also both need architectural-level review. With my Commitfest manager hat on: reviewing other stuff would help with that; if you're looking for something of similar complexity and also the same level of everyone-knows-we-need-to-fix-this-!@#$-we-just-don't-know-exactly-how-yet factor, I hope you get time to provide some more feedback on Takeshi Ideriha's work on shared caches, which doesn't seem a million miles from some of the things you're working on. Thank you, I will look at Takeshi Ideriha's patch. Could you please fix these compiler warnings so we can see this running check-world on CI? https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.46324 https://travis-ci.org/postgresql-cfbot/postgresql/builds/555180678 Sorry, I do not have access to Windows host, so can not check Win32 build myself. I have fixed all the reported warnings but can not verify that Win32 build is now ok. diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 84341a30e5..9398e561e8 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -719,6 +719,123 @@ include_dir 'conf.d' + + max_sessions (integer) + + max_sessions configuration parameter + + + + + The maximum number of client sessions that can be handled by + one connection proxy when session pooling is switched on. + This parameter does not add any memory or CPU overhead, so + specifying a large max_sessions value + does not affect performance. + If the max_sessions limit is reached new connection are not accepted. + + + The default value is 1000. This parameter can only be set at server start. + + + + + + session_pool_size (integer) + + session_pool_size configuration parameter + + + + + Enables session pooling and defines the maximum number of + backends that can be used by client sessions for each database/user combination. + Launched non-tainted backends are never terminated even if there are no active sessions. + Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements. + Tainted backend can server only one client. + + + The default value is 10, so up to 10 backends will server each database, + + + + + + proxy_port (integer) + + proxy_port configuration parameter + + + + + Sets the TCP port for the connection pooler. + Clients connected to main "port" will be assigned dedicated backends, + while client connected to proxy port will be connected to backends through proxy which + performs transaction level scheduling. + + + The default value is 6543. + + + + + + connection_proxies (integer) + + connection_proxies configuration parameter + + + + + Sets number of connection proxies. + Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing)." + "Each proxy launches its own subset of backends. So maximal number of non-tainted backends is " + "session_pool_size*connection_proxies*databases*roles. + + + The default value is 0, so session pooling is disabled. + + + + + + session_schedule (enum) + + session_schedule configuration parameter + + + + + Specifies scheduling policy for assigning session to proxies in case of + connection pooling. Default policy is round-robin. + + + With round-robin policy postmaster cyclicly scatter sessions between proxies. + + + With random policy postmaster randomly choose proxy for new session. + + + With load-balancing policy postmaster choose proxy with lowest load average. + Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections. + + + + + + restar
Re: Ltree syntax improvement
On 2019-Jul-08, Dmitry Belyavsky wrote: > I did not introduce any functions. I've just changed the parser. I mean the C-level functions -- count_parts_ors() and so on. > I'm not sure that it makes sense to remove any tests as most of them were > written to catch really happened bugs during the implementation. Well, I don't mean to decrease the coverage, only to condense a lot of little tests in a small powerful test. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Add missing operator <->(box, point)
Attached 3rd version of the patches. On 02.07.2019 21:55, Alexander Korotkov wrote: On Tue, Jul 2, 2019 at 9:19 PM Nikita Glukhov wrote: We could use commuted "const <-> var" operators for kNN searches, but the current implementation requires the existence of "var <-> const" operators, and order-by-op clauses are rebuilt using them (see match_clause_to_ordering_op() at /src/backend/optimizer/path/indxpath.c). But probably it's still worth to just add commutator for every <-> operator and close this question. Otherwise, it may arise again once we want to add some more kNN support to opclasses or something. On the other hand, are we already going to limit oid consumption? All missing distance operators were added to the first patch. On 08.07.2019 18:22, Alexander Korotkov wrote: On Mon, Mar 11, 2019 at 2:49 AM Nikita Glukhov wrote: 2. Add <-> to GiST box_ops. Extracted gist_box_distance_helper() common for gist_box_distance() and gist_bbox_distance(). For me it doesn't look worth having two distinct functions gist_box_distance_helper() and gist_bbox_distance(). What about having just one and leave responsibility for recheck flag to the caller? gist_bbox_distance() was removed. But maybe it would be better to replace two identical functions gist_circle_distance() and gist_poly_distance() with the single gist_bbox_distance()? 3. Add <-> to SP-GiST. Changed only catalog and tests. Box case is already checked in spg_box_quad_leaf_consistent(): out->recheckDistances = distfnoid == F_DIST_POLYP; So, it seems to be fix of oversight in 2a6368343ff4. But assuming fixing this requires catalog changes, we shouldn't backpatch this. -- Nikita Glukhov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company >From 14a0e22e1b68b084dd75a7f660955b52b6aaae79 Mon Sep 17 00:00:00 2001 From: Nikita Glukhov Date: Mon, 8 Jul 2019 22:55:00 +0300 Subject: [PATCH 1/3] Add missing distance operators --- src/backend/utils/adt/geo_ops.c| 136 - src/include/catalog/pg_operator.dat| 42 +- src/include/catalog/pg_proc.dat| 25 + src/test/regress/expected/geometry.out | 986 + src/test/regress/sql/geometry.sql | 15 +- 5 files changed, 687 insertions(+), 517 deletions(-) diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c index 373784f..6558ea3 100644 --- a/src/backend/utils/adt/geo_ops.c +++ b/src/backend/utils/adt/geo_ops.c @@ -2348,6 +2348,17 @@ dist_pl(PG_FUNCTION_ARGS) PG_RETURN_FLOAT8(line_closept_point(NULL, line, pt)); } +/* + * Distance from a line to a point + */ +Datum +dist_lp(PG_FUNCTION_ARGS) +{ + LINE *line = PG_GETARG_LINE_P(0); + Point *pt = PG_GETARG_POINT_P(1); + + PG_RETURN_FLOAT8(line_closept_point(NULL, line, pt)); +} /* * Distance from a point to a lseg @@ -2362,13 +2373,20 @@ dist_ps(PG_FUNCTION_ARGS) } /* - * Distance from a point to a path + * Distance from a lseg to a point */ Datum -dist_ppath(PG_FUNCTION_ARGS) +dist_sp(PG_FUNCTION_ARGS) +{ + LSEG *lseg = PG_GETARG_LSEG_P(0); + Point *pt = PG_GETARG_POINT_P(1); + + PG_RETURN_FLOAT8(lseg_closept_point(NULL, lseg, pt)); +} + +static float8 +dist_ppath_internal(Point *pt, PATH *path) { - Point *pt = PG_GETARG_POINT_P(0); - PATH *path = PG_GETARG_PATH_P(1); float8 result = 0.0; /* keep compiler quiet */ bool have_min = false; float8 tmp; @@ -2403,7 +2421,31 @@ dist_ppath(PG_FUNCTION_ARGS) } } - PG_RETURN_FLOAT8(result); + return result; +} + +/* + * Distance from a point to a path + */ +Datum +dist_ppath(PG_FUNCTION_ARGS) +{ + Point *pt = PG_GETARG_POINT_P(0); + PATH *path = PG_GETARG_PATH_P(1); + + PG_RETURN_FLOAT8(dist_ppath_internal(pt, path)); +} + +/* + * Distance from a path to a point + */ +Datum +dist_pathp(PG_FUNCTION_ARGS) +{ + PATH *path = PG_GETARG_PATH_P(0); + Point *pt = PG_GETARG_POINT_P(1); + + PG_RETURN_FLOAT8(dist_ppath_internal(pt, path)); } /* @@ -2419,6 +2461,18 @@ dist_pb(PG_FUNCTION_ARGS) } /* + * Distance from a box to a point + */ +Datum +dist_bp(PG_FUNCTION_ARGS) +{ + BOX *box = PG_GETARG_BOX_P(0); + Point *pt = PG_GETARG_POINT_P(1); + + PG_RETURN_FLOAT8(box_closept_point(NULL, box, pt)); +} + +/* * Distance from a lseg to a line */ Datum @@ -2431,6 +2485,18 @@ dist_sl(PG_FUNCTION_ARGS) } /* + * Distance from a line to a lseg + */ +Datum +dist_ls(PG_FUNCTION_ARGS) +{ + LINE *line = PG_GETARG_LINE_P(0); + LSEG *lseg = PG_GETARG_LSEG_P(1); + + PG_RETURN_FLOAT8(lseg_closept_line(NULL, lseg, line)); +} + +/* * Distance from a lseg to a box */ Datum @@ -2443,6 +2509,18 @@ dist_sb(PG_FUNCTION_ARGS) } /* + * Distance from a box to a lseg + */ +Datum +dist_bs(PG_FUNCTION_ARGS) +{ + BOX *box = PG_GETARG_BOX_P(0); + LSEG *lseg = PG_GETARG_LSEG_P(1); + + PG_RETURN_FLOAT8(box_closept_lseg(NULL, box, lseg)); +} + +/* * Distance from a line to a box */ Datum @@ -2462,13 +2540
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 8, 2019 at 12:09:58PM -0400, Joe Conway wrote: > On 7/8/19 11:56 AM, Peter Eisentraut wrote: > > On 2019-07-08 17:47, Stephen Frost wrote: > >> Of course, we can discuss if what websites do with over-the-wire > >> encryption is sensible to compare to what we want to do in PG for > >> data-at-rest, but then we shouldn't be talking about what websites do, > >> it'd make more sense to look at other data-at-rest encryption systems > >> and consider what they're doing. > > > > So, how do encrypted file systems do it? Are there any encrypted file > > systems in general use that allow encrypting only some files or > > encrypting different parts of the file system with different keys, or > > any of those other granular approaches being discussed? > > Well it is fairly common, for good reason IMHO, to encrypt some mount > points and not others on a system. In my mind, and in practice to a > large extent, a postgres tablespace == a unique mount point. Yes, that is a normal partition point for key use because one file system is independent of others. You could use different keys for different directories in the same file system, but internally it all uses the same storage, and storage theft would potentially happen at the file system level. For Postgres, tablespaces are not independent of the database system, though media theft would still match. Of course, in the case of a tablespace media theft, Postgres would be quite confused, though you could still start the database system: SELECT * FROM test; ERROR: could not open file "pg_tblspc/16385/PG_13_201907054/16384/16386": No such file or directory but the data would be gone. What you can't do with Postgres is to have the tablespace be inaccessible and then later reappear. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: Add parallelism and glibc dependent only options to reindexdb
On Mon, Jul 8, 2019 at 9:08 PM Julien Rouhaud wrote: > > I'll resubmit the parallel patch using per-table only > approach Attached. 0001-Export-vacuumdb-s-parallel-infrastructure.patch Description: Binary data 0002-Add-parallel-processing-to-reindexdb.patch Description: Binary data
Re: Add missing operator <->(box, point)
On Mon, Jul 8, 2019 at 11:39 PM Nikita Glukhov wrote: > On 08.07.2019 18:22, Alexander Korotkov wrote: > For me it doesn't look worth having two distinct functions > gist_box_distance_helper() and gist_bbox_distance(). What about > having just one and leave responsibility for recheck flag to the > caller? > > gist_bbox_distance() was removed. OK. > But maybe it would be better to replace two identical functions > gist_circle_distance() and gist_poly_distance() with the single > gist_bbox_distance()? Sounds reasonable to me. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 8, 2019 at 02:39:44PM -0400, Stephen Frost wrote: > > > Of course, we can discuss if what websites do with over-the-wire > > > encryption is sensible to compare to what we want to do in PG for > > > data-at-rest, but then we shouldn't be talking about what websites do, > > > it'd make more sense to look at other data-at-rest encryption systems > > > and consider what they're doing. > > > > (I talked to Joe on chat for clarity.) In modern TLS, the certificate is > > used only for authentication, and Diffie–Hellman is used for key > > exchange: > > > > https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange > > Right, and the key that's figured out for each connection is at least > specific to the server AND client keys/certificates, thus meaning that > they're changed at least as frequently as those change (and clients end > up creating ones on the fly randomly if they don't have one, iirc). > > > So, the question is whether you can pass so much data in TLS that using > > the same key for the entire session is a security issue. TLS originally > > had key renegotiation, but that was removed in TLS 1.3: > > > > > > https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/ > > To mitigate these types of attacks, TLS 1.3 disallows renegotiation. > > It was removed due to attacks targeting the renegotiation, not because > doing re-keying by itself was a bad idea, or because using multiple keys > was seen as a bad idea. Well, if it was a necessary features, I assume TLS 1.3 would have found a way to make it secure, no? Certainly they are not shipping TLS 1.3 with a known weakness. > > Of course, a database is going to process even more data so if the > > amount of data encrypted is a problem, we might have a problem too in > > using a single key. This is not related to whether we use one key for > > the entire cluster or multiple keys per tablespace --- the problem is > > the same. I guess we could create 1024 keys and use the bottom bits of > > the block number to decide what key to use. However, that still only > > pushes the goalposts farther. > > All of this is about pushing the goalposts farther away, as I see it. > There's going to be trade-offs here and there isn't going to be any "one > right answer" when it comes to this space. That's why I'm inclined to > argue that we should try to come up with a relatively *good* solution > that doesn't create a huge amount of work for us, and then build on > that. To that end, leveraging metadata that we already have outside of > the catalogs (databases, tablespaces, potentially other information that > we store, essentially, in the filesystem metadata already) to decide on > what key to use, and how many we can support, strikes me as a good > initial target. Yes, we will need that for a usable nonce that we don't need to store in the blocks and WAL files. > > Anyway, I will to research the reasonable data size that can be secured > > with a single key via AES. I will look at how PGP encrypts large files > > too. > > This seems unlikely to lead to a definitive result, but it would be > interesting to hear if there have been studies around that and what > their conclusions were. I found this: https://crypto.stackexchange.com/questions/44113/what-is-a-safe-maximum-message-size-limit-when-encrypting-files-to-disk-with-aes https://crypto.stackexchange.com/questions/20333/encryption-of-big-files-in-java-with-aes-gcm/20340#20340 The numbers listed are: Maximum Encrypted Plaintext Size: 68GB Maximum Processed Additional Authenticated Data: 2 x 10^18 The 68GB value is "the maximum bits that can be processed with a single key/IV(nonce) pair." We would 8k of data for each 8k page. If we assume a unique nonce per page that is 10^32 bytes. For the WAL we would probably use a different nonce for each 16MB page, so we would be OK there too, since that is 10 ^ 36 bytes. gives us 10^36 bytes before the segment number causes the nonce to repeat. > When it comes to concerns about autovacuum or other system processes, > those don't have any direct user connections or interactions, so having > them be more privileged and having access to more is reasonable. Well, I am trying to understand the value of having some keys accessible by some parts of the system, and some not. I am unclear what security value that has. > Ideally, all of this would leverage a vaulting system or other mechanism > which manages access to the keys and allows their usage to be limited. > That's been generally accepted as a good way to bridge the gap between > having to ask users every time for a key and having keys stored > long-term in memory. Having *only* the keys for the data which the > currently connected user is allowed to access would certainly be a great > initial capability, even if system processes (including potentially WAL > replay) have to have access to
Re: Excessive memory usage in multi-statement queries w/ partitioning
Amit Langote writes: > [ parse-plan-memcxt_v2.patch ] I got around to looking at this finally. I'm not at all happy with the fact that it's added a plantree copy step to the only execution path through exec_simple_query. That's a very significant overhead, in many use-cases, to solve something that nobody had complained about for a couple of decades before now. I don't see the need for any added copy step anyway. The only reason you're doing it AFAICS is so you can release the per-statement context a bit earlier, which is a completely unnecessary optimization. Just wait to release it till the bottom of the loop. Also, creating/deleting the sub-context is in itself an added overhead that accomplishes exactly nothing in the typical case where there's not multiple statements. I thought the idea was to do that only if there was more than one raw parsetree (or, maybe better, do it for all but the last parsetree). To show that this isn't an empty concern, I did a quick pgbench test. Using a single-client select-only test ("pgbench -S -T 60" in an -s 10 database), I got these numbers in three trials with HEAD: tps = 9593.818478 (excluding connections establishing) tps = 9570.189163 (excluding connections establishing) tps = 9596.579038 (excluding connections establishing) and these numbers after applying the patch: tps = 9411.918165 (excluding connections establishing) tps = 9389.279079 (excluding connections establishing) tps = 9409.350175 (excluding connections establishing) That's about a 2% dropoff. Now it's possible that that can be explained away as random variations from a slightly different layout of critical loops vs cacheline boundaries ... but I don't believe it. regards, tom lane
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 8, 2019 at 09:30:03PM +0200, Tomas Vondra wrote: > I think Bruce's proposal was to minimize the time the key is "unlocked" > in memory by only unlocking them when the user connects and supplies > some sort of secret (passphrase), and remove them from memory when the > user disconnects. So there's no way for the auxiliary processes to gain > access to those keys, because only the user knows the secret. I mentioned that because I thought that was the security value that people wanted. While I can see the value, I don't see how it can be cleanly accomplished. Keeping the keys unlocked at all times seems to be possible, but of much smaller value. Part of my goal in this discussion is to reverse the rush to implement and pick apart exactly what is possible, and desirable. > FWIW I have doubts this scheme actually measurably improves privacy in > practice, because most busy applications will end up having the keys in > the memory all the time anyway. Yep. > It also assumes memory is unsafe, i.e. bad actors can read it, and > that's probably a valid concern (root access, vulnerabilities etc.). But > in that case we already have plenty of issues with data in flight > anyway, and I doubt TDE is an answer to that. Agreed. > > Ideally, all of this would leverage a vaulting system or other mechanism > > which manages access to the keys and allows their usage to be limited. > > That's been generally accepted as a good way to bridge the gap between > > having to ask users every time for a key and having keys stored > > long-term in memory. > > Right. I agree with this. > > > Having *only* the keys for the data which the > > currently connected user is allowed to access would certainly be a great > > initial capability, even if system processes (including potentially WAL > > replay) have to have access to all of the keys. And yes, shared buffers > > being unencrypted and accessible by every backend continues to be an > > issue- it'd be great to improve on that situation too. I don't think > > having everything encrypted in shared buffers is likely the solution, > > rather, segregating it up might make more sense, again, along similar > > lines to keys and using metadata that's outside of the catalogs, which > > has been discussed previously, though I don't think anyone's actively > > working on it. > > > > I very much doubt TDE is a solution to this. Essentially, TDE is a good > data-at-rest solution, but this seems more like protecting data during > execution. And in that case I think we may need an entirely different > encryption scheme. I thought client-level encryption or pgcrypto-style encryption fits that need better. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Greetings, * Bruce Momjian (br...@momjian.us) wrote: > On Mon, Jul 8, 2019 at 02:39:44PM -0400, Stephen Frost wrote: > > > > Of course, we can discuss if what websites do with over-the-wire > > > > encryption is sensible to compare to what we want to do in PG for > > > > data-at-rest, but then we shouldn't be talking about what websites do, > > > > it'd make more sense to look at other data-at-rest encryption systems > > > > and consider what they're doing. > > > > > > (I talked to Joe on chat for clarity.) In modern TLS, the certificate is > > > used only for authentication, and Diffie–Hellman is used for key > > > exchange: > > > > > > https://en.wikipedia.org/wiki/Diffie%E2%80%93Hellman_key_exchange > > > > Right, and the key that's figured out for each connection is at least > > specific to the server AND client keys/certificates, thus meaning that > > they're changed at least as frequently as those change (and clients end > > up creating ones on the fly randomly if they don't have one, iirc). > > > > > So, the question is whether you can pass so much data in TLS that using > > > the same key for the entire session is a security issue. TLS originally > > > had key renegotiation, but that was removed in TLS 1.3: > > > > > > > > > https://www.cloudinsidr.com/content/known-attack-vectors-against-tls-implementation-vulnerabilities/ > > > To mitigate these types of attacks, TLS 1.3 disallows renegotiation. > > > > It was removed due to attacks targeting the renegotiation, not because > > doing re-keying by itself was a bad idea, or because using multiple keys > > was seen as a bad idea. > > Well, if it was a necessary features, I assume TLS 1.3 would have found > a way to make it secure, no? Certainly they are not shipping TLS 1.3 > with a known weakness. As discussed below- this is about moving goalposts, and that's, in part at least, why re-keying isn't a *necessary* feature of TLS. As the amount of data you transmit over a given TLS connection increases though, the risk increases and it would be better to re-key. How much better? That depends a great deal on if someone is trying to mount an attack or not. > > > Of course, a database is going to process even more data so if the > > > amount of data encrypted is a problem, we might have a problem too in > > > using a single key. This is not related to whether we use one key for > > > the entire cluster or multiple keys per tablespace --- the problem is > > > the same. I guess we could create 1024 keys and use the bottom bits of > > > the block number to decide what key to use. However, that still only > > > pushes the goalposts farther. > > > > All of this is about pushing the goalposts farther away, as I see it. > > There's going to be trade-offs here and there isn't going to be any "one > > right answer" when it comes to this space. That's why I'm inclined to > > argue that we should try to come up with a relatively *good* solution > > that doesn't create a huge amount of work for us, and then build on > > that. To that end, leveraging metadata that we already have outside of > > the catalogs (databases, tablespaces, potentially other information that > > we store, essentially, in the filesystem metadata already) to decide on > > what key to use, and how many we can support, strikes me as a good > > initial target. > > Yes, we will need that for a usable nonce that we don't need to store in > the blocks and WAL files. I'm not a fan of the idea of using something which is predictable as a nonce. Using the username as the salt for our md5 password mechanism was, all around, a bad idea. This seems like it's repeating that mistake. > > > Anyway, I will to research the reasonable data size that can be secured > > > with a single key via AES. I will look at how PGP encrypts large files > > > too. > > > > This seems unlikely to lead to a definitive result, but it would be > > interesting to hear if there have been studies around that and what > > their conclusions were. > > I found this: > > > https://crypto.stackexchange.com/questions/44113/what-is-a-safe-maximum-message-size-limit-when-encrypting-files-to-disk-with-aes > > https://crypto.stackexchange.com/questions/20333/encryption-of-big-files-in-java-with-aes-gcm/20340#20340 > > The numbers listed are: > > Maximum Encrypted Plaintext Size: 68GB > Maximum Processed Additional Authenticated Data: 2 x 10^18 These are specific to AES, from a quick reading of those pages, right? > The 68GB value is "the maximum bits that can be processed with a single > key/IV(nonce) pair." We would 8k of data for each 8k page. If we > assume a unique nonce per page that is 10^32 bytes. A unique nonce per page strikes me as excessive... but then, I think we should have an actually random nonce rather than something calculated from the metadata. > For the WAL we would probably use a different nonce for each 16MB page, > so we would be OK there too, since t
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 8, 2019 at 05:41:51PM -0400, Stephen Frost wrote: > * Bruce Momjian (br...@momjian.us) wrote: > > Well, if it was a necessary features, I assume TLS 1.3 would have found > > a way to make it secure, no? Certainly they are not shipping TLS 1.3 > > with a known weakness. > > As discussed below- this is about moving goalposts, and that's, in part > at least, why re-keying isn't a *necessary* feature of TLS. As the I agree we have to allow rekeying and allow multiple unlocked keys in the server at the same time. The open question is whether encrypting different data with different keys and different unlock controls is possible or useful. > amount of data you transmit over a given TLS connection increases > though, the risk increases and it would be better to re-key. How much > better? That depends a great deal on if someone is trying to mount an > attack or not. Yep, we need to allow rekey. > > > > Of course, a database is going to process even more data so if the > > > > amount of data encrypted is a problem, we might have a problem too in > > > > using a single key. This is not related to whether we use one key for > > > > the entire cluster or multiple keys per tablespace --- the problem is > > > > the same. I guess we could create 1024 keys and use the bottom bits of > > > > the block number to decide what key to use. However, that still only > > > > pushes the goalposts farther. > > > > > > All of this is about pushing the goalposts farther away, as I see it. > > > There's going to be trade-offs here and there isn't going to be any "one > > > right answer" when it comes to this space. That's why I'm inclined to > > > argue that we should try to come up with a relatively *good* solution > > > that doesn't create a huge amount of work for us, and then build on > > > that. To that end, leveraging metadata that we already have outside of > > > the catalogs (databases, tablespaces, potentially other information that > > > we store, essentially, in the filesystem metadata already) to decide on > > > what key to use, and how many we can support, strikes me as a good > > > initial target. > > > > Yes, we will need that for a usable nonce that we don't need to store in > > the blocks and WAL files. > > I'm not a fan of the idea of using something which is predictable as a > nonce. Using the username as the salt for our md5 password mechanism > was, all around, a bad idea. This seems like it's repeating that > mistake. Uh, well, renaming the user was a big problem, but that is the only case I can think of. I don't see that as an issue for block or WAL sequence numbers. If we want to use a different nonce, we have to find a way to store it or look it up efficiently. Considering the nonce size, I don't see how that is possible. > > > > Anyway, I will to research the reasonable data size that can be secured > > > > with a single key via AES. I will look at how PGP encrypts large files > > > > too. > > > > > > This seems unlikely to lead to a definitive result, but it would be > > > interesting to hear if there have been studies around that and what > > > their conclusions were. > > > > I found this: > > > > > > https://crypto.stackexchange.com/questions/44113/what-is-a-safe-maximum-message-size-limit-when-encrypting-files-to-disk-with-aes > > > > https://crypto.stackexchange.com/questions/20333/encryption-of-big-files-in-java-with-aes-gcm/20340#20340 > > > > The numbers listed are: > > > > Maximum Encrypted Plaintext Size: 68GB > > Maximum Processed Additional Authenticated Data: 2 x 10^18 > > These are specific to AES, from a quick reading of those pages, right? Yes, AES with GCM, which has authentication parts we would not use, so we would use CBC and CTR, which I think has the same or larger spaces. > > > The 68GB value is "the maximum bits that can be processed with a single > > key/IV(nonce) pair." We would 8k of data for each 8k page. If we > > assume a unique nonce per page that is 10^32 bytes. > > A unique nonce per page strikes me as excessive... but then, I think we > should have an actually random nonce rather than something calculated > from the metadata. Uh, well, you are much less likely to get duplicate nonce values by using block number or WAL sequence number. If you look at the implementations, few compute random nonce values. > > For the WAL we would probably use a different nonce for each 16MB page, > > so we would be OK there too, since that is 10 ^ 36 bytes. > > > > gives us 10^36 bytes before the segment number causes the nonce to > > repeat. > > This presumes that the segment number is involved in the nonce > selection, which again strikes me as a bad idea. Maybe it could be > involved in some way, but we should have a properly random nonce. And you base the random goal on what? Nonce is number used only once, and randomness is not a requirement. You can say you prefer it, but why, because most implementations don't u
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Greetings, * Bruce Momjian (br...@momjian.us) wrote: > On Mon, Jul 8, 2019 at 05:41:51PM -0400, Stephen Frost wrote: > > * Bruce Momjian (br...@momjian.us) wrote: > > > Well, if it was a necessary features, I assume TLS 1.3 would have found > > > a way to make it secure, no? Certainly they are not shipping TLS 1.3 > > > with a known weakness. > > > > As discussed below- this is about moving goalposts, and that's, in part > > at least, why re-keying isn't a *necessary* feature of TLS. As the > > I agree we have to allow rekeying and allow multiple unlocked keys in > the server at the same time. The open question is whether encrypting > different data with different keys and different unlock controls is > possible or useful. I'm not sure if there's really a question about if it's *possible*? As for if it's useful, I agree there's some debate. > > amount of data you transmit over a given TLS connection increases > > though, the risk increases and it would be better to re-key. How much > > better? That depends a great deal on if someone is trying to mount an > > attack or not. > > Yep, we need to allow rekey. Supporting a way to rekey is definitely a good idea. > > > > > Of course, a database is going to process even more data so if the > > > > > amount of data encrypted is a problem, we might have a problem too in > > > > > using a single key. This is not related to whether we use one key for > > > > > the entire cluster or multiple keys per tablespace --- the problem is > > > > > the same. I guess we could create 1024 keys and use the bottom bits > > > > > of > > > > > the block number to decide what key to use. However, that still only > > > > > pushes the goalposts farther. > > > > > > > > All of this is about pushing the goalposts farther away, as I see it. > > > > There's going to be trade-offs here and there isn't going to be any "one > > > > right answer" when it comes to this space. That's why I'm inclined to > > > > argue that we should try to come up with a relatively *good* solution > > > > that doesn't create a huge amount of work for us, and then build on > > > > that. To that end, leveraging metadata that we already have outside of > > > > the catalogs (databases, tablespaces, potentially other information that > > > > we store, essentially, in the filesystem metadata already) to decide on > > > > what key to use, and how many we can support, strikes me as a good > > > > initial target. > > > > > > Yes, we will need that for a usable nonce that we don't need to store in > > > the blocks and WAL files. > > > > I'm not a fan of the idea of using something which is predictable as a > > nonce. Using the username as the salt for our md5 password mechanism > > was, all around, a bad idea. This seems like it's repeating that > > mistake. > > Uh, well, renaming the user was a big problem, but that is the only case > I can think of. I don't see that as an issue for block or WAL sequence > numbers. If we want to use a different nonce, we have to find a way to > store it or look it up efficiently. Considering the nonce size, I don't > see how that is possible. No, this also meant that, as an attacker, I *knew* the salt ahead of time and therefore could build rainbow tables specifically for that salt. I could also use those *same* tables for any system where that user had an account, even if they used different passwords on different systems... I appreciate that *some* of this might not be completely relevant for the way a nonce is used in cryptography, but I'd be very surprised to have a cryptographer tell me that a deterministic nonce didn't have similar issues or didn't reduce the value of the nonce significantly. > > > > > Anyway, I will to research the reasonable data size that can be > > > > > secured > > > > > with a single key via AES. I will look at how PGP encrypts large > > > > > files > > > > > too. > > > > > > > > This seems unlikely to lead to a definitive result, but it would be > > > > interesting to hear if there have been studies around that and what > > > > their conclusions were. > > > > > > I found this: > > > > > > > > > https://crypto.stackexchange.com/questions/44113/what-is-a-safe-maximum-message-size-limit-when-encrypting-files-to-disk-with-aes > > > > > > https://crypto.stackexchange.com/questions/20333/encryption-of-big-files-in-java-with-aes-gcm/20340#20340 > > > > > > The numbers listed are: > > > > > > Maximum Encrypted Plaintext Size: 68GB > > > Maximum Processed Additional Authenticated Data: 2 x 10^18 > > > > These are specific to AES, from a quick reading of those pages, right? > > Yes, AES with GCM, which has authentication parts we would not use, so > we would use CBC and CTR, which I think has the same or larger spaces. > > > > > The 68GB value is "the maximum bits that can be processed with a single > > > key/IV(nonce) pair." We would 8k of data for each 8k page. If we > > > assume a unique nonce per page that i
Detailed questions about pg_xact_commit_timestamp
I have some specific questions about pg_xact_commit_timestamp, and am hoping that this is the right place to ask. I read a lot of the commentary about the original patch, and the contributors seem to be here. If I'm asking in the wrong place, just let me know. I'm working on a design for a concurrency-safe incremental aggregate rollup system,and pg_xact_commit_timestamp sounds perfect. But I've found very little commentary on it generally, and couldn't figure out how it works in detail from the source code. Hopefully, someone knows the answers to a few questions: * Is it possible for pg_xact_commit_timestamp to produce times out of order? What I'm after is a way to identify records that have been chagned since a specific time so that I can get any later changes for processing. I don't need them in commit order, so overlapping timestamps aren't a problem. * How many bytes are added to each row in the final implementation? The discussions I saw seemed to be ranging from 12-24 bytes. There was discussion of adding in extra bytes for "just in case." This is pre 9.5, so a world ago. * Are the timestamps indexed internally? With a B-tree? I ask for capacity-planning reasons. * I've seen on StackOverflow and the design discussions that the timestamps are not kept indefinitely, but can't find the details on exactly how long they are stored. * Any rules of thumb on the performance impact of enabling pg_xact_commit_timestamp? I don't need the data on all tables but, where I do, it sounds like it might work perfectly. Many thanks for any assistance!
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 8, 2019 at 06:04:46PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (br...@momjian.us) wrote: > > On Mon, Jul 8, 2019 at 05:41:51PM -0400, Stephen Frost wrote: > > > * Bruce Momjian (br...@momjian.us) wrote: > > > > Well, if it was a necessary features, I assume TLS 1.3 would have found > > > > a way to make it secure, no? Certainly they are not shipping TLS 1.3 > > > > with a known weakness. > > > > > > As discussed below- this is about moving goalposts, and that's, in part > > > at least, why re-keying isn't a *necessary* feature of TLS. As the > > > > I agree we have to allow rekeying and allow multiple unlocked keys in > > the server at the same time. The open question is whether encrypting > > different data with different keys and different unlock controls is > > possible or useful. > > I'm not sure if there's really a question about if it's *possible*? As > for if it's useful, I agree there's some debate. Right, it is easily possible to keep all keys unlocked, but the value is minimal, and the complexity will have a cost, which is my point. > > > amount of data you transmit over a given TLS connection increases > > > though, the risk increases and it would be better to re-key. How much > > > better? That depends a great deal on if someone is trying to mount an > > > attack or not. > > > > Yep, we need to allow rekey. > > Supporting a way to rekey is definitely a good idea. It is a requirement, I think. We might have problem tracking exactly what key _version_ each table (or 8k block), or WAL file are. :-( Ideally we would allow only two active keys, and somehow mark each page as using the odd or even key at a given time, or something strange. (Yeah, hand waving here.) > > Uh, well, renaming the user was a big problem, but that is the only case > > I can think of. I don't see that as an issue for block or WAL sequence > > numbers. If we want to use a different nonce, we have to find a way to > > store it or look it up efficiently. Considering the nonce size, I don't > > see how that is possible. > > No, this also meant that, as an attacker, I *knew* the salt ahead of > time and therefore could build rainbow tables specifically for that > salt. I could also use those *same* tables for any system where that > user had an account, even if they used different passwords on different > systems... Yes, 'postgres' can be used to create a nice md5 rainbow table that works on many servers --- good point. Are rainbow tables possible with something like AES? > I appreciate that *some* of this might not be completely relevant for > the way a nonce is used in cryptography, but I'd be very surprised to > have a cryptographer tell me that a deterministic nonce didn't have > similar issues or didn't reduce the value of the nonce significantly. This post: https://stackoverflow.com/questions/36760973/why-is-random-iv-fine-for-aes-cbc-but-not-for-aes-gcm says: GCM is a variation on Counter Mode (CTR). As you say, with any variant of Counter Mode, it is essential that the Nonce is not repeated with the same key. Hence CTR mode Nonces often include either a counter or a timer element: something that is guaranteed not to repeat over the lifetime of the key. CTR is what we use for WAL. 8k pages, we would use CBC, which says we need a random nonce. I need to dig deeper into ECB mode attack. > > Uh, well, you are much less likely to get duplicate nonce values by > > using block number or WAL sequence number. If you look at the > > implementations, few compute random nonce values. > > Which implementations..? Where do their nonce values come from? I can > see how a nonce might have to be naturally and deterministically random, > if the source for it is sufficiently varied across the key space, but > starting at '1' and going up with the same key seems like it's just > giving a potential attacker more information about what the encrypted > data contains... Well, in many modes the nonce is just a counter, but as stated above, not all modes. I need to pull out my security books to remember for which ones it is safe. (Frankly, it is a lot easier to use a random nonce for WAL than 8k pages.) > > And you base the random goal on what? Nonce is number used only once, > > and randomness is not a requirement. You can say you prefer it, but > > why, because most implementations don't use random nonce. > > The encryption schemes I've worked with in the past have used a random > nonce, so I'm wondering where the disconnect is between us on that. OK. > > > > > When it comes to concerns about autovacuum or other system processes, > > > > > those don't have any direct user connections or interactions, so > > > > > having > > > > > them be more privileged and having access to more is reasonable. > > > > > > > > Well, I am trying to understand the value of having some keys accessible > > > > by some parts of the
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On 7/8/19 6:04 PM, Stephen Frost wrote: > * Bruce Momjian (br...@momjian.us) wrote: >> Uh, well, renaming the user was a big problem, but that is the only case >> I can think of. I don't see that as an issue for block or WAL sequence >> numbers. If we want to use a different nonce, we have to find a way to >> store it or look it up efficiently. Considering the nonce size, I don't >> see how that is possible. > > No, this also meant that, as an attacker, I *knew* the salt ahead of > time and therefore could build rainbow tables specifically for that > salt. I could also use those *same* tables for any system where that > user had an account, even if they used different passwords on different > systems... > > I appreciate that *some* of this might not be completely relevant for > the way a nonce is used in cryptography, but I'd be very surprised to > have a cryptographer tell me that a deterministic nonce didn't have > similar issues or didn't reduce the value of the nonce significantly. I have worked side by side on projects with bona fide cryptographers and I can assure you that they recommended random nonces. Granted, that was in the early 2000s, but I don't think "modern cryptography" has changed that any more than "web scale" has made Postgres irrelevant in the intervening years. Related links: https://defuse.ca/cbcmodeiv.htm https://www.cryptofails.com/post/70059609995/crypto-noobs-1-initialization-vectors Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development signature.asc Description: OpenPGP digital signature
Re: FETCH FIRST clause PERCENT option
Hi Surafel, The v5 patch [1] applies cleanly and passes make installcheck-world. My concern with this is its performance. As a user I would expect using this feature to enable queries to run faster than they would simply pulling the full table. I tested on tables ranging from 10k rows up to 10 million with the same basic result that using FETCH FIRST N PERCENT is slower than using the full table. At best it ran slightly slower than querying the full table, at worst it increased execution times by 1400% when using a large high percentage (95%). Testing was on a DigitalOcean 2 CPU, 2GB RAM droplet. All queries were executed multiple times (6+) with EXPLAIN (ANALYZE, COSTS) and /timing enabled, query plans provided are from the faster end of each query. Create the test table: DROP TABLE IF EXISTS r10mwide; CREATE TABLE r10mwide AS SELECT generate_series(1, 1000)::INT AS id, random() AS v1, random() AS v2, random() AS v3, random() AS v4, random() AS v5, random() AS v6, random() AS v7, random() AS v8, random() AS v9, random() AS v10 ; VACUUM ANALYZE; Start with a baseline query from the full table, the limiting will be done within the CTE as I would typically do in a production query. The outer query does basic aggregates. Below I provide each full query for easy copy/paste but the only real change is how the inner results are limited. This runs in 2.2s off the full table. EXPLAIN (ANALYZE, COSTS) WITH t AS ( SELECT id, v1, v2 FROM r10mwide ) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t ; QUERY PLAN - Finalize Aggregate (cost=227192.07..227192.08 rows=1 width=24) (actual time=2230.419..2230.419 rows=1 loops=1) -> Gather (cost=227191.84..227192.05 rows=2 width=72) (actual time=2228.321..2231.225 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=226191.84..226191.85 rows=1 width=72) (actual time=2218.754..2218.754 rows=1 loops=3) -> Parallel Seq Scan on r10mwide (cost=0.00..184524.92 rows=4166692 width=16) (actual time=0.032..934.652 rows=333 loops=3) Planning Time: 0.116 ms Execution Time: 2231.935 ms (8 rows) Time: 2232.459 ms (00:02.232) It did use parallel, since the FETCH FIRST queries apparently won't I turned that off and ran it again. SET max_parallel_workers_per_gather = 0; New query plan for full table, no parallel, just over 4 seconds. QUERY PLAN --- Aggregate (cost=342859.21..342859.22 rows=1 width=24) (actual time=4253.861..4253.862 rows=1 loops=1) -> Seq Scan on r10mwide (cost=0.00..242858.60 rows=1060 width=16) (actual time=0.062..1728.346 rows=1000 loops=1) Planning Time: 0.082 ms Execution Time: 4253.918 ms (4 rows) The following uses the explicit row count to pull 100k rows (1%) and gives a massive improvement in speed, this query ranged from 55- 120ms. This is the type of speedup I would expect, and it beats the full-table query hands down, regardless of parallel query. EXPLAIN (ANALYZE, COSTS) WITH t AS ( SELECT id, v1, v2 FROM r10mwide FETCH FIRST 10 ROWS ONLY ) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t ; QUERY PLAN - Aggregate (cost=4428.58..4428.59 rows=1 width=24) (actual time=55.963..55.963 rows=1 loops=1) -> Limit (cost=0.00..2428.57 rows=10 width=20) (actual time=0.041..35.725 rows=10 loops=1) -> Seq Scan on r10mwide (cost=0.00..242858.60 rows=1060 width=20) (actual time=0.039..24.796 rows=10 loops=1) Planning Time: 0.134 ms Execution Time: 56.032 ms (5 rows) Time: 57.262 ms Using FETCH FIRST 1 PERCENT it takes over 7 seconds, 71% slower than querying the full table. EXPLAIN (ANALYZE, COSTS) WITH t AS ( SELECT id, v1, v2 FROM r10mwide FETCH FIRST 1 PERCENT ROWS ONLY ) SELECT AVG(v1), MIN(v1), AVG(v1 + v2) FROM t ; QUERY PLAN - Aggregate (cost=6857.22..6857.23 rows=1 width=24) (actual time=7112.205..7112.206 rows=1 loops=1) -> Limit (cost=2428.60..4857.19 rows=11 width=20) (actual time=0.036..7047.394 rows=10 loops=1) -> Seq Scan on r10mwide (cost=0.00..242858.60 rows=1060 width=20) (actual time=0.033..3072.881 rows=1000 loops=1) Planning Time: 0.
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Mon, Jun 3, 2019 at 12:18 AM Tomas Vondra wrote: > For a moment I thought we could/should look at the histogram, becase that > could tell us if there are groups "before" the first MCV one, but I don't > think we should do that, for two reasons. Firstly, rare values may not get > to the histogram anyway, which makes this rather unreliable and might > introduce sudden plan changes, because the cost would vary wildly > depending on whether we happened to sample the rare row or not. And > secondly, the rare row may be easily filtered out by a WHERE condition or > something, at which point we'll have to deal with the large group anyway. If first MCV is in the middle of first histogram bin, then it's reasonable to think that it would fit to first group. But if first MCV is in the middle of histogram, such assumption would be ridiculous. Also, I'd like to note that with our default_statistics_target == 100, non MCV values are not so rare. So, I'm +1 for taking histogram into account. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [PATCH] Incremental sort (was: PoC: Partial sort)
On Thu, Jul 4, 2019 at 4:25 PM James Coleman wrote: > Process questions: > - Do I need to explicitly move the patch somehow to the next CF? We didn't manage to register it on current (July) commitfest. So, please, register it on next (September) commitfest. > - Since I've basically taken over patch ownership, should I move my > name from reviewer to author in the CF app? And can there be two > authors listed there? Sure, you're co-author of this patch. Two or more authors could be listed at CF app, you can find a lot of examples on the list. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
Greetings, * Bruce Momjian (br...@momjian.us) wrote: > On Mon, Jul 8, 2019 at 06:04:46PM -0400, Stephen Frost wrote: > > * Bruce Momjian (br...@momjian.us) wrote: > > > On Mon, Jul 8, 2019 at 05:41:51PM -0400, Stephen Frost wrote: > > > > * Bruce Momjian (br...@momjian.us) wrote: > > > > > Well, if it was a necessary features, I assume TLS 1.3 would have > > > > > found > > > > > a way to make it secure, no? Certainly they are not shipping TLS 1.3 > > > > > with a known weakness. > > > > > > > > As discussed below- this is about moving goalposts, and that's, in part > > > > at least, why re-keying isn't a *necessary* feature of TLS. As the > > > > > > I agree we have to allow rekeying and allow multiple unlocked keys in > > > the server at the same time. The open question is whether encrypting > > > different data with different keys and different unlock controls is > > > possible or useful. > > > > I'm not sure if there's really a question about if it's *possible*? As > > for if it's useful, I agree there's some debate. > > Right, it is easily possible to keep all keys unlocked, but the value is > minimal, and the complexity will have a cost, which is my point. Having them all unlocked but only accessible to certain privileged processes if very different from having them unlocked and available to every backend process. > > > > amount of data you transmit over a given TLS connection increases > > > > though, the risk increases and it would be better to re-key. How much > > > > better? That depends a great deal on if someone is trying to mount an > > > > attack or not. > > > > > > Yep, we need to allow rekey. > > > > Supporting a way to rekey is definitely a good idea. > > It is a requirement, I think. We might have problem tracking exactly > what key _version_ each table (or 8k block), or WAL file are. :-( > Ideally we would allow only two active keys, and somehow mark each page > as using the odd or even key at a given time, or something strange. > (Yeah, hand waving here.) Well, that wouldn't be the ideal since it would limit us to some small number of GBs of data written, based on the earlier discussion, right? I'm not sure that I can see through to a system where we are rewriting tables that are out on disk every time we hit 60GB of data written. Or maybe I'm misunderstanding what you're suggesting here..? > > > Uh, well, renaming the user was a big problem, but that is the only case > > > I can think of. I don't see that as an issue for block or WAL sequence > > > numbers. If we want to use a different nonce, we have to find a way to > > > store it or look it up efficiently. Considering the nonce size, I don't > > > see how that is possible. > > > > No, this also meant that, as an attacker, I *knew* the salt ahead of > > time and therefore could build rainbow tables specifically for that > > salt. I could also use those *same* tables for any system where that > > user had an account, even if they used different passwords on different > > systems... > > Yes, 'postgres' can be used to create a nice md5 rainbow table that > works on many servers --- good point. Are rainbow tables possible with > something like AES? I'm not a cryptographer, just to be clear... but it sure seems like if you know what the nonce is, and a strong idea about at least what some of the contents are, then you could work to pre-calculate a portian of the encrypted data and be able to determine the key based on that. > > I appreciate that *some* of this might not be completely relevant for > > the way a nonce is used in cryptography, but I'd be very surprised to > > have a cryptographer tell me that a deterministic nonce didn't have > > similar issues or didn't reduce the value of the nonce significantly. > > This post: > > > https://stackoverflow.com/questions/36760973/why-is-random-iv-fine-for-aes-cbc-but-not-for-aes-gcm > > says: > > GCM is a variation on Counter Mode (CTR). As you say, with any variant > of Counter Mode, it is essential that the Nonce is not repeated with > the same key. Hence CTR mode Nonces often include either a counter or > a timer element: something that is guaranteed not to repeat over the > lifetime of the key. > > CTR is what we use for WAL. 8k pages, we would use CBC, which says we > need a random nonce. I need to dig deeper into ECB mode attack. That page also says: Using a random IV / nonce for GCM has been specified as an official recommendation by - for instance - NIST. If anybody suggests differently then that's up to them. and a recommendation by NIST certainly holds a lot of water, at least for me. They also have a recommendation regarding the amount of data to encrypt with the same key, and that number is much lower than the 96-bit randomness of the nonce, with a recommendation to use a cryptographically sound random, meaning that the chances of a duplicate are extremely low. > > > Uh, well, you are
Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)
On Mon, Jul 8, 2019 at 06:23:13PM -0400, Bruce Momjian wrote: > Yes, 'postgres' can be used to create a nice md5 rainbow table that > works on many servers --- good point. Are rainbow tables possible with > something like AES? > > > I appreciate that *some* of this might not be completely relevant for > > the way a nonce is used in cryptography, but I'd be very surprised to > > have a cryptographer tell me that a deterministic nonce didn't have > > similar issues or didn't reduce the value of the nonce significantly. > > This post: > > > https://stackoverflow.com/questions/36760973/why-is-random-iv-fine-for-aes-cbc-but-not-for-aes-gcm > > says: > > GCM is a variation on Counter Mode (CTR). As you say, with any variant > of Counter Mode, it is essential that the Nonce is not repeated with > the same key. Hence CTR mode Nonces often include either a counter or > a timer element: something that is guaranteed not to repeat over the > lifetime of the key. > > CTR is what we use for WAL. 8k pages, we would use CBC, which says we > need a random nonce. I need to dig deeper into ECB mode attack. Looking here: https://stackoverflow.com/questions/36760973/why-is-random-iv-fine-for-aes-cbc-but-not-for-aes-gcm I think the issues is that we can't use a _counter_ for the nonce since each page-0 of each table would use the same nonce, and each page-1, etc. I assume we would use the table oid and page number as the nonce. We can't use the database oid since we copy the files from one database to another via file system copy and not through the shared buffer cache where they would be re encrypted. Using relfilenode seems dangerous. For WAL I think it would be the WAL segment number. It would be nice to mix that with the "Database system identifier:", but are these the same on primary and replicas? -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +