Re: New GUC to sample log queries
On 11/18/18 12:47 AM, Dmitry Dolgov wrote: > This patch went through last few commitfests without any activity, but cfbot > says it still applies and doesn't break any tests. From what I see there was > some agreement about naming, so the patch is indeed needs more review. Could > anyone from the reviewers (Vik?) confirm that it's in a good shape? Thanks Dmitry to bringing this up. For me, the question is how to name this GUC? Is "log_sample_rate" is enough? I am not sure because it is only related to queries logged by log_min_duration_statements. Alors, I wonder if we should use the same logic for other parameters, such as log_statement_stats log_parser_stats log_planner_stats log_executor_stats It was mentioned in this thread https://www.postgresql.org/message-id/20180710183828.GB3890%40telsasoft.com
Re: zheap: a new storage format for PostgreSQL
On Sat, Nov 17, 2018 at 8:51 AM Adam Brusselback wrote: > > I don't know how much what I write on this thread is read by others or > how useful this is for others who are following this work > > I've been following this thread and many others like it, silently soaking > it up, because I don't feel like i'd have anything useful to add in most > cases. It is very interesting seeing the development take place though, so > just know it's appreciated at least from my perspective. > I'm also following the development and have hopes about it going forward. Not much low-level details I can comment on though :) In PostGIS workloads, UPDATE table SET geom = ST_CostyFunction(geom, magicnumber); is one of biggest time-eaters that happen upon initial load and clean up of your data. It is commonly followed by CLUSTER table using table_geom_idx; to make sure you're back at full speed and no VACUUM is needed, and your table (usually static after that) is more-or-less spatially ordered. I see that zheap can remove the need for VACUUM, which is a big win already. If you can do something that will allow reorder of tuples according to index happen during an UPDATE that rewrites most of table, that would be a game changer :) Another story is Visibility Map and Index-Only Scans. Right now there is a huge gap between the insert of rows and the moment they are available for index only scan, as VACUUM is required. Do I understand correctly that for zheap this all can be inverted, and UNDO can become "invisibility map" that may be quite small and discarded quickly? -- Darafei Praliaskouski Support me: http://patreon.com/komzpa
Re: [HACKERS] pgbench - allow to store select results into variables
Hello Alvaro, I think this patch's Command->lines would benefit from using PQExpBuffer (or maybe StringInfo?) for the command string instead of open-coding string manipulation and allocation. [...] Ok. I'm not sure that Command->first_line is really all that useful. It seems we go to a lot of trouble to keep it up to date. Isn't it easier to chop Command->lines at the first newline when it is needed? Hmmm, it is needed quite often (about 12 times) to report errors, that would mean having to handle the truncation in many places, so I felt it was worth the trouble. Ok, as long as we don't repeat the work during script execution. Sure, the point of first_line is that it is computed once at parse time. Attached a v23 with PQExpBuffer for managing lines. I've also added a function to compute the summary first line, which handles carriage-return. -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index b5e3a62a33..246944ea79 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -954,6 +954,87 @@ pgbench options d + + + \cset [prefix] + + + + + This command may be used to end SQL queries, replacing an embedded + semicolon (\;) within a compound SQL command. + + + + When this command is used, the preceding SQL query is expected to + return one row, the columns of which are stored into variables named after + column names, and prefixed with prefix if provided. + + + + The following example sends three queries as one compound SQL command, + inducing one message sent at the protocol level. + The result of the second query is stored into variable two, + whereas the results of the other queries are discarded. + +-- compound of 3 queries +SELECT 1 AS one \; SELECT 2 AS two \cset +SELECT 2; + + + + + +\cset does not work when empty SQL queries appear +within a compound SQL command. + + + + + + + + \gset [prefix] + + + + + This commands may be used to end SQL queries, replacing a final semicolon + (;). + + + + When this command is used, the preceding SQL query is expected to + return one row, the columns of which are stored into variables named after + column names, and prefixed with prefix if provided. + + + + The following example puts the final account balance from the first query + into variable abalance, and fills variables + p_two and p_three + with integers from the last query. + The result of the second query is discarded. + +UPDATE pgbench_accounts + SET abalance = abalance + :delta + WHERE aid = :aid + RETURNING abalance \gset +-- compound of two queries +SELECT 1 \; +SELECT 2 AS two, 3 AS three \gset p_ + + + + + +\gset does not work when empty SQL queries appear +within a compound SQL command. + + + + + \if expression \elif expression diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c index 73d3de0677..60d4c95be7 100644 --- a/src/bin/pgbench/pgbench.c +++ b/src/bin/pgbench/pgbench.c @@ -490,12 +490,15 @@ static const char *QUERYMODE[] = {"simple", "extended", "prepared"}; typedef struct { - char *line; /* text of command line */ + char *first_line; /* first line for short display */ + PQExpBufferData lines; /* full multi-line text of command */ int command_num; /* unique index of this Command struct */ int type; /* command type (SQL_COMMAND or META_COMMAND) */ MetaCommand meta; /* meta command identifier, or META_NONE */ int argc; /* number of command words */ char *argv[MAX_ARGS]; /* command word list */ + int compound; /* last compound command (number of \;) */ + char **gset; /* per-compound command prefix */ PgBenchExpr *expr; /* parsed expression, if needed */ SimpleStats stats; /* time spent in this command */ } Command; @@ -1718,6 +1721,107 @@ valueTruth(PgBenchValue *pval) } } +/* read all responses from backend, storing into variable or discarding */ +static bool +read_response(CState *st, char **gset) +{ + PGresult *res; + int compound = 0; + + while ((res = PQgetResult(st->con)) != NULL) + { + switch (PQresultStatus(res)) + { + case PGRES_COMMAND_OK: /* non-SELECT commands */ + case PGRES_EMPTY_QUERY: /* may be used for testing no-op overhead */ +if (gset[compound] != NULL) +{ + fprintf(stderr, + "client %d file %d command %d compound %d: " + "\\gset/cset expects a row\n", + st->id, st->use_file, st->command, compound); + st->ecnt++; + return false; +} +break; /* OK */ + + case PGRES_TUPLES_OK: +if (gset[compound] != NULL) +{ + /* store result into variables if required */ + int ntuples = PQntuples(res), + nfields = PQnfields(res), + f; +
Re: New GUC to sample log queries
> On Sun, 18 Nov 2018 at 10:52, Adrien Nayrat > wrote: > > For me, the question is how to name this GUC? Is "log_sample_rate" is enough? > I am not sure because it is only related to queries logged by > log_min_duration_statements. Since it's hard to come up with a concise name that will mention sampling rate in the context of min_duration_statement, I think it's fine to name this configuration "log_sample_rate", as long as it's dependency from log_min_duration_statements is clearly explained in the documentation.
[PATCH] Allow UNLISTEN during recovery
Hi all, Here is a tiny patch removing PreventCommandDuringRecovery() for UNLISTEN. See previous discussion in https://www.postgresql.org/message-id/CADT4RqBweu7QKRYAYzeRW77b%2BMhJdUikNe45m%2BfL4GJSq_u2Fg%40mail.gmail.com . In a nutshell, this prevents an error being raised when UNLISTEN is issued during recovery. The operation is a no-op (since LISTEN is still disallowed). This logic here is that some clients (namely Npgsql) issue UNLISTEN * to clear connection state (in the connection pool), but this needlessly breaks when the backend is in recovery. On a related note, there currently doesn't seem to be a good way for clients to know whether the backend is in recovery. As a backend can come out of recovery at any point, perhaps an asynchronous ParameterStatus announcing this state change could be useful. Hopefully this also qualifies for backporting to earlier version branches. Shay diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 970c94ee80..3efd262cb8 100644 --- a/src/backend/tcop/utility.c +++ b/src/backend/tcop/utility.c @@ -629,7 +629,7 @@ standard_ProcessUtility(PlannedStmt *pstmt, { UnlistenStmt *stmt = (UnlistenStmt *) parsetree; -PreventCommandDuringRecovery("UNLISTEN"); +/* allow UNLISTEN during recovery, which is a noop */ CheckRestrictedOperation("UNLISTEN"); if (stmt->conditionname) Async_Unlisten(stmt->conditionname);
Re: _isnan() on Windows
On 11/17/18 2:46 PM, Andres Freund wrote: Hi, On 2018-07-12 11:28:46 -0400, Andrew Dunstan wrote: On 07/12/2018 10:38 AM, Tom Lane wrote: Andrew Dunstan writes: On 07/12/2018 10:20 AM, Tom Lane wrote: bowerbird and hamerkop have some gripes like this: bowerbird | c:\perl64\lib\core\win32.h(218): warning C4005: 'isnan' : macro redefinition (src/pl/plperl/SPI.c) [G:\prog\bf\root\HEAD\pgsql.build\plperl.vcxproj] We actually undef a bunch of things in plperl.h to keep the compiler quiet. Maybe we need to add this to the list? Perhaps. But how do we tell the platforms where we should do that from the ones where we shouldn't? In the _MSCVER section: #ifdef isnan #undef isnan #endif By inspection the perl header is just defining it to _isnan, for every MSC version. Let's try undefining it before plperl.h then? It'd be nice to get rid of those warnings... done. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: pg_dumpall --exclude-database option
On 11/17/18 9:55 AM, Alvaro Herrera wrote: The comment in expand_dbname_patterns is ungrammatical and mentions "OID" rather than "name", so I suggest /* * The loop below might sometimes result in duplicate entries in the * output name list, but we don't care. */ Will fix. I'm not sure this is grammatical either: exclude databases whose name matches PATTERN I would have written it like this: exclude databases whose names match PATTERN but I'm not sure (each database has only one name, of course, but aren't you talking about multiple databases there?) I think the original is grammatical. Other than that, the patch seems fine to me -- I tested and it works as intended. Personally I would say "See also expand_table_name_patterns" instead of "This is similar to code in pg_dump.c for handling matching table names.", as well as mention this function in expand_table_name_patterns' comment. (No need to mention expand_schema_name_patterns, since these are adjacent.) But this is mostly stylistic and left to your own judgement. In the long run, I think we should add an option to processSQLNamePattern to use OR instead of AND, which would fix both this problem as well as pg_dump's. I don't think that's important enough to stall this patch. Thanks for the review. cheers andrew -- Andrew Dunstanhttps://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: doc - improve description of default privileges
Hello Tom, Thanks for this precise feedback. Progress on this patch seems to be blocked on the question of whether we want to keep enlarging the amount of psql-specific information in the GRANT reference page, or move that all somewhere else. Yep. FWIW, I think I agree with Peter's position that moving it somewhere else is the better option. Section 5.6 "Privileges" seems like a reasonable choice. Ok. * Perhaps we could fix Peter's complaint about the "Owner" column by relabeling it "All Privileges". Ok. I'd be inclined to label the last column "Default PUBLIC Privileges", too, if we can fit that in. Ok. * The phrase "relation-like objects" seems way too vague, especially since one has to read it as excluding sequences, which surely are relations for most purposes. Is there a good reason not to just leave that entry as "TABLE", full stop? Or maybe it could be "TABLE, VIEW, etc" or some such. Ok. * I don't think the use of "hardcoded" adds anything. Hmmm. As "default privileges" can be altered, the point is to describe the "default default privileges", but this looks absurd, hence the look for something to add the idea that there is another one. ISTM that removing "hardcoded" without replacing it makes the thing slightly ambiguous. No big deal. * Is it worth adding another table matching privilege names ("INSERT") with their aclitem letters ("a"), rather than having the semi-formal format currently appearing in grant.sgml? Indeed I thought about that, because the description is not easy to read. There's also some related material in 9.25 with the aclitem functions; it'd be worth unifying that too maybe. I've put a reference to it at least. Attached v4: - moves the table to the privileges section - updates the table column headers - adds a privilege/aclitem letter mapping table - adds some appropriate links towards psql & aclitem -- Fabien.diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml index c8268222af..5fbaacc214 100644 --- a/doc/src/sgml/ddl.sgml +++ b/doc/src/sgml/ddl.sgml @@ -1509,6 +1509,192 @@ REVOKE ALL ON accounts FROM PUBLIC; privilege. For details see the and reference pages. + + +shows how privileges are displayed + by aclitem functions described + in . +summarizes the + default privileges granted to all object's types with + their associated backslash commands. + + + + Privileges' one-letter display + + + + PRIVILEGE + Letter + Objects + + + + + SELECT + r (Read) + + LARGE OBJECT, + TABLE, VIEW... + + + + INSERT + a (Append) + TABLE, VIEW... + + + UPDATE + w (Write) + + LARGE OBJECT, + TABLE, VIEW... + + + + DELETE + d + TABLE, VIEW... + + + TRUNCATE + D (Delete) + TABLE, VIEW... + + + REFERENCES + x + TABLE + + + TRIGGER + t + TABLE, VIEW... + + + CREATE + C + + DATABASE, SCHEMA, + TABLESPACE + + + + CONNECT + c + DATABASE + + + TEMPORARY + T + DATABASE + + + EXECUTE + X (eXecute) + FUNCTION, PROCEDURE + + + USAGE + U + + DOMAIN, FOREIGN ..., + LANGUAGE, SCHEMA, + SEQUENCE, TYPE + + + + + + + + Default access privileges per object's type, as shown by psql + + + + Object's type + psql \-command + All Privileges + Default PUBLIC Privileges + + + + + DATABASE + \l + CTc + Tc + + + DOMAIN + \dD+ + U + U + + + FUNCTION or PROCEDURE + \df+ + X + X + + + FOREIGN DATA WRAPPER + \dew+ + U + + + + FOREIGN SERVER + \des+ + U + + + + LANGUAGE + \dL+ + U + U + + + LARGE OBJECT + + rw + + + + SCHEMA + \dn+ + UC + + + + SEQUENCE + \dp + rwU + + + + TABLE, VIEW... + \dp + arwdDxt + + + + TABLESPACE + \db+ + C + + + + TYPE + \dT+ + U + U + + + + + diff --git a/doc/src/sgml/ref/grant.sgml b/doc/src/sgml/ref/grant.sgml index ff64c7a3ba..bf643bfe92 100644 --- a/doc/src/sgml/ref/grant.sgml +++ b/doc/src/sgml/ref/grant.sgml @@ -533,7 +533,8 @@ GRANT role_name [, ...] TO -The entries shown by \dp are interpreted thus: +The entries shown by psql backslash-commands, +like \dp, are interpreted thus: rolename= -- privileges granted to a role = -- privileges granted to PU
Re: _isnan() on Windows
Hi, On 2018-11-18 12:46:14 -0500, Andrew Dunstan wrote: > On 11/17/18 2:46 PM, Andres Freund wrote: > > On 2018-07-12 11:28:46 -0400, Andrew Dunstan wrote: > > > By inspection the perl header is just defining it to _isnan, for every MSC > > > version. > > Let's try undefining it before plperl.h then? It'd be nice to get rid > > of those warnings... > done. Thanks! Greetings, Andres Freund
docs should mention that max_wal_size default depends on WAL segment size
Hi, while investigating something on a cluster with a non-default WAL segment (say 256MB), I've noticed a somewhat surprising behavior of max_wal_size default. While the docs claim the default is 1GB, the actual default depends on the WAL segment size. For example with the 256MB WAL segments, you end up with this: test=# show max_wal_size ; max_wal_size -- 16GB (1 row) This behavior is not entirely new - I've noticed it on 10, before the WAL segment size was moved to initdb (which made it more likely to be used). It's even more surprising there, because it leaves #max_wal_size = 1GB in the sample config, while fc49e24f at least emits the actual value. But I'd say not mentioning this behavior in the docs is a bug. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Fixing AC_CHECK_DECLS to do the right thing with clang
Hi, On 2018-11-17 23:32:47 -0500, Tom Lane wrote: > We've seen repeated complaints about bogus build warnings when using > "clang": it complains that strlcpy and some related library functions > haven't been declared. Several of the buildfarm animals exhibit such > warnings, for instance. That's because Autoconf's AC_CHECK_DECLS macro > fails to cope with the fact that clang only generates a warning, not > an error, for the test case that that macro uses. Noah fixed this > in upstream autoconf several years ago: > > http://git.savannah.gnu.org/cgit/autoconf.git/commit/?id=82ef7805faffa151e724aa76c245ec590d174580 > > However, I'm beginning to despair of the Autoconf crowd ever putting > out an official new release. Hence, I propose to apply and back-patch > the attached, which essentially just imports Noah's fix into our > configure script. I've verified that this does the right thing with > Fedora 28's version of clang (clang version 6.0.1). Seems like a good plan. The problem doesn't reproduce for me on debian (using any version of clang), so all I can report is that at patched build still works as it should. - Andres
Re: docs should mention that max_wal_size default depends on WAL segment size
Hi, On 2018-11-18 22:16:12 +0100, Tomas Vondra wrote: > while investigating something on a cluster with a non-default WAL > segment (say 256MB), I've noticed a somewhat surprising behavior of > max_wal_size default. While the docs claim the default is 1GB, the > actual default depends on the WAL segment size. > > For example with the 256MB WAL segments, you end up with this: > > test=# show max_wal_size ; >max_wal_size > -- >16GB > (1 row) > > This behavior is not entirely new - I've noticed it on 10, before the > WAL segment size was moved to initdb (which made it more likely to be > used). It's even more surprising there, because it leaves > > #max_wal_size = 1GB > > in the sample config, while fc49e24f at least emits the actual value. > > But I'd say not mentioning this behavior in the docs is a bug. Hm, you're not wrong there. Wonder if it'd be better to make it so that the default actually has the effect of being 1GB - I think that ought to be doable? Greetings, Andres Freund
Re: [HACKERS] WIP: Aggregation push-down
Antonin Houska writes: > [ agg_pushdown_v8.tgz ] I spent a few hours looking at this today. It seems to me that at no point has there been a clear explanation of what the patch is trying to accomplish, so let me see whether I've got it straight or not. (I suggest that something like this ought to be included in optimizer/README; the patch's lack of internal documentation is a serious deficiency.) -- Conceptually, we'd like to be able to push aggregation down below joins when that yields a win, which it could do by reducing the number of rows that have to be processed at the join stage. Thus consider SELECT agg(a.x) FROM a, b WHERE a.y = b.z; We can't simply aggregate during the scan of "a", because some values of "x" might not appear at all in the input of the naively-computed aggregate (if their rows have no join partner in "b"), or might appear multiple times (if their rows have multiple join partners). So at first glance, aggregating before the join is impossible. The key insight of the patch is that we can make some progress by considering only grouped aggregation: if we can group "a" into sets of rows that must all have the same join partners, then we can do preliminary aggregation within each such group, and take care of the number-of-repetitions problem when we join. If the groups are large then this reduces the number of rows processed by the join, at the cost that we might spend time computing the aggregate for some row sets that will just be discarded by the join. So now we consider SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?; and ask what properties the grouping column(s) "?" must have to make this work. I believe we can say that it's OK to push down through a join if its join clauses are all of the form "a.y = b.z", where either a.y or b.z is listed as a GROUP BY column *and* the join operator is equality for the btree opfamily specified by the SortGroupClause. (Note: actually, SortGroupClause per se mentions an equality operator, although I think the planner quickly reduces that to an opfamily specification.) The concerns Robert had about equality semantics are certainly vital, but I believe that the logic goes through correctly as long as the grouping equality operator and the join equality operator belong to the same opfamily. Case 1: the GROUP BY column is a.y. Two rows of "a" whose y values are equal per the grouping operator must join to exactly the same set of "b" rows, else transitivity is failing. Case 2: the GROUP BY column is b.z. It still works, because the set of "a" rows that are equal to a given z value is well-defined, and those rows must join to exactly the "b" rows whose z entries are equal to the given value, else transitivity is failing. Robert postulated cases like "citext = text", but that doesn't apply here because no cross-type citext = text operator could be part of a well-behaved opfamily. What we'd actually be looking at is either "citextvar::text texteq textvar" or "citextvar citexteq textvar::citext", and the casts prevent these expressions from matching GROUP BY entries that have the wrong semantics. However: what we have proven here is only that we can aggregate across a set of rows that must share the same join partners. We still have to be able to handle the case that the rowset has more than one join partner, which AFAICS means that we must use partial aggregation and then apply an "aggmultifn" (or else apply the aggcombinefn N times). We can avoid that and use plain aggregation when we can prove the "b" side of the join is unique, so that no sets of rows will have to be merged post-join; but ISTM that that reduces the set of use cases to be too small to be worth such a complex patch. So I'm really doubtful that we should proceed forward with only that case available. Also, Tomas complained in the earlier thread that he didn't think grouping on the join column was a very common use-case in the first place. I share that concern, but I think we could extend the logic to the case that Tomas posited as being useful: SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z; where the join column b.id is unique. If we group on a.y (using semantics compatible with the join operator and the uniqueness constraint), then all "a" rows in a given group will join to exactly one "b" row that necessarily has exactly one grouping value, so this group can safely be aggregated together. We might need to combine it post-join with other "b" rows that have equal "z" values, but we can do that as long as we're okay with partial aggregation. I think this example shows why the idea is far more powerful with partial aggregation than without. -- In short, then, I don't have much use for the patch as presented in this thread, without "aggmultifn". That might be OK as the first phase in a multi-step patch, but not as a production feature. I also have no use for the stuff depending on bitwise equality rather than the user-visible operators; tha
logical decoding vs. VACUUM FULL / CLUSTER on table with TOAST-ed data
Hi, It seems we have pretty annoying problem with logical decoding when performing VACUUM FULL / CLUSTER on a table with toast-ed data. The trouble is that the rewritten heap is WAL-logged using XLOG/FPI records, the TOAST data is logged as regular INSERT records. XLOG/FPI is ignored in logical decoding, and so reorderbuffer never gets those records. But we do decode the TOAST data, and reorderbuffer stashes them in toast_hash hash table. Which gets reset only when handling a row from the main heap, and that never arrives. So we end up stashing all the TOAST data in memory :-( So do VACUUM FULL (or CLUSTER) on a sufficiently large table, and you're likely to break any logical replication connection. And it does not matter if you replicate this particular table. Luckily enough, this can leverage some of the pieces introduced by e9edc1ba which was meant to deal with rewrites of system tables, and in raw_heap_insert it added this: /* * The new relfilenode's relcache entrye doesn't have the necessary * information to determine whether a relation should emit data for * logical decoding. Force it to off if necessary. */ if (!RelationIsLogicallyLogged(state->rs_old_rel)) options |= HEAP_INSERT_NO_LOGICAL; As raw_heap_insert is used only for heap rewrites, we can simply remove the if condition and use the HEAP_INSERT_NO_LOGICAL flag for all TOAST data logged from here. This does fix the issue, because we still decode the TOAST changes but there are no data and so if (change->data.tp.newtuple != NULL) { dlist_delete(&change->node); ReorderBufferToastAppendChunk(rb, txn, relation, change); } ends up not stashing the change in the hash table. It's imperfect, because we still decode the changes (and stash them to disk), but ISTM that can be fixed by tweaking DecodeInsert a bit to just ignore those changes entirely. Attached is a PoC patch with these two fixes. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c index c5db75afa1..ce6f9ed117 100644 --- a/src/backend/access/heap/rewriteheap.c +++ b/src/backend/access/heap/rewriteheap.c @@ -658,13 +658,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup) if (!state->rs_use_wal) options |= HEAP_INSERT_SKIP_WAL; - /* - * The new relfilenode's relcache entrye doesn't have the necessary - * information to determine whether a relation should emit data for - * logical decoding. Force it to off if necessary. - */ - if (!RelationIsLogicallyLogged(state->rs_old_rel)) - options |= HEAP_INSERT_NO_LOGICAL; + /* do not decode TOAST data for heap rewrites */ + options |= HEAP_INSERT_NO_LOGICAL; heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL, options); diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index afb497227e..f23cb120e8 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -665,6 +665,9 @@ DecodeAbort(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, static void DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) { + Size datalen; + char *tupledata; + Size tuplelen; XLogReaderState *r = buf->record; xl_heap_insert *xlrec; ReorderBufferChange *change; @@ -672,6 +675,10 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) xlrec = (xl_heap_insert *) XLogRecGetData(r); + /* ignore insert records without new tuples */ + if (!(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE)) + return; + /* only interested in our database */ XLogRecGetBlockTag(r, 0, &target_node, NULL, NULL); if (target_node.dbNode != ctx->slot->data.database) @@ -690,17 +697,13 @@ DecodeInsert(LogicalDecodingContext *ctx, XLogRecordBuffer *buf) memcpy(&change->data.tp.relnode, &target_node, sizeof(RelFileNode)); - if (xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE) - { - Size datalen; - char *tupledata = XLogRecGetBlockData(r, 0, &datalen); - Size tuplelen = datalen - SizeOfHeapHeader; + tupledata = XLogRecGetBlockData(r, 0, &datalen); + tuplelen = datalen - SizeOfHeapHeader; - change->data.tp.newtuple = - ReorderBufferGetTupleBuf(ctx->reorder, tuplelen); + change->data.tp.newtuple = + ReorderBufferGetTupleBuf(ctx->reorder, tuplelen); - DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple); - } + DecodeXLogTuple(tupledata, datalen, change->data.tp.newtuple); change->data.tp.clear_toast_afterwards = true;
Re: Fixing AC_CHECK_DECLS to do the right thing with clang
Andres Freund writes: > Seems like a good plan. The problem doesn't reproduce for me on debian > (using any version of clang), so all I can report is that at patched > build still works as it should. Interesting. It's hardly surprising that the problem would occur only on some platforms, since if declares the function then the problem isn't visible. But I'm surprised that some Debian boxes would show it and some not. Still, a closer look at the buildfarm shows both clang-on-Debian members with the warning (eg gull) and clang-on-Debian members without (eg yours). regards, tom lane
Re: docs should mention that max_wal_size default depends on WAL segment size
On 11/18/18 10:22 PM, Andres Freund wrote: > Hi, > > On 2018-11-18 22:16:12 +0100, Tomas Vondra wrote: >> while investigating something on a cluster with a non-default WAL >> segment (say 256MB), I've noticed a somewhat surprising behavior of >> max_wal_size default. While the docs claim the default is 1GB, the >> actual default depends on the WAL segment size. >> >> For example with the 256MB WAL segments, you end up with this: >> >> test=# show max_wal_size ; >>max_wal_size >> -- >>16GB >> (1 row) >> >> This behavior is not entirely new - I've noticed it on 10, before the >> WAL segment size was moved to initdb (which made it more likely to be >> used). It's even more surprising there, because it leaves >> >> #max_wal_size = 1GB >> >> in the sample config, while fc49e24f at least emits the actual value. >> >> But I'd say not mentioning this behavior in the docs is a bug. > > Hm, you're not wrong there. Wonder if it'd be better to make it so that > the default actually has the effect of being 1GB - I think that ought to > be doable? > I've actually thought about that, initially, but I'm not sure that's a good idea for a couple of reasons: (1) WAL segments can go up to 1GB now, and we need at least two of those, so there are at least some cases where we can't stick to the 1GB max_wal_size default. (2) If you're messing with WAL segment size, chances are you either have very large/busy or very small system, and so the changes to max_wal_size probably make sense. (3) If we decide to enforce the 1GB default, we should probably backpatch it, otherwise you'll get surprisingly different behavior on different versions. Not great. But backpatching may influence existing systems quite a bit. Of course, there are probably few systems tweaking WAL segment size, and I'd expect them to set max_wal_size explicitly (rendering the default irrelevant). So, not sure. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: has_table_privilege for a table in unprivileged schema causes an error
"David G. Johnston" writes: > On Sat, Aug 25, 2018 at 8:29 PM, Tom Lane wrote: >> There's a backwards-compatibility argument for not changing this behavior, >> sure, but I don't buy the other arguments you made here. And I don't >> think there's all that much to the backwards-compatibility argument, >> considering that the current behavior is to fail. > +1; any code using these functions can reasonably be expected to handle > true and false responses. Converting a present error into a false under > that presumption results in similar, and arguably improved, semantics. I took a closer look at what's going on here, and realized that the existing code is a bit confused, or confusing, about whose privileges it's testing. Consider this exchange (with HEAD, not the patch): regression=# CREATE SCHEMA testns; CREATE SCHEMA regression=# CREATE TABLE testns.t1 (f1 int); CREATE TABLE regression=# CREATE USER regress_priv_user1; CREATE ROLE regression=# SELECT has_table_privilege('regress_priv_user1', 'testns.t1', 'SELECT'); has_table_privilege - f (1 row) regression=# \c - regress_priv_user1 You are now connected to database "regression" as user "regress_priv_user1". regression=> SELECT has_table_privilege('regress_priv_user1', 'testns.t1', 'SELECT'); ERROR: permission denied for schema testns That is, whether you get an error or not depends on your *own* privileges for the schema, not those of the user you're supposedly testing. This seems rather inconsistent. If we apply the proposed patch, then (I'd expect, but haven't tested) you should always get the same results from has_table_privilege('u', 's.t', 'p') as from doing has_table_privilege('s.t', 'p') as user u. However ... if we do that, then Robert's previous concern about information leakage comes to life, because an unprivileged user could run has_table_privilege('postgres', 'testns.t1', 'SELECT') and thereby find out whether t1 exists regardless of whether he has any privilege for testns. Mind you, I think that argument is mostly bunk, because that same user can trivially find out whether t1 exists, and what its privilege grants are, just by looking into pg_catalog. So there's no real security improvement from disallowing this. Anyway, it seems to me that the principle of least astonishment suggests that the results of has_table_privilege should depend only on the privileges of the user allegedly being tested, not those of the calling user. Or if we decide differently, somebody has to write some doc text explaining that it's not so. Getting all the way to that might be a bit difficult though. For example, in SELECT has_function_privilege('someuser', 'func(schema.type)', 'usage'); the lookup and permission check for "schema" are buried a long way down from the has_function_privilege code. It'd take a lot of refactoring to keep it from throwing an error. I guess maybe you could argue that privileges on the type are a different question from privileges on the function, but it still seems ugly. A related thought is that it's not clear whether there should be any behavioral difference between SELECT has_function_privilege('someuser', 'func(schema.type)'::text, 'usage'); SELECT has_function_privilege('someuser', 'func(schema.type)'::regprocedure, 'usage'); In the latter case, it's entirely unsurprising that the schema lookup is done as the current user. However, if we define the first case as being equivalent to the second, then the error that Yugo-san is complaining of is something we can't/shouldn't fix, because certainly "SELECT 'testns.t1'::regclass" is going to throw an error if you lack usage privilege for testns. So at this point I'm not sure what to think, other than that things are inconsistent (and underdocumented). regards, tom lane
Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
On Sun, Nov 18, 2018 at 1:11 AM Alvaro Herrera wrote: > On 2018-Nov-17, Amit Kapila wrote: > > > On Sat, Nov 17, 2018 at 4:46 PM Alvaro Herrera > wrote: > > > > Uh, ouch! Seems to me that this is a high-caliber foot-gun, and will > > > cause nasty suprises when production stats data disappear > inadvertently! > > > > What is the alternative in your mind? > > Well, as I see this interface, it is as if > DELETE FROM ... WHERE queryid = > where passing a NULL value meant to delete *all* rows regardless of > queryid, rather than only those where queryid is NULL. > > > AFAICS, we have below alternatives: > > > > a) Return an error for the NULL value of any argument? > > b) Silently ignore if there is any NULL value argument and don't > > remove any stats. > > c) Just ignore the NULL value parameter and remove the stats based on > > other parameters. > > > Currently, patch implements option (c), but we can do (a) or (b) as > > well, however, that can make the API usage bit awkward. > > I think option d) is to have more functions (seven functions total), and > have them all be strict (i.e. don't do anything if one argument is > NULL). One function takes three arguments, and removes all entries > matching the three. Other functions take two arguments, and remove > everything matching those, ignoring the third (so behaving like the > current NULL). Others take one argument. > > Maybe it doesn't make sense to provide all combinations, so it's less > than seven. But having an interface that's inherently easy to misuse > makes me a bit nervous. > Following are the combinations that are possible and function name as also pg_stat_statements_reset_by_*** userid, dbid queryid userid, dbid userid, queryid dbid, queryid userid, dbid, queryid -- Existing function name with arguments can work. So 6 new functions needs to be added to cover all the above cases, IMO, we may need functions for all combinations, because I feel some user may have the requirement of left out combination, in case if we choose only some combinations. Regards, Haribabu Kommi Fujitsu Australia
Re: spurious(?) warnings in archive recovery
On 13/11/2018 16:34, Andrew Gierth wrote: > So while investigating a case of this warning (in > UpdateMinRecoveryPoint): > > "xlog min recovery request %X/%X is past current point %X/%X" > > I noticed that it is issued even in cases where we know that > minRecoveryPoint is not yet valid, for example because we're waiting to > see XLOG_BACKUP_END before declaring consistency. > > But, you'd think, you shouldn't get this error because any page we > modify during recovery should have been restored from an FPI with a > suitably early LSN? For data pages that is correct, but not for VM or > (iff wal_log_hints or checksums are enabled) FSM pages. > > When we replay an operation that, for example, clears a bit in the VM, > the redo code will read in that VM page from disk, and because we're not > yet consistent and because _clearing_ a VM bit is not in itself > wal-logged and doesn't result in any FPI being generated for the VM > page, it could well read a VM page that has a far-future LSN from the > point of view of replay, and dirty it, causing a later eviction to try > and do UpdateMinRecoveryPoint with that future LSN. > > (I haven't investigated this aspect, but there also appears to be no > protection against torn pages in the VM when checksums are enabled? am I > missing something somewhere?) > > I'm less clear on the exact mechanisms, but when wal_log_hints (or > checksums) is on, FSM pages also get LSNs, sometimes, thanks to > MarkBufferDirtyHint, and at least some code paths can also do > MarkBufferDirty on FSM pages during recovery, which would cause their > eviction with possible future LSNs as with VM pages. > > This means that if you simply do an old-style base backup using > pg_start_backup/rsync/pg_stop_backup (on a sufficiently active system > and taking long enough) and then recover from it, you're likely to get a > log spammed with these errors for no very good reason. > > So it seems to me that issuing this error is a bug if the conditions > described are actually harmless, while if they're not harmless, then > obviously that is a bug. So _something_ needs fixing here, but I'm not > yet sufficiently confident of my analysis to say what. > > Opinions? > > (as a further point, it seems to me that backupEndRequired is a rather > misleadingly named variable, since what _actually_ determines whether an > XLOG_BACKUP_END record is expected is whether backupStartPoint is set. > backupEndRequired seems to change one error message and, questionably, > one decision about whether to do crash recovery before entering archive > recovery, but nothing else.) Bump. I was the originator of this report. I work with Postgres every single day and I was spooked by these warnings. People with much less involvement would probably be terrified. -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Re: [PATCH] XLogReadRecord returns pointer to currently read page
On Fri, Nov 16, 2018 at 03:23:55PM +0900, Michael Paquier wrote: > I was looking at this patch, and shouldn't we worry about compatibility > with plugins or utilities which look directly at what's in readRecordBuf > for the record contents? Let's not forget that the contents of > XLogReaderState are public. And a couple of days later, committed. I did not notice first that the comments of xlogreader.h mention that a couple of areas are private, and readRecordBuf is part of that, so objection withdrawn. There is a comment in xlog.c (on top of ReadRecord) referring to readRecordBuf which has not been removed as part of 7fcbf6a4 when WAL reading has been moved to its own facility. The original comment was from 0ffe11a. So I have removed the comment at the same time. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] PoC: full merge join on comparison clause
Alexander Kuzmenkov writes: > [ Inequality-merge-join-v10.patch ] Just thinking about this patch a bit ... I wonder why you were so quick to reject the UNION approach at the outset. This patch is pretty messy, and it complicates a lot of stuff that is quite fundamental to the planner, and you still end up that the only functionality gain is now we can handle full joins whose conditions include a single btree inequality clause. Nor are we doing that remarkably efficiently ... it's pretty much impossible to do it efficiently, in fact, since if the inputs have M and N rows respectively then the output will have something like (M*N)/2 rows. So it seems to me that if we're going to put sweat into this area at all, our ambition ought to be "we'll successfully perform a FULL JOIN with any join clause whatsoever, though it might take O(M*N) time". Now as far as I can tell, the UNION substitution you proposed is completely valid, although it'd be better to phrase the second step as an antijoin. That is, I believe select * from t1 full join t2 on (anything) is exactly equal to select t1.*, t2.* from t1 left join t2 on (anything) union all select t1.*, t2.* from t2 anti join t1 on (anything) There is one fly in the ointment, which is that we will have to run the join clause twice, so it can't contain volatile functions --- but the merge join approach wouldn't handle that case either. Having to read the inputs twice is not good, but we could put them into CTEs, which fixes any problems with volatility below the join and at least alleviates the performance problem. Since we can't currently do any meaningful qual pushdown through full joins, the optimization-fence aspect of a CTE doesn't seem like an issue either. In short, proceeding like the above when we can't find another plan type for a full join seems like it fixes a far wider variety of cases. The possibility that maybe we could do some of those cases a bit faster isn't sufficiently attractive to me to justify also putting in a mechanism like this patch proposes. We only rarely see complaints at all about can't-do-a-full-join problems, and I do not think this patch would fix enough of those complaints to be worthwhile. regards, tom lane
Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
On Mon, Nov 19, 2018 at 10:41:22AM +1100, Haribabu Kommi wrote: > So 6 new functions needs to be added to cover all the above cases, > IMO, we may need functions for all combinations, because I feel some > user may have the requirement of left out combination, in case if we choose > only some combinations. That's bloating the interface in my opinion. Another choice #e would be to keep the function as not strict, but defaulting to the current database if NULL is used for the database ID and defaulting to the current user if NULL is used for the user ID. For the query ID, if NULL is given in output, process all queries matching with (database,user), removing nothing if there is no match. If the caller has set up a query ID the rest of the stats should not be touched. -- Michael signature.asc Description: PGP signature
Re: New GUC to sample log queries
On Sun, Nov 18, 2018 at 12:18:33PM +0100, Dmitry Dolgov wrote: > Since it's hard to come up with a concise name that will mention sampling rate > in the context of min_duration_statement, I think it's fine to name this > configuration "log_sample_rate", as long as it's dependency from > log_min_duration_statements is clearly explained in the documentation. log_sample_rate looks fine to me as a name. -- Michael signature.asc Description: PGP signature
Re: WIP: Avoid creation of the free space map for small tables
On 11/16/18, Amit Kapila wrote: > +/* Status codes for the local map. */ > +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */ > +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */ > +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */ > > Instead of maintaining three states, can't we do with two states > (Available and Not Available), basically combine 0 and 2 in your case. > I think it will save some cycles in > fsm_local_set, where each time you need to initialize all the entries > in the map. I think we can argue that it is not much overhead, but I > think it is better code-wise also if we can make it happen with fewer > states. That'd work too, but let's consider this scenario: We have a 2-block table that has no free space. After trying each block, the local cache looks like 0123 TT00 Let's say we have to wait to acquire a relation extension lock, because another backend had already started extending the heap by 1 block. We call GetPageWithFreeSpace() and now the local map looks like 0123 TTA0 By using bitwise OR to set availability, the already-tried blocks remain as they are. With only 2 states, the map would look like this instead: 0123 AAAN If we assume that an insert into the newly-created block 2 will almost always succeed, we don't have to worry about wasting time re-checking the first 2 full blocks. Does that sound right to you? > Some assorted comments: > 1. > > -Each heap and index relation, except for hash indexes, has a Free Space > Map > +Each heap relation, unless it is very small, and each index relation, > +except for hash indexes, has a Free Space Map > (FSM) to keep track of available space in the relation. It's stored > > It appears that line has ended abruptly. Not sure what you're referring to here. > 2. > page = BufferGetPage(buffer); > + targetBlock = BufferGetBlockNumber(buffer); > > if (!PageIsNew(page)) > elog(ERROR, "page %u of relation \"%s\" should be empty but is not", > - BufferGetBlockNumber(buffer), > + targetBlock, > RelationGetRelationName(relation)); > > PageInit(page, BufferGetPageSize(buffer), 0); > @@ -623,7 +641,18 @@ loop: > * current backend to make more insertions or not, which is probably a > * good bet most of the time. So for now, don't add it to FSM yet. > */ > - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer)); > + RelationSetTargetBlock(relation, targetBlock); > > Is this related to this patch? If not, I suggest let's do it > separately if required. I will separate this out. > 3. > static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, > -uint8 newValue, uint8 minValue); > + uint8 newValue, uint8 minValue); > > This appears to be a spurious change. It was intentional, but I will include it separately as above. > 4. > @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size > len, > * target. > */ > targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); > + > + /* > + * In case we used an in-memory map of available blocks, reset > + * it for next use. > + */ > + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD) > + ClearLocalMap(); > > How will you clear the local map during error? I think you need to > clear it in abort path and you can name the function as > FSMClearLocalMap or something like that. That sounds right, and I will rename the function that way. For the abort path, were you referring to this or somewhere else? if (!PageIsNew(page)) elog(ERROR, "page %u of relation \"%s\" should be empty but is not", targetBlock, RelationGetRelationName(relation)); > 5. > +/*#define TRACE_TARGETBLOCK */ > > Debugging leftover, do you want to retain this and related stuff > during the development of patch? I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's useful for development, but I don't particularly care whether it's in the final verision. Also, I found an off-by-one error that caused an unnecessary smgrexists() call in tables with threshold + 1 pages. This will be fixed in the next version. -John Naylor
Re: ToDo: show size of partitioned table
On 2018-Jul-23, Pavel Stehule wrote: > p.s. Another patch can be replacement of relation type from "table" to > "partitioned table" > > postgres=# \dt+ > List of relations > +++---+---+-+-+ > | Schema |Name| Type| Owner | Size | Description | > +++---+---+-+-+ > | public | data | partitioned table | pavel | 0 bytes | | > | public | data_2016 | table | pavel | 15 MB | | > | public | data_2017 | table | pavel | 15 MB | | > | public | data_other | table | pavel | 11 MB | | > +++---+---+-+-+ > (4 rows) I think this is a clear improvement. The term "table" was introduced for this case by f0e44751d7 ("Implement table partitioning.") and now the author of that commit supports this change. I used the term "index" for partitioned indexes originally because I was copying the existing term, but now I too think they should say "partitioned indexes" instead, because they are different enough objects from plain indexes. To be certain I'm not going against some old decision, I digged up Amit's old patches. Turns out he submitted psql's describe.c using the term "partitioned table" on August 10th [1] and then based on a discussion where Robert suggested calling these new objects "partition roots" instead to avoid confusion, it was changed to "table" in the next submission on August 26th [2]. It seems the right call to have used the term "table" in many places (rather than "partition roots"), but at least in psql's \dt it seems extremely useful to show the type as "partitioned table" instead, because it is one place where the distinction is clearly useful. In this thread there have been no contrary votes, so I'm pushing this part soon. [1] https://postgr.es/m/ad16e2f5-fc7c-cc2d-333a-88d4aa446...@lab.ntt.co.jp [2] https://postgr.es/m/169708f6-6e5a-18d1-707b-1b323e4a6...@lab.ntt.co.jp -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Psql patch to show access methods info
On Sat, Nov 17, 2018 at 11:20:50PM -0300, Alvaro Herrera wrote: > Here's a rebased version, fixing the rejects, pgindenting, and fixing > some "git show --check" whitespace issues. Haven't reviewed any further > than that. Schema qualifications are missing in many places, and they are added sometimes. The character limit in documentation paragraph could be more respected as well. +opereator families associated with whose name matches the pattern are shown. s/opereator/operator/. +List procedures () accociated with access method operator families. s/accociated/associated/. -- Michael signature.asc Description: PGP signature
Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
On 2018-Nov-19, Michael Paquier wrote: > On Mon, Nov 19, 2018 at 10:41:22AM +1100, Haribabu Kommi wrote: > > So 6 new functions needs to be added to cover all the above cases, > > IMO, we may need functions for all combinations, because I feel some > > user may have the requirement of left out combination, in case if we choose > > only some combinations. > > That's bloating the interface in my opinion. I understand. Let's call for a vote from a larger audience. It's important to get this interface right, ISTM. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: ToDo: show size of partitioned table
On 2018/11/19 11:17, Alvaro Herrera wrote: > On 2018-Jul-23, Pavel Stehule wrote: > >> p.s. Another patch can be replacement of relation type from "table" to >> "partitioned table" >> >> postgres=# \dt+ >> List of relations >> +++---+---+-+-+ >> | Schema |Name| Type| Owner | Size | Description | >> +++---+---+-+-+ >> | public | data | partitioned table | pavel | 0 bytes | | >> | public | data_2016 | table | pavel | 15 MB | | >> | public | data_2017 | table | pavel | 15 MB | | >> | public | data_other | table | pavel | 11 MB | | >> +++---+---+-+-+ >> (4 rows) > > I think this is a clear improvement. The term "table" was introduced > for this case by f0e44751d7 ("Implement table partitioning.") and now > the author of that commit supports this change. I used the term "index" > for partitioned indexes originally because I was copying the existing > term, but now I too think they should say "partitioned indexes" instead, > because they are different enough objects from plain indexes. > > To be certain I'm not going against some old decision, I digged up > Amit's old patches. Turns out he submitted psql's describe.c using the > term "partitioned table" on August 10th [1] and then based on a > discussion where Robert suggested calling these new objects "partition > roots" instead to avoid confusion, it was changed to "table" in the next > submission on August 26th [2]. It seems the right call to have used the > term "table" in many places (rather than "partition roots"), but at > least in psql's \dt it seems extremely useful to show the type as > "partitioned table" instead, because it is one place where the > distinction is clearly useful. > > In this thread there have been no contrary votes, so I'm pushing this > part soon. > > [1] https://postgr.es/m/ad16e2f5-fc7c-cc2d-333a-88d4aa446...@lab.ntt.co.jp > [2] https://postgr.es/m/169708f6-6e5a-18d1-707b-1b323e4a6...@lab.ntt.co.jp Yeah, I agree that showing "partitioned table" for partitioned tables in this case is helpful. Earlier on this thread [1], I had expressed a slight concern about the consistency of mentioning "partitioned" in various outputs, because many error messages say "table" even if the table is partitioned. But now I think that it's orthogonal. We should show "partitioned" where it is helpful. Thanks, Amit [1] https://www.postgresql.org/message-id/5474c8b6-04e7-1afc-97b6-adb7471c2c71%40lab.ntt.co.jp
Re: Psql patch to show access methods info
On 2018-Nov-19, Michael Paquier wrote: > On Sat, Nov 17, 2018 at 11:20:50PM -0300, Alvaro Herrera wrote: > > Here's a rebased version, fixing the rejects, pgindenting, and fixing > > some "git show --check" whitespace issues. Haven't reviewed any further > > than that. > > Schema qualifications are missing in many places, and they are added > sometimes. The character limit in documentation paragraph could be more > respected as well. Sergey, are you available to fix these issues? Nikita? Thanks -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: ToDo: show size of partitioned table
On Sun, Nov 18, 2018 at 11:17:37PM -0300, Alvaro Herrera wrote: > To be certain I'm not going against some old decision, I digged up > Amit's old patches. Turns out he submitted psql's describe.c using the > term "partitioned table" on August 10th [1] and then based on a > discussion where Robert suggested calling these new objects "partition > roots" instead to avoid confusion, it was changed to "table" in the next > submission on August 26th [2]. It seems the right call to have used the > term "table" in many places (rather than "partition roots"), but at > least in psql's \dt it seems extremely useful to show the type as > "partitioned table" instead, because it is one place where the > distinction is clearly useful. +1. > In this thread there have been no contrary votes, so I'm pushing this > part soon. > > [1] https://postgr.es/m/ad16e2f5-fc7c-cc2d-333a-88d4aa446...@lab.ntt.co.jp > [2] https://postgr.es/m/169708f6-6e5a-18d1-707b-1b323e4a6...@lab.ntt.co.jp Sorry for degressing, but could you also update \di at the same time so as it shows "partitioned index"? listTables() should be switched to use partitioned tables and partitioned indexes, and permissionsList() has a reference to partitioned tables. While on it, this gives the attached.. -- Michael diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c index 4ca0db1d0c..8e06097442 100644 --- a/src/bin/psql/describe.c +++ b/src/bin/psql/describe.c @@ -955,7 +955,7 @@ permissionsList(const char *pattern) gettext_noop("materialized view"), gettext_noop("sequence"), gettext_noop("foreign table"), - gettext_noop("table"), /* partitioned table */ + gettext_noop("partitioned table"), gettext_noop("Type")); printACLColumn(&buf, "c.relacl"); @@ -3524,8 +3524,8 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys gettext_noop("sequence"), gettext_noop("special"), gettext_noop("foreign table"), - gettext_noop("table"), /* partitioned table */ - gettext_noop("index"), /* partitioned index */ + gettext_noop("partitioned table"), + gettext_noop("partitioned index"), gettext_noop("Type"), gettext_noop("Owner")); signature.asc Description: PGP signature
Re: ATTACH/DETACH PARTITION CONCURRENTLY
On 2018/11/17 9:06, Michael Paquier wrote: > On Fri, Nov 16, 2018 at 09:38:40AM -0500, Robert Haas wrote: >> OK, but it seems to me that your version of my patch rearranges the >> code more than necessary. >> >> How about the attached? > > What you are proposing here looks good to me. Thanks! Me too, now that I see the patch closely. The errors I'd seen in the regression tests were due to uninitialized oids variable which is fixed in the later patches, not due to "confused memory context switching" as I'd put it [1] and made that the reason for additional changes. Thanks, Amit [1] https://www.postgresql.org/message-id/1be8055c-137b-5639-9bcf-8a2d5fef6e5a%40lab.ntt.co.jp
Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT
At Sat, 17 Nov 2018 11:14:49 -0500, Tom Lane wrote in <23489.1542471...@sss.pgh.pa.us> > I wrote: > > After still further thought, it seems like "if (bytesread == 0)" > > is the right way to proceed. That protects us against incorrectly > > setting reached_eof if the pipe is being run in unbuffered or > > line-buffered mode. (Which copy.c doesn't do, at the moment, > > but I'd just as soon this code didn't need to assume that. > > Also, I'm not 100% convinced that Windows or other platforms > > might not act that way.) In the case where we terminate early > > because we saw an in-band EOF marker, we would also not set reached_eof, > > and again that seems like a good outcome: if we see SIGPIPE it > > must mean that the program kept sending data after the EOF marker, > > and that seems like a good thing to complain about. (It's only > > guaranteed to fail if the program sends more than the current pipe > > buffer-ful, but I don't feel a need to add extra code to check for > > nonempty buffer contents as we exit.) > > Oh, wait, that last bit is backwards: if we see an in-band EOF mark, > and as a consequence exit without having set reached_eof, then the > exit code will think that SIGPIPE is ignorable. So transmitting > more data after an EOF mark will not be complained of, whether > it's within the same bufferload or not. > > Still, I can't get very excited about that. Potentially we could > improve it by having the places that recognize EOF marks set > reached_eof, but I'm unconvinced that it's worth the trouble. > I'm thinking that it's better to err in the direction of reporting > SIGPIPE less often not more often, considering that up to now > we've never reported it at all and there've been so few complaints. My opinion here is when we execute an external program on the other end of a pipe, we should behave as loosely (tolerantly) as ordinary un*x programs are expected. If we're connecting to another PostgreSQL server, we should be stringent as the current behavior. In other words, we don't need to change the behavior of other than the COPY_FILE case, but ClosePipeToProgram shouldn't complain not only for SIGPIPE but any kinds of error. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Pull up sublink of type 'NOT NOT (expr)'
Hi Tom, Thanks for your input. On Fri, Nov 16, 2018 at 6:06 AM, Tom Lane wrote: > Richard Guo writes: > > On Tue, Nov 13, 2018 at 10:05 AM, Tom Lane wrote: > >> What is the argument that this occurs often enough to be worth expending > >> extra cycles and code space on? > > > I am not using an ORM, but just considering maybe it would be better if > > PostgreSQL can do such pull-up. > > Tom, what's your suggestion? Is it worthwhile expending several lines of > > codes to do this pull-up? > > Well, really, if we're just doing this on speculation and don't even have > any concrete indication that anybody writes code like that, I can't get > excited about expending even a few lines of code on it. > > The reason why we perform optimizations similar to this in places like > eval_const_expressions is (IMO anyway) that transformations such as > function inlining and subquery pullup can create parse trees that look > like this, even when the original query was perfectly sane and without > obvious redundancy. However, because pull_up_sublinks runs before any > of that, it would only ever see NOT NOT if someone had actually written > such a thing. So to justify expending any code space or cycles on > optimizing this, you have to make the case that that actually happens > in the wild, and does so often enough to justify wasting some (admittedly > small) number of cycles for everybody else. I'd kind of like to see some > actual field occurrence of NOT NOT over an optimizable IN/EXISTS before > deciding that it's worth doing. > > It's sort of annoying that we have to run pull_up_sublinks before we do > scalar expression simplification. If we could do that in the other order, > NOT NOT elimination would fall out automatically, and we'd also be able > to recognize some other cases where it initially seems that an IN or > EXISTS is not at top level, but it becomes so after we const-fold, apply > DeMorgan's laws, etc. However, according to the point I made above, > it makes more sense to apply expression simplification after we've > completed pullup-like operations. So we can't really get there from > here unless we wanted to do (parts of?) expression simplification twice, > which is unlikely to win often enough to justify the cost. > Agree! If we do expression simplification in the other order, we have to do that twice, which is not a good idea. > > So I'm inclined to reject this patch as probably being a net loss in > almost all cases. If you can show any real-world cases where it wins, > we could reconsider. > I am convinced. I do not see this happen often enough in real world neither. Feel free to reject his patch. Thanks Richard
Re: [PATCH] XLogReadRecord returns pointer to currently read page
Thank you for taking this. At Mon, 19 Nov 2018 10:27:29 +0900, Michael Paquier wrote in <20181119012728.ga1...@paquier.xyz> > On Fri, Nov 16, 2018 at 03:23:55PM +0900, Michael Paquier wrote: > > I was looking at this patch, and shouldn't we worry about compatibility > > with plugins or utilities which look directly at what's in readRecordBuf > > for the record contents? Let's not forget that the contents of > > XLogReaderState are public. > > And a couple of days later, committed. I did not notice first that the > comments of xlogreader.h mention that a couple of areas are private, and > readRecordBuf is part of that, so objection withdrawn. It wasn't changed the way the function replaces the content of the buffer. (Regardless of whether it is private or not, I believe no one tries to write directly there outside the function.) Also the memory pointed by the retuned pointer from the previous code was regarded as read-only. The only point to notice was the lifetime of the returned pointer, I suppose. > There is a comment in xlog.c (on top of ReadRecord) referring to > readRecordBuf which has not been removed as part of 7fcbf6a4 when WAL > reading has been moved to its own facility. The original comment was > from 0ffe11a. So I have removed the comment at the same time. - * The record is copied into readRecordBuf, so that on successful return, - * the returned record pointer always points there. Yeah, it is incorrect in some sense, but the comment was suggesting the lifetime of the pointed record. So I'd like to have a comment like this instead. + * The record is copied into xlogreader, so that on successful return, + * the returned record pointer always points there. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Early WIP/PoC for inlining CTEs
On Sat, 17 Nov 2018 at 10:12, Stephen Frost wrote: > Greetings, > > * Andrew Gierth (and...@tao11.riddles.org.uk) wrote: > > > "Tom" == Tom Lane writes: > > > > >> [ inlining-ctes-v5.patch ] > > > > Tom> I took a little bit of a look through this. Some thoughts: > > > > Tom> * I think it'd be a good idea if we made OFFSET/LIMIT in a CTE be > > Tom> an alternate way of keeping it from being inlined. As the patch > > Tom> stands, if that's the behavior you want, you have no way to > > Tom> express it in a query that will also work in older servers. (I > > Tom> will manfully resist suggesting that then we don't need the > > Tom> nonstandard syntax at all ... oops, too late.) > > > > I think this is the wrong approach, because you may want the > > optimization-barrier effects of OFFSET/LIMIT _without_ the actual > > materialization - there is no need to force a query like > > > > with d as (select stuff from bigtable offset 1) select * from d; > > > > to push all the data through an (on-disk) tuplestore. > > Agreed, there's going to be cases where you want the CTE to be inlined > even with OFFSET/LIMIT. Let's please not cater to the crowd who > happened to know that they could hack around with OFFSET/LIMIT to make > something not be inlined when it comes to the question of if the CTE > should be inlined or not. That's the same issue we were argueing around > when discussing if we should allow parallel array_agg, imv. > > Particularly since, with CTEs anyway, we never inlined them, so the > whole OFFSET/LIMIT thing doesn't really make any sense- today, if you > wrote a CTE, you wouldn't bother with OFFSET/LIMIT because you knew it > wasn't going to be inlined, that entire line of thinking is for > subqueries, not CTEs. If you're going to force people to change their > CTEs to require that they not be inlined, let's not pick a method which > makes it ambiguous and makes us have to ask "do they really want this > limit/offset, or did they just want to make the CTE not be inlined...?" > > To satisfy Tom's understandable desire to let people write queries that behave the same on old and new versions, can we get away with back-patching the MATERIALIZED parser enhancement as a no-op in point releases? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
On Sat, Nov 17, 2018 at 7:41 PM Alvaro Herrera wrote: > > On 2018-Nov-17, Amit Kapila wrote: > > > On Sat, Nov 17, 2018 at 4:46 PM Alvaro Herrera > > wrote: > > > > Uh, ouch! Seems to me that this is a high-caliber foot-gun, and will > > > cause nasty suprises when production stats data disappear inadvertently! > > > > What is the alternative in your mind? > > Well, as I see this interface, it is as if > DELETE FROM ... WHERE queryid = > where passing a NULL value meant to delete *all* rows regardless of > queryid, rather than only those where queryid is NULL. > If one want to understand in query form, it will be like: DELETE FROM ... WHERE userid = AND dbid = AND queryid = ; Now, if any of the value is NULL, we ignore it, so the case in question would be: DELETE FROM ... WHERE userid = AND dbid = OR queryid = NULL; Now, I think it is quite possible that one can interpret it as you have written above and then this can lead to some unintended behavior. I think one thing we can do to avoid such cases is to make this function strict and make the default values as InvalidOid or -1. So, on NULL input, we don't do anything, but for -1 or an invalid value for the parameter, we ignore only that parameter as we are doing now. Do you think such an interface will address your concern? > > AFAICS, we have below alternatives: > > > > a) Return an error for the NULL value of any argument? > > b) Silently ignore if there is any NULL value argument and don't > > remove any stats. > > c) Just ignore the NULL value parameter and remove the stats based on > > other parameters. > > > Currently, patch implements option (c), but we can do (a) or (b) as > > well, however, that can make the API usage bit awkward. > > I think option d) is to have more functions (seven functions total), and > have them all be strict (i.e. don't do anything if one argument is > NULL). One function takes three arguments, and removes all entries > matching the three. Other functions take two arguments, and remove > everything matching those, ignoring the third (so behaving like the > current NULL). Others take one argument. > > Maybe it doesn't make sense to provide all combinations, so it's less > than seven. But having an interface that's inherently easy to misuse > makes me a bit nervous. > Yeah, we can go in that direction if the one interface idea doesn't work. IIRC, we have discussed having multiple interfaces for this API and majority people seems to be in favor of single interface. Let us first see if there is some reasonable way to provide this functionality with a single interface, if not, then surely we can explore multiple interface idea. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: zheap: a new storage format for PostgreSQL
On Sun, Nov 18, 2018 at 3:42 PM Darafei "Komяpa" Praliaskouski wrote: > > On Sat, Nov 17, 2018 at 8:51 AM Adam Brusselback > wrote: >> >> > I don't know how much what I write on this thread is read by others or >> how useful this is for others who are following this work >> >> I've been following this thread and many others like it, silently soaking it >> up, because I don't feel like i'd have anything useful to add in most cases. >> It is very interesting seeing the development take place though, so just >> know it's appreciated at least from my perspective. > > I'm also following the development and have hopes about it going forward. Not > much low-level details I can comment on though :) > > In PostGIS workloads, UPDATE table SET geom = ST_CostyFunction(geom, > magicnumber); is one of biggest time-eaters that happen upon initial load and > clean up of your data. It is commonly followed by CLUSTER table using > table_geom_idx; to make sure you're back at full speed and no VACUUM is > needed, and your table (usually static after that) is more-or-less spatially > ordered. I see that zheap can remove the need for VACUUM, which is a big win > already. If you can do something that will allow reorder of tuples according > to index happen during an UPDATE that rewrites most of table, that would be a > game changer :) > If the tuples are already in the order of the index, then we would retain the order, otherwise, we might not want to anything special for ordering w.r.t index. I think this is important as we are not sure of the user's intention and I guess it won't be easy to do such rearrangement during Update statement. > Another story is Visibility Map and Index-Only Scans. Right now there is a > huge gap between the insert of rows and the moment they are available for > index only scan, as VACUUM is required. Do I understand correctly that for > zheap this all can be inverted, and UNDO can become "invisibility map" that > may be quite small and discarded quickly? > Yeah, eventually that is our goal with the help of delete-marking in indexes, however, for the first version, we still need to rely on visibility maps for index-only-scans. Thank you for showing interest in this work. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT
(2018/11/17 8:10), Tom Lane wrote: I wrote: I'm not quite sure whether the reached_eof test should be "if (bytesread == 0)" or "if (bytesread< maxread)". The former seems more certain to indicate real EOF; are there other cases where the fread might return a short read? On the other hand, if we support in-band EOF indicators (\. for instance) then we might stop without having made an extra call to CopyGetData that would see bytesread == 0. On the third hand, if we do do that it's not quite clear to me whether SIGPIPE is ignorable or not. After still further thought, it seems like "if (bytesread == 0)" is the right way to proceed. That protects us against incorrectly setting reached_eof if the pipe is being run in unbuffered or line-buffered mode. (Which copy.c doesn't do, at the moment, but I'd just as soon this code didn't need to assume that. Also, I'm not 100% convinced that Windows or other platforms might not act that way.) So I think this version is probably good, although maybe it could use an additional comment explaining the above reasoning. I agree that it's better to keep the BeginCopyFrom API as-is. Also, I think your version would handle SIGPIPE in COPY FROM PROGRAM more properly than what I proposed. So, +1 from me. Thanks! Best regards, Etsuro Fujita
Re: Function to promote standby servers
On Fri, Oct 26, 2018 at 01:51:24PM +0900, Michael Paquier wrote: > Or we could use "the function returns true immediately after initiating > the promotion by sending the promotion signal to the postmaster"? As a > native speaker which one feels more natural to you? Sorry for the time it took. After pondering on it, I have committed the improvements from your version. -- Michael signature.asc Description: PGP signature
Re: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)
At Sat, 17 Nov 2018 11:15:54 -0300, Alvaro Herrera wrote in <20181117141554.4dkx2u4j6md3bqdh@alvherre.pgsql> > Is this patch committable now? I don't think so. We should make a decision on a point. I was a bit confused (sorry) but IIUIC Haribabu suggested that the RBM_ZERO_ON_ERROR case should be included to read (not just ignored). I'm on it. I think that RPM_ZERO_* other than ON_ERROR cases could be a kind of hit but I don't insist on it. So, I think we should decide on at least the ON_ERROR case before this becomes commttable. The another counter could be another issue. I don't object to add the counter but I'm not sure what is the suitable name. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [PATCH] XLogReadRecord returns pointer to currently read page
On Mon, Nov 19, 2018 at 12:28:06PM +0900, Kyotaro HORIGUCHI wrote: > Yeah, it is incorrect in some sense, but the comment was > suggesting the lifetime of the pointed record. So I'd like to > have a comment like this instead. I think that we can still live without as it is not the business of this routine to be careful how the lifetime of a record read is handled, but that's part of the internals of XLogReader. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] Restricting maximum keep segments by repslots
On Fri, Oct 26, 2018 at 11:26:36AM +0900, Kyotaro HORIGUCHI wrote: > The reason for doing that in the fucntion is it can happen also > for physical replication when walsender is active but far > behind. The removed(renamed)-but-still-open segment may be > recycled and can be overwritten while reading, and it will be > caught by page/record validation. It is substantially lost in > that sense. I don't think the strictness is useful for anything.. I was just coming by to look at bit at the patch series, and bumped into that: > + /* > + * checkpoint can remove the segment currently looking for. make sure > the > + * current segment is still exists. We check this only once per record. > + */ > + XLByteToSeg(targetPagePtr, targetSegNo, state->wal_segment_size); > + if (targetSegNo <= XLogGetLastRemovedSegno()) > + ereport(ERROR, > + (errcode(ERRCODE_NO_DATA), > + errmsg("WAL record at %X/%X no longer > available", > + (uint32)(RecPtr >> 32), > (uint32) RecPtr), > + errdetail("The segment for the record has been > removed."))); > + ereport should not be called within xlogreader.c as a base rule: * This file is compiled as both front-end and backend code, so it * may not use ereport, server-defined static variables, etc. -- Michael signature.asc Description: PGP signature
Re: Postgres, fsync, and OSs (specifically linux)
On Fri, Nov 9, 2018 at 9:03 AM Thomas Munro wrote: > On Fri, Nov 9, 2018 at 7:07 AM Robert Haas wrote: > > On Wed, Nov 7, 2018 at 9:41 PM Thomas Munro > > wrote: > > > My plan is do a round of testing and review of this stuff next week > > > once the dust is settled on the current minor releases (including > > > fixing a few typos I just spotted and some word-smithing). All going > > > well, I will then push the resulting patches to master and all > > > supported stable branches, unless other reviews or objections appear. ... On Sun, Nov 18, 2018 at 3:20 PM Thomas Munro wrote: > I think these patches are looking good now. If I don't spot any other > problems or hear any objections, I will commit them tomorrow-ish. Hearing no objections, pushed to all supported branches. Thank you to Craig for all his work getting to the bottom of this, to Andres for his open source diplomacy, and the Linux guys for their change "errseq: Always report a writeback error once" which came out of that. Some more comments: * The promotion of errors from close() to PANIC may or may not be effective considering that it doesn't have interlocking with concurrent checkpoints. I'm not sure if it can really happen on local file systems anyway... this may fall under the category of "making PostgreSQL work reliably on NFS", a configuration that is not recommended currently, and a separate project IMV. * In 9.4 and 9.5 there is no checking of errors from sync_file_range(), and I didn't add any for now. It was claimed that sync_file_range() without BEFORE/AFTER can't consume errors[1]. Errors are promoted in 9.6+ for consistency because we already looked at the return code, so we won't long rely on that knowledge in the long term. * I personally believe it is safe to run with data_sync_retry = on on any file system on FreeBSD, and ZFS on any operating system... but I see no need to make recommendations about that in the documentation, other than that you should investigate the behaviour of your operating system if you really want to turn it on. * A PANIC (and possibly ensuing crash restart loop if the I/O error is not transient) is of course a very unpleasant failure mode, but it is one that we already had for the WAL and control file. So I'm not sure I'd personally bother to run with the non-default setting even on a system where I believe it to be safe (considering the low likelihood that I/O failure is isolated to certain files); at best it probably gives you a better experience if the fs underneath a non-default tablespace dies. * The GUC is provided primarily because this patch is so drastic in its effect that it seems like we owe our users a way to disable it on principle, and that seems to outweigh a desire not to add GUCs in back-branches. * If I/O errors happen, your system is probably toast and you need to fail over or restore from backups, but at least we won't tell you any lies about checkpoints succeeding. In rare scenarios, perhaps involving a transient failure of virtualised storage with thin provisioning as originally described by Craig, the system may actually be able to continue running, and with this change we should now be able to avoid data loss by recovering from the WAL. * As noted the commit message, this isn't quite the end of the story. See the fsync queue redesign thread[2], WIP for master only. [1] https://www.postgresql.org/message-id/20180430160945.5s5qfoqryhtmugxl%40alap3.anarazel.de [2] https://www.postgresql.org/message-id/flat/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmazxk0jykxw+b21f...@mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Re: WIP: Avoid creation of the free space map for small tables
On Mon, Nov 19, 2018 at 7:30 AM John Naylor wrote: > > On 11/16/18, Amit Kapila wrote: > > +/* Status codes for the local map. */ > > +#define FSM_LOCAL_ZERO 0x00 /* Beyond the end of the relation */ > > +#define FSM_LOCAL_AVAIL 0x01 /* Available to try */ > > +#define FSM_LOCAL_TRIED 0x02 /* Already tried, not enough space */ > > > > Instead of maintaining three states, can't we do with two states > > (Available and Not Available), basically combine 0 and 2 in your case. > > I think it will save some cycles in > > fsm_local_set, where each time you need to initialize all the entries > > in the map. I think we can argue that it is not much overhead, but I > > think it is better code-wise also if we can make it happen with fewer > > states. > > That'd work too, but let's consider this scenario: We have a 2-block > table that has no free space. After trying each block, the local cache > looks like > > 0123 > TT00 > > Let's say we have to wait to acquire a relation extension lock, > because another backend had already started extending the heap by 1 > block. We call GetPageWithFreeSpace() and now the local map looks like > > 0123 > TTA0 > > By using bitwise OR to set availability, the already-tried blocks > remain as they are. With only 2 states, the map would look like this > instead: > > 0123 > AAAN > I expect below part of code to go-away. +fsm_local_set(Relation rel, BlockNumber nblocks) { .. + /* + * If the blkno is beyond the end of the relation, the status should + * be zero already, but make sure it is. If the blkno is within the + * relation, mark it available unless it's already been tried. + */ + for (blkno = 0; blkno < HEAP_FSM_CREATION_THRESHOLD; blkno++) + { + if (blkno < nblocks) + FSMLocalMap[blkno] |= FSM_LOCAL_AVAIL; + else + FSMLocalMap[blkno] = FSM_LOCAL_ZERO; + } .. } In my mind for such a case it should look like below: 0123 NNAN > If we assume that an insert into the newly-created block 2 will almost > always succeed, we don't have to worry about wasting time re-checking > the first 2 full blocks. Does that sound right to you? > As explained above, such a situation won't exist. > > > Some assorted comments: > > 1. > > > > -Each heap and index relation, except for hash indexes, has a Free Space > > Map > > +Each heap relation, unless it is very small, and each index relation, > > +except for hash indexes, has a Free Space Map > > (FSM) to keep track of available space in the relation. It's stored > > > > It appears that line has ended abruptly. > > Not sure what you're referring to here. > There is a space after "has a Free Space Map " so you can combine next line. > > 2. > > page = BufferGetPage(buffer); > > + targetBlock = BufferGetBlockNumber(buffer); > > > > if (!PageIsNew(page)) > > elog(ERROR, "page %u of relation \"%s\" should be empty but is not", > > - BufferGetBlockNumber(buffer), > > + targetBlock, > > RelationGetRelationName(relation)); > > > > PageInit(page, BufferGetPageSize(buffer), 0); > > @@ -623,7 +641,18 @@ loop: > > * current backend to make more insertions or not, which is probably a > > * good bet most of the time. So for now, don't add it to FSM yet. > > */ > > - RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer)); > > + RelationSetTargetBlock(relation, targetBlock); > > > > Is this related to this patch? If not, I suggest let's do it > > separately if required. > > I will separate this out. > > > 3. > > static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, > > -uint8 newValue, uint8 minValue); > > + uint8 newValue, uint8 minValue); > > > > This appears to be a spurious change. > > It was intentional, but I will include it separately as above. > > > 4. > > @@ -378,24 +386,15 @@ RelationGetBufferForTuple(Relation relation, Size > > len, > > * target. > > */ > > targetBlock = GetPageWithFreeSpace(relation, len + saveFreeSpace); > > + > > + /* > > + * In case we used an in-memory map of available blocks, reset > > + * it for next use. > > + */ > > + if (targetBlock < HEAP_FSM_CREATION_THRESHOLD) > > + ClearLocalMap(); > > > > How will you clear the local map during error? I think you need to > > clear it in abort path and you can name the function as > > FSMClearLocalMap or something like that. > > That sounds right, and I will rename the function that way. For the > abort path, were you referring to this or somewhere else? > > if (!PageIsNew(page)) > elog(ERROR, "page %u of relation \"%s\" should be empty but is not", > targetBlock, > RelationGetRelationName(relation)); > I think it might come from any other place between when you set it and before it got cleared (like any intermediate buffer and pin related API's). > > 5. > > +/*#define TRACE_TARGETBLOCK */ > > > > Debugging leftover, do you want to retain this and related stuff > > during the development of patch? > > I modeled this after TRACE_VISIBILITYMAP in visibilitymap.c. It's > useful for dev
Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query
On Mon, Nov 19, 2018 at 1:37 PM Alvaro Herrera wrote: > On 2018-Nov-19, Michael Paquier wrote: > > > On Mon, Nov 19, 2018 at 10:41:22AM +1100, Haribabu Kommi wrote: > > > So 6 new functions needs to be added to cover all the above cases, > > > IMO, we may need functions for all combinations, because I feel some > > > user may have the requirement of left out combination, in case if we > choose > > > only some combinations. > > > > That's bloating the interface in my opinion. > > I understand. > > Let's call for a vote from a larger audience. It's important to get > this interface right, ISTM. > Amit suggested another option in another mail, so total viable solutions that are discussed as of now are, 1. Single API with NULL input treat as invalid value 2. Multiple API to deal with NULL input of other values 3. Single API with NULL value to treat them as current user, current database and NULL queryid. 4. Single API with -1 as invalid value, treat NULL as no matching. (Only problem with this approach is till now -1 is also a valid queryid, but setting -1 as queryid needs to be avoided. I prefer single API. I can go with either 3 or 4. opinion from others? Regards, Haribabu Kommi Fujitsu Australia
Re: [PATCH] XLogReadRecord returns pointer to currently read page
On 16.11.2018 11:23, Michael Paquier wrote: On Thu, Nov 15, 2018 at 06:12:38PM +0900, Kyotaro HORIGUCHI wrote: This patch eliminates unnecessary copying that was done for non-continued records. Now the return value of XLogReadRecord directly points into page buffer holded in XLogReaderStats. It is safe because no caller site uses the returned pointer beyond the replacement of buffer content at the next call to the same function. I was looking at this patch, and shouldn't we worry about compatibility with plugins or utilities which look directly at what's in readRecordBuf for the record contents? Let's not forget that the contents of XLogReaderState are public. According to my experience, I clarify some comments to avoid this mistakes in the future (see attachment). -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company >From 7de252405ef5d5979fe2711515c0c6402abc0e06 Mon Sep 17 00:00:00 2001 From: "Andrey V. Lepikhov" Date: Mon, 19 Nov 2018 07:08:28 +0500 Subject: [PATCH] Some clarifications on readRecordBuf comments --- src/backend/access/transam/xlogreader.c | 4 ++-- src/include/access/xlogreader.h | 6 +- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index c5e019bf77..88d0bcf48a 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -209,8 +209,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) * If the reading fails for some other reason, NULL is also returned, and * *errormsg is set to a string with details of the failure. * - * The returned pointer (or *errormsg) points to an internal buffer that's - * valid until the next call to XLogReadRecord. + * The returned pointer (or *errormsg) points to an internal read-only buffer + * that's valid until the next call to XLogReadRecord. */ XLogRecord * XLogReadRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg) diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 40116f8ecb..0837625b95 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -185,7 +185,11 @@ struct XLogReaderState */ TimeLineID nextTLI; - /* Buffer for current ReadRecord result (expandable) */ + /* + * Buffer for current ReadRecord result (expandable). + * Used in the case, than current ReadRecord result don't fit on the + * currently read WAL page. + */ char *readRecordBuf; uint32 readRecordBufSize; -- 2.17.1
Re: In-place updates and serializable transactions
On Fri, Nov 16, 2018 at 4:07 AM Kevin Grittner wrote: > > On Thu, Nov 15, 2018 at 3:03 AM Kuntal Ghosh > wrote: > > > The test multiple-row-versions is failing because of the > > above-discussed scenario. I've attached the regression diff file and > > the result output file for the same. Here is a brief summary of the > > test w.r.t. heap: > > > > Step 1: T1-> BEGIN; Read FROM t where id=100; > > Step 2: T2-> BEGIN; UPDATE t where id=100; COMMIT; (creates T1->T2) > > Step 3: T3-> BEGIN; UPDATE t where id=100; Read FROM t where id=50; > > Step 4: T4-> BEGIN; UPDATE t where id= 50; Read FROM t where id=1; > > COMMIT; (creates T3->T4) > > Step 5: T3-> COMMIT; > > Step 6: T1-> UPDATE t where id=1; COMMIT; (creates T4->T1,) > > > > At step 6, when the update statement is executed, T1 is rolled back > > because of T3->T4->T1. > > > > But for zheap, step 3 also creates a dependency T1->T3 because of > > in-place update. When T4 commits in step 4, it marks T3 as doomed > > because of T1 --> T3 --> T4. Hence, in step 5, T3 is rolled back. > > If I understand this, no permutation (order of execution of the > statements in a set of concurrent transactions vulnerable to > serialization anomalies) which have succeeded with the old storage > engine now fail with zheap; what we have with zheap is an earlier > failure in one case. More importantly, zheap doesn't create any false > negatives (cases where a serialization anomaly is missed). > Your understanding is correct. Thanks for sharing your feedback. > I would say this should be considered a resounding success. We should > probably add an alternative result file to cover this case, but > otherwise I don't see anything which requires action. > > Congratulations on making this work so well! > Thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: In-place updates and serializable transactions
On Fri, Nov 16, 2018 at 4:07 AM Kevin Grittner wrote: > > I would say this should be considered a resounding success. We should > probably add an alternative result file to cover this case, but > otherwise I don't see anything which requires action. > As of now, we've added an extra expected file for the same. > Congratulations on making this work so well! > Thanks for developing the serializable module in a nice decoupled architecture. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Re: NOTIFY and pg_notify performance when deduplicating notifications
On 10/10/2018 19:42, Catalin Iacob wrote: On Tue, Oct 9, 2018 at 2:17 PM wrote: I just caught an error in my patch, it's fixed in the attachment. The 'never' and 'maybe' collapse modes were mixed up in one location. Here's a partial review of this version, did not read the doc part very carefully. First of all, I agree that this is a desirable feature as, for a large number of notiifications, the O(n^2) overhead quickly becomes very noticeable. I would expect the collapse mode to be an enum which is created from the string early on during parsing and used for the rest of the code. Instead the string is used all the way leading to string comparisons in the notification dispatcher and to the need of hardcoding special strings in various places, including the contrib module. This comment in the beginning of async.c should also be updated: * Duplicate notifications from the same transaction are sent out as one * notification only. This is done to save work when for example a trigger pg_notify_3args duplicates pg_notify, I would expect a helper function to be extracted and called from both. There are braces placed on the same line as the if, for example if (strlen(collapse_mode) != 0) { which seems to not be the project's style. Thank you for the review. I've addressed all your points in the attached patch. The patch was made against release 11.1. I couldn't find a way to make a good helper function for pg_notify_3args and pg_notify, I hope my proposed solution is acceptable. diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c index 0c274322bd..1494a35a5a 100644 --- a/contrib/tcn/tcn.c +++ b/contrib/tcn/tcn.c @@ -161,7 +161,7 @@ triggered_change_notification(PG_FUNCTION_ARGS) strcpy_quoted(payload, SPI_getvalue(trigtuple, tupdesc, colno), '\''); } - Async_Notify(channel, payload->data); + Async_Notify(channel, payload->data, NOTIFY_COLLAPSE_MODE_MAYBE); } ReleaseSysCache(indexTuple); break; diff --git a/doc/src/sgml/ref/notify.sgml b/doc/src/sgml/ref/notify.sgml index e0e125a2a2..96e0d7a990 100644 --- a/doc/src/sgml/ref/notify.sgml +++ b/doc/src/sgml/ref/notify.sgml @@ -21,7 +21,8 @@ PostgreSQL documentation -NOTIFY channel [ , payload ] +NOTIFY channel [ , payload [ , collapse_mode ] ] + @@ -93,20 +94,6 @@ NOTIFY channel [ , - - If the same channel name is signaled multiple times from the same - transaction with identical payload strings, the - database server can decide to deliver a single notification only. - On the other hand, notifications with distinct payload strings will - always be delivered as distinct notifications. Similarly, notifications from - different transactions will never get folded into one notification. - Except for dropping later instances of duplicate notifications, - NOTIFY guarantees that notifications from the same - transaction get delivered in the order they were sent. It is also - guaranteed that messages from different transactions are delivered in - the order in which the transactions committed. - - It is common for a client that executes NOTIFY to be listening on the same notification channel itself. In that case @@ -121,6 +108,41 @@ NOTIFY channel [ , + + + + + + Ordering and collapsing of notifications + + + If the same channel name is signaled multiple times from the same + transaction with identical payload strings, the + database server can decide to deliver a single notification only, + when the value of the collapse_mode parameter is + 'maybe' or '' (the empty string). + + If the 'never' collapse mode is specified, the server will + deliver all notifications, including duplicates. Turning off deduplication + in this way can considerably speed up transactions that emit large numbers + of notifications. + + Removal of duplicate notifications takes place within transaction block, + finished with COMMIT, END or SAVEPOINT. + + + + Notifications with distinct payload strings will + always be delivered as distinct notifications. Similarly, notifications from + different transactions will never get folded into one notification. + Except for dropping later instances of duplicate notifications, + NOTIFY guarantees that notifications from the same + transaction get delivered in the order they were sent. It is also + guaranteed that messages from different transactions are delivered in + the order in which the transactions committed. + + + @@ -147,6 +169,16 @@ NOTIFY channel [ , + +collapse_mode + + + The collapse mode to apply when identical notifications are issued within + a transaction. The acceptable values are 'maybe' (the + default) and 'never'. + + + @@ -190,6 +222