A really subtle lexer bug
Currently in PG, the precedence of = and <> is supposed to be equal, and the precedence of unary - is very high. So, (true=1 between 1 and 1) parses as ((true)=(1 between 1 and 1)), and (true=-1 between 1 and 1) parses as ((true)=((-1) between 1 and 1)). All good so far. (true<>-1 between 1 and 1) parses as ((true<>(-1)) between 1 and 1). ??? The fault here is in the lexer. The input "<>-" is being lexed as an Op followed by '-' rather than as NOT_EQUAL followed by '-' because it looks like a match for a multi-character operator, with the - being thrown back into the input afterwards. So the precedence of <> gets inflated to that of Op, which is higher than BETWEEN. More seriously, this also breaks named arguments: create function f(a integer) returns integer language sql as $$ select a; $$; select f(a => 1); -- works select f(a => -1); -- works select f(a =>-1); -- ERROR: column "a" does not exist I guess the fix is to extend the existing special case code that checks for one character left after removing trailing [+-] and also check for the two-character ops "<>" ">=" "<=" "=>" "!=". -- Andrew (irc:RhodiumToad)
Calculate total_table_pages after set_base_rel_sizes()
I believe that we should be delaying the PlannerInfo's total_table_pages calculation until after constraint exclusion and partition pruning have taken place. Doing this calculation before we determine which relations we don't need to scan can lead to incorrectly applying random_page_cost to too many pages processed during an Index Scan. We already don't count relations removed by join removals from this calculation, so counting pruned partitions seems like an omission. The attached patch moves the calculation to after set_base_rel_sizes() is called and before set_base_rel_pathlists() is called, where the information is actually used. I am considering this a bug fix, but I'm proposing this for PG12 only as I don't think destabilising plans in the back branches is a good idea. I'll add this to the September commitfest. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v1-0001-Calculate-total_table_pages-after-set_base_rel_si.patch Description: Binary data
Re: Fix help option of contrib/oid2name
On Mon, Aug 20, 2018 at 03:51:07PM +0900, Tatsuro Yamada wrote: > On 2018/08/20 13:54, Michael Paquier wrote: > Therefore, "-P" is a manual bag. I investigated more using git log command and > understood followings: > > 1. -P option was removed on 4192f2d85 > 2. -P option revived in only the manual on 2963d8228 Bruce did that to keep a trace of it in the docs, let's nuke it then, we don't handle it and the documentation is mentioning the thing as deprecated since 2010. >> oid2name supports also PGDATABASE. > > As far as I know, this environment variable is also unused because > I could not get the results as I expected. So, I didn't add it to the manual. > For example, following command expected that it shows about "postgres", but > it couldn't. Yep, you're right. > For now, I'm not sure about TAP test but I knew both utilities have no > regression tests. It would be better to use something tests. If you don't add them, I think that I would just that myself, just to have some basics. Not that it is a barrier for this patch anyway. -- Michael signature.asc Description: PGP signature
Re: partitioning - changing a slot's descriptor is expensive
Thanks for the review. On 2018/08/17 15:00, Amit Khandekar wrote: > Thanks for the patch. I did some review of the patch (needs a rebase). > Below are my comments. > > @@ -1936,12 +1936,11 @@ ExecPartitionCheckEmitError(ResultRelInfo > *resultRelInfo, > + /* Input slot might be of a partition, which has a fixed tupdesc. */ > + slot = MakeTupleTableSlot(tupdesc); > if (map != NULL) > - { > tuple = do_convert_tuple(tuple, map); > - ExecSetSlotDescriptor(slot, tupdesc); > - ExecStoreTuple(tuple, slot, InvalidBuffer, false); > - } > + ExecStoreTuple(tuple, slot, InvalidBuffer, false); > > Both MakeTupleTableSlot() and ExecStoreTuple() can be inside the (map > != NULL) if condition. > This also applies for similar changes in ExecConstraints() and > ExecWithCheckOptions(). Ah, okay. I guess that means we'll allocate a new slot here only if we had to switch to a partition-specific slot in the first place. > + * Initialize an empty slot that will be used to manipulate tuples of any > + * this partition's rowtype. > of any this => of this > > + * Initialized the array where these slots are stored, if not already > Initialized => Initialize Fixed. > + proute->partition_tuple_slots_alloced = > + lappend(proute->partition_tuple_slots_alloced, > + proute->partition_tuple_slots[partidx]); > > Instead of doing the above, I think we can collect those slots in > estate->es_tupleTable using ExecInitExtraTupleSlot() so that they > don't have to be explicitly dropped in ExecCleanupTupleRouting(). And > also then we won't require the new field > proute->partition_tuple_slots_alloced. Although I was slightly uncomfortable of the idea at first, thinking that it's not good for tuple routing specific resources to be released by generic executor code, doesn't seem too bad to do it the way you suggest. Attached updated patch. By the way, I think it might be a good idea to try to merge this patch with the effort in the following thread. * Reduce partition tuple routing overheads * https://commitfest.postgresql.org/19/1690/ Thanks, Amit From a543b15f83f5e1adaef2a46de59022b0b21711e4 Mon Sep 17 00:00:00 2001 From: amit Date: Thu, 28 Jun 2018 15:47:47 +0900 Subject: [PATCH v2] Allocate dedicated slots of partition tuple conversion Currently there is just one slot called partition_tuple_slot and we do ExecSetSlotDescriptor every time a tuple is inserted into a partition that requires tuple conversion. That's excessively inefficient during bulk inserts. Fix that by allocating a dedicated slot for each partitions that requires it. --- src/backend/commands/copy.c| 17 + src/backend/executor/execMain.c| 24 +++--- src/backend/executor/execPartition.c | 45 +++--- src/backend/executor/nodeModifyTable.c | 27 src/include/executor/execPartition.h | 13 ++ 5 files changed, 88 insertions(+), 38 deletions(-) diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c index 9bc67ce60f..f61db3e173 100644 --- a/src/backend/commands/copy.c +++ b/src/backend/commands/copy.c @@ -2690,6 +2690,7 @@ CopyFrom(CopyState cstate) if (proute) { int leaf_part_index; + TupleConversionMap *map; /* * Away we go ... If we end up not finding a partition after all, @@ -2864,11 +2865,17 @@ CopyFrom(CopyState cstate) * partition rowtype. Don't free the already stored tuple as it * may still be required for a multi-insert batch. */ - tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index], - tuple, - proute->partition_tuple_slot, - &slot, - false); + map = proute->parent_child_tupconv_maps[leaf_part_index]; + if (map != NULL) + { + TupleTableSlot *new_slot; + + Assert(proute->partition_tuple_slots != NULL && + proute->partition_tuple_slots[leaf_part_index] != NULL); + new_slot = proute->partition_tuple_slots[leaf_part_index]; + tuple = ConvertPartitionTupleSlot(map, tuple, new_slot, &slot, + false); + } tuple->t_tableO
Re: [FEATURE PATCH] pg_stat_statements with plans (v02)
Hi, On 2018-08-20 15:21:07 +1200, Thomas Munro wrote: > On Mon, Aug 20, 2018 at 2:55 PM, Tom Lane wrote: > > Thomas Munro writes: > > ... but I'm less excited about this one. Seems like a great opportunity > > for unexpected stack overflows, and thence at least the chance for > > DOS-causing security attacks. Can we prevent that from being allowed, > > if we start using -std=c99? > > -Werror=vla in GCC, apparently. How about detecting support for that in our configure script and automatically using it? If we're uncomfortable with raising an error, let's at least raise a warning? - Andres
Re: remove ATTRIBUTE_FIXED_PART_SIZE
On 18/08/2018 23:05, Tom Lane wrote: > Possibly we need to be more careful than we are now about whether > there's padding at the end of the fixed-size fields ... but just > ripping out the code that attempts to deal with that is hardly > an improvement. I don't think the tuple packing issue has to do with the tuple descriptor code. The tuple descriptors already use allocations of size sizeof(FormData_pg_attribute) (CreateTemplateTupleDesc), just the memcpy and memset calls use (potentially) less. That might have saved a few bytes for omitting the varlena fields, but I don't think it affects alignment correctness. If we, say, added a trailing char field now, the only thing this code -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: remove ATTRIBUTE_FIXED_PART_SIZE
On 20/08/2018 12:32, Peter Eisentraut wrote: > On 18/08/2018 23:05, Tom Lane wrote: >> Possibly we need to be more careful than we are now about whether >> there's padding at the end of the fixed-size fields ... but just >> ripping out the code that attempts to deal with that is hardly >> an improvement. > > I don't think the tuple packing issue has to do with the tuple > descriptor code. The tuple descriptors already use allocations of size > sizeof(FormData_pg_attribute) (CreateTemplateTupleDesc), just the memcpy > and memset calls use (potentially) less. That might have saved a few > bytes for omitting the varlena fields, but I don't think it affects > alignment correctness. If we, say, added a trailing char field now, the > only thing this code [oops] ... the only thing the current code would accomplish is not copying the last three padding bytes, which might even be slower than copying all four. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: ALTER TABLE on system catalogs
On 20/08/2018 04:37, Michael Paquier wrote: > For 0002, indeed the patch is still > seems relevant. The comment block at the beginning of > create_toast_table should be updated. Some tests would also be nice to > have. Tests would require enabling allow_system_table_mods, which is PGC_POSTMASTER, so we couldn't do it inside the normal regression test suite. I'm not sure setting up a whole new test suite for this is worth it. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: remove ATTRIBUTE_FIXED_PART_SIZE
On 2018-08-20 12:34:15 +0200, Peter Eisentraut wrote: > On 20/08/2018 12:32, Peter Eisentraut wrote: > > On 18/08/2018 23:05, Tom Lane wrote: > >> Possibly we need to be more careful than we are now about whether > >> there's padding at the end of the fixed-size fields ... but just > >> ripping out the code that attempts to deal with that is hardly > >> an improvement. > > > > I don't think the tuple packing issue has to do with the tuple > > descriptor code. The tuple descriptors already use allocations of size > > sizeof(FormData_pg_attribute) (CreateTemplateTupleDesc), just the memcpy > > and memset calls use (potentially) less. That might have saved a few > > bytes for omitting the varlena fields, but I don't think it affects > > alignment correctness. If we, say, added a trailing char field now, the > > only thing this code > > [oops] > > ... the only thing the current code would accomplish is not copying the > last three padding bytes, which might even be slower than copying all four. That's not generally the case though, there's code like: static VacAttrStats * examine_attribute(Relation onerel, int attnum, Node *index_expr) { ... /* * Create the VacAttrStats struct. Note that we only have a copy of the * fixed fields of the pg_attribute tuple. */ stats = (VacAttrStats *) palloc0(sizeof(VacAttrStats)); stats->attr = (Form_pg_attribute) palloc(ATTRIBUTE_FIXED_PART_SIZE); memcpy(stats->attr, attr, ATTRIBUTE_FIXED_PART_SIZE); Accesses to stats->attr can legitimately assume that the padding at the tail end of attr is present (i.e. widening a load / store). Greetings, Andres Freund
Re: ALTER TABLE on system catalogs
On 2018-08-20 11:37:49 +0900, Michael Paquier wrote: > On Fri, Jul 13, 2018 at 11:05:10AM +0200, Peter Eisentraut wrote: > > After reviewing that thread, I think my patch would still be relevant. > > Because the pending proposal is to not add TOAST tables to some catalogs > > such as pg_attribute, so my original use case of allowing ALTER TABLE / > > SET on system catalogs would still be broken for those tables. > > Something has happened in this area in the shape of 96cdeae, and the > following catalogs do not have toast tables: pg_class, pg_index, > pg_attribute and pg_largeobject*. While 0001 proposed upthread is not > really relevant anymore because there is already a test to list which > catalogs do not have a toast table. For 0002, indeed the patch is still > seems relevant. The comment block at the beginning of > create_toast_table should be updated. I still object to the approach in 0002. Greetings, Andres Freund
Re: partitioning - changing a slot's descriptor is expensive
On 20 August 2018 at 14:45, Amit Langote wrote: > Thanks for the review. > > On 2018/08/17 15:00, Amit Khandekar wrote: >> Thanks for the patch. I did some review of the patch (needs a rebase). >> Below are my comments. >> >> @@ -1936,12 +1936,11 @@ ExecPartitionCheckEmitError(ResultRelInfo >> *resultRelInfo, >> + /* Input slot might be of a partition, which has a fixed tupdesc. */ >> + slot = MakeTupleTableSlot(tupdesc); >> if (map != NULL) >> - { >> tuple = do_convert_tuple(tuple, map); >> - ExecSetSlotDescriptor(slot, tupdesc); >> - ExecStoreTuple(tuple, slot, InvalidBuffer, false); >> - } >> + ExecStoreTuple(tuple, slot, InvalidBuffer, false); >> >> Both MakeTupleTableSlot() and ExecStoreTuple() can be inside the (map >> != NULL) if condition. >> This also applies for similar changes in ExecConstraints() and >> ExecWithCheckOptions(). > > Ah, okay. I guess that means we'll allocate a new slot here only if we > had to switch to a partition-specific slot in the first place. > >> + * Initialize an empty slot that will be used to manipulate tuples of any >> + * this partition's rowtype. >> of any this => of this >> >> + * Initialized the array where these slots are stored, if not already >> Initialized => Initialize > > Fixed. > >> + proute->partition_tuple_slots_alloced = >> + lappend(proute->partition_tuple_slots_alloced, >> + proute->partition_tuple_slots[partidx]); >> >> Instead of doing the above, I think we can collect those slots in >> estate->es_tupleTable using ExecInitExtraTupleSlot() so that they >> don't have to be explicitly dropped in ExecCleanupTupleRouting(). And >> also then we won't require the new field >> proute->partition_tuple_slots_alloced. > > Although I was slightly uncomfortable of the idea at first, thinking that > it's not good for tuple routing specific resources to be released by > generic executor code, doesn't seem too bad to do it the way you suggest. > > Attached updated patch. Thanks for the changes. The patch looks good to me. > By the way, I think it might be a good idea to > try to merge this patch with the effort in the following thread. > > * Reduce partition tuple routing overheads * > https://commitfest.postgresql.org/19/1690/ Yes. I guess, whichever commits later needs to be rebased on the earlier one. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Re: Slotification of partition tuple conversion
On 17 August 2018 at 21:47, Amit Khandekar wrote: > Attached is a patch tup_convert.patch that does the conversion > directly using input and output tuple slots. This patch is to be > applied on an essential patch posted by Amit Langote in [1] that > arranges for dedicated per-partition slots rather than changing > tupdesc of a single slot. I have rebased that patch from [1] and > attached it here ( > Allocate-dedicated-slots-of-partition-tuple_amitlan_rebased.patch). Amit Langote has posted a new version of the Allocate-dedicated-slots-of-partition-tuple_amitlan_rebased.patch at [1] . Attached is version v2 of the tup_convert.patch, that is rebased over that patch. [1] https://www.postgresql.org/message-id/attachment/64354/v2-0001-Allocate-dedicated-slots-of-partition-tuple-conve.patch -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company tup_convert_v2.patch Description: Binary data
Re: partitioning - changing a slot's descriptor is expensive
Hi, On 2018-08-20 16:40:17 +0530, Amit Khandekar wrote: > On 20 August 2018 at 14:45, Amit Langote > wrote: > > By the way, I think it might be a good idea to > > try to merge this patch with the effort in the following thread. > > > > * Reduce partition tuple routing overheads * > > https://commitfest.postgresql.org/19/1690/ > > Yes. I guess, whichever commits later needs to be rebased on the earlier one. I think I'd rather keep this patch separate, it's pretty simple, so we should be able to commit it fairly soon. Greetings, Andres Freund
Re: partitioning - changing a slot's descriptor is expensive
On 20 August 2018 at 23:21, Andres Freund wrote: > On 2018-08-20 16:40:17 +0530, Amit Khandekar wrote: >> On 20 August 2018 at 14:45, Amit Langote >> wrote: >> > By the way, I think it might be a good idea to >> > try to merge this patch with the effort in the following thread. >> > >> > * Reduce partition tuple routing overheads * >> > https://commitfest.postgresql.org/19/1690/ >> >> Yes. I guess, whichever commits later needs to be rebased on the earlier one. > > I think I'd rather keep this patch separate, it's pretty simple, so we > should be able to commit it fairly soon. +1. I think that patch is already big enough. I'm perfectly fine with rebasing that if this one gets committed first. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
libpq host/hostaddr/conninfo inconsistencies
Hello devs, While reviewing various patches by Tom which are focussing on libpq multi-host behavior, https://commitfest.postgresql.org/19/1749/ https://commitfest.postgresql.org/19/1752/ it occured to me that there are a few more problems with the documentation, the host/hostaddr feature, and the consistency of both. Namely: * According to the documentation, either "host" or "hostaddr" can be specified. The former for names and socket directories, the later for ip addresses. If both are specified, "hostaddr" supersedes "host", and it may be used for various authentication purposes. However, the actual capability is slightly different: specifying an ip address to "host" does work, without ensuing any name or reverse name look-ups, even if this is undocumented. This means that the host/hostaddr dichotomy is somehow moot as host can already be used for the same purpose. * \conninfo does not follow the implemented logic, and, as there is no sanity check performed on the specification, it can display wrong informations, which are not going to be helpful to anyone with a problem to solve and trying to figure out the current state: sh> psql "host=/tmp hostaddr=127.0.0.1" psql> \conninfo You are connected to database "fabien" as user "fabien" via socket in "/tmp" at port "5432" # wrong, it is really connected to 127.0.0.1 by TCP/IP sh> psql "host=127.0.0.2 hostaddr=127.0.0.1" psql> \conninfo You are connected to database "fabien" as user "fabien" on host "127.0.0.2" at port "5432". # wrong again, it is really connected to 127.0.0.1 sh> psql "hostaddr=127.0.0.1" psql> \conninfo You are connected to database "fabien" as user "fabien" via socket in "/var/run/postgresql" at port "5432". # wrong again * Another issue with \conninfo is that if a host resolves to multiple ips, there is no way to know which was chosen and/or worked, although on errors some messages show the failing ip. * The host/hostaddr dichotomy worsens when several targets are specified, because according to the documentation you should specify either names & dirs as host and ips as hostaddr, which leads to pretty strange spec each being a possible source of confusion and unhelpful messages as described above: sh> psql "host=localhost,127.0.0.2,, hostaddr=127.0.0.1,,127.0.0.3," # attempt 1 is 127.0.0.1 identified as localhost # attempt 2 is 127.0.0.2 # attempt 3 is 127.0.0.3 identified as the default, whatever it is # attempt 4 is really the default * The documentation about host/hostaddr/port accepting lists is really added as an afterthought: the features are presented for one, and then the list is mentionned. Moreover there are quite a few repeats between the paragraph about defaults and so. Given this state of affair ISTM that the situation would be clarified by: (1) describing "host" full capability to accept names, ips and dirs. (2) describing "hostaddr" as a look-up shortcut. Maybe the "hostaddr" could be renamed in passing, eg "resolve" to outline that it is just a lookup shortcut, and not a partial alternative to "host". (3) checking that hostaddr non empty addresses are only accepted if the corresponding host is a name. The user must use the "host=ip" syntax to connect to an ip. (4) teaching \conninfo to show the real connection, which probably require extending libpq to access the underlying ip, eg PQaddr or PQhostaddr or whatever. The attached patch does 1-3 (2 without renaming, though). Thoughts? -- Fabien.diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index 49de975e7f..6b3e0b0b87 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -955,23 +955,28 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname host -Name of host to connect to.host name -If a host name begins with a slash, it specifies Unix-domain -communication rather than TCP/IP communication; the value is the -name of the directory in which the socket file is stored. If -multiple host names are specified, each will be tried in turn in -the order given. The default behavior when host is -not specified, or is empty, is to connect to a Unix-domain +Comma-separated list of hosts to connect to.host name +Each item may be a host name that will be resolved with a look-up, +a numeric IP address (IPv4 in the standard format, e.g., +172.28.40.9, or IPv6 if supported by your machine) +that will be used directly, or +the name of a directory which contains the socket file for Unix-domain +communication rather than TCP/IP communication +(the specification must then begin with a slash); + + + +The default behavior when host is +not specified, or an item is empty, is to connect to a Unix-domain socketUnix domain socket in /tmp (or whatever socket directory was sp
Re: ALTER TABLE on system catalogs
On Mon, Aug 20, 2018 at 12:43:20PM +0200, Peter Eisentraut wrote: > Tests would require enabling allow_system_table_mods, which is > PGC_POSTMASTER, so we couldn't do it inside the normal regression test > suite. I'm not sure setting up a whole new test suite for this is worth it. I forgot this point. Likely that's not worth it. -- Michael signature.asc Description: PGP signature
Re: Conflict handling for COPY FROM
On Sat, Aug 4, 2018 at 9:10 AM Surafel Temesgen wrote: > In order to prevent extreme condition the patch also add a new GUC > variable called copy_max_error_limit that control the amount of error to > swallow before start to error and new failed record file options for copy > to write a failed record so the user can examine it. > Why should this be a GUC rather than a COPY option? In fact, try doing all of this by adding more options to COPY rather than new syntax. COPY ... WITH (ignore_conflicts 1000, ignore_logfile '...') It kind of stinks to use a log file written by the server as the dup-reporting channel though. That would have to be superuser-only. ...Robert -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: TupleTableSlot abstraction
On 2018-08-20 17:51:38 +0530, Ashutosh Bapat wrote: > > I also noticed an independent issue in your changes to > > ExecInitScanTupleSlot(): You can't assume that the plan belonging to the > > ScanState have a Scan node in their plan. Look e.g. at Material, Sort > > etc. So currently your scanrelid access is often just uninitialized > > data. > > I have incorporated changes in your patches into the relevant patches > in the updated patch-set. With this patch-set make check-world passes > for me. Have you addressed the issue I commented on above? Greetings, Andres Freund
WaitForOlderSnapshots refactoring
The attached patch factors out the CREATE INDEX CONCURRENTLY code that waits for transactions with older snapshots to finish into a new function WaitForOlderSnapshots(). This refactoring was part of a previously posted REINDEX CONCURRENTLY patch. But this code is now also appearing as a copy-and-paste in the ATTACH/DETACH PARTITION CONCURRENTLY thread, so it might be worth making it an official thing. The question is where to put it. This patch just leaves it static in indexcmds.c, which doesn't help other uses. A sensible place might be a new src/backend/commands/common.c. Or we make it non-static in indexcmds.c when the need arises. Thoughts? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services From 61c413200d9806127a72d13a197b32527b48d793 Mon Sep 17 00:00:00 2001 From: Peter Eisentraut Date: Mon, 20 Aug 2018 13:58:22 +0200 Subject: [PATCH] Factor out WaitForOlderSnapshots() This code from CREATE INDEX CONCURRENTLY might also be useful for other future concurrent DDL commands, so make it a separate function to avoid everyone copy-and-pasting it. --- src/backend/commands/indexcmds.c | 155 +-- 1 file changed, 86 insertions(+), 69 deletions(-) diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index d54c78c352..8beb8bb899 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -297,6 +297,90 @@ CheckIndexCompatible(Oid oldId, return ret; } + +/* + * WaitForOlderSnapshots + * + * Wait for transactions that might have an older snapshot than the given xmin + * limit, because it might not contain tuples deleted just before it has + * been taken. Obtain a list of VXIDs of such transactions, and wait for them + * individually. + * + * We can exclude any running transactions that have xmin > the xmin given; + * their oldest snapshot must be newer than our xmin limit. + * We can also exclude any transactions that have xmin = zero, since they + * evidently have no live snapshot at all (and any one they might be in + * process of taking is certainly newer than ours). Transactions in other + * DBs can be ignored too, since they'll never even be able to see this + * index. + * + * We can also exclude autovacuum processes and processes running manual + * lazy VACUUMs, because they won't be fazed by missing index entries + * either. (Manual ANALYZEs, however, can't be excluded because they + * might be within transactions that are going to do arbitrary operations + * later.) + * + * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not + * check for that. + * + * If a process goes idle-in-transaction with xmin zero, we do not need to + * wait for it anymore, per the above argument. We do not have the + * infrastructure right now to stop waiting if that happens, but we can at + * least avoid the folly of waiting when it is idle at the time we would + * begin to wait. We do this by repeatedly rechecking the output of + * GetCurrentVirtualXIDs. If, during any iteration, a particular vxid + * doesn't show up in the output, we know we can forget about it. + */ +static void +WaitForOlderSnapshots(TransactionId limitXmin) +{ + int i; + int n_old_snapshots; + VirtualTransactionId *old_snapshots; + + old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false, + PROC_IS_AUTOVACUUM | PROC_IN_VACUUM, + &n_old_snapshots); + + for (i = 0; i < n_old_snapshots; i++) + { + if (!VirtualTransactionIdIsValid(old_snapshots[i])) + continue; /* found uninteresting in previous cycle */ + + if (i > 0) + { + /* see if anything's changed ... */ + VirtualTransactionId *newer_snapshots; + int n_newer_snapshots; + int j; + int k; + + newer_snapshots = GetCurrentVirtualXIDs(limitXmin, + true, false, + PROC_IS_AUTOVACUUM | PROC_IN_VACUUM, + &n_newer_snapshots); + for (j = i; j < n_old_snapshots; j++) + { + if (!VirtualTransactionIdIsValid(old_snapshots[j])) + continue; /* found uninteresting in previous cycle */
Re: ALTER TABLE on system catalogs
On 20/08/2018 12:48, Andres Freund wrote: > On 2018-08-20 11:37:49 +0900, Michael Paquier wrote: >> On Fri, Jul 13, 2018 at 11:05:10AM +0200, Peter Eisentraut wrote: >>> After reviewing that thread, I think my patch would still be relevant. >>> Because the pending proposal is to not add TOAST tables to some catalogs >>> such as pg_attribute, so my original use case of allowing ALTER TABLE / >>> SET on system catalogs would still be broken for those tables. >> >> Something has happened in this area in the shape of 96cdeae, and the >> following catalogs do not have toast tables: pg_class, pg_index, >> pg_attribute and pg_largeobject*. While 0001 proposed upthread is not >> really relevant anymore because there is already a test to list which >> catalogs do not have a toast table. For 0002, indeed the patch is still >> seems relevant. The comment block at the beginning of >> create_toast_table should be updated. > > I still object to the approach in 0002. Do you have an alternative in mind? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: WaitForOlderSnapshots refactoring
Hi, On 2018-08-20 14:35:34 +0200, Peter Eisentraut wrote: > The attached patch factors out the CREATE INDEX CONCURRENTLY code that > waits for transactions with older snapshots to finish into a new > function WaitForOlderSnapshots(). > This refactoring was part of a previously posted REINDEX CONCURRENTLY > patch. But this code is now also appearing as a copy-and-paste in the > ATTACH/DETACH PARTITION CONCURRENTLY thread, so it might be worth making > it an official thing. I'm doubtful that ATTACH should use this, but I think the refactoring is a good idea regardless. > The question is where to put it. This patch just leaves it static in > indexcmds.c, which doesn't help other uses. A sensible place might be a > new src/backend/commands/common.c. Or we make it non-static in > indexcmds.c when the need arises. Why not move it to procarray.c? Most of the referenced functionality resides there IIRC. Greetings, Andres Freund
Re: ALTER TABLE on system catalogs
On 2018-08-20 14:38:25 +0200, Peter Eisentraut wrote: > On 20/08/2018 12:48, Andres Freund wrote: > > On 2018-08-20 11:37:49 +0900, Michael Paquier wrote: > >> On Fri, Jul 13, 2018 at 11:05:10AM +0200, Peter Eisentraut wrote: > >>> After reviewing that thread, I think my patch would still be relevant. > >>> Because the pending proposal is to not add TOAST tables to some catalogs > >>> such as pg_attribute, so my original use case of allowing ALTER TABLE / > >>> SET on system catalogs would still be broken for those tables. > >> > >> Something has happened in this area in the shape of 96cdeae, and the > >> following catalogs do not have toast tables: pg_class, pg_index, > >> pg_attribute and pg_largeobject*. While 0001 proposed upthread is not > >> really relevant anymore because there is already a test to list which > >> catalogs do not have a toast table. For 0002, indeed the patch is still > >> seems relevant. The comment block at the beginning of > >> create_toast_table should be updated. > > > > I still object to the approach in 0002. > > Do you have an alternative in mind? One is to just not do anything. I'm not sure I'm on board with the goal of changing things to make DDL on system tables more palatable. If we want to do something, the first thing to do is to move the if (shared_relation && !IsBootstrapProcessingMode()) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("shared tables cannot be toasted after initdb"))); bit below the /* * Is it already toasted? */ and /* * Check to see whether the table actually needs a TOAST table. */ blocks. There's no point in erroring out when we'd actually not do squat anyway. By that point there's just a few remaining tables where you couldn't do the ALTER TABLE. After that I think there's two options: First is to just follow to the rules and add toast tables for the relations that formally need them and review the VACUUM FULL/CLUSTER codepath around relation swapping. That's imo the best course. Third option is to move those checks to the layers where they more reasonably belong. IMO that's needs_toast_table(). I disfavor this, but it's better than what you did IMO. Greetings, Andres Freund
Re: [FEATURE PATCH] pg_stat_statements with plans (v02)
Andres Freund writes: > On 2018-08-20 15:21:07 +1200, Thomas Munro wrote: >> -Werror=vla in GCC, apparently. > How about detecting support for that in our configure script and > automatically using it? If we're uncomfortable with raising an error, > let's at least raise a warning? Yeah, we'd certainly auto-turn-on whatever options we think are needed. regards, tom lane
Re: remove ATTRIBUTE_FIXED_PART_SIZE
Peter Eisentraut writes: >> I don't think the tuple packing issue has to do with the tuple >> descriptor code. The tuple descriptors already use allocations of size >> sizeof(FormData_pg_attribute) (CreateTemplateTupleDesc), just the memcpy >> and memset calls use (potentially) less. That might have saved a few >> bytes for omitting the varlena fields, but I don't think it affects >> alignment correctness. If we, say, added a trailing char field now, the >> only thing this code > ... the only thing the current code would accomplish is not copying the > last three padding bytes, which might even be slower than copying all four. But, if the varlena fields are all null, those bytes might not be there ... at least, not according to valgrind. I agree this is all moot as long as there's no pad bytes. What I'm trying to figure out is if we need to put in place some provisions to prevent there from being pad bytes at the end of any catalog struct. According to what Andres is saying, it seems like we do (at least for ones with varlena fields). regards, tom lane
Re: ALTER TABLE on system catalogs
Andres Freund writes: > On 2018-08-20 14:38:25 +0200, Peter Eisentraut wrote: >> Do you have an alternative in mind? > One is to just not do anything. I'm not sure I'm on board with the goal > of changing things to make DDL on system tables more palatable. FWIW, I agree with doing as little as possible here. I'd be on board with Andres' suggestion of just swapping the two code stanzas so that the no-op case doesn't error out. As soon as you go beyond that, you are in wildly unsafe and untested territory. regards, tom lane
Re: Temporary tables prevent autovacuum, leading to XID wraparound
On 2018-Aug-20, Michael Paquier wrote: > On Tue, Aug 14, 2018 at 06:53:32PM +0200, Michael Paquier wrote: > > I was thinking about adding "Even if it is not atomic" or such at the > > beginning of the paragraph, but at the end your phrasing sounds better > > to me. So I have hacked up the attached, which also reworks the comment > > in InitTempTableNamespace in the same spirit. Thoughts? > > It has been close to one week since I sent this patch. Anything? I > don't like to keep folks unhappy about anything I committed. Seems reasonable. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Two proposed modifications to the PostgreSQL FDW
Hi all; I am looking at trying to make two modifications to the PostgreSQL FDW and would like feedback on this before I do. 1. INSERTMETHOD=[insert|copy] option on foreign table. One significant limitation of the PostgreSQL FDW is that it does a prepared statement insert on each row written which imposes a per-row latency. This hits environments where there is significant latency or few latency guarantees particularly hard, for example, writing to a foreign table that might be physically located on another continent. The idea is that INSERTMETHOD would default to insert and therefore have no changes but where needed people could specify COPY which would stream the data out. Updates would still be unaffected. 2. TWOPHASECOMMIT=[off|on] option The second major issue that I see with PostgreSQL's foreign database wrappers is the fact that there is no two phase commit which means that a single transaction writing to a group of tables has no expectation that all backends will commit or rollback together. With this patch an option would be applied to foreign tables such that they could be set to use two phase commit When this is done, the first write to each backend would register a connection with a global transaction handler and a pre-commit and commit hooks would be set up to properly process these. On recommit a per-global-transaction file would be opened in the data directory and prepare statements logged to the file. On error, we simply roll back our local transaction. On commit hook , we go through and start to commit the remote global transactions. At this point we make a best effort but track whether or not we were successfully on all. If successful on all, we delete the file. If unsuccessful we fire a background worker which re-reads the file and is responsible for cleanup. If global transactions persist, a SQL administration function will be made available to restart the cleanup process. On rollback, we do like commit but we roll back all transactions in the set. The file has enough information to determine whether we should be committing or rolling back on cleanup. I would like to push these both for Pg 12. Is there any feedback on the concepts and the problems first -- Best Regards, Chris Travers Head of Database Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin
Re: Two proposed modifications to the PostgreSQL FDW
Hi, On 2018-08-20 16:28:01 +0200, Chris Travers wrote: > 1. INSERTMETHOD=[insert|copy] option on foreign table. > > One significant limitation of the PostgreSQL FDW is that it does a prepared > statement insert on each row written which imposes a per-row latency. This > hits environments where there is significant latency or few latency > guarantees particularly hard, for example, writing to a foreign table that > might be physically located on another continent. The idea is that > INSERTMETHOD would default to insert and therefore have no changes but > where needed people could specify COPY which would stream the data out. > Updates would still be unaffected. That has a *lot* of semantics issues, because you suddenly don't get synchronous error reports anymore. I don't think that's OK on a per-table basis. If we invented something like this, it IMO should be a per-statement explicit opt in that'd allow streaming. > 2. TWOPHASECOMMIT=[off|on] option > The second major issue that I see with PostgreSQL's foreign database > wrappers is the fact that there is no two phase commit which means that a > single transaction writing to a group of tables has no expectation that all > backends will commit or rollback together. With this patch an option would > be applied to foreign tables such that they could be set to use two phase > commit When this is done, the first write to each backend would register a > connection with a global transaction handler and a pre-commit and commit > hooks would be set up to properly process these. > > On recommit a per-global-transaction file would be opened in the data > directory and prepare statements logged to the file. On error, we simply > roll back our local transaction. > > On commit hook , we go through and start to commit the remote global > transactions. At this point we make a best effort but track whether or not > we were successfully on all. If successful on all, we delete the file. If > unsuccessful we fire a background worker which re-reads the file and is > responsible for cleanup. If global transactions persist, a SQL > administration function will be made available to restart the cleanup > process. On rollback, we do like commit but we roll back all transactions > in the set. The file has enough information to determine whether we should > be committing or rolling back on cleanup. > > I would like to push these both for Pg 12. Is there any feedback on the > concepts and the problems first There's been *substantial* work on this. You should at least read the discussion & coordinate with the relevant developers. See https://commitfest.postgresql.org/19/1574/ and the referenced discussions. Greetings, Andres Freund
Re: Two proposed modifications to the PostgreSQL FDW
Chris Travers writes: > I am looking at trying to make two modifications to the PostgreSQL FDW and > would like feedback on this before I do. > 1. INSERTMETHOD=[insert|copy] option on foreign table. > One significant limitation of the PostgreSQL FDW is that it does a prepared > statement insert on each row written which imposes a per-row latency. This > hits environments where there is significant latency or few latency > guarantees particularly hard, for example, writing to a foreign table that > might be physically located on another continent. The idea is that > INSERTMETHOD would default to insert and therefore have no changes but > where needed people could specify COPY which would stream the data out. > Updates would still be unaffected. It seems unlikely to me that an FDW option would be at all convenient for this. What about selecting it dynamically based on the planner's estimate of the number of rows to be inserted? A different thing we could think about is enabling COPY TO/FROM a foreign table. > 2. TWOPHASECOMMIT=[off|on] option > The second major issue that I see with PostgreSQL's foreign database > wrappers is the fact that there is no two phase commit which means that a > single transaction writing to a group of tables has no expectation that all > backends will commit or rollback together. With this patch an option would > be applied to foreign tables such that they could be set to use two phase > commit When this is done, the first write to each backend would register a > connection with a global transaction handler and a pre-commit and commit > hooks would be set up to properly process these. ENOINFRASTRUCTURE ... and the FDW pieces of that hardly seem like the place to start. regards, tom lane
Re: Two proposed modifications to the PostgreSQL FDW
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On 2018-08-20 16:28:01 +0200, Chris Travers wrote: > > 1. INSERTMETHOD=[insert|copy] option on foreign table. > > > > One significant limitation of the PostgreSQL FDW is that it does a prepared > > statement insert on each row written which imposes a per-row latency. This > > hits environments where there is significant latency or few latency > > guarantees particularly hard, for example, writing to a foreign table that > > might be physically located on another continent. The idea is that > > INSERTMETHOD would default to insert and therefore have no changes but > > where needed people could specify COPY which would stream the data out. > > Updates would still be unaffected. > > That has a *lot* of semantics issues, because you suddenly don't get > synchronous error reports anymore. I don't think that's OK on a > per-table basis. If we invented something like this, it IMO should be a > per-statement explicit opt in that'd allow streaming. Doing some kind of decoration on a per-statement level to do something different for FDWs doesn't really seem very clean.. On reading this, a thought I had was that maybe we should just perform a COPY to the FDW when COPY is what's been specified by the user (eg: COPY my_foreign_table FROM STDIN; ), but that wouldn't help when someone wants to bulk copy data from a local table into a foreign table. COPY is already non-standard though, so we can extend it, and one option might be to extend it like so: COPY my_local_table TO TABLE my_foreign_table; Which could be made to work for both foreign tables and local ones, where it'd basically be: INSERT INTO my_foreign_table SELECT * FROM my_local_table; The COPY TO case already supports queries, such that you could then do: COPY (SELECT c1,c2,c3 FROM my_local_table) TO TABLE my_foreign_table; I'd also think we'd want to support this kind of 'bulk COPY-like' operation for multiple FDWs (I wonder if maybe file_fdw could be made to support this new method, thus allowing users to write out to files with it, which we don't support today at all). Just some brain-storming and ideas about where this could possibly go. Thanks! Stephen signature.asc Description: PGP signature
Re: Truncation failure in autovacuum results in data corruption (duplicate keys)
On Wed, Apr 18, 2018 at 11:49 PM Tom Lane wrote: > I wrote: > > Relation truncation throws away the page image in memory without ever > > writing it to disk. Then, if the subsequent file truncate step fails, > > we have a problem, because anyone who goes looking for that page will > > fetch it afresh from disk and see the tuples as live. > > > There are WAL entries recording the row deletions, but that doesn't > > help unless we crash and replay the WAL. > > > It's hard to see a way around this that isn't fairly catastrophic for > > performance :-(. > > Just to throw out a possibly-crazy idea: maybe we could fix this by > PANIC'ing if truncation fails, so that we replay the row deletions from > WAL. Obviously this would be intolerable if the case were frequent, > but we've had only two such complaints in the last nine years, so maybe > it's tolerable. It seems more attractive than taking a large performance > hit on truncation speed in normal cases, anyway. We have only two complaints of data corruption in nine years. But I suspect that in vast majority of cases truncation error didn't cause the corruption OR the corruption wasn't noticed. So, once we introduce PANIC here, we would get way more complaints. > A gotcha to be concerned about is what happens if we replay from WAL, > come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation > failure again, which is surely not unlikely. PANIC'ing again will not > do. I think we could probably handle that by having the replay code > path zero out all the pages it was unable to delete; as long as that > succeeds, we can call it good and move on. > > Or maybe just do that in the mainline case too? That is, if ftruncate > fails, handle it by zeroing the undeletable pages and pressing on? I've just started really digging into this set of problems. But this idea looks good for me so soon... -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: libpq compression
On 13.08.2018 23:06, Andrew Dunstan wrote: On 08/13/2018 02:47 PM, Robert Haas wrote: On Fri, Aug 10, 2018 at 5:55 PM, Andrew Dunstan wrote: This thread appears to have gone quiet. What concerns me is that there appears to be substantial disagreement between the author and the reviewers. Since the last thing was this new patch it should really have been put back into "needs review" (my fault to some extent - I missed that). So rather than return the patch with feedfack I'm going to set it to "needs review" and move it to the next CF. However, if we can't arrive at a consensus about the direction during the next CF it should be returned with feedback. I agree with the critiques from Robbie Harwood and Michael Paquier that the way in that compression is being hooked into the existing architecture looks like a kludge. I'm not sure I know exactly how it should be done, but the current approach doesn't look natural; it looks like it was bolted on. I agree with the critique from Peter Eisentraut and others that we should not go around and add -Z as a command-line option to all of our tools; this feature hasn't yet proved itself to be useful enough to justify that. Better to let people specify it via a connection string option if they want it. I think Thomas Munro was right to ask about what will happen when different compression libraries are in use, and I think failing uncleanly is quite unacceptable. I think there needs to be some method for negotiating the compression type; the client can, for example, provide a comma-separated list of methods it supports in preference order, and the server can pick the first one it likes. In short, I think that a number of people have provided really good feedback on this patch, and I suggest to Konstantin that he should consider accepting all of those suggestions. Commit ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed tried to introduce some facilities that can be used for protocol version negotiation as new features are added, but this patch doesn't use them. It looks to me like it instead just breaks backward compatibility. The new 'compression' option won't be recognized by older servers. If it were spelled '_pq_.compression' then the facilities in that commit would cause a NegotiateProtocolVersion message to be sent by servers which have that commit but don't support the compression option. I'm not exactly sure what will happen on even-older servers that don't have that commit either, but I hope they'll treat it as a GUC name; note that setting an unknown GUC name with a namespace prefix is not an error, but setting one without such a prefix IS an ERROR. Servers which do support compression would respond with a message indicating that compression had been enabled or, maybe, just start sending back compressed-packet messages, if we go with including some framing in the libpq protocol. Excellent summary, and well argued recommendations, thanks. I've changed the status to waiting on author. New version of the patch is attached: I removed -Z options form pgbench and psql and add checking that server and client are implementing the same compression algorithm. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/configure b/configure index 0aafd9c..fc5685c 100755 --- a/configure +++ b/configure @@ -699,6 +699,7 @@ ELF_SYS EGREP GREP with_zlib +with_zstd with_system_tzdata with_libxslt with_libxml @@ -863,6 +864,7 @@ with_libxml with_libxslt with_system_tzdata with_zlib +with_zstd with_gnu_ld enable_largefile enable_float4_byval @@ -8017,6 +8019,86 @@ fi # +# ZStd +# + + + +# Check whether --with-zstd was given. +if test "${with_zstd+set}" = set; then : + withval=$with_zstd; + case $withval in +yes) + ;; +no) + : + ;; +*) + as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5 + ;; + esac + +else + with_zstd=no + +fi + + + + +if test "$with_zstd" = yes ; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5 +$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; } +if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then : + $as_echo_n "(cached) " >&6 +else + ac_check_lib_save_LIBS=$LIBS +LIBS="-lzstd $LIBS" +cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ + +/* Override any GCC internal prototype to avoid an error. + Use char because int might match the return type of a GCC + builtin and then its argument prototype would still apply. */ +#ifdef __cplusplus +extern "C" +#endif +char ZSTD_compress (); +int +main () +{ +return ZSTD_compress (); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + ac_cv_lib_zstd_ZSTD_compress=yes +else + ac_cv_lib_zstd_ZSTD_compress=no +fi +rm -f core conftest.err conftest.$ac_objext \ +conftest$ac_exeext conftest.$ac_ext +LIBS=$ac_check_lib_save_LIBS +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}
Re: Two proposed modifications to the PostgreSQL FDW
Hi, On 2018-08-20 10:56:39 -0400, Stephen Frost wrote: > * Andres Freund (and...@anarazel.de) wrote: > > On 2018-08-20 16:28:01 +0200, Chris Travers wrote: > > > 1. INSERTMETHOD=[insert|copy] option on foreign table. > > > > > > One significant limitation of the PostgreSQL FDW is that it does a > > > prepared > > > statement insert on each row written which imposes a per-row latency. > > > This > > > hits environments where there is significant latency or few latency > > > guarantees particularly hard, for example, writing to a foreign table that > > > might be physically located on another continent. The idea is that > > > INSERTMETHOD would default to insert and therefore have no changes but > > > where needed people could specify COPY which would stream the data out. > > > Updates would still be unaffected. > > > > That has a *lot* of semantics issues, because you suddenly don't get > > synchronous error reports anymore. I don't think that's OK on a > > per-table basis. If we invented something like this, it IMO should be a > > per-statement explicit opt in that'd allow streaming. > > Doing some kind of decoration on a per-statement level to do something > different for FDWs doesn't really seem very clean.. I think it's required. The semantics of an INSERT statement *drastically* change if you don't insert remotely. Constraints aren't evaluated once the command finished, sequences aren't increased until later, there'll be weird interactions with savepoints, ... Without executing immediately remotely it's basically isn't a normal INSERT anymore. Note bulk INSERT and single row INSERT are very different here. That's not to say it's not useful to pipeline. To the contrary. > On reading this, a thought I had was that maybe we should just perform a > COPY to the FDW when COPY is what's been specified by the user (eg: > > COPY my_foreign_table FROM STDIN; Right. There'd not even need to be a an option since that's already pipelined. > ), but that wouldn't help when someone wants to bulk copy data from a > local table into a foreign table. That possibly still is doable, just with INSERT, as you don't need (may not even, in plenty cases) to see the effects of the statement until the CommandCounterIncrement(). So we can delay the flush a bit. Greetings, Andres Freund
Re: Truncation failure in autovacuum results in data corruption (duplicate keys)
On Wed, Apr 18, 2018 at 10:04 PM Tom Lane wrote: > [ re-reads thread... ] The extra assumption you need in order to have > trouble is that the blocks in question are dirty in shared buffers and > have never been written to disk since their rows were deleted. Then > the situation is that the page image on disk shows the rows as live, > while the up-to-date page image in memory correctly shows them as dead. > Relation truncation throws away the page image in memory without ever > writing it to disk. Then, if the subsequent file truncate step fails, > we have a problem, because anyone who goes looking for that page will > fetch it afresh from disk and see the tuples as live. > > There are WAL entries recording the row deletions, but that doesn't > help unless we crash and replay the WAL. > > It's hard to see a way around this that isn't fairly catastrophic for > performance :-(. But in any case it's wrapped up in order-of-operations > issues. I've long since forgotten the details, but I seem to have thought > that there were additional order-of-operations hazards besides this one. Just for clarification. Do you mean zeroing of to-be-truncated blocks to be catastrophic for performance? Or something else? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: Truncation failure in autovacuum results in data corruption (duplicate keys)
Alexander Korotkov writes: > On Wed, Apr 18, 2018 at 10:04 PM Tom Lane wrote: >> It's hard to see a way around this that isn't fairly catastrophic for >> performance :-(. But in any case it's wrapped up in order-of-operations >> issues. I've long since forgotten the details, but I seem to have thought >> that there were additional order-of-operations hazards besides this one. > Just for clarification. Do you mean zeroing of to-be-truncated blocks > to be catastrophic for performance? Or something else? It would be pretty terrible to have to do that in the normal code path. The other idea that was in my mind was to force out dirty buffers, then discard them, then truncate ... but that's awful too if there's a lot of dirty buffers that we'd have to write only to throw the data away. I think it's all right to be slow if the truncation fails, though; that does not seem like a path that has to be fast, only correct. One thing to be concerned about is that as soon as we've discarded any page images from buffers, we have to be in a critical section all the way through till we've either truncated or zeroed those pages on-disk. Any failure in that has to result in a PANIC and recover-from-WAL, because we don't know what state we lost by dropping the buffers. Ugh. It's especially bad if the truncation fails because the file got marked read-only, because then the zeroing is also going to fail, making that a guaranteed PANIC case (with no clear path to recovery after the panic, either ...) I wonder if it could help to do something like zeroing the buffers in memory, then truncating, then discarding buffers. This is just a half-baked idea and I don't have time to think more right now, but maybe making two passes over the shared buffers could lead to a better solution. It's the useless I/O that we need to avoid, IMO. regards, tom lane
Re: Two proposed modifications to the PostgreSQL FDW
On Mon, Aug 20, 2018 at 4:41 PM Andres Freund wrote: > Hi, > > On 2018-08-20 16:28:01 +0200, Chris Travers wrote: > > 1. INSERTMETHOD=[insert|copy] option on foreign table. > > > > One significant limitation of the PostgreSQL FDW is that it does a > prepared > > statement insert on each row written which imposes a per-row latency. > This > > hits environments where there is significant latency or few latency > > guarantees particularly hard, for example, writing to a foreign table > that > > might be physically located on another continent. The idea is that > > INSERTMETHOD would default to insert and therefore have no changes but > > where needed people could specify COPY which would stream the data out. > > Updates would still be unaffected. > > That has a *lot* of semantics issues, because you suddenly don't get > synchronous error reports anymore. I don't think that's OK on a > per-table basis. If we invented something like this, it IMO should be a > per-statement explicit opt in that'd allow streaming. > I want to push back a bit and justify the per table decision here. This is primarily useful when two things are present: 1. There is significant lag between the servers 2. There are significant bulk operations on the table. Only when both are present does it make sense to use this option. If you have a low latency connection, don't use it. If you are only writing a few rows here and there, don't use it. If you have a high latency connection *and* you are inserting large numbers of rows at a time, then this is where you run into a problem and the current FDW approach is, frankly, unusable in these cases. These characteristics follow the server (and hence the table) in the first case and the table only in the second case. Therefore the characteristics of the foreign table determine whether it makes sense to consider this option. Doing this on the statement level poses more problems, in my view, than it fixes. Moreover the idea of batching inserts doesn't help either (and actually would be worse in many cases than the current approach) since the current approach uses a prepared statement which gets called for each row.Doing a batch insert might save you on the per-row round trip but it adds a great deal of complexity to the overall operation. Now, a per-statement opt-in poses a number of questions. Obviously we'd have to allow this on local tables too, but it would have no effect. So suddenly you have a semantic construct which *may* or *may not* pose the problems you mention and in fact may or may not pose the problem on foreign tables because whether the construct does anything depends on the FDW driver. I don't see what exactly one buys in shifting the problem in this way given that when this is useful it will be used on virtually all inserts and when it is not useful it won't be used at all. Now, the entire problem is the expectation of synchronous errors on a per row basis. This is what causes the problem with inserting a million rows over a transcontinental link. (insert row, check error, insert row, check error etc). So on to the question of containing the problems. I think the best way to contain this is to have one COPY statement per insert planner node. And that makes sure that the basic guarantee that a planner node succeeds or errors is met. For sequences, I would expect that supplying a column list without the relevant sequenced value should behave in a tolerable way (otherwise two concurrent copies would have problems). So that's the primary defense. Now let's look at the main alternatives. 1. You can use a table where you logically replicate inserts, and where you insert then truncate or delete. This works fine until you have enough WAL traffic that you cannot tolerate a replication outage. 2. You can write an extension which allows you to COPY a current table into a foreign table. The problem here is that SPI has a bunch of protections to keep you from copying to stdout, so when I helped with this proof fo concept we copied to a temp file in a ramdisk and copied from there which is frankly quite a bit worse than this approach. If you want a per statement opt-in, I guess the questions that come to my mind are: 1. What would this opt-in look like? 2. How important is it that we avoid breaking current FDW interfaces to implement it? 3. How would you expect it to behave if directed at a local table or a foreign table on, say, a MySQL db? > > > > 2. TWOPHASECOMMIT=[off|on] option > > > The second major issue that I see with PostgreSQL's foreign database > > wrappers is the fact that there is no two phase commit which means that a > > single transaction writing to a group of tables has no expectation that > all > > backends will commit or rollback together. With this patch an option > would > > be applied to foreign tables such that they could be set to use two phase > > commit When this is done, the first write to each backend would > regist
Re: A really subtle lexer bug
> "Andrew" == Andrew Gierth writes: Andrew> select f(a =>-1); -- ERROR: column "a" does not exist Andrew> I guess the fix is to extend the existing special case code Andrew> that checks for one character left after removing trailing [+-] Andrew> and also check for the two-character ops "<>" ">=" "<=" "=>" Andrew> "!=". Patch attached. This fixes two bugs: first the mis-lexing of two-char ops as mentioned originally; second, the O(N^3) lexing time of strings of - or + characters is reduced to O(N^2) (in practice it's better than O(N^2) once N gets large because the bison stack gets blown out, ending the loop early). -- Andrew (irc:RhodiumToad) diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l index 0cd782827a..cc855ffb1c 100644 --- a/src/backend/parser/scan.l +++ b/src/backend/parser/scan.l @@ -885,20 +885,34 @@ other . * to forbid operator names like '?-' that could not be * sequences of SQL operators. */ - while (nchars > 1 && - (yytext[nchars - 1] == '+' || - yytext[nchars - 1] == '-')) + if (nchars > 1 && + (yytext[nchars - 1] == '+' || + yytext[nchars - 1] == '-')) { int ic; for (ic = nchars - 2; ic >= 0; ic--) { - if (strchr("~!@#^&|`?%", yytext[ic])) + char c = yytext[ic]; + if (c == '~' || c == '!' || c == '@' || +c == '#' || c == '^' || c == '&' || +c == '|' || c == '`' || c == '?' || +c == '%') break; } - if (ic >= 0) - break; /* found a char that makes it OK */ - nchars--; /* else remove the +/-, and check again */ + if (ic < 0) + { + /* + * didn't find a qualifying character, so remove + * all trailing [+-] + */ + for (nchars--; + nchars > 1 && + (yytext[nchars - 1] == '+' || + yytext[nchars - 1] == '-'); + nchars--) +continue; + } } SET_YYLLOC(); @@ -916,6 +930,25 @@ other . if (nchars == 1 && strchr(",()[].;:+-*/%^<>=", yytext[0])) return yytext[0]; + /* + * Likewise, if what we have left is two chars, and + * those match the tokens ">=", "<=", "=>", "<>" or + * "!=", then we must return the sppropriate token + * rather than the generic Op. + */ + if (nchars == 2) + { + if (yytext[0] == '=' && yytext[1] == '>') +return EQUALS_GREATER; + if (yytext[0] == '>' && yytext[1] == '=') +return GREATER_EQUALS; + if (yytext[0] == '<' && yytext[1] == '=') +return LESS_EQUALS; + if (yytext[0] == '<' && yytext[1] == '>') +return NOT_EQUALS; + if (yytext[0] == '!' && yytext[1] == '=') +return NOT_EQUALS; + } } /*
Re: Two proposed modifications to the PostgreSQL FDW
For the record, here's the proof of concept code for the transaction manager which works off libpq connections. It is not ready yet by any means. But it is included for design discussion. If the previous patch gets in instead, that's fine, but figure it is worth including here for discussion purposes. Two things which are currently missing are a) an ability to specify in the log file where the cleanup routine is located for a background worker and b) a separation of backend responsibility for restarting cleanup efforts on server start. > >> >> > > -- > Best Regards, > Chris Travers > Head of Database > > Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com > Saarbrücker Straße 37a, 10405 Berlin > > -- Best Regards, Chris Travers Head of Database Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com Saarbrücker Straße 37a, 10405 Berlin twophasecommit.h Description: Binary data twophasecommit.c Description: Binary data
Re: Getting NOT NULL constraint from pg_attribute
Hi tom, Thanks for the quick respond. Why are SELECT query never marked nullable? For nullable columns, when I call SPI_getvalue(), the result (in char*) is NULL. I don’t think I’m too clear on the definition of attnotnull. Can you give me a example in which the tupleTable is can be marked nullable? Also, is there any other ways to get nullability of each column while getting the data from SPI_cursor_fetch? The only way I can think is to call another separate command to query the table schema, but it will be in a separate transaction in that case. Thank you again! Best, Ivy > On Aug 17, 2018, at 6:02 PM, Tom Lane wrote: > > Wu Ivy writes: >> I’m currently building a Postgres C extension that fetch data from a >> Postgres table. >> Since the table can be large, in order to prevent memory overrun, I use >> SPI_cursor_fetch to fetch chunks of data. The result rows are saved in >> SPITupleTable* SPI_tuptable and attributes are saved in >> SPI_tuptable->tupdesc. >> In order to process my data, I need to get information of column nullability >> (whether column has NOT NULL constrain). I can get this information by >> calling: > >> TupleDesc tupdesc = SPI_tuptable->tupdesc; >> bool is_nullable = TupleDescAttr(tupdesc, column_num - 1) -> attnotnull; >> However, the result (is_nullable) is always 0, meaning the column does not >> have NOT NULLl constraint, even for columns that do have the NOT NULL >> constraint. > > The output columns of a SELECT query are never marked nullable, regardless > of what the source data was. > > regards, tom lane
Re: Getting NOT NULL constraint from pg_attribute
On Monday, August 20, 2018, Wu Ivy wrote: > > Thanks for the quick respond. > Why are SELECT query never marked nullable? For nullable columns, when I > call SPI_getvalue(), the result (in char*) is NULL. I don’t think I’m too > clear on the definition of *attnotnull*. Can you give me a example in > which the tupleTable is can be marked nullable? > Also, is there any other ways to get nullability of each column while > getting the data from SPI_cursor_fetch? The only way I can think is to call > another separate command to query the table schema, but it will be in a > separate transaction in that case. > Basically the nullability property is used by the planner for optimization during the joining of physical tables. As soon as you try outputting columns the ability to enforce not null goes away because of, in particular, outer joins. While some changes could maybe be made the cost-benefit to do so doesn't seem favorable. David J.
Re: ATTACH/DETACH PARTITION CONCURRENTLY
On 2018-Aug-13, Robert Haas wrote: > I think this is a somewhat confused analysis. We don't use > SnapshotAny for catalog scans, and we never have. We used to use > SnapshotNow, and we now use a current MVCC snapshot. What you're > talking about, I think, is possibly using the transaction snapshot > rather than a current MVCC snapshot for the catalog scans. > > I've thought about similar things, but I think there's a pretty deep > can of worms. For instance, if you built a relcache entry using the > transaction snapshot, you might end up building a seemingly-valid > relcache entry for a relation that has been dropped or rewritten. > When you try to access the relation data, you'll be attempt to access > a relfilenode that's not there any more. Similarly, if you use an > older snapshot to build a partition descriptor, you might thing that > relation OID 12345 is still a partition of that table when in fact > it's been detached - and, maybe, altered in other ways, such as > changing column types. I wonder if this all stems from a misunderstanding of what I suggested to David offlist. My suggestion was that the catalog scans would continue to use the catalog MVCC snapshot, and that the relcache entries would contain all the partitions that appear to the catalog; but each partition's entry would carry the Xid of the creating transaction in a field (say xpart), and that field is compared to the regular transaction snapshot: if xpart is visible to the transaction snapshot, then the partition is visible, otherwise not. So you never try to access a partition that doesn't exist, because those just don't appear at all in the relcache entry. But if you have an old transaction running with an old snapshot, and the partitioned table just acquired a new partition, then whether the partition will be returned as part of the partition descriptor or not depends on the visibility of its entry. I think that works fine for ATTACH without any further changes. I'm not so sure about DETACH, particularly when snapshots persist for a "long time" (a repeatable-read transaction). ISTM that in the above design, the partition descriptor would lose the entry for the detached partition ahead of time, which means queries would silently fail to see their data (though they wouldn't crash). I first thought this could be fixed by waiting for those snapshots to finish, but then I realized that there's no actual place where waiting achieves anything. Certainly it's not useful to wait before commit (because other snapshots are going to be starting all the time), and it's not useful to start after the commit (because by then the catalog tuple is already gone). Maybe we need two transactions: mark partition as removed with an xmax of sorts, commit, wait for all snapshots, start transaction, remove partition catalog tuple, commit. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Improve behavior of concurrent ANALYZE/VACUUM
Hi, Sorry for the delay! I looked through the latest patch. On 8/17/18, 1:43 AM, "Michael Paquier" wrote: > I have reworked the patch on this side, clarifying the use of the new > common API for the logs. One thing I am wondering about is what do we > want to do when VACUUM ANALYZE is used. As of HEAD, if vacuum_rel() > stops, then analyze_rel() is never called, and the only log showing up > to a non-owner user would be: > skipping "foo" --- only superuser can vacuum it > > With this patch, things become perhaps more confusing by generating two > WARNING log entries: > skipping "foo" --- only superuser can vacuum it > skipping "foo" --- only superuser can analyze it > > We could either combine both in a single message, or just generate the > message for vacuum as HEAD does now. I have also added some simple > regression tests triggering the skipping logs for shared catalogs, > non-shared catalogs and non-owners. This could be a separate patch as > well. I like the idea of emitting a separate WARNING for each for clarity on what operations are being skipped. However, I think it could be a bit confusing with the current wording. Perhaps something like "skipping vacuum of..." and "skipping analyze of..." would make things clearer. Another thing to keep in mind is how often only one of these messages will apply. IIUC the vast majority of VACUUM (ANALYZE) statements that need to emit such log statements would emit both. Plus, while VACUUM (ANALYZE) is clearly documented as performing both operations, I can easily see the argument that users may view it as one big command and that emitting multiple log entries could be a confusing change in behavior. In short, my vote would be to maintain the current behavior for now and to bring up any logging improvements separately. + /* +* Depending on the permission checks done while building the list +* of relations to work on, it could be possible that the list is +* empty, hence do nothing in this case. +*/ + if (relations == NIL) + return; It might be better to let the command go through normally so that we don't miss any cleanup at the end (e.g. deleting vac_context). +/* + * Check if a given relation can be safely vacuumed or not. If the + * user is not the relation owner, issue a WARNING log message and return + * false to let the caller decide what to do with this relation. + */ +bool +vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple, int options) +{ + Assert((options & (VACOPT_VACUUM | VACOPT_ANALYZE)) != 0); + + /* +* Check permissions. +* +* We allow the user to vacuum a table if he is superuser, the table +* owner, or the database owner (but in the latter case, only if it's not +* a shared relation). pg_class_ownercheck includes the superuser case. +* +* Note we choose to treat permissions failure as a WARNING and keep +* trying to vacuum the rest of the DB --- is this appropriate? +*/ Do you think it's worth adding ANALYZE to these comments as well? + if (!(pg_class_ownercheck(relid, GetUserId()) || + (pg_database_ownercheck(MyDatabaseId, GetUserId()) && !reltuple->relisshared))) Returning true right away when the role does have permissions might be a nice way to save a level of indentation. + /* +* To check whether the relation is a partitioned table and its +* ownership, fetch its syscache entry. +*/ + tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", relid); + classForm = (Form_pg_class) GETSTRUCT(tuple); + + /* check permissions of relation */ + if (!vacuum_is_relation_owner(relid, classForm, options)) + { + ReleaseSysCache(tuple); + + /* +* Release lock again with AccessShareLock -- see below for +* the reason why this lock is released. +*/ + UnlockRelationOid(relid, AccessShareLock); + return vacrels; + } I think this actually changes the behavior for partitioned tables. Presently, we still go through and collect all the partitions in the vacrels list. With this change, we will skip collecting a table's partitions if the current role doesn't have the required permissions. Perhaps we should skip adding the current relation to vacrels if vacuum_is_relation_owner() returns false, and then we could go through the partitions and check permissions on those as well. Since we don't take any locks on the individual partitions yet, getting the relation name and calling pg_class_ownercheck() safely might be tricky, though. Nathan
Re: Temporary tables prevent autovacuum, leading to XID wraparound
On Mon, Aug 20, 2018 at 10:54:28AM -0300, Alvaro Herrera wrote: > Seems reasonable. Thanks Álvaro, pushed. -- Michael signature.asc Description: PGP signature
Re: Reopen logfile on SIGHUP
At Fri, 10 Aug 2018 15:33:26 +0300, Alexander Kuzmenkov wrote in <5142559a-82e6-b3e4-d6ed-8fd2d240c...@postgrespro.ru> > On 08/09/2018 10:33 AM, Kyotaro HORIGUCHI wrote: > > > > - Since I'm not sure unlink is signal-handler safe on all > >supported platforms, I moved unlink() call out of > >checkLogrotateSignal() to SysLoggerMain, just before rotating > >log file. > > Which platforms specifically do you have in mind? unlink() is required > to be async-signal-safe by POSIX.1-2001, which is required by UNIX 03, > and these are rather old. I suspect that something bad can happen on Windows. Another reason for the movement I didn't mention was it is not necesarry to be there. So I applied the principle that a signal handler should be as small and simple as possible. As the result the flow of logrotate signal handling becomes similar to that for promote signal. > For UNIX 03-certified distributions, see this list: > http://www.opengroup.org/csq/search/t=XY1.html > For FreeBSD, unlink() was signal-safe at least in 4.0, which was > released in 2000 > https://www.freebsd.org/cgi/man.cgi?query=sigaction&apropos=0&sektion=0&manpath=FreeBSD+4.0-RELEASE&arch=default&format=html > Debian 4.0, which was released in 2007 and had a 2.6 kernel, also > claims to have a signal-safe unlink(): > https://www.freebsd.org/cgi/man.cgi?query=signal&apropos=0&sektion=0&manpath=Debian+4.0.9&arch=default&format=html regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Improve behavior of concurrent ANALYZE/VACUUM
On Mon, Aug 20, 2018 at 08:57:00PM +, Bossart, Nathan wrote: > Sorry for the delay! I looked through the latest patch. Thanks a lot for the review! > I like the idea of emitting a separate WARNING for each for clarity on > what operations are being skipped. However, I think it could be a bit > confusing with the current wording. Perhaps something like "skipping > vacuum of..." and "skipping analyze of..." would make things clearer. > Another thing to keep in mind is how often only one of these messages > will apply. IIUC the vast majority of VACUUM (ANALYZE) statements > that need to emit such log statements would emit both. Plus, while > VACUUM (ANALYZE) is clearly documented as performing both operations, > I can easily see the argument that users may view it as one big > command and that emitting multiple log entries could be a confusing > change in behavior. > > In short, my vote would be to maintain the current behavior for now > and to bring up any logging improvements separately. On the other hand, it would be useful for the user to know exactly what is getting skipped. For example if VACUUM ANALYZE is used then both operations would happen, but now the user would only know that VACUUM has been skipped, and may miss the fact that ANALYZE was not attempted. Let's do as you suggest at the end, aka if both VACOPT_VACUUM and VACOPT_ANALYZE are passed down to vacuum_is_relation_owner, then only the log for VACUUM is generated, which is consistent. Any other changes could happen later on if necessary. > + /* > +* Depending on the permission checks done while building the list > +* of relations to work on, it could be possible that the list is > +* empty, hence do nothing in this case. > +*/ > + if (relations == NIL) > + return; > > It might be better to let the command go through normally so that we > don't miss any cleanup at the end (e.g. deleting vac_context). Right, that was a bad idea. Updating datfrozenxid can actually be a good thing. > +/* > + * Check if a given relation can be safely vacuumed or not. If the > + * user is not the relation owner, issue a WARNING log message and return > + * false to let the caller decide what to do with this relation. > + */ > +bool > +vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple, int options) > +{ > + Assert((options & (VACOPT_VACUUM | VACOPT_ANALYZE)) != 0); > + > + /* > +* Check permissions. > +* > +* We allow the user to vacuum a table if he is superuser, the table > +* owner, or the database owner (but in the latter case, only if it's > not > +* a shared relation). pg_class_ownercheck includes the superuser > case. > +* > +* Note we choose to treat permissions failure as a WARNING and keep > +* trying to vacuum the rest of the DB --- is this appropriate? > +*/ > > Do you think it's worth adding ANALYZE to these comments as well? Done. > + if (!(pg_class_ownercheck(relid, GetUserId()) || > + (pg_database_ownercheck(MyDatabaseId, GetUserId()) && > !reltuple->relisshared))) > > Returning true right away when the role does have permissions might be > a nice way to save a level of indentation. Done. > I think this actually changes the behavior for partitioned tables. > Presently, we still go through and collect all the partitions in the > vacrels list. With this change, we will skip collecting a table's > partitions if the current role doesn't have the required permissions. > Perhaps we should skip adding the current relation to vacrels if > vacuum_is_relation_owner() returns false, and then we could go through > the partitions and check permissions on those as well. Since we don't > take any locks on the individual partitions yet, getting the relation > name and calling pg_class_ownercheck() safely might be tricky, though. Yes, that's actually intentional on my side as this keeps the logic more simple, and we avoid risks around deadlocks when working on partitions. It seems to me that it is also more intuitive to *not* scan a full partition tree if the user does not have ownership on its root if the relation is listed, instead of trying to scan all leafs to find perhaps some of them. In most data models it would matter much anyway, no? This is also more consistent with what is done for TRUNCATE where the ACLs of the parent are considered first. The documentation also actually mentions that: "To vacuum a table, one must ordinarily be the table's owner or a superuser." Perhaps we could make that clearer for partitions, with something like: "If listed explicitly, the vacuum of a partitioned table will include all its partitions if the user is the owner of the partitioned table." If we don't want to change the current behavior, then one simple solution would be close to what you mention, aka skip adding the partitioned table to the list, include *all* the
Re: ATTACH/DETACH PARTITION CONCURRENTLY
On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera wrote: > I wonder if this all stems from a misunderstanding of what I suggested > to David offlist. My suggestion was that the catalog scans would > continue to use the catalog MVCC snapshot, and that the relcache entries > would contain all the partitions that appear to the catalog; but each > partition's entry would carry the Xid of the creating transaction in a > field (say xpart), and that field is compared to the regular transaction > snapshot: if xpart is visible to the transaction snapshot, then the > partition is visible, otherwise not. So you never try to access a > partition that doesn't exist, because those just don't appear at all in > the relcache entry. But if you have an old transaction running with an > old snapshot, and the partitioned table just acquired a new partition, > then whether the partition will be returned as part of the partition > descriptor or not depends on the visibility of its entry. Hmm. One question is where you're going to get the XID of the creating transaction. If it's taken from the pg_class row or the pg_inherits row or something of that sort, then you risk getting a bogus value if something updates that row other than what you expect -- and the consequences of that are pretty bad here; for this to work as you intend, you need an exactly-correct value, not newer or older. An alternative is to add an xid field that stores the value explicitly, and that might work, but you'll have to arrange for that value to be frozen at the appropriate time. A further problem is that there could be multiple changes in quick succession. Suppose that a partition is attached, then detached before the attach operation is all-visible, then reattached, perhaps with different partition bounds. > I think that works fine for ATTACH without any further changes. I'm not > so sure about DETACH, particularly when snapshots persist for a "long > time" (a repeatable-read transaction). ISTM that in the above design, > the partition descriptor would lose the entry for the detached partition > ahead of time, which means queries would silently fail to see their data > (though they wouldn't crash). I don't see why they wouldn't crash. If the partition descriptor gets rebuilt and some partitions disappear out from under you, the old partition descriptor is going to get freed, and the executor has a cached pointer to it, so it seems like you are in trouble. > I first thought this could be fixed by > waiting for those snapshots to finish, but then I realized that there's > no actual place where waiting achieves anything. Certainly it's not > useful to wait before commit (because other snapshots are going to be > starting all the time), and it's not useful to start after the commit > (because by then the catalog tuple is already gone). Maybe we need two > transactions: mark partition as removed with an xmax of sorts, commit, > wait for all snapshots, start transaction, remove partition catalog > tuple, commit. And what would that accomplish, exactly? Waiting for all snapshots would ensure that all still-running transactions see the fact the xmax with which the partition has been marked as removed, but what good does that do? In order to have a plausible algorithm, you have to describe both what the ATTACH/DETACH operation does and what the other concurrent transactions do and how those things interact. Otherwise, it's like saying that we're going to solve a problem with X and Y overlapping by having X take a lock. If Y doesn't take a conflicting lock, this does nothing. Generally, I think I see what you're aiming at: make ATTACH and DETACH have MVCC-like semantics with respect to concurrent transactions. I don't think that's a dumb idea from a theoretical perspective, but in practice I think it's going to be very difficult to implement. We have no other DDL that has such semantics, and there's no reason we couldn't; for example, TRUNCATE could work with SUEL and transactions that can't see the TRUNCATE as committed continue to operate on the old heap. While we could do such things, we don't. If you decide to do them here, you've probably got a lot of work ahead of you. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Problem while updating a foreign table pointing to a partitioned table on foreign server
Fujita-san thank you for the comment. At Tue, 14 Aug 2018 20:49:02 +0900, Etsuro Fujita wrote in <5b72c1ae.8010...@lab.ntt.co.jp> > (2018/08/09 22:04), Etsuro Fujita wrote: > > (2018/08/08 17:30), Kyotaro HORIGUCHI wrote: > >> I choosed to expand tuple descriptor for junk column added to > >> foreign relaions. We might be better to have new member in > >> ForeignScan but I didn't so that we can backpatch it. > > > > I've not looked at the patch closely yet, but I'm not sure that it's a > > good idea to expand the tuple descriptor of the target relation on the > > fly so that it contains the remotetableoid as a non-system attribute > > of > > the target table. My concern is: is there not any risk in affecting > > some > > other part of the planner and/or the executor? (I was a bit surprised > > that the patch passes the regression tests successfully.) Yeah, me too. > I spent more time looking at the patch. ISTM that the patch well > suppresses the effect of the tuple-descriptor expansion by making > changes to code in the planner and executor (and ruleutils.c), but I'm > still not sure that the patch is the right direction to go in, because > ISTM that expanding the tuple descriptor on the fly might be a wart. The non-Var nodes seems to me the same as PARAM_EXEC, which works imperfectly for this purpose since tableoid must be in one-to-one correspondence with a tuple but differently from joins the correspondence is stired up by intermedate executor nodes in some cases. The exapansion should be safe if the expanded descriptor has the same defitions for base columns and all the extended coulumns are junks. The junk columns should be ignored by unrelated nodes and they are passed safely as far as ForeignModify passes tuples as is from underlying ForeignScan to ForeignUpdate/Delete. > >> What the patch does are: > >> > >> - This abuses ForeignScan->fdw_scan_tlist to store the additional > >> junk columns when foreign simple relation scan (that is, not a > >> join). > > > > I think that this issue was introduced in 9.3, which added > > postgres_fdw > > in combination with support for writable foreign tables, but > > fdw_scan_tlist was added to 9.5 as part of join-pushdown > > infrastructure, > > so my concern is that we might not be able to backpatch your patch to > > 9.3 and 9.4. Right. So I'm thinking that the older versions just get error for the failure case instead of get it work anyhow. Or we might be able to use tableoid in tuple header without emitting the local oid to users but I haven't find the way to do that. > Another concern about this: > > You wrote: > >Several places seems to be assuming that fdw_scan_tlist may be > >used foreign scan on simple relation but I didn't find that > >actually happens. > > Yeah, currently, postgres_fdw and file_fdw don't use that list for > simple foreign table scans, but it could be used to improve the > efficiency for those scans, as explained in fdwhandler.sgml: > > Another ForeignScan field that can be filled > by FDWs > is fdw_scan_tlist, which describes the > tuples returned by > the FDW for this plan node. For simple foreign table scans this can > be > set to NIL, implying that the returned tuples have > the > row type declared for the foreign table. A non-NIL > value must be a > target list (list of TargetEntrys) containing > Vars and/or > expressions representing the returned columns. This might be used, > for > example, to show that the FDW has omitted some columns that it noticed > won't be needed for the query. Also, if the FDW can compute > expressions > used by the query more cheaply than can be done locally, it could add > those expressions to fdw_scan_tlist. Note > that join > plans (created from paths made by > GetForeignJoinPaths) must > always supply fdw_scan_tlist to describe > the set of > columns they will return. https://www.postgresql.org/docs/devel/static/fdw-planning.html Hmm. Thanks for the pointer, it seems to need rewrite. However, it doesn't seem to work for non-join foreign scans, since the core igonres it and uses local table definition. This "tweak" won't be needed if it worked. > You wrote: > > I'm not sure whether the following ponits are valid. > > > > - If fdw_scan_tlist is used for simple relation scans, this would > >break the case. (ExecInitForeignScan, set_foreignscan_references) > > Some FDWs might already use that list for the improved efficiency for > simple foreign table scans as explained above, so we should avoid > breaking that. I considered to use fdw_scan_tlist in that way but the core is assuming that foreign scans with scanrelid > 0 uses the relation descriptor. Do you have any example for that? > If we take the Param-based approach suggested by Tom, I suspect there > would be no need to worry about at least those things, so I'll try to > update your patc
Re: Reopen logfile on SIGHUP
On Tue, Aug 21, 2018 at 09:26:54AM +0900, Kyotaro HORIGUCHI wrote: > I suspect that something bad can happen on Windows. [troll mode] More and even worse things than that could happen on Windows. [/troll mode] -- Michael signature.asc Description: PGP signature
Re: Slotification of partition tuple conversion
On 2018/08/20 20:15, Amit Khandekar wrote: > On 17 August 2018 at 21:47, Amit Khandekar wrote: > >> Attached is a patch tup_convert.patch that does the conversion >> directly using input and output tuple slots. This patch is to be >> applied on an essential patch posted by Amit Langote in [1] that >> arranges for dedicated per-partition slots rather than changing >> tupdesc of a single slot. I have rebased that patch from [1] and >> attached it here ( >> Allocate-dedicated-slots-of-partition-tuple_amitlan_rebased.patch). > > Amit Langote has posted a new version of the > Allocate-dedicated-slots-of-partition-tuple_amitlan_rebased.patch at > [1] . Attached is version v2 of the tup_convert.patch, that is rebased > over that patch. Thanks. Here are some comments on the patch: +ConvertTupleSlot Might be a good idea to call this ConvertSlotTuple? +/* + * Get the tuple in the per-tuple context. Also, we cannot call + * ExecMaterializeSlot(), otherwise the tuple will get freed + * while storing the next tuple. + */ +oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate)); +tuple = ExecCopySlotTuple(slot); +MemoryContextSwitchTo(oldcontext); The comment about why we cannot call ExecMaterializeSlot is a bit unclear. Maybe, the comment doesn't need to mention that? Instead, expand a bit more on why the context switch here or how it interacts with recently tuple buffering (0d5f05cde01)? Seeing that all the sites in the partitioning code that previously called do_convert_tuple() now call ConvertTupleSlot, I wonder if we don't need a full TupleConversionMap, just the attribute map in it. We don't need the input/output Datum and bool arrays in it, because we'd be using the ones from input and output slots of ConvertTupleSlot. So, can we replace all instances of TupleConversionMap in the partitioning code and data structures by AttributeNumber arrays. Also, how about putting ConvertTupleSlot in execPartition.c and exporting it in execPartition.h, instead of tupconvert.c and tupconvert.h, resp.? Thanks, Amit
Re: NLS handling fixes.
At Mon, 20 Aug 2018 13:23:38 +0900, Michael Paquier wrote in <20180820042338.gh7...@paquier.xyz> > On Fri, Aug 10, 2018 at 04:50:55PM -0400, Tom Lane wrote: > > I would certainly *not* back-patch the GetConfigOptionByNum change, > > as that will be a user-visible behavioral change that people will > > not be expecting. We might get away with back-patching the other changes, > > but SHOW ALL output seems like something that people might be expecting > > not to change drastically in a minor release. > > Agreed, some folks may rely on that. > > > * In the patch fixing view_query_is_auto_updatable misuse, nothing's > > been done to remove the underlying cause of the errors, which IMO > > is that the header comment for view_query_is_auto_updatable fails to > > specify the return convention. I'd consider adding a comment along > > the lines of > > > > * view_query_is_auto_updatable - test whether the specified view definition > > * represents an auto-updatable view. Returns NULL (if the view can be > > updated) > > * or a message string giving the reason that it cannot be. > > +* > > +* The returned string has not been translated; if it is shown as an error > > +* message, the caller should apply _() to translate it. > > That sounds right. A similar comment should be added for > view_cols_are_auto_updatable and view_col_is_auto_updatable. > > > * Perhaps pg_GSS_error likewise could use a comment saying the error > > string must be translated already. Or we could leave its callers alone > > and put the _() into it, but either way the API contract ought to be > > written down. > > Both things are indeed possible. As pg_SSPI_error applies translation > beforehand, I have taken the approach to let the caller just apply _(). > For two messages that does not matter much one way or another. > > > * The proposed patch for psql/common.c seems completely wrong, or at > > least it has a lot of side-effects other than enabling header translation, > > and I doubt they are desirable. > > I doubted about it, and at closer look I think that you are right, so +1 > for discarding this one. > > Attached is a patch which should address all the issues reported and all > the remarks done. What do you think? Agreed on all of the above, but if we don't need translation in the header line of \gdesc, gettext_noop for the strings is useless. A period is missing in the comment for view_col_is_auto_updatable. The attached is the fixed version. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 7cedc28c6b..2a12d64aeb 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -10826,7 +10826,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation, ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("WITH CHECK OPTION is supported only on automatically updatable views"), - errhint("%s", view_updatable_error))); + errhint("%s", _(view_updatable_error; } } diff --git a/src/backend/commands/view.c b/src/backend/commands/view.c index 7d4511c585..ffb71c0ea7 100644 --- a/src/backend/commands/view.c +++ b/src/backend/commands/view.c @@ -502,7 +502,7 @@ DefineView(ViewStmt *stmt, const char *queryString, ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("WITH CHECK OPTION is supported only on automatically updatable views"), - errhint("%s", view_updatable_error))); + errhint("%s", _(view_updatable_error; } /* diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c index cecd104b4a..18c4921d61 100644 --- a/src/backend/libpq/auth.c +++ b/src/backend/libpq/auth.c @@ -1037,6 +1037,10 @@ static GSS_DLLIMP gss_OID GSS_C_NT_USER_NAME = &GSS_C_NT_USER_NAME_desc; #endif +/* + * Generate an error for GSSAPI authentication. The caller needs to apply + * _() to errmsg to make it translatable. + */ static void pg_GSS_error(int severity, const char *errmsg, OM_uint32 maj_stat, OM_uint32 min_stat) { @@ -1227,7 +1231,7 @@ pg_GSS_recvauth(Port *port) { gss_delete_sec_context(&lmin_s, &port->gss->ctx, GSS_C_NO_BUFFER); pg_GSS_error(ERROR, - gettext_noop("accepting GSS security context failed"), + _("accepting GSS security context failed"), maj_stat, min_stat); } @@ -1253,7 +1257,7 @@ pg_GSS_recvauth(Port *port) maj_stat = gss_display_name(&min_stat, port->gss->name, &gbuf, NULL); if (maj_stat != GSS_S_COMPLETE) pg_GSS_error(ERROR, - gettext_noop("retrieving GSS user name failed"), + _("retrieving GSS user name failed"), maj_stat, min_stat); /* @@ -1317,6 +1321,11 @@ pg_GSS_recvauth(Port *port) * */ #ifdef ENABLE_SSPI + +/* + * Generate an error for SSPI authentication. The caller needs to apply + * _() to errmsg to make it translatable. + */ static void
A typo in guc.c
Hello, I found a typo in guc.c. > {"enable_parallel_hash", PGC_USERSET, QUERY_TUNING_METHOD, > - gettext_noop("Enables the planner's user of parallel hash plans."), > + gettext_noop("Enables the planner's use of parallel hash plans."), The "user" should be "use". As you(who?) know, this applies only 11 and dev. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 9989d3a351..2a874dc78d 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -962,7 +962,7 @@ static struct config_bool ConfigureNamesBool[] = }, { {"enable_parallel_hash", PGC_USERSET, QUERY_TUNING_METHOD, - gettext_noop("Enables the planner's user of parallel hash plans."), + gettext_noop("Enables the planner's use of parallel hash plans."), NULL }, &enable_parallel_hash,
Re: [HACKERS] Optional message to user when terminating/cancelling backend
On Wed, Jul 25, 2018 at 7:27 PM, Daniel Gustafsson wrote: >> On 24 Jul 2018, at 22:57, Daniel Gustafsson wrote: >> >>> On 6 Jul 2018, at 02:18, Thomas Munro wrote: >>> >>> On Fri, Jun 15, 2018 at 9:56 AM, Daniel Gustafsson wrote: attached >>> >>> Hi Daniel, >>> >>> 6118 --select pg_cancel_backend(pg_backend_pid(), 'it brings on many >>> changes'); >>> 6119 select pg_cancel_backend(pg_backend_pid(), NULL); >>> 6120! ERROR: canceling statement due to user request >>> 6121--- 25,32 >>> 6122 >>> 6123 --select pg_cancel_backend(pg_backend_pid(), 'it brings on many >>> changes'); >>> 6124 select pg_cancel_backend(pg_backend_pid(), NULL); >>> 6125! pg_cancel_backend >>> 6126! --- >>> 6127! t >>> >>> Apparently Windows can take or leave it as it pleases. >> >> Well played =) >> >>> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.4488 >> >> That reads to me like it’s cancelling another backend than the current one, >> which clearly isn’t right as we’re passing pg_backend_pid(). I can’t really >> see what Windows specific bug was introduced by this patch though (or well, >> the >> bug exhibits itself on Windows but it may well be generic of course). >> >> Will continue to hunt. > > Seems the build of the updated patch built and tested Ok. Still have no idea > why the previous one didn’t. That problem apparently didn't go away. cfbot tested it 7 times in the past week, and it passed only once on Windows: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.9691 The other times all failed like this: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.9833 -- Thomas Munro http://www.enterprisedb.com
Re: A typo in guc.c
On Tue, Aug 21, 2018 at 11:58:41AM +0900, Kyotaro HORIGUCHI wrote: > The "user" should be "use". > > As you(who?) know, this applies only 11 and dev. Thanks, applied. -- Michael signature.asc Description: PGP signature
Re: Fix help option of contrib/oid2name
On 2018/08/20 17:38, Michael Paquier wrote: On Mon, Aug 20, 2018 at 03:51:07PM +0900, Tatsuro Yamada wrote: On 2018/08/20 13:54, Michael Paquier wrote: Therefore, "-P" is a manual bag. I investigated more using git log command and understood followings: 1. -P option was removed on 4192f2d85 2. -P option revived in only the manual on 2963d8228 Bruce did that to keep a trace of it in the docs, let's nuke it then, we don't handle it and the documentation is mentioning the thing as deprecated since 2010. Yep, right. I see, and will remove -P option's explanation from the manual. For now, I'm not sure about TAP test but I knew both utilities have no regression tests. It would be better to use something tests. If you don't add them, I think that I would just that myself, just to have some basics. Not that it is a barrier for this patch anyway. Hmm... As long as I've come this far, I'll do it through. BTW, can I add the patch to the Commitfest September? The patch includes improvements and bug fix as you know, So, I can divide the patch into 2 patches for that. Which one is better to create, 2 patches or 1 patch? I summed up fixes of oid2name and vacuumlo so far, the next patch will include following stuffs: oid2name bug fix - Remove "-P password" option from the document improvements - Divide "Options section" into "Options" and "Connection Options" sections in the help message (--help) - Add long options only for Connection options (-d, -H, -h, -p and -U) - Add "-H host" and "-h host" options to the help message and the document - Add environment variable (PGHOST, PGPORT and PGUSER) to the document - Add TAP tests for checking command-line options vacuumlo improvements - Add long options only for Connection options (-h, -p and -U) - Add environment variable (PGHOST, PGPORT and PGUSER) to the document - Add TAP tests for checking command-line options Thanks, Tatsuro Yamada
Re: Fix help option of contrib/oid2name
On Tue, Aug 21, 2018 at 12:26:15PM +0900, Tatsuro Yamada wrote: > BTW, can I add the patch to the Commitfest September? You should. > The patch includes improvements and bug fix as you know, So, I can divide the > patch into 2 patches for that. I am not really seeing any bug fix, but if you think so then splitting into two patches would help in making the difference for sure. -- Michael signature.asc Description: PGP signature
Re: Fix help option of contrib/oid2name
On 2018/08/21 12:40, Michael Paquier wrote: On Tue, Aug 21, 2018 at 12:26:15PM +0900, Tatsuro Yamada wrote: BTW, can I add the patch to the Commitfest September? You should. Thanks, I'll do that. The patch includes improvements and bug fix as you know, So, I can divide the patch into 2 patches for that. I am not really seeing any bug fix, but if you think so then splitting into two patches would help in making the difference for sure. Yep, I meant the bug fix is below: oid2name - Remove "-P password" option from the document As I wrote before, "-P option" is not works well because there is no code to handle that option in the code. So, it will be removed on next patch. I'll send 2 patches in this week, probably. Regards, Tatsuro Yamada
Re: Speeding up INSERTs and UPDATEs to partitioned tables
On 3 August 2018 at 17:58, Amit Langote wrote: > On 2018/07/31 16:03, David Rowley wrote: >> Maybe we can do that as a follow-on patch. > > We probably could, but I think it would be a good idea get rid of *all* > redundant allocations due to tuple routing in one patch, if that's the > mission of this thread and the patch anyway. I started looking at this patch today and I now agree that it should be included in the main patch. I changed a few things with the patch. For example, the map access macros you'd defined were not in CamelCase. I also fixed a bug where the child to parent map was not being initialised when on conflict transition capture was required. I added a test which was crashing the backend but fixed the code so it works correctly. I also got rid of the child_parent_map_not_required array since we now no longer need it. The code now always initialises the maps in cases where they're going to be required. I've attached a v3 version of your patch and also v6 of the main patch which includes the v3 patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services v3_Refactor-handling-of-child_parent_tupconv_maps.patch Description: Binary data v6-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch Description: Binary data
Re: A really subtle lexer bug
> "Andrew" == Andrew Gierth writes: Andrew> Patch attached. Andrew> This fixes two bugs: I'm going to split this into two patches, since the O(N^3) fix can be backpatched further than the operator token fix; the latter bug was introduced in 9.5 when operator precedences were corrected, while the former has been there forever. (But don't wait for me to post those before commenting on either issue) -- Andrew (irc:RhodiumToad)
Re: ATTACH/DETACH PARTITION CONCURRENTLY
On 21 August 2018 at 13:59, Robert Haas wrote: > On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera > wrote: >> I wonder if this all stems from a misunderstanding of what I suggested >> to David offlist. My suggestion was that the catalog scans would >> continue to use the catalog MVCC snapshot, and that the relcache entries >> would contain all the partitions that appear to the catalog; but each >> partition's entry would carry the Xid of the creating transaction in a >> field (say xpart), and that field is compared to the regular transaction >> snapshot: if xpart is visible to the transaction snapshot, then the >> partition is visible, otherwise not. So you never try to access a >> partition that doesn't exist, because those just don't appear at all in >> the relcache entry. But if you have an old transaction running with an >> old snapshot, and the partitioned table just acquired a new partition, >> then whether the partition will be returned as part of the partition >> descriptor or not depends on the visibility of its entry. > > Hmm. One question is where you're going to get the XID of the > creating transaction. If it's taken from the pg_class row or the > pg_inherits row or something of that sort, then you risk getting a > bogus value if something updates that row other than what you expect > -- and the consequences of that are pretty bad here; for this to work > as you intend, you need an exactly-correct value, not newer or older. > An alternative is to add an xid field that stores the value > explicitly, and that might work, but you'll have to arrange for that > value to be frozen at the appropriate time. > > A further problem is that there could be multiple changes in quick > succession. Suppose that a partition is attached, then detached > before the attach operation is all-visible, then reattached, perhaps > with different partition bounds. I should probably post the WIP I have here. In those, I do have the xmin array in the PartitionDesc. This gets taken from the new pg_partition table, which I don't think suffers from the same issue as taking it from pg_class, since nothing else will update the pg_partition record. However, I don't think the xmin array is going to work if we include it in the PartitionDesc. The problem is, as I discovered from writing the code was that the PartitionDesc must remain exactly the same between planning an execution. If there are any more or any fewer partitions found during execution than what we saw in planning then run-time pruning will access the wrong element in the PartitionPruneInfo array, or perhaps access of the end of the array. It might be possible to work around that by identifying partitions by Oid rather than PartitionDesc array index, but the run-time pruning code is already pretty complex. I think coding it to work when the PartitionDesc does not match between planning and execution is just going to too difficult to get right. Tom is already unhappy with the complexity of ExecFindInitialMatchingSubPlans(). I think the solution will require that the PartitionDesc does not: a) Change between planning and execution. b) Change during a snapshot after the partitioned table has been locked. With b, it sounds like we'll need to take the most recent PartitionDesc even if the transaction is older than the one that did the ATTACH/DETACH operation as if we use an old version then, as Robert mentions, there's nothing to stop another transaction making changes to the table that make it an incompatible partition, e.g DROP COLUMN. This wouldn't be possible if we update the PartitionDesc right after taking the first lock on the partitioned table since any transactions doing DROP COLUMN would be blocked until the other snapshot gets released. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: NLS handling fixes.
On Tue, Aug 21, 2018 at 11:49:05AM +0900, Kyotaro HORIGUCHI wrote: > Agreed on all of the above, but if we don't need translation in > the header line of \gdesc, gettext_noop for the strings is > useless. I have let that one alone, as all columns showing up in psql have the same, consistent way of handling things. > A period is missing in the comment for view_col_is_auto_updatable. Fixed this one, and pushed. All things are back-patched where they apply, except for the part of GetConfigOptionByNum getting only on HEAD as it could be surprising after a minor upgrade as Tom has suggested. -- Michael signature.asc Description: PGP signature
Re: Two proposed modifications to the PostgreSQL FDW
On Tue, Aug 21, 2018 at 1:47 AM Chris Travers wrote: > > > > On Mon, Aug 20, 2018 at 4:41 PM Andres Freund wrote: >> >> Hi, >> >> On 2018-08-20 16:28:01 +0200, Chris Travers wrote: >> > 2. TWOPHASECOMMIT=[off|on] option >> >> > The second major issue that I see with PostgreSQL's foreign database >> > wrappers is the fact that there is no two phase commit which means that a >> > single transaction writing to a group of tables has no expectation that all >> > backends will commit or rollback together. With this patch an option would >> > be applied to foreign tables such that they could be set to use two phase >> > commit When this is done, the first write to each backend would register a >> > connection with a global transaction handler and a pre-commit and commit >> > hooks would be set up to properly process these. >> > >> > On recommit a per-global-transaction file would be opened in the data >> > directory and prepare statements logged to the file. On error, we simply >> > roll back our local transaction. >> > >> > On commit hook , we go through and start to commit the remote global >> > transactions. At this point we make a best effort but track whether or not >> > we were successfully on all. If successful on all, we delete the file. If >> > unsuccessful we fire a background worker which re-reads the file and is >> > responsible for cleanup. If global transactions persist, a SQL >> > administration function will be made available to restart the cleanup >> > process. On rollback, we do like commit but we roll back all transactions >> > in the set. The file has enough information to determine whether we should >> > be committing or rolling back on cleanup. >> > >> > I would like to push these both for Pg 12. Is there any feedback on the >> > concepts and the problems first >> Thank you for the proposal. I agree that it's a major problem that postgres_fdw (or PostgreSQL core API) doesn't support two-phase commit. >> There's been *substantial* work on this. You should at least read the >> discussion & coordinate with the relevant developers. > > > I suppose I should forward this to them directly also. > > Yeah. Also the transaction manager code for this I wrote while helping with > a proof of concept for this copy-to-remote extension. > > There are a few big differences in implementation with the patches you > mention and the disagreement was part of why I thought about going this > direction. > > First, discussion of differences in implementation: > > 1. I treat the local and remote transactions symmetrically and I make no > assumptions about what might happen between prepare and an attempted local > commit. >prepare goes into the precommit hook >commit goes into the commit hook and we never raise errors if it fails > (because you cannot rollback at that point). Instead a warning is raised and > cleanup commences. >rollback goes into the rollback hook and we never raise errors if it fails > (because you are already rolling back). > > 2. By treating this as a property of a table rather than a property of a > foreign data wrapper or a server, we can better prevent prepared transactions > where they have not been enabled. >This also ensures that we know whether we are guaranteeing two phase > commit or not by looking at the table. > > 3. By making this opt-in it avoids a lot of problems with regards to > incorrect configuration etc since if the DBA says "use two phase commit" and > failed to enable prepared transactions on the other side... > > On to failure modes: > 1. Its possible that under high load too many foreign transactions are > prepared and things start rolling back instead of committing. Oh well > 2. In the event that a foreign server goes away between prepare and commit, > we continue to retry via the background worker. The background worker is > very pessimistic and checks every remote system for the named transaction. If some participant servers fail during COMMIT PREPARED, will the client get a "committed"? or an "aborted"? If the client gets "aborted", that's not correct because the local changes are already committed at that point. On the other hand, if the client get "committed" it might break the current user semantics because the subsequent reads may not be able to see the own committed writes. Also since we don't want to wait for COMMIT PREPARED to complete we need to consider that users could cancel the query anytime. To not break the current semantics we cannot raise error during 2nd phase of two-phase commit but it's not realistic because even the palloc() can raise an error. The design the patch chose is making backends do only PREPARE and wait for the background worker to complete COMMIT PREPARED. In this design the clients get a "committed" only either when successful in commit on all participants or when they cancel the query explicitly. In other words, the client will wait for completion of 2nd phase of two-phase
Re: Pluggable Storage - Andres's take
On Sun, Aug 5, 2018 at 7:48 PM Andres Freund wrote: > Hi, > > I'm currently in the process of rebasing zheap onto the pluggable > storage work. The goal, which seems to work surprisingly well, is to > find issues that the current pluggable storage patch doesn't yet deal > with. I plan to push a tree including a lot of fixes and improvements > soon. > Sorry for coming late to this thread. That's good. Did you find any problems in porting zheap into pluggable storage? Does it needs any API changes or new API requirement? > On 2018-08-03 12:35:50 +1000, Haribabu Kommi wrote: > > while investing the crash, I observed that it is due to the lot of > FIXME's > > in > > the code. So I just fixed minimal changes and looking into correcting > > the FIXME's first. > > > > One thing I observed is lack relation pointer is leading to crash in the > > flow of EvalPlan* functions, because all ROW_MARK types doesn't > > contains relation pointer. > > > > will continue to check all FIXME fixes. > > Thanks. > I fixed some of the Isolation test problems. All the issues are related to EPQ slot handling. Still more needs to be fixed. Does the new TupleTableSlot abstraction patches has fixed any of these issues in the recent thread [1]? so that I can look into the change of FDW API to return slot instead of tuple. > > > - COPY's multi_insert path should probably deal with a bunch of slots, > > > rather than forming HeapTuples > > > > > > > Implemented supporting of slots in the copy multi insert path. > > Cool. I've not yet looked at it, but I plan to do so soon. Will have to > rebase over the other copy changes first :( > OK. Understood. There are many changes in the COPY flow conflicts with my changes. Please let me know once you done the rebase, I can fix those conflicts and regenerate the patch. Attached is the patch with further fixes. [1] - https://www.postgresql.org/message-id/CAFjFpRcNPQ1oOL41-HQYaEF%3DNq6Vbg0eHeFgopJhHw_X2usA5w%40mail.gmail.com Regards, Haribabu Kommi Fujitsu Australia 0001-isolation-test-fixes-2.patch Description: Binary data