A really subtle lexer bug

2018-08-20 Thread Andrew Gierth
Currently in PG, the precedence of = and <> is supposed to be equal, and
the precedence of unary - is very high.

So, (true=1 between 1 and 1) parses as ((true)=(1 between 1 and 1)),
and (true=-1 between 1 and 1) parses as ((true)=((-1) between 1 and 1)).
All good so far.

(true<>-1 between 1 and 1) parses as ((true<>(-1)) between 1 and 1). ???

The fault here is in the lexer. The input "<>-" is being lexed as an Op
followed by '-' rather than as NOT_EQUAL followed by '-' because it
looks like a match for a multi-character operator, with the - being
thrown back into the input afterwards. So the precedence of <> gets
inflated to that of Op, which is higher than BETWEEN.

More seriously, this also breaks named arguments:

create function f(a integer) returns integer language sql
 as $$ select a; $$;

select f(a => 1);  -- works
select f(a => -1);  -- works
select f(a =>-1);  -- ERROR:  column "a" does not exist

I guess the fix is to extend the existing special case code that checks
for one character left after removing trailing [+-] and also check for
the two-character ops "<>" ">=" "<=" "=>" "!=". 

-- 
Andrew (irc:RhodiumToad)



Calculate total_table_pages after set_base_rel_sizes()

2018-08-20 Thread David Rowley
I believe that we should be delaying the PlannerInfo's
total_table_pages calculation until after constraint exclusion and
partition pruning have taken place. Doing this calculation before we
determine which relations we don't need to scan can lead to
incorrectly applying random_page_cost to too many pages processed
during an Index Scan.

We already don't count relations removed by join removals from this
calculation, so counting pruned partitions seems like an omission.

The attached patch moves the calculation to after set_base_rel_sizes()
is called and before set_base_rel_pathlists() is called, where the
information is actually used.

I am considering this a bug fix, but I'm proposing this for PG12 only
as I don't think destabilising plans in the back branches is a good
idea. I'll add this to the September commitfest.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v1-0001-Calculate-total_table_pages-after-set_base_rel_si.patch
Description: Binary data


Re: Fix help option of contrib/oid2name

2018-08-20 Thread Michael Paquier
On Mon, Aug 20, 2018 at 03:51:07PM +0900, Tatsuro Yamada wrote:
> On 2018/08/20 13:54, Michael Paquier wrote:
> Therefore, "-P" is a manual bag. I investigated more using git log command and
> understood followings:
> 
>   1. -P option was removed on 4192f2d85
>   2. -P option revived in only the manual on 2963d8228

Bruce did that to keep a trace of it in the docs, let's nuke it then, we
don't handle it and the documentation is mentioning the thing as
deprecated since 2010.

>> oid2name supports also PGDATABASE.
> 
> As far as I know, this environment variable is also unused because
> I could not get the results as I expected. So, I didn't add it to the manual.
> For example, following command expected that it shows about "postgres", but
> it couldn't.

Yep, you're right.

> For now, I'm not sure about TAP test but I knew both utilities have no
> regression tests. It would be better to use something tests.

If you don't add them, I think that I would just that myself, just to
have some basics.  Not that it is a barrier for this patch anyway.
--
Michael


signature.asc
Description: PGP signature


Re: partitioning - changing a slot's descriptor is expensive

2018-08-20 Thread Amit Langote
Thanks for the review.

On 2018/08/17 15:00, Amit Khandekar wrote:
> Thanks for the patch. I did some review of the patch (needs a rebase).
> Below are my comments.
> 
> @@ -1936,12 +1936,11 @@ ExecPartitionCheckEmitError(ResultRelInfo
> *resultRelInfo,
> +   /* Input slot might be of a partition, which has a fixed tupdesc. */
> +   slot = MakeTupleTableSlot(tupdesc);
> if (map != NULL)
> -   {
>   tuple = do_convert_tuple(tuple, map);
> - ExecSetSlotDescriptor(slot, tupdesc);
> - ExecStoreTuple(tuple, slot, InvalidBuffer, false);
> -   }
> +   ExecStoreTuple(tuple, slot, InvalidBuffer, false);
> 
> Both MakeTupleTableSlot() and ExecStoreTuple() can be inside the (map
> != NULL) if condition.
> This also applies for similar changes in ExecConstraints() and
> ExecWithCheckOptions().

Ah, okay.  I guess that means we'll allocate a new slot here only if we
had to switch to a partition-specific slot in the first place.

> + * Initialize an empty slot that will be used to manipulate tuples of any
> + * this partition's rowtype.
> of any this => of this
> 
> + * Initialized the array where these slots are stored, if not already
> Initialized => Initialize

Fixed.

> + proute->partition_tuple_slots_alloced =
> + lappend(proute->partition_tuple_slots_alloced,
> + proute->partition_tuple_slots[partidx]);
> 
> Instead of doing the above, I think we can collect those slots in
> estate->es_tupleTable using ExecInitExtraTupleSlot() so that they
> don't have to be explicitly dropped in ExecCleanupTupleRouting(). And
> also then we won't require the new field
> proute->partition_tuple_slots_alloced.

Although I was slightly uncomfortable of the idea at first, thinking that
it's not good for tuple routing specific resources to be released by
generic executor code, doesn't seem too bad to do it the way you suggest.

Attached updated patch.  By the way, I think it might be a good idea to
try to merge this patch with the effort in the following thread.

* Reduce partition tuple routing overheads *
https://commitfest.postgresql.org/19/1690/

Thanks,
Amit
From a543b15f83f5e1adaef2a46de59022b0b21711e4 Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 28 Jun 2018 15:47:47 +0900
Subject: [PATCH v2] Allocate dedicated slots of partition tuple conversion

Currently there is just one slot called partition_tuple_slot and
we do ExecSetSlotDescriptor every time a tuple is inserted into a
partition that requires tuple conversion.  That's excessively
inefficient during bulk inserts.

Fix that by allocating a dedicated slot for each partitions that
requires it.
---
 src/backend/commands/copy.c| 17 +
 src/backend/executor/execMain.c| 24 +++---
 src/backend/executor/execPartition.c   | 45 +++---
 src/backend/executor/nodeModifyTable.c | 27 
 src/include/executor/execPartition.h   | 13 ++
 5 files changed, 88 insertions(+), 38 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9bc67ce60f..f61db3e173 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2690,6 +2690,7 @@ CopyFrom(CopyState cstate)
if (proute)
{
int leaf_part_index;
+   TupleConversionMap *map;
 
/*
 * Away we go ... If we end up not finding a partition 
after all,
@@ -2864,11 +2865,17 @@ CopyFrom(CopyState cstate)
 * partition rowtype.  Don't free the already stored 
tuple as it
 * may still be required for a multi-insert batch.
 */
-   tuple = 
ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-   
  tuple,
-   
  proute->partition_tuple_slot,
-   
  &slot,
-   
  false);
+   map = 
proute->parent_child_tupconv_maps[leaf_part_index];
+   if (map != NULL)
+   {
+   TupleTableSlot *new_slot;
+
+   Assert(proute->partition_tuple_slots != NULL &&
+  
proute->partition_tuple_slots[leaf_part_index] != NULL);
+   new_slot = 
proute->partition_tuple_slots[leaf_part_index];
+   tuple = ConvertPartitionTupleSlot(map, tuple, 
new_slot, &slot,
+   
  false);
+   }
 
tuple->t_tableO

Re: [FEATURE PATCH] pg_stat_statements with plans (v02)

2018-08-20 Thread Andres Freund
Hi,

On 2018-08-20 15:21:07 +1200, Thomas Munro wrote:
> On Mon, Aug 20, 2018 at 2:55 PM, Tom Lane  wrote:
> > Thomas Munro  writes:
> > ... but I'm less excited about this one.  Seems like a great opportunity
> > for unexpected stack overflows, and thence at least the chance for
> > DOS-causing security attacks.  Can we prevent that from being allowed,
> > if we start using -std=c99?
> 
> -Werror=vla in GCC, apparently.

How about detecting support for that in our configure script and
automatically using it?  If we're uncomfortable with raising an error,
let's at least raise a warning?

- Andres



Re: remove ATTRIBUTE_FIXED_PART_SIZE

2018-08-20 Thread Peter Eisentraut
On 18/08/2018 23:05, Tom Lane wrote:
> Possibly we need to be more careful than we are now about whether
> there's padding at the end of the fixed-size fields ... but just
> ripping out the code that attempts to deal with that is hardly
> an improvement.

I don't think the tuple packing issue has to do with the tuple
descriptor code.  The tuple descriptors already use allocations of size
sizeof(FormData_pg_attribute) (CreateTemplateTupleDesc), just the memcpy
and memset calls use (potentially) less.  That might have saved a few
bytes for omitting the varlena fields, but I don't think it affects
alignment correctness.  If we, say, added a trailing char field now, the
only thing this code

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: remove ATTRIBUTE_FIXED_PART_SIZE

2018-08-20 Thread Peter Eisentraut
On 20/08/2018 12:32, Peter Eisentraut wrote:
> On 18/08/2018 23:05, Tom Lane wrote:
>> Possibly we need to be more careful than we are now about whether
>> there's padding at the end of the fixed-size fields ... but just
>> ripping out the code that attempts to deal with that is hardly
>> an improvement.
> 
> I don't think the tuple packing issue has to do with the tuple
> descriptor code.  The tuple descriptors already use allocations of size
> sizeof(FormData_pg_attribute) (CreateTemplateTupleDesc), just the memcpy
> and memset calls use (potentially) less.  That might have saved a few
> bytes for omitting the varlena fields, but I don't think it affects
> alignment correctness.  If we, say, added a trailing char field now, the
> only thing this code

[oops]

... the only thing the current code would accomplish is not copying the
last three padding bytes, which might even be slower than copying all four.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: ALTER TABLE on system catalogs

2018-08-20 Thread Peter Eisentraut
On 20/08/2018 04:37, Michael Paquier wrote:
> For 0002, indeed the patch is still
> seems relevant.  The comment block at the beginning of
> create_toast_table should be updated.  Some tests would also be nice to
> have.

Tests would require enabling allow_system_table_mods, which is
PGC_POSTMASTER, so we couldn't do it inside the normal regression test
suite.  I'm not sure setting up a whole new test suite for this is worth it.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: remove ATTRIBUTE_FIXED_PART_SIZE

2018-08-20 Thread Andres Freund
On 2018-08-20 12:34:15 +0200, Peter Eisentraut wrote:
> On 20/08/2018 12:32, Peter Eisentraut wrote:
> > On 18/08/2018 23:05, Tom Lane wrote:
> >> Possibly we need to be more careful than we are now about whether
> >> there's padding at the end of the fixed-size fields ... but just
> >> ripping out the code that attempts to deal with that is hardly
> >> an improvement.
> > 
> > I don't think the tuple packing issue has to do with the tuple
> > descriptor code.  The tuple descriptors already use allocations of size
> > sizeof(FormData_pg_attribute) (CreateTemplateTupleDesc), just the memcpy
> > and memset calls use (potentially) less.  That might have saved a few
> > bytes for omitting the varlena fields, but I don't think it affects
> > alignment correctness.  If we, say, added a trailing char field now, the
> > only thing this code
> 
> [oops]
> 
> ... the only thing the current code would accomplish is not copying the
> last three padding bytes, which might even be slower than copying all four.

That's not generally the case though, there's code like:

static VacAttrStats *
examine_attribute(Relation onerel, int attnum, Node *index_expr)
{
...
/*
 * Create the VacAttrStats struct.  Note that we only have a copy of the
 * fixed fields of the pg_attribute tuple.
 */
stats = (VacAttrStats *) palloc0(sizeof(VacAttrStats));
stats->attr = (Form_pg_attribute) palloc(ATTRIBUTE_FIXED_PART_SIZE);
memcpy(stats->attr, attr, ATTRIBUTE_FIXED_PART_SIZE);

Accesses to stats->attr can legitimately assume that the padding at the
tail end of attr is present (i.e. widening a load / store).

Greetings,

Andres Freund



Re: ALTER TABLE on system catalogs

2018-08-20 Thread Andres Freund
On 2018-08-20 11:37:49 +0900, Michael Paquier wrote:
> On Fri, Jul 13, 2018 at 11:05:10AM +0200, Peter Eisentraut wrote:
> > After reviewing that thread, I think my patch would still be relevant.
> > Because the pending proposal is to not add TOAST tables to some catalogs
> > such as pg_attribute, so my original use case of allowing ALTER TABLE /
> > SET on system catalogs would still be broken for those tables.
> 
> Something has happened in this area in the shape of 96cdeae, and the
> following catalogs do not have toast tables: pg_class, pg_index,
> pg_attribute and pg_largeobject*.  While 0001 proposed upthread is not
> really relevant anymore because there is already a test to list which
> catalogs do not have a toast table.  For 0002, indeed the patch is still
> seems relevant.  The comment block at the beginning of
> create_toast_table should be updated.

I still object to the approach in 0002.

Greetings,

Andres Freund



Re: partitioning - changing a slot's descriptor is expensive

2018-08-20 Thread Amit Khandekar
On 20 August 2018 at 14:45, Amit Langote  wrote:
> Thanks for the review.
>
> On 2018/08/17 15:00, Amit Khandekar wrote:
>> Thanks for the patch. I did some review of the patch (needs a rebase).
>> Below are my comments.
>>
>> @@ -1936,12 +1936,11 @@ ExecPartitionCheckEmitError(ResultRelInfo
>> *resultRelInfo,
>> +   /* Input slot might be of a partition, which has a fixed tupdesc. */
>> +   slot = MakeTupleTableSlot(tupdesc);
>> if (map != NULL)
>> -   {
>>   tuple = do_convert_tuple(tuple, map);
>> - ExecSetSlotDescriptor(slot, tupdesc);
>> - ExecStoreTuple(tuple, slot, InvalidBuffer, false);
>> -   }
>> +   ExecStoreTuple(tuple, slot, InvalidBuffer, false);
>>
>> Both MakeTupleTableSlot() and ExecStoreTuple() can be inside the (map
>> != NULL) if condition.
>> This also applies for similar changes in ExecConstraints() and
>> ExecWithCheckOptions().
>
> Ah, okay.  I guess that means we'll allocate a new slot here only if we
> had to switch to a partition-specific slot in the first place.
>
>> + * Initialize an empty slot that will be used to manipulate tuples of any
>> + * this partition's rowtype.
>> of any this => of this
>>
>> + * Initialized the array where these slots are stored, if not already
>> Initialized => Initialize
>
> Fixed.
>
>> + proute->partition_tuple_slots_alloced =
>> + lappend(proute->partition_tuple_slots_alloced,
>> + proute->partition_tuple_slots[partidx]);
>>
>> Instead of doing the above, I think we can collect those slots in
>> estate->es_tupleTable using ExecInitExtraTupleSlot() so that they
>> don't have to be explicitly dropped in ExecCleanupTupleRouting(). And
>> also then we won't require the new field
>> proute->partition_tuple_slots_alloced.
>
> Although I was slightly uncomfortable of the idea at first, thinking that
> it's not good for tuple routing specific resources to be released by
> generic executor code, doesn't seem too bad to do it the way you suggest.
>
> Attached updated patch.

Thanks for the changes. The patch looks good to me.

> By the way, I think it might be a good idea to
> try to merge this patch with the effort in the following thread.
>
> * Reduce partition tuple routing overheads *
> https://commitfest.postgresql.org/19/1690/

Yes. I guess, whichever commits later needs to be rebased on the earlier one.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company



Re: Slotification of partition tuple conversion

2018-08-20 Thread Amit Khandekar
On 17 August 2018 at 21:47, Amit Khandekar  wrote:

> Attached is a patch tup_convert.patch that does the conversion
> directly using input and output tuple slots. This patch is to be
> applied on an essential patch posted by Amit Langote in [1] that
> arranges for dedicated per-partition slots rather than changing
> tupdesc of a single slot. I have rebased that patch from [1] and
> attached it here (
> Allocate-dedicated-slots-of-partition-tuple_amitlan_rebased.patch).

Amit Langote has posted a new version of the
Allocate-dedicated-slots-of-partition-tuple_amitlan_rebased.patch at
[1] . Attached is version v2 of the tup_convert.patch, that is rebased
over that patch.


[1] 
https://www.postgresql.org/message-id/attachment/64354/v2-0001-Allocate-dedicated-slots-of-partition-tuple-conve.patch



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


tup_convert_v2.patch
Description: Binary data


Re: partitioning - changing a slot's descriptor is expensive

2018-08-20 Thread Andres Freund
Hi,

On 2018-08-20 16:40:17 +0530, Amit Khandekar wrote:
> On 20 August 2018 at 14:45, Amit Langote  
> wrote:
> > By the way, I think it might be a good idea to
> > try to merge this patch with the effort in the following thread.
> >
> > * Reduce partition tuple routing overheads *
> > https://commitfest.postgresql.org/19/1690/
> 
> Yes. I guess, whichever commits later needs to be rebased on the earlier one.

I think I'd rather keep this patch separate, it's pretty simple, so we
should be able to commit it fairly soon.

Greetings,

Andres Freund



Re: partitioning - changing a slot's descriptor is expensive

2018-08-20 Thread David Rowley
On 20 August 2018 at 23:21, Andres Freund  wrote:
> On 2018-08-20 16:40:17 +0530, Amit Khandekar wrote:
>> On 20 August 2018 at 14:45, Amit Langote  
>> wrote:
>> > By the way, I think it might be a good idea to
>> > try to merge this patch with the effort in the following thread.
>> >
>> > * Reduce partition tuple routing overheads *
>> > https://commitfest.postgresql.org/19/1690/
>>
>> Yes. I guess, whichever commits later needs to be rebased on the earlier one.
>
> I think I'd rather keep this patch separate, it's pretty simple, so we
> should be able to commit it fairly soon.

+1.   I think that patch is already big enough.

I'm perfectly fine with rebasing that if this one gets committed first.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



libpq host/hostaddr/conninfo inconsistencies

2018-08-20 Thread Fabien COELHO


Hello devs,

While reviewing various patches by Tom which are focussing on libpq 
multi-host behavior,


https://commitfest.postgresql.org/19/1749/
https://commitfest.postgresql.org/19/1752/

it occured to me that there are a few more problems with the 
documentation, the host/hostaddr feature, and the consistency of both. 
Namely:


* According to the documentation, either "host" or "hostaddr" can be 
specified. The former for names and socket directories, the later for ip 
addresses. If both are specified, "hostaddr" supersedes "host", and it may 
be used for various authentication purposes.


However, the actual capability is slightly different: specifying an ip 
address to "host" does work, without ensuing any name or reverse name 
look-ups, even if this is undocumented.  This means that the host/hostaddr 
dichotomy is somehow moot as host can already be used for the same 
purpose.


* \conninfo does not follow the implemented logic, and, as there is no 
sanity check performed on the specification, it can display wrong 
informations, which are not going to be helpful to anyone with a problem 
to solve and trying to figure out the current state:


  sh> psql "host=/tmp hostaddr=127.0.0.1"
  psql> \conninfo
  You are connected to database "fabien" as user "fabien" via socket in "/tmp" at port 
"5432"
  # wrong, it is really connected to 127.0.0.1 by TCP/IP

  sh> psql "host=127.0.0.2 hostaddr=127.0.0.1"
  psql> \conninfo
  You are connected to database "fabien" as user "fabien" on host "127.0.0.2" at port 
"5432".
  # wrong again, it is really connected to 127.0.0.1

  sh> psql "hostaddr=127.0.0.1"
  psql> \conninfo
  You are connected to database "fabien" as user "fabien" via socket in 
"/var/run/postgresql" at port "5432".
  # wrong again

* Another issue with \conninfo is that if a host resolves to multiple ips, 
there is no way to know which was chosen and/or worked, although on errors 
some messages show the failing ip.


* The host/hostaddr dichotomy worsens when several targets are specified, 
because according to the documentation you should specify either names & 
dirs as host and ips as hostaddr, which leads to pretty strange spec each 
being a possible source of confusion and unhelpful messages as described 
above:


  sh> psql "host=localhost,127.0.0.2,, hostaddr=127.0.0.1,,127.0.0.3,"
  # attempt 1 is 127.0.0.1 identified as localhost
  # attempt 2 is 127.0.0.2
  # attempt 3 is 127.0.0.3 identified as the default, whatever it is
  # attempt 4 is really the default

* The documentation about host/hostaddr/port accepting lists is really 
added as an afterthought: the features are presented for one, and then the 
list is mentionned. Moreover there are quite a few repeats between the 
paragraph about defaults and so.



Given this state of affair ISTM that the situation would be clarified by:

(1) describing "host" full capability to accept names, ips and dirs.

(2) describing "hostaddr" as a look-up shortcut. Maybe the "hostaddr" 
could be renamed in passing, eg "resolve" to outline that it is just a 
lookup shortcut, and not a partial alternative to "host".


(3) checking that hostaddr non empty addresses are only accepted if the 
corresponding host is a name. The user must use the "host=ip" syntax

to connect to an ip.

(4) teaching \conninfo to show the real connection, which probably require 
extending libpq to access the underlying ip, eg PQaddr or PQhostaddr or 
whatever.


The attached patch does 1-3 (2 without renaming, though).

Thoughts?

--
Fabien.diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index 49de975e7f..6b3e0b0b87 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -955,23 +955,28 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname
   host
   

-Name of host to connect to.host name
-If a host name begins with a slash, it specifies Unix-domain
-communication rather than TCP/IP communication; the value is the
-name of the directory in which the socket file is stored.  If
-multiple host names are specified, each will be tried in turn in
-the order given.  The default behavior when host is
-not specified, or is empty, is to connect to a Unix-domain
+Comma-separated list of hosts to connect to.host name
+Each item may be a host name that will be resolved with a look-up,
+a numeric IP address (IPv4 in the standard format, e.g.,
+172.28.40.9, or IPv6 if supported by your machine)
+that will be used directly, or
+the name of a directory which contains the socket file for Unix-domain
+communication rather than TCP/IP communication
+(the specification must then begin with a slash);
+   
+
+   
+The default behavior when host is
+not specified, or an item is empty, is to connect to a Unix-domain
 socketUnix domain socket in
 /tmp (or whatever socket directory was sp

Re: ALTER TABLE on system catalogs

2018-08-20 Thread Michael Paquier
On Mon, Aug 20, 2018 at 12:43:20PM +0200, Peter Eisentraut wrote:
> Tests would require enabling allow_system_table_mods, which is
> PGC_POSTMASTER, so we couldn't do it inside the normal regression test
> suite.  I'm not sure setting up a whole new test suite for this is worth it.

I forgot this point.  Likely that's not worth it.
--
Michael


signature.asc
Description: PGP signature


Re: Conflict handling for COPY FROM

2018-08-20 Thread Robert Haas
On Sat, Aug 4, 2018 at 9:10 AM Surafel Temesgen 
wrote:

> In order to prevent extreme condition the patch also add a new GUC
> variable called copy_max_error_limit that control the amount of error to
> swallow before start to error and new failed record file options for copy
> to write a failed record so the user can examine it.
>

Why should this be a GUC rather than a COPY option?

In fact, try doing all of this by adding more options to COPY rather than
new syntax.

COPY ... WITH (ignore_conflicts 1000, ignore_logfile '...')

It kind of stinks to use a log file written by the server as the
dup-reporting channel though. That would have to be superuser-only.

...Robert
-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: TupleTableSlot abstraction

2018-08-20 Thread Andres Freund
On 2018-08-20 17:51:38 +0530, Ashutosh Bapat wrote:
> > I also noticed an independent issue in your changes to
> > ExecInitScanTupleSlot(): You can't assume that the plan belonging to the
> > ScanState have a Scan node in their plan. Look e.g. at Material, Sort
> > etc. So currently your scanrelid access is often just uninitialized
> > data.
> 
> I have incorporated changes in your patches into the relevant patches
> in the updated patch-set. With this patch-set make check-world passes
> for me.

Have you addressed the issue I commented on above?

Greetings,

Andres Freund



WaitForOlderSnapshots refactoring

2018-08-20 Thread Peter Eisentraut
The attached patch factors out the CREATE INDEX CONCURRENTLY code that
waits for transactions with older snapshots to finish into a new
function WaitForOlderSnapshots().

This refactoring was part of a previously posted REINDEX CONCURRENTLY
patch.  But this code is now also appearing as a copy-and-paste in the
ATTACH/DETACH PARTITION CONCURRENTLY thread, so it might be worth making
it an official thing.

The question is where to put it.  This patch just leaves it static in
indexcmds.c, which doesn't help other uses.  A sensible place might be a
new src/backend/commands/common.c.  Or we make it non-static in
indexcmds.c when the need arises.

Thoughts?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 61c413200d9806127a72d13a197b32527b48d793 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 20 Aug 2018 13:58:22 +0200
Subject: [PATCH] Factor out WaitForOlderSnapshots()

This code from CREATE INDEX CONCURRENTLY might also be useful for other
future concurrent DDL commands, so make it a separate function to avoid
everyone copy-and-pasting it.
---
 src/backend/commands/indexcmds.c | 155 +--
 1 file changed, 86 insertions(+), 69 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index d54c78c352..8beb8bb899 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -297,6 +297,90 @@ CheckIndexCompatible(Oid oldId,
return ret;
 }
 
+
+/*
+ * WaitForOlderSnapshots
+ *
+ * Wait for transactions that might have an older snapshot than the given xmin
+ * limit, because it might not contain tuples deleted just before it has
+ * been taken. Obtain a list of VXIDs of such transactions, and wait for them
+ * individually.
+ *
+ * We can exclude any running transactions that have xmin > the xmin given;
+ * their oldest snapshot must be newer than our xmin limit.
+ * We can also exclude any transactions that have xmin = zero, since they
+ * evidently have no live snapshot at all (and any one they might be in
+ * process of taking is certainly newer than ours).  Transactions in other
+ * DBs can be ignored too, since they'll never even be able to see this
+ * index.
+ *
+ * We can also exclude autovacuum processes and processes running manual
+ * lazy VACUUMs, because they won't be fazed by missing index entries
+ * either. (Manual ANALYZEs, however, can't be excluded because they
+ * might be within transactions that are going to do arbitrary operations
+ * later.)
+ *
+ * Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
+ * check for that.
+ *
+ * If a process goes idle-in-transaction with xmin zero, we do not need to
+ * wait for it anymore, per the above argument.  We do not have the
+ * infrastructure right now to stop waiting if that happens, but we can at
+ * least avoid the folly of waiting when it is idle at the time we would
+ * begin to wait.  We do this by repeatedly rechecking the output of
+ * GetCurrentVirtualXIDs.  If, during any iteration, a particular vxid
+ * doesn't show up in the output, we know we can forget about it.
+ */
+static void
+WaitForOlderSnapshots(TransactionId limitXmin)
+{
+   int i;
+   int n_old_snapshots;
+   VirtualTransactionId *old_snapshots;
+
+   old_snapshots = GetCurrentVirtualXIDs(limitXmin, true, false,
+   
  PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+   
  &n_old_snapshots);
+
+   for (i = 0; i < n_old_snapshots; i++)
+   {
+   if (!VirtualTransactionIdIsValid(old_snapshots[i]))
+   continue;   /* found uninteresting 
in previous cycle */
+
+   if (i > 0)
+   {
+   /* see if anything's changed ... */
+   VirtualTransactionId *newer_snapshots;
+   int n_newer_snapshots;
+   int j;
+   int k;
+
+   newer_snapshots = GetCurrentVirtualXIDs(limitXmin,
+   
true, false,
+   
PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
+   
&n_newer_snapshots);
+   for (j = i; j < n_old_snapshots; j++)
+   {
+   if 
(!VirtualTransactionIdIsValid(old_snapshots[j]))
+   continue;   /* found uninteresting 
in previous cycle */

Re: ALTER TABLE on system catalogs

2018-08-20 Thread Peter Eisentraut
On 20/08/2018 12:48, Andres Freund wrote:
> On 2018-08-20 11:37:49 +0900, Michael Paquier wrote:
>> On Fri, Jul 13, 2018 at 11:05:10AM +0200, Peter Eisentraut wrote:
>>> After reviewing that thread, I think my patch would still be relevant.
>>> Because the pending proposal is to not add TOAST tables to some catalogs
>>> such as pg_attribute, so my original use case of allowing ALTER TABLE /
>>> SET on system catalogs would still be broken for those tables.
>>
>> Something has happened in this area in the shape of 96cdeae, and the
>> following catalogs do not have toast tables: pg_class, pg_index,
>> pg_attribute and pg_largeobject*.  While 0001 proposed upthread is not
>> really relevant anymore because there is already a test to list which
>> catalogs do not have a toast table.  For 0002, indeed the patch is still
>> seems relevant.  The comment block at the beginning of
>> create_toast_table should be updated.
> 
> I still object to the approach in 0002.

Do you have an alternative in mind?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WaitForOlderSnapshots refactoring

2018-08-20 Thread Andres Freund
Hi,

On 2018-08-20 14:35:34 +0200, Peter Eisentraut wrote:
> The attached patch factors out the CREATE INDEX CONCURRENTLY code that
> waits for transactions with older snapshots to finish into a new
> function WaitForOlderSnapshots().

> This refactoring was part of a previously posted REINDEX CONCURRENTLY
> patch.  But this code is now also appearing as a copy-and-paste in the
> ATTACH/DETACH PARTITION CONCURRENTLY thread, so it might be worth making
> it an official thing.

I'm doubtful that ATTACH should use this, but I think the refactoring is
a good idea regardless.


> The question is where to put it.  This patch just leaves it static in
> indexcmds.c, which doesn't help other uses.  A sensible place might be a
> new src/backend/commands/common.c.  Or we make it non-static in
> indexcmds.c when the need arises.

Why not move it to procarray.c?  Most of the referenced functionality
resides there IIRC.

Greetings,

Andres Freund



Re: ALTER TABLE on system catalogs

2018-08-20 Thread Andres Freund
On 2018-08-20 14:38:25 +0200, Peter Eisentraut wrote:
> On 20/08/2018 12:48, Andres Freund wrote:
> > On 2018-08-20 11:37:49 +0900, Michael Paquier wrote:
> >> On Fri, Jul 13, 2018 at 11:05:10AM +0200, Peter Eisentraut wrote:
> >>> After reviewing that thread, I think my patch would still be relevant.
> >>> Because the pending proposal is to not add TOAST tables to some catalogs
> >>> such as pg_attribute, so my original use case of allowing ALTER TABLE /
> >>> SET on system catalogs would still be broken for those tables.
> >>
> >> Something has happened in this area in the shape of 96cdeae, and the
> >> following catalogs do not have toast tables: pg_class, pg_index,
> >> pg_attribute and pg_largeobject*.  While 0001 proposed upthread is not
> >> really relevant anymore because there is already a test to list which
> >> catalogs do not have a toast table.  For 0002, indeed the patch is still
> >> seems relevant.  The comment block at the beginning of
> >> create_toast_table should be updated.
> > 
> > I still object to the approach in 0002.
> 
> Do you have an alternative in mind?

One is to just not do anything. I'm not sure I'm on board with the goal
of changing things to make DDL on system tables more palatable.

If we want to do something, the first thing to do is to move the
if (shared_relation && !IsBootstrapProcessingMode())
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("shared tables cannot be toasted after 
initdb")));
bit below the
/*
 * Is it already toasted?
 */
and
/*
 * Check to see whether the table actually needs a TOAST table.
 */
blocks.  There's no point in erroring out when we'd actually not do
squat anyway.

By that point there's just a few remaining tables where you couldn't do
the ALTER TABLE.

After that I think there's two options: First is to just follow to the
rules and add toast tables for the relations that formally need them and
review the VACUUM FULL/CLUSTER codepath around relation swapping. That's
imo the best course.

Third option is to move those checks to the layers where they more
reasonably belong. IMO that's needs_toast_table(). I disfavor this, but
it's better than what you did IMO.

Greetings,

Andres Freund



Re: [FEATURE PATCH] pg_stat_statements with plans (v02)

2018-08-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-08-20 15:21:07 +1200, Thomas Munro wrote:
>> -Werror=vla in GCC, apparently.

> How about detecting support for that in our configure script and
> automatically using it?  If we're uncomfortable with raising an error,
> let's at least raise a warning?

Yeah, we'd certainly auto-turn-on whatever options we think are needed.

regards, tom lane



Re: remove ATTRIBUTE_FIXED_PART_SIZE

2018-08-20 Thread Tom Lane
Peter Eisentraut  writes:
>> I don't think the tuple packing issue has to do with the tuple
>> descriptor code.  The tuple descriptors already use allocations of size
>> sizeof(FormData_pg_attribute) (CreateTemplateTupleDesc), just the memcpy
>> and memset calls use (potentially) less.  That might have saved a few
>> bytes for omitting the varlena fields, but I don't think it affects
>> alignment correctness.  If we, say, added a trailing char field now, the
>> only thing this code
> ... the only thing the current code would accomplish is not copying the
> last three padding bytes, which might even be slower than copying all four.

But, if the varlena fields are all null, those bytes might not be there
... at least, not according to valgrind.

I agree this is all moot as long as there's no pad bytes.  What I'm
trying to figure out is if we need to put in place some provisions
to prevent there from being pad bytes at the end of any catalog struct.
According to what Andres is saying, it seems like we do (at least for
ones with varlena fields).

regards, tom lane



Re: ALTER TABLE on system catalogs

2018-08-20 Thread Tom Lane
Andres Freund  writes:
> On 2018-08-20 14:38:25 +0200, Peter Eisentraut wrote:
>> Do you have an alternative in mind?

> One is to just not do anything. I'm not sure I'm on board with the goal
> of changing things to make DDL on system tables more palatable.

FWIW, I agree with doing as little as possible here.  I'd be on board
with Andres' suggestion of just swapping the two code stanzas so that
the no-op case doesn't error out.  As soon as you go beyond that, you
are in wildly unsafe and untested territory.

regards, tom lane



Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-08-20 Thread Alvaro Herrera
On 2018-Aug-20, Michael Paquier wrote:

> On Tue, Aug 14, 2018 at 06:53:32PM +0200, Michael Paquier wrote:
> > I was thinking about adding "Even if it is not atomic" or such at the
> > beginning of the paragraph, but at the end your phrasing sounds better
> > to me.  So I have hacked up the attached, which also reworks the comment
> > in InitTempTableNamespace in the same spirit.  Thoughts?
> 
> It has been close to one week since I sent this patch.  Anything?  I
> don't like to keep folks unhappy about anything I committed.

Seems reasonable.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Two proposed modifications to the PostgreSQL FDW

2018-08-20 Thread Chris Travers
Hi all;

I am looking at trying to make two modifications to the PostgreSQL FDW and
would like feedback on this before I do.

1.  INSERTMETHOD=[insert|copy] option on foreign table.

One significant limitation of the PostgreSQL FDW is that it does a prepared
statement insert on each row written which imposes a per-row latency.  This
hits environments where there is significant latency or few latency
guarantees particularly hard, for example, writing to a foreign table that
might be physically located on another continent.  The idea is that
INSERTMETHOD would default to insert and therefore have no changes but
where needed people could specify COPY which would stream the data out.
Updates would still be unaffected.

2.  TWOPHASECOMMIT=[off|on] option

The second major issue that I see with PostgreSQL's foreign database
wrappers is the fact that there is no two phase commit which means that a
single transaction writing to a group of tables has no expectation that all
backends will commit or rollback together.  With this patch an option would
be applied to foreign tables such that they could be set to use two phase
commit  When this is done, the first write to each backend would register a
connection with a global transaction handler and a pre-commit and commit
hooks would be set up to properly process these.

On recommit a per-global-transaction file would be opened in the data
directory and prepare statements logged to the file.  On error, we simply
roll back our local transaction.

On commit hook , we go through and start to commit the remote global
transactions.  At this point we make a best effort but track whether or not
we were successfully on all.  If successful on all, we delete the file.  If
unsuccessful we fire a background worker which re-reads the file and is
responsible for cleanup.  If global transactions persist, a SQL
administration function will be made available to restart the cleanup
process.  On rollback, we do like commit but we roll back all transactions
in the set.  The file has enough information to determine whether we should
be committing or rolling back on cleanup.

I would like to push these both for Pg 12.  Is there any feedback on the
concepts and the problems first

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


Re: Two proposed modifications to the PostgreSQL FDW

2018-08-20 Thread Andres Freund
Hi,

On 2018-08-20 16:28:01 +0200, Chris Travers wrote:
> 1.  INSERTMETHOD=[insert|copy] option on foreign table.
> 
> One significant limitation of the PostgreSQL FDW is that it does a prepared
> statement insert on each row written which imposes a per-row latency.  This
> hits environments where there is significant latency or few latency
> guarantees particularly hard, for example, writing to a foreign table that
> might be physically located on another continent.  The idea is that
> INSERTMETHOD would default to insert and therefore have no changes but
> where needed people could specify COPY which would stream the data out.
> Updates would still be unaffected.

That has a *lot* of semantics issues, because you suddenly don't get
synchronous error reports anymore.  I don't think that's OK on a
per-table basis.  If we invented something like this, it IMO should be a
per-statement explicit opt in that'd allow streaming.


> 2.  TWOPHASECOMMIT=[off|on] option

> The second major issue that I see with PostgreSQL's foreign database
> wrappers is the fact that there is no two phase commit which means that a
> single transaction writing to a group of tables has no expectation that all
> backends will commit or rollback together.  With this patch an option would
> be applied to foreign tables such that they could be set to use two phase
> commit  When this is done, the first write to each backend would register a
> connection with a global transaction handler and a pre-commit and commit
> hooks would be set up to properly process these.
> 
> On recommit a per-global-transaction file would be opened in the data
> directory and prepare statements logged to the file.  On error, we simply
> roll back our local transaction.
> 
> On commit hook , we go through and start to commit the remote global
> transactions.  At this point we make a best effort but track whether or not
> we were successfully on all.  If successful on all, we delete the file.  If
> unsuccessful we fire a background worker which re-reads the file and is
> responsible for cleanup.  If global transactions persist, a SQL
> administration function will be made available to restart the cleanup
> process.  On rollback, we do like commit but we roll back all transactions
> in the set.  The file has enough information to determine whether we should
> be committing or rolling back on cleanup.
> 
> I would like to push these both for Pg 12.  Is there any feedback on the
> concepts and the problems first

There's been *substantial* work on this. You should at least read the
discussion & coordinate with the relevant developers.

See https://commitfest.postgresql.org/19/1574/ and the referenced discussions.

Greetings,

Andres Freund



Re: Two proposed modifications to the PostgreSQL FDW

2018-08-20 Thread Tom Lane
Chris Travers  writes:
> I am looking at trying to make two modifications to the PostgreSQL FDW and
> would like feedback on this before I do.

> 1.  INSERTMETHOD=[insert|copy] option on foreign table.

> One significant limitation of the PostgreSQL FDW is that it does a prepared
> statement insert on each row written which imposes a per-row latency.  This
> hits environments where there is significant latency or few latency
> guarantees particularly hard, for example, writing to a foreign table that
> might be physically located on another continent.  The idea is that
> INSERTMETHOD would default to insert and therefore have no changes but
> where needed people could specify COPY which would stream the data out.
> Updates would still be unaffected.

It seems unlikely to me that an FDW option would be at all convenient
for this.  What about selecting it dynamically based on the planner's
estimate of the number of rows to be inserted?

A different thing we could think about is enabling COPY TO/FROM a
foreign table.

> 2.  TWOPHASECOMMIT=[off|on] option

> The second major issue that I see with PostgreSQL's foreign database
> wrappers is the fact that there is no two phase commit which means that a
> single transaction writing to a group of tables has no expectation that all
> backends will commit or rollback together.  With this patch an option would
> be applied to foreign tables such that they could be set to use two phase
> commit  When this is done, the first write to each backend would register a
> connection with a global transaction handler and a pre-commit and commit
> hooks would be set up to properly process these.

ENOINFRASTRUCTURE ... and the FDW pieces of that hardly seem like the
place to start.

regards, tom lane



Re: Two proposed modifications to the PostgreSQL FDW

2018-08-20 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2018-08-20 16:28:01 +0200, Chris Travers wrote:
> > 1.  INSERTMETHOD=[insert|copy] option on foreign table.
> > 
> > One significant limitation of the PostgreSQL FDW is that it does a prepared
> > statement insert on each row written which imposes a per-row latency.  This
> > hits environments where there is significant latency or few latency
> > guarantees particularly hard, for example, writing to a foreign table that
> > might be physically located on another continent.  The idea is that
> > INSERTMETHOD would default to insert and therefore have no changes but
> > where needed people could specify COPY which would stream the data out.
> > Updates would still be unaffected.
> 
> That has a *lot* of semantics issues, because you suddenly don't get
> synchronous error reports anymore.  I don't think that's OK on a
> per-table basis.  If we invented something like this, it IMO should be a
> per-statement explicit opt in that'd allow streaming.

Doing some kind of decoration on a per-statement level to do something
different for FDWs doesn't really seem very clean..

On reading this, a thought I had was that maybe we should just perform a
COPY to the FDW when COPY is what's been specified by the user (eg:

COPY my_foreign_table FROM STDIN;

), but that wouldn't help when someone wants to bulk copy data from a
local table into a foreign table.

COPY is already non-standard though, so we can extend it, and one option
might be to extend it like so:

COPY my_local_table TO TABLE my_foreign_table;

Which could be made to work for both foreign tables and local ones,
where it'd basically be:

INSERT INTO my_foreign_table SELECT * FROM my_local_table;

The COPY TO case already supports queries, such that you could then do:

COPY (SELECT c1,c2,c3 FROM my_local_table) TO TABLE my_foreign_table;

I'd also think we'd want to support this kind of 'bulk COPY-like'
operation for multiple FDWs (I wonder if maybe file_fdw could be made to
support this new method, thus allowing users to write out to files with
it, which we don't support today at all).

Just some brain-storming and ideas about where this could possibly go.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-08-20 Thread Alexander Korotkov
On Wed, Apr 18, 2018 at 11:49 PM Tom Lane  wrote:
> I wrote:
> > Relation truncation throws away the page image in memory without ever
> > writing it to disk.  Then, if the subsequent file truncate step fails,
> > we have a problem, because anyone who goes looking for that page will
> > fetch it afresh from disk and see the tuples as live.
>
> > There are WAL entries recording the row deletions, but that doesn't
> > help unless we crash and replay the WAL.
>
> > It's hard to see a way around this that isn't fairly catastrophic for
> > performance :-(.
>
> Just to throw out a possibly-crazy idea: maybe we could fix this by
> PANIC'ing if truncation fails, so that we replay the row deletions from
> WAL.  Obviously this would be intolerable if the case were frequent,
> but we've had only two such complaints in the last nine years, so maybe
> it's tolerable.  It seems more attractive than taking a large performance
> hit on truncation speed in normal cases, anyway.

We have only two complaints of data corruption in nine years.  But I
suspect that in vast majority of cases truncation error didn't cause
the corruption OR the corruption wasn't noticed.  So, once we
introduce PANIC here, we would get way more complaints.

> A gotcha to be concerned about is what happens if we replay from WAL,
> come to the XLOG_SMGR_TRUNCATE WAL record, and get the same truncation
> failure again, which is surely not unlikely.  PANIC'ing again will not
> do.  I think we could probably handle that by having the replay code
> path zero out all the pages it was unable to delete; as long as that
> succeeds, we can call it good and move on.
>
> Or maybe just do that in the mainline case too?  That is, if ftruncate
> fails, handle it by zeroing the undeletable pages and pressing on?

I've just started really digging into this set of problems.  But this
idea looks good for me so soon...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: libpq compression

2018-08-20 Thread Konstantin Knizhnik



On 13.08.2018 23:06, Andrew Dunstan wrote:



On 08/13/2018 02:47 PM, Robert Haas wrote:

On Fri, Aug 10, 2018 at 5:55 PM, Andrew Dunstan
 wrote:

This thread appears to have gone quiet. What concerns me is that there
appears to be substantial disagreement between the author and the 
reviewers.
Since the last thing was this new patch it should really have been 
put back
into "needs review" (my fault to some extent - I missed that). So 
rather
than return the patch with feedfack I'm going to set it to "needs 
review"
and move it to the next CF. However, if we can't arrive at a 
consensus about

the direction during the next CF it should be returned with feedback.

I agree with the critiques from Robbie Harwood and Michael Paquier
that the way in that compression is being hooked into the existing
architecture looks like a kludge.  I'm not sure I know exactly how it
should be done, but the current approach doesn't look natural; it
looks like it was bolted on.  I agree with the critique from Peter
Eisentraut and others that we should not go around and add -Z as a
command-line option to all of our tools; this feature hasn't yet
proved itself to be useful enough to justify that.  Better to let
people specify it via a connection string option if they want it.  I
think Thomas Munro was right to ask about what will happen when
different compression libraries are in use, and I think failing
uncleanly is quite unacceptable.  I think there needs to be some
method for negotiating the compression type; the client can, for
example, provide a comma-separated list of methods it supports in
preference order, and the server can pick the first one it likes.  In
short, I think that a number of people have provided really good
feedback on this patch, and I suggest to Konstantin that he should
consider accepting all of those suggestions.

Commit ae65f6066dc3d19a55f4fdcd3b30003c5ad8dbed tried to introduce
some facilities that can be used for protocol version negotiation as
new features are added, but this patch doesn't use them.  It looks to
me like it instead just breaks backward compatibility.  The new
'compression' option won't be recognized by older servers.  If it were
spelled '_pq_.compression' then the facilities in that commit would
cause a NegotiateProtocolVersion message to be sent by servers which
have that commit but don't support the compression option.  I'm not
exactly sure what will happen on even-older servers that don't have
that commit either, but I hope they'll treat it as a GUC name; note
that setting an unknown GUC name with a namespace prefix is not an
error, but setting one without such a prefix IS an ERROR. Servers
which do support compression would respond with a message indicating
that compression had been enabled or, maybe, just start sending back
compressed-packet messages, if we go with including some framing in
the libpq protocol.




Excellent summary, and well argued recommendations, thanks. I've 
changed the status to waiting on author.


New version of the patch is attached: I removed -Z options form pgbench 
and psql and add checking that server and client are implementing the 
same compression algorithm.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/configure b/configure
index 0aafd9c..fc5685c 100755
--- a/configure
+++ b/configure
@@ -699,6 +699,7 @@ ELF_SYS
 EGREP
 GREP
 with_zlib
+with_zstd
 with_system_tzdata
 with_libxslt
 with_libxml
@@ -863,6 +864,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 enable_float4_byval
@@ -8017,6 +8019,86 @@ fi
 
 
 #
+# ZStd
+#
+
+
+
+# Check whether --with-zstd was given.
+if test "${with_zstd+set}" = set; then :
+  withval=$with_zstd;
+  case $withval in
+yes)
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --with-zstd option" "$LINENO" 5
+  ;;
+  esac
+
+else
+  with_zstd=no
+
+fi
+
+
+
+
+if test "$with_zstd" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for ZSTD_compress in -lzstd" >&5
+$as_echo_n "checking for ZSTD_compress in -lzstd... " >&6; }
+if ${ac_cv_lib_zstd_ZSTD_compress+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lzstd  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char ZSTD_compress ();
+int
+main ()
+{
+return ZSTD_compress ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_zstd_ZSTD_compress=yes
+else
+  ac_cv_lib_zstd_ZSTD_compress=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}

Re: Two proposed modifications to the PostgreSQL FDW

2018-08-20 Thread Andres Freund
Hi,

On 2018-08-20 10:56:39 -0400, Stephen Frost wrote:
> * Andres Freund (and...@anarazel.de) wrote:
> > On 2018-08-20 16:28:01 +0200, Chris Travers wrote:
> > > 1.  INSERTMETHOD=[insert|copy] option on foreign table.
> > > 
> > > One significant limitation of the PostgreSQL FDW is that it does a 
> > > prepared
> > > statement insert on each row written which imposes a per-row latency.  
> > > This
> > > hits environments where there is significant latency or few latency
> > > guarantees particularly hard, for example, writing to a foreign table that
> > > might be physically located on another continent.  The idea is that
> > > INSERTMETHOD would default to insert and therefore have no changes but
> > > where needed people could specify COPY which would stream the data out.
> > > Updates would still be unaffected.
> > 
> > That has a *lot* of semantics issues, because you suddenly don't get
> > synchronous error reports anymore.  I don't think that's OK on a
> > per-table basis.  If we invented something like this, it IMO should be a
> > per-statement explicit opt in that'd allow streaming.
> 
> Doing some kind of decoration on a per-statement level to do something
> different for FDWs doesn't really seem very clean..

I think it's required.  The semantics of an INSERT statement
*drastically* change if you don't insert remotely. Constraints aren't
evaluated once the command finished, sequences aren't increased until
later, there'll be weird interactions with savepoints, ... Without
executing immediately remotely it's basically isn't a normal INSERT
anymore.

Note bulk INSERT and single row INSERT are very different here.

That's not to say it's not useful to pipeline. To the contrary.


> On reading this, a thought I had was that maybe we should just perform a
> COPY to the FDW when COPY is what's been specified by the user (eg:
> 
> COPY my_foreign_table FROM STDIN;

Right. There'd not even need to be a an option since that's already
pipelined.


> ), but that wouldn't help when someone wants to bulk copy data from a
> local table into a foreign table.

That possibly still is doable, just with INSERT, as you don't need (may
not even, in plenty cases) to see the effects of the statement until the
CommandCounterIncrement().  So we can delay the flush a bit.


Greetings,

Andres Freund



Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-08-20 Thread Alexander Korotkov
On Wed, Apr 18, 2018 at 10:04 PM Tom Lane  wrote:
> [ re-reads thread... ]  The extra assumption you need in order to have
> trouble is that the blocks in question are dirty in shared buffers and
> have never been written to disk since their rows were deleted.  Then
> the situation is that the page image on disk shows the rows as live,
> while the up-to-date page image in memory correctly shows them as dead.
> Relation truncation throws away the page image in memory without ever
> writing it to disk.  Then, if the subsequent file truncate step fails,
> we have a problem, because anyone who goes looking for that page will
> fetch it afresh from disk and see the tuples as live.
>
> There are WAL entries recording the row deletions, but that doesn't
> help unless we crash and replay the WAL.
>
> It's hard to see a way around this that isn't fairly catastrophic for
> performance :-(.  But in any case it's wrapped up in order-of-operations
> issues.  I've long since forgotten the details, but I seem to have thought
> that there were additional order-of-operations hazards besides this one.

Just for clarification.  Do you mean zeroing of to-be-truncated blocks
to be catastrophic for performance?  Or something else?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Truncation failure in autovacuum results in data corruption (duplicate keys)

2018-08-20 Thread Tom Lane
Alexander Korotkov  writes:
> On Wed, Apr 18, 2018 at 10:04 PM Tom Lane  wrote:
>> It's hard to see a way around this that isn't fairly catastrophic for
>> performance :-(.  But in any case it's wrapped up in order-of-operations
>> issues.  I've long since forgotten the details, but I seem to have thought
>> that there were additional order-of-operations hazards besides this one.

> Just for clarification.  Do you mean zeroing of to-be-truncated blocks
> to be catastrophic for performance?  Or something else?

It would be pretty terrible to have to do that in the normal code path.
The other idea that was in my mind was to force out dirty buffers, then
discard them, then truncate ... but that's awful too if there's a lot
of dirty buffers that we'd have to write only to throw the data away.

I think it's all right to be slow if the truncation fails, though; that
does not seem like a path that has to be fast, only correct.

One thing to be concerned about is that as soon as we've discarded any
page images from buffers, we have to be in a critical section all the way
through till we've either truncated or zeroed those pages on-disk.  Any
failure in that has to result in a PANIC and recover-from-WAL, because
we don't know what state we lost by dropping the buffers.  Ugh.  It's
especially bad if the truncation fails because the file got marked
read-only, because then the zeroing is also going to fail, making that
a guaranteed PANIC case (with no clear path to recovery after the
panic, either ...)

I wonder if it could help to do something like zeroing the buffers in
memory, then truncating, then discarding buffers.  This is just a
half-baked idea and I don't have time to think more right now, but maybe
making two passes over the shared buffers could lead to a better solution.
It's the useless I/O that we need to avoid, IMO.

regards, tom lane



Re: Two proposed modifications to the PostgreSQL FDW

2018-08-20 Thread Chris Travers
On Mon, Aug 20, 2018 at 4:41 PM Andres Freund  wrote:

> Hi,
>
> On 2018-08-20 16:28:01 +0200, Chris Travers wrote:
> > 1.  INSERTMETHOD=[insert|copy] option on foreign table.
> >
> > One significant limitation of the PostgreSQL FDW is that it does a
> prepared
> > statement insert on each row written which imposes a per-row latency.
> This
> > hits environments where there is significant latency or few latency
> > guarantees particularly hard, for example, writing to a foreign table
> that
> > might be physically located on another continent.  The idea is that
> > INSERTMETHOD would default to insert and therefore have no changes but
> > where needed people could specify COPY which would stream the data out.
> > Updates would still be unaffected.
>
> That has a *lot* of semantics issues, because you suddenly don't get
> synchronous error reports anymore.  I don't think that's OK on a
> per-table basis.  If we invented something like this, it IMO should be a
> per-statement explicit opt in that'd allow streaming.
>

I want to push back a bit and justify the per table decision here.

This is primarily useful when two things are present:
1.  There is significant lag between the servers
2.  There are significant bulk operations on the table.

Only when both are present does it make sense to use this option.  If you
have a low latency connection, don't use it.  If you are only writing a few
rows here and there, don't use it.  If you have a high latency connection
*and* you are inserting large numbers of rows at a time, then this is where
you run into a problem and the current FDW approach is, frankly, unusable
in these cases.  These characteristics follow the server (and hence the
table) in the first case and the table only in the second case.  Therefore
the characteristics of the foreign table determine whether it makes sense
to consider this option.  Doing this on the statement level poses more
problems, in my view, than it fixes.

Moreover the idea of batching inserts doesn't help either (and actually
would be worse in many cases than the current approach) since the current
approach uses a prepared statement which gets called for each row.Doing
a batch insert might save you on the per-row round trip but it adds a great
deal of complexity to the overall operation.

Now, a per-statement opt-in poses a number of questions.  Obviously we'd
have to allow this on local tables too, but it would have no effect.  So
suddenly you have a semantic construct which *may* or *may not* pose the
problems you mention and in fact may or may not pose the problem on foreign
tables because whether the construct does anything depends on the FDW
driver.  I don't see what exactly one buys in shifting the problem in this
way given that when this is useful it will be used on virtually all inserts
and when it is not useful it won't be used at all.

Now, the entire problem is the expectation of synchronous errors on a per
row basis.  This is what causes the problem with inserting a million rows
over a transcontinental link. (insert row, check error, insert row, check
error etc).

So on to the question of containing the problems.  I think the best way to
contain this is to have one COPY statement per insert planner node.  And
that makes sure that the basic guarantee that a planner node succeeds or
errors is met.  For sequences, I would expect that supplying a column list
 without the relevant sequenced value should behave in a tolerable way
(otherwise two concurrent copies would have problems).

So that's the primary defense.  Now let's look at the main alternatives.

1.  You can use a table where you logically replicate inserts, and where
you insert then truncate or delete.  This works fine until you have enough
WAL traffic that you cannot tolerate a replication outage.
2.  You can write an extension which allows you to COPY a current table
into a foreign table.  The problem here is that SPI has a bunch of
protections to keep you from copying to stdout, so when I helped with this
proof fo concept we copied to a temp file in a ramdisk and copied from
there which is frankly quite a bit worse than this approach.

If you want a per statement opt-in, I guess the questions that come to my
mind are:

1.  What would this opt-in look like?
2.  How important is it that we avoid breaking current FDW interfaces to
implement it?
3.  How would you expect it to behave if directed at a local table or a
foreign table on, say, a MySQL db?

>
>
> > 2.  TWOPHASECOMMIT=[off|on] option
>
> > The second major issue that I see with PostgreSQL's foreign database
> > wrappers is the fact that there is no two phase commit which means that a
> > single transaction writing to a group of tables has no expectation that
> all
> > backends will commit or rollback together.  With this patch an option
> would
> > be applied to foreign tables such that they could be set to use two phase
> > commit  When this is done, the first write to each backend would
> regist

Re: A really subtle lexer bug

2018-08-20 Thread Andrew Gierth
> "Andrew" == Andrew Gierth  writes:

 Andrew> select f(a =>-1);  -- ERROR:  column "a" does not exist

 Andrew> I guess the fix is to extend the existing special case code
 Andrew> that checks for one character left after removing trailing [+-]
 Andrew> and also check for the two-character ops "<>" ">=" "<=" "=>"
 Andrew> "!=".

Patch attached.

This fixes two bugs: first the mis-lexing of two-char ops as mentioned
originally; second, the O(N^3) lexing time of strings of - or +
characters is reduced to O(N^2) (in practice it's better than O(N^2)
once N gets large because the bison stack gets blown out, ending the
loop early).

-- 
Andrew (irc:RhodiumToad)

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index 0cd782827a..cc855ffb1c 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -885,20 +885,34 @@ other			.
 	 * to forbid operator names like '?-' that could not be
 	 * sequences of SQL operators.
 	 */
-	while (nchars > 1 &&
-		   (yytext[nchars - 1] == '+' ||
-			yytext[nchars - 1] == '-'))
+	if (nchars > 1 &&
+		(yytext[nchars - 1] == '+' ||
+		 yytext[nchars - 1] == '-'))
 	{
 		int			ic;
 
 		for (ic = nchars - 2; ic >= 0; ic--)
 		{
-			if (strchr("~!@#^&|`?%", yytext[ic]))
+			char c = yytext[ic];
+			if (c == '~' || c == '!' || c == '@' ||
+c == '#' || c == '^' || c == '&' ||
+c == '|' || c == '`' || c == '?' ||
+c == '%')
 break;
 		}
-		if (ic >= 0)
-			break; /* found a char that makes it OK */
-		nchars--; /* else remove the +/-, and check again */
+		if (ic < 0)
+		{
+			/*
+			 * didn't find a qualifying character, so remove
+			 * all trailing [+-]
+			 */
+			for (nchars--;
+ nchars > 1 &&
+	 (yytext[nchars - 1] == '+' ||
+	  yytext[nchars - 1] == '-');
+ nchars--)
+continue;
+		}
 	}
 
 	SET_YYLLOC();
@@ -916,6 +930,25 @@ other			.
 		if (nchars == 1 &&
 			strchr(",()[].;:+-*/%^<>=", yytext[0]))
 			return yytext[0];
+		/*
+		 * Likewise, if what we have left is two chars, and
+		 * those match the tokens ">=", "<=", "=>", "<>" or
+		 * "!=", then we must return the sppropriate token
+		 * rather than the generic Op.
+		 */
+		if (nchars == 2)
+		{
+			if (yytext[0] == '=' && yytext[1] == '>')
+return EQUALS_GREATER;
+			if (yytext[0] == '>' && yytext[1] == '=')
+return GREATER_EQUALS;
+			if (yytext[0] == '<' && yytext[1] == '=')
+return LESS_EQUALS;
+			if (yytext[0] == '<' && yytext[1] == '>')
+return NOT_EQUALS;
+			if (yytext[0] == '!' && yytext[1] == '=')
+return NOT_EQUALS;
+		}
 	}
 
 	/*


Re: Two proposed modifications to the PostgreSQL FDW

2018-08-20 Thread Chris Travers
For the record, here's the proof of concept code for the transaction
manager which works off libpq connections.

It is not ready yet by any means.  But it is included for design
discussion.  If the previous patch gets in instead, that's fine, but figure
it is worth including here for discussion purposes.

Two things which are currently missing are a) an ability to specify in the
log file where the cleanup routine is located for a background worker and
b) a separation of backend responsibility for restarting cleanup efforts on
server start.

>
>>
>>
>
> --
> Best Regards,
> Chris Travers
> Head of Database
>
> Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
> Saarbrücker Straße 37a, 10405 Berlin
>
>

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


twophasecommit.h
Description: Binary data


twophasecommit.c
Description: Binary data


Re: Getting NOT NULL constraint from pg_attribute

2018-08-20 Thread Wu Ivy
Hi tom,

Thanks for the quick respond.
Why are SELECT query never marked nullable? For nullable columns, when I call 
SPI_getvalue(), the result (in char*) is NULL. I don’t think I’m too clear on 
the definition of attnotnull. Can you give me a example in which the tupleTable 
is can be marked nullable?
Also, is there any other ways to get nullability of each column while getting 
the data from SPI_cursor_fetch? The only way I can think is to call another 
separate command to query the table schema, but it will be in a separate 
transaction in that case.

Thank you again!
Best,
Ivy
> On Aug 17, 2018, at 6:02 PM, Tom Lane  wrote:
> 
> Wu Ivy  writes:
>> I’m currently building a Postgres C extension that fetch data from a 
>> Postgres table.
>> Since the table can be large, in order to prevent memory overrun, I use 
>> SPI_cursor_fetch to fetch chunks of data. The result rows are saved in 
>> SPITupleTable* SPI_tuptable and attributes are saved in 
>> SPI_tuptable->tupdesc. 
>> In order to process my data, I need to get information of column nullability 
>> (whether column has NOT NULL constrain). I can get this information by 
>> calling:
> 
>>  TupleDesc tupdesc = SPI_tuptable->tupdesc;
>>  bool is_nullable = TupleDescAttr(tupdesc, column_num - 1) -> attnotnull;
>> However, the result (is_nullable) is always 0, meaning the column does not 
>> have NOT NULLl constraint, even for columns that do have the NOT NULL 
>> constraint.
> 
> The output columns of a SELECT query are never marked nullable, regardless
> of what the source data was.
> 
>   regards, tom lane



Re: Getting NOT NULL constraint from pg_attribute

2018-08-20 Thread David G. Johnston
On Monday, August 20, 2018, Wu Ivy  wrote:
>
> Thanks for the quick respond.
> Why are SELECT query never marked nullable? For nullable columns, when I
> call SPI_getvalue(), the result (in char*) is NULL. I don’t think I’m too
> clear on the definition of *attnotnull*. Can you give me a example in
> which the tupleTable is can be marked nullable?
> Also, is there any other ways to get nullability of each column while
> getting the data from SPI_cursor_fetch? The only way I can think is to call
> another separate command to query the table schema, but it will be in a
> separate transaction in that case.
>

Basically the nullability property is used by the planner for optimization
during the joining of physical tables.  As soon as you try outputting
columns the ability to enforce not null goes away because of, in
particular, outer joins.  While some changes could maybe be made the
cost-benefit to do so doesn't seem favorable.

David J.


Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-08-20 Thread Alvaro Herrera
On 2018-Aug-13, Robert Haas wrote:

> I think this is a somewhat confused analysis.  We don't use
> SnapshotAny for catalog scans, and we never have.  We used to use
> SnapshotNow, and we now use a current MVCC snapshot.  What you're
> talking about, I think, is possibly using the transaction snapshot
> rather than a current MVCC snapshot for the catalog scans.
> 
> I've thought about similar things, but I think there's a pretty deep
> can of worms.  For instance, if you built a relcache entry using the
> transaction snapshot, you might end up building a seemingly-valid
> relcache entry for a relation that has been dropped or rewritten.
> When you try to access the relation data, you'll be attempt to access
> a relfilenode that's not there any more.  Similarly, if you use an
> older snapshot to build a partition descriptor, you might thing that
> relation OID 12345 is still a partition of that table when in fact
> it's been detached - and, maybe, altered in other ways, such as
> changing column types.

I wonder if this all stems from a misunderstanding of what I suggested
to David offlist.  My suggestion was that the catalog scans would
continue to use the catalog MVCC snapshot, and that the relcache entries
would contain all the partitions that appear to the catalog; but each
partition's entry would carry the Xid of the creating transaction in a
field (say xpart), and that field is compared to the regular transaction
snapshot: if xpart is visible to the transaction snapshot, then the
partition is visible, otherwise not.  So you never try to access a
partition that doesn't exist, because those just don't appear at all in
the relcache entry.  But if you have an old transaction running with an
old snapshot, and the partitioned table just acquired a new partition,
then whether the partition will be returned as part of the partition
descriptor or not depends on the visibility of its entry.

I think that works fine for ATTACH without any further changes.  I'm not
so sure about DETACH, particularly when snapshots persist for a "long
time" (a repeatable-read transaction).  ISTM that in the above design,
the partition descriptor would lose the entry for the detached partition
ahead of time, which means queries would silently fail to see their data
(though they wouldn't crash).  I first thought this could be fixed by
waiting for those snapshots to finish, but then I realized that there's
no actual place where waiting achieves anything.  Certainly it's not
useful to wait before commit (because other snapshots are going to be
starting all the time), and it's not useful to start after the commit
(because by then the catalog tuple is already gone).  Maybe we need two
transactions: mark partition as removed with an xmax of sorts, commit,
wait for all snapshots, start transaction, remove partition catalog
tuple, commit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Improve behavior of concurrent ANALYZE/VACUUM

2018-08-20 Thread Bossart, Nathan
Hi,

Sorry for the delay!  I looked through the latest patch.

On 8/17/18, 1:43 AM, "Michael Paquier"  wrote:
> I have reworked the patch on this side, clarifying the use of the new
> common API for the logs.  One thing I am wondering about is what do we
> want to do when VACUUM ANALYZE is used.  As of HEAD, if vacuum_rel()
> stops, then analyze_rel() is never called, and the only log showing up
> to a non-owner user would be:
> skipping "foo" --- only superuser can vacuum it
>
> With this patch, things become perhaps more confusing by generating two
> WARNING log entries:
> skipping "foo" --- only superuser can vacuum it
> skipping "foo" --- only superuser can analyze it
>
> We could either combine both in a single message, or just generate the
> message for vacuum as HEAD does now.  I have also added some simple
> regression tests triggering the skipping logs for shared catalogs,
> non-shared catalogs and non-owners.  This could be a separate patch as
> well.

I like the idea of emitting a separate WARNING for each for clarity on
what operations are being skipped.  However, I think it could be a bit
confusing with the current wording.  Perhaps something like "skipping
vacuum of..." and "skipping analyze of..." would make things clearer.
Another thing to keep in mind is how often only one of these messages
will apply.  IIUC the vast majority of VACUUM (ANALYZE) statements
that need to emit such log statements would emit both.  Plus, while
VACUUM (ANALYZE) is clearly documented as performing both operations,
I can easily see the argument that users may view it as one big
command and that emitting multiple log entries could be a confusing
change in behavior.

In short, my vote would be to maintain the current behavior for now
and to bring up any logging improvements separately.

+   /*
+* Depending on the permission checks done while building the list
+* of relations to work on, it could be possible that the list is
+* empty, hence do nothing in this case.
+*/
+   if (relations == NIL)
+   return;

It might be better to let the command go through normally so that we
don't miss any cleanup at the end (e.g. deleting vac_context).

+/*
+ * Check if a given relation can be safely vacuumed or not.  If the
+ * user is not the relation owner, issue a WARNING log message and return
+ * false to let the caller decide what to do with this relation.
+ */
+bool
+vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple, int options)
+{
+   Assert((options & (VACOPT_VACUUM | VACOPT_ANALYZE)) != 0);
+
+   /*
+* Check permissions.
+*
+* We allow the user to vacuum a table if he is superuser, the table
+* owner, or the database owner (but in the latter case, only if it's 
not
+* a shared relation).  pg_class_ownercheck includes the superuser case.
+*
+* Note we choose to treat permissions failure as a WARNING and keep
+* trying to vacuum the rest of the DB --- is this appropriate?
+*/

Do you think it's worth adding ANALYZE to these comments as well?

+   if (!(pg_class_ownercheck(relid, GetUserId()) ||
+ (pg_database_ownercheck(MyDatabaseId, GetUserId()) && 
!reltuple->relisshared)))

Returning true right away when the role does have permissions might be
a nice way to save a level of indentation.

+   /*
+* To check whether the relation is a partitioned table and its
+* ownership, fetch its syscache entry.
+*/
+   tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(relid));
+   if (!HeapTupleIsValid(tuple))
+   elog(ERROR, "cache lookup failed for relation %u", 
relid);
+   classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+   /* check permissions of relation */
+   if (!vacuum_is_relation_owner(relid, classForm, options))
+   {
+   ReleaseSysCache(tuple);
+
+   /*
+* Release lock again with AccessShareLock -- see below 
for
+* the reason why this lock is released.
+*/
+   UnlockRelationOid(relid, AccessShareLock);
+   return vacrels;
+   }

I think this actually changes the behavior for partitioned tables.
Presently, we still go through and collect all the partitions in the
vacrels list.  With this change, we will skip collecting a table's
partitions if the current role doesn't have the required permissions.
Perhaps we should skip adding the current relation to vacrels if
vacuum_is_relation_owner() returns false, and then we could go through
the partitions and check permissions on those as well.  Since we don't
take any locks on the individual partitions yet, getting the relation
name and calling pg_class_ownercheck() safely might be tricky, though.

Nathan



Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-08-20 Thread Michael Paquier
On Mon, Aug 20, 2018 at 10:54:28AM -0300, Alvaro Herrera wrote:
> Seems reasonable.
Thanks Álvaro, pushed.
--
Michael


signature.asc
Description: PGP signature


Re: Reopen logfile on SIGHUP

2018-08-20 Thread Kyotaro HORIGUCHI
At Fri, 10 Aug 2018 15:33:26 +0300, Alexander Kuzmenkov 
 wrote in 
<5142559a-82e6-b3e4-d6ed-8fd2d240c...@postgrespro.ru>
> On 08/09/2018 10:33 AM, Kyotaro HORIGUCHI wrote:
> >
> > - Since I'm not sure unlink is signal-handler safe on all
> >supported platforms, I moved unlink() call out of
> >checkLogrotateSignal() to SysLoggerMain, just before rotating
> >log file.
> 
> Which platforms specifically do you have in mind? unlink() is required
> to be async-signal-safe by POSIX.1-2001, which is required by UNIX 03,
> and these are rather old.

I suspect that something bad can happen on Windows. Another
reason for the movement I didn't mention was it is not necesarry
to be there. So I applied the principle that a signal handler
should be as small and simple as possible.  As the result the
flow of logrotate signal handling becomes similar to that for
promote signal.

> For UNIX 03-certified distributions, see this list:
> http://www.opengroup.org/csq/search/t=XY1.html
> For FreeBSD, unlink() was signal-safe at least in 4.0, which was
> released in 2000
> https://www.freebsd.org/cgi/man.cgi?query=sigaction&apropos=0&sektion=0&manpath=FreeBSD+4.0-RELEASE&arch=default&format=html
> Debian 4.0, which was released in 2007 and had a 2.6 kernel, also
> claims to have a signal-safe unlink():
> https://www.freebsd.org/cgi/man.cgi?query=signal&apropos=0&sektion=0&manpath=Debian+4.0.9&arch=default&format=html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Improve behavior of concurrent ANALYZE/VACUUM

2018-08-20 Thread Michael Paquier
On Mon, Aug 20, 2018 at 08:57:00PM +, Bossart, Nathan wrote:
> Sorry for the delay!  I looked through the latest patch.

Thanks a lot for the review!

> I like the idea of emitting a separate WARNING for each for clarity on
> what operations are being skipped.  However, I think it could be a bit
> confusing with the current wording.  Perhaps something like "skipping
> vacuum of..." and "skipping analyze of..." would make things clearer.
> Another thing to keep in mind is how often only one of these messages
> will apply.  IIUC the vast majority of VACUUM (ANALYZE) statements
> that need to emit such log statements would emit both.  Plus, while
> VACUUM (ANALYZE) is clearly documented as performing both operations,
> I can easily see the argument that users may view it as one big
> command and that emitting multiple log entries could be a confusing
> change in behavior.
> 
> In short, my vote would be to maintain the current behavior for now
> and to bring up any logging improvements separately.

On the other hand, it would be useful for the user to know exactly what
is getting skipped.  For example if VACUUM ANALYZE is used then both
operations would happen, but now the user would only know that VACUUM
has been skipped, and may miss the fact that ANALYZE was not attempted.
Let's do as you suggest at the end, aka if both VACOPT_VACUUM and
VACOPT_ANALYZE are passed down to vacuum_is_relation_owner, then only
the log for VACUUM is generated, which is consistent.  Any other changes
could happen later on if necessary.

> +   /*
> +* Depending on the permission checks done while building the list
> +* of relations to work on, it could be possible that the list is
> +* empty, hence do nothing in this case.
> +*/
> +   if (relations == NIL)
> +   return;
> 
> It might be better to let the command go through normally so that we
> don't miss any cleanup at the end (e.g. deleting vac_context).

Right, that was a bad idea.  Updating datfrozenxid can actually be a
good thing.

> +/*
> + * Check if a given relation can be safely vacuumed or not.  If the
> + * user is not the relation owner, issue a WARNING log message and return
> + * false to let the caller decide what to do with this relation.
> + */
> +bool
> +vacuum_is_relation_owner(Oid relid, Form_pg_class reltuple, int options)
> +{
> +   Assert((options & (VACOPT_VACUUM | VACOPT_ANALYZE)) != 0);
> +
> +   /*
> +* Check permissions.
> +*
> +* We allow the user to vacuum a table if he is superuser, the table
> +* owner, or the database owner (but in the latter case, only if it's 
> not
> +* a shared relation).  pg_class_ownercheck includes the superuser 
> case.
> +*
> +* Note we choose to treat permissions failure as a WARNING and keep
> +* trying to vacuum the rest of the DB --- is this appropriate?
> +*/
> 
> Do you think it's worth adding ANALYZE to these comments as well?

Done.

> +   if (!(pg_class_ownercheck(relid, GetUserId()) ||
> + (pg_database_ownercheck(MyDatabaseId, GetUserId()) && 
> !reltuple->relisshared)))
> 
> Returning true right away when the role does have permissions might be
> a nice way to save a level of indentation.

Done.

> I think this actually changes the behavior for partitioned tables.
> Presently, we still go through and collect all the partitions in the
> vacrels list.  With this change, we will skip collecting a table's
> partitions if the current role doesn't have the required permissions.
> Perhaps we should skip adding the current relation to vacrels if
> vacuum_is_relation_owner() returns false, and then we could go through
> the partitions and check permissions on those as well.  Since we don't
> take any locks on the individual partitions yet, getting the relation
> name and calling pg_class_ownercheck() safely might be tricky, though.

Yes, that's actually intentional on my side as this keeps the logic more 
simple, and we avoid risks around deadlocks when working on partitions.
It seems to me that it is also more intuitive to *not* scan a full
partition tree if the user does not have ownership on its root if the
relation is listed, instead of trying to scan all leafs to find perhaps
some of them.  In most data models it would matter much anyway, no?
This is also more consistent with what is done for TRUNCATE where the
ACLs of the parent are considered first.  The documentation also
actually mentions that:
"To vacuum a table, one must ordinarily be the table's owner or a
superuser."
Perhaps we could make that clearer for partitions, with something like:
"If listed explicitly, the vacuum of a partitioned table will include
all its partitions if the user is the owner of the partitioned table."

If we don't want to change the current behavior, then one simple
solution would be close to what you mention, aka skip adding the
partitioned table to the list, include *all* the 

Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-08-20 Thread Robert Haas
On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
 wrote:
> I wonder if this all stems from a misunderstanding of what I suggested
> to David offlist.  My suggestion was that the catalog scans would
> continue to use the catalog MVCC snapshot, and that the relcache entries
> would contain all the partitions that appear to the catalog; but each
> partition's entry would carry the Xid of the creating transaction in a
> field (say xpart), and that field is compared to the regular transaction
> snapshot: if xpart is visible to the transaction snapshot, then the
> partition is visible, otherwise not.  So you never try to access a
> partition that doesn't exist, because those just don't appear at all in
> the relcache entry.  But if you have an old transaction running with an
> old snapshot, and the partitioned table just acquired a new partition,
> then whether the partition will be returned as part of the partition
> descriptor or not depends on the visibility of its entry.

Hmm.  One question is where you're going to get the XID of the
creating transaction.  If it's taken from the pg_class row or the
pg_inherits row or something of that sort, then you risk getting a
bogus value if something updates that row other than what you expect
-- and the consequences of that are pretty bad here; for this to work
as you intend, you need an exactly-correct value, not newer or older.
An alternative is to add an xid field that stores the value
explicitly, and that might work, but you'll have to arrange for that
value to be frozen at the appropriate time.

A further problem is that there could be multiple changes in quick
succession.  Suppose that a partition is attached, then detached
before the attach operation is all-visible, then reattached, perhaps
with different partition bounds.

> I think that works fine for ATTACH without any further changes.  I'm not
> so sure about DETACH, particularly when snapshots persist for a "long
> time" (a repeatable-read transaction).  ISTM that in the above design,
> the partition descriptor would lose the entry for the detached partition
> ahead of time, which means queries would silently fail to see their data
> (though they wouldn't crash).

I don't see why they wouldn't crash.  If the partition descriptor gets
rebuilt and some partitions disappear out from under you, the old
partition descriptor is going to get freed, and the executor has a
cached pointer to it, so it seems like you are in trouble.

> I first thought this could be fixed by
> waiting for those snapshots to finish, but then I realized that there's
> no actual place where waiting achieves anything.  Certainly it's not
> useful to wait before commit (because other snapshots are going to be
> starting all the time), and it's not useful to start after the commit
> (because by then the catalog tuple is already gone).  Maybe we need two
> transactions: mark partition as removed with an xmax of sorts, commit,
> wait for all snapshots, start transaction, remove partition catalog
> tuple, commit.

And what would that accomplish, exactly?  Waiting for all snapshots
would ensure that all still-running transactions see the fact the xmax
with which the partition has been marked as removed, but what good
does that do?  In order to have a plausible algorithm, you have to
describe both what the ATTACH/DETACH operation does and what the other
concurrent transactions do and how those things interact.  Otherwise,
it's like saying that we're going to solve a problem with X and Y
overlapping by having X take a lock.  If Y doesn't take a conflicting
lock, this does nothing.

Generally, I think I see what you're aiming at: make ATTACH and DETACH
have MVCC-like semantics with respect to concurrent transactions.  I
don't think that's a dumb idea from a theoretical perspective, but in
practice I think it's going to be very difficult to implement.  We
have no other DDL that has such semantics, and there's no reason we
couldn't; for example, TRUNCATE could work with SUEL and transactions
that can't see the TRUNCATE as committed continue to operate on the
old heap.  While we could do such things, we don't.  If you decide to
do them here, you've probably got a lot of work ahead of you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Problem while updating a foreign table pointing to a partitioned table on foreign server

2018-08-20 Thread Kyotaro HORIGUCHI
Fujita-san thank you for the comment.

At Tue, 14 Aug 2018 20:49:02 +0900, Etsuro Fujita  
wrote in <5b72c1ae.8010...@lab.ntt.co.jp>
> (2018/08/09 22:04), Etsuro Fujita wrote:
> > (2018/08/08 17:30), Kyotaro HORIGUCHI wrote:
> >> I choosed to expand tuple descriptor for junk column added to
> >> foreign relaions. We might be better to have new member in
> >> ForeignScan but I didn't so that we can backpatch it.
> >
> > I've not looked at the patch closely yet, but I'm not sure that it's a
> > good idea to expand the tuple descriptor of the target relation on the
> > fly so that it contains the remotetableoid as a non-system attribute
> > of
> > the target table. My concern is: is there not any risk in affecting
> > some
> > other part of the planner and/or the executor? (I was a bit surprised
> > that the patch passes the regression tests successfully.)

Yeah, me too.

> I spent more time looking at the patch.  ISTM that the patch well
> suppresses the effect of the tuple-descriptor expansion by making
> changes to code in the planner and executor (and ruleutils.c), but I'm
> still not sure that the patch is the right direction to go in, because
> ISTM that expanding the tuple descriptor on the fly might be a wart.

The non-Var nodes seems to me the same as PARAM_EXEC, which works
imperfectly for this purpose since tableoid must be in one-to-one
correspondence with a tuple but differently from joins the
correspondence is stired up by intermedate executor nodes in some
cases.

The exapansion should be safe if the expanded descriptor has the
same defitions for base columns and all the extended coulumns are
junks. The junk columns should be ignored by unrelated nodes and
they are passed safely as far as ForeignModify passes tuples as
is from underlying ForeignScan to ForeignUpdate/Delete.

> >> What the patch does are:
> >>
> >> - This abuses ForeignScan->fdw_scan_tlist to store the additional
> >> junk columns when foreign simple relation scan (that is, not a
> >> join).
> >
> > I think that this issue was introduced in 9.3, which added
> > postgres_fdw
> > in combination with support for writable foreign tables, but
> > fdw_scan_tlist was added to 9.5 as part of join-pushdown
> > infrastructure,
> > so my concern is that we might not be able to backpatch your patch to
> > 9.3 and 9.4.

Right. So I'm thinking that the older versions just get error for
the failure case instead of get it work anyhow. Or we might be
able to use tableoid in tuple header without emitting the local
oid to users but I haven't find the way to do that.

> Another concern about this:
> 
> You wrote:
> >Several places seems to be assuming that fdw_scan_tlist may be
> >used foreign scan on simple relation but I didn't find that
> >actually happens.
> 
> Yeah, currently, postgres_fdw and file_fdw don't use that list for
> simple foreign table scans, but it could be used to improve the
> efficiency for those scans, as explained in fdwhandler.sgml:
> 
>  Another ForeignScan field that can be filled
>  by FDWs
>  is fdw_scan_tlist, which describes the
>  tuples returned by
>  the FDW for this plan node.  For simple foreign table scans this can
>  be
>  set to NIL, implying that the returned tuples have
>  the
>  row type declared for the foreign table.  A non-NIL
>  value must be a
>  target list (list of TargetEntrys) containing
>  Vars and/or
>  expressions representing the returned columns.  This might be used,
>  for
>  example, to show that the FDW has omitted some columns that it noticed
>  won't be needed for the query.  Also, if the FDW can compute
>  expressions
>  used by the query more cheaply than can be done locally, it could add
>  those expressions to fdw_scan_tlist. Note
>  that join
>  plans (created from paths made by
>  GetForeignJoinPaths) must
>  always supply fdw_scan_tlist to describe
>  the set of
>  columns they will return.

https://www.postgresql.org/docs/devel/static/fdw-planning.html

Hmm. Thanks for the pointer, it seems to need rewrite. However,
it doesn't seem to work for non-join foreign scans, since the
core igonres it and uses local table definition. This "tweak"
won't be needed if it worked.

> You wrote:
> > I'm not sure whether the following ponits are valid.
> >
> > - If fdw_scan_tlist is used for simple relation scans, this would
> >break the case. (ExecInitForeignScan,  set_foreignscan_references)
> 
> Some FDWs might already use that list for the improved efficiency for
> simple foreign table scans as explained above, so we should avoid
> breaking that.

I considered to use fdw_scan_tlist in that way but the core is
assuming that foreign scans with scanrelid > 0 uses the relation
descriptor. Do you have any example for that?

> If we take the Param-based approach suggested by Tom, I suspect there
> would be no need to worry about at least those things, so I'll try to
> update your patc

Re: Reopen logfile on SIGHUP

2018-08-20 Thread Michael Paquier
On Tue, Aug 21, 2018 at 09:26:54AM +0900, Kyotaro HORIGUCHI wrote:
> I suspect that something bad can happen on Windows.

[troll mode]
More and even worse things than that could happen on Windows.
[/troll mode]
--
Michael


signature.asc
Description: PGP signature


Re: Slotification of partition tuple conversion

2018-08-20 Thread Amit Langote
On 2018/08/20 20:15, Amit Khandekar wrote:
> On 17 August 2018 at 21:47, Amit Khandekar  wrote:
> 
>> Attached is a patch tup_convert.patch that does the conversion
>> directly using input and output tuple slots. This patch is to be
>> applied on an essential patch posted by Amit Langote in [1] that
>> arranges for dedicated per-partition slots rather than changing
>> tupdesc of a single slot. I have rebased that patch from [1] and
>> attached it here (
>> Allocate-dedicated-slots-of-partition-tuple_amitlan_rebased.patch).
> 
> Amit Langote has posted a new version of the
> Allocate-dedicated-slots-of-partition-tuple_amitlan_rebased.patch at
> [1] . Attached is version v2 of the tup_convert.patch, that is rebased
> over that patch.

Thanks.

Here are some comments on the patch:

+ConvertTupleSlot

Might be a good idea to call this ConvertSlotTuple?

+/*
+ * Get the tuple in the per-tuple context. Also, we
cannot call
+ * ExecMaterializeSlot(), otherwise the tuple will get freed
+ * while storing the next tuple.
+ */
+oldcontext =
MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+tuple = ExecCopySlotTuple(slot);
+MemoryContextSwitchTo(oldcontext);

The comment about why we cannot call ExecMaterializeSlot is a bit unclear.
 Maybe, the comment doesn't need to mention that?  Instead, expand a bit
more on why the context switch here or how it interacts with recently
tuple buffering (0d5f05cde01)?

Seeing that all the sites in the partitioning code that previously called
do_convert_tuple() now call ConvertTupleSlot, I wonder if we don't need a
full TupleConversionMap, just the attribute map in it.  We don't need the
input/output Datum and bool arrays in it, because we'd be using the ones
from input and output slots of ConvertTupleSlot.  So, can we replace all
instances of TupleConversionMap in the partitioning code and data
structures by AttributeNumber arrays.

Also, how about putting ConvertTupleSlot in execPartition.c and exporting
it in execPartition.h, instead of tupconvert.c and tupconvert.h, resp.?

Thanks,
Amit




Re: NLS handling fixes.

2018-08-20 Thread Kyotaro HORIGUCHI
At Mon, 20 Aug 2018 13:23:38 +0900, Michael Paquier  wrote 
in <20180820042338.gh7...@paquier.xyz>
> On Fri, Aug 10, 2018 at 04:50:55PM -0400, Tom Lane wrote:
> > I would certainly *not* back-patch the GetConfigOptionByNum change,
> > as that will be a user-visible behavioral change that people will
> > not be expecting.  We might get away with back-patching the other changes,
> > but SHOW ALL output seems like something that people might be expecting
> > not to change drastically in a minor release.
> 
> Agreed, some folks may rely on that.
> 
> > * In the patch fixing view_query_is_auto_updatable misuse, nothing's
> > been done to remove the underlying cause of the errors, which IMO
> > is that the header comment for view_query_is_auto_updatable fails to
> > specify the return convention.  I'd consider adding a comment along
> > the lines of
> > 
> >  * view_query_is_auto_updatable - test whether the specified view definition
> >  * represents an auto-updatable view. Returns NULL (if the view can be 
> > updated)
> >  * or a message string giving the reason that it cannot be.
> > +*
> > +* The returned string has not been translated; if it is shown as an error
> > +* message, the caller should apply _() to translate it.
> 
> That sounds right.  A similar comment should be added for
> view_cols_are_auto_updatable and view_col_is_auto_updatable.
> 
> > * Perhaps pg_GSS_error likewise could use a comment saying the error
> > string must be translated already.  Or we could leave its callers alone
> > and put the _() into it, but either way the API contract ought to be
> > written down.
> 
> Both things are indeed possible.  As pg_SSPI_error applies translation
> beforehand, I have taken the approach to let the caller just apply _().
> For two messages that does not matter much one way or another.
> 
> > * The proposed patch for psql/common.c seems completely wrong, or at
> > least it has a lot of side-effects other than enabling header translation,
> > and I doubt they are desirable.
> 
> I doubted about it, and at closer look I think that you are right, so +1
> for discarding this one.
> 
> Attached is a patch which should address all the issues reported and all
> the remarks done.  What do you think?

Agreed on all of the above, but if we don't need translation in
the header line of \gdesc, gettext_noop for the strings is
useless.

A period is missing in the comment for
view_col_is_auto_updatable.

The attached is the fixed version.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7cedc28c6b..2a12d64aeb 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -10826,7 +10826,7 @@ ATExecSetRelOptions(Relation rel, List *defList, AlterTableType operation,
 ereport(ERROR,
 		(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 		 errmsg("WITH CHECK OPTION is supported only on automatically updatable views"),
-		 errhint("%s", view_updatable_error)));
+		 errhint("%s", _(view_updatable_error;
 		}
 	}
 
diff --git a/src/backend/commands/view.c b/src/backend/commands/view.c
index 7d4511c585..ffb71c0ea7 100644
--- a/src/backend/commands/view.c
+++ b/src/backend/commands/view.c
@@ -502,7 +502,7 @@ DefineView(ViewStmt *stmt, const char *queryString,
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 	 errmsg("WITH CHECK OPTION is supported only on automatically updatable views"),
-	 errhint("%s", view_updatable_error)));
+	 errhint("%s", _(view_updatable_error;
 	}
 
 	/*
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index cecd104b4a..18c4921d61 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -1037,6 +1037,10 @@ static GSS_DLLIMP gss_OID GSS_C_NT_USER_NAME = &GSS_C_NT_USER_NAME_desc;
 #endif
 
 
+/*
+ * Generate an error for GSSAPI authentication.  The caller needs to apply
+ * _() to errmsg to make it translatable.
+ */
 static void
 pg_GSS_error(int severity, const char *errmsg, OM_uint32 maj_stat, OM_uint32 min_stat)
 {
@@ -1227,7 +1231,7 @@ pg_GSS_recvauth(Port *port)
 		{
 			gss_delete_sec_context(&lmin_s, &port->gss->ctx, GSS_C_NO_BUFFER);
 			pg_GSS_error(ERROR,
-		 gettext_noop("accepting GSS security context failed"),
+		 _("accepting GSS security context failed"),
 		 maj_stat, min_stat);
 		}
 
@@ -1253,7 +1257,7 @@ pg_GSS_recvauth(Port *port)
 	maj_stat = gss_display_name(&min_stat, port->gss->name, &gbuf, NULL);
 	if (maj_stat != GSS_S_COMPLETE)
 		pg_GSS_error(ERROR,
-	 gettext_noop("retrieving GSS user name failed"),
+	 _("retrieving GSS user name failed"),
 	 maj_stat, min_stat);
 
 	/*
@@ -1317,6 +1321,11 @@ pg_GSS_recvauth(Port *port)
  *
  */
 #ifdef ENABLE_SSPI
+
+/*
+ * Generate an error for SSPI authentication.  The caller needs to apply
+ * _() to errmsg to make it translatable.
+ */
 static void

A typo in guc.c

2018-08-20 Thread Kyotaro HORIGUCHI
Hello, I found a typo in guc.c.

>  {"enable_parallel_hash", PGC_USERSET, QUERY_TUNING_METHOD,
> -  gettext_noop("Enables the planner's user of parallel hash plans."),
> +  gettext_noop("Enables the planner's use of parallel hash plans."),

The "user" should be "use".

As you(who?) know, this applies only 11 and dev.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 9989d3a351..2a874dc78d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -962,7 +962,7 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 	{
 		{"enable_parallel_hash", PGC_USERSET, QUERY_TUNING_METHOD,
-			gettext_noop("Enables the planner's user of parallel hash plans."),
+			gettext_noop("Enables the planner's use of parallel hash plans."),
 			NULL
 		},
 		&enable_parallel_hash,


Re: [HACKERS] Optional message to user when terminating/cancelling backend

2018-08-20 Thread Thomas Munro
On Wed, Jul 25, 2018 at 7:27 PM, Daniel Gustafsson  wrote:
>> On 24 Jul 2018, at 22:57, Daniel Gustafsson  wrote:
>>
>>> On 6 Jul 2018, at 02:18, Thomas Munro  wrote:
>>>
>>> On Fri, Jun 15, 2018 at 9:56 AM, Daniel Gustafsson  wrote:
 attached
>>>
>>> Hi Daniel,
>>>
>>> 6118  --select pg_cancel_backend(pg_backend_pid(), 'it brings on many 
>>> changes');
>>> 6119  select pg_cancel_backend(pg_backend_pid(), NULL);
>>> 6120! ERROR:  canceling statement due to user request
>>> 6121--- 25,32 
>>> 6122
>>> 6123  --select pg_cancel_backend(pg_backend_pid(), 'it brings on many 
>>> changes');
>>> 6124  select pg_cancel_backend(pg_backend_pid(), NULL);
>>> 6125!  pg_cancel_backend
>>> 6126! ---
>>> 6127!  t
>>>
>>> Apparently Windows can take or leave it as it pleases.
>>
>> Well played =)
>>
>>> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.4488
>>
>> That reads to me like it’s cancelling another backend than the current one,
>> which clearly isn’t right as we’re passing pg_backend_pid().  I can’t really
>> see what Windows specific bug was introduced by this patch though (or well, 
>> the
>> bug exhibits itself on Windows but it may well be generic of course).
>>
>> Will continue to hunt.
>
> Seems the build of the updated patch built and tested Ok.  Still have no idea
> why the previous one didn’t.

That problem apparently didn't go away.  cfbot tested it 7 times in
the past week, and it passed only once on Windows:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.9691

The other times all failed like this:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.9833

-- 
Thomas Munro
http://www.enterprisedb.com



Re: A typo in guc.c

2018-08-20 Thread Michael Paquier
On Tue, Aug 21, 2018 at 11:58:41AM +0900, Kyotaro HORIGUCHI wrote:
> The "user" should be "use".
> 
> As you(who?) know, this applies only 11 and dev.

Thanks, applied.
--
Michael


signature.asc
Description: PGP signature


Re: Fix help option of contrib/oid2name

2018-08-20 Thread Tatsuro Yamada

On 2018/08/20 17:38, Michael Paquier wrote:

On Mon, Aug 20, 2018 at 03:51:07PM +0900, Tatsuro Yamada wrote:

On 2018/08/20 13:54, Michael Paquier wrote:
Therefore, "-P" is a manual bag. I investigated more using git log command and
understood followings:

   1. -P option was removed on 4192f2d85
   2. -P option revived in only the manual on 2963d8228


Bruce did that to keep a trace of it in the docs, let's nuke it then, we
don't handle it and the documentation is mentioning the thing as
deprecated since 2010.


Yep, right.
I see, and will remove -P option's explanation from the manual.



For now, I'm not sure about TAP test but I knew both utilities have no
regression tests. It would be better to use something tests.


If you don't add them, I think that I would just that myself, just to
have some basics.  Not that it is a barrier for this patch anyway.


Hmm...
As long as I've come this far, I'll do it through.


BTW, can I add the patch to the Commitfest September?
The patch includes improvements and bug fix as you know, So, I can divide the
patch into 2 patches for that.

Which one is better to create, 2 patches or 1 patch?
I summed up fixes of oid2name and vacuumlo so far, the next patch will include
following stuffs:

oid2name
  bug fix
- Remove "-P password" option from the document

  improvements
- Divide "Options section" into "Options" and "Connection Options" sections 
in
  the help message (--help)
- Add long options only for Connection options (-d, -H, -h, -p and -U)
- Add "-H host" and "-h host" options to the help message and the document
- Add environment variable (PGHOST, PGPORT and PGUSER) to the document
- Add TAP tests for checking command-line options

vacuumlo
  improvements
- Add long options only for Connection options (-h, -p and -U)
- Add environment variable (PGHOST, PGPORT and PGUSER) to the document
- Add TAP tests for checking command-line options


Thanks,
Tatsuro Yamada




Re: Fix help option of contrib/oid2name

2018-08-20 Thread Michael Paquier
On Tue, Aug 21, 2018 at 12:26:15PM +0900, Tatsuro Yamada wrote:
> BTW, can I add the patch to the Commitfest September?

You should.

> The patch includes improvements and bug fix as you know, So, I can divide the
> patch into 2 patches for that.

I am not really seeing any bug fix, but if you think so then splitting
into two patches would help in making the difference for sure.
--
Michael


signature.asc
Description: PGP signature


Re: Fix help option of contrib/oid2name

2018-08-20 Thread Tatsuro Yamada

On 2018/08/21 12:40, Michael Paquier wrote:

On Tue, Aug 21, 2018 at 12:26:15PM +0900, Tatsuro Yamada wrote:

BTW, can I add the patch to the Commitfest September?


You should.


Thanks, I'll do that.



The patch includes improvements and bug fix as you know, So, I can divide the
patch into 2 patches for that.


I am not really seeing any bug fix, but if you think so then splitting
into two patches would help in making the difference for sure.


Yep, I meant the bug fix is below:
   oid2name
- Remove "-P password" option from the document
As I wrote before, "-P option" is not works well because there is no code to 
handle
that option in the code. So, it will be removed on next patch.

I'll send 2 patches in this week, probably.

Regards,
Tatsuro Yamada




Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-08-20 Thread David Rowley
On 3 August 2018 at 17:58, Amit Langote  wrote:
> On 2018/07/31 16:03, David Rowley wrote:
>> Maybe we can do that as a follow-on patch.
>
> We probably could, but I think it would be a good idea get rid of *all*
> redundant allocations due to tuple routing in one patch, if that's the
> mission of this thread and the patch anyway.

I started looking at this patch today and I now agree that it should
be included in the main patch.

I changed a few things with the patch. For example, the map access
macros you'd defined were not in CamelCase. I also fixed a bug where
the child to parent map was not being initialised when on conflict
transition capture was required. I added a test which was crashing the
backend but fixed the code so it works correctly. I also got rid of
the child_parent_map_not_required array since we now no longer need
it.  The code now always initialises the maps in cases where they're
going to be required.

I've attached a v3 version of your patch and also v6 of the main patch
which includes the v3 patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v3_Refactor-handling-of-child_parent_tupconv_maps.patch
Description: Binary data


v6-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
Description: Binary data


Re: A really subtle lexer bug

2018-08-20 Thread Andrew Gierth
> "Andrew" == Andrew Gierth  writes:

 Andrew> Patch attached.

 Andrew> This fixes two bugs:

I'm going to split this into two patches, since the O(N^3) fix can be
backpatched further than the operator token fix; the latter bug was
introduced in 9.5 when operator precedences were corrected, while the
former has been there forever.

(But don't wait for me to post those before commenting on either issue)

-- 
Andrew (irc:RhodiumToad)



Re: ATTACH/DETACH PARTITION CONCURRENTLY

2018-08-20 Thread David Rowley
On 21 August 2018 at 13:59, Robert Haas  wrote:
> On Mon, Aug 20, 2018 at 4:21 PM, Alvaro Herrera
>  wrote:
>> I wonder if this all stems from a misunderstanding of what I suggested
>> to David offlist.  My suggestion was that the catalog scans would
>> continue to use the catalog MVCC snapshot, and that the relcache entries
>> would contain all the partitions that appear to the catalog; but each
>> partition's entry would carry the Xid of the creating transaction in a
>> field (say xpart), and that field is compared to the regular transaction
>> snapshot: if xpart is visible to the transaction snapshot, then the
>> partition is visible, otherwise not.  So you never try to access a
>> partition that doesn't exist, because those just don't appear at all in
>> the relcache entry.  But if you have an old transaction running with an
>> old snapshot, and the partitioned table just acquired a new partition,
>> then whether the partition will be returned as part of the partition
>> descriptor or not depends on the visibility of its entry.
>
> Hmm.  One question is where you're going to get the XID of the
> creating transaction.  If it's taken from the pg_class row or the
> pg_inherits row or something of that sort, then you risk getting a
> bogus value if something updates that row other than what you expect
> -- and the consequences of that are pretty bad here; for this to work
> as you intend, you need an exactly-correct value, not newer or older.
> An alternative is to add an xid field that stores the value
> explicitly, and that might work, but you'll have to arrange for that
> value to be frozen at the appropriate time.
>
> A further problem is that there could be multiple changes in quick
> succession.  Suppose that a partition is attached, then detached
> before the attach operation is all-visible, then reattached, perhaps
> with different partition bounds.

I should probably post the WIP I have here.  In those, I do have the
xmin array in the PartitionDesc. This gets taken from the new
pg_partition table, which I don't think suffers from the same issue as
taking it from pg_class, since nothing else will update the
pg_partition record.

However, I don't think the xmin array is going to work if we include
it in the PartitionDesc.  The problem is, as I discovered from writing
the code was that the PartitionDesc must remain exactly the same
between planning an execution. If there are any more or any fewer
partitions found during execution than what we saw in planning then
run-time pruning will access the wrong element in the
PartitionPruneInfo array, or perhaps access of the end of the array.
It might be possible to work around that by identifying partitions by
Oid rather than PartitionDesc array index, but the run-time pruning
code is already pretty complex. I think coding it to work when the
PartitionDesc does not match between planning and execution is just
going to too difficult to get right.  Tom is already unhappy with the
complexity of ExecFindInitialMatchingSubPlans().

I think the solution will require that the PartitionDesc does not:

a) Change between planning and execution.
b) Change during a snapshot after the partitioned table has been locked.

With b, it sounds like we'll need to take the most recent
PartitionDesc even if the transaction is older than the one that did
the ATTACH/DETACH operation as if we use an old version then, as
Robert mentions, there's nothing to stop another transaction making
changes to the table that make it an incompatible partition, e.g DROP
COLUMN.  This wouldn't be possible if we update the PartitionDesc
right after taking the first lock on the partitioned table since any
transactions doing DROP COLUMN would be blocked until the other
snapshot gets released.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: NLS handling fixes.

2018-08-20 Thread Michael Paquier
On Tue, Aug 21, 2018 at 11:49:05AM +0900, Kyotaro HORIGUCHI wrote:
> Agreed on all of the above, but if we don't need translation in
> the header line of \gdesc, gettext_noop for the strings is
> useless.

I have let that one alone, as all columns showing up in psql have the
same, consistent way of handling things.

> A period is missing in the comment for view_col_is_auto_updatable.

Fixed this one, and pushed.  All things are back-patched where they
apply, except for the part of GetConfigOptionByNum getting only on HEAD
as it could be surprising after a minor upgrade as Tom has suggested.
--
Michael


signature.asc
Description: PGP signature


Re: Two proposed modifications to the PostgreSQL FDW

2018-08-20 Thread Masahiko Sawada
On Tue, Aug 21, 2018 at 1:47 AM Chris Travers  wrote:
>
>
>
> On Mon, Aug 20, 2018 at 4:41 PM Andres Freund  wrote:
>>
>> Hi,
>>
>> On 2018-08-20 16:28:01 +0200, Chris Travers wrote:
>> > 2.  TWOPHASECOMMIT=[off|on] option
>>
>> > The second major issue that I see with PostgreSQL's foreign database
>> > wrappers is the fact that there is no two phase commit which means that a
>> > single transaction writing to a group of tables has no expectation that all
>> > backends will commit or rollback together.  With this patch an option would
>> > be applied to foreign tables such that they could be set to use two phase
>> > commit  When this is done, the first write to each backend would register a
>> > connection with a global transaction handler and a pre-commit and commit
>> > hooks would be set up to properly process these.
>> >
>> > On recommit a per-global-transaction file would be opened in the data
>> > directory and prepare statements logged to the file.  On error, we simply
>> > roll back our local transaction.
>> >
>> > On commit hook , we go through and start to commit the remote global
>> > transactions.  At this point we make a best effort but track whether or not
>> > we were successfully on all.  If successful on all, we delete the file.  If
>> > unsuccessful we fire a background worker which re-reads the file and is
>> > responsible for cleanup.  If global transactions persist, a SQL
>> > administration function will be made available to restart the cleanup
>> > process.  On rollback, we do like commit but we roll back all transactions
>> > in the set.  The file has enough information to determine whether we should
>> > be committing or rolling back on cleanup.
>> >
>> > I would like to push these both for Pg 12.  Is there any feedback on the
>> > concepts and the problems first
>>

Thank you for the proposal. I agree that it's a major problem that
postgres_fdw (or PostgreSQL core API) doesn't support two-phase
commit.

>> There's been *substantial* work on this. You should at least read the
>> discussion & coordinate with the relevant developers.
>
>
> I suppose I should forward this to them directly also.
>
> Yeah.   Also the transaction manager code for this I wrote while helping with 
> a proof of concept for this copy-to-remote extension.
>
> There are a few big differences in implementation with the patches you 
> mention and the disagreement was part of why I thought about going this 
> direction.
>
> First, discussion of differences in implementation:
>
> 1.  I treat the local and remote transactions symmetrically and I make no 
> assumptions about what might happen between prepare and an attempted local 
> commit.
>prepare goes into the precommit hook
>commit goes into the commit hook and we never raise errors if it fails 
> (because you cannot rollback at that point).  Instead a warning is raised and 
> cleanup commences.
>rollback goes into the rollback hook and we never raise errors if it fails 
> (because you are already rolling back).
>
> 2.  By treating this as a property of a table rather than a property of a 
> foreign data wrapper or a server, we can better prevent prepared transactions 
> where they have not been enabled.
>This also ensures that we know whether we are guaranteeing two phase 
> commit or not by looking at the table.
>
> 3.  By making this opt-in it avoids a lot of problems with regards to 
> incorrect configuration etc since if the DBA says "use two phase commit" and 
> failed to enable prepared transactions on the other side...
>
> On to failure modes:
>  1.  Its possible that under high load too many foreign transactions are 
> prepared and things start rolling back instead of committing.  Oh well
>  2.  In the event that a foreign server goes away between prepare and commit, 
> we continue to retry via the background worker.  The background worker is 
> very pessimistic and checks every remote system for the named transaction.

If some participant servers fail during COMMIT PREPARED, will the
client get a "committed"? or an "aborted"? If the client gets
"aborted", that's not correct because the local changes are already
committed at that point. On the other hand, if the client get
"committed" it might break the current user semantics because the
subsequent reads may not be able to see the own committed writes. Also
since we don't want to wait for COMMIT PREPARED to complete we need to
consider that users could cancel the query anytime. To not break the
current semantics we cannot raise error during 2nd phase of two-phase
commit but it's not realistic because even the palloc() can raise an
error.

The design the patch chose is making backends do only PREPARE and wait
for the background worker to complete COMMIT PREPARED. In this design
the clients get a "committed" only either when successful in commit on
all participants or when they cancel the query explicitly. In other
words, the client will wait for completion of 2nd phase of two-phase

Re: Pluggable Storage - Andres's take

2018-08-20 Thread Haribabu Kommi
On Sun, Aug 5, 2018 at 7:48 PM Andres Freund  wrote:

> Hi,
>
> I'm currently in the process of rebasing zheap onto the pluggable
> storage work. The goal, which seems to work surprisingly well, is to
> find issues that the current pluggable storage patch doesn't yet deal
> with.  I plan to push a tree including a lot of fixes and improvements
> soon.
>

Sorry for coming late to this thread.

That's good. Did you find any problems in porting zheap into pluggable
storage? Does it needs any API changes or new API requirement?


> On 2018-08-03 12:35:50 +1000, Haribabu Kommi wrote:
> > while investing the crash, I observed that it is due to the lot of
> FIXME's
> > in
> > the code. So I just fixed minimal changes and looking into correcting
> > the FIXME's first.
> >
> > One thing I observed is lack relation pointer is leading to crash in the
> > flow of EvalPlan* functions, because all ROW_MARK types doesn't
> > contains relation pointer.
> >
> > will continue to check all FIXME fixes.
>
> Thanks.
>

I fixed some of the Isolation test problems. All the issues are related to
EPQ slot handling. Still more needs to be fixed.

Does the new TupleTableSlot abstraction patches has fixed any of these
issues in the recent thread [1]? so that I can look into the change of FDW
API
to return slot instead of tuple.


> > > - COPY's multi_insert path should probably deal with a bunch of slots,
> > >   rather than forming HeapTuples
> > >
> >
> > Implemented supporting of slots in the copy multi insert path.
>
> Cool. I've not yet looked at it, but I plan to do so soon.  Will have to
> rebase over the other copy changes first :(
>

OK. Understood. There are many changes in the COPY flow conflicts
with my changes. Please let me know once you done the rebase, I can
fix those conflicts and regenerate the patch.

Attached is the patch with further fixes.

[1] -
https://www.postgresql.org/message-id/CAFjFpRcNPQ1oOL41-HQYaEF%3DNq6Vbg0eHeFgopJhHw_X2usA5w%40mail.gmail.com

Regards,
Haribabu Kommi
Fujitsu Australia


0001-isolation-test-fixes-2.patch
Description: Binary data