Unusable index

2018-07-03 Thread Konstantin Knizhnik

Hi hackers,

Hot experts help is highly needed:)
One of our customers is faced with the following problem:

Session 1:
create table x (i int);
begin;
insert into x values(1);
...


Session 2:
select i as id, 0 as v into t from generate_series(1, 10) i;
create unique index idx on t (id);
explain analyze select v from t where id = 1;
  QUERY PLAN
--- 

 Index Scan using idx on t  (cost=0.29..8.31 rows=1 width=4) (actual 
time=0.062..0.064 rows=1 loops=1)

   Index Cond: (id = 1)
-- Everything is Ok here
-- Now some magic
update t set v = v + 1 where id = 1;
update t set v = v + 1 where id = 1;
update t set v = v + 1 where id = 1;
update t set v = v + 1 where id = 1;
update t set v = v + 1 where id = 1;
drop index idx;
create unique index idx on t (id);
explain analyze select v from t where id = 1;
explain analyze select v from t where id = 1;
   QUERY PLAN
 

 Seq Scan on t  (cost=0.00..1693.00 rows=1 width=4) (actual 
time=27.557..27.558 rows=1 loops=1)



Now no backend can use this index until transaction in Session 1 is 
completed.


According to the README.HOT it is more or less expected behavior:

-

Practically, we prevent certain transactions from using the new index by
setting pg_index.indcheckxmin to TRUE.  Transactions are allowed to use
such an index only after pg_index.xmin is below their TransactionXmin
horizon, thereby ensuring that any incompatible rows in HOT chains are
dead to them. (pg_index.xmin will be the XID of the CREATE INDEX
transaction.  The reason for using xmin rather than a normal column is
that the regular vacuum freezing mechanism will take care of converting
xmin to FrozenTransactionId before it can wrap around.)

This means in particular that the transaction creating the index will be
unable to use the index if the transaction has old snapshots.  We
alleviate that problem somewhat by not setting indcheckxmin unless the
table actually contains HOT chains with RECENTLY_DEAD members.

-

But few notes:

1. Not only transaction created the index is not able to use it. Any 
other transaction will not be able to use it as well.
2. It happens even if Session1 and Session2 works with different 
databases, so it can really confuse users!


Ok, isolation of databases in Postgres is separate topic.
I want to clarify why this is index is disabled for any new transactions.

The code disabling use of index is the following:
            /*
 * If the index is valid, but cannot yet be used, ignore 
it; but

 * mark the plan we are generating as transient. See
 * src/backend/access/heap/README.HOT for discussion.
 */
    if (index->indcheckxmin &&
!TransactionIdPrecedes(HeapTupleHeaderGetXmin(indexRelation->rd_indextuple->t_data),
   TransactionXmin))
    {
    root->glob->transientPlan = true;
    index_close(indexRelation, NoLock);
    continue;
    }

And "indcheckmin" is set here:

    /*
     * If we found any potentially broken HOT chains, mark the index as not
     * being usable until the current transaction is below the event 
horizon.
     * See src/backend/access/heap/README.HOT for discussion.  Also set 
this
     * if early pruning/vacuuming is enabled for the heap relation. 
While it

     * might become safe to use the index earlier based on actual cleanup
     * activity and other active transactions, the test for that would 
be much
     * more complex and would require some form of blocking, so keep it 
simple

     * and fast by just using the current transaction.
     *
     * However, when reindexing an existing index, we should do nothing 
here.
     * Any HOT chains that are broken with respect to the index must 
predate

     * the index's original creation, so there is no need to change the
     * index's usability horizon.  Moreover, we *must not* try to 
change the

     * index's pg_index entry while reindexing pg_index itself, and this
     * optimization nicely prevents that.  The more complex rules 
needed for a

     * reindex are handled separately after this function returns.
     *
     * We also need not set indcheckxmin during a concurrent index build,
     * because we won't set indisvalid true until all transactions that 
care

     * about the broken HOT chains or early pruning/vacuuming are gone.
     *
     * Therefore, this code path can only be taken during non-concurrent
     * CREATE INDEX.  Thus the fact that heap_update will set the pg_index
     * tuple's xmin doesn't matter, because that tuple was cre

Re: pgsql: Clarify use of temporary tables within partition trees

2018-07-03 Thread Michael Paquier
On Tue, Jul 03, 2018 at 03:49:44PM +0900, Amit Langote wrote:
> I forgot that expand_partitioned_rtentry() will recursively call itself if
> a partition is itself a partitioned table, in which case the above
> code helps.

Actually look at the coverage reports:
https://coverage.postgresql.org/src/backend/optimizer/prep/prepunion.c.gcov.html
1742  : /*
1743  :  * If the partitioned table has no partitions or all the 
partitions are
1744  :  * temporary tables from other backends, treat this as 
non-inheritance
1745  :  * case.
1746  :  */
1747 4920 : if (!has_child)
17480 : parentrte->inh = false;
1749 4920 : }

expand_partitioned_rtentry() never disables this flag on recursive calls
with a multi-level tree.  Could it be possible to get a test which
closes the gap?
--
Michael


signature.asc
Description: PGP signature


Pluggable Storage - Andres's take

2018-07-03 Thread Andres Freund
Hi,

As I've previously mentioned I had planned to spend some time to polish
Haribabu's version of the pluggable storage patch and rebase it on the
vtable based slot approach from [1]. While doing so I found more and
more things that I previously hadn't noticed. I started rewriting things
into something closer to what I think we want architecturally.

The current state of my version of the patch is *NOT* ready for proper
review (it doesn't even pass all tests, there's FIXME / elog()s).  But I
think it's getting close enough to it's eventual shape that more eyes,
and potentially more hands on keyboards, can be useful.

The most fundamental issues I had with Haribabu's last version from [2]
are the following:

- The use of TableTuple, a typedef from void *, is bad from multiple
  fronts. For one it reduces just about all type safety. There were
  numerous bugs in the patch where things were just cast from HeapTuple
  to TableTuple to HeapTuple (and even to TupleTableSlot).  I think it's
  a really, really bad idea to introduce a vague type like this for
  development purposes alone, it makes it way too hard to refactor -
  essentially throwing the biggest benefit of type safe languages out of
  the window.

  Additionally I think it's also the wrong approach architecturally. We
  shouldn't assume that a tuple can efficiently be represented as a
  single palloc'ed chunk. In fact, we should move *away* from relying on
  that so much.

  I've thus removed the TableTuple type entirely.


- Previous verions of the patchset exposed Buffers in the tableam.h API,
  performed buffer locking / pinning / ExecStoreTuple() calls outside of
  it.  That is wrong in my opinion, as various AMs will deal very
  differently with buffer pinning & locking. The relevant logic is
  largely moved within the AM.  Bringing me to the next point:


- tableam exposed various operations based on HeapTuple/TableTuple's
  (and their Buffers). This all need be slot based, as we can't
  represent the way each AM will deal with this.  I've largely converted
  the API to be slot based.  That has some fallout, but I think largely
  works.  Lots of outdated comments.


- I think the move of the indexing from outside the table layer into the
  storage layer isn't a good idea. It lead to having to pass EState into
  the tableam, a callback API to perform index updates, etc.  This seems
  to have at least partially been triggered by the speculative insertion
  codepaths.  I've reverted this part of the changes.  The speculative
  insertion / confirm codepaths are now exposed to tableam.h - I think
  that's the right thing because we'll likely want to have that
  functionality across more than a single tuple in the future.


- The visibility functions relied on the *caller* performing buffer
  locking. That's not a great idea, because generic code shouldn't know
  about the locking scheme a particular AM needs.  I've changed the
  external visibility functions to instead take a slot, and perform the
  necessary locking inside.


- There were numerous tableam callback uses inside heapam.c - that makes
  no sense, we know what the storage is therein.  The relevant


- The integration between index lookups and heap lookups based on the
  results on a index lookup was IMO too tight.  The index code dealt
  with heap tuples, which isn't great.  I've introduced a new concept, a
  'IndexFetchTableData' scan. It's initialized when building an index
  scan, and provides the necessary state (say current heap buffer), to
  do table lookups from within a heap.


- The am of relations required for bootstrapping was set to 0 - I don't
  think that's a good idea. I changed it so it's set to the heap AM as
  well.


- HOT was encoded in the API in a bunch of places. That doesn't look
  right to me. I tried to improve a bit on that, but I'm not yet quite
  sure I like it. Needs written explanation & arguments...


- the heap tableam did a heap_copytuple() nearly everywhere. Leading to
  a higher memory usage, because the resulting tuples weren't freed or
  anything. There might be a reason for doing such a change - we've
  certainly discussed that before - but I'm *vehemently* against doing
  that at the same time we introduce pluggable storage. Analyzing the
  performance effects will be hard enough without changes like this.


- I've for now backed out the heap rewrite changes, partially.  Mostly
  because I didn't like the way the abstraction looks, but haven't quite
  figured out how it should look like.


- I did not like that speculative tokens were moved to slots. There's
  really no reason for them to live outside parameters to tableam.h
  functsions.


- lotsa additional smaller changes.


- lotsa new bugs


My current working state is at [3] (urls to clone repo are at [4]).
This is *HEAVILY WIP*. I plan to continue working on it over the next
days, but I'll temporarily focus onto v11 work.  If others want I could
move repo to github and grant others write 

Re: Should contrib modules install .h files?

2018-07-03 Thread Peter Eisentraut
On 02.07.18 15:26, Tom Lane wrote:
> FWIW, I agree with Andres' thought that each contrib module should have
> its own subdirectory under $(includedir_server).  Otherwise we're going
> to be faced with questions about whether .h files need to be renamed
> because they're not globally unique enough.

Then they perhaps should be renamed.  That seems like a much simpler
solution.

The use case being discussed here is installing a data type extension's
header so you can write a transform for it.  The extension's name as
well as the data type's own name already have to be pretty much globally
unique if you want it to be useful.  So it doesn't seem very difficult
to me to have the extension install a single header file with that same
name.

The other side of this is that the PLs have to install their header
files.  Which the in-core PLs already do.  Would we we want to move
their header files under a new per-extension directory scheme?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Copy function for logical replication slots

2018-07-03 Thread Masahiko Sawada
On Tue, Jul 3, 2018 at 1:01 PM, Michael Paquier  wrote:
> On Mon, Jul 02, 2018 at 04:31:32PM +0900, Masahiko Sawada wrote:
>> Attached an updated patch including copy function support for logical
>> slots as well as physical slots. Please review it.
>
> I had a look at this patch.

Thank you for the reviews.

>
> As the output plugin can be changed for logical slots, having two
> functions is a good thing.
>
> +# Basic decoding works using the copied slot
> +$result = $node_master->safe_psql('postgres',
> +   qq[SELECT pg_logical_slot_get_changes('copy_slot', NULL, NULL);]);
> +is(scalar(@foobar = split /^/m, $result),
> +   12, 'Decoding produced 12 rows inc BEGIN/COMMIT using the copied slot');
>
> This could live as well as part of the test suite for test_decoding, and
> that would be faster as well.  There are more scenarios that could be
> tested as well:
> - Copy a temporary slot to become permanent and switch its plugin type,
> then check the state of the slot in pg_replication_slots.
> - Do the reverse, aka copy a permanent slot and make it temporary.
> - Checking that the default behaviors are preserved: if persistent is
> not changed then the new slot should have the same persistence as the
> origin.  The same applies for the output plugin, and for both logical
> and replication slots.
> - Check that the reversed LSN is the same for both the origin and the
> target.

Will fix.

>
> +Copy a existing src_slot_name physical slot
> +to dst_slot_name. The copied physical slot starts
> +to reserve WAL from the same LSN as the source slot if the source slot
> +already reserves WAL. temporary is optional. If
> +temporary is omitted, the same value of the
> +source slot is used.
>
> Typo here.  Copy AN existing slot.  I would reword that a bit:
> Copies an existing physical replication slot name src_slot_name to a
> physical replication slot named dst_slot_name.
> Extra one: "the same value AS the source slot is used."
>
> +Copy a existing src_slot_name logical (decoding) 
> slot
> +to dst_slot_name while changing the output plugin
> +and persistence.
>
> There may be no need for "decoding" here, the same phrasing as suggested
> above would be nicer I think.  For LSN I would suggest to add an
>  markup.

Will fix.

>
> I am not sure if it makes much sense to be able to copy from a slot
> which has not yet been used to reserve WAL, but to change the properties
> of a slot I could get that forbidding the behavior is not really
> intuitive either.
>
> -   ReplicationSlotReserveWal();
> +   /* Find start location to read WAL if not specified */
> +   if (XLogRecPtrIsInvalid(start_lsn))
> +   ReplicationSlotReserveWal();
> +   else
> +   {
> +   SpinLockAcquire(&slot->mutex);
> +   slot->data.restart_lsn = start_lsn;
> +   SpinLockRelease(&slot->mutex);
> +   }
> Hmm.  Instead of all this stanza in CreateInitDecodingContext(), I would
> have imagined that what should happen is that the new fresh slot gets
> created with the same status data as the origin.  This saves quite a bit
> of extra post-creation computing, and we are also sure that the origin
> slot has consistent data as it is owned by the process doing the copy.

Hmm, such post-creation computing is not necessary for the copied
slot? I'm concerned we might miss some operations required for for
logical replication slot.

>
> One property which seems important to me is to make sure that the target
> slot has the same data as the origin slot once the caller knows that the
> copy has completed, and not that the target slot may perhaps have the
> same data as the origin while creating the target.  Do you see the
> difference?  Your patch does the latter, because it creates the new slot
> after releasing the lock it held on the origin, while I would think that
> the former is more important, aka keep the lock for the duration of the
> copy.

Did you mean "the caller" is clients who call
pg_copy_logical_replication_slot function? If so, I think it's very
difficult to ensure the former because the origin slot can advance
during returning the result to the client. Also if we keep the lock on
the origin slot during the copy I think that one problem is that we
will own two replication slots at a time. But there are a lot of code
that premise a process holds at most one replication slot.

>I am not completely sure if that's a property we want to keep,
> but that deserves discussion as we should not do a copy while the origin
> slot may still be consumed in parallel.  For physical slot the copy is
> straight-forward, less for logical slots.

I think this copy functions ensure that the target slot has the same
values as the origin slot at the time when the function is called.
Isn't it clear?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Add --include-table-data-where option to pg_dump, to export only a subset of table data

2018-07-03 Thread Carter Thaxton
>
>
> >   pg_dump --where "bar:created_at >= 2018-05-01'"
>
> I am wondering how this works at parsing if the table name, or one of
> the columns includes a colon character :)
>

The proposed patch will handle quoted identifiers.  E.g. the following will
work just fine:

  pg_dump --where 'table:"column:with:colons" = 5'

Note the use of single quotes in the shell, and then double quotes in the
WHERE clause.  There are also many other options for quoting in the shell,
of course.



> Please don't top-post on the PostgreSQL lists. See <
http://idallen.com/topposting.html>

Sorry.  Thanks for the reminder.


Re: pgsql: Clarify use of temporary tables within partition trees

2018-07-03 Thread Amit Langote
On 2018/07/03 17:31, Amit Langote wrote:
> On 2018/07/03 16:05, Michael Paquier wrote:
>> On Tue, Jul 03, 2018 at 03:49:44PM +0900, Amit Langote wrote:
>>> I forgot that expand_partitioned_rtentry() will recursively call itself if
>>> a partition is itself a partitioned table, in which case the above
>>> code helps.
>>
>> Actually look at the coverage reports:
>> https://coverage.postgresql.org/src/backend/optimizer/prep/prepunion.c.gcov.html
>> 1742  : /*
>> 1743  :  * If the partitioned table has no partitions or all the 
>> partitions are
>> 1744  :  * temporary tables from other backends, treat this as 
>> non-inheritance
>> 1745  :  * case.
>> 1746  :  */
>> 1747 4920 : if (!has_child)
>> 17480 : parentrte->inh = false;
>> 1749 4920 : }
>>
>> expand_partitioned_rtentry() never disables this flag on recursive calls
>> with a multi-level tree.  Could it be possible to get a test which
>> closes the gap?
> 
> I guess that it will be hard as you mentioned before on the thread that
> led to this commit.  We can't write regression tests which require using
> temporary partitions from other sessions.
> 
> Anyway, I just wanted to say that I was wrong when I said that the block
> added by the patch is unreachable.  It *is* reachable for multi-level
> partitioning.  For example, it will execute in the following case:
> 
> create table p (a int, b int) partition by list (a);
> create table pd partition of p default partition by list (b);
> select * from p;
> 
> expand_partitioned_rtentry will get called twice and the newly added code
> would result in early return from the function in the 2nd invocation which
> is for 'pd'.
> 
> But,
> 
> 1. I still insist that it's better for the newly added code to be near the
> top of the function body than in the middle, which brings me to...
> 
> 2. While we're at fixing the code around here, I think we should think
> about trying to get rid of the *non-dead* code that produces a structure
> that isn't used anywhere, which I was under the impression, 0a480502b09
> [1] already did (cc'ing Ashutosh).  To clarify, we still unnecessarily
> create a "duplicate" RTE for partitioned tables in a partition tree
> (non-leaf tables) in its role as a child.  So, for the above query, there
> end up being created 4 entries in the query's range table (2 for 'p' and 2
> for 'pd').  That makes sense for plain inheritance, because even the root
> parent table in a plain inheritance tree is a regular table containing
> data.  That's not true for partition inheritance trees, where non-leaf
> tables contain no data, so we don't create a plan to scan them (see
> d3cc37f1d80 [2]), which in turn means we don't need to create the
> redundant "duplicate" child RTEs for them either.
> 
> See attached my delta patch to address both 1 and 2.
> 
> Thanks,
> Amit
> 
> [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0a480502b09
> 
> [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=d3cc37f1d80

For some reason, ML address got removed from the list of address when
sending the above message.

Thanks,
Amit




Re: Copy function for logical replication slots

2018-07-03 Thread Craig Ringer
On 28 June 2018 at 10:51, Masahiko Sawada  wrote:

> Hi,
>
> I'd like to propose a copy function for logical replication slots.
> Currently when we create a new logical replication slot it starts to
> read WAL from an LSN of the current insert. This function copies a
> existing logical replication slot while changing output plugin and
> persistence. That is, the copied new replication slot starts from the
> same LSN as the source one. Since a new copied slot starts from the
> same LSN of existing one we don't need to care about WAL reservation.
>

Strong agreement from me.

The inability to switch plugins is a massive pain. I've worked around it
with similar functions bundled into the extensions I work with that do
logical decoding related work. It makes sense to have it in core.

A use case I imagined is for investigations for example. I mean that
> when replication collision occurs on subscriber there is no way to see
> what replicated data is conflicting (perhaps error log helps it but is
> not detailed) and there is no way to advance a replication origin in
> order to exactly skip to apply conflicting data.


pglogical handles this by letting you peek the slot in a separate json
protocol output mode, but that doesn't help with pgoutput.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: shared-memory based stats collector

2018-07-03 Thread Kyotaro HORIGUCHI
Hello. Thanks for the comment.

At Mon, 2 Jul 2018 14:25:58 -0400, Robert Haas  wrote in 

> On Fri, Jun 29, 2018 at 4:34 AM, Kyotaro HORIGUCHI
>  wrote:
> > Nowadays PostgreSQL has dynamic shared hash (dshash) so we can
> > use this as the main storage of statistics. We can share data
> > without a stress using this.
> >
> > A PoC previously posted tried to use "locally copied" dshash but
> > it doesn't looks fine so I steered to different direction.
> >
> > With this patch dshash can create a local copy based on dynhash.
> 
> Copying the whole hash table kinds of sucks, partly because of the
> time it will take to copy it, but also because it means that memory
> usage is still O(nbackends * ntables).  Without looking at the patch,
> I'm guessing that you're doing that because we need a way to show each
> transaction a consistent snapshot of the data, and I admit that I
> don't see another obvious way to tackle that problem.  Still, it would
> be nice if we had a better idea.

The consistency here means "repeatable read" of an object's stats
entry, not a snapshot covering all objects. We don't need to copy
all the entries at once following this definition. The attached
version makes a cache entry only for requested objects.

Addition to that vacuum doesn't require even repeatable read
consistency so we don't need to cache the entries at all.
backend_get_tab_entry now returns an isolated (that means
not-stored-in-hash) palloc'ed copy without making a local copy in
the case.

As the result, this version behaves as the follows.

- Stats collector stores the results in shared memory.

- In backend, cache is created only for requested objects and
  lasts for the transaction.

- Vacuum directly reads the shared stats and doesn't create a
  local copy.

The non-behavioral difference from the v1 is the follows.

- snapshot feature of dshash is removed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 0f29e118092b85882dfa7a89f5de5db35f576ad5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 29 Jun 2018 16:41:04 +0900
Subject: [PATCH 1/3] sequential scan for dshash

Add sequential scan feature to dshash.
---
 src/backend/lib/dshash.c | 122 +++
 src/include/lib/dshash.h |  20 +++-
 2 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index b46f7c4cfd..e6c1ef44f1 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -592,6 +592,128 @@ dshash_memhash(const void *v, size_t size, void *arg)
 	return tag_hash(v, size);
 }
 
+/*
+ * Initialize a sequential scan on the hash_table. Allows no modifications
+ * during a scan if consistent = true.
+ */
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table,
+bool consistent)
+{
+	status->hash_table = hash_table;
+	status->curbucket = 0;
+	status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
+	status->curitem = NULL;
+	status->curpartition = -1;
+	status->consistent = consistent;
+
+	/*
+	 * Protect all partitions from modification if the caller wants a
+	 * consistent result.
+	 */
+	if (consistent)
+	{
+		int i;
+
+		for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+		{
+			Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+
+			LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+		}
+	}
+	ensure_valid_bucket_pointers(hash_table);
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+	dsa_pointer next_item_pointer;
+
+	if (status->curitem == NULL)
+	{
+		Assert (status->curbucket == 0);
+		Assert(!status->hash_table->find_locked);
+
+		/* first shot. grab the first item. */
+		next_item_pointer = status->hash_table->buckets[status->curbucket];
+		status->hash_table->find_locked = true;
+	}
+	else
+		next_item_pointer = status->curitem->next;
+
+	/* Move to the next bucket if we finished the current bucket */
+	while (!DsaPointerIsValid(next_item_pointer))
+	{
+		if (++status->curbucket >= status->nbuckets)
+		{
+			/* all buckets have been scanned. finsih. */
+			dshash_seq_release(status);
+			return NULL;
+		}
+		Assert(status->hash_table->find_locked);
+
+		next_item_pointer = status->hash_table->buckets[status->curbucket];
+
+		/*
+		 * we need a lock on the scanning partition even if the caller don't
+		 * requested a consistent snapshot.
+		 */
+		if (!status->consistent && DsaPointerIsValid(next_item_pointer))
+		{
+			dshash_table_item  *item = dsa_get_address(status->hash_table->area,
+	   next_item_pointer);
+			int next_partition = PARTITION_FOR_HASH(item->hash);
+			if (status->curpartition != next_partition)
+			{
+if (status->curpartition >= 0)
+	LWLockRelease(PARTITION_LOCK(status->hash_table,
+ status->curpartition));
+LWLockAcquire(PARTITION_LOCK(status->hash_table,
+			 next_partition),
+			  LW_SHARED);
+status->curpartition = next_partition;
+			}
+		}
+	}
+
+	status->curitem =
+		dsa_get_addres

How to use public key file to encrypt data

2018-07-03 Thread ROS Didier
Hi
   I Would like to know how to encrypt data with  physical public 
key files. I can't find any documentation about this subject.
   Thanks in advance

Best Regards
[cid:image002.png@01D14E0E.8515EB90]


Didier ROS
Expertise SGBD
DS IT/IT DMA/Solutions Groupe EDF/Expertise Applicative - SGBD
Nanterre Picasso - E2 565D (aile nord-est)
32 Avenue Pablo Picasso
92000 Nanterre






Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


Re: Avoid extra Sort nodes between WindowAggs when sorting can be reused

2018-07-03 Thread Masahiko Sawada
On Tue, Jul 3, 2018 at 6:19 AM, Daniel Gustafsson  wrote:
>> On 2 Jul 2018, at 14:01, Masahiko Sawada  wrote:
>
>> Thank you for updating the patch! There are two review comments.
>
> Thanks for reviewing!
>
>> The current select_active_windows() function compares the all fields
>> of WindowClause for the sorting but with this patch we compare only
>> tleSortGroupRef, sortop and the number of uniqueOrder. I think this
>> leads a degradation as follows.
>
> You are right, that was an oversight.  The attached patch takes a stab at
> fixing this.
>
>> s/readibility/readability/
>
> Fixed.

Thank you for updating the patch.

+   if (sca->tleSortGroupRef > scb->tleSortGroupRef)
+   return -1;
+   else if (sca->tleSortGroupRef < scb->tleSortGroupRef)
+   return 1;
+   else if (sca->sortop > scb->sortop)
+   return -1;
+   else if (sca->sortop < scb->sortop)
+   return 1;
+   else if (sca->nulls_first && !scb->nulls_first)
+   return -1;
+   else if (!sca->nulls_first && scb->nulls_first)
+   return 1;

Hmm, this is missing the eqop fields of SortGroupClause. I haven't
tested yet but does the similar degradation happen if two
SortGroupCaluses have different eqop and the same other values?

--
The source code comments for common_prefix_cmp() function and
WindowClauseSortNode struct is needed.

--
+-- Test Sort node reordering
+EXPLAIN (COSTS OFF)
+SELECT
+  lead(1) OVER (PARTITION BY depname ORDER BY salary, enroll_date),
+  lag(1) OVER (PARTITION BY depname ORDER BY salary,enroll_date,empno)
+  from empsalary;

I think it's better to change "from empsalary" to "FROM empsalary" for
consistency with other code.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Psql patch to show access methods info

2018-07-03 Thread s . cherkashin

Following issues are solved:

\dAf[+]  [AMPTRN [OPFPTRN]]  list operator families of access method. 
+

prints owner of operator family. (Table pg_opfamily)


\dAfp[AMPTRN [OPFPTRN]]  list procedures of operator family 
related

to access method (Table pg_amproc)


 * Reorder "Left"/"Right" and "Strategy"/"Proc name" columns.
 * Include "Left"/"Right" columns into ORDER BY clause.
 * Show procedure's argument types, because procedure's name does not 
completely
   identify procedure (for example, in_range() is used in several 
opclasses with
   different signatures).  Or maybe show arguments only if procedure 
name is not

   unique?

\dAfo[AMPTRN [OPFPTRN]]  list operators of family related to 
access

method (Table pg_amop)


 * Reorder "Left"/"Right" and "Strategy"/"Operator" columns.
 * Include "Left"/"Right" columns into ORDER BY clause.



\dAoc[+] [AMPTRN [OPCPTRN]]  list operator classes of index access
methods. + prints owner of operator class. (Table pg_opclass)


 * Maybe it would be better to show stored type only if it differs from 
the

   indexed type?


\dip[S]  [PATTERN]   list indexes with properties (Table
pg_class)



\dicp[S] [IDXNAME [COLNAME]] show index column properties (Table
pg_class)


 * Fix duplicate rows that appear in the table for composite indices.
 * Include "Column #" into ORDER BY clause.
 * Rename column "Null first" to "Nulls First" or "NULLS LAST".
 * Maybe it is not necessary to show "Access method" column here?
 * I think we should show column's properties in the separate table for 
each

   index, because it is not so easy to understand the combined table.



Following issues require discussion:

\dAp  
 * Should we rename it to \dAip and include "index" word into the table 
header?

   As you know, we are going to support table AMs in the future.



\dAfo
 * Operator's schema is shown only if operator is invisible for the 
current

   user -- I'm not sure if this is correct.

   \dAfo and \dAfp
   * Should we put info in separate table for each Operator family?



\dicp
 * ASC, NULLS are shown as TRUE/FALSE only if the index is orderable, 
and as
   NULL if unorderable -- I'm not sure if this is correct.  Maybe we 
should

   simply show these properties in the literal form, not as booleans
   (as strings 'ASC'/'DESC', 'NULLS FIRST'/'NULLS LAST')?





I also have a question about testing commands \dAf+ and \dAoc+: is it
good idea to test them by changing an owner of one operator family or
class to created new one, checking the output, and restoring the owner
back? Or we should create a new opclass or opfamily with proper owner.
Or maybe it is not necesary to test these commands?
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3ed9021..b699548 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -675,7 +675,7 @@
search and ordering purposes.)
   
 
-  
+  
pg_amop Columns
 

@@ -818,7 +818,7 @@
is one row for each support procedure belonging to an operator family.
   
 
-  
+  
pg_amproc Columns
 

@@ -4458,7 +4458,7 @@ SCRAM-SHA-256$:&l
Operator classes are described at length in .
   
 
-  
+  
pg_opclass Columns
 

@@ -4720,7 +4720,7 @@ SCRAM-SHA-256$:&l
Operator families are described at length in .
   
 
-  
+  
pg_opfamily Columns
 

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index b17039d..273c3f7 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1203,6 +1203,95 @@ testdb=>
 
 
   
+  
+  
+
+  \dAf
+  [ access-method-pattern 
+[operator-family-pattern]]
+  
+
+
+
+
+Lists operator families (). If access-method-pattern is specified, only
+families whose access method name matches the pattern are shown.
+If operator-family-pattern is specified, only
+opereator families associated with whose name matches the pattern are shown.
+If + is appended to the command name, each operator
+family is listed with it's owner.
+
+
+  
+
+  
+
+  \dAfo
+[access-method-pattern
+  [operator-family-pattern]]
+  
+
+
+
+
+Lists operators () associated with access method operator families.
+If access-method-patttern is specified,
+only operators associated with access method whose name matches pattern are shown.
+If operator-family-pattern is specified, only
+opereators associated with families whose name matches the pattern are shown.
+
+
+  
+
+  
+
+  \dAfp
+[access-method-pattern
+  [operator-family-pattern]]
+  
+
+
+
+List procedures () accociated with access method operator famili

Re: Threat models for DB cryptography (Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key) Management Service (KMS)

2018-07-03 Thread Masahiko Sawada
On Tue, Jul 3, 2018 at 7:16 AM, Nico Williams  wrote:
> On Mon, Jul 02, 2018 at 06:22:46PM +0900, Masahiko Sawada wrote:
>> On Fri, Jun 22, 2018 at 2:31 PM, Tsunakawa, Takayuki
>>  wrote:
>> > From: Nico Williams [mailto:n...@cryptonector.com]
>> >
>> >> One shortcoming of relying on OS functionality for protection against
>> >> malicious storage is that not all OSes may provide such functionality.
>> >> This could be an argument for implementing full, transparent encryption
>> >> for an entire DB in the postgres server.  Not a very compelling
>> >> argument, but that's just my opinion -- reasonable people could differ
>> >> on this.
>> >
>> > Yes, this is one reason I developed TDE in our product.  And
>> > in-database encryption allows optimization by encrypting only user
>> > data.
>
> You're likely getting some things terribly wrong.  E.g., integrity
> protection.  Most likely you're getting a false sense of security.
>
>> Me too. In-database encryption is helpful in practice. I think 1) and
>> 2) seem to cover the thread models which the data encryption in
>> database needs to defend.
>
> Yes, but piecemeal encryption seems like a bad idea to me.
>

What do you mean by "piecemeal encryption"? Is it not-whole database
encryption such as per-table or per-tablespace? If so could you please
elaborate on the reason why you think so?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: Add --include-table-data-where option to pg_dump, to export only a subset of table data

2018-07-03 Thread Surafel Temesgen
On Mon, Jul 2, 2018 at 9:27 PM, Robert Haas  wrote:

>
> But you can specify multiple tables.  You wouldn't want the same WHERE
> clause to apply to all of them.
>
> also with this new --where option you can specify multiple table using
wildcard and it
try to apply the same where clause to each table. may be its a desirable
feature
because such kind of table can be structurally similar too.

regards
Surafel


Re: Explain buffers wrong counter with parallel plans

2018-07-03 Thread Amit Kapila
On Mon, Jul 2, 2018 at 6:02 PM, Robert Haas  wrote:
>
> I think the core problem here is this hunk from gather_readnext:
>
>  {
>  Assert(!tup);
> -DestroyTupleQueueReader(reader);
>  --gatherstate->nreaders;
>  if (gatherstate->nreaders == 0)
> -{
> -ExecShutdownGatherWorkers(gatherstate);
>  return NULL;
> -}
>  memmove(&gatherstate->reader[gatherstate->nextreader],
>  &gatherstate->reader[gatherstate->nextreader + 1],
>  sizeof(TupleQueueReader *)
>
> Since ExecShutdownGatherWorkers() is no longer called there, the
> instrumentation data isn't accumulated into the Gather node when the
> workers are shut down.  I think that's a bug and we should fix it.
>

Yeah, previously, I have also pointed out the same code [1].  However,
I have not done any testing to prove it.

> To fix the problem with Limit that you mention, we could just modify
> nodeLimit.c so that when the state is changed from LIMIT_INWINDOW to
> LIMIT_WINDOWEND, we also call ExecShutdownNode on the child plan.
>

It should work.

> We can fix other cases as we find them.
>

I think we have a similar problem with GatherMerge, but that also
appears to be fixable.

Are you planning to work on it?  If not, then I can look into it.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1KZEbYKj9HHP-6WqqjAXuoB%2BWJu-w1s9uovj%3DeeBxC48Q%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Failed assertion due to procedure created with SECURITY DEFINER option

2018-07-03 Thread amul sul
On Fri, Jun 29, 2018 at 5:26 PM Peter Eisentraut
 wrote:
>
> On 6/29/18 13:07, amul sul wrote:
> > This happens because of in fmgr_security_definer() function we are
> > changing  global variable SecurityRestrictionContext and in the
> > StartTransaction() insisting it should be zero, which is the problem.
>
> Hmm, what is the reason for this insistation?
>
> We could work around this for now by prohibiting transaction commands in
> security definer procedures, similar to what we do in procedures with
> GUC settings attached.
>

I am not sure that I have understood this, apologies. Do you mean by
the following case:

postgres=# CREATE PROCEDURE transaction_test1() LANGUAGE plpgsql
SECURITY DEFINER SET work_mem to '16MB'
AS $$ BEGIN
  COMMIT;
 END $$;
CREATE PROCEDURE

postgres=# CALL transaction_test1();
ERROR:  invalid transaction termination
CONTEXT:  PL/pgSQL function transaction_test1() line 2 at COMMIT

Thanks.

Regards,
Amul



RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-07-03 Thread Moon, Insung
Dear Antonin Houska.

> -Original Message-
> From: Antonin Houska [mailto:a...@cybertec.at]
> Sent: Tuesday, May 29, 2018 3:23 PM
> To: Moon, Insung
> Cc: pgsql-hack...@postgresql.org
> Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key 
> Management Service (KMS)
> 
> Moon, Insung  wrote:
> 
> This patch seems to implement some of the features you propose, especially 
> encryption of buffers and WAL. I recommend
> you to check so that no effort is
> duplicated:

Yes. encrypting / decrypting between Buffer <-> Disk is the same architecture.
But, this idea is not to encrypt all table, thinks to minimize the performance 
overhead, only encrypting to necessary tables (including Xlog).

Thank you and Best regards.
Moon.

> 
> > [4] Recently discussed mail
> >
> > https://www.postgresql.org/message-id/CA%2BCSw_tb3bk5i7if6inZFc3yyf%2B
> > 9HEVNTy51QFBoeUk7UE_V%3Dw%40mail.gmail.com
> 
> 
> 
> --
> Antonin Houska
> Cybertec Schönig & Schönig GmbH
> Gröhrmühlgasse 26, A-2700 Wiener Neustadt
> Web: https://www.cybertec-postgresql.com





RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-07-03 Thread Moon, Insung
Dear Aleksander Alekseev.

> -Original Message-
> From: Aleksander Alekseev [mailto:a.aleks...@postgrespro.ru]
> Sent: Thursday, May 31, 2018 10:33 PM
> To: Moon, Insung
> Cc: pgsql-hack...@postgresql.org
> Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key 
> Management Service (KMS)
> 
> Hello Moon,
> 
> I promised to email links to the articles I mentioned during your talk on the 
> PGCon Unconference to this thread. Here
> they are:
> 
> * http://cryptowiki.net/index.php?title=Order-preserving_encryption
> * https://en.wikipedia.org/wiki/Homomorphic_encryption
> 
> Also I realized that I was wrong regarding encryption of the indexes since 
> they will be encrypted on the block level the
> same way the heap will be.

Sorry. I did not explain correctly in PGCon.
Yes. this idea is encrypting at the block level as you said, there is probably 
not a big problem with index encryption.
I will testing with PoC later an Index Encryption.

Thank you and Best regards.
Moon.


> 
> --
> Best regards,
> Aleksander Alekseev





RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-07-03 Thread Moon, Insung
Dear Masahiko Sawada.

> -Original Message-
> From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> Sent: Monday, June 11, 2018 6:22 PM
> To: Moon, Insung
> Cc: PostgreSQL-development; Joe Conway
> Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key 
> Management Service (KMS)
> 
> On Fri, May 25, 2018 at 8:41 PM, Moon, Insung  
> wrote:
> > Hello Hackers,
> >
> > This propose a way to develop "Table-level" Transparent Data
> > Encryption (TDE) and Key Management Service (KMS) support in PostgreSQL.
> >
> >
> > Issues on data encryption of PostgreSQL == Currently, in
> > PostgreSQL, data encryption can be using pgcrypto Tool.
> > However, it is inconvenient to use pgcrypto to encrypts data in some cases.
> >
> > There are two significant inconveniences.
> >
> > First, if we use pgcrypto to encrypt/decrypt data, we must call pgcrypto 
> > functions everywhere we encrypt/decrypt.
> > Second, we must modify application program code much if we want to do
> > database migration to PostgreSQL from other databases that is using TDE.
> >
> > To resolved these inconveniences, many users want to support TDE.
> > There have also been a few proposals, comments, and questions to support 
> > TDE in the PostgreSQL community.
> >
> > However, currently PostgreSQL does not support TDE, so in development
> > community, there are discussions whether it's necessary to support TDE or 
> > not.
> >
> > In these discussions, there were requirements necessary to support TDE in 
> > PostgreSQL.
> >
> > 1) The performance overhead of encryption and decryption database data
> > must be minimized
> > 2) Need to support WAL encryption.
> > 3) Need to support Key Management Service.
> >
> > Therefore, I'd like to propose the new design of TDE that deals with both 
> > above requirements.
> > Since this feature will become very large, I'd like to hear opinions from 
> > community before starting making the patch.
> >
> > First, my proposal is table-level TDE which is that user can specify tables 
> > begin encrypted.
> > Indexes, TOAST table and WAL associated with the table that enables TDE are 
> > also encrypted.
> >
> > Moreover, I want to support encryption for large object as well.
> > But I haven't found a good way for it so far. So I'd like to remain it as 
> > future TODO.
> >
> > My proposal has five characteristics features of "table-level TDE".
> >
> > 1) Buffer-level data encryption and decryption
> > 2) Per-table encryption
> > 3) 2-tier encryption key management
> > 4) Working with external key management services(KMS)
> > 5) WAL encryption
> >
> > Here are more details for each items.
> >
> >
> > 1. Buffer-level data encryption and decryption ==
> > Transparent data encryption and decryption accompany by storage
> > operation With ordinally way like using pgcrypto, the biggest problem
> > with encrypted data is the performance overhead of decrypting the data each 
> > time the run to queries.
> >
> > My proposal is to encrypt and decrypt data when performing DISK I/O 
> > operation to minimize performance overhead.
> > Therefore, the data in the shared memory layer is unencrypted so that 
> > performance overhead can minimize.
> >
> > With this design, data encryption/decryption implementations can be
> > developed by modifying the codes of the storage and buffer manager
> > modules, which are responsible for performing DISK I/O operation.
> >
> >
> > 2. Per-table encryption
> > ==
> > User can enable TDE per table as they want.
> > I introduce new storage parameter "encryption_enabled" which enables TDE at 
> > table-level.
> >
> > // Generate  the encryption table
> >CREATE TABLE foo WITH ( ENCRYPTION_ENABLED = ON );
> >
> > // Change to the non-encryption table
> >ALTER TABLE foo SET ( ENCRYPTION_ENABLED = OFF );
> >
> > This approach minimizes the overhead for tables that do not require 
> > encryption options.
> > For tables that enable TDE, the corresponding table key will be
> > generated with random values, and it's stored into the new system catalog 
> > after being encrypted by the master key.
> >
> > BTW, I want to support CBC mode encryption[3]. However, I'm not sure how to 
> > use the IV in CBC mode for this proposal.
> > I'd like to hear opinions by security engineer.
> >
> >
> > 3. 2-tier encryption key management
> > ==
> > when it comes time to change cryptographic keys, there is a performance 
> > overhead to decryption and re-encryption to
> all data.
> >
> > To solve this problem we employee 2-tier encryption.
> > 2-tier encryption is All table keys can be stored in the database
> > cluster after being encrypted by the master key, And master keys must be 
> > stored at external of PostgreSQL.
> >
> > Therefore, without master key, it is impossible to decrypt the table key. 
> > Thus, It is impossible to decrypt the database
> data.
> >
> > When changing the key, it's not necessary to re

RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-07-03 Thread Moon, Insung
Dear Tomas Vondra.

> -Original Message-
> From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> Sent: Wednesday, June 13, 2018 10:15 PM
> To: Moon, Insung; pgsql-hack...@postgresql.org
> Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key 
> Management Service (KMS)
> 
> Hi,
> 
> On 05/25/2018 01:41 PM, Moon, Insung wrote:
> > Hello Hackers,
> >
> > ...
> >
> > BTW, I want to support CBC mode encryption[3]. However, I'm not sure
> > how to use the IV in CBC mode for this proposal. I'd like to hear
> > opinions by security engineer.
> >
> 
> I'm not a cryptographer either, but this is exactly where you need a prior 
> discussion about the threat models - there
> are a couple of chaining modes, each with different weaknesses.
> 

Thank you for your advice.
First, I'm researched to more security problem and found that CBC mode is an 
not safe encryption mode.
Later, when I'll create a PoC, using to GCM or XTS encryption mode.
And this time I know for using the same IV is dangerous, and I'm doing some 
more research on this.

Thank you and Best regards.
Moon.


> FWIW it may also matter if data_checksums are enabled, because that may 
> prevent malleability attacks affecting of the
> modes. Assuming active attacker (with the ability to modify the data files) 
> is part of the threat model, of course.
> 
> regards
> 
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Explain buffers wrong counter with parallel plans

2018-07-03 Thread Robert Haas
On Tue, Jul 3, 2018 at 6:48 AM, Amit Kapila  wrote:
> Yeah, previously, I have also pointed out the same code [1].  However,
> I have not done any testing to prove it.

Ah, OK.  Sorry, forgot about that email.

> Are you planning to work on it?  If not, then I can look into it.

I won't have time for a few weeks, so if you have time sooner please have at it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-07-03 Thread Moon, Insung
Dear Tomas Vondra.

> -Original Message-
> From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> Sent: Wednesday, June 13, 2018 10:03 PM
> To: Masahiko Sawada; Moon, Insung
> Cc: PostgreSQL-development; Joe Conway
> Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key 
> Management Service (KMS)
> 
> On 06/11/2018 11:22 AM, Masahiko Sawada wrote:
> > On Fri, May 25, 2018 at 8:41 PM, Moon, Insung
> >  wrote:
> >> Hello Hackers,
> >>
> >> This propose a way to develop "Table-level" Transparent Data
> >> Encryption (TDE) and Key Management Service (KMS) support in
> >> PostgreSQL.
> >>
> >> ...
> >
> > As per discussion at PGCon unconference, I think that firstly we need
> > to discuss what threats we want to defend database data against.
> > If user wants to defend against a threat that is malicious user who
> > logged in OS or database steals an important data on datbase this
> > design TDE would not help. Because such user can steal the data by
> > getting a memory dump or by SQL. That is of course differs depending
> > on system requirements or security compliance but what threats do you
> > want to defend database data against? and why?
> >
> 
> I do agree with this - a description of the threat model needs to be part of 
> the design discussion, otherwise it's not
> possible to compare it to alternative solutions (e.g. full-disk encryption 
> using LUKS or using existing privilege controls
> and/or RLS).
> 
> TDE was proposed/discussed repeatedly in the past, and every time it died 
> exactly because it was not very clear which
> issue it was attempting to solve.
> 
> Let me share some of the issues mentioned as possibly addressed by TDE (I'm 
> not entirely sure TDE actually solves them,
> I'm just saying those were mentioned in previous discussions):
> 
> 1) enterprise requirement - Companies want in-database encryption, for 
> various reasons (because "enterprise solution"
> or something).

Yes. I do not know clearly about enterprise encryption requirements.
Typically, identified the requirements for encryption of PCI-DSS and posted 
these ideas.(Storage encryptoin)
Therefore, according to your opinion, I will more try to research of the 
enterprise encryption requirements.

> 
> 2) like FDE, but OS/filesystem independent - Same config on any OS and 
> filesystem, which may make maintenance easier.
> 
> 3) does not require special OS/filesystem setup - Does not require help from 
> system adminitrators, setup of LUKS devices
> or whatever.

Yes. We can use disk encryption like LUKS at Linux, but it does not apply to 
all OS's, so I'm proposed TDE.

> 
> 4) all filesystem access (basebackups/rsync) is encrypted anyway
> 
> 5) solves key management (the main challenge with pgcrypto)

In fact, it is the biggest worry about key management.
First, I think of 2-tier encryption as I wrote in my idea, and I am thinking of 
using KMS for management to master key.
However, I am also worried about security problems when I managed of table key 
and master key.
Therefore, I want to more discuss of Key Management and develop KMS 
simultaneously with TDE.


Thank you and Best regards.
Moon.


> 
> 6) allows encrypting only some of the data (tables, columns) to minimize 
> performance impact
> 
> IMHO it makes sense to have TDE even if it provides the same "security"
> as disk-level encryption, assuming it's more convenient to setup/use from the 
> database.
> 
> > Also, if I understand correctly, at unconference session there also
> > were two suggestions about the design other than the suggestion by
> > Alexander: implementing TDE at column level using POLICY, and
> > implementing TDE at table-space level. The former was suggested by Joe
> > but I'm not sure the detail of that suggestion. I'd love to hear the
> > deal of that suggestion. The latter was suggested by Tsunakawa-san.
> > Have you considered that?
> >
> > You mentioned that encryption of temporary data for query processing
> > and large objects are still under the consideration. But other than
> > them you should consider the temporary data generated by other
> > subsystems such as reorderbuffer and transition table as well.
> >
> 
> The severity of those limitations is likely related to the threat model.
> I don't think encrypting temporary data would be a big problem, assuming you 
> know which key to use.
> 
> regards
> 
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services






Re: Threat models for DB cryptography (Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key) Management Service (KMS)

2018-07-03 Thread Nico Williams
On Tue, Jul 03, 2018 at 07:28:42PM +0900, Masahiko Sawada wrote:
> On Tue, Jul 3, 2018 at 7:16 AM, Nico Williams  wrote:
> > Yes, but piecemeal encryption seems like a bad idea to me.
> 
> What do you mean by "piecemeal encryption"? Is it not-whole database
> encryption such as per-table or per-tablespace? If so could you please
> elaborate on the reason why you think so?

I mean that encrypting some columns only, or some tables only, has
integrity protection issues.  See earlier posts in this thread.

Encrypting the whole DB has no such problems, assuming you're doing the
crypto correctly anyways.  But for full DB encryption it's easier to
leave the crypto to the filesystem or device drivers.  (If the devices
are physically in the host and cannot be removed easily, then FDE at the
device works well too.)

Nico
-- 



RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-07-03 Thread Moon, Insung
Dear Takayuki Tsunakawa.

> -Original Message-
> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
> Sent: Thursday, June 14, 2018 9:58 AM
> To: 'Tomas Vondra'; Moon, Insung; pgsql-hack...@postgresql.org
> Subject: RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key 
> Management Service (KMS)
> 
> > From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com]
> > On 05/25/2018 01:41 PM, Moon, Insung wrote:
> > > BTW, I want to support CBC mode encryption[3]. However, I'm not sure
> > > how to use the IV in CBC mode for this proposal. I'd like to hear
> > > opinions by security engineer.
> > >
> >
> > I'm not a cryptographer either, but this is exactly where you need a
> > prior discussion about the threat models - there are a couple of
> > chaining modes, each with different weaknesses.
> Our products uses XTS, which recent FDE software like BitLocker and TrueCrypt 
> uses instead of CBC.
> 
> https://en.wikipedia.org/wiki/Disk_encryption_theory#XTS
> 
> "According to SP 800-38E, "In the absence of authentication or access 
> control, XTS-AES provides more protection than the
> other approved confidentiality-only modes against unauthorized manipulation 
> of the encrypted data.""

Thank your for your advice!

Yes. I found that CBC is not safe at this time.
So let's use XTS mode or GCM mode as you mentioned.

Thank you and Best regards.
Moon.

> 
> 
> 
> > FWIW it may also matter if data_checksums are enabled, because that
> > may prevent malleability attacks affecting of the modes. Assuming
> > active attacker (with the ability to modify the data files) is part of
> > the threat model, of course.
> 
> Encrypt the page after embedding its checksum value.  If a malicious attacker 
> modifies a page on disk, then the decrypted
> page would be corrupt anyway, which can be detected by checksum.
> 
> 
> Regards
> Takayuki Tsunakawa
> 






Re: Add --include-table-data-where option to pg_dump, to export only a subset of table data

2018-07-03 Thread Robert Haas
On Tue, Jul 3, 2018 at 6:31 AM, Surafel Temesgen  wrote:
> On Mon, Jul 2, 2018 at 9:27 PM, Robert Haas  wrote:
>> But you can specify multiple tables.  You wouldn't want the same WHERE
>> clause to apply to all of them.
>>
> also with this new --where option you can specify multiple table using
> wildcard and it
> try to apply the same where clause to each table. may be its a desirable
> feature
> because such kind of table can be structurally similar too.

I don't think that's likely to be very useful.  I think Carter Thaxton
has the right idea, although using foo:bar to mean foo.bar doesn't
seem like a great plan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-07-03 Thread Andrey V. Lepikhov



On 03.07.2018 00:40, Peter Geoghegan wrote:

On Mon, Jul 2, 2018 at 9:28 AM, Peter Geoghegan  wrote:

Execution time of last "VACUUM test;" command on my notebook was:

with bulk deletion: 1.6 s;
with Quick Vacuum Strategy: 5.2 s;
with Quick Vacuum Strategy & TID sorting: 0.6 s.


I'm glad that you looked into this. You could make this faster still,
by actually passing the lowest heap TID in the list of TIDs to kill to
_bt_search() and _bt_binsrch(). You might have to work through several
extra B-Tree leaf pages per bttargetdelete() call without this (you'll
move right multiple times within bttargetdelete()).


I should add: I think that this doesn't matter so much in your
original test case with v1 of my patch, because you're naturally
accessing the index tuples in almost the most efficient way already,
since you VACUUM works its way from the start of the table until the
end of the table. You'll definitely need to pass a heap TID to
routines like _bt_search() once you start using my v2, though, since
that puts the heap TIDs in DESC sort order. Otherwise, it'll be almost
as slow as the plain "Quick Vacuum Strategy" case was.

In general, the big idea with my patch is that heap TID is just
another attribute. I am not "cheating" in any way; if it's not
possible to descend the tree and arrive at the right leaf page without
looking through several leaf pages, then my patch is broken.

You might also use _bt_moveright() with my patch. That way, you can
quickly detect that you need to move right immediately, without going
through all the items on the page. This should only be an issue in the
event of a concurrent page split, though. In my patch, I use
_bt_moveright() in a special way for unique indexes: I need to start
at the first leaf page a duplicate could be on for duplicate checking,
but once that's over I want to "jump immediately" to the leaf page the
index tuple actually needs to be inserted on. That's when
_bt_moveright() is called. (Actually, that looks like it breaks unique
index enforcement in the case of my patch, which I need to fix, but
you might still do this.)



Done.
Attachment contains an update for use v.2 of the 'Ensure nbtree leaf 
tuple keys are always unique' patch.


Apply order:
1. 0001-Retail-IndexTuple-Deletion-Access-Method.patch - from previous email
2. 0002-Quick-vacuum-strategy.patch - from previous email
3. v2-0001-Ensure-nbtree-leaf-tuple-keys-are-always-unique.patch - from [1]
4. 0004-Retail-IndexTuple-Deletion-with-TID-sorting-in-leaf.patch

[1] 
https://www.postgresql.org/message-id/CAH2-Wzm6D%3DKnV%2BP8bZE-ZtP4e%2BW64HtVTdOenqd1d7HjJL3xZQ%40mail.gmail.com


--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company
>From 1c8569abe9479e547911ec3079633f79056eff96 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Tue, 3 Jul 2018 16:54:46 +0500
Subject: [PATCH 4/4] Retail-IndexTuple-Deletion-with-TID-sorting-in-leaf

---
 src/backend/access/nbtree/nbtree.c | 75 +-
 src/backend/commands/vacuumlazy.c  |  8 ++--
 src/include/access/genam.h |  2 +-
 3 files changed, 55 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c54aeac..7c617e9 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -887,15 +887,19 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	return stats;
 }
 
-static int
-tid_list_search(ItemPointer tid, ItemPointer tid_list, int ntid)
-{
-	for (int i = 0; i < ntid; i++)
-		if (ItemPointerEquals(tid, &(tid_list[i])))
-			return i;
-	return -1;
-}
-
+/*
+ * Deletion of index entries pointing to heap tuples.
+ *
+ * Constraints:
+ * 1. TID list info->dead_tuples arranged in ASC order.
+ * 2. Logical duplicates of index tuples stored in DESC order.
+ *
+ * The function generates an insertion scan key and descent by btree for first
+ * index tuple what satisfies scan key and last TID in info->dead_tuples list.
+ * For the scan results it deletes all index entries, matched to the TID list.
+ *
+ * Result: a palloc'd struct containing statistical info.
+ */
 IndexTargetDeleteResult*
 bttargetdelete(IndexTargetDeleteInfo *info,
 			   IndexTargetDeleteResult *stats,
@@ -914,20 +918,21 @@ bttargetdelete(IndexTargetDeleteInfo *info,
 	intndeletable = 0;
 	OffsetNumber	deletable[MaxOffsetNumber];
 	IndexTuple		itup;
+	intpos = info->last_dead_tuple;
 
 	if (stats == NULL)
 		stats = (IndexTargetDeleteResult *) palloc0(sizeof(IndexTargetDeleteResult));
 
+	/* Assemble scankey */
 	itup = index_form_tuple(RelationGetDescr(irel), values, isnull);
 	skey = _bt_mkscankey(irel, itup);
 
 	/* Descend the tree and position ourselves on the target leaf page. */
-	stack = _bt_search(irel, keysCount, skey, false, &buf, BT_READ, NULL);
-	_bt_freestack(stack);
+	stack = _bt_search(irel, keysCount, skey, &info->dead_tuples[pos], false, &buf, BT_READ, NULL);
 
 	/* T

RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-07-03 Thread Moon, Insung
Dear Joe.

> -Original Message-
> From: Joe Conway [mailto:m...@joeconway.com]
> Sent: Monday, June 18, 2018 9:30 PM
> To: Masahiko Sawada
> Cc: Moon, Insung; PostgreSQL-development
> Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key 
> Management Service (KMS)
> 
> On 06/14/2018 12:19 PM, Masahiko Sawada wrote:
> > On Wed, Jun 13, 2018 at 10:20 PM, Joe Conway  wrote:
> >> The idea has not been extensively fleshed out yet, but the thought
> >> was that we create column level POLICY, which would transparently
> >> apply some kind of transform on input and/or output. The transforms
> >> would presumably be expressions, which in turn could use functions
> >> (extension or builtin) to do their work. That would allow
> >> encryption/decryption, DLP (data loss prevention) schemes (masking,
> >> redacting), etc. to be applied based on the policies.
> >
> > Which does this design encrypt data on, buffer or both buffer and
> > disk?
> 
> 
> The point of the design is simply to provide a mechanism for input and output 
> transformation, not to provide the transform
> function itself.
> 
> How you use that transformation would be entirely up to you, but if you were 
> providing an encryption transform on input
> the data would be encrypted both buffer and disk.
> 
> > And does this design (per-column encryption) aim to satisfy something
> > specific security compliance?
> 
> 
> Again, entirely up to you and dependent on what type of transformation you 
> provide. If, for example you provided input
> encryption and output decryption based on some in memory session variable 
> key, that would be essentially TDE and would
> satisfy several common sets of compliance requirements.
> 
> 
> >> This, in and of itself, would not address key management. There is
> >> probably a separate need for some kind of built in key management --
> >> perhaps a flexible way to integrate with external systems such as
> >> Vault for example, or maybe something self contained, or perhaps both.
> >
> > I agree to have a flexible way in order to address different
> > requirements. I thought that having a GUC parameter to which we store
> > a shell command to get encryption key is enough but considering
> > integration with various key managements seamlessly I think that we
> > need to have APIs for key managements. (fetching key, storing key,
> > generating key etc)
> 
> 
> I don't like the idea of yet another path for arbitrary shell code execution. 
> An API for extension code would be preferable.

Thank you for your advice on key management.
In fact, it was a big worry how to implement key management.
Basically, we will look at the rules of KMIP, and I'll try to create an 
extension API that can mostly work with KMS.

and I have a question.
You said do not like the idea of another path for arbitrary shell code 
execution, is there any special reason?
For example, I think usability to specify a path for shell code to use several 
KMSs, Is there a potential security issue?

Thank you and Best regards.
Moon.


> 
> 
> >> Or
> >> maybe key management is really tied into the separately discussed
> >> effort to create SQL VARIABLEs somehow.
> >
> > Could you elaborate on how key management is tied into SQL VARIABLEs?
> 
> Well, the key management probably is not, but the SQL VARIABLE might be where 
> the key is stored for use.
> 
> Joe
> 
> --
> Crunchy Data - http://crunchydata.com
> PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source 
> Development






RE: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2018-07-03 Thread Moon, Insung
Dear Tom Lane.

> -Original Message-
> From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> Sent: Monday, June 18, 2018 11:52 PM
> To: Robert Haas
> Cc: Joe Conway; Masahiko Sawada; Moon, Insung; PostgreSQL-development
> Subject: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key 
> Management Service (KMS)
> 
> Robert Haas  writes:
> > On Mon, Jun 18, 2018 at 10:12 AM, Joe Conway  wrote:
> >> Not necessarily. Our pages probably have enough predictable bytes to
> >> aid cryptanalysis, compared to user data in a column which might not
> >> be very predicable.
> 
> > Really?  I would guess that the amount of entropy in a page is WAY
> > higher than in an individual column value.
> 
> Depending on the specifics of the encryption scheme, having some amount of 
> known (or guessable) plaintext may allow breaking
> the cipher, even if much of the plaintext is not known.  This is cryptology 
> 101, really.
> 
> At the same time, having to have a bunch of independently-decipherable short 
> field values is not real secure either, especially
> if they're known to all be encrypted with the same key.  But what you know or 
> can guess about the plaintext in such cases
> would be target-specific, rather than an attack that could be built once and 
> used against any PG database.

Yes. If there is known to guessable data of encrypted data, maybe there is a  
possibility of decrypting the encrypted data.

But would it be safe to use an additional encryption mode such as GCM or XFS to 
solve this problem?
(Do not use the same IV)

Thank you and Best regards.
Moon.


> 
>   regards, tom lane






Re: Desirability of client-side expressions in psql?

2018-07-03 Thread Ashutosh Bapat
On Mon, Jun 25, 2018 at 1:29 AM, Fabien COELHO  wrote:
>
> I do not mind spending some time for elegance, but I mind spending time for
> nothing.
>
> I hope some committers will also express their views about the feature.
>

There's a commitfest entry for this but no patch. So I am marking this
as RWF. The discussion here can continue. Please submit a patch (may
be a PoC) to the commitfest whenever there's one.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: SQL/JSON: functions

2018-07-03 Thread Pavel Stehule
2018-07-03 14:30 GMT+02:00 Nikita Glukhov :

> Attached 16th version of the patches:
>  * changed type of new SQL keyword STRING
>(STRING is used as a function parameter name in Pl/Tcl tests)
>  * removed implicit coercion via I/O from JSON_VALUE (see below)
>
>
> On 28.06.2018 07:25, Pavel Stehule wrote:
>
> 2018-06-28 2:18 GMT+02:00 Nikita Glukhov :
>
>> On 15.03.2018 20:04, Nikita Glukhov wrote:
>>
>>> Attached 13th version of the patches:
>>>
>>> * Subtransactions in PG_TRY/CATCH in ExecEvalJsonExpr() were made
>>> unconditional,
>>>   regardless of the volatility of expressions.
>>>
>>> * PG_TRY/CATCH in ExecEvalExprPassingCaseValue() was removed along with
>>> the
>>>   entire function.
>>>
>>
>>
>> Attached 15th version of the patches:
>>  * disabled parallel execution of SQL/JSON query functions when internal
>>subtransactions are used (if ERROR ON ERROR is not specified)
>>  * added experimental optimization of internal subtransactions (see below)
>>
>>
>> The new patch #14 is an experimental attempt to reduce overhead of
>> subtransaction start/commit which can result in 2x-slowdown in the
>> simplest
>> cases.  By the idea of Alexander Korotkov, subtransaction is not really
>> committed if it has not touched the database and its XID has not been
>> assigned
>> (DB modification is not expected in type casts functions) and then can be
>> reused
>> when the next subtransaction is started.  So, all rows in JsonExpr can be
>> executed in the single cached subtransaction.  This optimization really
>> helps
>> to reduce overhead from 100% to 5-10%:
>>
>
> I read a technical report for SQL/JSON. If I understand it well, then ON
> ERROR clause is primary related to structural errors, not to all errors.
>
> So your implementation is maybe too tolerant, what has this issue. There
> was not any example, so this clause should to handle cast errors or any
> other errors than JSON structural.
>
> The playing with other implementation of subtransactions doesn't look like
> safe way, more if it is not necessary
>
> The other possible error are casts errors. We can introduce new exception
> safe input functions. These functions can be interesting for fault tolerant
> COPY for example.
>
> SQL/JSON standard requires handling of cast errors too.
>
> I didn't speak something else. But cast (and in this case it is from JSON
to some else) can be exception safe.

Regards

Pavel


>
> 9.40 Casting an SQL/JSON sequence to an SQL type (pages 724-725):
>
> 4) If TEMPST is successful completion, then:
>   b) If the length of SEQ is 1 (one), then let I be the SQL/JSON item in SEQ.
> Case:
>   ...
> iii) Otherwise, let IDT be the data type of I.
>   Case:
>   1) If IDT cannot be cast to target type DT according to the Syntax Rules
>  of Subclause 6.13, "", then let TEMPST be data
>  exception — SQL/JSON item cannot be cast to target type.
>   2) Otherwise, let X be an SQL variable whose value is I. Let V be the
>  value of CAST (X AS DT).  If an exception condition is raised by this
>  , then let TEMPST be that exception condition.
>   ...
> 5) Case:
>   a) If TEMPST is successful completion, then let OUTST be successful
>  completion.
>   b) If ONERROR is ERROR, then let OUTST be TEMPST.
>   c) If ONERROR is NULL, then let V be the SQL null value and let OUTST be
>  successful completion.
>   d) If ONERROR immediately contains DEFAULT, then let VE be the
>   immediately contained in ONERROR. Let V be the value 
> of
>  CAST (VE AS DT)
> Case:
> i) If an exception condition is raised by this , then
>let OUTST be that exception condition.
> ii) Otherwise, let OUTST be successful completion.
>
>
> In 4.b.iii.1 said that there should be an error if the desired cast does not 
> exist.
> In the previous versions of the patches there was implicit coercion via I/O 
> here
> instead of error, so I decided to fix it the last version (fix is combined 
> with a
> minor refactoring of ExecEvalJsonExpr()).
>
> --
> Nikita Glukhov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: [HACKERS] plpgsql - additional extra checks

2018-07-03 Thread Tomas Vondra

On 03/20/2018 01:35 PM, Tomas Vondra wrote:



On 03/20/2018 05:36 AM, Pavel Stehule wrote:



2018-03-19 21:47 GMT+01:00 Tomas Vondra mailto:tomas.von...@2ndquadrant.com>>:

 Hi,

 I'm looking at the updated patch (plpgsql-extra-check-180316.patch), and
 this time it applies and builds OK. The one thing I noticed is that the
 documentation still uses the old wording for strict_multi_assignement:

 WARNING:  Number of evaluated fields does not match expected.
 HINT:  strict_multi_assignement check of extra_warnings is active.
 WARNING:  Number of evaluated fields does not match expected.
 HINT:  strict_multi_assignement check of extra_warnings is active.

 This was reworded to "Number of source and target fields in assignment
 does not match."


fixed



OK, thanks. PFA I've marked it as ready for committer.



Stephen, what are your thoughts about this patch? I remember discussing 
it with you at pgcon, but I don't recall what exactly your complaints 
were. Do you see any problems with the current version of the patch?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Server crashed with dense_rank on partition table.

2018-07-03 Thread Andres Freund
On 2018-06-13 16:35:58 +0900, Amit Langote wrote:
> Hi.
> 
> On 2018/06/13 14:55, Michael Paquier wrote:
> > On Wed, Jun 13, 2018 at 11:08:38AM +0530, Rajkumar Raghuwanshi wrote:
> >> postgres=# SELECT dense_rank(b) WITHIN GROUP (ORDER BY a) FROM pagg_tab
> >> GROUP BY b ORDER BY 1;
> >> server closed the connection unexpectedly
> >> This probably means the server terminated abnormally
> >> before or while processing the request.
> >> The connection to the server was lost. Attempting reset: Failed.
> > 
> > Indeed, thanks for the test case.  This used to work in v10 but this is
> > failing with v11 so I am adding an open item.  The plans of the pre-10
> > query and the query on HEAD are rather similar, and the memory context
> > at execution time looks messed up.
> 
> Fwiw, I see that the crash can also occur even when using a
> non-partitioned table in the query, as shown in the following example
> which reuses Rajkumar's test data and query:
> 
> create table foo (a int, b int, c text);
> postgres=# insert into foo select i%20, i%30, to_char(i%12, 'FM') from
> generate_series(0, 36) i;
> 
> select dense_rank(b) within group (order by a) from foo group by b order by 1;
> server closed the connection unexpectedly
>   This probably means the server terminated abnormally
>   before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> 
> Following query in the regression test suite can also be made to crash by
> adding a group by clause:
> 
> select dense_rank(3) within group (order by x) from (values
> (1),(1),(2),(2),(3),(3),(4)) v(x) group by (x);
> server closed the connection unexpectedly
>   This probably means the server terminated abnormally
>   before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> 
> Looking at the core dump of this, it seems the following commit may be
> relevant:
> 
> commit bf6c614a2f2c58312b3be34a47e7fb7362e07bcb
> Author: Andres Freund 
> Date:   Thu Feb 15 21:55:31 2018 -0800
> 
> Do execGrouping.c via expression eval machinery, take two.

Andres, with RMT hat on: Andres, this needs looking at ASAP.
Andres, without RMT hat on: Oh, I had first missed it, and then was
  distracted reviewing pluggable storage.
Andres, with RMT hat on: that's not really an excuse
Andres, without RMT hat on: sorry, will start looking now.

Greetings,

Andres Freund



Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-03 Thread Andres Freund
Hi Ashutosh, Etsuro, Robert,

On 2018-06-22 10:58:28 -0400, Robert Haas wrote:
> On Tue, Jun 19, 2018 at 8:46 AM, Etsuro Fujita
>  wrote:
> > Here is a patch for that.
> >
> > * As I said upthread, the patch makes code much more simple; I removed all
> > the changes to setrefs.c added by the partitionwise-join patch.  I also
> > simplified the logic for building a tlist for a child-join rel. The original
> > PWJ computes attr_needed data even for child rels, and build the tlist for a
> > child-join by passing to build_joinrel_tlist that data for input child rels
> > for the child-join.  But I think that's redundant, and it's more
> > straightforward to apply adjust_appendrel_attrs to the parent-join's tlist
> > to get the child-join's tlist.  So, I changed that way, which made
> > unnecessary all the changes to build_joinrel_tlist and placeholder.c added
> > by the PWJ patch, so I removed those as well.
> >
> > * The patch contains all of the regression tests in the original patch
> > proposed by Ashutosh.
> 
> I think this approach is going to run into trouble if the level at
> which we have to apply the ConvertRowTypeExpr happens not to be a
> projection-capable node.

What's the plan forward here? This has been an open item for quite a
while. Robert, are you in agreement with this approach on a high level?
Who is going to drive this forward?

Greetings,

Andres Freund



Re: Non-reserved replication slots and slot advancing

2018-07-03 Thread Andres Freund
Hi,

On 2018-06-26 16:13:05 +0900, Michael Paquier wrote:
> I have been chewing for the last couple of days on this email from
> Horiguchi-san:
> https://www.postgresql.org/message-id/20180622.163312.254556300.horiguchi.kyot...@lab.ntt.co.jp
> 
> As summarized, it is actually strange to be able to advance a slot which
> has a non-reserved restart_lsn.  For example, take that which can happen
> on HEAD:
> =# select pg_create_physical_replication_slot('toto');
>  pg_create_physical_replication_slot
> -
>  (toto,)
> (1 row)
> =# select pg_replication_slot_advance('toto', '0/1');
>  pg_replication_slot_advance
> -
>  (toto,0/1)
> (1 row)
> =# select slot_name, restart_lsn from pg_replication_slots ;
>  slot_name | restart_lsn
> ---+-
>  toto  | 0/1
> (1 row)

I'm not clear to why this is a problem? Seems like either behaviour can
be argued for. I don't really have an opinion either way. I'd just
remove the item from the open items list, I don't think we need to hold
up the release for it?

Greetings,

Andres Freund



Re: Failed assertion due to procedure created with SECURITY DEFINER option

2018-07-03 Thread Andres Freund
On 2018-06-29 10:19:17 -0700, Andres Freund wrote:
> Hi,
> 
> On 2018-06-29 13:56:12 +0200, Peter Eisentraut wrote:
> > On 6/29/18 13:07, amul sul wrote:
> > > This happens because of in fmgr_security_definer() function we are
> > > changing  global variable SecurityRestrictionContext and in the
> > > StartTransaction() insisting it should be zero, which is the problem.
> > 
> > Hmm, what is the reason for this insistation?
> 
> Because it's supposed to be reset by AbortTransaction(), after an error.

Does that make sense Peter?

I've added this thread to the open items list.

Greetings,

Andres Freund



Re: Non-reserved replication slots and slot advancing

2018-07-03 Thread Alvaro Herrera
On 2018-Jul-03, Andres Freund wrote:

> I'm not clear to why this is a problem? Seems like either behaviour can
> be argued for. I don't really have an opinion either way. I'd just
> remove the item from the open items list, I don't think we need to hold
> up the release for it?

After reading this more carefully, isn't the problem that as soon as you
get a slot into the 0/1 restart_lsn state, WAL recycling/deletion no
longer happens?  That does sound like a bad thing to me.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Non-reserved replication slots and slot advancing

2018-07-03 Thread Andres Freund
On 2018-07-03 13:23:50 -0400, Alvaro Herrera wrote:
> On 2018-Jul-03, Andres Freund wrote:
> 
> > I'm not clear to why this is a problem? Seems like either behaviour can
> > be argued for. I don't really have an opinion either way. I'd just
> > remove the item from the open items list, I don't think we need to hold
> > up the release for it?
> 
> After reading this more carefully, isn't the problem that as soon as you
> get a slot into the 0/1 restart_lsn state, WAL recycling/deletion no
> longer happens?  That does sound like a bad thing to me.

Fair enough, but that's what a plain slot allows you as well, pretty
fundamentally, no? The precise point at which recycling will be blocked
will differer, sure.

Greetings,

Andres Freund



Re: Non-reserved replication slots and slot advancing

2018-07-03 Thread Alvaro Herrera
On 2018-Jul-03, Andres Freund wrote:

> On 2018-07-03 13:23:50 -0400, Alvaro Herrera wrote:
> > On 2018-Jul-03, Andres Freund wrote:
> > 
> > > I'm not clear to why this is a problem? Seems like either behaviour can
> > > be argued for. I don't really have an opinion either way. I'd just
> > > remove the item from the open items list, I don't think we need to hold
> > > up the release for it?
> > 
> > After reading this more carefully, isn't the problem that as soon as you
> > get a slot into the 0/1 restart_lsn state, WAL recycling/deletion no
> > longer happens?  That does sound like a bad thing to me.
> 
> Fair enough, but that's what a plain slot allows you as well, pretty
> fundamentally, no? The precise point at which recycling will be blocked
> will differer, sure.

Yeah, well, I suppose that other mechanisms to use slots are less of a
foot-gun -- by creating one their start_lsn is set to some reasonable
value.  With slot advancing, it seems easier to get into trouble.  Of
course, you can set the slot to a LSN that is valid now, and then not do
anything with it, in which case you're also screwed.

As I recall, this slot advancing business is new in pg11, and I think it
makes sense to provide a decent API that prevents you from doing
something extremely stupid.

Getting this fixed is +0.2 from me -- I'm not really on the side of this
being a severe bug as all that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: AtEOXact_ApplyLauncher() and subtransactions

2018-07-03 Thread Robert Haas
On Tue, Jun 26, 2018 at 6:25 AM, Amit Khandekar  wrote:
> Added this into the July 2018 commitfest :
>
> https://commitfest.postgresql.org/18/1696/

It seems to me that it would probably be better to separate this into
two patches, because I think there are really two separate issues.
With regard to the lack of proper subtransaction handling, I think it
would be best if we could avoid introducing a new AtSubStart function.
I wrote a patch for this issue that works that uses a slightly
different kind of stack than yours, which I have attached to this
email, and it creates stack levels lazily so that it doesn't need an
AtSubStart function. It would probably also be possible to adapt your
patch to create new stack levels on demand instead of at
subtransaction start.  I'm not sure which approach is better, but I do
think it would be best not to use your patch as you have it now,
because that does unnecessary work at the beginning and end of every
subtransaction if there is an ALTER SUBSCRIPTION command pending at an
outer level, even though the subtransaction may never touch the
subscription state.

As for the other part of your fix, which I think basically boils down
to comparing the final states instead of just looking at what got
changed, the logic looks complicated and I don't think I fully
understand it, but here are a few comments.

+   subrelids = GetSubscriptionRelids(sub->oid,
+   committed_subrels ?
+   CurrentMemoryContext :
TopTransactionContext);

This looks ugly and dangerous.  In the first place, if
GetSubscriptionRelids() needs to work in one of several memory
contexts, the best thing would probably be for the caller to be
responsible for saving and restoring the memory context as needed,
rather than passing it as an argument.  Secondly, it's not very clear
why we need to do this.  The comment says we have to do it, but it
doesn't give a reason.

+* Merge the current list into the immediate parent.
+* So say, parent has sub1(tab1, tab2),
sub2(tab2, tab3),
+* and current on_commit_workers has
sub2(tab4) and sub3(tab1),
+* then the merged list will have :
+* sub1(tab1, tab2), sub2(tab4), sub3(tab1)

I don't think this is very clear.  Also, why is that the correct
answer?  Why not sub2(tab2, tab3, tab4)?

+   foreach(lc, on_commit_stop_workers)
+   {
+   SubscriptionRels *subrels = lfirst(lc);
+   ListCell *lc1;
+
+   /* Search this subrel in the subrels
of the top of stack. */
+   foreach(lc1,
subtrans_stop_workers->stop_workers)

This might be very expensive if both lists are long.  I guess that's
probably not very likely.  You would need to have modify a lot of
subscriptions and then, within a subtransaction, modify a lot of
subscriptions again.

+   foreach(lc, committed_subrels_list)
+   {
+   SubscriptionRels *subrels = (SubscriptionRels *) lfirst(lc);
+
+   if (sub->oid == subrels->subid)
+   {
+   committed_subrels = subrels;
+   break;
+   }
+   }

This looks to be O(n^2) in the number of subscriptions modified in a
single transaction.

Like Peter, I'm not entirely sure how much we should worry about this
problem.  However, if we're going to worry about it, I wonder if we
should worry harder and try to come up with a better data structure
than a bunch of linked lists.  Maybe a hash table indexed by subid?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


apply-launcher-subtrans-rmh.patch
Description: Binary data


Cache invalidation after authentication (on-the-fly role creation)

2018-07-03 Thread Thomas Munro
Hello hackers,

I'd like to do this to postinit.c:

PerformAuthentication(MyProcPort);
+   AcceptInvalidationMessages();
InitializeSessionUserId(username, useroid);

Any objections?  Motivation:

Many people use custom scripts, ldap2pg or other similar tools to
synchronise or manage their PostgreSQL roles from an LDAP server.
Doing that in a cron job works pretty well, but you have to wait for
the next cron job run after you make a change.  Based on several
requests from the field (basically, ex-SQL Server shops, where they
love Active Directory group-based authorisation), I started wondering
if it would be crazy to do incremental per-user synchronisation at at
the moment of authentication.  New hooks don't seem to be necessary,
since PAM is designed for things like that -- for example creating
your Unix home directory on demand.  But I ran into one small problem:
PostgreSQL doesn't see catalog changes that happened during
authentication because of caching.

For example, you might put this in pg_hba.conf:

host all all all pam pamservice=postgresql

... and then put this in /etc/pam.d/postgresql:

auth required pam_ldap.so config=/path/to/ldap.conf
auth required pam_exec.so /path/to/ldap2pg --sync-pam-user
account required pam_permit.so

That's a fictional ldap2pg option that would synchronise only the user
passed to it in $PAM_USER.  A high performance version could do
authentication and role synchronisation at the same time (perhaps
using the LDAP server's change notification/counter to skip having to
connect back to PostgreSQL to inspect catalogs in the common case that
nothing changed).

If you try things like that today, it works but any roles created
during authentication are (sometimes?) not visible to PostgreSQL until
your *next* attempt to log in.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Non-reserved replication slots and slot advancing

2018-07-03 Thread Michael Paquier
On Tue, Jul 03, 2018 at 01:51:48PM -0400, Alvaro Herrera wrote:
> Getting this fixed is +0.2 from me -- I'm not really on the side of this
> being a severe bug as all that.

This can also cause incorrect results for clients querying
pg_replication_slots when measuring bloat in pg_wal/, hence as this
thing is new I would think that restricting the API first, and then
perhaps relaxing it with future improvements makes the most sense. 
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] [PATCH] kNN for SP-GiST

2018-07-03 Thread Nikita Glukhov

Attached 5th version of the patches, where minor refactoring of distance
handling was done (see below).

On 02.07.2018 19:06, Alexander Korotkov wrote:


Hi!

On Fri, Jun 29, 2018 at 5:37 PM Nikita Glukhov  wrote:

On 06.03.2018 17:30, David Steele wrote:


I agree with Andres.  Pushing this patch to the next CF.

Attached 4th version of the patches rebased onto the current master.
Nothing interesting has changed from the previous version.

I took a look to this patchset.  In general, it looks good for me, but
I've following notes about it.

* We're going to replace scan stack with pairing heap not only for
KNN-search, but for regular search as well.  Did you verify that it
doesn't cause regression for regular search case, because insertion
into pairing heap might be slower than insertion into stack?  One may
say that it didn't caused regression in GiST, and that's why it
shouldn't cause regression in SP-GiST.  However, SP-GiST trees might
be much higher and these performance aspects might be different.


I decided to bring back the List-based scan stack in the previous version
of the patches, and here is how spgAddSearchItemToQueue() looks like now:

static void
spgAddSearchItemToQueue(SpGistScanOpaque so, SpGistSearchItem *item)
{
if (so->queue)
pairingheap_add(so->queue, &item->phNode);
else
so->scanStack = lcons(item, so->scanStack);
}

so->queue is initialized in spgrescan() only for ordered searches.


* I think null handling requires better explanation. Nulls are
specially handled in pairingheap_SpGistSearchItem_cmp().  In the same
time you explicitly set infinity distances for leaf nulls.  You
probably have reasons to implement it this way, but I think this
should be explained.  Also isnull property of SpGistSearchItem doesn't
have comment.


Distances for NULL items are expected to be NULL (it would not be true for
non-strict the distance operators, so we do not seem to support them here),
and so NULL items are considered to be greater than any non-NULL items in
pairingheap_SpGistSearchItem_cmp().  Distances are copied into SpGistSearchItem
only if it is non-NULL item.  Also in the last version of the patch I have
introduced spgAllocSearchItem() which conditionally allocates distance-array in
SpGistSearchItem.  Now inifinity distances are used only in one place, but
if we require that innerConsistent() should always return distances, then
we can completely get rid of so->infDistances field.



* I think KNN support should be briefly described in
src/backed/access/spgist/README.


A minimal description written by the original author Vlad Sterzhanov
already present in README.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c
index f57380a..a4345ef 100644
--- a/src/backend/utils/adt/geo_ops.c
+++ b/src/backend/utils/adt/geo_ops.c
@@ -41,7 +41,6 @@ enum path_delim
 static int	point_inside(Point *p, int npts, Point *plist);
 static int	lseg_crossing(double x, double y, double px, double py);
 static BOX *box_construct(double x1, double x2, double y1, double y2);
-static BOX *box_fill(BOX *result, double x1, double x2, double y1, double y2);
 static bool box_ov(BOX *box1, BOX *box2);
 static double box_ht(BOX *box);
 static double box_wd(BOX *box);
@@ -451,7 +450,7 @@ box_construct(double x1, double x2, double y1, double y2)
 
 /*		box_fill		-		fill in a given box struct
  */
-static BOX *
+BOX *
 box_fill(BOX *result, double x1, double x2, double y1, double y2)
 {
 	if (x1 > x2)
@@ -482,7 +481,7 @@ box_fill(BOX *result, double x1, double x2, double y1, double y2)
 /*		box_copy		-		copy a box
  */
 BOX *
-box_copy(BOX *box)
+box_copy(const BOX *box)
 {
 	BOX		   *result = (BOX *) palloc(sizeof(BOX));
 
diff --git a/src/include/utils/geo_decls.h b/src/include/utils/geo_decls.h
index a589e42..94806c2 100644
--- a/src/include/utils/geo_decls.h
+++ b/src/include/utils/geo_decls.h
@@ -182,6 +182,7 @@ typedef struct
 extern double point_dt(Point *pt1, Point *pt2);
 extern double point_sl(Point *pt1, Point *pt2);
 extern double pg_hypot(double x, double y);
-extern BOX *box_copy(BOX *box);
+extern BOX *box_copy(const BOX *box);
+extern BOX *box_fill(BOX *box, double xlo, double xhi, double ylo, double yhi);
 
 #endif			/* GEO_DECLS_H */
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index c4e8a3b..9279756 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -14,9 +14,9 @@
  */
 #include "postgres.h"
 
+#include "access/genam.h"
 #include "access/gist_private.h"
 #include "access/relscan.h"
-#include "catalog/pg_type.h"
 #include "miscadmin.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -543,7 +543,6 @@ getNextNearest(IndexScanDesc scan)
 {
 	GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
 	bool		res = false;
-	int			i;
 
 	if (scan->xs_hitup)
 	{
@@ -564,45 +

Re: Bulk Insert into PostgreSQL

2018-07-03 Thread Srinivas Karthik V
@Peter: I was indexing the primary key of all the tables in tpc-ds. Some of
the fact tables has multiple columns as part of the primary key. Also, most
of them are numeric type.

On Mon, Jul 2, 2018 at 7:09 AM, Peter Geoghegan  wrote:

> On Sun, Jul 1, 2018 at 5:19 PM, Tsunakawa, Takayuki
>  wrote:
> > 400 GB / 15 hours = 7.6 MB/s
> >
> > That looks too slow.  I experienced a similar slowness.  While our user
> tried to INSERT (not COPY) a billion record, they reported INSERTs slowed
> down by 10 times or so after inserting about 500 million records.  Periodic
> pstack runs on Linux showed that the backend was busy in btree operations.
> I didn't pursue the cause due to other businesses, but there might be
> something to be improved.
>
> What kind of data was indexed? Was it a bigserial primary key, or
> something else?
>
> --
> Peter Geoghegan
>


Re: Cache invalidation after authentication (on-the-fly role creation)

2018-07-03 Thread Alvaro Herrera
On 2018-Jul-04, Thomas Munro wrote:

> Hello hackers,
> 
> I'd like to do this to postinit.c:
> 
> PerformAuthentication(MyProcPort);
> +   AcceptInvalidationMessages();
> InitializeSessionUserId(username, useroid);
> 
> Any objections?

Is there a measurable performance overhead to this change?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PATCH: pgbench - option to build using ppoll() for larger connection counts

2018-07-03 Thread Andrew Dunstan




On 05/17/2018 01:23 AM, Thomas Munro wrote:

On Tue, Mar 27, 2018 at 9:23 AM, Rady, Doug  wrote:

pgbench11-ppoll-v12.patch

Hi Doug,

FYI this patch is trying and failing to use ppoll() on Windows:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.30




It's still failing -  see 



I'm setting this back to "Waiting on Author" until that's fixed.

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Cache invalidation after authentication (on-the-fly role creation)

2018-07-03 Thread Andres Freund
Hi,

On 2018-07-03 19:44:21 -0400, Alvaro Herrera wrote:
> On 2018-Jul-04, Thomas Munro wrote:
> 
> > Hello hackers,
> > 
> > I'd like to do this to postinit.c:
> > 
> > PerformAuthentication(MyProcPort);
> > +   AcceptInvalidationMessages();
> > InitializeSessionUserId(username, useroid);
> > 
> > Any objections?
> 
> Is there a measurable performance overhead to this change?

I can't see it being relevant here. We accept inval message in plenty
other places, and in comparison to session startup it should be just
about inmeasurable.

Greetings,

Andres Freund



How can we submit code patches that implement our (pending) patents?

2018-07-03 Thread Tsunakawa, Takayuki
Hello,

As I asked at the PGCon developer meeting this year, we'd like to offer our 
company's patents and patentapplications license to the PostgreSQL 
community free of charge.  If I heard correctly at that time, we could continue 
this discussion during the unconference, but I missed that opportunity (I'm 
sorry).  So,  please let me continue the consultation here.  If some other 
mailing list is appropriate such as pgsql-core, let me know (but I hope open 
discussion will lead to better and fair ideas and conclusion.)


There are three ideas.  Is there any effective idea?



(1) 
Share patents through a patent pool for open source communities.  Our company, 
Fujitsu Limited, is a licensee member of Open Invention Network (OIN).  And 
PostgreSQL is the protection target of OIN as listed here:

http://www.openinventionnetwork.com/joining-oin/linux-system/linux-system-table/?cat_id=14&type=table

Google, IBM, Red Hat, Toyota, and other big names are the big sponsors.  The 
basic membership is free.


(2)
For each patch we submit to the community that implements our patents, include 
in the mail body something like "3. Grant of Patent License" in Apache License 
2.0:

Apache License, Version 2.0
https://www.apache.org/licenses/LICENSE-2.0


(3)
Put up a page on our company web site that's similar to Red Hat's "Patent 
Promise", which is restricted to PostgreSQL.


FYI, I've consulted SFLC (Software Freedom Law Center) about this matter, and 
I've just got a reply "We'll be in touch."  I'm waiting for the next response.


Regards
Takayuki Tsunakawa





Re: Bulk Insert into PostgreSQL

2018-07-03 Thread Ashwin Agrawal
On Sat, Jun 30, 2018 at 6:27 AM Craig Ringer  wrote:

>
> You can also gain a bit by running with wal_level = minimal. On newer
> version you can use UNLOGGED tables then convert them to logged, but that
> won't be an option for 9.4.
>

Curious to know more on this does with standby also its faster or only
without standby this option can be faster ?


Re: How can we submit code patches that implement our (pending) patents?

2018-07-03 Thread Craig Ringer
On 4 July 2018 at 08:27, Tsunakawa, Takayuki  wrote:

> Hello,
>
> As I asked at the PGCon developer meeting this year, we'd like to offer
> our company's patents and patentapplications license to the
> PostgreSQL community free of charge.  If I heard correctly at that time, we
> could continue this discussion during the unconference, but I missed that
> opportunity (I'm sorry).  So,  please let me continue the consultation
> here.  If some other mailing list is appropriate such as pgsql-core, let me
> know (but I hope open discussion will lead to better and fair ideas and
> conclusion.)
>
>
> There are three ideas.  Is there any effective idea?
>

My big hesitation with all those approaches is that they seem to exclude
derivative and transformative works.

PostgreSQL is BSD-licensed. Knowingly implementing patented work with a
patent grant scoped to PostgreSQL would effectively change that license,
require that derivatives identify and remove the patented parts, or require
that derivatives license them.

I'm assuming you don't want to offer a grant that lets anyone use them for
anything. But if you have a really broad grant to PostgreSQL, all someone
would have to do to inherit the grant is re-use some part of PostgreSQL.

I guess there's a middle ground somewhere that protects substantial
derivatives and extracts but stops you using some Pg code snippets as a
freebie license.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Possible bug in logical replication.

2018-07-03 Thread Michael Paquier
On Tue, Jul 03, 2018 at 01:17:48AM -0400, Alvaro Herrera wrote:
> Let me review tomorrow.

Of course, please feel free.
--
Michael


signature.asc
Description: PGP signature


Re: Non-reserved replication slots and slot advancing

2018-07-03 Thread Michael Paquier
On Tue, Jul 03, 2018 at 01:51:48PM -0400, Alvaro Herrera wrote:
> On 2018-Jul-03, Andres Freund wrote:
>> Fair enough, but that's what a plain slot allows you as well, pretty
>> fundamentally, no? The precise point at which recycling will be blocked
>> will differer, sure.

ReplicationSlotsComputeRequiredLSN() is careful enough to discard slots
which have their restart_lsn set to InvalidXLogRecPtr, so they are not
accounted within the minimum LSN calculated for segment retention.  Any
fake value added by a user advancing a non-reserved slot is.

At the end, are their any objections into fixing the issue and
tightening the advancing API?
--
Michael


signature.asc
Description: PGP signature


Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

2018-07-03 Thread Etsuro Fujita

(2018/07/04 1:35), Andres Freund wrote:

On 2018-06-22 10:58:28 -0400, Robert Haas wrote:

On Tue, Jun 19, 2018 at 8:46 AM, Etsuro Fujita
  wrote:

Here is a patch for that.



I think this approach is going to run into trouble if the level at
which we have to apply the ConvertRowTypeExpr happens not to be a
projection-capable node.


What's the plan forward here?


I still think that this approach would be the right way to go, so I plan 
to polish the patch.



This has been an open item for quite a
while. Robert, are you in agreement with this approach on a high level?
Who is going to drive this forward?


I'm willing to do that if it's OK, but I'd like to get more feedback 
from Robert, Ashutosh or anyone else.


Best regards,
Etsuro Fujita



RE: libpq example doesn't work

2018-07-03 Thread Ideriha, Takeshi



>-Original Message-
>From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
>Sent: Sunday, July 1, 2018 9:17 PM
>To: Ideriha, Takeshi/出利葉 健 ; pgsql-hackers
>
>Subject: Re: libpq example doesn't work
>
>On 27.06.18 08:29, Ideriha, Takeshi wrote:
>> When I tried to use libpq, I found that libpq example[1] does not work.
>> That's because SELECT pg_catlog.set_config() never returns
>> PGRES_COMMAND_OK which is expected, but actually returns PGRES_TUPLES_OK.
>
>Fixed.  There were additional similar cases that I fixed also.
>
>--
>Peter Eisentraut  http://www.2ndQuadrant.com/
>PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Thank you for the commit. I totally missed the large object examples.

Regards,
Takeshi Ideriha




move PartitionDispatchData definition to execPartition.c

2018-07-03 Thread Amit Langote
Hi.

I think we may have simply forgotten to do $subject in the following commit.

commit da6f3e45ddb68ab3161076e120e7c32cfd46d1db
Author: Alvaro Herrera 
Date:   Sat Apr 14 21:12:14 2018 -0300

Reorganize partitioning code

Attached a patch.

Thanks,
Amit
From 82113b74c2b385eb66be0c6be6e47aa639eaa3e7 Mon Sep 17 00:00:00 2001
From: amit 
Date: Wed, 4 Jul 2018 11:34:30 +0900
Subject: [PATCH v1] Move PartitionDispatchData definition to execPartition.c.

It has no business being exposed for the whole world to see what's
inside it.
---
 src/backend/executor/execPartition.c | 28 
 src/include/executor/execPartition.h | 30 +-
 2 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 7a4665cc4e..8b7342ae52 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,6 +31,34 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+/*---
+ * PartitionDispatch - information about one partitioned table in a partition
+ * hierarchy required to route a tuple to one of its partitions
+ *
+ * reldesc Relation descriptor of the table
+ * key Partition key information of the table
+ * keystateExecution state required for expressions in the 
partition key
+ * partdescPartition descriptor of the table
+ * tupslot A standalone TupleTableSlot initialized with this 
table's tuple
+ * descriptor
+ * tupmap  TupleConversionMap to convert from the parent's rowtype 
to
+ * this table's rowtype (when extracting the 
partition key of a
+ * tuple just before routing it through this table)
+ * indexes Array with partdesc->nparts members (for details on what
+ * individual members represent, see how they are 
set in
+ * get_partition_dispatch_recurse())
+ *---
+ */
+typedef struct PartitionDispatchData
+{
+   Relationreldesc;
+   PartitionKey key;
+   List   *keystate;   /* list of ExprState */
+   PartitionDesc partdesc;
+   TupleTableSlot *tupslot;
+   TupleConversionMap *tupmap;
+   int*indexes;
+} PartitionDispatchData;
 
 static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 int 
*num_parted, List **leaf_part_oids);
diff --git a/src/include/executor/execPartition.h 
b/src/include/executor/execPartition.h
index 862bf65060..8d6500dcfc 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -18,35 +18,7 @@
 #include "nodes/plannodes.h"
 #include "partitioning/partprune.h"
 
-/*---
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
- *
- * reldesc Relation descriptor of the table
- * key Partition key information of the table
- * keystateExecution state required for expressions in the 
partition key
- * partdescPartition descriptor of the table
- * tupslot A standalone TupleTableSlot initialized with this 
table's tuple
- * descriptor
- * tupmap  TupleConversionMap to convert from the parent's rowtype 
to
- * this table's rowtype (when extracting the 
partition key of a
- * tuple just before routing it through this table)
- * indexes Array with partdesc->nparts members (for details on what
- * individual members represent, see how they are 
set in
- * get_partition_dispatch_recurse())
- *---
- */
-typedef struct PartitionDispatchData
-{
-   Relationreldesc;
-   PartitionKey key;
-   List   *keystate;   /* list of ExprState */
-   PartitionDesc partdesc;
-   TupleTableSlot *tupslot;
-   TupleConversionMap *tupmap;
-   int*indexes;
-} PartitionDispatchData;
-
+/* See execPartition.c for the definition. */
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*---
-- 
2.11.0



Re: Cache invalidation after authentication (on-the-fly role creation)

2018-07-03 Thread Thomas Munro
On Wed, Jul 4, 2018 at 12:10 PM, Andres Freund  wrote:
> On 2018-07-03 19:44:21 -0400, Alvaro Herrera wrote:
>> On 2018-Jul-04, Thomas Munro wrote:
>> > PerformAuthentication(MyProcPort);
>> > +   AcceptInvalidationMessages();
>> > InitializeSessionUserId(username, useroid);
>> >
>> > Any objections?
>>
>> Is there a measurable performance overhead to this change?
>
> I can't see it being relevant here. We accept inval message in plenty
> other places, and in comparison to session startup it should be just
> about inmeasurable.

Yeah, using "pgbench -c 8 -j 8 -T 60 --connect -S -M prepared
postgres" I wasn't able to measure a significant difference on my
laptop.  The performance was equally terrible, at around 940 TPS +/-
10 including connection time.  Adding to open commitfest.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Call-AcceptInvalidationMessages-after-authenticating.patch
Description: Binary data


Re: Cache invalidation after authentication (on-the-fly role creation)

2018-07-03 Thread Andres Freund
On 2018-07-04 16:25:18 +1200, Thomas Munro wrote:
> @@ -745,6 +746,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char 
> *username,
>   /* normal multiuser case */
>   Assert(MyProcPort != NULL);
>   PerformAuthentication(MyProcPort);
> + AcceptInvalidationMessages();
>   InitializeSessionUserId(username, useroid);
>   am_superuser = superuser();

FWIW, a comment explaining why it's done there seems appropriate.

- Andres



Re: Cache invalidation after authentication (on-the-fly role creation)

2018-07-03 Thread Michael Paquier
On Wed, Jul 04, 2018 at 04:25:18PM +1200, Thomas Munro wrote:
> Yeah, using "pgbench -c 8 -j 8 -T 60 --connect -S -M prepared
> postgres" I wasn't able to measure a significant difference on my
> laptop.  The performance was equally terrible, at around 940 TPS +/-
> 10 including connection time.  Adding to open commitfest.

I wanted to comment on that this morning but forgot as my mind was
driven away by another problem.  What if you used the Julien-Rouhaud's
method of a custom script with only ";" used as query and -c?  This
won't run any queries, and will stress authentication.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] WAL logging problem in 9.4.3?

2018-07-03 Thread Michael Paquier
On Fri, Mar 30, 2018 at 10:06:46AM +0900, Kyotaro HORIGUCHI wrote:
> Hello.  I found that c203d6cf81 hit this and this is the rebased
> version on the current master.

Okay, as this is visibly the oldest item in this commit fest, Andrew has
asked me to look at a solution which would allow us to definitely close
the loop for all maintained branches.  In consequence, I have been
looking at this problem.  Here are my thoughts:
- The set of errors reported on this thread are alarming, depending on
the scenarios used, we could have "could not read file" stuff, or even
data loss after WAL replay comes and wipes out everything.
- Disabling completely the TRUNCATE optimization is definitely not cool,
as there could be an impact for users.
- Removing wal_level = minimal is not acceptable as well, as some people
rely on this feature.
- Rewriting the sync handling of heap relation files in an invasive way
may be something to investigate and improve on HEAD (I am not really
convinced about that actually for the optimizations discussed on this
thread as this may result in more bugs than actual fixes), but that
would do nothing for back-branches.

Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems.  One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=yetxvmuu1yk4umxqy2g...@mail.gmail.com
For back-patching, I find that really appealing.

The second thing that the patch attached does is to tweak
ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
wal_level = minimal.

Another thing that this patch adds is a set of regression tests to
stress all the various scenarios presented on this thread with table
creation, INSERT, COPY and TRUNCATE running in the same transactions for
both wal_level = minimal and replica, which make sure that there are no
failures and no actual data loss.  The test is useful anyway, as any
patch presented did not present a way to test easily all the scenarios,
except for a bash script present upthread, but this discarded some of
the cases.

I would propose that for a back-patch, except for the test which can go
down easily to 9.6 but I have not tested that yet.

Thoughts?
--
Michael
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..78f7db07f0 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -42,6 +42,7 @@
 #include "parser/parse_relation.h"
 #include "port/pg_bswap.h"
 #include "rewrite/rewriteHandler.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
@@ -2404,7 +2405,16 @@ CopyFrom(CopyState cstate)
 		cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId)
 	{
 		hi_options |= HEAP_INSERT_SKIP_FSM;
-		if (!XLogIsNeeded())
+
+		/*
+		 * Skip writing WAL if there have been no actions that write an init
+		 * block for any of the buffers that will be touched during COPY.
+		 * Since there is no way of knowing at present which ones these are,
+		 * we must use a simple but effective heuristic to ensure safety of
+		 * the COPY operation for all cases, which is in this case to check
+		 * that the relation copied to has zero blocks.
+		 */
+		if (!XLogIsNeeded() && RelationGetNumberOfBlocks(cstate->rel) == 0)
 			hi_options |= HEAP_INSERT_SKIP_WAL;
 	}
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7c0cf0d7ee..90fe27fbf9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1562,10 +1562,15 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
 		 * the table was either created in the current (sub)transaction or has
 		 * a new relfilenode in the current (sub)transaction, then we can just
 		 * truncate it in-place, because a rollback would cause the whole
-		 * table or the current physical file to be thrown away anyway.
+		 * table or the current physical file to be thrown away anyway.  This
+		 * optimization is not safe with wal_level = minimal as there is no
+		 * actual way to know which are the blocks that could have been
+		 * touched by another operation done within this same transaction, be
+		 * it INSERT or COPY.
 		 */
-		if (rel->rd_createSubid == mySubid ||
-			rel->rd_newRelfilenodeSubid == mySubid)
+		if (XLogIsNeeded() &&
+			(rel->rd_createSubid == mySubid ||
+			 rel->rd_newRelfilenodeSubid == mySubid))
 		{
 			/* Immediate, non-rollbackable truncation is OK */
 			heap_truncate_one_rel(rel);
diff --git a/src/test/recovery/t/015_wal_optimize.pl b/src/test/recovery/t/015_wal_optimize.pl
new file mode 100644
index 00..98a410b125
--- /dev/null
+++ b/src/test/recovery/t/015_wal_optimize.pl
@@ -0,0 +1,120 @@
+# Test WAL replay for optimized TRUNCATE and COPY records

Legacy GiST invalid tuples

2018-07-03 Thread Andrey Borodin
Hi, hackers!

There is bunch of code in current GiST implementation checking for 
GistTupleIsInvalid(). PostgreSQL since 9.1 do not create invalid tuples. Should 
we support this tuples forever?

Invalid tuples were created in case of crash during GiST page split. They were 
used as a recovery measure. During index scan invalid tuples are treated as 
unconditional match. Any update in subtree of such tuple will error with 
errhint("Please REINDEX it."). Any VACUUM will log same complaint (but succeed).

I think that there are two options:
1. Bold one: just assume that there are no more invalid tuples.
2. For v12 raise the severity to ERROR on any encounter, then goto option 1 for 
vNext

What is the purpose of dropping invalid tuples?
1. Make code cleaner
2. Free some spare bytes in t_tid for advancing GiST (for example, see my other 
patch on GiST intrapage indexing [0] )

Best regards, Andrey Borodin.

[0] 
https://www.postgresql.org/message-id/7780a07b-4d04-41e2-b228-166b41d07...@yandex-team.ru


When use prepared protocol, transaction will hold backend_xmin until the end of the transaction.

2018-07-03 Thread chenhj
Hi, hackers!

When execute sql with prepared protocol, read committed transaction will hold 
backend_xmin until the end of the transaction.

Is this behavior normal?

Should read committed transaction release backend_xmin immediately after SQL 
executing is completed? Just like
when executing sql with simple protocol.

# reproduction
## env
- PostgreSQL 9.2
- CentOS 7.2

## test script

$ cat test.sql
begin;
select 1;
\sleep 1000s


## execute with simple protocol

$ pgbench -n -t 1 -f test.sql "service=admin"

postgres=# select * from pg_stat_activity where query='select 1;';
-[ RECORD 1 ]+--
datid| 13805
datname  | postgres
pid  | 19641
usesysid | 16388
usename  | admin
application_name | pgbench
client_addr  |
client_hostname  |
client_port  | -1
backend_start| 2018-07-04 13:27:10.62635+08
xact_start   | 2018-07-04 13:27:10.629609+08
query_start  | 2018-07-04 13:27:10.629845+08
state_change | 2018-07-04 13:27:10.63035+08
wait_event_type  | Client
wait_event   | ClientRead
state| idle in transaction
backend_xid  |
backend_xmin |
query| select 1;
backend_type | client backend

## execute with prepared protocol

$ pgbench -n -t 1 -f test.sql "service=admin" -M prepared

postgres=# select * from pg_stat_activity where query='select 1;';
-[ RECORD 1 ]+--
datid| 13805
datname  | postgres
pid  | 19662
usesysid | 16388
usename  | admin
application_name | pgbench
client_addr  |
client_hostname  |
client_port  | -1
backend_start| 2018-07-04 13:27:46.637134+08
xact_start   | 2018-07-04 13:27:46.641348+08
query_start  | 2018-07-04 13:27:46.64174+08
state_change | 2018-07-04 13:27:46.641778+08
wait_event_type  | Client
wait_event   | ClientRead
state| idle in transaction
backend_xid  |
backend_xmin | 3930269815
query| select 1;
backend_type | client backend

backend_xmin will affect dead tuple removing

postgres=# create table tbchj(id int);
CREATE TABLE
postgres=# insert into tbchj values(1);
INSERT 0 1
postgres=# delete from tbchj;
DELETE 1
postgres=# vacuum VERBOSE tbchj;
INFO:  vacuuming "public.tbchj"
INFO:  "tbchj": found 0 removable, 1 nonremovable row versions in 1 out of 
1 pages
DETAIL:  1 dead row versions cannot be removed yet, oldest xmin: 3930269815
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM
 
Regards
Chen Huajun

Re: why partition pruning doesn't work?

2018-07-03 Thread Amit Langote
On 2018/06/19 2:05, Tom Lane wrote:
> Amit Langote  writes:
>> [ 0001-Open-partitioned-tables-during-Append-initialization.patch ]
> 
> I took a look at this.  While I'm in agreement with the general idea of
> holding open the partitioned relations' relcache entries throughout the
> query, I do not like anything about this patch:

Thanks for taking a look at it and sorry about the delay in replying.

> * It seems to be outright broken for the case that any of the partitioned
> relations are listed in nonleafResultRelations.  If we're going to do it
> like this, we have to open every one of the partrels regardless of that.

Yes, that was indeed wrong.

> (I wonder whether we couldn't lose PlannedStmt.nonleafResultRelations
> altogether, in favor of merging the related code in InitPlan with this.

Hmm, PlannedStmt.nonleafResultRelations exists for the same reason as why
PlannedStmt.resultRelations does, that is,

/*
 * initialize result relation stuff, and open/lock the result rels.
 *
 * We must do this before initializing the plan tree, else we might try to
 * do a lock upgrade if a result rel is also a source rel.
 */

nonleafResultRelations contains members of partitioned_rels lists of all
ModifyTable nodes contained in a plan.

> That existing code is already a mighty ugly wart, and this patch makes
> it worse by adding new, related warts elsewhere.)

I just realized that there is a thinko in the following piece of code in
ExecLockNonLeafAppendTables

/* If this is a result relation, already locked in InitPlan */
foreach(l, stmt->nonleafResultRelations)
{
if (rti == lfirst_int(l))
{
is_result_rel = true;
break;
}
}

It should actually be:

/* If this is a result relation, already locked in InitPlan */
foreach(l, stmt->nonleafResultRelations)
{
Index   nonleaf_rti = lfirst_int(l);
Oid nonleaf_relid = getrelid(nonleaf_rti,
 estate->es_range_table);

if (relid == nonleaf_relid)
{
is_result_rel = true;
break;
}
}

RT indexes in, say, Append.partitioned_rels, are distinct from those in
PlannedStmt.nonleafResultRelations, so the existing test never succeeds,
as also evident from the coverage report:

https://coverage.postgresql.org/src/backend/executor/execUtils.c.gcov.html#864


I'm wondering if we couldn't just get rid of this code.  If an input
partitioned tables is indeed also a result relation, then we would've
locked it in InitPlan with RowExclusiveLock and heap_opening it with a
weaker lock (RowShare/AccessShare) wouldn't hurt.

> * You've got *far* too much intimate knowledge of the possible callers
> in ExecOpenAppendPartitionedTables.
> 
> Personally, what I would have this function do is return a List of
> the opened Relation pointers, and add a matching function to run through
> such a List and close the entries again, and make the callers responsible
> for stashing the List pointer in an appropriate field in their planstate.

OK, I rewrote it to work that way.

> Or maybe what we should do is drop ExecLockNonLeafAppendTables/
> ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it.

Hmm, for InitPlan to do what ExecOpenAppendPartitionedTables does, we'd
have to have all the partitioned tables (contained in partitioned_rels
fields of all Append/MergeAppend/ModifyTable nodes of a plan) also listed
in a global list like rootResultRelations and nonleafResultRelations of
PlannedStmt.

Attached updated patch.

Thanks,
Amit
From 0fd93b0b2108d4d12f483b96aab2c25c120173cb Mon Sep 17 00:00:00 2001
From: amit 
Date: Thu, 14 Jun 2018 14:50:22 +0900
Subject: [PATCH v2] Fix opening/closing of partitioned tables in Append plans

---
 src/backend/executor/execPartition.c   |  49 --
 src/backend/executor/execUtils.c   | 114 +
 src/backend/executor/nodeAppend.c  |  26 
 src/backend/executor/nodeMergeAppend.c |  11 ++--
 src/include/executor/execPartition.h   |   4 +-
 src/include/executor/executor.h|   4 +-
 src/include/nodes/execnodes.h  |   4 ++
 7 files changed, 100 insertions(+), 112 deletions(-)

diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index 7a4665cc4e..b9bc157bfa 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1399,13 +1399,19 @@ adjust_partition_tlist(List *tlist, TupleConversionMap 
*map)
  * PartitionPruningData for each item in that List.  This data can be re-used
  * each time we re-evaluate which partitions match the pruning steps provided
  * in each PartitionPruneInfo.
+ *
+ * 'partitioned_rels' is a List of same number of elements as there are in
+ * 'partitionpruneinfo' containing the Relation pointers of corresponding
+ * partitioned tables.
  *