Re: Strange decreasing value of pg_last_wal_receive_lsn()

2020-06-03 Thread godjan β€’
I got it should be LSN + MAXALIGN(xlogrecord length) πŸ‘
Thanks a lot.

> On 2 Jun 2020, at 19:11, Jehan-Guillaume de Rorthais  wrote:
> 
> Nope, just sum the xlogrecord length.



Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-03 Thread Amit Kapila
On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar  wrote:
>
> On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila  wrote:
> >
> > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar  wrote:
> > >
> > > I thin for our use case BufFileCreateShared is more suitable.  I think
> > > we need to do some modifications so that we can use these apps without
> > > SharedFileSet. Otherwise, we need to unnecessarily need to create
> > > SharedFileSet for each transaction and also need to maintain it in xid
> > > array or xid hash until transaction commit/abort.  So I suggest
> > > following modifications in shared files set so that we can
> > > conveniently use it.
> > > 1. ChooseTablespace(const SharedFileSet fileset, const char name)
> > >   if fileset is NULL then select the DEFAULTTABLESPACEOID
> > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace)
> > > If fileset is NULL then in directory path we can use MyProcPID or
> > > something instead of fileset->creator_pid.
> > >
> >
> > Hmm, I find these modifications a bit ad-hoc.  So, not sure if it is
> > better than the patch maintains sharedfileset information.
>
> I think we might do something better here, maybe by supplying function
> pointer or so,  but maintaining sharedfileset which contains different
> tablespace/mutext which we don't need at all for our purpose also
> doesn't sound very appealing.
>

I think we can say something similar for Relation (rel cache entry as
well) maintained in LogicalRepRelMapEntry.  I think we only need a
pointer to that information.

>  Let me see if I can not come up with
> some clean way of avoiding the need to shared-fileset then maybe we
> can go with the shared fileset idea.
>

Fair enough.
..

> > >
> > > But, even if nsubxacts become 0 we want to write the file so that we
> > > can overwrite the previous info.
> > >
> >
> > Can't we just remove the file for such a case?
>
> But, as of now, we expect if it is not a first-time stream start then
> the file exists.
>

Isn't it primarily because we do subxact_info_write in stop stream
which will create such a file irrespective of whether we have any
subxacts?  If so, isn't that an unnecessary write?

>Actually, currently, it's very easy that if it is
> not the first segment we always expect that the file must exist,
> otherwise an error.
>

I think we can check if the file doesn't exist then we can initialize
nsubxacts as 0.

>   Now if it is not the first segment then we will
> need to handle multiple cases.
>
> a) subxact_info_read need to handle the error case, because the file
> may not exist because there was no subxact in last stream or it was
> deleted because nsubxact become 0.
> b) subxact_info_write,  there will be multiple cases that if nsubxact
> was already 0 then we can avoid writing the file, but if it become 0
> now we need to remove the file.
>
> Let me think more on that.
>

I feel we should be able to deal with these cases but if you find any
difficulty then let us discuss.  I understand there is some ease if we
always have subxacts file but OTOH it sounds quite awkward that we
need so many file operations to detect the case whether the
transaction has any subtransactions.

> >
> > Here, we have performed Rolledback to savepoint s1 which doesn't have
> > any change of its own.  I think this would have handled but just
> > wanted to confirm.
>
> But internally, that will send abort for the s2 first, and for that,
> we will find xid and truncate, and later we will send abort for s1 but
> that we will not find and do nothing?  Anyway, I will test it and let
> you know.
>

It would be good if we can test and confirm this behavior once.  If it
is not very inconvenient then we can even try to include a test for
the same in the patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Asynchronous Append on postgres_fdw nodes.

2020-06-03 Thread Andrey Lepikhov

On 3/30/20 1:15 PM, Etsuro Fujita wrote:

Hi,

On Wed, Mar 11, 2020 at 10:47 AM movead li  wrote:


I redo the make installcheck-world as Kyotaro Horiguchi point out and the
result nothing wrong. And I think the patch is good in feature and performance
here is the test result thread I made before:
https://www.postgresql.org/message-id/CA%2B9bhCK7chd0qx%2Bmny%2BU9xaOs2FDNJ7RaxG4%3D9rpgT6oAKBgWA%40mail.gmail.com

The new status of this patch is: Ready for Committer


As discussed upthread, this is a material for PG14, so I moved this to
the next commitfest, keeping the same status.  I've not looked at the
patch in any detail yet, so I'm not sure that that is the right status
for the patch, though.  I'd like to work on this for PG14 if I have
time.


Hi,
This patch no longer applies cleanly.
In addition, code comments contain spelling errors.

--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company




Re: Should we remove a fallback promotion? take 2

2020-06-03 Thread Fujii Masao




On 2020/06/03 12:06, Kyotaro Horiguchi wrote:

At Wed, 3 Jun 2020 09:43:17 +0900, Fujii Masao  
wrote in

I will change the status back to Needs Review.


Thanks for the review!


  record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false);
  if (record != NULL)
  {
-  fast_promoted = true;
+  promoted = true;

Even if we missed the last checkpoint record, we don't give up
promotion and continue fall-back promotion but the variable "promoted"
stays false. That is confusiong.

How about changing it to fallback_promotion, or some names with more
behavior-specific name like immediate_checkpoint_needed?



I like doEndOfRecoveryCkpt or something, but I have no strong opinion
about that flag naming. So I'm ok with immediate_checkpoint_needed
if others also like that, too.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Internal key management system

2020-06-03 Thread Masahiko Sawada
On Wed, 3 Jun 2020 at 16:16, Fabien COELHO  wrote:
>
>
> Hello Masahiko-san,
>
> > This key manager is aimed to manage cryptographic keys used for
> > transparent data encryption. As a result of the discussion, we
> > concluded it's safer to use multiple keys to encrypt database data
> > rather than using one key to encrypt the whole thing, for example, in
> > order to make sure different data is not encrypted with the same key
> > and IV. Therefore, in terms of TDE, the minimum requirement is that
> > PostgreSQL can use multiple keys.
> >
> > Using multiple keys in PG, there are roughly two possible designs:
> >
> > 1. Store all keys for TDE into the external KMS, and PG gets them from
> > it as needed.
>
> +1

In this approach, encryption keys obtained from the external KMS are
directly used to encrypt/decrypt data. What KEK and DEK are you
referring to in this approach?

>
> > 2. PG manages all keys for TDE inside and protect these keys on disk
> > by the key (i.g. KEK) stored in the external KMS.
>
> -1, this is the one where you would need arguing.
>
> > There are pros and cons to each design. If I take one cons of #1 as an
> > example, the operation between PG and the external KMS could be
> > complex. The operations could be creating, removing and rotate key and
> > so on.
>
> ISTM that only create (delete?) are really needed. Rotating is the problem
> of the KMS itself, thus does not need to be managed by pg under #1.

With your idea how is the key rotation going to be performed? After
invoking key rotation on the external KMS we need to re-encrypt all
data encrypted with the old keys? Or you assume that the external KMS
employes something like 2-tier key hierarchy?

>
> > We can implement these operations in an extension to interact
> > with different kinds of external KMS, and perhaps we can use KMIP.
>
> I would even put that (KMIP protocol stuff) outside pg core.
>
> Even under #2, if some KMS is implemented and managed by pg, I would put
> the stuff in a separate process which I would probably run with a
> different uid, so that the KEK is not accessible directly by pg, ever.
>
> Once KMS interactions are managed with an outside process, then what this
> process does becomes an interface, and whether this process actually
> manages the keys or discuss with some external KMS with some KMIP or
> whatever is irrelevant to pg. Providing an interface means that anyone
> could implement their KMS fitting their requirements if they comply with
> the interface/protocol.

Just to be clear we don't keep KEK on neither shared memory nor disk.
Postmaster and a backend who executes pg_rotate_cluster_passphrase()
get KEK and use it to (re-)encrypt internal keys. But after that they
immediately free it. The encryption keys we need to store inside
PostgreSQL are DEK.

>
> Note that I'd be fine with having the current implementation somehow
> wrapped up as an example KMS.
>
> > But the development cost could become high because we might need
> > different extensions for each key management solutions/services.
>
> Yes and no. What I suggest is, I think, pretty simple, and I think I can
> implement it in a few line of script, so the cost is not high, and having
> a separate process looks, to me, like a security win and an extensibility
> win (i.e. another implementation can be provided).

How can we get multiple keys from the external KMS? I think we will
need to save something like identifiers for each encryption key
Postgres needs in the core and ask the external KMS for the key by the
identifier via an extension. Is that right?

>
> > #2 is better at that point; the interaction between PG and KMS is only
> > GET.
>
> I think that it could be the same with #1. I think that having a separate
> process is a reasonable security requirement, and if you do that #1 and #2
> are more or less the same.
>
> > Other databases employes a similar approach are SQL Server and DB2.
>
> Too bad for them:-) I'd still disagree with having the master key inside
> the database process, even if Microsoft, IBM and Oracle think it is a good
> idea.
>
> > In terms of the necessity of introducing the key manager into PG core,
> > I think at least TDE needs to be implemented in PG core. And as this
> > key manager is for managing keys for TDE, I think the key manager also
> > needs to be introduced into the core so that TDE functionality doesn't
> > depend on external modules.
>
> Hmmm.
>
> My point is that only interactions should be in core.
>
> The implementation could be in core, but as a separate process.
>
> I agree that pg needs to be able to manage the DEK, so it needs to store
> data keys.
>
> I still do not understand why an extension, possibly distributed with pg,
> would not be ok. There may be good arguments for that, but I do not think
> you provided any yet.

Hmm I think I don't fully understand your idea yet. With the current
patch, KEK could be obtained by either postmaster or backend processs
who execute pg_rotate_clu

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-03 Thread Amit Kapila
On Fri, May 29, 2020 at 8:31 PM Dilip Kumar  wrote:
>

The fixes in the latest patchset are correct.  Few minor comments:
v26-0005-Implement-streaming-mode-in-ReorderBuffer
+ /*
+ * Mark toplevel transaction as having catalog changes too if one of its
+ * children has so that the ReorderBufferBuildTupleCidHash can conveniently
+ * check just toplevel transaction and decide whethe we need to build the
+ * hash table or not.  In non-streaming mode we mark the toplevel
+ * transaction in DecodeCommit as we only stream on commit.

Typo, /whethe/whether
missing comma, /In non-streaming mode we/In non-streaming mode, we

v26-0008-Add-support-for-streaming-to-built-in-replicatio
+ /*
+ * This memory context used for per stream data when streaming mode is
+ * enabled.  This context is reeset on each stream stop.
+ */

Can we slightly modify the above comment as "This is used in the
streaming mode for the changes between the start and stop stream
messages.  We reset this context on the stream stop message."?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Dave Cramer
On Wed, 3 Jun 2020 at 01:19, Michael Paquier  wrote:

> On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:
> > On 2020/06/02 13:24, Michael Paquier wrote:
> >> Still unconvinced as this restriction stands for logical decoding
> >> requiring a database connection but it is not necessarily true now as
> >> physical replication has less restrictions than a logical one.
> >
> > Could you tell me what the benefit for supporting physical replication on
> > logical rep connection is? If it's only for "undocumented"
> > backward-compatibility, IMO it's better to reject such "tricky" set up.
> > But if there are some use cases for that, I'm ok to support that.
>
> Well, I don't really think that we can just break a behavior that
> exists since 9.4 as you could break applications relying on the
> existing behavior, and that's also the point of Vladimir upthread.
>

I don't see this is a valid reason to keep doing something. If it is broken
then fix it.
Clients can deal with the change.

Dave Cramer
https://www.postgres.rocks


Re: what can go in root.crt ?

2020-06-03 Thread Ants Aasma
On Tue, 2 Jun 2020 at 20:14, Bruce Momjian  wrote:

> The server certificate should be issued by a certificate authority root
> outside of your organization only if you want people outside of your
> organization to trust your server certificate, but you are then asking
> for the client to only trust an intermediate inside your organization.
> The big question is why bother having the server certificate chain to a
> root certificat you don't trust when you have no intention of having
> clients outside of your organization trust the server certificate.
> Postgres could be made to handle such cases, but is is really a valid
> configuration we should support?
>

I think the "why" the org cert is not root was already made clear, that is
the copmany policy. I don't think postgres should take a stance whether the
certificate designated as the root of trust is self-signed or claims to get
its power from somewhere else.

It's pretty easy to conceive of certificate management procedures that make
use of this chain to implement certificate replacement securely. For
example one might trust the global issuer to verify that a CSR is coming
from the O= value that it's claiming to come from to automate replacement
of intermediate certificates, but not trust that every other sub-CA signed
by root and their sub-sub-CA-s are completely honest and secure.

Regards,
Ants Aasma


how to create index concurrently on paritioned table

2020-06-03 Thread 李杰(ζ…ŽθΏ½)
Hi hackers,

Partitioning is necessary for very large tables.
 However, I found that postgresql does not support create index concurrently on 
partitioned tables.
The document show that we need to create an index on each partition 
individually and then finally create the partitioned index non-concurrently. 
This is undoubtedly a complex operation for DBA, especially when there are many 
partitions. 

Therefore, I wonder why pg does not support concurrent index creation on 
partitioned tables? 
What are the difficulties of this function? 
If I want to implement it, what should I pay attention?

Sincerely look forward to your reply. 

Regards & Thanks Adger

Re: Incorrect comment in be-secure-openssl.c

2020-06-03 Thread Michael Paquier
On Mon, Jun 01, 2020 at 10:39:45AM +0200, Daniel Gustafsson wrote:
> I don't have a problem with the existing wording of the first sentence, and I
> don't have a problem with your suggestion either (as long as it's parameters 
> in
> plural).

Thanks, that's why I kept the word plural in my own suggestion.  I was
just reading through the whole set again, and still kind of prefer the
last flavor, so I think that I'll just fix it this way tomorrow and
call it a day.
--
Michael


signature.asc
Description: PGP signature


Re: Incorrect comment in be-secure-openssl.c

2020-06-03 Thread Daniel Gustafsson
> On 3 Jun 2020, at 14:38, Michael Paquier  wrote:
> 
> On Mon, Jun 01, 2020 at 10:39:45AM +0200, Daniel Gustafsson wrote:
>> I don't have a problem with the existing wording of the first sentence, and I
>> don't have a problem with your suggestion either (as long as it's parameters 
>> in
>> plural).
> 
> Thanks, that's why I kept the word plural in my own suggestion.  I was
> just reading through the whole set again, and still kind of prefer the
> last flavor, so I think that I'll just fix it this way tomorrow and
> call it a day.

Sounds good, thanks!

cheers ./daniel



Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Fujii Masao




On 2020/06/03 20:33, Dave Cramer wrote:




On Wed, 3 Jun 2020 at 01:19, Michael Paquier mailto:mich...@paquier.xyz>> wrote:

On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote:
 > On 2020/06/02 13:24, Michael Paquier wrote:
 >> Still unconvinced as this restriction stands for logical decoding
 >> requiring a database connection but it is not necessarily true now as
 >> physical replication has less restrictions than a logical one.
 >
 > Could you tell me what the benefit for supporting physical replication on
 > logical rep connection is? If it's only for "undocumented"
 > backward-compatibility, IMO it's better to reject such "tricky" set up.
 > But if there are some use cases for that, I'm ok to support that.

Well, I don't really think that we can just break a behavior that
exists since 9.4 as you could break applications relying on the
existing behavior, and that's also the point of Vladimir upthread.


For the back branches, I agree with you. Even if it's undocumented behavior,
basically we should not get rid of it from the back branches unless there is
very special reason.

For v13, if it has no functional merit, I don't think it's so bad to get rid of
that undocumented (and maybe not-fully tested) behavior. If there are
applications depending it, I think that they can be updated.


I don't see this is a valid reason to keep doing something. If it is broken 
then fix it.
Clients can deal with the change.


+1

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Removal of currtid()/currtid2() and some table AM cleanup

2020-06-03 Thread Inoue, Hiroshi

Hi,

On 2020/06/03 11:14, Michael Paquier wrote:

Hi all,



I have been looking at the ODBC driver and the need for currtid() as
well as currtid2(), and as mentioned already in [1], matching with my
lookup of things, these are actually not needed by the driver as long
as we connect to a server newer than 8.2 able to support RETURNING.


Though currtid2() is necessary even for servers which support RETURNING,
I don't object to remove it.

regards,
Hiroshi Inoue


   I
am adding in CC of this thread Saito-san and Inoue-san who are the
two main maintainers of the driver for comments.  It is worth noting
that on its latest HEAD the ODBC driver requires libpq from at least
9.2.

I would like to remove those two functions and the surrounding code
for v14, leading to some cleanup:
  6 files changed, 326 deletions(-)

While on it, I have noticed that heap_get_latest_tid() is still
located within heapam.c, but we can just move it within
heapam_handler.c.

Attached are two patches to address both points.  Comments are
welcome.

Thanks,

[1]: 
https://www.postgresql.org/message-id/20200529005559.jl2gsolomyro4...@alap3.anarazel.de
--
Michael






Re: More tests with USING INDEX replident and dropped indexes

2020-06-03 Thread Euler Taveira
On Wed, 3 Jun 2020 at 03:14, Michael Paquier  wrote:

> On Tue, Jun 02, 2020 at 04:46:55PM +0900, Masahiko Sawada wrote:
> > How about avoiding such an inconsistent situation? In that case,
> > replica identity works as NOTHING, but pg_class.relreplident is still
> > β€˜i’, confusing users. It seems to me that dropping an index specified
> > by REPLICA IDENTITY USING INDEX is not a valid operation.
>
> This looks first like complicating RemoveRelations() or the internal
> object removal APIs with a dedicated lookup at this index's pg_index
> tuple, but you could just put that in index_drop when REINDEX
> CONCURRENTLY is not used.  Still, I am not sure if it is worth
> complicating those code paths.  It would be better to get more
> opinions about that first.
>
>
Consistency is a good goal. Why don't we clear the relreplident from the
relation while dropping the index? relation_mark_replica_identity() already
does that but do other things too. Let's move the first code block from
relation_mark_replica_identity to another function and call this new
function
while dropping the index.


-- 
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Why is pq_begintypsend so slow?

2020-06-03 Thread Robert Haas
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund  wrote:
> The biggest problem after that is that we waste a lot of time memcpying
> stuff around repeatedly. There is:
> 1) send function: datum -> per datum stringinfo
> 2) printtup: per datum stringinfo -> per row stringinfo
> 3) socket_putmessage: per row stringinfo -> PqSendBuffer
> 4) send(): PqSendBuffer -> kernel buffer
>
> It's obviously hard to avoid 1) and 4) in the common case, but the
> number of other copies seem pretty clearly excessive.

I too have seen recent benchmarking data where this was a big problem.
Basically, you need a workload where the server doesn't have much or
any actual query processing to do, but is just returning a lot of
stuff to a really fast client - e.g. a locally connected client.
That's not necessarily the most common case but, if you have it, all
this extra copying is really pretty expensive.

My first thought was to wonder about changing all of our send/output
functions to write into a buffer passed as an argument rather than
returning something which we then have to copy into a different
buffer, but that would be a somewhat painful change, so it is probably
better to first pursue the idea of getting rid of some of the other
copies that happen in more centralized places (e.g. printtup). I
wonder if we could replace the whole
pq_beginmessage...()/pq_send()/pq_endmessage...() system with
something a bit better-designed. For instance, suppose we get rid of
the idea that the caller supplies the buffer, and we move the
responsibility for error recovery into the pqcomm layer. So you do
something like:

my_message = xyz_beginmessage('D');
xyz_sendint32(my_message, 42);
xyz_endmessage(my_message);

Maybe what happens here under the hood is we keep a pool of free
message buffers sitting around, and you just grab one and put your
data into it. When you end the message we add it to a list of used
message buffers that are waiting to be sent, and once we send the data
it goes back on the free list. If an error occurs after
xyz_beginmessage() and before xyz_endmessage(), we put the buffer back
on the free list. That would allow us to merge (2) and (3) into a
single copy. To go further, we could allow send/output functions to
opt in to receiving a message buffer rather than returning a value,
and then we could get rid of (1) for types that participate. (4) seems
unavoidable AFAIK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




significant slowdown of HashAggregate between 9.6 and 10

2020-06-03 Thread Pavel Stehule
Hi

One czech Postgres user reported performance issue related to speed
HashAggregate in nested loop.

The speed of 9.6

HashAggregate  (cost=27586.10..27728.66 rows=14256 width=24)
(actual time=0.003..0.049 rows=39 loops=599203)

The speed of 10.7

HashAggregate  (cost=27336.78..27552.78 rows=21600 width=24)
(actual time=0.011..0.156 rows=38 loops=597137)

So it looks so HashAgg is about 3x slower - with brutal nested loop it is a
problem.

I wrote simple benchmark and really looks so our hash aggregate is slower
and slower.

create table foo(a int, b int, c int, d int, e int, f int);
insert into foo select random()*1000, random()*4, random()*4, random()* 2,
random()*100, random()*100 from generate_series(1,200);

analyze foo;

9.6.7
postgres=# explain (analyze, buffers) select i from
generate_series(1,50) g(i) where exists (select count(*) cx from foo
group by b, c, d having count(*) = i);
β”Œβ”€β”€β”€β”
β”‚  QUERY PLAN
β”‚
β•žβ•β•β•β•‘
β”‚ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500
width=4) (actual time=807.485..3364.515 rows=74 loops=1) β”‚
β”‚   Filter: (SubPlan 1)
β”‚
β”‚   Rows Removed by Filter: 499926
 β”‚
β”‚   Buffers: shared hit=12739, temp read=856 written=855
 β”‚
β”‚   SubPlan 1
β”‚
β”‚ ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20) (actual
time=0.006..0.006 rows=0 loops=50)  β”‚
β”‚   Group Key: foo.b, foo.c, foo.d
 β”‚
β”‚   Filter: (count(*) = g.i)
 β”‚
β”‚   Rows Removed by Filter: 75
 β”‚
β”‚   Buffers: shared hit=12739
β”‚
β”‚   ->  Seq Scan on foo  (cost=0.00..32739.00 rows=200
width=12) (actual time=0.015..139.736 rows=200 loops=1)  β”‚
β”‚ Buffers: shared hit=12739
β”‚
β”‚ Planning time: 0.276 ms
β”‚
β”‚ Execution time: 3365.758 ms
β”‚
β””β”€β”€β”€β”˜
(14 rows)

10.9

postgres=# explain (analyze, buffers) select i from
generate_series(1,50) g(i) where exists (select count(*) cx from foo
group by b, c, d having count(*) = i);
β”Œβ”€β”€β”€β”
β”‚  QUERY PLAN
β”‚
β•žβ•β•β•β•‘
β”‚ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500
width=4) (actual time=825.468..4919.063 rows=74 loops=1) β”‚
β”‚   Filter: (SubPlan 1)
β”‚
β”‚   Rows Removed by Filter: 499926
 β”‚
β”‚   Buffers: shared hit=12739, temp read=856 written=855
 β”‚
β”‚   SubPlan 1
β”‚
β”‚ ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20) (actual
time=0.009..0.009 rows=0 loops=50)  β”‚
β”‚   Group Key: foo.b, foo.c, foo.d
 β”‚
β”‚   Filter: (count(*) = g.i)
 β”‚
β”‚   Rows Removed by Filter: 75
 β”‚
β”‚   Buffers: shared hit=12739
β”‚
β”‚   ->  Seq Scan on foo  (cost=0.00..32739.00 rows=200
width=12) (actual time=0.025..157.887 rows=200 loops=1)  β”‚
β”‚ Buffers: shared hit=12739
β”‚
β”‚ Planning time: 0.829 ms
β”‚
β”‚ Execution time: 4920.800 ms
β”‚
β””β”€β”€β”€β”˜
(14 rows)

master


postgres=# explain (analyze, buffers) select i from
generate_series(1,50) g(i) where exists (select c

Re: REINDEX CONCURRENTLY and indisreplident

2020-06-03 Thread Euler Taveira
On Wed, 3 Jun 2020 at 03:54, Michael Paquier  wrote:

> Hi all,
>
> I have bumped into $subject, causing a replica identity index to
> be considered as dropped if running REINDEX CONCURRENTLY on it.  This
> means that the old tuple information would get lost in this case, as
> a REPLICA IDENTITY USING INDEX without a dropped index is the same as
> NOTHING.
>
> LGTM. I tested in both versions (12, master) and it works accordingly.


-- 
Euler Taveira http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-03 Thread Pavel Stehule
st 3. 6. 2020 v 17:32 odesΓ­latel Pavel Stehule 
napsal:

> Hi
>
> One czech Postgres user reported performance issue related to speed
> HashAggregate in nested loop.
>
> The speed of 9.6
>
> HashAggregate  (cost=27586.10..27728.66 rows=14256 width=24)
> (actual time=0.003..0.049 rows=39 loops=599203)
>
> The speed of 10.7
>
> HashAggregate  (cost=27336.78..27552.78 rows=21600 width=24)
> (actual time=0.011..0.156 rows=38 loops=597137)
>
> So it looks so HashAgg is about 3x slower - with brutal nested loop it is
> a problem.
>
> I wrote simple benchmark and really looks so our hash aggregate is slower
> and slower.
>
> create table foo(a int, b int, c int, d int, e int, f int);
> insert into foo select random()*1000, random()*4, random()*4, random()* 2,
> random()*100, random()*100 from generate_series(1,200);
>
> analyze foo;
>
> 9.6.7
> postgres=# explain (analyze, buffers) select i from
> generate_series(1,50) g(i) where exists (select count(*) cx from foo
> group by b, c, d having count(*) = i);
>
> β”Œβ”€β”€β”€β”
> β”‚  QUERY PLAN
>   β”‚
>
> β•žβ•β•β•β•‘
> β”‚ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500
> width=4) (actual time=807.485..3364.515 rows=74 loops=1) β”‚
> β”‚   Filter: (SubPlan 1)
>   β”‚
> β”‚   Rows Removed by Filter: 499926
>  β”‚
> β”‚   Buffers: shared hit=12739, temp read=856 written=855
>  β”‚
> β”‚   SubPlan 1
>   β”‚
> β”‚ ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20)
> (actual time=0.006..0.006 rows=0 loops=50)  β”‚
> β”‚   Group Key: foo.b, foo.c, foo.d
>  β”‚
> β”‚   Filter: (count(*) = g.i)
>  β”‚
> β”‚   Rows Removed by Filter: 75
>  β”‚
> β”‚   Buffers: shared hit=12739
>   β”‚
> β”‚   ->  Seq Scan on foo  (cost=0.00..32739.00 rows=200
> width=12) (actual time=0.015..139.736 rows=200 loops=1)  β”‚
> β”‚ Buffers: shared hit=12739
>   β”‚
> β”‚ Planning time: 0.276 ms
>   β”‚
> β”‚ Execution time: 3365.758 ms
>   β”‚
>
> β””β”€β”€β”€β”˜
> (14 rows)
>
> 10.9
>
> postgres=# explain (analyze, buffers) select i from
> generate_series(1,50) g(i) where exists (select count(*) cx from foo
> group by b, c, d having count(*) = i);
>
> β”Œβ”€β”€β”€β”
> β”‚  QUERY PLAN
>   β”‚
>
> β•žβ•β•β•β•‘
> β”‚ Function Scan on generate_series g  (cost=0.00..57739020.00 rows=500
> width=4) (actual time=825.468..4919.063 rows=74 loops=1) β”‚
> β”‚   Filter: (SubPlan 1)
>   β”‚
> β”‚   Rows Removed by Filter: 499926
>  β”‚
> β”‚   Buffers: shared hit=12739, temp read=856 written=855
>  β”‚
> β”‚   SubPlan 1
>   β”‚
> β”‚ ->  HashAggregate  (cost=57739.00..57739.75 rows=75 width=20)
> (actual time=0.009..0.009 rows=0 loops=50)  β”‚
> β”‚   Group Key: foo.b, foo.c, foo.d
>  β”‚
> β”‚   Filter: (count(*) = g.i)
>  β”‚
> β”‚   Rows Removed by Filter: 75
>  β”‚
> β”‚   Buffers: shared hit=12739
>   β”‚
> β”‚   ->  Seq Scan on foo  (cost=0.00..32739.00 rows=200
> width=12) (actual time=0.025..157.887 rows=200 loops=1)  β”‚
> β”‚ Buffers: shared hit=12739
>   β”‚
> β”‚ Planning time: 0.829 ms
>   β”‚
> β”‚ Execution time: 4920.800 ms
>   

Re: Parallel copy

2020-06-03 Thread Robert Haas
On Mon, May 18, 2020 at 12:48 AM Amit Kapila  wrote:
> In the above case, even though we are executing a single command from
> the user perspective, but the currentCommandId will be four after the
> command.  One increment will be for the copy command and the other
> three increments are for locking tuple in PK table
> (tab_fk_referenced_chk) for three tuples in FK table
> (tab_fk_referencing_chk).  Now, for parallel workers, it is
> (theoretically) possible that the three tuples are processed by three
> different workers which don't get synced as of now.  The question was
> do we see any kind of problem with this and if so can we just sync it
> up at the end of parallelism.

I strongly disagree with the idea of "just sync(ing) it up at the end
of parallelism". That seems like a completely unprincipled approach to
the problem. Either the command counter increment is important or it's
not. If it's not important, maybe we can arrange to skip it in the
first place. If it is important, then it's probably not OK for each
backend to be doing it separately.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Expand the use of check_canonical_path() for more GUCs

2020-06-03 Thread Robert Haas
On Tue, Jun 2, 2020 at 5:04 AM Peter Eisentraut
 wrote:
> The archeology reveals that these calls where originally added to
> canonicalize the data_directory and config_file settings (7b0f060d54),
> but that was then moved out of guc.c to be done early during postmaster
> startup (337ffcddba).  The remaining calls of check_canonical_path() in
> guc.c appear to be leftovers from a previous regime.

Thanks for looking into it. Sounds like it can just be ripped out,
then, unless someone knows of a reason to do otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Read access for pg_monitor to pg_replication_origin_status view

2020-06-03 Thread MartΓ­n MarquΓ©s
Hi Kyotaro-san,

Thank you for taking the time to review my patches. Would you like to
set yourself as a reviewer in the commit entry here?
https://commitfest.postgresql.org/28/2577/

> 0002:
>
> It is forgetting to grant pg_read_all_stats to execute
> pg_show_replication_origin_status.  As the result pg_read_all_stats
> gets error on executing the function, not on doing select on the view.

Seems I was testing on a cluster where I had already been digging, so
pg_real_all_stats had execute privileges on
pg_show_replication_origin_status (I had manually granted that) and
didn't notice because I forgot to drop the cluster and initialize
again.

Thanks for the pointer here!

> 0003:
>
> It seems to be a take-after of adminpack's documentation, but a
> superuser is not the only one on PostgreSQL.  The something like the
> description in 27.2.2 Viewing Statistics looks more suitable.
>
> > Superusers and members of the built-in role pg_read_all_stats (see
> > also Section 21.5) can see all the information about all sessions.
>
> Section 21.5 is already saying as follows.
>
> > pg_monitor
> >   Read/execute various monitoring views and functions. This role is a
> >   member of pg_read_all_settings, pg_read_all_stats and
> >   pg_stat_scan_tables.

I'm not sure if I got this right, but I added some more text to point
out that the pg_read_all_stats role can also access one specific
function. I personally think it's a bit too detailed, and if we wanted
to add details it should be formatted differently, which would require
a more invasive patch (would prefer leaving that out, as it might even
mean moving parts which are not part of this patch).

In any case, I hope the change fits what you've kindly pointed out.

> 0004:
>
> Looks fine by me.

Here goes v3 of the patch

-- 
MartΓ­n MarquΓ©shttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From 87c00782c691f2c6c13768cd6d96b19de69cab16 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mart=C3=ADn=20Marqu=C3=A9s?=
 
Date: Tue, 2 Jun 2020 10:44:42 -0300
Subject: [PATCH v3 1/4] Access to pg_replication_origin_status view has
 restricted access only for superusers due to it using
 pg_show_replication_origin_status(), and all replication origin functions
 requiring superuser rights.

This patch removes the check for superuser priviledges when executing
replication origin functions, which is hardcoded, and instead rely on
ACL checks.

This patch will also revoke execution of such functions from PUBLIC
---
 src/backend/catalog/system_views.sql | 16 
 src/backend/replication/logical/origin.c |  5 -
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 56420bbc9d6..6658f0c2eb2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1469,6 +1469,22 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public;
 REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public;
 
+--
+-- Permision to execute Replication Origin functions should be revoked from public
+--
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_advance(text, pg_lsn) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_create(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_drop(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_oid(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_progress(text, boolean) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_is_setup() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_progress(boolean) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_reset() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_setup(text) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_reset() FROM public;
+REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_setup(pg_lsn, timestamp with time zone) FROM public;
+REVOKE EXECUTE ON FUNCTION pg_show_replication_origin_status() FROM public;
+
 --
 -- We also set up some things as accessible to standard roles.
 --
diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c
index 981d60f135d..1f223daf21f 100644
--- a/src/backend/replication/logical/origin.c
+++ b/src/backend/replication/logical/origin.c
@@ -182,11 +182,6 @@ static ReplicationState *session_replication_state = NULL;
 static void
 replorigin_check_prerequisites(bool check_slots, bool recoveryOK)
 {
-	if (!superuser())
-		ereport(ERROR,
-(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- errmsg("only superusers can query or manipulate replication origins")));
-
 	if (check_slots && max_replication_slots == 0)
 		ereport(ERROR,
 (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-- 
2.21.3

From 0d0ef6aa7e563bd3b9

Towards easier AMs: Cleaning up inappropriate use of name "relkind"

2020-06-03 Thread Mark Dilger
Hackers,

The name "relkind" normally refers to a field of type 'char' with values like 
'r' for "table" and 'i' for "index".  In AlterTableStmt and CreateTableAsStmt, 
this naming convention was abused for a field of type enum ObjectType.  Often, 
such fields are named "objtype", though also "kind", "removeType", 
"renameType", etc.

I don't care to debate those other names, though in passing I'll say that 
"kind" seems not great.  The "relkind" name is particularly bad, though. It is 
confusing in functions that also operate on a RangeTblEntry object, which also 
has a field named "relkind", and is confusing in light of the function 
get_relkind_objtype() which maps from "relkind" to "objtype", implying quite 
correctly that those two things are distinct.

The attached patch cleans this up.  How many toes am I stepping on here?  
Changing the names was straightforward and doesn't seem to cause any issues 
with 'make check-world'.  Any objection?

For those interested in the larger context of this patch, I am trying to clean 
up any part of the code that makes it harder to write and test new access 
methods.  When implementing a new AM, one currently needs to `grep -i relkind` 
to find a long list of files that need special treatment.  One then needs to 
consider whether special logic for the new AM needs to be inserted into all 
these spots.  As such, it is nice if these spots have as little confusing 
naming as possible.  This patch makes that process a little easier.  I have 
another patch (to be posted shortly) that cleans up the #define RELKIND_XXX 
stuff using a new RelKind enum and special macros while keeping the relkind 
fields as type 'char'.  Along with converting code to use switch(relkind) 
rather than if (relkind...) statements, the compiler now warns on unhandled 
cases when you add a new RelKind to the list, making it easier to find all the 
places you need to update.  I decided to keep that work independent of this 
patch, as the code is logically distinct.



v1-0001-Renaming-relkind-as-objtype-where-appropriate.patch
Description: Binary data


β€”
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Re: Why is pq_begintypsend so slow?

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-03 11:30:42 -0400, Robert Haas wrote:
> I too have seen recent benchmarking data where this was a big problem.
> Basically, you need a workload where the server doesn't have much or
> any actual query processing to do, but is just returning a lot of
> stuff to a really fast client - e.g. a locally connected client.
> That's not necessarily the most common case but, if you have it, all
> this extra copying is really pretty expensive.

Even when the query actually is doing something, it's still quite
possible to get the memcpies to be be measurable (say > 10% of
cycles). Obviously not in a huge aggregating query. Even in something
like pgbench -M prepared -S, which is obviously spending most of its
cycles elsewhere, the patches upthread improve throughput by ~1.5% (and
that's not eliding several unnecessary copies).


> My first thought was to wonder about changing all of our send/output
> functions to write into a buffer passed as an argument rather than
> returning something which we then have to copy into a different
> buffer, but that would be a somewhat painful change, so it is probably
> better to first pursue the idea of getting rid of some of the other
> copies that happen in more centralized places (e.g. printtup).

For those I think the allocator overhead is the bigger issue than the
memcpy itself. I wonder how much we could transparently hide in
pq_begintypsend()/pq_endtypsend().


> I
> wonder if we could replace the whole
> pq_beginmessage...()/pq_send()/pq_endmessage...() system with
> something a bit better-designed. For instance, suppose we get rid of
> the idea that the caller supplies the buffer, and we move the
> responsibility for error recovery into the pqcomm layer. So you do
> something like:
> 
> my_message = xyz_beginmessage('D');
> xyz_sendint32(my_message, 42);
> xyz_endmessage(my_message);
> 
> Maybe what happens here under the hood is we keep a pool of free
> message buffers sitting around, and you just grab one and put your
> data into it.

Why do we need multiple buffers? ISTM we don't want to just send
messages at endmsg() time, because that implies unnecessary syscall
overhead. Nor do we want to imply the overhead of the copy from the
message buffer to the network buffer.

To me that seems to imply that the best approach would be to have
PqSendBuffer be something stringbuffer like, and have pg_beginmessage()
record the starting position of the current message somewhere
(->cursor?). When an error is thrown, we reset the position to be where
the in-progress message would have begun.

I've previously outlined a slightly more complicated scheme, where we
have "proxy" stringinfos that point into another stringinfo, instead of
their own buffer. And know how to resize the "outer" buffer when
needed. That'd have some advantages, but I'm not sure it's really
needed.


There's some disadvantages with what I describe above, in particular
when dealing with send() sending only parts of our network buffer. We
couldn't cheaply reuse the already sent memory in that case.

I've before wondered / suggested that we should have StringInfos not
insist on having one consecutive buffer (which obviously implies needing
to copy contents when growing). Instead it should have a list of buffers
containing chunks of the data, and never copy contents around while the
string is being built. We'd only allocate a buffer big enough for all
data when the caller actually wants to have all the resulting data in
one string (rather than using an API that can iterate over chunks).

For the network buffer case that'd allow us to reuse the earlier buffers
even in the "partial send" case. And more generally it'd allow us to be
less wasteful with buffer sizes, and perhaps even have a small "inline"
buffer inside StringInfoData avoiding unnecessary memory allocations in
a lot of cases where the common case is only a small amount of data
being used. And I think the overhead while appending data to such a
stringinfo should be neglegible, because it'd just require the exact
same checks we already have to do for enlargeStringInfo().


> (4) seems unavoidable AFAIK.

Not entirely. Linux can do zero-copy sends, but it does require somewhat
complicated black magic rituals. Including more complex buffer
management for the application, because the memory containing the
to-be-sent data cannot be reused until the kernel notifies that it's
done with the buffer.

See https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html

That might be something worth pursuing in the future (since it, I think,
basically avoids spending any cpu cycles on copying data around in the
happy path, relying on DMA instead), but I think for now there's much
bigger fish to fry.

I am hoping that somebody will write a nicer abstraction for zero-copy
sends using io_uring. avoiding the need of a separate completion queue,
by simply only signalling completion for the sendmsg operation once the
buffer isn't needed anymore. There's 

Atomic operations within spinlocks

2020-06-03 Thread Tom Lane
In connection with the nearby thread about spinlock coding rule
violations, I noticed that we have several places that are doing
things like this:

SpinLockAcquire(&WalRcv->mutex);
...
written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
...
SpinLockRelease(&WalRcv->mutex);

That's from pg_stat_get_wal_receiver(); there is similar code in
freelist.c's ClockSweepTick() and StrategySyncStart().

This seems to me to be very bad code.  In the first place, on machines
without the appropriate type of atomic operation, atomics.c is going
to be using a spinlock to emulate atomicity, which means this code
tries to take a spinlock while holding another one.  That's not okay,
either from the standpoint of performance or error-safety.  In the
second place, this coding seems to me to indicate serious confusion
about which lock is protecting what.  In the above example, if
writtenUpto is only accessed through atomic operations then it seems
like we could just move the pg_atomic_read_u64 out of the spinlock
section; or if the spinlock is adequate protection then we could just
do a normal fetch.  If we actually need both locks then this needs
significant re-thinking, IMO.

Comments?

regards, tom lane




Re: question regarding copyData containers

2020-06-03 Thread Tom Lane
Jerome Wagner  writes:
> now my question is the following :
> is it ok to consider that over the long term copyData is simply a transport
> container that exists only to allow the multiplexing of events in the
> protocol but that messages inside could be chunked over several copyData
> events ?

Yes, the expectation is that clients can send CopyData messages that are
split up however they choose; the message boundaries needn't correspond
to any semantic boundaries in the data stream.

The rule in the other direction, that a message corresponds to one table
row, is something that might not last forever either.  As we get more
people working with large data values, there's going to be pressure to
set some smaller limit on message size.

regards, tom lane




Re: elog(DEBUG2 in SpinLocked section.

2020-06-03 Thread Tom Lane
Kyotaro Horiguchi  writes:
> I looked through 224 locations where SpinLockAcquire and found some.

Yeah, I made a similar scan and arrived at about the same conclusions.
I think that the memcpy and strlcpy calls are fine; at least, we've got
to transport data somehow and it's not apparent why those aren't OK ways
to do it.  The one use of StrNCpy is annoying from a cosmetic standpoint
(mainly because it's Not Like Anywhere Else) but I'm not sure it's worth
changing.

The condition-variable code has a boatload of spinlocked calls of the
proclist functions in proclist.h.  All of those are straight-line code
so they're okay performance wise, but I wonder if we shouldn't add a
comment to that header pointing out that its functions must not throw
errors.

The only other thing I remain concerned about is some instances of atomic
operations inside spinlocks, which I started a separate thread about [1].

regards, tom lane

[1] https://www.postgresql.org/message-id/1141819.1591208385%40sss.pgh.pa.us




Re: Parallel copy

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-03 12:13:14 -0400, Robert Haas wrote:
> On Mon, May 18, 2020 at 12:48 AM Amit Kapila  wrote:
> > In the above case, even though we are executing a single command from
> > the user perspective, but the currentCommandId will be four after the
> > command.  One increment will be for the copy command and the other
> > three increments are for locking tuple in PK table
> > (tab_fk_referenced_chk) for three tuples in FK table
> > (tab_fk_referencing_chk).  Now, for parallel workers, it is
> > (theoretically) possible that the three tuples are processed by three
> > different workers which don't get synced as of now.  The question was
> > do we see any kind of problem with this and if so can we just sync it
> > up at the end of parallelism.

> I strongly disagree with the idea of "just sync(ing) it up at the end
> of parallelism". That seems like a completely unprincipled approach to
> the problem. Either the command counter increment is important or it's
> not. If it's not important, maybe we can arrange to skip it in the
> first place. If it is important, then it's probably not OK for each
> backend to be doing it separately.

That scares me too. These command counter increments definitely aren't
unnecessary in the general case.

Even in the example you share above, aren't we potentially going to
actually lock rows multiple times from within the same transaction,
instead of once?  If the command counter increments from within
ri_trigger.c aren't visible to other parallel workers/leader, we'll not
correctly understand that a locked row is invisible to heap_lock_tuple,
because we're not using a new enough snapshot (by dint of not having a
new enough cid).

I've not dug through everything that'd potentially cause, but it seems
pretty clearly a no-go from here.

Greetings,

Andres Freund




Re: Expand the use of check_canonical_path() for more GUCs

2020-06-03 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jun 2, 2020 at 5:04 AM Peter Eisentraut
>  wrote:
>> The archeology reveals that these calls where originally added to
>> canonicalize the data_directory and config_file settings (7b0f060d54),
>> but that was then moved out of guc.c to be done early during postmaster
>> startup (337ffcddba).  The remaining calls of check_canonical_path() in
>> guc.c appear to be leftovers from a previous regime.

> Thanks for looking into it. Sounds like it can just be ripped out,
> then, unless someone knows of a reason to do otherwise.

In the abstract, I agree with Peter's point that we shouldn't alter
user-given strings without need.  However, I think there's strong
reason for canonicalizing the data directory and config file locations.
We access those both before and after chdir'ing into the datadir, so
we'd better have absolute paths to them --- and at least for the
datadir, it's documented that you can initially give it as a path
relative to wherever you started the postmaster from.  If the other
files are only accessed after the chdir happens then we could likely
do without canonicalizing them.  But ... do we know which directory
the user (thought he) specified them with reference to?  Forced
canonicalization does have the advantage that it's clear to all
onlookers how we are interpreting the paths.

regards, tom lane




Re: Parallel copy

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-03 15:53:24 +0530, vignesh C wrote:
> Workers/
> Exec time (seconds) copy from file,
> 2 indexes on integer columns
> 1 index on text column copy from stdin,
> 2 indexes on integer columns
> 1 index on text column copy from file, 1 gist index on text column copy
> from file,
> 3 indexes on integer columns copy from stdin, 3 indexes on integer columns
> 0 1162.772(1X) 1176.035(1X) 827.669(1X) 216.171(1X) 217.376(1X)
> 1 1110.288(1.05X) 1120.556(1.05X) 747.384(1.11X) 174.242(1.24X) 163.492(1.33X)
> 2 635.249(1.83X) 668.18(1.76X) 435.673(1.9X) 133.829(1.61X) 126.516(1.72X)
> 4 336.835(3.45X) 346.768(3.39X) 236.406(3.5X) 105.767(2.04X) 107.382(2.02X)
> 8 188.577(6.17X) 194.491(6.04X) 148.962(5.56X) 100.708(2.15X) 107.72(2.01X)
> 16 126.819(9.17X) 146.402(8.03X) 119.923(6.9X) 97.996(2.2X) 106.531(2.04X)
> 20 *117.845(9.87X)* 149.203(7.88X) 138.741(5.96X) 97.94(2.21X) 107.5(2.02)
> 30 127.554(9.11X) 161.218(7.29X) 172.443(4.8X) 98.232(2.2X) 108.778(1.99X)

Hm. you don't explicitly mention that in your design, but given how
small the benefits going from 0-1 workers is, I assume the leader
doesn't do any "chunk processing" on its own?


> Design of the Parallel Copy: The backend, to which the "COPY FROM" query is
> submitted acts as leader with the responsibility of reading data from the
> file/stdin, launching at most n number of workers as specified with
> PARALLEL 'n' option in the "COPY FROM" query. The leader populates the
> common data required for the workers execution in the DSM and shares it
> with the workers. The leader then executes before statement triggers if
> there exists any. Leader populates DSM chunks which includes the start
> offset and chunk size, while populating the chunks it reads as many blocks
> as required into the DSM data blocks from the file. Each block is of 64K
> size. The leader parses the data to identify a chunk, the existing logic
> from CopyReadLineText which identifies the chunks with some changes was
> used for this. Leader checks if a free chunk is available to copy the
> information, if there is no free chunk it waits till the required chunk is
> freed up by the worker and then copies the identified chunks information
> (offset & chunk size) into the DSM chunks. This process is repeated till
> the complete file is processed. Simultaneously, the workers cache the
> chunks(50) locally into the local memory and release the chunks to the
> leader for further populating. Each worker processes the chunk that it
> cached and inserts it into the table. The leader waits till all the chunks
> populated are processed by the workers and exits.

Why do we need the local copy of 50 chunks? Copying memory around is far
from free. I don't see why it'd be better to add per-process caching,
rather than making the DSM bigger? I can see some benefit in marking
multiple chunks as being processed with one lock acquisition, but I
don't think adding a memory copy is a good idea.


This patch *desperately* needs to be split up. It imo is close to
unreviewable, due to a large amount of changes that just move code
around without other functional changes being mixed in with the actual
new stuff.


>  /*
> + * State of the chunk.
> + */
> +typedef enum ChunkState
> +{
> + CHUNK_INIT, /* initial state of 
> chunk */
> + CHUNK_LEADER_POPULATING,/* leader processing chunk */
> + CHUNK_LEADER_POPULATED, /* leader completed populating chunk */
> + CHUNK_WORKER_PROCESSING,/* worker processing chunk */
> + CHUNK_WORKER_PROCESSED  /* worker completed processing chunk */
> +}ChunkState;
> +
> +#define RAW_BUF_SIZE 65536   /* we palloc RAW_BUF_SIZE+1 bytes */
> +
> +#define DATA_BLOCK_SIZE RAW_BUF_SIZE
> +#define RINGSIZE (10 * 1000)
> +#define MAX_BLOCKS_COUNT 1000
> +#define WORKER_CHUNK_COUNT 50/* should be mod of RINGSIZE */
> +
> +#define  IsParallelCopy()(cstate->is_parallel)
> +#define IsLeader()   (cstate->pcdata->is_leader)
> +#define IsHeaderLine()   (cstate->header_line && 
> cstate->cur_lineno == 1)
> +
> +/*
> + * Copy data block information.
> + */
> +typedef struct CopyDataBlock
> +{
> + /* The number of unprocessed chunks in the current block. */
> + pg_atomic_uint32 unprocessed_chunk_parts;
> +
> + /*
> +  * If the current chunk data is continued into another block,
> +  * following_block will have the position where the remaining data need 
> to
> +  * be read.
> +  */
> + uint32  following_block;
> +
> + /*
> +  * This flag will be set, when the leader finds out this block can be 
> read
> +  * safely by the worker. This helps the worker to start processing the 
> chunk
> +  * early where the chunk will be spread across many blocks and the 
> worker
> +  * need not wait for the complete chunk to be processed.
> +  */
> + bool   curr_blk_completed;
> +  

Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-03 Thread Tomas Vondra

Hi,

Not sure what's the root cause, but I can reproduce it. Timings for 9.6,
10 and master (all built from git with the same options) without explain
analyze look like this:

9.6
-
Time: 1971.314 ms
Time: 1995.875 ms
Time: 1997.408 ms
Time: 2069.913 ms
Time: 2004.196 ms

10
-
Time: 2815.434 ms (00:02.815)
Time: 2862.589 ms (00:02.863)
Time: 2841.126 ms (00:02.841)
Time: 2803.040 ms (00:02.803)
Time: 2805.527 ms (00:02.806)

master
-
Time: 3479.233 ms (00:03.479)
Time: 3537.901 ms (00:03.538)
Time: 3459.314 ms (00:03.459)
Time: 3542.810 ms (00:03.543)
Time: 3482.141 ms (00:03.482)

So there seems to be +40% between 9.6 and 10, and further +25% between
10 and master. However, plain hashagg, measured e.g. like this:

  select count(*) cx from foo group by b, c, d having count(*) = 1;

does not indicate any slowdown at all, so I think you're right it has
something to do with the looping.

Profiles from those versions look like this:

9.6
-
Samples
Overhead  Shared Objec  Symbol
  14.19%  postgres  [.] ExecMakeFunctionResultNoSets
  13.65%  postgres  [.] finalize_aggregates
  12.54%  postgres  [.] hash_seq_search
   6.70%  postgres  [.] finalize_aggregate.isra.0
   5.71%  postgres  [.] ExecEvalParamExec
   5.54%  postgres  [.] ExecEvalAggref
   5.00%  postgres  [.] ExecStoreMinimalTuple
   4.34%  postgres  [.] ExecAgg
   4.08%  postgres  [.] ExecQual
   2.67%  postgres  [.] slot_deform_tuple
   2.24%  postgres  [.] pgstat_init_function_usage
   2.22%  postgres  [.] check_stack_depth
   2.14%  postgres  [.] MemoryContextReset
   1.89%  postgres  [.] hash_search_with_hash_value
   1.72%  postgres  [.] project_aggregates
   1.68%  postgres  [.] pgstat_end_function_usage
   1.59%  postgres  [.] slot_getattr


10

Samples
Overhead  Shared Object   Symbol
  15.18%  postgres[.] slot_deform_tuple
  13.09%  postgres[.] agg_retrieve_hash_table
  12.02%  postgres[.] ExecInterpExpr
   7.47%  postgres[.] finalize_aggregates
   7.38%  postgres[.] tuplehash_iterate
   5.13%  postgres[.] prepare_projection_slot
   4.86%  postgres[.] finalize_aggregate.isra.0
   4.05%  postgres[.] bms_is_member
   3.97%  postgres[.] slot_getallattrs
   3.59%  postgres[.] ExecStoreMinimalTuple
   2.85%  postgres[.] project_aggregates
   1.95%  postgres[.] ExecClearTuple
   1.71%  libc-2.30.so[.] __memset_avx2_unaligned_erms
   1.69%  postgres[.] ExecEvalParamExec
   1.58%  postgres[.] MemoryContextReset
   1.17%  postgres[.] slot_getattr
   1.03%  postgres[.] slot_getsomeattrs


master
--
Samples
Overhead  Shared Object   Symbol
  17.07%  postgres[.] agg_retrieve_hash_table
  15.46%  postgres[.] tuplehash_iterate
  11.83%  postgres[.] tts_minimal_getsomeattrs
   9.39%  postgres[.] ExecInterpExpr
   6.94%  postgres[.] prepare_projection_slot
   4.85%  postgres[.] finalize_aggregates
   4.27%  postgres[.] bms_is_member
   3.80%  postgres[.] finalize_aggregate.isra.0
   3.80%  postgres[.] tts_minimal_store_tuple
   2.22%  postgres[.] project_aggregates
   2.07%  postgres[.] tts_virtual_clear
   2.07%  postgres[.] MemoryContextReset
   1.78%  postgres[.] tts_minimal_clear
   1.61%  postgres[.] ExecEvalParamExec
   1.46%  postgres[.] slot_getsomeattrs_int
   1.34%  libc-2.30.so[.] __memset_avx2_unaligned_erms

Not sure what to think about this. Seems slot_deform_tuple got way more
expensive between 9.6 and 10, for some reason. 



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: Internal key management system

2020-06-03 Thread Bruce Momjian
On Wed, Jun  3, 2020 at 09:16:03AM +0200, Fabien COELHO wrote:
> > > Also, I'm not at fully at ease with some of the underlying principles
> > > behind this proposal. Are we re-inventing/re-implementing kerberos or
> > > whatever? Are we re-implementing a brand new KMS inside pg? Why having
> > > our own?
> > 
> > As I explained above, this key manager is for managing internal keys
> > used by TDE. It's not an alternative to existing key management
> > solutions/services.
> 
> Hmmm. This seels to suggest that interacting with something outside should
> be an option.

Our goal is not to implement every possible security idea someone has,
because we will never finish, and the final result would be too complex
to be unable.  You will need to explain exactly why having a separate
process has value over coding/user complexity, and you will need to get
agreement from a sufficient number of people to move that idea forward.

> > The key rotation this key manager has is KEK rotation, which is very
> > important. Without KEK rotation, when KEK is leaked an attacker can
> > get database data by disk theft. Since KEK is responsible for
> > encrypting all internal keys it's necessary to re-encrypt the internal
> > keys when KEK is rotated. I think PG is the only role that can do that
> > job.
> 
> I'm not claiming that KEK rotation is a bad thing, I'm saying that it should
> not be postgres problem. My issue is where you put the thing, not about the
> thing itself.
> 
> > I think this key manager satisfies the fist point by
> > cluster_passphrase_command. For the second point, the key manager
> > stores local keys inside PG while protecting them by KEK managed
> > outside of PG.
> 
> I do not understand. From what I understood from the code, the KEK is loaded
> into postgres process. That is what I'm disagreeing with, only needed DEK
> should be there.

One option would be to send the data needing to be encrypted to an
external command, and get the decrypted data back.  In that way, the KEK
is never on the Postgres server.  However, the API for doing such an
interface seems very complex and could lead to failures.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





libpq copy error handling busted

2020-06-03 Thread Andres Freund
Hi,

When libpq is used to COPY data to the server, it doesn't properly
handle errors.

An easy way to trigger the problem is to start pgbench -i with a
sufficiently large scale, and then just shut the server down. pgbench
will happily use 100% of the cpu trying to send data to the server, even
though libpq knows that the connection is broken.

It can't even be cancelled using ctrl-c anymore, because the cancel
request cannot be sent:

andres@awork3:~/src/postgresql$ pgbench -i -s 4000 -q
dropping old tables...
creating tables...
generating data (client-side)...
80889300 of 4 tuples (20%) done (elapsed 85.00 s, remaining 335.33 s)
^CCould not send cancel request: PQcancel() -- connect() failed: No such file 
or directory


This is partially an old problem, and partially got recently
worse. Before the below commit we detected terminated connections, but
we didn't handle copy failing.


The failure to handle terminated connections originates in:

commit 1f39a1c0641531e0462a4822f2dba904c5d4d699
Author: Tom Lane 
Date:   2019-03-19 16:20:20 -0400

Restructure libpq's handling of send failures.


The problem is basically that pqSendSome() returns *success* in all
failure cases. Both when a failure is already known:

+/*
+ * If we already had a write failure, we will never again try to send data
+ * on that connection.  Even if the kernel would let us, we've probably
+ * lost message boundary sync with the server.  conn->write_failed
+ * therefore persists until the connection is reset, and we just discard
+ * all data presented to be written.
+ */
+if (conn->write_failed)
+{
+/* conn->write_err_msg should be set up already */
+conn->outCount = 0;
+return 0;
+}
+

and when initially "diagnosing" the failure:
/* Anything except EAGAIN/EWOULDBLOCK/EINTR is trouble 
*/
switch (SOCK_ERRNO)
...
/* Discard queued data; no chance it'll 
ever be sent */
conn->outCount = 0;
return 0;

The idea of the above commit was:
Instead, let's fix this in a more general fashion by eliminating
pqHandleSendFailure altogether, and instead arranging to postpone
all reports of send failures in libpq until after we've made an
attempt to read and process server messages.  The send failure won't
be reported at all if we find a server message or detect input EOF.

but that doesn't work here, because we never process the error
message. There's no code in pqParseInput3() to process server messages
while doing copy.


I'm honestly a bit baffled. How can we not have noticed that COPY FROM
STDIN doesn't handle errors before the input is exhausted? It's not just
pgbench, it's psql (and I asume pg_restore) etc as well.

Greetings,

Andres Freund




Re: significant slowdown of HashAggregate between 9.6 and 10

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote:
> So there seems to be +40% between 9.6 and 10, and further +25% between
> 10 and master. However, plain hashagg, measured e.g. like this:

Ugh.

Since I am a likely culprit, I'm taking a look.


> Not sure what to think about this. Seems slot_deform_tuple got way more
> expensive between 9.6 and 10, for some reason.

Might indicate too many calls instead. Or it could just be the fact that
we have expressions deform all columns once, instead of deforming
columns one-by-one now.

Greetings,

Andres Freund




Re: what can go in root.crt ?

2020-06-03 Thread Bruce Momjian
On Wed, Jun  3, 2020 at 03:07:30PM +0300, Ants Aasma wrote:
> On Tue, 2 Jun 2020 at 20:14, Bruce Momjian  wrote:
> 
> The server certificate should be issued by a certificate authority root
> outside of your organization only if you want people outside of your
> organization to trust your server certificate, but you are then asking
> for the client to only trust an intermediate inside your organization.
> The big question is why bother having the server certificate chain to a
> root certificat you don't trust when you have no intention of having
> clients outside of your organization trust the server certificate.
> Postgres could be made to handle such cases, but is is really a valid
> configuration we should support?
> 
> 
> I think the "why" the org cert is not root was already made clear, that is the
> copmany policy. I don't think postgres should take a stance whether the
> certificate designated as the root of trust is self-signed or claims to get 
> its
> power fromΒ somewhere else.

Uh, we sure can.  We disallow many configurations that we consider
unsafe.  openssl allowed a lot of things, and their flexibility make
them less secure.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee





Re: Atomic operations within spinlocks

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-03 14:19:45 -0400, Tom Lane wrote:
> In connection with the nearby thread about spinlock coding rule
> violations, I noticed that we have several places that are doing
> things like this:
>
>   SpinLockAcquire(&WalRcv->mutex);
>   ...
>   written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto);
>   ...
>   SpinLockRelease(&WalRcv->mutex);
>
> That's from pg_stat_get_wal_receiver(); there is similar code in
> freelist.c's ClockSweepTick() and StrategySyncStart().
>
> This seems to me to be very bad code.  In the first place, on machines
> without the appropriate type of atomic operation, atomics.c is going
> to be using a spinlock to emulate atomicity, which means this code
> tries to take a spinlock while holding another one.  That's not okay,
> either from the standpoint of performance or error-safety.

I'm honestly not particularly concerned about either of those in
general:

- WRT performance: Which platforms where we care about performance can't
  do a tear-free read of a 64bit integer, and thus needs a spinlock for
  this? And while the cases in freelist.c aren't just reads, they are
  really cold paths (clock wrapping around).
- WRT error safety: What could happen here? The atomics codepaths is
  no-fail code? And nothing should ever nest inside the atomic-emulation
  spinlocks. What am I missing?

I think straight out prohibiting use of atomics inside a spinlock will
lead to more complicated and slower code, rather than than improving
anything. I'm to blame for the ClockSweepTick() case, and I don't really
see how we could avoid the atomic-while-spinlocked without relying on
64bit atomics, which are emulated on more platforms, and without having
a more complicated retry loop.


> In the second place, this coding seems to me to indicate serious
> confusion about which lock is protecting what.  In the above example,
> if writtenUpto is only accessed through atomic operations then it
> seems like we could just move the pg_atomic_read_u64 out of the
> spinlock section; or if the spinlock is adequate protection then we
> could just do a normal fetch.  If we actually need both locks then
> this needs significant re-thinking, IMO.

Yea, it doesn't seem necessary at all that writtenUpto is read with the
spinlock held. That's very recent:

commit 2c8dd05d6cbc86b7ad21cfd7010e041bb4c3950b
Author: Michael Paquier 
Date:   2020-05-17 09:22:07 +0900

Make pg_stat_wal_receiver consistent with the WAL receiver's shmem info

and I assume just was caused by mechanical replacement, rather than
intentionally doing so while holding the spinlock. As far as I can tell
none of the other writtenUpto accesses are under the spinlock.

Greetings,

Andres Freund




question regarding copyData containers

2020-06-03 Thread Jerome Wagner
Hello,

I have been working on a node.js streaming client for different COPY
scenarios.
usually, during CopyOut, clients tend to buffer network chunks until they
have gathered a full copyData message and pass that to the user.

In some cases, this can lead to very large copyData messages. when there
are very long text fields or bytea fields it will require a lot of memory
to be handled (up to 1GB I think in the worst case scenario)

In COPY TO, I managed to relax that requirement, considering that copyData
is simply a transparent container. For each network chunk, the relevent
message content is forwarded which makes for 64KB chunks at most.

If that makes things clearer, here is an example scenarios, with 4 network
chunks received and the way they are forwarded to the client.

in: CopyData Int32Len Byten1
in: Byten2
in: Byten3
in: CopyData Int32Len Byten4

out: Byten1
out: Byten2
out: Byten3
out: Byten4

We loose the semantics of the "row" that copyData has according to the
documentation
https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY
>The backend sends a CopyOutResponse message to the frontend, followed by
zero or more >CopyData messages (**always one per row**), followed by
CopyDone

but it is not a problem because the raw bytes are still parsable (rows +
fields) in text mode (tsv) and in binary mode)

Now I started working on copyBoth and logical decoding scenarios. In this
case, the server send series of copyData. 1 copyData containing 1 message :

at the network chunk level, in the case of large fields, we can observe

in: CopyData Int32 XLogData Int64  Int64  Int64 Byten1
in: Byten2
in: CopyData Int32 XLogData Int64  Int64  Int64 Byten3
in: CopyData Int32 XLogData Int64  Int64  Int64 Byten4

out: XLogData Int64  Int64  Int64 Byten1
out: Byten2
out: XLogData Int64  Int64  Int64 Byten3
out: XLogData Int64  Int64  Int64 Byten4

but at the XLogData level, the protocol is not self-describing its length,
so there is no real way of knowing where the first XLogData ends apart from
 - knowing the length of the first copyData (4 + 1 + 3*8 + n1 + n2)
 - knowing the internals of the output plugin and benefit from a plugin
that self-describe its span

when a network chunks contains several copyDatas
in: CopyData Int32 XLogData Int64  Int64  Int64 Byten1  CopyData Int32
XLogData Int64  Int64  Int64 Byten2
we have
out: XLogData Int64  Int64  Int64 Byten1 XLogData Int64  Int64  Int64 Byten2

and with test_decoding for example it is impossible to know where the
test_decoding output ends without remembering the original length of the
copyData.

now my question is the following :
is it ok to consider that over the long term copyData is simply a transport
container that exists only to allow the multiplexing of events in the
protocol but that messages inside could be chunked over several copyData
events ?

if we put test_decoding apart, do you consider that output plugins XLogData
should be self-aware of their length ? I suppose (but did not fully verify
yet) that this is the case for pgoutput ? I suppose that wal2json could
also be parsed by balancing the brackets.

I am wondering because when a client sends copyData to the server, the
documentation says
>The message boundaries are not required to have anything to do with row
boundaries, >although that is often a reasonable choice.

I hope that my message will ring a bell on the list.
I tried the best I could to describe my very specific research.
Thank you for your help,
---
JΓ©rΓ΄me


[PATCH] Leading minus for negative time interval in ISO 8601

2020-06-03 Thread Mikhail Titov
Hello!

I'd like to propose a simple patch to allow for negative ISO 8601
intervals with leading minus, e.g. -PT1H besides PT-1H. It seems that
standard isn't quite clear on negative duration. However, lots of
software use leading minus and expect/generate intervals in such forms
making those incompatible with current PostgreSQL decoding code.

All patch is doing is making a note of a leading minus and negates pg_tm
components along with fractional seconds. No other behavior change is
introduced.

--
Mikhail
From c1d96252ebd727b205e1ee66f34f8b564eb05c00 Mon Sep 17 00:00:00 2001
From: Mikhail Titov 
Date: Wed, 3 Jun 2020 02:29:38 -0500
Subject: [PATCH] Support leading minus in front of ISO 8601 interval string

It is a common extension for a standard to have negative intervals
prefixed with a leading minus instead of individual parts.
For example -PT1H as an alternative to PT-1H.
The standard itself is not specific on negative intervals.

This patch makes PostgreSQL a bit more compatible with 3-rd party software
that use ISO 8601 style intervals that might be negative, e.g.
* https://github.com/rails/rails/issues/39503
* https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html#parse-java.lang.CharSequence-
---
 src/backend/utils/adt/datetime.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 1914138a24..f79df368a2 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -3492,10 +3492,17 @@ DecodeISO8601Interval(char *str,
 {
 	bool		datepart = true;
 	bool		havefield = false;
+	bool		negative = false;

 	*dtype = DTK_DELTA;
 	ClearPgTm(tm, fsec);

+	if (str[0] == '-')
+	{
+		negative = true;
+		str++;
+	}
+
 	if (strlen(str) < 2 || str[0] != 'P')
 		return DTERR_BAD_FORMAT;

@@ -3681,6 +3688,19 @@ DecodeISO8601Interval(char *str,
 		havefield = true;
 	}

+	if (negative)
+	{
+		tm->tm_sec = -tm->tm_sec;
+		tm->tm_min = -tm->tm_min;
+		tm->tm_hour = -tm->tm_hour;
+		tm->tm_mday = -tm->tm_mday;
+		tm->tm_mon = -tm->tm_mon;
+		tm->tm_year = -tm->tm_year;
+		tm->tm_wday = -tm->tm_wday;
+		tm->tm_yday = -tm->tm_yday;
+		*fsec = -*fsec;
+	}
+
 	return 0;
 }

--
2.26.0.windows.1



Re: Atomic operations within spinlocks

2020-06-03 Thread Thomas Munro
On Thu, Jun 4, 2020 at 8:45 AM Andres Freund  wrote:
> On 2020-06-03 14:19:45 -0400, Tom Lane wrote:
> > In the second place, this coding seems to me to indicate serious
> > confusion about which lock is protecting what.  In the above example,
> > if writtenUpto is only accessed through atomic operations then it
> > seems like we could just move the pg_atomic_read_u64 out of the
> > spinlock section; or if the spinlock is adequate protection then we
> > could just do a normal fetch.  If we actually need both locks then
> > this needs significant re-thinking, IMO.
>
> Yea, it doesn't seem necessary at all that writtenUpto is read with the
> spinlock held. That's very recent:
>
> commit 2c8dd05d6cbc86b7ad21cfd7010e041bb4c3950b
> Author: Michael Paquier 
> Date:   2020-05-17 09:22:07 +0900
>
> Make pg_stat_wal_receiver consistent with the WAL receiver's shmem info
>
> and I assume just was caused by mechanical replacement, rather than
> intentionally doing so while holding the spinlock. As far as I can tell
> none of the other writtenUpto accesses are under the spinlock.

Yeah.  It'd be fine to move that after the spinlock release.  Although
it's really just for informational purposes only, not for any data
integrity purpose, reading it before the spinlock acquisition would
theoretically allow it to appear to be (reportedly) behind
flushedUpto, which would be silly.




Re: question regarding copyData containers

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-03 19:28:12 +0200, Jerome Wagner wrote:
> I have been working on a node.js streaming client for different COPY
> scenarios.
> usually, during CopyOut, clients tend to buffer network chunks until they
> have gathered a full copyData message and pass that to the user.
> 
> In some cases, this can lead to very large copyData messages. when there
> are very long text fields or bytea fields it will require a lot of memory
> to be handled (up to 1GB I think in the worst case scenario)
> 
> In COPY TO, I managed to relax that requirement, considering that copyData
> is simply a transparent container. For each network chunk, the relevent
> message content is forwarded which makes for 64KB chunks at most.

Uhm.


> We loose the semantics of the "row" that copyData has according to the
> documentation
> https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY
> >The backend sends a CopyOutResponse message to the frontend, followed by
> zero or more >CopyData messages (**always one per row**), followed by
> CopyDone
> 
> but it is not a problem because the raw bytes are still parsable (rows +
> fields) in text mode (tsv) and in binary mode)

This seems like an extremely bad idea to me. Are we really going to ask
clients to incur the overhead (both in complexity and runtime) to parse
incoming data just to detect row boundaries?  Given the number of
options there are for COPY, that's a seriously complicated task.

I think that's a completely no-go.


Leaving error handling aside (see para below), what does this actually
get you? Either your client cares about getting a row in one sequential
chunk, or it doesn't. If it doesn't care, then there's no need to
allocate a buffer that can contain the whole 'd' message. You can just
hand the clients the chunks incrementally. If it does, then you need to
reassemble either way (or worse, you force to reimplement the client to
reimplement that).

I assume what you're trying to get at is being able to send CopyData
messages before an entire row is assembled? And you want to send
separate CopyData messages to allow for error handling?  I think that's
a quite worthwhile goal, but I don't think it can sensibly solved by
just removing protocol level framing of row boundaries. And that will
mean evolving the protocol in a non-compatible way.


> Now I started working on copyBoth and logical decoding scenarios. In this
> case, the server send series of copyData. 1 copyData containing 1 message :
> 
> at the network chunk level, in the case of large fields, we can observe
> 
> in: CopyData Int32 XLogData Int64  Int64  Int64 Byten1
> in: Byten2
> in: CopyData Int32 XLogData Int64  Int64  Int64 Byten3
> in: CopyData Int32 XLogData Int64  Int64  Int64 Byten4
> 
> out: XLogData Int64  Int64  Int64 Byten1
> out: Byten2
> out: XLogData Int64  Int64  Int64 Byten3
> out: XLogData Int64  Int64  Int64 Byten4

> but at the XLogData level, the protocol is not self-describing its length,

> so there is no real way of knowing where the first XLogData ends apart from
>  - knowing the length of the first copyData (4 + 1 + 3*8 + n1 + n2)
>  - knowing the internals of the output plugin and benefit from a plugin
> that self-describe its span
> when a network chunks contains several copyDatas
> in: CopyData Int32 XLogData Int64  Int64  Int64 Byten1  CopyData Int32
> XLogData Int64  Int64  Int64 Byten2
> we have
> out: XLogData Int64  Int64  Int64 Byten1 XLogData Int64  Int64  Int64 Byten2

Right now all 'w' messages should be contained in one CopyData/'d' that
doesn't contain anything but the XLogData/'w'.

Do you just mean that if we'd change the server side code to split 'w'
messages across multiple 'd' messages, then we couldn't make much sense
of the data anymore?  If so, then I don't really see a problem. Unless
you do a much larger change, what'd be the point in allowing to split
'w' across multiple 'd' chunks?  The input data exists in a linear
buffer already, so you're not going to reduce peak memory usage by
sending smaller CopyData chunks.

Sure, we could evolve the logical decoding interface to output to be
able to send data in a much more incremental way than, typically,
per-row basis. But I think that'd quite substantially increase
complexity. And the message framing seems to be the easier part of such
a change.

Greetings,

Andres Freund




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-02 14:23:50 +0900, Fujii Masao wrote:
> On 2020/06/02 13:24, Michael Paquier wrote:
> > On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote:
> > > Yes. Conversely, if we start logical replication in a physical
> > > replication connection (i.g. replication=true) we got an error before
> > > staring replication:
> > > 
> > > ERROR:  logical decoding requires a database connection
> > > 
> > > I think we can prevent that SEGV in a similar way.
> > 
> > Still unconvinced as this restriction stands for logical decoding
> > requiring a database connection but it is not necessarily true now as
> > physical replication has less restrictions than a logical one.
> 
> Could you tell me what the benefit for supporting physical replication on
> logical rep connection is? If it's only for "undocumented"
> backward-compatibility, IMO it's better to reject such "tricky" set up.
> But if there are some use cases for that, I'm ok to support that.

I don't think we should prohibit this. For one, it'd probably break some
clients, without a meaningful need.

But I think it's also actually quite useful to be able to access
catalogs before streaming data. You e.g. can look up configuration of
the primary before streaming WAL. With a second connection that's
actually harder to do reliably in some cases, because you need to be
sure that you actually reached the right server (consider a pooler,
automatic failover etc).

Greetings,

Andres Freund




Re: Parallel Seq Scan vs kernel read ahead

2020-06-03 Thread Soumyadeep Chakraborty
> It doesn't look like it's using table_block_parallelscan_nextpage() as
> a block allocator so it's not affected by the patch.  It has its own
> thing zs_parallelscan_nextrange(), which does
> pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids,
> ZS_PARALLEL_CHUNK_SIZE), and that macro is 0x10.

My apologies, I was too hasty. Indeed, you are correct. Zedstore's
unit of work is chunks of the logical zstid space. There is a
correlation between the zstid and blocks: zstids near each other are
likely to lie in the same block or in neighboring blocks. It would be
interesting to try something like this patch for Zedstore.

Regards,
Soumyadeep




Re: Parallel Seq Scan vs kernel read ahead

2020-06-03 Thread Soumyadeep Chakraborty
On Sat, May 23, 2020 at 12:00 AM Robert Haas
 wrote:
> I think there's a significant difference. The idea I remember being
> discussed at the time was to divide the relation into equal parts at
> the very start and give one part to each worker. I think that carries
> a lot of risk of some workers finishing much sooner than others.

Was the idea of work-stealing considered? Here is what I have been
thinking about:

Each worker would be assigned a contiguous chunk of blocks at init time.
Then if a worker is finished with its work, it can inspect other
workers' remaining work and "steal" some of the blocks from the end of
the victim worker's allocation.

Considerations for such a scheme:

1. Victim selection: Who will be the victim worker? It can be selected at
random if nothing else.

2. How many blocks to steal? Stealing half of the victim's remaining
blocks seems to be fair.

3. Stealing threshold: We should disallow stealing if the amount of
remaining work is not enough in the victim worker.

4. Additional parallel state: Representing the chunk of "work". I guess
one variable for the current block and one for the last block in the
chunk allocated. The latter would have to be protected with atomic
fetches as it would be decremented by the stealing worker.

5. Point 4 implies that there might be more atomic fetch operations as
compared to this patch. Idk if that is a lesser evil than the workers
being idle..probably not? A way to salvage that a little would be to
forego atomic fetches when the amount of work remaining is less than the
threshold discussed in 3 as there is no possibility of work stealing then.


Regards,

Soumyadeep




Re: libpq copy error handling busted

2020-06-03 Thread Tom Lane
Andres Freund  writes:
> When libpq is used to COPY data to the server, it doesn't properly
> handle errors.
> This is partially an old problem, and partially got recently
> worse. Before the below commit we detected terminated connections, but
> we didn't handle copy failing.

Yeah.  After poking at that for a little bit, there are at least three
problems:

* pqSendSome() is responsible not only for pushing out data, but for
calling pqReadData in any situation where it can't get rid of the data
promptly.  1f39a1c06 overlooked that requirement, and the upshot is
that we don't necessarily notice that the connection is broken (it's
pqReadData's job to detect that).  Putting a pqReadData call into
the early-exit path helps, but doesn't fix things completely.

* The more longstanding problem is that the PQputCopyData code path
doesn't have any mechanism for consuming an 'E' (error) message
once pqReadData has collected it.  AFAICS that's ancient.  (It does
not affect the behavior of this case if you use an immediate-mode
shutdown, because then the backend never issues an 'E' message;
but it does matter in a fast-mode shutdown.)  I think that the
idea was to let the client dump all its copy data and then report
the error message when PQendCopy is called, but as you say, that's
none too friendly when there's gigabytes of data involved.  I'm
not sure we can do anything about this without effectively changing
the client API for copy errors, though.

* As for control-C not getting out of it: there is

if (CancelRequested)
break;

in pgbench's loop, but this does nothing in this scenario because
fe-utils/cancel.c only sets that flag when it successfully sends a
Cancel ... which it certainly cannot if the postmaster is gone.
I suspect that this may be relatively recent breakage.  It doesn't look
too well thought out, in any case.  The places that are testing this
flag look like they'd rather not be bothered with the fine point of
whether the cancel request actually went anywhere.  (And aside from this
issue, I see no mechanism for that flag to become unset once it's set.
Current users of cancel.c probably don't care, but we'd have noticed if
we tried to make psql use it.)

regards, tom lane




Re: Parallel Seq Scan vs kernel read ahead

2020-06-03 Thread Soumyadeep Chakraborty
On Wed, Jun 3, 2020 at 3:18 PM Soumyadeep Chakraborty
 wrote:
> Idk if that is a lesser evil than the workers
> being idle..probably not?

Apologies, I meant that the extra atomic fetches is probably a lesser
evil than the workers being idle.

Soumyadeep




Re: what can go in root.crt ?

2020-06-03 Thread Chapman Flack
On 06/03/20 08:07, Ants Aasma wrote:
> I think the "why" the org cert is not root was already made clear, that is
> the copmany policy.

Thank you, yes, that was what I had intended to convey, and you have saved
me finishing a weedsier follow-up message hoping to convey it better.

> I don't think postgres should take a stance ...

> whether the
> certificate designated as the root of trust is self-signed or claims to get
> its power from somewhere else.

On 06/03/20 16:34, Bruce Momjian wrote:
> Uh, we sure can.  We disallow many configurations that we consider
> unsafe.

Ok, so a person in the situation described here, who is not in a position
to demand changes in an organizational policy (whether or not it seems
ill-conceived to you or even to him/her), is facing this question:

What are the "safest" things I /can/ do, under the existing constraints,
and /which of those will work in PostgreSQL/?

For example, we might agree that it is safe to trust nothing but the
end-entity cert of my server itself. I made a server, here is its cert,
here is a root.crt file for libpq containing only this exact cert, I
want libpq to connect only ever to this server with this cert and nothing
else. It's a pain because I have to roll out new root.crt files to everybody
whenever the cert changes, but it would be hard to call it unsafe.

Great! Can I do that? I think the answer is no. I haven't found it
documented, but I think libpq will fail such a connection, because
the cert it has found in root.crt is not self-signed.

Or, vary the scenario just enough that my organization, or even my
department in my organization, now has its own CA, as the first
intermediate, the issuer of the end-entity cert.

It might be entirely reasonable to put that CA cert into root.crt,
so libpq would only connect to things whose certs were issued in
my department, or at least my org. I trust the person who would be
issuing my department's certs (in all likelihood, me). I would
more-or-less trust my org to issue certs for my org.

Great! Can I do that? I think that answer is also no, for the same
reason.

My department's or org's CA cert isn't going to be self-signed, it's
going to be vouched for by a chain of more certs leading to a globally
recognized one. Why? Short answer, our org also has web sites. We like
for people's browsers to be able to see those. We have one group that
makes server certs and they follow one procedure, and the certs they make
come out the same way. That shouldn't be a problem for PostgreSQL, so it's
hard to argue they should have to use a different procedure just for my
cert.

> I don't think postgres should take a stance whether the
> certificate designated as the root of trust is self-signed or claims
> to get its power from somewhere else.

I'm inclined to agree but I would change the wording a bit for clarity.
*Any* certificate we are going to trust gets its power from somewhere
else. Being self-signed is not an exception to that rule (if it were,
every Snake Oil, Ltd. self-signed cert generated by every student in
every network security class ever would be a root of trust).

For us to trust a cert, it must be vouched for in some way. The most
important vouching that happens, as far as libpq is concerned, is
vouching by the administrator who controls the file system where root.crt
is found and the contents of that file.

If libpq is looking at a cert and finds it in that file, I have vouched
for it. That's why it's there.

If it is self-signed, then I'm the only person vouching for it, and that's
ok.

If it is not self-signed, that just means somebody else also has vouched
for it. Maybe for the same use, maybe for some other use. In any event,
the fact that somebody else has also vouched for it does not in any way
negate that I vouched for it, by putting it there in that file I control.

> It's pretty easy to conceive of certificate management procedures that make
> use of this chain to implement certificate replacement securely. For
> example one might trust the global issuer to verify that a CSR is coming
> from the O= value that it's claiming to come from to automate replacement
> of intermediate certificates, but not trust that every other sub-CA signed
> by root and their sub-sub-CA-s are completely honest and secure.

That's an example of the kind of policy design I think ought to be possible,
but a first step to getting there would be to just better document what
does and doesn't work in libpq now. There seem to be some possible
configurations that aren't available, not because of principled arguments
for disallowing them, but because they fail unstated assumptions.

In an ideal world, I think libpq would be using this algorithm:

  I'm looking at the server's certificate, s.
  Is s unexpired and in the trust file? If so, SUCCEED.

  otherwise, loop:
get issuer certificate i from s (if s is self-signed, FAIL).
does i have CA:TRUE and Certificate Sign bits? If not, FAIL.
does i's Domain Constrai

Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Alvaro Herrera
On 2020-Jun-03, Andres Freund wrote:

> I don't think we should prohibit this. For one, it'd probably break some
> clients, without a meaningful need.

There *is* a need, namely to keep complexity down.  This is quite
convoluted, it's got a lot of historical baggage because of the way it
was implemented, and it's very difficult to understand.  The greatest
motive I see is to make this easier to understand, so that it is easier
to modify and improve in the future.

> But I think it's also actually quite useful to be able to access
> catalogs before streaming data. You e.g. can look up configuration of
> the primary before streaming WAL. With a second connection that's
> actually harder to do reliably in some cases, because you need to be
> sure that you actually reached the right server (consider a pooler,
> automatic failover etc).

I don't think having a physical replication connection access catalog
data directly is a great idea.  We already have gadgets like
IDENTIFY_SYSTEM for physical replication that can do that, and if you
need particular settings you can use SHOW (commit d1ecd539477).  If
there was a strong need for even more than that, we can add something to
the grammar.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Alvaro Herrera
On 2020-Jun-02, Michael Paquier wrote:

> I can note as well that StartLogicalReplication() moves in this sense
> by setting xlogreader to be the one from logical_decoding_ctx once the
> decoding context has been created.
> 
> This results in the attached.  The extra test from upthread to check
> that logical decoding is not allowed in a non-database WAL sender is a
> good idea, so I have kept it.

I don't particularly disagree with your proposed patch -- in fact, it
seems to make things cleaner.  It is a little wasteful, but I don't
really mind that.  It's just some memory, and it's not a significant
amount.

That said, I would *also* apply Kyotaro's proposed patch to prohibit a
physical standby running with a logical slot, if only because that
reduces the number of combinations that we need to test and keep our
collective heads straight about.  Just reject the weird case and have
one type of slot for each type of replication.  I didn't even think this
was at all possible.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Towards easier AMs: Cleaning up inappropriate use of name "relkind"

2020-06-03 Thread Alvaro Herrera
On 2020-Jun-03, Mark Dilger wrote:

> The name "relkind" normally refers to a field of type 'char' with
> values like 'r' for "table" and 'i' for "index".  In AlterTableStmt
> and CreateTableAsStmt, this naming convention was abused for a field
> of type enum ObjectType.

I agree that "relkind" here is a misnomer, and I bet that what happened
here is that the original patch Gavin developed was using the relkind
enum from pg_class and was later changed to the OBJECT_ defines after
patch review, but the struct member name remained.  I don't object to
the proposed renaming.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote:
> On 2020-Jun-03, Andres Freund wrote:
> > I don't think we should prohibit this. For one, it'd probably break some
> > clients, without a meaningful need.
> 
> There *is* a need, namely to keep complexity down.  This is quite
> convoluted, it's got a lot of historical baggage because of the way it
> was implemented, and it's very difficult to understand.  The greatest
> motive I see is to make this easier to understand, so that it is easier
> to modify and improve in the future.

That seems like a possibly convincing argument for not introducing the
capability, but doesn't seem strong enough to remove it. Especially not
if it was just broken as part of effectively a refactoring, as far as I
understand?


> > But I think it's also actually quite useful to be able to access
> > catalogs before streaming data. You e.g. can look up configuration of
> > the primary before streaming WAL. With a second connection that's
> > actually harder to do reliably in some cases, because you need to be
> > sure that you actually reached the right server (consider a pooler,
> > automatic failover etc).
> 
> I don't think having a physical replication connection access catalog
> data directly is a great idea.  We already have gadgets like
> IDENTIFY_SYSTEM for physical replication that can do that, and if you
> need particular settings you can use SHOW (commit d1ecd539477).  If
> there was a strong need for even more than that, we can add something to
> the grammar.

Those special case things are a bad idea, and we shouldn't introduce
more. It's unrealistic that we can ever make that support everything,
and since we already have to support the database connected thing, I
don't see the point.

Greetings,

Andres Freund




Re: libpq copy error handling busted

2020-06-03 Thread Tom Lane
I wrote:
> * pqSendSome() is responsible not only for pushing out data, but for
> calling pqReadData in any situation where it can't get rid of the data
> promptly.  1f39a1c06 overlooked that requirement, and the upshot is
> that we don't necessarily notice that the connection is broken (it's
> pqReadData's job to detect that).  Putting a pqReadData call into
> the early-exit path helps, but doesn't fix things completely.

Ah, it's better if I put the pqReadData call into *both* the paths
where 1f39a1c06 made pqSendSome give up.  The attached patch seems
to fix the issue for the "pgbench -i" scenario, with either fast-
or immediate-mode server stop.  I tried it with and without SSL too,
just to see.  Still, it's not clear to me whether this might worsen
any of the situations we discussed in the lead-up to 1f39a1c06 [1].
Thomas, are you in a position to redo any of that testing?

> * The more longstanding problem is that the PQputCopyData code path
> doesn't have any mechanism for consuming an 'E' (error) message
> once pqReadData has collected it.

At least with pgbench's approach (die immediately on PQputline failure)
this isn't very relevant once we apply the attached.  Perhaps we should
revisit this behavior anyway, but I'd be afraid to back-patch a change
of that nature.

> * As for control-C not getting out of it: there is
>   if (CancelRequested)
>   break;
> in pgbench's loop, but this does nothing in this scenario because
> fe-utils/cancel.c only sets that flag when it successfully sends a
> Cancel ... which it certainly cannot if the postmaster is gone.

I'll send a patch for this later.

regards, tom lane

[1] 
https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com
diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c
index 9afa0533a6..9273984727 100644
--- a/src/interfaces/libpq/fe-misc.c
+++ b/src/interfaces/libpq/fe-misc.c
@@ -823,6 +823,10 @@ definitelyFailed:
  * Return 0 on success, -1 on failure and 1 when not all data could be sent
  * because the socket would block and the connection is non-blocking.
  *
+ * Note that this is also responsible for consuming data from the socket
+ * (putting it in conn->inBuffer) in any situation where we can't send
+ * all the specified data immediately.
+ *
  * Upon write failure, conn->write_failed is set and the error message is
  * saved in conn->write_err_msg, but we clear the output buffer and return
  * zero anyway; this is because callers should soldier on until it's possible
@@ -842,12 +846,20 @@ pqSendSome(PGconn *conn, int len)
 	 * on that connection.  Even if the kernel would let us, we've probably
 	 * lost message boundary sync with the server.  conn->write_failed
 	 * therefore persists until the connection is reset, and we just discard
-	 * all data presented to be written.
+	 * all data presented to be written.  However, as long as we still have a
+	 * valid socket, we should continue to absorb data from the backend, so
+	 * that we can collect any final error messages.
 	 */
 	if (conn->write_failed)
 	{
 		/* conn->write_err_msg should be set up already */
 		conn->outCount = 0;
+		/* Absorb input data if any, and detect socket closure */
+		if (conn->sock != PGINVALID_SOCKET)
+		{
+			if (pqReadData(conn) < 0)
+return -1;
+		}
 		return 0;
 	}
 
@@ -917,6 +929,13 @@ pqSendSome(PGconn *conn, int len)
 
 	/* Discard queued data; no chance it'll ever be sent */
 	conn->outCount = 0;
+
+	/* Absorb input data if any, and detect socket closure */
+	if (conn->sock != PGINVALID_SOCKET)
+	{
+		if (pqReadData(conn) < 0)
+			return -1;
+	}
 	return 0;
 			}
 		}


Re: elog(DEBUG2 in SpinLocked section.

2020-06-03 Thread Michael Paquier
On Wed, Jun 03, 2020 at 12:36:34AM -0400, Tom Lane wrote:
> Should we think about adding automated detection of this type of
> mistake?  I don't like the attached as-is because of the #include
> footprint expansion, but maybe we can find a better way.

I think that this one first boils down to the FRONTEND dependency in
those headers.  Or in short, spin.h may get loaded by the frontend but
we have a backend-only API, no?

> It's actually worse in the back branches, because elog() did not have
> a good short-circuit path like ereport() does.  +1 for back-patch.

Thanks, got that fixed down to 9.5.
--
Michael


signature.asc
Description: PGP signature


Re: libpq copy error handling busted

2020-06-03 Thread Thomas Munro
On Thu, Jun 4, 2020 at 1:35 PM Tom Lane  wrote:
> I wrote:
> > * pqSendSome() is responsible not only for pushing out data, but for
> > calling pqReadData in any situation where it can't get rid of the data
> > promptly.  1f39a1c06 overlooked that requirement, and the upshot is
> > that we don't necessarily notice that the connection is broken (it's
> > pqReadData's job to detect that).  Putting a pqReadData call into
> > the early-exit path helps, but doesn't fix things completely.
>
> Ah, it's better if I put the pqReadData call into *both* the paths
> where 1f39a1c06 made pqSendSome give up.  The attached patch seems
> to fix the issue for the "pgbench -i" scenario, with either fast-
> or immediate-mode server stop.  I tried it with and without SSL too,
> just to see.  Still, it's not clear to me whether this might worsen
> any of the situations we discussed in the lead-up to 1f39a1c06 [1].
> Thomas, are you in a position to redo any of that testing?

Yes, sure.  The testing consisted of running on a system with OpenSSL
1.1.1a (older versions didn't have the problem).  It originally showed
up on eelpout, a very underpowered build farm machine running Linux on
ARM64, but then later we worked out we could make it happen on a Mac
or any other Linux system if we had bad enough luck or if we added a
sleep in a particular spot.  We could do it with psql running in a
loop using a bad certificate from the testing setup stuff, as shown
here:

https://www.postgresql.org/message-id/CA%2BhUKGJafyTgpsYBgQGt1EX0O8UnL4VGHSc7J0KZyMH4_jPGBw%40mail.gmail.com

I don't have access to eelpout from where I am right now, but I'll try
that test now on the Debian 10 amd64 system I have here.  OpenSSL has
since moved on to 1.1.1d-0+deb10u3, but that should be fine, the
triggering change was the move to TLS1.3 so let me see what happens if
I do that with your patch applied...




> > * The more longstanding problem is that the PQputCopyData code path
> > doesn't have any mechanism for consuming an 'E' (error) message
> > once pqReadData has collected it.
>
> At least with pgbench's approach (die immediately on PQputline failure)
> this isn't very relevant once we apply the attached.  Perhaps we should
> revisit this behavior anyway, but I'd be afraid to back-patch a change
> of that nature.
>
> > * As for control-C not getting out of it: there is
> >   if (CancelRequested)
> >   break;
> > in pgbench's loop, but this does nothing in this scenario because
> > fe-utils/cancel.c only sets that flag when it successfully sends a
> > Cancel ... which it certainly cannot if the postmaster is gone.
>
> I'll send a patch for this later.
>
> regards, tom lane
>
> [1] 
> https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com




Re: elog(DEBUG2 in SpinLocked section.

2020-06-03 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Jun 03, 2020 at 12:36:34AM -0400, Tom Lane wrote:
>> Should we think about adding automated detection of this type of
>> mistake?  I don't like the attached as-is because of the #include
>> footprint expansion, but maybe we can find a better way.

> I think that this one first boils down to the FRONTEND dependency in
> those headers.  Or in short, spin.h may get loaded by the frontend but
> we have a backend-only API, no?

I think the #include bloat comes from wanting to declare the global
state variable as "slock_t *".  We could give up on that and write
something like this in a central place like c.h:

#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND)
extern void *held_spinlock;
#define NotHoldingSpinLock() Assert(held_spinlock == NULL)
#else
#define NotHoldingSpinLock() ((void) 0)
#endif

Then throwing NotHoldingSpinLock() into relevant places costs
nothing new include-wise.

regards, tom lane




Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Michael Paquier
On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote:
> On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote:
>> There *is* a need, namely to keep complexity down.  This is quite
>> convoluted, it's got a lot of historical baggage because of the way it
>> was implemented, and it's very difficult to understand.  The greatest
>> motive I see is to make this easier to understand, so that it is easier
>> to modify and improve in the future.
> 
> That seems like a possibly convincing argument for not introducing the
> capability, but doesn't seem strong enough to remove it. Especially not
> if it was just broken as part of effectively a refactoring, as far as I
> understand?

Are there any objections in fixing the issue first then?  As far as I
can see there is no objection to this part, like here:
https://www.postgresql.org/message-id/20200603214448.GA901@alvherre.pgsql

>> I don't think having a physical replication connection access catalog
>> data directly is a great idea.  We already have gadgets like
>> IDENTIFY_SYSTEM for physical replication that can do that, and if you
>> need particular settings you can use SHOW (commit d1ecd539477).  If
>> there was a strong need for even more than that, we can add something to
>> the grammar.
> 
> Those special case things are a bad idea, and we shouldn't introduce
> more. It's unrealistic that we can ever make that support everything,
> and since we already have to support the database connected thing, I
> don't see the point.

Let's continue discussing this part as well.
--
Michael


signature.asc
Description: PGP signature


Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762

2020-06-03 Thread Alvaro Herrera
On 2020-Jun-03, Andres Freund wrote:

> On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote:
> > On 2020-Jun-03, Andres Freund wrote:
> > > I don't think we should prohibit this. For one, it'd probably break some
> > > clients, without a meaningful need.
> > 
> > There *is* a need, namely to keep complexity down.  This is quite
> > convoluted, it's got a lot of historical baggage because of the way it
> > was implemented, and it's very difficult to understand.  The greatest
> > motive I see is to make this easier to understand, so that it is easier
> > to modify and improve in the future.
> 
> That seems like a possibly convincing argument for not introducing the
> capability, but doesn't seem strong enough to remove it.

This "capability" has never been introduced.  The fact that it's there
is just an accident.  In fact, it's not a capability, since the feature
(physical replication) is invoked differently -- namely, using a
physical replication connection.  JDBC uses a logical replication
connection for it only because they never realized that they were
supposed to do differently, because we failed to throw the correct
error message in the first place.

> > I don't think having a physical replication connection access catalog
> > data directly is a great idea.  We already have gadgets like
> > IDENTIFY_SYSTEM for physical replication that can do that, and if you
> > need particular settings you can use SHOW (commit d1ecd539477).  If
> > there was a strong need for even more than that, we can add something to
> > the grammar.
> 
> Those special case things are a bad idea, and we shouldn't introduce
> more.

What special case things?  The replication connection has never been
supposed to run SQL.  That's why we have SHOW in the replication
grammar.

> It's unrealistic that we can ever make that support everything,
> and since we already have to support the database connected thing, I
> don't see the point.

A logical replication connection is not supposed to be used for physical
replication.  That's just going to make more bugs appear.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: REINDEX CONCURRENTLY and indisreplident

2020-06-03 Thread Michael Paquier
On Wed, Jun 03, 2020 at 12:40:38PM -0300, Euler Taveira wrote:
> On Wed, 3 Jun 2020 at 03:54, Michael Paquier  wrote:
>> I have bumped into $subject, causing a replica identity index to
>> be considered as dropped if running REINDEX CONCURRENTLY on it.  This
>> means that the old tuple information would get lost in this case, as
>> a REPLICA IDENTITY USING INDEX without a dropped index is the same as
>> NOTHING.
>
> LGTM. I tested in both versions (12, master) and it works accordingly.

Thanks for the review.  I'll try to get that fixed soon.

By the way, your previous email was showing up as part of my own email
with the indentation that was used so I missed it first.  That's the
case as well here:
https://www.postgresql.org/message-id/cah503wdaejzhp7+wa-hhs6c7nze69owqe5zf_tyfu1epawp...@mail.gmail.com
--
Michael


signature.asc
Description: PGP signature


Re: Parallel copy

2020-06-03 Thread Amit Kapila
On Thu, Jun 4, 2020 at 12:09 AM Andres Freund  wrote:
>
> Hi,
>
> On 2020-06-03 12:13:14 -0400, Robert Haas wrote:
> > On Mon, May 18, 2020 at 12:48 AM Amit Kapila  
> > wrote:
> > > In the above case, even though we are executing a single command from
> > > the user perspective, but the currentCommandId will be four after the
> > > command.  One increment will be for the copy command and the other
> > > three increments are for locking tuple in PK table
> > > (tab_fk_referenced_chk) for three tuples in FK table
> > > (tab_fk_referencing_chk).  Now, for parallel workers, it is
> > > (theoretically) possible that the three tuples are processed by three
> > > different workers which don't get synced as of now.  The question was
> > > do we see any kind of problem with this and if so can we just sync it
> > > up at the end of parallelism.
>
> > I strongly disagree with the idea of "just sync(ing) it up at the end
> > of parallelism". That seems like a completely unprincipled approach to
> > the problem. Either the command counter increment is important or it's
> > not. If it's not important, maybe we can arrange to skip it in the
> > first place. If it is important, then it's probably not OK for each
> > backend to be doing it separately.
>
> That scares me too. These command counter increments definitely aren't
> unnecessary in the general case.
>

Yeah, this is what we want to understand?  Can you explain how they
are useful here?  AFAIU, heap_lock_tuple doesn't use commandid while
storing the transaction information of xact while locking the tuple.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Leading minus for negative time interval in ISO 8601

2020-06-03 Thread Tom Lane
Mikhail Titov  writes:
> I'd like to propose a simple patch to allow for negative ISO 8601
> intervals with leading minus, e.g. -PT1H besides PT-1H. It seems that
> standard isn't quite clear on negative duration.

"Isn't quite clear"?  ISTM that if the standard intended to allow that,
it'd be pretty clear.  I looked through the 8601 spec just now, and
I can't see any indication whatever that they intend to allow "-" before P.
It's hard to see why they'd bother with that introducer at all if
data can appear before it.

> However, lots of
> software use leading minus and expect/generate intervals in such forms
> making those incompatible with current PostgreSQL decoding code.

Which "lots of software" are you speaking of, exactly?  interval_in
has never had such a capability, and I don't recall previous complaints
about it.

The difference between a useful standard and a useless one is the
extent to which people obey the standard rather than adding random
extensions to it, so I'm not inclined to add such an extension
without a very well-grounded argument for it.

regards, tom lane




Re: libpq copy error handling busted

2020-06-03 Thread Thomas Munro
On Thu, Jun 4, 2020 at 1:53 PM Thomas Munro  wrote:
> On Thu, Jun 4, 2020 at 1:35 PM Tom Lane  wrote:
> > Ah, it's better if I put the pqReadData call into *both* the paths
> > where 1f39a1c06 made pqSendSome give up.  The attached patch seems
> > to fix the issue for the "pgbench -i" scenario, with either fast-
> > or immediate-mode server stop.  I tried it with and without SSL too,
> > just to see.  Still, it's not clear to me whether this might worsen
> > any of the situations we discussed in the lead-up to 1f39a1c06 [1].
> > Thomas, are you in a position to redo any of that testing?

It seems to be behave correctly in that scenario.

Here's what I tested.  First, I put this into pgdata/postgresql.conf:

ssl=on
ssl_ca_file='root+client_ca.crt'
ssl_cert_file='server-cn-only.crt'
ssl_key_file='server-cn-only.key'
ssl_crl_file='root+client.crl'
ssl_min_protocol_version='TLSv1.2'
ssl_max_protocol_version='TLSv1.1'
ssl_min_protocol_version='TLSv1.2'
ssl_max_protocol_version=''

I copied the named files from src/test/ssl/ssl/ into pgdata, and I ran
chmod 600 on the .key file.

I put this into pgdata/pg_hba.conf at the top:

hostssl all all 127.0.0.1/32 cert clientcert=verify-full

I made a copy of src/test/ssl/ssl/client-revoked.key and ran chmod 600 on it.

Now on unpatched master I get:

$ psql "host=127.0.0.1 port=5432 dbname=postgres user=tmunro
sslcert=src/test/ssl/ssl/client-revoked.crt sslkey=client-revoked.key
sslmode=require"
psql: error: could not connect to server: SSL error: sslv3 alert
certificate revoked

It's the same if I add in this sleep in fe-connect.c:

+sleep(1);
/*
 * Send the startup packet.
 *

If I revert 1f39a1c0641531e0462a4822f2dba904c5d4d699 "Restructure
libpq's handling of send failures.", I get the error that eelpout
showed intermittently:

psql: error: could not connect to server: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
could not send startup packet: Connection reset by peer

I go back to master, and apply your patch.  I get the expected error:

psql: error: could not connect to server: SSL error: sslv3 alert
certificate revoked




Re: Parallel copy

2020-06-03 Thread Andres Freund
Hi,

On 2020-06-04 08:10:07 +0530, Amit Kapila wrote:
> On Thu, Jun 4, 2020 at 12:09 AM Andres Freund  wrote:
> > > I strongly disagree with the idea of "just sync(ing) it up at the end
> > > of parallelism". That seems like a completely unprincipled approach to
> > > the problem. Either the command counter increment is important or it's
> > > not. If it's not important, maybe we can arrange to skip it in the
> > > first place. If it is important, then it's probably not OK for each
> > > backend to be doing it separately.
> >
> > That scares me too. These command counter increments definitely aren't
> > unnecessary in the general case.
> >
> 
> Yeah, this is what we want to understand?  Can you explain how they
> are useful here?  AFAIU, heap_lock_tuple doesn't use commandid while
> storing the transaction information of xact while locking the tuple.

But the HeapTupleSatisfiesUpdate() call does use it?

And even if that weren't an issue, I don't see how it's defensible to
just randomly break the the commandid coherency for parallel copy.

Greetings,

Andres Freund




Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-03 Thread Amit Kapila
On Wed, Jun 3, 2020 at 12:02 PM Masahiko Sawada
 wrote:
>
> On Wed, 3 Jun 2020 at 14:50, Amit Kapila  wrote:
> >
> >
> > If the intention is to keep the first version simple, then why do we
> > want to support any mode other than 'required'?  I think it will limit
> > its usage for the cases where 2PC can be used only when all FDWs
> > involved support Prepare API but if that helps to keep the design and
> > patch simpler then why not just do that for the first version and then
> > extend it later.  OTOH, if you think it will be really useful to keep
> > other modes, then also we could try to keep those in separate patches
> > to facilitate the review and discussion of the core feature.
>
> β€˜disabled’ is the fundamental mode. We also need 'disabled' mode,
> otherwise existing FDW won't work.
>

IIUC, if foreign_twophase_commit is 'disabled', we don't use a
two-phase protocol to commit distributed transactions, right?  So, do
we check this at the time of Prepare or Commit whether we need to use
a two-phase protocol?  I think this should be checked at prepare time.

+
+ This parameter can be changed at any time; the behavior for any one
+ transaction is determined by the setting in effect when it commits.
+

This is written w.r.t foreign_twophase_commit.  If one changes this
between prepare and commit, will it have any impact?

>  I was concerned that many FDW
> plugins don't implement FDW transaction APIs yet when users start
> using this feature. But it seems to be a good idea to move 'prefer'
> mode to a separate patch while leaving 'required'. I'll do that in the
> next version patch.
>

Okay, thanks.  Please, see if you can separate out the documentation
for that as well.

Few other comments on v21-0003-Documentation-update:

1.
+  
+  
+   Numeric transaction identifier with that this foreign transaction
+   associates
+  

/with that this/with which this

2.
+  
+   The OID of the foreign server on that the foreign transaction
is prepared
+  

/on that the/on which the

3.
+  status
+  text
+  
+  
+   Status of foreign transaction. Possible values are:
+   
+
+ 
+  initial : Initial status.
+ 

What exactly "Initial status" means?

4.
+  in_doubt
+  boolean
+  
+  
+   If true this foreign transaction is
in-doubt status and
+   needs to be resolved by calling pg_resolve_fdwxact
+   function.
+  

It would be better if you can add an additional sentence to say when
and or how can foreign transactions reach in-doubt state.

5.
If N local transactions each
+ across K foreign server this value need to be set

This part of the sentence can be improved by saying something like:
"If a user expects N local transactions and each of those involves K
foreign servers, this value..".

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [PATCH] Leading minus for negative time interval in ISO 8601

2020-06-03 Thread Chapman Flack
On 06/03/20 22:46, Tom Lane wrote:
> "Isn't quite clear"?  ISTM that if the standard intended to allow that,
> it'd be pretty clear.  I looked through the 8601 spec just now, and
> I can't see any indication whatever that they intend to allow "-" before P.

Umm, did you see any indication that they intend to allow "-" /anywhere/
in a time interval (with the exception of between year and month, month
and day in the alternate form, as simple delimiters, not as minus?

(Maybe you did; I'm looking at a publicly-accessible 2016 draft.)

It looks like the whole idea of minusness has to be shoehorned into ISO 8601
by anyone who misses it, and that's been done different ways. I guess that's
the "isn't quite clear" part.

> Which "lots of software" are you speaking of, exactly?  interval_in
> has never had such a capability, and I don't recall previous complaints
> about it.

Java durations allow both the PostgreSQL-style minus on individual
components, and a leading minus that negates the whole thing. [1]
That explicitly says "The leading plus/minus sign, and negative values
for other units are not part of the ISO-8601 standard."

XML Schema (and therefore XML Query, which uses XML Schema data types)
allows only the leading minus. [2]

The XML Schema folks say their concept is "drawn from those of ISO 8601,
specifically durations without fixed endpoints." That's why they can get
away with just the single leading sign: they don't admit something like
P1M-1D which you don't know to call 27, 28, 29, or 30 days until you're
given an endpoint to hang it on.

I had to deal with that in [3].

Regards,
-Chap




[1]
https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html#parse-java.lang.CharSequence-

[2] https://www.w3.org/TR/xmlschema11-2/#nt-durationRep

[3]
https://github.com/tada/pljava/blob/master/pljava-examples/src/main/java/org/postgresql/pljava/example/saxon/S9.java#L329




Re: libpq copy error handling busted

2020-06-03 Thread Thomas Munro
On Thu, Jun 4, 2020 at 3:36 PM Thomas Munro  wrote:
> Here's what I tested.

In passing, I noticed that this:

$ psql ...
psql: error: could not connect to server: private key file
"src/test/ssl/ssl/client-revoked.key" has group or world access;
permissions should be u=rw (0600) or less

... leads to this nonsensical error message on the server:

2020-06-04 16:03:11.547 NZST [7765] LOG:  could not accept SSL
connection: Success




Re: Incorrect comment in be-secure-openssl.c

2020-06-03 Thread Michael Paquier
On Wed, Jun 03, 2020 at 02:40:54PM +0200, Daniel Gustafsson wrote:
> Sounds good, thanks!

Okay, done then.
--
Michael


signature.asc
Description: PGP signature


Re: [PATCH] Leading minus for negative time interval in ISO 8601

2020-06-03 Thread Tom Lane
Chapman Flack  writes:
> On 06/03/20 22:46, Tom Lane wrote:
>> "Isn't quite clear"?  ISTM that if the standard intended to allow that,
>> it'd be pretty clear.  I looked through the 8601 spec just now, and
>> I can't see any indication whatever that they intend to allow "-" before P.

> Umm, did you see any indication that they intend to allow "-" /anywhere/
> in a time interval (with the exception of between year and month, month
> and day in the alternate form, as simple delimiters, not as minus?
> (Maybe you did; I'm looking at a publicly-accessible 2016 draft.)

I don't have an "official" copy either; I was looking at this draft:
https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0038_iso_wd_8601-1_2016-02-16.pdf

I see this bit:

[Β±] represents a plus sign [+] if in combination with the following
element a positive value or zero needs to be represented (in this
case, unless explicitly stated otherwise, the plus sign shall not be
omitted), or a minus sign [βˆ’] if in combination with the following
element a negative value needs to be represented.

but I agree that there's no clear application of that to intervals,
either overall or per-field.

> Java durations allow both the PostgreSQL-style minus on individual
> components, and a leading minus that negates the whole thing. [1]
> That explicitly says "The leading plus/minus sign, and negative values
> for other units are not part of the ISO-8601 standard."
> XML Schema (and therefore XML Query, which uses XML Schema data types)
> allows only the leading minus. [2]

Hm.  The slippery slope I *don't* want to be drawn down is somebody
arguing that we should change interval_out, because that would open
a whole Pandora's box of compatibility issues.  Maybe we should just
take the position that negative intervals aren't standardized, and
if you want to transport them using ISO format then you first need
to lobby ISO to fix that.

regards, tom lane




Re: [PATCH] Leading minus for negative time interval in ISO 8601

2020-06-03 Thread Mikhail Titov
On Wed, Jun  3, 2020 at  9:46 PM, Tom Lane  wrote:
> ...
> ISTM that if the standard intended to allow that, it'd be pretty
> clear.  I looked through the 8601 spec just now, and I can't see any
> indication whatever that they intend to allow "-" before P.

To be fair, I do not have an access to 2019 edition that
seems to address negative duration, but what I can see from the wording
at
https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0039_iso_wd_8601-2_2016-02-16.pdf
, it seems to be written without an idea of negative duration at all,
even PT-1D alikes supported by PostgreSQL. Also that PDF mentions comma
as a preferred sign for e.g. PT1,5D that PostgreSQL does not accept. I
understand though that PDF explicitly states it is not a standard.

> It's hard to see why they'd bother with that introducer at all if data
> can appear before it.

I'm not sure I follow. Do you mean to hard require for time/span to
start with P and nothing but that? If so, can we think of it as a
syntactic sugar? I.e. unary minus AND a normal, positive duration of
your liking that we just negate in-place.

>> However, lots of software use leading minus and expect/generate
>> intervals in such forms making those incompatible with current
>> PostgreSQL decoding code.
>
> Which "lots of software" are you speaking of, exactly?  interval_in
> has never had such a capability, and I don't recall previous complaints
> about it.

I was not talking about PG-centric software in particular. I had some
JavaScript libraries, Ruby on Rails, Java, Rust, Go in mind. Here is the
related issue for Rust https://github.com/rust-lang/rust/issues/18181
and some Go library
https://pkg.go.dev/github.com/rickb777/date/period?tab=doc#Parse (besides
the links I gave in the patch) to show examples of accepting minus prefix.

I presume no one complained much previously because offset can be (and
often is) stored as float in, e.g., seconds, and then offset * '@1
second'::interval. That looks a bit verbose and I'd prefer to keep
offset as interval and do no extra casting.

Take a look at w3c specs that refer to ISO 8601 as well. I understand,
that is not what PG is after, but here is an excerpt:

,[ https://www.w3.org/TR/xmlschema-2/#duration ]
| One could also indicate a duration of minus 120 days as: -P120D.
| ...
| P-1347M is not allowed although -P1347M is allowed
`

Note that the second example explicitly contradicts currently allowed PG
syntax. I presume if the standard was clear, there would be no such
ambiguity.

Not that I'm trying to introduce drastic changes, but to make PostgreSQL
to be somewhat more friendly to what it can accept directly without
dancing around.

--
Mikhail




Re: [PATCH] Leading minus for negative time interval in ISO 8601

2020-06-03 Thread Mikhail Titov
> ...
>> Umm, did you see any indication that they intend to allow "-" /anywhere/
>> in a time interval (with the exception of between year and month, month
>> and day in the alternate form, as simple delimiters, not as minus?
>> (Maybe you did; I'm looking at a publicly-accessible 2016 draft.)
>
> I don't have an "official" copy either; I was looking at this draft:
> https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0038_iso_wd_8601-1_2016-02-16.pdf

heh, no one has an up to date standard :-) Also that is the link I meant
to include in my first reply. From what I see at
https://www.iso.org/obp/ui/#iso:std:iso:8601:-2:ed-1:v1:en they (ISO) did
address negative values for components and also there is "3.1.1.7
negative duration" that would be nice to read somehow.

> I see this bit:
>
> [Β±] represents a plus sign [+] if in combination with the following
> element a positive value or zero needs to be represented (in this
> case, unless explicitly stated otherwise, the plus sign shall not be
> omitted), or a minus sign [βˆ’] if in combination with the following
> element a negative value needs to be represented.

But nowhere near duration specification [Β±] is used whatsoever.

> Hm.  The slippery slope I *don't* want to be drawn down is somebody
> arguing that we should change interval_out, because that would open
> a whole Pandora's box of compatibility issues.  Maybe we should just
> take the position that negative intervals aren't standardized, and
> if you want to transport them using ISO format then you first need
> to lobby ISO to fix that.

I explicitly do NOT want to change anything on the way out. First, that
is how things are and we do not want to break anything. And, second, in
many cases client software can read either format. That is why I thought
it would be a trivial change. No output changes.

-- 
Mikhail




Re: Parallel copy

2020-06-03 Thread Amit Kapila
On Thu, Jun 4, 2020 at 9:10 AM Andres Freund  wrote:
>
> Hi,
>
> On 2020-06-04 08:10:07 +0530, Amit Kapila wrote:
> > On Thu, Jun 4, 2020 at 12:09 AM Andres Freund  wrote:
> > > > I strongly disagree with the idea of "just sync(ing) it up at the end
> > > > of parallelism". That seems like a completely unprincipled approach to
> > > > the problem. Either the command counter increment is important or it's
> > > > not. If it's not important, maybe we can arrange to skip it in the
> > > > first place. If it is important, then it's probably not OK for each
> > > > backend to be doing it separately.
> > >
> > > That scares me too. These command counter increments definitely aren't
> > > unnecessary in the general case.
> > >
> >
> > Yeah, this is what we want to understand?  Can you explain how they
> > are useful here?  AFAIU, heap_lock_tuple doesn't use commandid while
> > storing the transaction information of xact while locking the tuple.
>
> But the HeapTupleSatisfiesUpdate() call does use it?
>

It won't use 'cid' for lockers or multi-lockers case (AFAICS, there
are special case handling for lockers/multi-lockers).  I think it is
used for updates/deletes.

> And even if that weren't an issue, I don't see how it's defensible to
> just randomly break the the commandid coherency for parallel copy.
>

At this stage, we are evelauating whether there is any need to
increment command counter for foreign key checks or is it just
happening because we are using using some common code to execute
"Select ... For Key Share" statetement during these checks.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Expand the use of check_canonical_path() for more GUCs

2020-06-03 Thread Michael Paquier
On Wed, Jun 03, 2020 at 02:45:50PM -0400, Tom Lane wrote:
> In the abstract, I agree with Peter's point that we shouldn't alter
> user-given strings without need.  However, I think there's strong
> reason for canonicalizing the data directory and config file locations.
> We access those both before and after chdir'ing into the datadir, so
> we'd better have absolute paths to them --- and at least for the
> datadir, it's documented that you can initially give it as a path
> relative to wherever you started the postmaster from.  If the other
> files are only accessed after the chdir happens then we could likely
> do without canonicalizing them.  But ... do we know which directory
> the user (thought he) specified them with reference to?  Forced
> canonicalization does have the advantage that it's clear to all
> onlookers how we are interpreting the paths.

Even with the last point...  It looks like there is little love for
this patch.  So it seems to me that this brings the discussion down to
two points: shouldn't we document why canonicalization is not done for
data_directory, config_file, hba_file and ident_file with some
comments in guc.c?  Then, why do we apply it to external_pid_file,
Log_directory and stats_temp_directory knowing that we chdir to PGDATA
in the postmaster before they get used (as far as I can see)?
--
Michael


signature.asc
Description: PGP signature


Re: Asynchronous Append on postgres_fdw nodes.

2020-06-03 Thread Kyotaro Horiguchi
Hello, Andrey.

At Wed, 3 Jun 2020 15:00:06 +0500, Andrey Lepikhov  
wrote in 
> This patch no longer applies cleanly.
> In addition, code comments contain spelling errors.

Sure. Thaks for noticing of them and sorry for the many typos.
Additional item in WaitEventIPC conflicted with this.


I found the following typos.

connection.c:
  s/Rerturns/Returns/
postgres-fdw.c:
  s/Retrive/Retrieve/
  s/ForeginScanState/ForeignScanState/
  s/manipuration/manipulation/
  s/asyncstate/async state/
  s/alrady/already/

nodeAppend.c:  
  s/Rery/Retry/

createplan.c:
  s/chidlren/children/

resowner.c:
  s/identier/identifier/ X 2

execnodes.h:
  s/sutff/stuff/

plannodes.h:
  s/asyncronous/asynchronous/
  
  
Removed a useless variable PgFdwScanState.result_ready.
Removed duplicate code from remove_async_node() by using move_to_next_waiter().
Done some minor cleanups.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From db231fa99da5954b52e195f6af800c0f9b991ed4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH v3 1/3] Allow wait event set to be registered to resource
 owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 -
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 67 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 +++
 6 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7717bb2719..16aefb03ee 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -218,7 +218,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 05df5017c4..a8b52cd381 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -56,6 +56,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -84,6 +85,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -560,12 +563,15 @@ ResetLatch(Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 37b6a4eecd..fcc92138fe 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	{
 		WaitEventSet *new_event_set;
 
-		new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+		new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
 		AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 		  MyLatch, NULL);
 		AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner

Re: what can go in root.crt ?

2020-06-03 Thread Laurenz Albe
On Wed, 2020-06-03 at 19:57 -0400, Chapman Flack wrote:
> Ok, so a person in the situation described here, who is not in a position
> to demand changes in an organizational policy (whether or not it seems
> ill-conceived to you or even to him/her), is facing this question:
> 
> What are the "safest" things I /can/ do, under the existing constraints,
> and /which of those will work in PostgreSQL/?

I feel bad about bending the basic idea of certificates and trust to suit
some misbegotten bureaucratic constraints on good security.

If you are working for a company that has a bad idea of security
and cannot be dissuaded from it, you point that out loudly and then
keep going.  Trying to subvert the principles of an architecture
very often leads to pain in my experience.

Yours,
Laurenz Albe





Re: libpq copy error handling busted

2020-06-03 Thread Oleksandr Shulgin
On Thu, Jun 4, 2020 at 5:37 AM Thomas Munro  wrote:

> On Thu, Jun 4, 2020 at 1:53 PM Thomas Munro 
> wrote:
> > On Thu, Jun 4, 2020 at 1:35 PM Tom Lane  wrote:
> > > Ah, it's better if I put the pqReadData call into *both* the paths
> > > where 1f39a1c06 made pqSendSome give up.  The attached patch seems
> > > to fix the issue for the "pgbench -i" scenario, with either fast-
> > > or immediate-mode server stop.  I tried it with and without SSL too,
> > > just to see.  Still, it's not clear to me whether this might worsen
> > > any of the situations we discussed in the lead-up to 1f39a1c06 [1].
> > > Thomas, are you in a position to redo any of that testing?
>
> It seems to be behave correctly in that scenario.
>
> Here's what I tested.  First, I put this into pgdata/postgresql.conf:
>
> ssl=on
> ssl_ca_file='root+client_ca.crt'
> ssl_cert_file='server-cn-only.crt'
> ssl_key_file='server-cn-only.key'
> ssl_crl_file='root+client.crl'
> ssl_min_protocol_version='TLSv1.2'
> ssl_max_protocol_version='TLSv1.1'
> ssl_min_protocol_version='TLSv1.2'
> ssl_max_protocol_version=''
>
> I copied the named files from src/test/ssl/ssl/ into pgdata, and I ran
> chmod 600 on the .key file.
>
> I put this into pgdata/pg_hba.conf at the top:
>
> hostssl all all 127.0.0.1/32 cert clientcert=verify-full
>
> I made a copy of src/test/ssl/ssl/client-revoked.key and ran chmod 600 on
> it.
>

Would it be feasible to capture this in a sort of a regression (TAP?) test?

--
Alex


Re: Getting ERROR with FOR UPDATE/SHARE for partitioned table.

2020-06-03 Thread Amit Langote
On Thu, May 28, 2020 at 11:08 PM Ashutosh Bapat
 wrote:
> On Wed, May 27, 2020 at 6:51 PM Amit Langote  wrote:
> > So in Rajkumar's example, the cursor is declared as:
> >
> > CURSOR IS SELECT * FROM tbl WHERE c1< 5 FOR UPDATE;
> >
> > and the WHERE CURRENT OF query is this:
> >
> >  UPDATE tbl SET c2='aa' WHERE CURRENT OF cur;
>
> Thanks for the clarification. So it looks like we expand UPDATE on
> partitioned table to UPDATE on each partition (inheritance_planner for
> DML) and then execute each of those. If CURRENT OF were to save the
> table oid or something we could run the UPDATE only on that partition.

Are you saying that the planner should take into account the state of
the cursor specified in WHERE CURRENT OF to determine which of the
tables to scan for the UPDATE?  Note that neither partition pruning
nor constraint exclusion know that CurrentOfExpr can possibly allow to
exclude children of the UPDATE target.

> I am possibly shooting in dark, but this puzzles me. And it looks like
> we can cause wrong rows to be updated in non-partition inheritance
> where the ctids match?

I don't think that hazard exists, because the table OID is matched
before the TID.  Consider this example:

drop table if exists p cascade;
create table p (a int);
create table c (check (a = 2)) inherits (p);
insert into p values (1);
insert into c values (2);
begin;
declare c cursor for select * from p;
fetch c;
update p set a = a where current of c;
 QUERY PLAN

 Update on p  (cost=0.00..8.02 rows=2 width=10)
   Update on p
   Update on c p_1
   ->  Tid Scan on p  (cost=0.00..4.01 rows=1 width=10)
 TID Cond: CURRENT OF c
   ->  Tid Scan on c p_1  (cost=0.00..4.01 rows=1 width=10)
 TID Cond: CURRENT OF c
(7 rows)

Whenever the TID scan evaluates the CURRENT OF qual, it passes the
table being scanned to execCurrentOf().  execCurrentOf() then fetches
the ExecRowMark or the ScanState for *that* table from the cursor's
("c") PlanState via its portal.  Only if it confirms that such a
ExecRowMark or a ScanState exists and is valid/active that it returns
the TID found therein as the cursor's current TID.

-- 
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: Internal key management system

2020-06-03 Thread Fabien COELHO


Hello Masahiko-san,


This key manager is aimed to manage cryptographic keys used for
transparent data encryption. As a result of the discussion, we
concluded it's safer to use multiple keys to encrypt database data
rather than using one key to encrypt the whole thing, for example, in
order to make sure different data is not encrypted with the same key
and IV. Therefore, in terms of TDE, the minimum requirement is that
PostgreSQL can use multiple keys.

Using multiple keys in PG, there are roughly two possible designs:

1. Store all keys for TDE into the external KMS, and PG gets them from
it as needed.


+1


2. PG manages all keys for TDE inside and protect these keys on disk
by the key (i.g. KEK) stored in the external KMS.


-1, this is the one where you would need arguing.


There are pros and cons to each design. If I take one cons of #1 as an
example, the operation between PG and the external KMS could be
complex. The operations could be creating, removing and rotate key and
so on.


ISTM that only create (delete?) are really needed. Rotating is the problem 
of the KMS itself, thus does not need to be managed by pg under #1.



We can implement these operations in an extension to interact
with different kinds of external KMS, and perhaps we can use KMIP.


I would even put that (KMIP protocol stuff) outside pg core.

Even under #2, if some KMS is implemented and managed by pg, I would put 
the stuff in a separate process which I would probably run with a 
different uid, so that the KEK is not accessible directly by pg, ever.


Once KMS interactions are managed with an outside process, then what this 
process does becomes an interface, and whether this process actually 
manages the keys or discuss with some external KMS with some KMIP or 
whatever is irrelevant to pg. Providing an interface means that anyone 
could implement their KMS fitting their requirements if they comply with 
the interface/protocol.


Note that I'd be fine with having the current implementation somehow 
wrapped up as an example KMS.


But the development cost could become high because we might need 
different extensions for each key management solutions/services.


Yes and no. What I suggest is, I think, pretty simple, and I think I can 
implement it in a few line of script, so the cost is not high, and having 
a separate process looks, to me, like a security win and an extensibility 
win (i.e. another implementation can be provided).



#2 is better at that point; the interaction between PG and KMS is only
GET.


I think that it could be the same with #1. I think that having a separate 
process is a reasonable security requirement, and if you do that #1 and #2 
are more or less the same.



Other databases employes a similar approach are SQL Server and DB2.


Too bad for them:-) I'd still disagree with having the master key inside 
the database process, even if Microsoft, IBM and Oracle think it is a good 
idea.



In terms of the necessity of introducing the key manager into PG core,
I think at least TDE needs to be implemented in PG core. And as this
key manager is for managing keys for TDE, I think the key manager also
needs to be introduced into the core so that TDE functionality doesn't
depend on external modules.


Hmmm.

My point is that only interactions should be in core.

The implementation could be in core, but as a separate process.

I agree that pg needs to be able to manage the DEK, so it needs to store 
data keys.


I still do not understand why an extension, possibly distributed with pg, 
would not be ok. There may be good arguments for that, but I do not think 
you provided any yet.



Also, I'm not at fully at ease with some of the underlying principles
behind this proposal. Are we re-inventing/re-implementing kerberos or
whatever? Are we re-implementing a brand new KMS inside pg? Why having
our own?


As I explained above, this key manager is for managing internal keys
used by TDE. It's not an alternative to existing key management
solutions/services.


Hmmm. This seels to suggest that interacting with something outside should 
be an option.



The requirements of this key manager are generating internal keys,
letting other PG components use them, protecting them by KEK when
persisting,


If you want that, I'd still argue that you should have a separate process.


and support KEK rotation. It doesn’t have a feature like
allowing users to store arbitrary keys into this key manager, like
other key management solutions/services have.


Hmmm.


I agree that the key used to encrypt data must not be placed in the
same host. But it's true only when the key is not protected, right?


The DEK is needed when encrypting and decrypting, obviously, so it would 
be there once obtained, it cannot be helped. My concern is about the KEK, 
which AFAICS in your code is somewhere in memory accessible by the 
postgres process, which is a no go for me.


The definition of "protected" is fuzzy, it would depend on what the user 

Re: Read access for pg_monitor to pg_replication_origin_status view

2020-06-03 Thread Kyotaro Horiguchi
At Tue, 2 Jun 2020 13:13:18 -0300, MartΓ­n MarquΓ©s  
wrote in 
> > > I have no problem adding it to this ROLE, but we'd have to amend the
> > > doc for default-roles to reflect that SELECT for this view is also
> > > granted to `pg_read_all_stats`.
> >
> > I agree in general that pg_monitor shouldn't have privileges granted
> > directly to it.  If this needs a new default role, that's an option, but
> > it seems like it'd make sense to be part of pg_read_all_stats to me, so
> > amending the docs looks reasonable from here.
> 
> Good, that's more or less what I had in mind.
> 
> Here goes v2 of the patch, now there are 4 files (I could have
> squashed the docs with the code changes, but hey, that'll be easy to
> merge if needed :-) )
> 
> I did some fiddling to Michaels doc proposal, but it says basically the same.
> 
> Not 100% happy with the change to user-manag.sgml, but ok enough to send.
> 
> I also added an entry to the commitfest so we can track this there as well.

0001:

Looks good to me. REVOKE is performed on all functions that called
replorigin_check_prerequisites.

0002:

It is forgetting to grant pg_read_all_stats to execute
pg_show_replication_origin_status.  As the result pg_read_all_stats
gets error on executing the function, not on doing select on the view.

Even if we also granted execution of the function to the specific
role, anyone who wants to grant a user for the view also needs to
grant execution of the function.

To avoid such an inconvenience, as I mentioned upthread, the view and
the function should be granted to public and the function should just
mute the output all the rows, or some columns in each row. That can be
done the same way with pg_stat_get_activity(), as Stephen said.


0003:

It seems to be a take-after of adminpack's documentation, but a
superuser is not the only one on PostgreSQL.  The something like the
description in 27.2.2 Viewing Statistics looks more suitable.

> Superusers and members of the built-in role pg_read_all_stats (see
> also Section 21.5) can see all the information about all sessions.

Section 21.5 is already saying as follows.

> pg_monitor
>   Read/execute various monitoring views and functions. This role is a
>   member of pg_read_all_settings, pg_read_all_stats and
>   pg_stat_scan_tables.


0004:

Looks fine by me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center