Re: Strange decreasing value of pg_last_wal_receive_lsn()
I got it should be LSN + MAXALIGN(xlogrecord length) π Thanks a lot. > On 2 Jun 2020, at 19:11, Jehan-Guillaume de Rorthais wrote: > > Nope, just sum the xlogrecord length.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Tue, Jun 2, 2020 at 7:53 PM Dilip Kumar wrote: > > On Tue, Jun 2, 2020 at 4:56 PM Amit Kapila wrote: > > > > On Tue, Jun 2, 2020 at 3:59 PM Dilip Kumar wrote: > > > > > > I thin for our use case BufFileCreateShared is more suitable. I think > > > we need to do some modifications so that we can use these apps without > > > SharedFileSet. Otherwise, we need to unnecessarily need to create > > > SharedFileSet for each transaction and also need to maintain it in xid > > > array or xid hash until transaction commit/abort. So I suggest > > > following modifications in shared files set so that we can > > > conveniently use it. > > > 1. ChooseTablespace(const SharedFileSet fileset, const char name) > > > if fileset is NULL then select the DEFAULTTABLESPACEOID > > > 2. SharedFileSetPath(char path, SharedFileSet fileset, Oid tablespace) > > > If fileset is NULL then in directory path we can use MyProcPID or > > > something instead of fileset->creator_pid. > > > > > > > Hmm, I find these modifications a bit ad-hoc. So, not sure if it is > > better than the patch maintains sharedfileset information. > > I think we might do something better here, maybe by supplying function > pointer or so, but maintaining sharedfileset which contains different > tablespace/mutext which we don't need at all for our purpose also > doesn't sound very appealing. > I think we can say something similar for Relation (rel cache entry as well) maintained in LogicalRepRelMapEntry. I think we only need a pointer to that information. > Let me see if I can not come up with > some clean way of avoiding the need to shared-fileset then maybe we > can go with the shared fileset idea. > Fair enough. .. > > > > > > But, even if nsubxacts become 0 we want to write the file so that we > > > can overwrite the previous info. > > > > > > > Can't we just remove the file for such a case? > > But, as of now, we expect if it is not a first-time stream start then > the file exists. > Isn't it primarily because we do subxact_info_write in stop stream which will create such a file irrespective of whether we have any subxacts? If so, isn't that an unnecessary write? >Actually, currently, it's very easy that if it is > not the first segment we always expect that the file must exist, > otherwise an error. > I think we can check if the file doesn't exist then we can initialize nsubxacts as 0. > Now if it is not the first segment then we will > need to handle multiple cases. > > a) subxact_info_read need to handle the error case, because the file > may not exist because there was no subxact in last stream or it was > deleted because nsubxact become 0. > b) subxact_info_write, there will be multiple cases that if nsubxact > was already 0 then we can avoid writing the file, but if it become 0 > now we need to remove the file. > > Let me think more on that. > I feel we should be able to deal with these cases but if you find any difficulty then let us discuss. I understand there is some ease if we always have subxacts file but OTOH it sounds quite awkward that we need so many file operations to detect the case whether the transaction has any subtransactions. > > > > Here, we have performed Rolledback to savepoint s1 which doesn't have > > any change of its own. I think this would have handled but just > > wanted to confirm. > > But internally, that will send abort for the s2 first, and for that, > we will find xid and truncate, and later we will send abort for s1 but > that we will not find and do nothing? Anyway, I will test it and let > you know. > It would be good if we can test and confirm this behavior once. If it is not very inconvenient then we can even try to include a test for the same in the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Asynchronous Append on postgres_fdw nodes.
On 3/30/20 1:15 PM, Etsuro Fujita wrote: Hi, On Wed, Mar 11, 2020 at 10:47 AM movead li wrote: I redo the make installcheck-world as Kyotaro Horiguchi point out and the result nothing wrong. And I think the patch is good in feature and performance here is the test result thread I made before: https://www.postgresql.org/message-id/CA%2B9bhCK7chd0qx%2Bmny%2BU9xaOs2FDNJ7RaxG4%3D9rpgT6oAKBgWA%40mail.gmail.com The new status of this patch is: Ready for Committer As discussed upthread, this is a material for PG14, so I moved this to the next commitfest, keeping the same status. I've not looked at the patch in any detail yet, so I'm not sure that that is the right status for the patch, though. I'd like to work on this for PG14 if I have time. Hi, This patch no longer applies cleanly. In addition, code comments contain spelling errors. -- Andrey Lepikhov Postgres Professional https://postgrespro.com The Russian Postgres Company
Re: Should we remove a fallback promotion? take 2
On 2020/06/03 12:06, Kyotaro Horiguchi wrote: At Wed, 3 Jun 2020 09:43:17 +0900, Fujii Masao wrote in I will change the status back to Needs Review. Thanks for the review! record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false); if (record != NULL) { - fast_promoted = true; + promoted = true; Even if we missed the last checkpoint record, we don't give up promotion and continue fall-back promotion but the variable "promoted" stays false. That is confusiong. How about changing it to fallback_promotion, or some names with more behavior-specific name like immediate_checkpoint_needed? I like doEndOfRecoveryCkpt or something, but I have no strong opinion about that flag naming. So I'm ok with immediate_checkpoint_needed if others also like that, too. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Internal key management system
On Wed, 3 Jun 2020 at 16:16, Fabien COELHO wrote: > > > Hello Masahiko-san, > > > This key manager is aimed to manage cryptographic keys used for > > transparent data encryption. As a result of the discussion, we > > concluded it's safer to use multiple keys to encrypt database data > > rather than using one key to encrypt the whole thing, for example, in > > order to make sure different data is not encrypted with the same key > > and IV. Therefore, in terms of TDE, the minimum requirement is that > > PostgreSQL can use multiple keys. > > > > Using multiple keys in PG, there are roughly two possible designs: > > > > 1. Store all keys for TDE into the external KMS, and PG gets them from > > it as needed. > > +1 In this approach, encryption keys obtained from the external KMS are directly used to encrypt/decrypt data. What KEK and DEK are you referring to in this approach? > > > 2. PG manages all keys for TDE inside and protect these keys on disk > > by the key (i.g. KEK) stored in the external KMS. > > -1, this is the one where you would need arguing. > > > There are pros and cons to each design. If I take one cons of #1 as an > > example, the operation between PG and the external KMS could be > > complex. The operations could be creating, removing and rotate key and > > so on. > > ISTM that only create (delete?) are really needed. Rotating is the problem > of the KMS itself, thus does not need to be managed by pg under #1. With your idea how is the key rotation going to be performed? After invoking key rotation on the external KMS we need to re-encrypt all data encrypted with the old keys? Or you assume that the external KMS employes something like 2-tier key hierarchy? > > > We can implement these operations in an extension to interact > > with different kinds of external KMS, and perhaps we can use KMIP. > > I would even put that (KMIP protocol stuff) outside pg core. > > Even under #2, if some KMS is implemented and managed by pg, I would put > the stuff in a separate process which I would probably run with a > different uid, so that the KEK is not accessible directly by pg, ever. > > Once KMS interactions are managed with an outside process, then what this > process does becomes an interface, and whether this process actually > manages the keys or discuss with some external KMS with some KMIP or > whatever is irrelevant to pg. Providing an interface means that anyone > could implement their KMS fitting their requirements if they comply with > the interface/protocol. Just to be clear we don't keep KEK on neither shared memory nor disk. Postmaster and a backend who executes pg_rotate_cluster_passphrase() get KEK and use it to (re-)encrypt internal keys. But after that they immediately free it. The encryption keys we need to store inside PostgreSQL are DEK. > > Note that I'd be fine with having the current implementation somehow > wrapped up as an example KMS. > > > But the development cost could become high because we might need > > different extensions for each key management solutions/services. > > Yes and no. What I suggest is, I think, pretty simple, and I think I can > implement it in a few line of script, so the cost is not high, and having > a separate process looks, to me, like a security win and an extensibility > win (i.e. another implementation can be provided). How can we get multiple keys from the external KMS? I think we will need to save something like identifiers for each encryption key Postgres needs in the core and ask the external KMS for the key by the identifier via an extension. Is that right? > > > #2 is better at that point; the interaction between PG and KMS is only > > GET. > > I think that it could be the same with #1. I think that having a separate > process is a reasonable security requirement, and if you do that #1 and #2 > are more or less the same. > > > Other databases employes a similar approach are SQL Server and DB2. > > Too bad for them:-) I'd still disagree with having the master key inside > the database process, even if Microsoft, IBM and Oracle think it is a good > idea. > > > In terms of the necessity of introducing the key manager into PG core, > > I think at least TDE needs to be implemented in PG core. And as this > > key manager is for managing keys for TDE, I think the key manager also > > needs to be introduced into the core so that TDE functionality doesn't > > depend on external modules. > > Hmmm. > > My point is that only interactions should be in core. > > The implementation could be in core, but as a separate process. > > I agree that pg needs to be able to manage the DEK, so it needs to store > data keys. > > I still do not understand why an extension, possibly distributed with pg, > would not be ok. There may be good arguments for that, but I do not think > you provided any yet. Hmm I think I don't fully understand your idea yet. With the current patch, KEK could be obtained by either postmaster or backend processs who execute pg_rotate_clu
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, May 29, 2020 at 8:31 PM Dilip Kumar wrote: > The fixes in the latest patchset are correct. Few minor comments: v26-0005-Implement-streaming-mode-in-ReorderBuffer + /* + * Mark toplevel transaction as having catalog changes too if one of its + * children has so that the ReorderBufferBuildTupleCidHash can conveniently + * check just toplevel transaction and decide whethe we need to build the + * hash table or not. In non-streaming mode we mark the toplevel + * transaction in DecodeCommit as we only stream on commit. Typo, /whethe/whether missing comma, /In non-streaming mode we/In non-streaming mode, we v26-0008-Add-support-for-streaming-to-built-in-replicatio + /* + * This memory context used for per stream data when streaming mode is + * enabled. This context is reeset on each stream stop. + */ Can we slightly modify the above comment as "This is used in the streaming mode for the changes between the start and stop stream messages. We reset this context on the stream stop message."? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On Wed, 3 Jun 2020 at 01:19, Michael Paquier wrote: > On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote: > > On 2020/06/02 13:24, Michael Paquier wrote: > >> Still unconvinced as this restriction stands for logical decoding > >> requiring a database connection but it is not necessarily true now as > >> physical replication has less restrictions than a logical one. > > > > Could you tell me what the benefit for supporting physical replication on > > logical rep connection is? If it's only for "undocumented" > > backward-compatibility, IMO it's better to reject such "tricky" set up. > > But if there are some use cases for that, I'm ok to support that. > > Well, I don't really think that we can just break a behavior that > exists since 9.4 as you could break applications relying on the > existing behavior, and that's also the point of Vladimir upthread. > I don't see this is a valid reason to keep doing something. If it is broken then fix it. Clients can deal with the change. Dave Cramer https://www.postgres.rocks
Re: what can go in root.crt ?
On Tue, 2 Jun 2020 at 20:14, Bruce Momjian wrote: > The server certificate should be issued by a certificate authority root > outside of your organization only if you want people outside of your > organization to trust your server certificate, but you are then asking > for the client to only trust an intermediate inside your organization. > The big question is why bother having the server certificate chain to a > root certificat you don't trust when you have no intention of having > clients outside of your organization trust the server certificate. > Postgres could be made to handle such cases, but is is really a valid > configuration we should support? > I think the "why" the org cert is not root was already made clear, that is the copmany policy. I don't think postgres should take a stance whether the certificate designated as the root of trust is self-signed or claims to get its power from somewhere else. It's pretty easy to conceive of certificate management procedures that make use of this chain to implement certificate replacement securely. For example one might trust the global issuer to verify that a CSR is coming from the O= value that it's claiming to come from to automate replacement of intermediate certificates, but not trust that every other sub-CA signed by root and their sub-sub-CA-s are completely honest and secure. Regards, Ants Aasma
how to create index concurrently on paritioned table
Hi hackers, Partitioning is necessary for very large tables. However, I found that postgresql does not support create index concurrently on partitioned tables. The document show that we need to create an index on each partition individually and then finally create the partitioned index non-concurrently. This is undoubtedly a complex operation for DBA, especially when there are many partitions. Therefore, I wonder why pg does not support concurrent index creation on partitioned tables? What are the difficulties of this function? If I want to implement it, what should I pay attention? Sincerely look forward to your reply. Regards & Thanks Adger
Re: Incorrect comment in be-secure-openssl.c
On Mon, Jun 01, 2020 at 10:39:45AM +0200, Daniel Gustafsson wrote: > I don't have a problem with the existing wording of the first sentence, and I > don't have a problem with your suggestion either (as long as it's parameters > in > plural). Thanks, that's why I kept the word plural in my own suggestion. I was just reading through the whole set again, and still kind of prefer the last flavor, so I think that I'll just fix it this way tomorrow and call it a day. -- Michael signature.asc Description: PGP signature
Re: Incorrect comment in be-secure-openssl.c
> On 3 Jun 2020, at 14:38, Michael Paquier wrote: > > On Mon, Jun 01, 2020 at 10:39:45AM +0200, Daniel Gustafsson wrote: >> I don't have a problem with the existing wording of the first sentence, and I >> don't have a problem with your suggestion either (as long as it's parameters >> in >> plural). > > Thanks, that's why I kept the word plural in my own suggestion. I was > just reading through the whole set again, and still kind of prefer the > last flavor, so I think that I'll just fix it this way tomorrow and > call it a day. Sounds good, thanks! cheers ./daniel
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020/06/03 20:33, Dave Cramer wrote: On Wed, 3 Jun 2020 at 01:19, Michael Paquier mailto:mich...@paquier.xyz>> wrote: On Tue, Jun 02, 2020 at 02:23:50PM +0900, Fujii Masao wrote: > On 2020/06/02 13:24, Michael Paquier wrote: >> Still unconvinced as this restriction stands for logical decoding >> requiring a database connection but it is not necessarily true now as >> physical replication has less restrictions than a logical one. > > Could you tell me what the benefit for supporting physical replication on > logical rep connection is? If it's only for "undocumented" > backward-compatibility, IMO it's better to reject such "tricky" set up. > But if there are some use cases for that, I'm ok to support that. Well, I don't really think that we can just break a behavior that exists since 9.4 as you could break applications relying on the existing behavior, and that's also the point of Vladimir upthread. For the back branches, I agree with you. Even if it's undocumented behavior, basically we should not get rid of it from the back branches unless there is very special reason. For v13, if it has no functional merit, I don't think it's so bad to get rid of that undocumented (and maybe not-fully tested) behavior. If there are applications depending it, I think that they can be updated. I don't see this is a valid reason to keep doing something. If it is broken then fix it. Clients can deal with the change. +1 Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Removal of currtid()/currtid2() and some table AM cleanup
Hi, On 2020/06/03 11:14, Michael Paquier wrote: Hi all, I have been looking at the ODBC driver and the need for currtid() as well as currtid2(), and as mentioned already in [1], matching with my lookup of things, these are actually not needed by the driver as long as we connect to a server newer than 8.2 able to support RETURNING. Though currtid2() is necessary even for servers which support RETURNING, I don't object to remove it. regards, Hiroshi Inoue I am adding in CC of this thread Saito-san and Inoue-san who are the two main maintainers of the driver for comments. It is worth noting that on its latest HEAD the ODBC driver requires libpq from at least 9.2. I would like to remove those two functions and the surrounding code for v14, leading to some cleanup: 6 files changed, 326 deletions(-) While on it, I have noticed that heap_get_latest_tid() is still located within heapam.c, but we can just move it within heapam_handler.c. Attached are two patches to address both points. Comments are welcome. Thanks, [1]: https://www.postgresql.org/message-id/20200529005559.jl2gsolomyro4...@alap3.anarazel.de -- Michael
Re: More tests with USING INDEX replident and dropped indexes
On Wed, 3 Jun 2020 at 03:14, Michael Paquier wrote: > On Tue, Jun 02, 2020 at 04:46:55PM +0900, Masahiko Sawada wrote: > > How about avoiding such an inconsistent situation? In that case, > > replica identity works as NOTHING, but pg_class.relreplident is still > > βiβ, confusing users. It seems to me that dropping an index specified > > by REPLICA IDENTITY USING INDEX is not a valid operation. > > This looks first like complicating RemoveRelations() or the internal > object removal APIs with a dedicated lookup at this index's pg_index > tuple, but you could just put that in index_drop when REINDEX > CONCURRENTLY is not used. Still, I am not sure if it is worth > complicating those code paths. It would be better to get more > opinions about that first. > > Consistency is a good goal. Why don't we clear the relreplident from the relation while dropping the index? relation_mark_replica_identity() already does that but do other things too. Let's move the first code block from relation_mark_replica_identity to another function and call this new function while dropping the index. -- Euler Taveira http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Why is pq_begintypsend so slow?
On Tue, Jun 2, 2020 at 9:56 PM Andres Freund wrote: > The biggest problem after that is that we waste a lot of time memcpying > stuff around repeatedly. There is: > 1) send function: datum -> per datum stringinfo > 2) printtup: per datum stringinfo -> per row stringinfo > 3) socket_putmessage: per row stringinfo -> PqSendBuffer > 4) send(): PqSendBuffer -> kernel buffer > > It's obviously hard to avoid 1) and 4) in the common case, but the > number of other copies seem pretty clearly excessive. I too have seen recent benchmarking data where this was a big problem. Basically, you need a workload where the server doesn't have much or any actual query processing to do, but is just returning a lot of stuff to a really fast client - e.g. a locally connected client. That's not necessarily the most common case but, if you have it, all this extra copying is really pretty expensive. My first thought was to wonder about changing all of our send/output functions to write into a buffer passed as an argument rather than returning something which we then have to copy into a different buffer, but that would be a somewhat painful change, so it is probably better to first pursue the idea of getting rid of some of the other copies that happen in more centralized places (e.g. printtup). I wonder if we could replace the whole pq_beginmessage...()/pq_send()/pq_endmessage...() system with something a bit better-designed. For instance, suppose we get rid of the idea that the caller supplies the buffer, and we move the responsibility for error recovery into the pqcomm layer. So you do something like: my_message = xyz_beginmessage('D'); xyz_sendint32(my_message, 42); xyz_endmessage(my_message); Maybe what happens here under the hood is we keep a pool of free message buffers sitting around, and you just grab one and put your data into it. When you end the message we add it to a list of used message buffers that are waiting to be sent, and once we send the data it goes back on the free list. If an error occurs after xyz_beginmessage() and before xyz_endmessage(), we put the buffer back on the free list. That would allow us to merge (2) and (3) into a single copy. To go further, we could allow send/output functions to opt in to receiving a message buffer rather than returning a value, and then we could get rid of (1) for types that participate. (4) seems unavoidable AFAIK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
significant slowdown of HashAggregate between 9.6 and 10
Hi One czech Postgres user reported performance issue related to speed HashAggregate in nested loop. The speed of 9.6 HashAggregate (cost=27586.10..27728.66 rows=14256 width=24) (actual time=0.003..0.049 rows=39 loops=599203) The speed of 10.7 HashAggregate (cost=27336.78..27552.78 rows=21600 width=24) (actual time=0.011..0.156 rows=38 loops=597137) So it looks so HashAgg is about 3x slower - with brutal nested loop it is a problem. I wrote simple benchmark and really looks so our hash aggregate is slower and slower. create table foo(a int, b int, c int, d int, e int, f int); insert into foo select random()*1000, random()*4, random()*4, random()* 2, random()*100, random()*100 from generate_series(1,200); analyze foo; 9.6.7 postgres=# explain (analyze, buffers) select i from generate_series(1,50) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i); βββββ β QUERY PLAN β βββββ‘ β Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 width=4) (actual time=807.485..3364.515 rows=74 loops=1) β β Filter: (SubPlan 1) β β Rows Removed by Filter: 499926 β β Buffers: shared hit=12739, temp read=856 written=855 β β SubPlan 1 β β -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.006..0.006 rows=0 loops=50) β β Group Key: foo.b, foo.c, foo.d β β Filter: (count(*) = g.i) β β Rows Removed by Filter: 75 β β Buffers: shared hit=12739 β β -> Seq Scan on foo (cost=0.00..32739.00 rows=200 width=12) (actual time=0.015..139.736 rows=200 loops=1) β β Buffers: shared hit=12739 β β Planning time: 0.276 ms β β Execution time: 3365.758 ms β βββββ (14 rows) 10.9 postgres=# explain (analyze, buffers) select i from generate_series(1,50) g(i) where exists (select count(*) cx from foo group by b, c, d having count(*) = i); βββββ β QUERY PLAN β βββββ‘ β Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 width=4) (actual time=825.468..4919.063 rows=74 loops=1) β β Filter: (SubPlan 1) β β Rows Removed by Filter: 499926 β β Buffers: shared hit=12739, temp read=856 written=855 β β SubPlan 1 β β -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) (actual time=0.009..0.009 rows=0 loops=50) β β Group Key: foo.b, foo.c, foo.d β β Filter: (count(*) = g.i) β β Rows Removed by Filter: 75 β β Buffers: shared hit=12739 β β -> Seq Scan on foo (cost=0.00..32739.00 rows=200 width=12) (actual time=0.025..157.887 rows=200 loops=1) β β Buffers: shared hit=12739 β β Planning time: 0.829 ms β β Execution time: 4920.800 ms β βββββ (14 rows) master postgres=# explain (analyze, buffers) select i from generate_series(1,50) g(i) where exists (select c
Re: REINDEX CONCURRENTLY and indisreplident
On Wed, 3 Jun 2020 at 03:54, Michael Paquier wrote: > Hi all, > > I have bumped into $subject, causing a replica identity index to > be considered as dropped if running REINDEX CONCURRENTLY on it. This > means that the old tuple information would get lost in this case, as > a REPLICA IDENTITY USING INDEX without a dropped index is the same as > NOTHING. > > LGTM. I tested in both versions (12, master) and it works accordingly. -- Euler Taveira http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: significant slowdown of HashAggregate between 9.6 and 10
st 3. 6. 2020 v 17:32 odesΓlatel Pavel Stehule napsal: > Hi > > One czech Postgres user reported performance issue related to speed > HashAggregate in nested loop. > > The speed of 9.6 > > HashAggregate (cost=27586.10..27728.66 rows=14256 width=24) > (actual time=0.003..0.049 rows=39 loops=599203) > > The speed of 10.7 > > HashAggregate (cost=27336.78..27552.78 rows=21600 width=24) > (actual time=0.011..0.156 rows=38 loops=597137) > > So it looks so HashAgg is about 3x slower - with brutal nested loop it is > a problem. > > I wrote simple benchmark and really looks so our hash aggregate is slower > and slower. > > create table foo(a int, b int, c int, d int, e int, f int); > insert into foo select random()*1000, random()*4, random()*4, random()* 2, > random()*100, random()*100 from generate_series(1,200); > > analyze foo; > > 9.6.7 > postgres=# explain (analyze, buffers) select i from > generate_series(1,50) g(i) where exists (select count(*) cx from foo > group by b, c, d having count(*) = i); > > βββββ > β QUERY PLAN > β > > βββββ‘ > β Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 > width=4) (actual time=807.485..3364.515 rows=74 loops=1) β > β Filter: (SubPlan 1) > β > β Rows Removed by Filter: 499926 > β > β Buffers: shared hit=12739, temp read=856 written=855 > β > β SubPlan 1 > β > β -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) > (actual time=0.006..0.006 rows=0 loops=50) β > β Group Key: foo.b, foo.c, foo.d > β > β Filter: (count(*) = g.i) > β > β Rows Removed by Filter: 75 > β > β Buffers: shared hit=12739 > β > β -> Seq Scan on foo (cost=0.00..32739.00 rows=200 > width=12) (actual time=0.015..139.736 rows=200 loops=1) β > β Buffers: shared hit=12739 > β > β Planning time: 0.276 ms > β > β Execution time: 3365.758 ms > β > > βββββ > (14 rows) > > 10.9 > > postgres=# explain (analyze, buffers) select i from > generate_series(1,50) g(i) where exists (select count(*) cx from foo > group by b, c, d having count(*) = i); > > βββββ > β QUERY PLAN > β > > βββββ‘ > β Function Scan on generate_series g (cost=0.00..57739020.00 rows=500 > width=4) (actual time=825.468..4919.063 rows=74 loops=1) β > β Filter: (SubPlan 1) > β > β Rows Removed by Filter: 499926 > β > β Buffers: shared hit=12739, temp read=856 written=855 > β > β SubPlan 1 > β > β -> HashAggregate (cost=57739.00..57739.75 rows=75 width=20) > (actual time=0.009..0.009 rows=0 loops=50) β > β Group Key: foo.b, foo.c, foo.d > β > β Filter: (count(*) = g.i) > β > β Rows Removed by Filter: 75 > β > β Buffers: shared hit=12739 > β > β -> Seq Scan on foo (cost=0.00..32739.00 rows=200 > width=12) (actual time=0.025..157.887 rows=200 loops=1) β > β Buffers: shared hit=12739 > β > β Planning time: 0.829 ms > β > β Execution time: 4920.800 ms >
Re: Parallel copy
On Mon, May 18, 2020 at 12:48 AM Amit Kapila wrote: > In the above case, even though we are executing a single command from > the user perspective, but the currentCommandId will be four after the > command. One increment will be for the copy command and the other > three increments are for locking tuple in PK table > (tab_fk_referenced_chk) for three tuples in FK table > (tab_fk_referencing_chk). Now, for parallel workers, it is > (theoretically) possible that the three tuples are processed by three > different workers which don't get synced as of now. The question was > do we see any kind of problem with this and if so can we just sync it > up at the end of parallelism. I strongly disagree with the idea of "just sync(ing) it up at the end of parallelism". That seems like a completely unprincipled approach to the problem. Either the command counter increment is important or it's not. If it's not important, maybe we can arrange to skip it in the first place. If it is important, then it's probably not OK for each backend to be doing it separately. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Expand the use of check_canonical_path() for more GUCs
On Tue, Jun 2, 2020 at 5:04 AM Peter Eisentraut wrote: > The archeology reveals that these calls where originally added to > canonicalize the data_directory and config_file settings (7b0f060d54), > but that was then moved out of guc.c to be done early during postmaster > startup (337ffcddba). The remaining calls of check_canonical_path() in > guc.c appear to be leftovers from a previous regime. Thanks for looking into it. Sounds like it can just be ripped out, then, unless someone knows of a reason to do otherwise. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Read access for pg_monitor to pg_replication_origin_status view
Hi Kyotaro-san, Thank you for taking the time to review my patches. Would you like to set yourself as a reviewer in the commit entry here? https://commitfest.postgresql.org/28/2577/ > 0002: > > It is forgetting to grant pg_read_all_stats to execute > pg_show_replication_origin_status. As the result pg_read_all_stats > gets error on executing the function, not on doing select on the view. Seems I was testing on a cluster where I had already been digging, so pg_real_all_stats had execute privileges on pg_show_replication_origin_status (I had manually granted that) and didn't notice because I forgot to drop the cluster and initialize again. Thanks for the pointer here! > 0003: > > It seems to be a take-after of adminpack's documentation, but a > superuser is not the only one on PostgreSQL. The something like the > description in 27.2.2 Viewing Statistics looks more suitable. > > > Superusers and members of the built-in role pg_read_all_stats (see > > also Section 21.5) can see all the information about all sessions. > > Section 21.5 is already saying as follows. > > > pg_monitor > > Read/execute various monitoring views and functions. This role is a > > member of pg_read_all_settings, pg_read_all_stats and > > pg_stat_scan_tables. I'm not sure if I got this right, but I added some more text to point out that the pg_read_all_stats role can also access one specific function. I personally think it's a bit too detailed, and if we wanted to add details it should be formatted differently, which would require a more invasive patch (would prefer leaving that out, as it might even mean moving parts which are not part of this patch). In any case, I hope the change fits what you've kindly pointed out. > 0004: > > Looks fine by me. Here goes v3 of the patch -- MartΓn MarquΓ©shttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services From 87c00782c691f2c6c13768cd6d96b19de69cab16 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mart=C3=ADn=20Marqu=C3=A9s?= Date: Tue, 2 Jun 2020 10:44:42 -0300 Subject: [PATCH v3 1/4] Access to pg_replication_origin_status view has restricted access only for superusers due to it using pg_show_replication_origin_status(), and all replication origin functions requiring superuser rights. This patch removes the check for superuser priviledges when executing replication origin functions, which is hardcoded, and instead rely on ACL checks. This patch will also revoke execution of such functions from PUBLIC --- src/backend/catalog/system_views.sql | 16 src/backend/replication/logical/origin.c | 5 - 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 56420bbc9d6..6658f0c2eb2 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1469,6 +1469,22 @@ REVOKE EXECUTE ON FUNCTION pg_stat_file(text,boolean) FROM public; REVOKE EXECUTE ON FUNCTION pg_ls_dir(text) FROM public; REVOKE EXECUTE ON FUNCTION pg_ls_dir(text,boolean,boolean) FROM public; +-- +-- Permision to execute Replication Origin functions should be revoked from public +-- +REVOKE EXECUTE ON FUNCTION pg_replication_origin_advance(text, pg_lsn) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_create(text) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_drop(text) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_oid(text) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_progress(text, boolean) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_is_setup() FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_progress(boolean) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_reset() FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_session_setup(text) FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_reset() FROM public; +REVOKE EXECUTE ON FUNCTION pg_replication_origin_xact_setup(pg_lsn, timestamp with time zone) FROM public; +REVOKE EXECUTE ON FUNCTION pg_show_replication_origin_status() FROM public; + -- -- We also set up some things as accessible to standard roles. -- diff --git a/src/backend/replication/logical/origin.c b/src/backend/replication/logical/origin.c index 981d60f135d..1f223daf21f 100644 --- a/src/backend/replication/logical/origin.c +++ b/src/backend/replication/logical/origin.c @@ -182,11 +182,6 @@ static ReplicationState *session_replication_state = NULL; static void replorigin_check_prerequisites(bool check_slots, bool recoveryOK) { - if (!superuser()) - ereport(ERROR, -(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("only superusers can query or manipulate replication origins"))); - if (check_slots && max_replication_slots == 0) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), -- 2.21.3 From 0d0ef6aa7e563bd3b9
Towards easier AMs: Cleaning up inappropriate use of name "relkind"
Hackers, The name "relkind" normally refers to a field of type 'char' with values like 'r' for "table" and 'i' for "index". In AlterTableStmt and CreateTableAsStmt, this naming convention was abused for a field of type enum ObjectType. Often, such fields are named "objtype", though also "kind", "removeType", "renameType", etc. I don't care to debate those other names, though in passing I'll say that "kind" seems not great. The "relkind" name is particularly bad, though. It is confusing in functions that also operate on a RangeTblEntry object, which also has a field named "relkind", and is confusing in light of the function get_relkind_objtype() which maps from "relkind" to "objtype", implying quite correctly that those two things are distinct. The attached patch cleans this up. How many toes am I stepping on here? Changing the names was straightforward and doesn't seem to cause any issues with 'make check-world'. Any objection? For those interested in the larger context of this patch, I am trying to clean up any part of the code that makes it harder to write and test new access methods. When implementing a new AM, one currently needs to `grep -i relkind` to find a long list of files that need special treatment. One then needs to consider whether special logic for the new AM needs to be inserted into all these spots. As such, it is nice if these spots have as little confusing naming as possible. This patch makes that process a little easier. I have another patch (to be posted shortly) that cleans up the #define RELKIND_XXX stuff using a new RelKind enum and special macros while keeping the relkind fields as type 'char'. Along with converting code to use switch(relkind) rather than if (relkind...) statements, the compiler now warns on unhandled cases when you add a new RelKind to the list, making it easier to find all the places you need to update. I decided to keep that work independent of this patch, as the code is logically distinct. v1-0001-Renaming-relkind-as-objtype-where-appropriate.patch Description: Binary data β Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Why is pq_begintypsend so slow?
Hi, On 2020-06-03 11:30:42 -0400, Robert Haas wrote: > I too have seen recent benchmarking data where this was a big problem. > Basically, you need a workload where the server doesn't have much or > any actual query processing to do, but is just returning a lot of > stuff to a really fast client - e.g. a locally connected client. > That's not necessarily the most common case but, if you have it, all > this extra copying is really pretty expensive. Even when the query actually is doing something, it's still quite possible to get the memcpies to be be measurable (say > 10% of cycles). Obviously not in a huge aggregating query. Even in something like pgbench -M prepared -S, which is obviously spending most of its cycles elsewhere, the patches upthread improve throughput by ~1.5% (and that's not eliding several unnecessary copies). > My first thought was to wonder about changing all of our send/output > functions to write into a buffer passed as an argument rather than > returning something which we then have to copy into a different > buffer, but that would be a somewhat painful change, so it is probably > better to first pursue the idea of getting rid of some of the other > copies that happen in more centralized places (e.g. printtup). For those I think the allocator overhead is the bigger issue than the memcpy itself. I wonder how much we could transparently hide in pq_begintypsend()/pq_endtypsend(). > I > wonder if we could replace the whole > pq_beginmessage...()/pq_send()/pq_endmessage...() system with > something a bit better-designed. For instance, suppose we get rid of > the idea that the caller supplies the buffer, and we move the > responsibility for error recovery into the pqcomm layer. So you do > something like: > > my_message = xyz_beginmessage('D'); > xyz_sendint32(my_message, 42); > xyz_endmessage(my_message); > > Maybe what happens here under the hood is we keep a pool of free > message buffers sitting around, and you just grab one and put your > data into it. Why do we need multiple buffers? ISTM we don't want to just send messages at endmsg() time, because that implies unnecessary syscall overhead. Nor do we want to imply the overhead of the copy from the message buffer to the network buffer. To me that seems to imply that the best approach would be to have PqSendBuffer be something stringbuffer like, and have pg_beginmessage() record the starting position of the current message somewhere (->cursor?). When an error is thrown, we reset the position to be where the in-progress message would have begun. I've previously outlined a slightly more complicated scheme, where we have "proxy" stringinfos that point into another stringinfo, instead of their own buffer. And know how to resize the "outer" buffer when needed. That'd have some advantages, but I'm not sure it's really needed. There's some disadvantages with what I describe above, in particular when dealing with send() sending only parts of our network buffer. We couldn't cheaply reuse the already sent memory in that case. I've before wondered / suggested that we should have StringInfos not insist on having one consecutive buffer (which obviously implies needing to copy contents when growing). Instead it should have a list of buffers containing chunks of the data, and never copy contents around while the string is being built. We'd only allocate a buffer big enough for all data when the caller actually wants to have all the resulting data in one string (rather than using an API that can iterate over chunks). For the network buffer case that'd allow us to reuse the earlier buffers even in the "partial send" case. And more generally it'd allow us to be less wasteful with buffer sizes, and perhaps even have a small "inline" buffer inside StringInfoData avoiding unnecessary memory allocations in a lot of cases where the common case is only a small amount of data being used. And I think the overhead while appending data to such a stringinfo should be neglegible, because it'd just require the exact same checks we already have to do for enlargeStringInfo(). > (4) seems unavoidable AFAIK. Not entirely. Linux can do zero-copy sends, but it does require somewhat complicated black magic rituals. Including more complex buffer management for the application, because the memory containing the to-be-sent data cannot be reused until the kernel notifies that it's done with the buffer. See https://www.kernel.org/doc/html/latest/networking/msg_zerocopy.html That might be something worth pursuing in the future (since it, I think, basically avoids spending any cpu cycles on copying data around in the happy path, relying on DMA instead), but I think for now there's much bigger fish to fry. I am hoping that somebody will write a nicer abstraction for zero-copy sends using io_uring. avoiding the need of a separate completion queue, by simply only signalling completion for the sendmsg operation once the buffer isn't needed anymore. There's
Atomic operations within spinlocks
In connection with the nearby thread about spinlock coding rule violations, I noticed that we have several places that are doing things like this: SpinLockAcquire(&WalRcv->mutex); ... written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto); ... SpinLockRelease(&WalRcv->mutex); That's from pg_stat_get_wal_receiver(); there is similar code in freelist.c's ClockSweepTick() and StrategySyncStart(). This seems to me to be very bad code. In the first place, on machines without the appropriate type of atomic operation, atomics.c is going to be using a spinlock to emulate atomicity, which means this code tries to take a spinlock while holding another one. That's not okay, either from the standpoint of performance or error-safety. In the second place, this coding seems to me to indicate serious confusion about which lock is protecting what. In the above example, if writtenUpto is only accessed through atomic operations then it seems like we could just move the pg_atomic_read_u64 out of the spinlock section; or if the spinlock is adequate protection then we could just do a normal fetch. If we actually need both locks then this needs significant re-thinking, IMO. Comments? regards, tom lane
Re: question regarding copyData containers
Jerome Wagner writes: > now my question is the following : > is it ok to consider that over the long term copyData is simply a transport > container that exists only to allow the multiplexing of events in the > protocol but that messages inside could be chunked over several copyData > events ? Yes, the expectation is that clients can send CopyData messages that are split up however they choose; the message boundaries needn't correspond to any semantic boundaries in the data stream. The rule in the other direction, that a message corresponds to one table row, is something that might not last forever either. As we get more people working with large data values, there's going to be pressure to set some smaller limit on message size. regards, tom lane
Re: elog(DEBUG2 in SpinLocked section.
Kyotaro Horiguchi writes: > I looked through 224 locations where SpinLockAcquire and found some. Yeah, I made a similar scan and arrived at about the same conclusions. I think that the memcpy and strlcpy calls are fine; at least, we've got to transport data somehow and it's not apparent why those aren't OK ways to do it. The one use of StrNCpy is annoying from a cosmetic standpoint (mainly because it's Not Like Anywhere Else) but I'm not sure it's worth changing. The condition-variable code has a boatload of spinlocked calls of the proclist functions in proclist.h. All of those are straight-line code so they're okay performance wise, but I wonder if we shouldn't add a comment to that header pointing out that its functions must not throw errors. The only other thing I remain concerned about is some instances of atomic operations inside spinlocks, which I started a separate thread about [1]. regards, tom lane [1] https://www.postgresql.org/message-id/1141819.1591208385%40sss.pgh.pa.us
Re: Parallel copy
Hi, On 2020-06-03 12:13:14 -0400, Robert Haas wrote: > On Mon, May 18, 2020 at 12:48 AM Amit Kapila wrote: > > In the above case, even though we are executing a single command from > > the user perspective, but the currentCommandId will be four after the > > command. One increment will be for the copy command and the other > > three increments are for locking tuple in PK table > > (tab_fk_referenced_chk) for three tuples in FK table > > (tab_fk_referencing_chk). Now, for parallel workers, it is > > (theoretically) possible that the three tuples are processed by three > > different workers which don't get synced as of now. The question was > > do we see any kind of problem with this and if so can we just sync it > > up at the end of parallelism. > I strongly disagree with the idea of "just sync(ing) it up at the end > of parallelism". That seems like a completely unprincipled approach to > the problem. Either the command counter increment is important or it's > not. If it's not important, maybe we can arrange to skip it in the > first place. If it is important, then it's probably not OK for each > backend to be doing it separately. That scares me too. These command counter increments definitely aren't unnecessary in the general case. Even in the example you share above, aren't we potentially going to actually lock rows multiple times from within the same transaction, instead of once? If the command counter increments from within ri_trigger.c aren't visible to other parallel workers/leader, we'll not correctly understand that a locked row is invisible to heap_lock_tuple, because we're not using a new enough snapshot (by dint of not having a new enough cid). I've not dug through everything that'd potentially cause, but it seems pretty clearly a no-go from here. Greetings, Andres Freund
Re: Expand the use of check_canonical_path() for more GUCs
Robert Haas writes: > On Tue, Jun 2, 2020 at 5:04 AM Peter Eisentraut > wrote: >> The archeology reveals that these calls where originally added to >> canonicalize the data_directory and config_file settings (7b0f060d54), >> but that was then moved out of guc.c to be done early during postmaster >> startup (337ffcddba). The remaining calls of check_canonical_path() in >> guc.c appear to be leftovers from a previous regime. > Thanks for looking into it. Sounds like it can just be ripped out, > then, unless someone knows of a reason to do otherwise. In the abstract, I agree with Peter's point that we shouldn't alter user-given strings without need. However, I think there's strong reason for canonicalizing the data directory and config file locations. We access those both before and after chdir'ing into the datadir, so we'd better have absolute paths to them --- and at least for the datadir, it's documented that you can initially give it as a path relative to wherever you started the postmaster from. If the other files are only accessed after the chdir happens then we could likely do without canonicalizing them. But ... do we know which directory the user (thought he) specified them with reference to? Forced canonicalization does have the advantage that it's clear to all onlookers how we are interpreting the paths. regards, tom lane
Re: Parallel copy
Hi, On 2020-06-03 15:53:24 +0530, vignesh C wrote: > Workers/ > Exec time (seconds) copy from file, > 2 indexes on integer columns > 1 index on text column copy from stdin, > 2 indexes on integer columns > 1 index on text column copy from file, 1 gist index on text column copy > from file, > 3 indexes on integer columns copy from stdin, 3 indexes on integer columns > 0 1162.772(1X) 1176.035(1X) 827.669(1X) 216.171(1X) 217.376(1X) > 1 1110.288(1.05X) 1120.556(1.05X) 747.384(1.11X) 174.242(1.24X) 163.492(1.33X) > 2 635.249(1.83X) 668.18(1.76X) 435.673(1.9X) 133.829(1.61X) 126.516(1.72X) > 4 336.835(3.45X) 346.768(3.39X) 236.406(3.5X) 105.767(2.04X) 107.382(2.02X) > 8 188.577(6.17X) 194.491(6.04X) 148.962(5.56X) 100.708(2.15X) 107.72(2.01X) > 16 126.819(9.17X) 146.402(8.03X) 119.923(6.9X) 97.996(2.2X) 106.531(2.04X) > 20 *117.845(9.87X)* 149.203(7.88X) 138.741(5.96X) 97.94(2.21X) 107.5(2.02) > 30 127.554(9.11X) 161.218(7.29X) 172.443(4.8X) 98.232(2.2X) 108.778(1.99X) Hm. you don't explicitly mention that in your design, but given how small the benefits going from 0-1 workers is, I assume the leader doesn't do any "chunk processing" on its own? > Design of the Parallel Copy: The backend, to which the "COPY FROM" query is > submitted acts as leader with the responsibility of reading data from the > file/stdin, launching at most n number of workers as specified with > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the > common data required for the workers execution in the DSM and shares it > with the workers. The leader then executes before statement triggers if > there exists any. Leader populates DSM chunks which includes the start > offset and chunk size, while populating the chunks it reads as many blocks > as required into the DSM data blocks from the file. Each block is of 64K > size. The leader parses the data to identify a chunk, the existing logic > from CopyReadLineText which identifies the chunks with some changes was > used for this. Leader checks if a free chunk is available to copy the > information, if there is no free chunk it waits till the required chunk is > freed up by the worker and then copies the identified chunks information > (offset & chunk size) into the DSM chunks. This process is repeated till > the complete file is processed. Simultaneously, the workers cache the > chunks(50) locally into the local memory and release the chunks to the > leader for further populating. Each worker processes the chunk that it > cached and inserts it into the table. The leader waits till all the chunks > populated are processed by the workers and exits. Why do we need the local copy of 50 chunks? Copying memory around is far from free. I don't see why it'd be better to add per-process caching, rather than making the DSM bigger? I can see some benefit in marking multiple chunks as being processed with one lock acquisition, but I don't think adding a memory copy is a good idea. This patch *desperately* needs to be split up. It imo is close to unreviewable, due to a large amount of changes that just move code around without other functional changes being mixed in with the actual new stuff. > /* > + * State of the chunk. > + */ > +typedef enum ChunkState > +{ > + CHUNK_INIT, /* initial state of > chunk */ > + CHUNK_LEADER_POPULATING,/* leader processing chunk */ > + CHUNK_LEADER_POPULATED, /* leader completed populating chunk */ > + CHUNK_WORKER_PROCESSING,/* worker processing chunk */ > + CHUNK_WORKER_PROCESSED /* worker completed processing chunk */ > +}ChunkState; > + > +#define RAW_BUF_SIZE 65536 /* we palloc RAW_BUF_SIZE+1 bytes */ > + > +#define DATA_BLOCK_SIZE RAW_BUF_SIZE > +#define RINGSIZE (10 * 1000) > +#define MAX_BLOCKS_COUNT 1000 > +#define WORKER_CHUNK_COUNT 50/* should be mod of RINGSIZE */ > + > +#define IsParallelCopy()(cstate->is_parallel) > +#define IsLeader() (cstate->pcdata->is_leader) > +#define IsHeaderLine() (cstate->header_line && > cstate->cur_lineno == 1) > + > +/* > + * Copy data block information. > + */ > +typedef struct CopyDataBlock > +{ > + /* The number of unprocessed chunks in the current block. */ > + pg_atomic_uint32 unprocessed_chunk_parts; > + > + /* > + * If the current chunk data is continued into another block, > + * following_block will have the position where the remaining data need > to > + * be read. > + */ > + uint32 following_block; > + > + /* > + * This flag will be set, when the leader finds out this block can be > read > + * safely by the worker. This helps the worker to start processing the > chunk > + * early where the chunk will be spread across many blocks and the > worker > + * need not wait for the complete chunk to be processed. > + */ > + bool curr_blk_completed; > +
Re: significant slowdown of HashAggregate between 9.6 and 10
Hi, Not sure what's the root cause, but I can reproduce it. Timings for 9.6, 10 and master (all built from git with the same options) without explain analyze look like this: 9.6 - Time: 1971.314 ms Time: 1995.875 ms Time: 1997.408 ms Time: 2069.913 ms Time: 2004.196 ms 10 - Time: 2815.434 ms (00:02.815) Time: 2862.589 ms (00:02.863) Time: 2841.126 ms (00:02.841) Time: 2803.040 ms (00:02.803) Time: 2805.527 ms (00:02.806) master - Time: 3479.233 ms (00:03.479) Time: 3537.901 ms (00:03.538) Time: 3459.314 ms (00:03.459) Time: 3542.810 ms (00:03.543) Time: 3482.141 ms (00:03.482) So there seems to be +40% between 9.6 and 10, and further +25% between 10 and master. However, plain hashagg, measured e.g. like this: select count(*) cx from foo group by b, c, d having count(*) = 1; does not indicate any slowdown at all, so I think you're right it has something to do with the looping. Profiles from those versions look like this: 9.6 - Samples Overhead Shared Objec Symbol 14.19% postgres [.] ExecMakeFunctionResultNoSets 13.65% postgres [.] finalize_aggregates 12.54% postgres [.] hash_seq_search 6.70% postgres [.] finalize_aggregate.isra.0 5.71% postgres [.] ExecEvalParamExec 5.54% postgres [.] ExecEvalAggref 5.00% postgres [.] ExecStoreMinimalTuple 4.34% postgres [.] ExecAgg 4.08% postgres [.] ExecQual 2.67% postgres [.] slot_deform_tuple 2.24% postgres [.] pgstat_init_function_usage 2.22% postgres [.] check_stack_depth 2.14% postgres [.] MemoryContextReset 1.89% postgres [.] hash_search_with_hash_value 1.72% postgres [.] project_aggregates 1.68% postgres [.] pgstat_end_function_usage 1.59% postgres [.] slot_getattr 10 Samples Overhead Shared Object Symbol 15.18% postgres[.] slot_deform_tuple 13.09% postgres[.] agg_retrieve_hash_table 12.02% postgres[.] ExecInterpExpr 7.47% postgres[.] finalize_aggregates 7.38% postgres[.] tuplehash_iterate 5.13% postgres[.] prepare_projection_slot 4.86% postgres[.] finalize_aggregate.isra.0 4.05% postgres[.] bms_is_member 3.97% postgres[.] slot_getallattrs 3.59% postgres[.] ExecStoreMinimalTuple 2.85% postgres[.] project_aggregates 1.95% postgres[.] ExecClearTuple 1.71% libc-2.30.so[.] __memset_avx2_unaligned_erms 1.69% postgres[.] ExecEvalParamExec 1.58% postgres[.] MemoryContextReset 1.17% postgres[.] slot_getattr 1.03% postgres[.] slot_getsomeattrs master -- Samples Overhead Shared Object Symbol 17.07% postgres[.] agg_retrieve_hash_table 15.46% postgres[.] tuplehash_iterate 11.83% postgres[.] tts_minimal_getsomeattrs 9.39% postgres[.] ExecInterpExpr 6.94% postgres[.] prepare_projection_slot 4.85% postgres[.] finalize_aggregates 4.27% postgres[.] bms_is_member 3.80% postgres[.] finalize_aggregate.isra.0 3.80% postgres[.] tts_minimal_store_tuple 2.22% postgres[.] project_aggregates 2.07% postgres[.] tts_virtual_clear 2.07% postgres[.] MemoryContextReset 1.78% postgres[.] tts_minimal_clear 1.61% postgres[.] ExecEvalParamExec 1.46% postgres[.] slot_getsomeattrs_int 1.34% libc-2.30.so[.] __memset_avx2_unaligned_erms Not sure what to think about this. Seems slot_deform_tuple got way more expensive between 9.6 and 10, for some reason. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Internal key management system
On Wed, Jun 3, 2020 at 09:16:03AM +0200, Fabien COELHO wrote: > > > Also, I'm not at fully at ease with some of the underlying principles > > > behind this proposal. Are we re-inventing/re-implementing kerberos or > > > whatever? Are we re-implementing a brand new KMS inside pg? Why having > > > our own? > > > > As I explained above, this key manager is for managing internal keys > > used by TDE. It's not an alternative to existing key management > > solutions/services. > > Hmmm. This seels to suggest that interacting with something outside should > be an option. Our goal is not to implement every possible security idea someone has, because we will never finish, and the final result would be too complex to be unable. You will need to explain exactly why having a separate process has value over coding/user complexity, and you will need to get agreement from a sufficient number of people to move that idea forward. > > The key rotation this key manager has is KEK rotation, which is very > > important. Without KEK rotation, when KEK is leaked an attacker can > > get database data by disk theft. Since KEK is responsible for > > encrypting all internal keys it's necessary to re-encrypt the internal > > keys when KEK is rotated. I think PG is the only role that can do that > > job. > > I'm not claiming that KEK rotation is a bad thing, I'm saying that it should > not be postgres problem. My issue is where you put the thing, not about the > thing itself. > > > I think this key manager satisfies the fist point by > > cluster_passphrase_command. For the second point, the key manager > > stores local keys inside PG while protecting them by KEK managed > > outside of PG. > > I do not understand. From what I understood from the code, the KEK is loaded > into postgres process. That is what I'm disagreeing with, only needed DEK > should be there. One option would be to send the data needing to be encrypted to an external command, and get the decrypted data back. In that way, the KEK is never on the Postgres server. However, the API for doing such an interface seems very complex and could lead to failures. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
libpq copy error handling busted
Hi, When libpq is used to COPY data to the server, it doesn't properly handle errors. An easy way to trigger the problem is to start pgbench -i with a sufficiently large scale, and then just shut the server down. pgbench will happily use 100% of the cpu trying to send data to the server, even though libpq knows that the connection is broken. It can't even be cancelled using ctrl-c anymore, because the cancel request cannot be sent: andres@awork3:~/src/postgresql$ pgbench -i -s 4000 -q dropping old tables... creating tables... generating data (client-side)... 80889300 of 4 tuples (20%) done (elapsed 85.00 s, remaining 335.33 s) ^CCould not send cancel request: PQcancel() -- connect() failed: No such file or directory This is partially an old problem, and partially got recently worse. Before the below commit we detected terminated connections, but we didn't handle copy failing. The failure to handle terminated connections originates in: commit 1f39a1c0641531e0462a4822f2dba904c5d4d699 Author: Tom Lane Date: 2019-03-19 16:20:20 -0400 Restructure libpq's handling of send failures. The problem is basically that pqSendSome() returns *success* in all failure cases. Both when a failure is already known: +/* + * If we already had a write failure, we will never again try to send data + * on that connection. Even if the kernel would let us, we've probably + * lost message boundary sync with the server. conn->write_failed + * therefore persists until the connection is reset, and we just discard + * all data presented to be written. + */ +if (conn->write_failed) +{ +/* conn->write_err_msg should be set up already */ +conn->outCount = 0; +return 0; +} + and when initially "diagnosing" the failure: /* Anything except EAGAIN/EWOULDBLOCK/EINTR is trouble */ switch (SOCK_ERRNO) ... /* Discard queued data; no chance it'll ever be sent */ conn->outCount = 0; return 0; The idea of the above commit was: Instead, let's fix this in a more general fashion by eliminating pqHandleSendFailure altogether, and instead arranging to postpone all reports of send failures in libpq until after we've made an attempt to read and process server messages. The send failure won't be reported at all if we find a server message or detect input EOF. but that doesn't work here, because we never process the error message. There's no code in pqParseInput3() to process server messages while doing copy. I'm honestly a bit baffled. How can we not have noticed that COPY FROM STDIN doesn't handle errors before the input is exhausted? It's not just pgbench, it's psql (and I asume pg_restore) etc as well. Greetings, Andres Freund
Re: significant slowdown of HashAggregate between 9.6 and 10
Hi, On 2020-06-03 21:31:01 +0200, Tomas Vondra wrote: > So there seems to be +40% between 9.6 and 10, and further +25% between > 10 and master. However, plain hashagg, measured e.g. like this: Ugh. Since I am a likely culprit, I'm taking a look. > Not sure what to think about this. Seems slot_deform_tuple got way more > expensive between 9.6 and 10, for some reason. Might indicate too many calls instead. Or it could just be the fact that we have expressions deform all columns once, instead of deforming columns one-by-one now. Greetings, Andres Freund
Re: what can go in root.crt ?
On Wed, Jun 3, 2020 at 03:07:30PM +0300, Ants Aasma wrote: > On Tue, 2 Jun 2020 at 20:14, Bruce Momjian wrote: > > The server certificate should be issued by a certificate authority root > outside of your organization only if you want people outside of your > organization to trust your server certificate, but you are then asking > for the client to only trust an intermediate inside your organization. > The big question is why bother having the server certificate chain to a > root certificat you don't trust when you have no intention of having > clients outside of your organization trust the server certificate. > Postgres could be made to handle such cases, but is is really a valid > configuration we should support? > > > I think the "why" the org cert is not root was already made clear, that is the > copmany policy. I don't think postgres should take a stance whether the > certificate designated as the root of trust is self-signed or claims to get > its > power fromΒ somewhere else. Uh, we sure can. We disallow many configurations that we consider unsafe. openssl allowed a lot of things, and their flexibility make them less secure. -- Bruce Momjian https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Re: Atomic operations within spinlocks
Hi, On 2020-06-03 14:19:45 -0400, Tom Lane wrote: > In connection with the nearby thread about spinlock coding rule > violations, I noticed that we have several places that are doing > things like this: > > SpinLockAcquire(&WalRcv->mutex); > ... > written_lsn = pg_atomic_read_u64(&WalRcv->writtenUpto); > ... > SpinLockRelease(&WalRcv->mutex); > > That's from pg_stat_get_wal_receiver(); there is similar code in > freelist.c's ClockSweepTick() and StrategySyncStart(). > > This seems to me to be very bad code. In the first place, on machines > without the appropriate type of atomic operation, atomics.c is going > to be using a spinlock to emulate atomicity, which means this code > tries to take a spinlock while holding another one. That's not okay, > either from the standpoint of performance or error-safety. I'm honestly not particularly concerned about either of those in general: - WRT performance: Which platforms where we care about performance can't do a tear-free read of a 64bit integer, and thus needs a spinlock for this? And while the cases in freelist.c aren't just reads, they are really cold paths (clock wrapping around). - WRT error safety: What could happen here? The atomics codepaths is no-fail code? And nothing should ever nest inside the atomic-emulation spinlocks. What am I missing? I think straight out prohibiting use of atomics inside a spinlock will lead to more complicated and slower code, rather than than improving anything. I'm to blame for the ClockSweepTick() case, and I don't really see how we could avoid the atomic-while-spinlocked without relying on 64bit atomics, which are emulated on more platforms, and without having a more complicated retry loop. > In the second place, this coding seems to me to indicate serious > confusion about which lock is protecting what. In the above example, > if writtenUpto is only accessed through atomic operations then it > seems like we could just move the pg_atomic_read_u64 out of the > spinlock section; or if the spinlock is adequate protection then we > could just do a normal fetch. If we actually need both locks then > this needs significant re-thinking, IMO. Yea, it doesn't seem necessary at all that writtenUpto is read with the spinlock held. That's very recent: commit 2c8dd05d6cbc86b7ad21cfd7010e041bb4c3950b Author: Michael Paquier Date: 2020-05-17 09:22:07 +0900 Make pg_stat_wal_receiver consistent with the WAL receiver's shmem info and I assume just was caused by mechanical replacement, rather than intentionally doing so while holding the spinlock. As far as I can tell none of the other writtenUpto accesses are under the spinlock. Greetings, Andres Freund
question regarding copyData containers
Hello, I have been working on a node.js streaming client for different COPY scenarios. usually, during CopyOut, clients tend to buffer network chunks until they have gathered a full copyData message and pass that to the user. In some cases, this can lead to very large copyData messages. when there are very long text fields or bytea fields it will require a lot of memory to be handled (up to 1GB I think in the worst case scenario) In COPY TO, I managed to relax that requirement, considering that copyData is simply a transparent container. For each network chunk, the relevent message content is forwarded which makes for 64KB chunks at most. If that makes things clearer, here is an example scenarios, with 4 network chunks received and the way they are forwarded to the client. in: CopyData Int32Len Byten1 in: Byten2 in: Byten3 in: CopyData Int32Len Byten4 out: Byten1 out: Byten2 out: Byten3 out: Byten4 We loose the semantics of the "row" that copyData has according to the documentation https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY >The backend sends a CopyOutResponse message to the frontend, followed by zero or more >CopyData messages (**always one per row**), followed by CopyDone but it is not a problem because the raw bytes are still parsable (rows + fields) in text mode (tsv) and in binary mode) Now I started working on copyBoth and logical decoding scenarios. In this case, the server send series of copyData. 1 copyData containing 1 message : at the network chunk level, in the case of large fields, we can observe in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1 in: Byten2 in: CopyData Int32 XLogData Int64 Int64 Int64 Byten3 in: CopyData Int32 XLogData Int64 Int64 Int64 Byten4 out: XLogData Int64 Int64 Int64 Byten1 out: Byten2 out: XLogData Int64 Int64 Int64 Byten3 out: XLogData Int64 Int64 Int64 Byten4 but at the XLogData level, the protocol is not self-describing its length, so there is no real way of knowing where the first XLogData ends apart from - knowing the length of the first copyData (4 + 1 + 3*8 + n1 + n2) - knowing the internals of the output plugin and benefit from a plugin that self-describe its span when a network chunks contains several copyDatas in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1 CopyData Int32 XLogData Int64 Int64 Int64 Byten2 we have out: XLogData Int64 Int64 Int64 Byten1 XLogData Int64 Int64 Int64 Byten2 and with test_decoding for example it is impossible to know where the test_decoding output ends without remembering the original length of the copyData. now my question is the following : is it ok to consider that over the long term copyData is simply a transport container that exists only to allow the multiplexing of events in the protocol but that messages inside could be chunked over several copyData events ? if we put test_decoding apart, do you consider that output plugins XLogData should be self-aware of their length ? I suppose (but did not fully verify yet) that this is the case for pgoutput ? I suppose that wal2json could also be parsed by balancing the brackets. I am wondering because when a client sends copyData to the server, the documentation says >The message boundaries are not required to have anything to do with row boundaries, >although that is often a reasonable choice. I hope that my message will ring a bell on the list. I tried the best I could to describe my very specific research. Thank you for your help, --- JΓ©rΓ΄me
[PATCH] Leading minus for negative time interval in ISO 8601
Hello! I'd like to propose a simple patch to allow for negative ISO 8601 intervals with leading minus, e.g. -PT1H besides PT-1H. It seems that standard isn't quite clear on negative duration. However, lots of software use leading minus and expect/generate intervals in such forms making those incompatible with current PostgreSQL decoding code. All patch is doing is making a note of a leading minus and negates pg_tm components along with fractional seconds. No other behavior change is introduced. -- Mikhail From c1d96252ebd727b205e1ee66f34f8b564eb05c00 Mon Sep 17 00:00:00 2001 From: Mikhail Titov Date: Wed, 3 Jun 2020 02:29:38 -0500 Subject: [PATCH] Support leading minus in front of ISO 8601 interval string It is a common extension for a standard to have negative intervals prefixed with a leading minus instead of individual parts. For example -PT1H as an alternative to PT-1H. The standard itself is not specific on negative intervals. This patch makes PostgreSQL a bit more compatible with 3-rd party software that use ISO 8601 style intervals that might be negative, e.g. * https://github.com/rails/rails/issues/39503 * https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html#parse-java.lang.CharSequence- --- src/backend/utils/adt/datetime.c | 20 1 file changed, 20 insertions(+) diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c index 1914138a24..f79df368a2 100644 --- a/src/backend/utils/adt/datetime.c +++ b/src/backend/utils/adt/datetime.c @@ -3492,10 +3492,17 @@ DecodeISO8601Interval(char *str, { bool datepart = true; bool havefield = false; + bool negative = false; *dtype = DTK_DELTA; ClearPgTm(tm, fsec); + if (str[0] == '-') + { + negative = true; + str++; + } + if (strlen(str) < 2 || str[0] != 'P') return DTERR_BAD_FORMAT; @@ -3681,6 +3688,19 @@ DecodeISO8601Interval(char *str, havefield = true; } + if (negative) + { + tm->tm_sec = -tm->tm_sec; + tm->tm_min = -tm->tm_min; + tm->tm_hour = -tm->tm_hour; + tm->tm_mday = -tm->tm_mday; + tm->tm_mon = -tm->tm_mon; + tm->tm_year = -tm->tm_year; + tm->tm_wday = -tm->tm_wday; + tm->tm_yday = -tm->tm_yday; + *fsec = -*fsec; + } + return 0; } -- 2.26.0.windows.1
Re: Atomic operations within spinlocks
On Thu, Jun 4, 2020 at 8:45 AM Andres Freund wrote: > On 2020-06-03 14:19:45 -0400, Tom Lane wrote: > > In the second place, this coding seems to me to indicate serious > > confusion about which lock is protecting what. In the above example, > > if writtenUpto is only accessed through atomic operations then it > > seems like we could just move the pg_atomic_read_u64 out of the > > spinlock section; or if the spinlock is adequate protection then we > > could just do a normal fetch. If we actually need both locks then > > this needs significant re-thinking, IMO. > > Yea, it doesn't seem necessary at all that writtenUpto is read with the > spinlock held. That's very recent: > > commit 2c8dd05d6cbc86b7ad21cfd7010e041bb4c3950b > Author: Michael Paquier > Date: 2020-05-17 09:22:07 +0900 > > Make pg_stat_wal_receiver consistent with the WAL receiver's shmem info > > and I assume just was caused by mechanical replacement, rather than > intentionally doing so while holding the spinlock. As far as I can tell > none of the other writtenUpto accesses are under the spinlock. Yeah. It'd be fine to move that after the spinlock release. Although it's really just for informational purposes only, not for any data integrity purpose, reading it before the spinlock acquisition would theoretically allow it to appear to be (reportedly) behind flushedUpto, which would be silly.
Re: question regarding copyData containers
Hi, On 2020-06-03 19:28:12 +0200, Jerome Wagner wrote: > I have been working on a node.js streaming client for different COPY > scenarios. > usually, during CopyOut, clients tend to buffer network chunks until they > have gathered a full copyData message and pass that to the user. > > In some cases, this can lead to very large copyData messages. when there > are very long text fields or bytea fields it will require a lot of memory > to be handled (up to 1GB I think in the worst case scenario) > > In COPY TO, I managed to relax that requirement, considering that copyData > is simply a transparent container. For each network chunk, the relevent > message content is forwarded which makes for 64KB chunks at most. Uhm. > We loose the semantics of the "row" that copyData has according to the > documentation > https://www.postgresql.org/docs/10/protocol-flow.html#PROTOCOL-COPY > >The backend sends a CopyOutResponse message to the frontend, followed by > zero or more >CopyData messages (**always one per row**), followed by > CopyDone > > but it is not a problem because the raw bytes are still parsable (rows + > fields) in text mode (tsv) and in binary mode) This seems like an extremely bad idea to me. Are we really going to ask clients to incur the overhead (both in complexity and runtime) to parse incoming data just to detect row boundaries? Given the number of options there are for COPY, that's a seriously complicated task. I think that's a completely no-go. Leaving error handling aside (see para below), what does this actually get you? Either your client cares about getting a row in one sequential chunk, or it doesn't. If it doesn't care, then there's no need to allocate a buffer that can contain the whole 'd' message. You can just hand the clients the chunks incrementally. If it does, then you need to reassemble either way (or worse, you force to reimplement the client to reimplement that). I assume what you're trying to get at is being able to send CopyData messages before an entire row is assembled? And you want to send separate CopyData messages to allow for error handling? I think that's a quite worthwhile goal, but I don't think it can sensibly solved by just removing protocol level framing of row boundaries. And that will mean evolving the protocol in a non-compatible way. > Now I started working on copyBoth and logical decoding scenarios. In this > case, the server send series of copyData. 1 copyData containing 1 message : > > at the network chunk level, in the case of large fields, we can observe > > in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1 > in: Byten2 > in: CopyData Int32 XLogData Int64 Int64 Int64 Byten3 > in: CopyData Int32 XLogData Int64 Int64 Int64 Byten4 > > out: XLogData Int64 Int64 Int64 Byten1 > out: Byten2 > out: XLogData Int64 Int64 Int64 Byten3 > out: XLogData Int64 Int64 Int64 Byten4 > but at the XLogData level, the protocol is not self-describing its length, > so there is no real way of knowing where the first XLogData ends apart from > - knowing the length of the first copyData (4 + 1 + 3*8 + n1 + n2) > - knowing the internals of the output plugin and benefit from a plugin > that self-describe its span > when a network chunks contains several copyDatas > in: CopyData Int32 XLogData Int64 Int64 Int64 Byten1 CopyData Int32 > XLogData Int64 Int64 Int64 Byten2 > we have > out: XLogData Int64 Int64 Int64 Byten1 XLogData Int64 Int64 Int64 Byten2 Right now all 'w' messages should be contained in one CopyData/'d' that doesn't contain anything but the XLogData/'w'. Do you just mean that if we'd change the server side code to split 'w' messages across multiple 'd' messages, then we couldn't make much sense of the data anymore? If so, then I don't really see a problem. Unless you do a much larger change, what'd be the point in allowing to split 'w' across multiple 'd' chunks? The input data exists in a linear buffer already, so you're not going to reduce peak memory usage by sending smaller CopyData chunks. Sure, we could evolve the logical decoding interface to output to be able to send data in a much more incremental way than, typically, per-row basis. But I think that'd quite substantially increase complexity. And the message framing seems to be the easier part of such a change. Greetings, Andres Freund
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
Hi, On 2020-06-02 14:23:50 +0900, Fujii Masao wrote: > On 2020/06/02 13:24, Michael Paquier wrote: > > On Fri, May 29, 2020 at 06:09:06PM +0900, Masahiko Sawada wrote: > > > Yes. Conversely, if we start logical replication in a physical > > > replication connection (i.g. replication=true) we got an error before > > > staring replication: > > > > > > ERROR: logical decoding requires a database connection > > > > > > I think we can prevent that SEGV in a similar way. > > > > Still unconvinced as this restriction stands for logical decoding > > requiring a database connection but it is not necessarily true now as > > physical replication has less restrictions than a logical one. > > Could you tell me what the benefit for supporting physical replication on > logical rep connection is? If it's only for "undocumented" > backward-compatibility, IMO it's better to reject such "tricky" set up. > But if there are some use cases for that, I'm ok to support that. I don't think we should prohibit this. For one, it'd probably break some clients, without a meaningful need. But I think it's also actually quite useful to be able to access catalogs before streaming data. You e.g. can look up configuration of the primary before streaming WAL. With a second connection that's actually harder to do reliably in some cases, because you need to be sure that you actually reached the right server (consider a pooler, automatic failover etc). Greetings, Andres Freund
Re: Parallel Seq Scan vs kernel read ahead
> It doesn't look like it's using table_block_parallelscan_nextpage() as > a block allocator so it's not affected by the patch. It has its own > thing zs_parallelscan_nextrange(), which does > pg_atomic_fetch_add_u64(&pzscan->pzs_allocatedtids, > ZS_PARALLEL_CHUNK_SIZE), and that macro is 0x10. My apologies, I was too hasty. Indeed, you are correct. Zedstore's unit of work is chunks of the logical zstid space. There is a correlation between the zstid and blocks: zstids near each other are likely to lie in the same block or in neighboring blocks. It would be interesting to try something like this patch for Zedstore. Regards, Soumyadeep
Re: Parallel Seq Scan vs kernel read ahead
On Sat, May 23, 2020 at 12:00 AM Robert Haas wrote: > I think there's a significant difference. The idea I remember being > discussed at the time was to divide the relation into equal parts at > the very start and give one part to each worker. I think that carries > a lot of risk of some workers finishing much sooner than others. Was the idea of work-stealing considered? Here is what I have been thinking about: Each worker would be assigned a contiguous chunk of blocks at init time. Then if a worker is finished with its work, it can inspect other workers' remaining work and "steal" some of the blocks from the end of the victim worker's allocation. Considerations for such a scheme: 1. Victim selection: Who will be the victim worker? It can be selected at random if nothing else. 2. How many blocks to steal? Stealing half of the victim's remaining blocks seems to be fair. 3. Stealing threshold: We should disallow stealing if the amount of remaining work is not enough in the victim worker. 4. Additional parallel state: Representing the chunk of "work". I guess one variable for the current block and one for the last block in the chunk allocated. The latter would have to be protected with atomic fetches as it would be decremented by the stealing worker. 5. Point 4 implies that there might be more atomic fetch operations as compared to this patch. Idk if that is a lesser evil than the workers being idle..probably not? A way to salvage that a little would be to forego atomic fetches when the amount of work remaining is less than the threshold discussed in 3 as there is no possibility of work stealing then. Regards, Soumyadeep
Re: libpq copy error handling busted
Andres Freund writes: > When libpq is used to COPY data to the server, it doesn't properly > handle errors. > This is partially an old problem, and partially got recently > worse. Before the below commit we detected terminated connections, but > we didn't handle copy failing. Yeah. After poking at that for a little bit, there are at least three problems: * pqSendSome() is responsible not only for pushing out data, but for calling pqReadData in any situation where it can't get rid of the data promptly. 1f39a1c06 overlooked that requirement, and the upshot is that we don't necessarily notice that the connection is broken (it's pqReadData's job to detect that). Putting a pqReadData call into the early-exit path helps, but doesn't fix things completely. * The more longstanding problem is that the PQputCopyData code path doesn't have any mechanism for consuming an 'E' (error) message once pqReadData has collected it. AFAICS that's ancient. (It does not affect the behavior of this case if you use an immediate-mode shutdown, because then the backend never issues an 'E' message; but it does matter in a fast-mode shutdown.) I think that the idea was to let the client dump all its copy data and then report the error message when PQendCopy is called, but as you say, that's none too friendly when there's gigabytes of data involved. I'm not sure we can do anything about this without effectively changing the client API for copy errors, though. * As for control-C not getting out of it: there is if (CancelRequested) break; in pgbench's loop, but this does nothing in this scenario because fe-utils/cancel.c only sets that flag when it successfully sends a Cancel ... which it certainly cannot if the postmaster is gone. I suspect that this may be relatively recent breakage. It doesn't look too well thought out, in any case. The places that are testing this flag look like they'd rather not be bothered with the fine point of whether the cancel request actually went anywhere. (And aside from this issue, I see no mechanism for that flag to become unset once it's set. Current users of cancel.c probably don't care, but we'd have noticed if we tried to make psql use it.) regards, tom lane
Re: Parallel Seq Scan vs kernel read ahead
On Wed, Jun 3, 2020 at 3:18 PM Soumyadeep Chakraborty wrote: > Idk if that is a lesser evil than the workers > being idle..probably not? Apologies, I meant that the extra atomic fetches is probably a lesser evil than the workers being idle. Soumyadeep
Re: what can go in root.crt ?
On 06/03/20 08:07, Ants Aasma wrote: > I think the "why" the org cert is not root was already made clear, that is > the copmany policy. Thank you, yes, that was what I had intended to convey, and you have saved me finishing a weedsier follow-up message hoping to convey it better. > I don't think postgres should take a stance ... > whether the > certificate designated as the root of trust is self-signed or claims to get > its power from somewhere else. On 06/03/20 16:34, Bruce Momjian wrote: > Uh, we sure can. We disallow many configurations that we consider > unsafe. Ok, so a person in the situation described here, who is not in a position to demand changes in an organizational policy (whether or not it seems ill-conceived to you or even to him/her), is facing this question: What are the "safest" things I /can/ do, under the existing constraints, and /which of those will work in PostgreSQL/? For example, we might agree that it is safe to trust nothing but the end-entity cert of my server itself. I made a server, here is its cert, here is a root.crt file for libpq containing only this exact cert, I want libpq to connect only ever to this server with this cert and nothing else. It's a pain because I have to roll out new root.crt files to everybody whenever the cert changes, but it would be hard to call it unsafe. Great! Can I do that? I think the answer is no. I haven't found it documented, but I think libpq will fail such a connection, because the cert it has found in root.crt is not self-signed. Or, vary the scenario just enough that my organization, or even my department in my organization, now has its own CA, as the first intermediate, the issuer of the end-entity cert. It might be entirely reasonable to put that CA cert into root.crt, so libpq would only connect to things whose certs were issued in my department, or at least my org. I trust the person who would be issuing my department's certs (in all likelihood, me). I would more-or-less trust my org to issue certs for my org. Great! Can I do that? I think that answer is also no, for the same reason. My department's or org's CA cert isn't going to be self-signed, it's going to be vouched for by a chain of more certs leading to a globally recognized one. Why? Short answer, our org also has web sites. We like for people's browsers to be able to see those. We have one group that makes server certs and they follow one procedure, and the certs they make come out the same way. That shouldn't be a problem for PostgreSQL, so it's hard to argue they should have to use a different procedure just for my cert. > I don't think postgres should take a stance whether the > certificate designated as the root of trust is self-signed or claims > to get its power from somewhere else. I'm inclined to agree but I would change the wording a bit for clarity. *Any* certificate we are going to trust gets its power from somewhere else. Being self-signed is not an exception to that rule (if it were, every Snake Oil, Ltd. self-signed cert generated by every student in every network security class ever would be a root of trust). For us to trust a cert, it must be vouched for in some way. The most important vouching that happens, as far as libpq is concerned, is vouching by the administrator who controls the file system where root.crt is found and the contents of that file. If libpq is looking at a cert and finds it in that file, I have vouched for it. That's why it's there. If it is self-signed, then I'm the only person vouching for it, and that's ok. If it is not self-signed, that just means somebody else also has vouched for it. Maybe for the same use, maybe for some other use. In any event, the fact that somebody else has also vouched for it does not in any way negate that I vouched for it, by putting it there in that file I control. > It's pretty easy to conceive of certificate management procedures that make > use of this chain to implement certificate replacement securely. For > example one might trust the global issuer to verify that a CSR is coming > from the O= value that it's claiming to come from to automate replacement > of intermediate certificates, but not trust that every other sub-CA signed > by root and their sub-sub-CA-s are completely honest and secure. That's an example of the kind of policy design I think ought to be possible, but a first step to getting there would be to just better document what does and doesn't work in libpq now. There seem to be some possible configurations that aren't available, not because of principled arguments for disallowing them, but because they fail unstated assumptions. In an ideal world, I think libpq would be using this algorithm: I'm looking at the server's certificate, s. Is s unexpired and in the trust file? If so, SUCCEED. otherwise, loop: get issuer certificate i from s (if s is self-signed, FAIL). does i have CA:TRUE and Certificate Sign bits? If not, FAIL. does i's Domain Constrai
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020-Jun-03, Andres Freund wrote: > I don't think we should prohibit this. For one, it'd probably break some > clients, without a meaningful need. There *is* a need, namely to keep complexity down. This is quite convoluted, it's got a lot of historical baggage because of the way it was implemented, and it's very difficult to understand. The greatest motive I see is to make this easier to understand, so that it is easier to modify and improve in the future. > But I think it's also actually quite useful to be able to access > catalogs before streaming data. You e.g. can look up configuration of > the primary before streaming WAL. With a second connection that's > actually harder to do reliably in some cases, because you need to be > sure that you actually reached the right server (consider a pooler, > automatic failover etc). I don't think having a physical replication connection access catalog data directly is a great idea. We already have gadgets like IDENTIFY_SYSTEM for physical replication that can do that, and if you need particular settings you can use SHOW (commit d1ecd539477). If there was a strong need for even more than that, we can add something to the grammar. -- Γlvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020-Jun-02, Michael Paquier wrote: > I can note as well that StartLogicalReplication() moves in this sense > by setting xlogreader to be the one from logical_decoding_ctx once the > decoding context has been created. > > This results in the attached. The extra test from upthread to check > that logical decoding is not allowed in a non-database WAL sender is a > good idea, so I have kept it. I don't particularly disagree with your proposed patch -- in fact, it seems to make things cleaner. It is a little wasteful, but I don't really mind that. It's just some memory, and it's not a significant amount. That said, I would *also* apply Kyotaro's proposed patch to prohibit a physical standby running with a logical slot, if only because that reduces the number of combinations that we need to test and keep our collective heads straight about. Just reject the weird case and have one type of slot for each type of replication. I didn't even think this was at all possible. -- Γlvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Towards easier AMs: Cleaning up inappropriate use of name "relkind"
On 2020-Jun-03, Mark Dilger wrote: > The name "relkind" normally refers to a field of type 'char' with > values like 'r' for "table" and 'i' for "index". In AlterTableStmt > and CreateTableAsStmt, this naming convention was abused for a field > of type enum ObjectType. I agree that "relkind" here is a misnomer, and I bet that what happened here is that the original patch Gavin developed was using the relkind enum from pg_class and was later changed to the OBJECT_ defines after patch review, but the struct member name remained. I don't object to the proposed renaming. -- Γlvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
Hi, On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote: > On 2020-Jun-03, Andres Freund wrote: > > I don't think we should prohibit this. For one, it'd probably break some > > clients, without a meaningful need. > > There *is* a need, namely to keep complexity down. This is quite > convoluted, it's got a lot of historical baggage because of the way it > was implemented, and it's very difficult to understand. The greatest > motive I see is to make this easier to understand, so that it is easier > to modify and improve in the future. That seems like a possibly convincing argument for not introducing the capability, but doesn't seem strong enough to remove it. Especially not if it was just broken as part of effectively a refactoring, as far as I understand? > > But I think it's also actually quite useful to be able to access > > catalogs before streaming data. You e.g. can look up configuration of > > the primary before streaming WAL. With a second connection that's > > actually harder to do reliably in some cases, because you need to be > > sure that you actually reached the right server (consider a pooler, > > automatic failover etc). > > I don't think having a physical replication connection access catalog > data directly is a great idea. We already have gadgets like > IDENTIFY_SYSTEM for physical replication that can do that, and if you > need particular settings you can use SHOW (commit d1ecd539477). If > there was a strong need for even more than that, we can add something to > the grammar. Those special case things are a bad idea, and we shouldn't introduce more. It's unrealistic that we can ever make that support everything, and since we already have to support the database connected thing, I don't see the point. Greetings, Andres Freund
Re: libpq copy error handling busted
I wrote: > * pqSendSome() is responsible not only for pushing out data, but for > calling pqReadData in any situation where it can't get rid of the data > promptly. 1f39a1c06 overlooked that requirement, and the upshot is > that we don't necessarily notice that the connection is broken (it's > pqReadData's job to detect that). Putting a pqReadData call into > the early-exit path helps, but doesn't fix things completely. Ah, it's better if I put the pqReadData call into *both* the paths where 1f39a1c06 made pqSendSome give up. The attached patch seems to fix the issue for the "pgbench -i" scenario, with either fast- or immediate-mode server stop. I tried it with and without SSL too, just to see. Still, it's not clear to me whether this might worsen any of the situations we discussed in the lead-up to 1f39a1c06 [1]. Thomas, are you in a position to redo any of that testing? > * The more longstanding problem is that the PQputCopyData code path > doesn't have any mechanism for consuming an 'E' (error) message > once pqReadData has collected it. At least with pgbench's approach (die immediately on PQputline failure) this isn't very relevant once we apply the attached. Perhaps we should revisit this behavior anyway, but I'd be afraid to back-patch a change of that nature. > * As for control-C not getting out of it: there is > if (CancelRequested) > break; > in pgbench's loop, but this does nothing in this scenario because > fe-utils/cancel.c only sets that flag when it successfully sends a > Cancel ... which it certainly cannot if the postmaster is gone. I'll send a patch for this later. regards, tom lane [1] https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com diff --git a/src/interfaces/libpq/fe-misc.c b/src/interfaces/libpq/fe-misc.c index 9afa0533a6..9273984727 100644 --- a/src/interfaces/libpq/fe-misc.c +++ b/src/interfaces/libpq/fe-misc.c @@ -823,6 +823,10 @@ definitelyFailed: * Return 0 on success, -1 on failure and 1 when not all data could be sent * because the socket would block and the connection is non-blocking. * + * Note that this is also responsible for consuming data from the socket + * (putting it in conn->inBuffer) in any situation where we can't send + * all the specified data immediately. + * * Upon write failure, conn->write_failed is set and the error message is * saved in conn->write_err_msg, but we clear the output buffer and return * zero anyway; this is because callers should soldier on until it's possible @@ -842,12 +846,20 @@ pqSendSome(PGconn *conn, int len) * on that connection. Even if the kernel would let us, we've probably * lost message boundary sync with the server. conn->write_failed * therefore persists until the connection is reset, and we just discard - * all data presented to be written. + * all data presented to be written. However, as long as we still have a + * valid socket, we should continue to absorb data from the backend, so + * that we can collect any final error messages. */ if (conn->write_failed) { /* conn->write_err_msg should be set up already */ conn->outCount = 0; + /* Absorb input data if any, and detect socket closure */ + if (conn->sock != PGINVALID_SOCKET) + { + if (pqReadData(conn) < 0) +return -1; + } return 0; } @@ -917,6 +929,13 @@ pqSendSome(PGconn *conn, int len) /* Discard queued data; no chance it'll ever be sent */ conn->outCount = 0; + + /* Absorb input data if any, and detect socket closure */ + if (conn->sock != PGINVALID_SOCKET) + { + if (pqReadData(conn) < 0) + return -1; + } return 0; } }
Re: elog(DEBUG2 in SpinLocked section.
On Wed, Jun 03, 2020 at 12:36:34AM -0400, Tom Lane wrote: > Should we think about adding automated detection of this type of > mistake? I don't like the attached as-is because of the #include > footprint expansion, but maybe we can find a better way. I think that this one first boils down to the FRONTEND dependency in those headers. Or in short, spin.h may get loaded by the frontend but we have a backend-only API, no? > It's actually worse in the back branches, because elog() did not have > a good short-circuit path like ereport() does. +1 for back-patch. Thanks, got that fixed down to 9.5. -- Michael signature.asc Description: PGP signature
Re: libpq copy error handling busted
On Thu, Jun 4, 2020 at 1:35 PM Tom Lane wrote: > I wrote: > > * pqSendSome() is responsible not only for pushing out data, but for > > calling pqReadData in any situation where it can't get rid of the data > > promptly. 1f39a1c06 overlooked that requirement, and the upshot is > > that we don't necessarily notice that the connection is broken (it's > > pqReadData's job to detect that). Putting a pqReadData call into > > the early-exit path helps, but doesn't fix things completely. > > Ah, it's better if I put the pqReadData call into *both* the paths > where 1f39a1c06 made pqSendSome give up. The attached patch seems > to fix the issue for the "pgbench -i" scenario, with either fast- > or immediate-mode server stop. I tried it with and without SSL too, > just to see. Still, it's not clear to me whether this might worsen > any of the situations we discussed in the lead-up to 1f39a1c06 [1]. > Thomas, are you in a position to redo any of that testing? Yes, sure. The testing consisted of running on a system with OpenSSL 1.1.1a (older versions didn't have the problem). It originally showed up on eelpout, a very underpowered build farm machine running Linux on ARM64, but then later we worked out we could make it happen on a Mac or any other Linux system if we had bad enough luck or if we added a sleep in a particular spot. We could do it with psql running in a loop using a bad certificate from the testing setup stuff, as shown here: https://www.postgresql.org/message-id/CA%2BhUKGJafyTgpsYBgQGt1EX0O8UnL4VGHSc7J0KZyMH4_jPGBw%40mail.gmail.com I don't have access to eelpout from where I am right now, but I'll try that test now on the Debian 10 amd64 system I have here. OpenSSL has since moved on to 1.1.1d-0+deb10u3, but that should be fine, the triggering change was the move to TLS1.3 so let me see what happens if I do that with your patch applied... > > * The more longstanding problem is that the PQputCopyData code path > > doesn't have any mechanism for consuming an 'E' (error) message > > once pqReadData has collected it. > > At least with pgbench's approach (die immediately on PQputline failure) > this isn't very relevant once we apply the attached. Perhaps we should > revisit this behavior anyway, but I'd be afraid to back-patch a change > of that nature. > > > * As for control-C not getting out of it: there is > > if (CancelRequested) > > break; > > in pgbench's loop, but this does nothing in this scenario because > > fe-utils/cancel.c only sets that flag when it successfully sends a > > Cancel ... which it certainly cannot if the postmaster is gone. > > I'll send a patch for this later. > > regards, tom lane > > [1] > https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com
Re: elog(DEBUG2 in SpinLocked section.
Michael Paquier writes: > On Wed, Jun 03, 2020 at 12:36:34AM -0400, Tom Lane wrote: >> Should we think about adding automated detection of this type of >> mistake? I don't like the attached as-is because of the #include >> footprint expansion, but maybe we can find a better way. > I think that this one first boils down to the FRONTEND dependency in > those headers. Or in short, spin.h may get loaded by the frontend but > we have a backend-only API, no? I think the #include bloat comes from wanting to declare the global state variable as "slock_t *". We could give up on that and write something like this in a central place like c.h: #if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND) extern void *held_spinlock; #define NotHoldingSpinLock() Assert(held_spinlock == NULL) #else #define NotHoldingSpinLock() ((void) 0) #endif Then throwing NotHoldingSpinLock() into relevant places costs nothing new include-wise. regards, tom lane
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On Wed, Jun 03, 2020 at 06:33:11PM -0700, Andres Freund wrote: > On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote: >> There *is* a need, namely to keep complexity down. This is quite >> convoluted, it's got a lot of historical baggage because of the way it >> was implemented, and it's very difficult to understand. The greatest >> motive I see is to make this easier to understand, so that it is easier >> to modify and improve in the future. > > That seems like a possibly convincing argument for not introducing the > capability, but doesn't seem strong enough to remove it. Especially not > if it was just broken as part of effectively a refactoring, as far as I > understand? Are there any objections in fixing the issue first then? As far as I can see there is no objection to this part, like here: https://www.postgresql.org/message-id/20200603214448.GA901@alvherre.pgsql >> I don't think having a physical replication connection access catalog >> data directly is a great idea. We already have gadgets like >> IDENTIFY_SYSTEM for physical replication that can do that, and if you >> need particular settings you can use SHOW (commit d1ecd539477). If >> there was a strong need for even more than that, we can add something to >> the grammar. > > Those special case things are a bad idea, and we shouldn't introduce > more. It's unrealistic that we can ever make that support everything, > and since we already have to support the database connected thing, I > don't see the point. Let's continue discussing this part as well. -- Michael signature.asc Description: PGP signature
Re: SIGSEGV from START_REPLICATION 0/XXXXXXX in XLogSendPhysical () at walsender.c:2762
On 2020-Jun-03, Andres Freund wrote: > On 2020-06-03 18:27:12 -0400, Alvaro Herrera wrote: > > On 2020-Jun-03, Andres Freund wrote: > > > I don't think we should prohibit this. For one, it'd probably break some > > > clients, without a meaningful need. > > > > There *is* a need, namely to keep complexity down. This is quite > > convoluted, it's got a lot of historical baggage because of the way it > > was implemented, and it's very difficult to understand. The greatest > > motive I see is to make this easier to understand, so that it is easier > > to modify and improve in the future. > > That seems like a possibly convincing argument for not introducing the > capability, but doesn't seem strong enough to remove it. This "capability" has never been introduced. The fact that it's there is just an accident. In fact, it's not a capability, since the feature (physical replication) is invoked differently -- namely, using a physical replication connection. JDBC uses a logical replication connection for it only because they never realized that they were supposed to do differently, because we failed to throw the correct error message in the first place. > > I don't think having a physical replication connection access catalog > > data directly is a great idea. We already have gadgets like > > IDENTIFY_SYSTEM for physical replication that can do that, and if you > > need particular settings you can use SHOW (commit d1ecd539477). If > > there was a strong need for even more than that, we can add something to > > the grammar. > > Those special case things are a bad idea, and we shouldn't introduce > more. What special case things? The replication connection has never been supposed to run SQL. That's why we have SHOW in the replication grammar. > It's unrealistic that we can ever make that support everything, > and since we already have to support the database connected thing, I > don't see the point. A logical replication connection is not supposed to be used for physical replication. That's just going to make more bugs appear. -- Γlvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: REINDEX CONCURRENTLY and indisreplident
On Wed, Jun 03, 2020 at 12:40:38PM -0300, Euler Taveira wrote: > On Wed, 3 Jun 2020 at 03:54, Michael Paquier wrote: >> I have bumped into $subject, causing a replica identity index to >> be considered as dropped if running REINDEX CONCURRENTLY on it. This >> means that the old tuple information would get lost in this case, as >> a REPLICA IDENTITY USING INDEX without a dropped index is the same as >> NOTHING. > > LGTM. I tested in both versions (12, master) and it works accordingly. Thanks for the review. I'll try to get that fixed soon. By the way, your previous email was showing up as part of my own email with the indentation that was used so I missed it first. That's the case as well here: https://www.postgresql.org/message-id/cah503wdaejzhp7+wa-hhs6c7nze69owqe5zf_tyfu1epawp...@mail.gmail.com -- Michael signature.asc Description: PGP signature
Re: Parallel copy
On Thu, Jun 4, 2020 at 12:09 AM Andres Freund wrote: > > Hi, > > On 2020-06-03 12:13:14 -0400, Robert Haas wrote: > > On Mon, May 18, 2020 at 12:48 AM Amit Kapila > > wrote: > > > In the above case, even though we are executing a single command from > > > the user perspective, but the currentCommandId will be four after the > > > command. One increment will be for the copy command and the other > > > three increments are for locking tuple in PK table > > > (tab_fk_referenced_chk) for three tuples in FK table > > > (tab_fk_referencing_chk). Now, for parallel workers, it is > > > (theoretically) possible that the three tuples are processed by three > > > different workers which don't get synced as of now. The question was > > > do we see any kind of problem with this and if so can we just sync it > > > up at the end of parallelism. > > > I strongly disagree with the idea of "just sync(ing) it up at the end > > of parallelism". That seems like a completely unprincipled approach to > > the problem. Either the command counter increment is important or it's > > not. If it's not important, maybe we can arrange to skip it in the > > first place. If it is important, then it's probably not OK for each > > backend to be doing it separately. > > That scares me too. These command counter increments definitely aren't > unnecessary in the general case. > Yeah, this is what we want to understand? Can you explain how they are useful here? AFAIU, heap_lock_tuple doesn't use commandid while storing the transaction information of xact while locking the tuple. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] Leading minus for negative time interval in ISO 8601
Mikhail Titov writes: > I'd like to propose a simple patch to allow for negative ISO 8601 > intervals with leading minus, e.g. -PT1H besides PT-1H. It seems that > standard isn't quite clear on negative duration. "Isn't quite clear"? ISTM that if the standard intended to allow that, it'd be pretty clear. I looked through the 8601 spec just now, and I can't see any indication whatever that they intend to allow "-" before P. It's hard to see why they'd bother with that introducer at all if data can appear before it. > However, lots of > software use leading minus and expect/generate intervals in such forms > making those incompatible with current PostgreSQL decoding code. Which "lots of software" are you speaking of, exactly? interval_in has never had such a capability, and I don't recall previous complaints about it. The difference between a useful standard and a useless one is the extent to which people obey the standard rather than adding random extensions to it, so I'm not inclined to add such an extension without a very well-grounded argument for it. regards, tom lane
Re: libpq copy error handling busted
On Thu, Jun 4, 2020 at 1:53 PM Thomas Munro wrote: > On Thu, Jun 4, 2020 at 1:35 PM Tom Lane wrote: > > Ah, it's better if I put the pqReadData call into *both* the paths > > where 1f39a1c06 made pqSendSome give up. The attached patch seems > > to fix the issue for the "pgbench -i" scenario, with either fast- > > or immediate-mode server stop. I tried it with and without SSL too, > > just to see. Still, it's not clear to me whether this might worsen > > any of the situations we discussed in the lead-up to 1f39a1c06 [1]. > > Thomas, are you in a position to redo any of that testing? It seems to be behave correctly in that scenario. Here's what I tested. First, I put this into pgdata/postgresql.conf: ssl=on ssl_ca_file='root+client_ca.crt' ssl_cert_file='server-cn-only.crt' ssl_key_file='server-cn-only.key' ssl_crl_file='root+client.crl' ssl_min_protocol_version='TLSv1.2' ssl_max_protocol_version='TLSv1.1' ssl_min_protocol_version='TLSv1.2' ssl_max_protocol_version='' I copied the named files from src/test/ssl/ssl/ into pgdata, and I ran chmod 600 on the .key file. I put this into pgdata/pg_hba.conf at the top: hostssl all all 127.0.0.1/32 cert clientcert=verify-full I made a copy of src/test/ssl/ssl/client-revoked.key and ran chmod 600 on it. Now on unpatched master I get: $ psql "host=127.0.0.1 port=5432 dbname=postgres user=tmunro sslcert=src/test/ssl/ssl/client-revoked.crt sslkey=client-revoked.key sslmode=require" psql: error: could not connect to server: SSL error: sslv3 alert certificate revoked It's the same if I add in this sleep in fe-connect.c: +sleep(1); /* * Send the startup packet. * If I revert 1f39a1c0641531e0462a4822f2dba904c5d4d699 "Restructure libpq's handling of send failures.", I get the error that eelpout showed intermittently: psql: error: could not connect to server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. could not send startup packet: Connection reset by peer I go back to master, and apply your patch. I get the expected error: psql: error: could not connect to server: SSL error: sslv3 alert certificate revoked
Re: Parallel copy
Hi, On 2020-06-04 08:10:07 +0530, Amit Kapila wrote: > On Thu, Jun 4, 2020 at 12:09 AM Andres Freund wrote: > > > I strongly disagree with the idea of "just sync(ing) it up at the end > > > of parallelism". That seems like a completely unprincipled approach to > > > the problem. Either the command counter increment is important or it's > > > not. If it's not important, maybe we can arrange to skip it in the > > > first place. If it is important, then it's probably not OK for each > > > backend to be doing it separately. > > > > That scares me too. These command counter increments definitely aren't > > unnecessary in the general case. > > > > Yeah, this is what we want to understand? Can you explain how they > are useful here? AFAIU, heap_lock_tuple doesn't use commandid while > storing the transaction information of xact while locking the tuple. But the HeapTupleSatisfiesUpdate() call does use it? And even if that weren't an issue, I don't see how it's defensible to just randomly break the the commandid coherency for parallel copy. Greetings, Andres Freund
Re: Transactions involving multiple postgres foreign servers, take 2
On Wed, Jun 3, 2020 at 12:02 PM Masahiko Sawada wrote: > > On Wed, 3 Jun 2020 at 14:50, Amit Kapila wrote: > > > > > > If the intention is to keep the first version simple, then why do we > > want to support any mode other than 'required'? I think it will limit > > its usage for the cases where 2PC can be used only when all FDWs > > involved support Prepare API but if that helps to keep the design and > > patch simpler then why not just do that for the first version and then > > extend it later. OTOH, if you think it will be really useful to keep > > other modes, then also we could try to keep those in separate patches > > to facilitate the review and discussion of the core feature. > > βdisabledβ is the fundamental mode. We also need 'disabled' mode, > otherwise existing FDW won't work. > IIUC, if foreign_twophase_commit is 'disabled', we don't use a two-phase protocol to commit distributed transactions, right? So, do we check this at the time of Prepare or Commit whether we need to use a two-phase protocol? I think this should be checked at prepare time. + + This parameter can be changed at any time; the behavior for any one + transaction is determined by the setting in effect when it commits. + This is written w.r.t foreign_twophase_commit. If one changes this between prepare and commit, will it have any impact? > I was concerned that many FDW > plugins don't implement FDW transaction APIs yet when users start > using this feature. But it seems to be a good idea to move 'prefer' > mode to a separate patch while leaving 'required'. I'll do that in the > next version patch. > Okay, thanks. Please, see if you can separate out the documentation for that as well. Few other comments on v21-0003-Documentation-update: 1. + + + Numeric transaction identifier with that this foreign transaction + associates + /with that this/with which this 2. + + The OID of the foreign server on that the foreign transaction is prepared + /on that the/on which the 3. + status + text + + + Status of foreign transaction. Possible values are: + + + + initial : Initial status. + What exactly "Initial status" means? 4. + in_doubt + boolean + + + If true this foreign transaction is in-doubt status and + needs to be resolved by calling pg_resolve_fdwxact + function. + It would be better if you can add an additional sentence to say when and or how can foreign transactions reach in-doubt state. 5. If N local transactions each + across K foreign server this value need to be set This part of the sentence can be improved by saying something like: "If a user expects N local transactions and each of those involves K foreign servers, this value..". -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [PATCH] Leading minus for negative time interval in ISO 8601
On 06/03/20 22:46, Tom Lane wrote: > "Isn't quite clear"? ISTM that if the standard intended to allow that, > it'd be pretty clear. I looked through the 8601 spec just now, and > I can't see any indication whatever that they intend to allow "-" before P. Umm, did you see any indication that they intend to allow "-" /anywhere/ in a time interval (with the exception of between year and month, month and day in the alternate form, as simple delimiters, not as minus? (Maybe you did; I'm looking at a publicly-accessible 2016 draft.) It looks like the whole idea of minusness has to be shoehorned into ISO 8601 by anyone who misses it, and that's been done different ways. I guess that's the "isn't quite clear" part. > Which "lots of software" are you speaking of, exactly? interval_in > has never had such a capability, and I don't recall previous complaints > about it. Java durations allow both the PostgreSQL-style minus on individual components, and a leading minus that negates the whole thing. [1] That explicitly says "The leading plus/minus sign, and negative values for other units are not part of the ISO-8601 standard." XML Schema (and therefore XML Query, which uses XML Schema data types) allows only the leading minus. [2] The XML Schema folks say their concept is "drawn from those of ISO 8601, specifically durations without fixed endpoints." That's why they can get away with just the single leading sign: they don't admit something like P1M-1D which you don't know to call 27, 28, 29, or 30 days until you're given an endpoint to hang it on. I had to deal with that in [3]. Regards, -Chap [1] https://docs.oracle.com/javase/8/docs/api/java/time/Duration.html#parse-java.lang.CharSequence- [2] https://www.w3.org/TR/xmlschema11-2/#nt-durationRep [3] https://github.com/tada/pljava/blob/master/pljava-examples/src/main/java/org/postgresql/pljava/example/saxon/S9.java#L329
Re: libpq copy error handling busted
On Thu, Jun 4, 2020 at 3:36 PM Thomas Munro wrote: > Here's what I tested. In passing, I noticed that this: $ psql ... psql: error: could not connect to server: private key file "src/test/ssl/ssl/client-revoked.key" has group or world access; permissions should be u=rw (0600) or less ... leads to this nonsensical error message on the server: 2020-06-04 16:03:11.547 NZST [7765] LOG: could not accept SSL connection: Success
Re: Incorrect comment in be-secure-openssl.c
On Wed, Jun 03, 2020 at 02:40:54PM +0200, Daniel Gustafsson wrote: > Sounds good, thanks! Okay, done then. -- Michael signature.asc Description: PGP signature
Re: [PATCH] Leading minus for negative time interval in ISO 8601
Chapman Flack writes: > On 06/03/20 22:46, Tom Lane wrote: >> "Isn't quite clear"? ISTM that if the standard intended to allow that, >> it'd be pretty clear. I looked through the 8601 spec just now, and >> I can't see any indication whatever that they intend to allow "-" before P. > Umm, did you see any indication that they intend to allow "-" /anywhere/ > in a time interval (with the exception of between year and month, month > and day in the alternate form, as simple delimiters, not as minus? > (Maybe you did; I'm looking at a publicly-accessible 2016 draft.) I don't have an "official" copy either; I was looking at this draft: https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0038_iso_wd_8601-1_2016-02-16.pdf I see this bit: [Β±] represents a plus sign [+] if in combination with the following element a positive value or zero needs to be represented (in this case, unless explicitly stated otherwise, the plus sign shall not be omitted), or a minus sign [β] if in combination with the following element a negative value needs to be represented. but I agree that there's no clear application of that to intervals, either overall or per-field. > Java durations allow both the PostgreSQL-style minus on individual > components, and a leading minus that negates the whole thing. [1] > That explicitly says "The leading plus/minus sign, and negative values > for other units are not part of the ISO-8601 standard." > XML Schema (and therefore XML Query, which uses XML Schema data types) > allows only the leading minus. [2] Hm. The slippery slope I *don't* want to be drawn down is somebody arguing that we should change interval_out, because that would open a whole Pandora's box of compatibility issues. Maybe we should just take the position that negative intervals aren't standardized, and if you want to transport them using ISO format then you first need to lobby ISO to fix that. regards, tom lane
Re: [PATCH] Leading minus for negative time interval in ISO 8601
On Wed, Jun 3, 2020 at 9:46 PM, Tom Lane wrote: > ... > ISTM that if the standard intended to allow that, it'd be pretty > clear. I looked through the 8601 spec just now, and I can't see any > indication whatever that they intend to allow "-" before P. To be fair, I do not have an access to 2019 edition that seems to address negative duration, but what I can see from the wording at https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0039_iso_wd_8601-2_2016-02-16.pdf , it seems to be written without an idea of negative duration at all, even PT-1D alikes supported by PostgreSQL. Also that PDF mentions comma as a preferred sign for e.g. PT1,5D that PostgreSQL does not accept. I understand though that PDF explicitly states it is not a standard. > It's hard to see why they'd bother with that introducer at all if data > can appear before it. I'm not sure I follow. Do you mean to hard require for time/span to start with P and nothing but that? If so, can we think of it as a syntactic sugar? I.e. unary minus AND a normal, positive duration of your liking that we just negate in-place. >> However, lots of software use leading minus and expect/generate >> intervals in such forms making those incompatible with current >> PostgreSQL decoding code. > > Which "lots of software" are you speaking of, exactly? interval_in > has never had such a capability, and I don't recall previous complaints > about it. I was not talking about PG-centric software in particular. I had some JavaScript libraries, Ruby on Rails, Java, Rust, Go in mind. Here is the related issue for Rust https://github.com/rust-lang/rust/issues/18181 and some Go library https://pkg.go.dev/github.com/rickb777/date/period?tab=doc#Parse (besides the links I gave in the patch) to show examples of accepting minus prefix. I presume no one complained much previously because offset can be (and often is) stored as float in, e.g., seconds, and then offset * '@1 second'::interval. That looks a bit verbose and I'd prefer to keep offset as interval and do no extra casting. Take a look at w3c specs that refer to ISO 8601 as well. I understand, that is not what PG is after, but here is an excerpt: ,[ https://www.w3.org/TR/xmlschema-2/#duration ] | One could also indicate a duration of minus 120 days as: -P120D. | ... | P-1347M is not allowed although -P1347M is allowed ` Note that the second example explicitly contradicts currently allowed PG syntax. I presume if the standard was clear, there would be no such ambiguity. Not that I'm trying to introduce drastic changes, but to make PostgreSQL to be somewhat more friendly to what it can accept directly without dancing around. -- Mikhail
Re: [PATCH] Leading minus for negative time interval in ISO 8601
> ... >> Umm, did you see any indication that they intend to allow "-" /anywhere/ >> in a time interval (with the exception of between year and month, month >> and day in the alternate form, as simple delimiters, not as minus? >> (Maybe you did; I'm looking at a publicly-accessible 2016 draft.) > > I don't have an "official" copy either; I was looking at this draft: > https://www.loc.gov/standards/datetime/iso-tc154-wg5_n0038_iso_wd_8601-1_2016-02-16.pdf heh, no one has an up to date standard :-) Also that is the link I meant to include in my first reply. From what I see at https://www.iso.org/obp/ui/#iso:std:iso:8601:-2:ed-1:v1:en they (ISO) did address negative values for components and also there is "3.1.1.7 negative duration" that would be nice to read somehow. > I see this bit: > > [Β±] represents a plus sign [+] if in combination with the following > element a positive value or zero needs to be represented (in this > case, unless explicitly stated otherwise, the plus sign shall not be > omitted), or a minus sign [β] if in combination with the following > element a negative value needs to be represented. But nowhere near duration specification [Β±] is used whatsoever. > Hm. The slippery slope I *don't* want to be drawn down is somebody > arguing that we should change interval_out, because that would open > a whole Pandora's box of compatibility issues. Maybe we should just > take the position that negative intervals aren't standardized, and > if you want to transport them using ISO format then you first need > to lobby ISO to fix that. I explicitly do NOT want to change anything on the way out. First, that is how things are and we do not want to break anything. And, second, in many cases client software can read either format. That is why I thought it would be a trivial change. No output changes. -- Mikhail
Re: Parallel copy
On Thu, Jun 4, 2020 at 9:10 AM Andres Freund wrote: > > Hi, > > On 2020-06-04 08:10:07 +0530, Amit Kapila wrote: > > On Thu, Jun 4, 2020 at 12:09 AM Andres Freund wrote: > > > > I strongly disagree with the idea of "just sync(ing) it up at the end > > > > of parallelism". That seems like a completely unprincipled approach to > > > > the problem. Either the command counter increment is important or it's > > > > not. If it's not important, maybe we can arrange to skip it in the > > > > first place. If it is important, then it's probably not OK for each > > > > backend to be doing it separately. > > > > > > That scares me too. These command counter increments definitely aren't > > > unnecessary in the general case. > > > > > > > Yeah, this is what we want to understand? Can you explain how they > > are useful here? AFAIU, heap_lock_tuple doesn't use commandid while > > storing the transaction information of xact while locking the tuple. > > But the HeapTupleSatisfiesUpdate() call does use it? > It won't use 'cid' for lockers or multi-lockers case (AFAICS, there are special case handling for lockers/multi-lockers). I think it is used for updates/deletes. > And even if that weren't an issue, I don't see how it's defensible to > just randomly break the the commandid coherency for parallel copy. > At this stage, we are evelauating whether there is any need to increment command counter for foreign key checks or is it just happening because we are using using some common code to execute "Select ... For Key Share" statetement during these checks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Expand the use of check_canonical_path() for more GUCs
On Wed, Jun 03, 2020 at 02:45:50PM -0400, Tom Lane wrote: > In the abstract, I agree with Peter's point that we shouldn't alter > user-given strings without need. However, I think there's strong > reason for canonicalizing the data directory and config file locations. > We access those both before and after chdir'ing into the datadir, so > we'd better have absolute paths to them --- and at least for the > datadir, it's documented that you can initially give it as a path > relative to wherever you started the postmaster from. If the other > files are only accessed after the chdir happens then we could likely > do without canonicalizing them. But ... do we know which directory > the user (thought he) specified them with reference to? Forced > canonicalization does have the advantage that it's clear to all > onlookers how we are interpreting the paths. Even with the last point... It looks like there is little love for this patch. So it seems to me that this brings the discussion down to two points: shouldn't we document why canonicalization is not done for data_directory, config_file, hba_file and ident_file with some comments in guc.c? Then, why do we apply it to external_pid_file, Log_directory and stats_temp_directory knowing that we chdir to PGDATA in the postmaster before they get used (as far as I can see)? -- Michael signature.asc Description: PGP signature
Re: Asynchronous Append on postgres_fdw nodes.
Hello, Andrey. At Wed, 3 Jun 2020 15:00:06 +0500, Andrey Lepikhov wrote in > This patch no longer applies cleanly. > In addition, code comments contain spelling errors. Sure. Thaks for noticing of them and sorry for the many typos. Additional item in WaitEventIPC conflicted with this. I found the following typos. connection.c: s/Rerturns/Returns/ postgres-fdw.c: s/Retrive/Retrieve/ s/ForeginScanState/ForeignScanState/ s/manipuration/manipulation/ s/asyncstate/async state/ s/alrady/already/ nodeAppend.c: s/Rery/Retry/ createplan.c: s/chidlren/children/ resowner.c: s/identier/identifier/ X 2 execnodes.h: s/sutff/stuff/ plannodes.h: s/asyncronous/asynchronous/ Removed a useless variable PgFdwScanState.result_ready. Removed duplicate code from remove_async_node() by using move_to_next_waiter(). Done some minor cleanups. regards. -- Kyotaro Horiguchi NTT Open Source Software Center >From db231fa99da5954b52e195f6af800c0f9b991ed4 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi Date: Mon, 22 May 2017 12:42:58 +0900 Subject: [PATCH v3 1/3] Allow wait event set to be registered to resource owner WaitEventSet needs to be released using resource owner for a certain case. This change adds WaitEventSet reowner and allow the creator of a WaitEventSet to specify a resource owner. --- src/backend/libpq/pqcomm.c| 2 +- src/backend/storage/ipc/latch.c | 18 - src/backend/storage/lmgr/condition_variable.c | 2 +- src/backend/utils/resowner/resowner.c | 67 +++ src/include/storage/latch.h | 4 +- src/include/utils/resowner_private.h | 8 +++ 6 files changed, 96 insertions(+), 5 deletions(-) diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 7717bb2719..16aefb03ee 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -218,7 +218,7 @@ pq_init(void) (errmsg("could not set socket to nonblocking mode: %m"))); #endif - FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3); + FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3); AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock, NULL, NULL); AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL); diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index 05df5017c4..a8b52cd381 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -56,6 +56,7 @@ #include "storage/latch.h" #include "storage/pmsignal.h" #include "storage/shmem.h" +#include "utils/resowner_private.h" /* * Select the fd readiness primitive to use. Normally the "most modern" @@ -84,6 +85,8 @@ struct WaitEventSet int nevents; /* number of registered events */ int nevents_space; /* maximum number of events in this set */ + ResourceOwner resowner; /* Resource owner */ + /* * Array, of nevents_space length, storing the definition of events this * set is waiting for. @@ -393,7 +396,7 @@ WaitLatchOrSocket(Latch *latch, int wakeEvents, pgsocket sock, int ret = 0; int rc; WaitEvent event; - WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3); + WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3); if (wakeEvents & WL_TIMEOUT) Assert(timeout >= 0); @@ -560,12 +563,15 @@ ResetLatch(Latch *latch) * WaitEventSetWait(). */ WaitEventSet * -CreateWaitEventSet(MemoryContext context, int nevents) +CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents) { WaitEventSet *set; char *data; Size sz = 0; + if (res) + ResourceOwnerEnlargeWESs(res); + /* * Use MAXALIGN size/alignment to guarantee that later uses of memory are * aligned correctly. E.g. epoll_event might need 8 byte alignment on some @@ -680,6 +686,11 @@ CreateWaitEventSet(MemoryContext context, int nevents) StaticAssertStmt(WSA_INVALID_EVENT == NULL, ""); #endif + /* Register this wait event set if requested */ + set->resowner = res; + if (res) + ResourceOwnerRememberWES(set->resowner, set); + return set; } @@ -725,6 +736,9 @@ FreeWaitEventSet(WaitEventSet *set) } #endif + if (set->resowner != NULL) + ResourceOwnerForgetWES(set->resowner, set); + pfree(set); } diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c index 37b6a4eecd..fcc92138fe 100644 --- a/src/backend/storage/lmgr/condition_variable.c +++ b/src/backend/storage/lmgr/condition_variable.c @@ -70,7 +70,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv) { WaitEventSet *new_event_set; - new_event_set = CreateWaitEventSet(TopMemoryContext, 2); + new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2); AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL); AddWaitEventToSet(new_event_set, WL_EXIT_ON_PM_DEATH, PGINVALID_SOCKET, diff --git a/src/backend/utils/resowner/resowner
Re: what can go in root.crt ?
On Wed, 2020-06-03 at 19:57 -0400, Chapman Flack wrote: > Ok, so a person in the situation described here, who is not in a position > to demand changes in an organizational policy (whether or not it seems > ill-conceived to you or even to him/her), is facing this question: > > What are the "safest" things I /can/ do, under the existing constraints, > and /which of those will work in PostgreSQL/? I feel bad about bending the basic idea of certificates and trust to suit some misbegotten bureaucratic constraints on good security. If you are working for a company that has a bad idea of security and cannot be dissuaded from it, you point that out loudly and then keep going. Trying to subvert the principles of an architecture very often leads to pain in my experience. Yours, Laurenz Albe
Re: libpq copy error handling busted
On Thu, Jun 4, 2020 at 5:37 AM Thomas Munro wrote: > On Thu, Jun 4, 2020 at 1:53 PM Thomas Munro > wrote: > > On Thu, Jun 4, 2020 at 1:35 PM Tom Lane wrote: > > > Ah, it's better if I put the pqReadData call into *both* the paths > > > where 1f39a1c06 made pqSendSome give up. The attached patch seems > > > to fix the issue for the "pgbench -i" scenario, with either fast- > > > or immediate-mode server stop. I tried it with and without SSL too, > > > just to see. Still, it's not clear to me whether this might worsen > > > any of the situations we discussed in the lead-up to 1f39a1c06 [1]. > > > Thomas, are you in a position to redo any of that testing? > > It seems to be behave correctly in that scenario. > > Here's what I tested. First, I put this into pgdata/postgresql.conf: > > ssl=on > ssl_ca_file='root+client_ca.crt' > ssl_cert_file='server-cn-only.crt' > ssl_key_file='server-cn-only.key' > ssl_crl_file='root+client.crl' > ssl_min_protocol_version='TLSv1.2' > ssl_max_protocol_version='TLSv1.1' > ssl_min_protocol_version='TLSv1.2' > ssl_max_protocol_version='' > > I copied the named files from src/test/ssl/ssl/ into pgdata, and I ran > chmod 600 on the .key file. > > I put this into pgdata/pg_hba.conf at the top: > > hostssl all all 127.0.0.1/32 cert clientcert=verify-full > > I made a copy of src/test/ssl/ssl/client-revoked.key and ran chmod 600 on > it. > Would it be feasible to capture this in a sort of a regression (TAP?) test? -- Alex
Re: Getting ERROR with FOR UPDATE/SHARE for partitioned table.
On Thu, May 28, 2020 at 11:08 PM Ashutosh Bapat wrote: > On Wed, May 27, 2020 at 6:51 PM Amit Langote wrote: > > So in Rajkumar's example, the cursor is declared as: > > > > CURSOR IS SELECT * FROM tbl WHERE c1< 5 FOR UPDATE; > > > > and the WHERE CURRENT OF query is this: > > > > UPDATE tbl SET c2='aa' WHERE CURRENT OF cur; > > Thanks for the clarification. So it looks like we expand UPDATE on > partitioned table to UPDATE on each partition (inheritance_planner for > DML) and then execute each of those. If CURRENT OF were to save the > table oid or something we could run the UPDATE only on that partition. Are you saying that the planner should take into account the state of the cursor specified in WHERE CURRENT OF to determine which of the tables to scan for the UPDATE? Note that neither partition pruning nor constraint exclusion know that CurrentOfExpr can possibly allow to exclude children of the UPDATE target. > I am possibly shooting in dark, but this puzzles me. And it looks like > we can cause wrong rows to be updated in non-partition inheritance > where the ctids match? I don't think that hazard exists, because the table OID is matched before the TID. Consider this example: drop table if exists p cascade; create table p (a int); create table c (check (a = 2)) inherits (p); insert into p values (1); insert into c values (2); begin; declare c cursor for select * from p; fetch c; update p set a = a where current of c; QUERY PLAN Update on p (cost=0.00..8.02 rows=2 width=10) Update on p Update on c p_1 -> Tid Scan on p (cost=0.00..4.01 rows=1 width=10) TID Cond: CURRENT OF c -> Tid Scan on c p_1 (cost=0.00..4.01 rows=1 width=10) TID Cond: CURRENT OF c (7 rows) Whenever the TID scan evaluates the CURRENT OF qual, it passes the table being scanned to execCurrentOf(). execCurrentOf() then fetches the ExecRowMark or the ScanState for *that* table from the cursor's ("c") PlanState via its portal. Only if it confirms that such a ExecRowMark or a ScanState exists and is valid/active that it returns the TID found therein as the cursor's current TID. -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: Internal key management system
Hello Masahiko-san, This key manager is aimed to manage cryptographic keys used for transparent data encryption. As a result of the discussion, we concluded it's safer to use multiple keys to encrypt database data rather than using one key to encrypt the whole thing, for example, in order to make sure different data is not encrypted with the same key and IV. Therefore, in terms of TDE, the minimum requirement is that PostgreSQL can use multiple keys. Using multiple keys in PG, there are roughly two possible designs: 1. Store all keys for TDE into the external KMS, and PG gets them from it as needed. +1 2. PG manages all keys for TDE inside and protect these keys on disk by the key (i.g. KEK) stored in the external KMS. -1, this is the one where you would need arguing. There are pros and cons to each design. If I take one cons of #1 as an example, the operation between PG and the external KMS could be complex. The operations could be creating, removing and rotate key and so on. ISTM that only create (delete?) are really needed. Rotating is the problem of the KMS itself, thus does not need to be managed by pg under #1. We can implement these operations in an extension to interact with different kinds of external KMS, and perhaps we can use KMIP. I would even put that (KMIP protocol stuff) outside pg core. Even under #2, if some KMS is implemented and managed by pg, I would put the stuff in a separate process which I would probably run with a different uid, so that the KEK is not accessible directly by pg, ever. Once KMS interactions are managed with an outside process, then what this process does becomes an interface, and whether this process actually manages the keys or discuss with some external KMS with some KMIP or whatever is irrelevant to pg. Providing an interface means that anyone could implement their KMS fitting their requirements if they comply with the interface/protocol. Note that I'd be fine with having the current implementation somehow wrapped up as an example KMS. But the development cost could become high because we might need different extensions for each key management solutions/services. Yes and no. What I suggest is, I think, pretty simple, and I think I can implement it in a few line of script, so the cost is not high, and having a separate process looks, to me, like a security win and an extensibility win (i.e. another implementation can be provided). #2 is better at that point; the interaction between PG and KMS is only GET. I think that it could be the same with #1. I think that having a separate process is a reasonable security requirement, and if you do that #1 and #2 are more or less the same. Other databases employes a similar approach are SQL Server and DB2. Too bad for them:-) I'd still disagree with having the master key inside the database process, even if Microsoft, IBM and Oracle think it is a good idea. In terms of the necessity of introducing the key manager into PG core, I think at least TDE needs to be implemented in PG core. And as this key manager is for managing keys for TDE, I think the key manager also needs to be introduced into the core so that TDE functionality doesn't depend on external modules. Hmmm. My point is that only interactions should be in core. The implementation could be in core, but as a separate process. I agree that pg needs to be able to manage the DEK, so it needs to store data keys. I still do not understand why an extension, possibly distributed with pg, would not be ok. There may be good arguments for that, but I do not think you provided any yet. Also, I'm not at fully at ease with some of the underlying principles behind this proposal. Are we re-inventing/re-implementing kerberos or whatever? Are we re-implementing a brand new KMS inside pg? Why having our own? As I explained above, this key manager is for managing internal keys used by TDE. It's not an alternative to existing key management solutions/services. Hmmm. This seels to suggest that interacting with something outside should be an option. The requirements of this key manager are generating internal keys, letting other PG components use them, protecting them by KEK when persisting, If you want that, I'd still argue that you should have a separate process. and support KEK rotation. It doesnβt have a feature like allowing users to store arbitrary keys into this key manager, like other key management solutions/services have. Hmmm. I agree that the key used to encrypt data must not be placed in the same host. But it's true only when the key is not protected, right? The DEK is needed when encrypting and decrypting, obviously, so it would be there once obtained, it cannot be helped. My concern is about the KEK, which AFAICS in your code is somewhere in memory accessible by the postgres process, which is a no go for me. The definition of "protected" is fuzzy, it would depend on what the user
Re: Read access for pg_monitor to pg_replication_origin_status view
At Tue, 2 Jun 2020 13:13:18 -0300, MartΓn MarquΓ©s wrote in > > > I have no problem adding it to this ROLE, but we'd have to amend the > > > doc for default-roles to reflect that SELECT for this view is also > > > granted to `pg_read_all_stats`. > > > > I agree in general that pg_monitor shouldn't have privileges granted > > directly to it. If this needs a new default role, that's an option, but > > it seems like it'd make sense to be part of pg_read_all_stats to me, so > > amending the docs looks reasonable from here. > > Good, that's more or less what I had in mind. > > Here goes v2 of the patch, now there are 4 files (I could have > squashed the docs with the code changes, but hey, that'll be easy to > merge if needed :-) ) > > I did some fiddling to Michaels doc proposal, but it says basically the same. > > Not 100% happy with the change to user-manag.sgml, but ok enough to send. > > I also added an entry to the commitfest so we can track this there as well. 0001: Looks good to me. REVOKE is performed on all functions that called replorigin_check_prerequisites. 0002: It is forgetting to grant pg_read_all_stats to execute pg_show_replication_origin_status. As the result pg_read_all_stats gets error on executing the function, not on doing select on the view. Even if we also granted execution of the function to the specific role, anyone who wants to grant a user for the view also needs to grant execution of the function. To avoid such an inconvenience, as I mentioned upthread, the view and the function should be granted to public and the function should just mute the output all the rows, or some columns in each row. That can be done the same way with pg_stat_get_activity(), as Stephen said. 0003: It seems to be a take-after of adminpack's documentation, but a superuser is not the only one on PostgreSQL. The something like the description in 27.2.2 Viewing Statistics looks more suitable. > Superusers and members of the built-in role pg_read_all_stats (see > also Section 21.5) can see all the information about all sessions. Section 21.5 is already saying as follows. > pg_monitor > Read/execute various monitoring views and functions. This role is a > member of pg_read_all_settings, pg_read_all_stats and > pg_stat_scan_tables. 0004: Looks fine by me. regards. -- Kyotaro Horiguchi NTT Open Source Software Center