RE: run pgindent on a regular basis / scripted manner
On Thu, Feb 9, 2023 6:10 AM Andrew Dunstan wrote: > Thanks, I have committed this. Still looking at Robert's other request. > Hi, Commit #068a243b7 supported directories to be non-option arguments of pgindent. But the help text doesn't mention that. Should we update it? Attach a small patch which did that. Regards, Shi Yu v1-0001-Update-help-text-of-pgindent.patch Description: v1-0001-Update-help-text-of-pgindent.patch
Re: Add index scan progress to pg_stat_progress_vacuum
On Wed, Feb 8, 2023 at 11:03 AM Imseih (AWS), Sami wrote: > > Hi, > > Thanks for your reply! > > I addressed the latest comments in v23. > > 1/ cleaned up the asserts as discussed. > 2/ used pq_putmessage to send the message on index scan completion. Thank you for updating the patch! Here are some comments for v23 patch: + + ParallelVacuumFinish + Waiting for parallel vacuum workers to finish index vacuum. + This change is out-of-date. --- + + + indexes_total bigint + + + Number of indexes that will be vacuumed or cleaned up. This value will be + 0 if the phase is not vacuuming indexes + or cleaning up indexes, INDEX_CLEANUP + is set to OFF, index vacuum is skipped due to very + few dead tuples in the table, or vacuum failsafe is triggered. + See + for more on vacuum failsafe. + + This explanation looks redundant: setting INDEX_CLEANUP to OFF essentially means the vacuum doesn't enter the vacuuming indexes phase. The same is true for the case of skipping index vacuum and failsafe mode. How about the following? Total number of indexes that will be vacuumed or cleaned up. This number is reported as of the beginning of the vacuuming indexes phase or the cleaning up indexes phase. --- + + + indexes_completed bigint + + + Number of indexes vacuumed in the current vacuum cycle when the + phase is vacuuming indexes, or the number + of indexes cleaned up during the cleaning up indexes + phase. + + How about simplfyy the description to something like: Number of indexes processed. This counter only advances when the phase is vacuuming indexes or cleaning up indexes. Also, index_processed sounds accurate to me. What do you think? --- +pcxt->parallel_progress_callback = NULL; +pcxt->parallel_progress_callback_arg = NULL; I think these settings are not necessary since the pcxt is palloc0'ed. --- +void +parallel_vacuum_update_progress(void *arg) +{ +ParallelVacuumState *pvs = (ParallelVacuumState *)arg; + +Assert(!IsParallelWorker()); +Assert(pvs->pcxt->parallel_progress_callback_arg); + +if (pvs) +pgstat_progress_update_param(PROGRESS_VACUUM_INDEX_COMPLETED, + pg_atomic_add_fetch_u32(&(pvs->shared->nindexes_completed), 1)); +} Assert(pvs->pcxt->parallel_progress_callback_arg) looks wrong to me. If 'arg' is NULL, a SEGV happens. I think it's better to update pvs->shared->nindexes_completed by both leader and worker processes who processed the index. --- +/* progress callback definition */ +typedef void (*ParallelProgressCallback) (void *parallel_progress_callback_state); + typedef void (*parallel_worker_main_type) (dsm_segment *seg, shm_toc *toc); I think it's better to make the function type consistent with the existing parallel_worker_main_type. How about parallel_progress_callback_type? I've attached a patch that incorporates the above comments and has some suggestions of updating comments etc. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com fix_v23_masahiko.patch Description: Binary data
Re: Support logical replication of global object commands
On Thu, Feb 16, 2023 at 12:02 PM Amit Kapila wrote: > > On Tue, Aug 30, 2022 at 8:09 AM Zheng Li wrote: > > > > > > I think a publication of ALL OBJECTS sounds intuitive. Does it mean > > > > we'll > > > > publish all DDL commands, all commit and abort operations in every > > > > database if there is such publication of ALL OBJECTS? > > > > > > > > > > Actually, I intend something for global objects. But the main thing > > > that is worrying me about this is that we don't have a clean way to > > > untie global object replication from database-specific object > > > replication. > > > > I think ultimately we need a clean and efficient way to publish (and > > subscribe to) any changes in all databases, preferably in one logical > > replication slot. > > > > Agreed. I was thinking currently for logical replication both > walsender and slot are database-specific. So we need a way to > distinguish the WAL for global objects and then avoid filtering based > on the slot's database during decoding. I also thought about whether > we want to have a WALSender that is not connected to a database for > the replication of global objects but I couldn't come up with a reason > for doing so. Do you have any thoughts on this matter? > Another thing about the patch proposed here is that it LOGs the DDL for global objects without any consideration of whether that is required for logical replication. This is quite unlike what we are planning to do for other DDLs where it will be logged only when the publication has defined an event trigger for it. -- With Regards, Amit Kapila.
Re: Time delayed LR (WAS Re: logical replication restrictions)
At Thu, 16 Feb 2023 06:20:23 +, "Hayato Kuroda (Fujitsu)" wrote in > Dear Horiguchi-san, > > Thank you for responding! Before modifying patches, I want to confirm > something > you said. > > > As Amit-K mentioned, we may need to change the name of the option in > > this version, since the delay mechanism in this version causes a delay > > in sending from publisher than delaying apply on the subscriber side. > > Right, will be changed. > > > I'm not sure why output plugin is involved in the delay mechanism. It > > appears to me that it would be simpler if the delay occurred in > > reorder buffer or logical decoder instead. > > I'm planning to change, but.. Yeah, I don't think we've made up our minds about which way to go yet, so it's a bit too early to work on that. > > Perhaps what I understand > > correctly is that we could delay right before only sending commit > > records in this case. If we delay at publisher end, all changes will > > be sent at once if !streaming, and otherwise, all changes in a > > transaction will be spooled at subscriber end. In any case, apply > > worker won't be holding an active transaction unnecessarily. > > What about parallel case? Latest patch does not reject the combination of > parallel > streaming mode and delay. If delay is done at commit and subscriber uses an > parallel > apply worker, it may acquire lock for a long time. I didn't looked too closely, but my guess is that transactions are conveyed in spool files in parallel mode, with each file storing a complete transaction. > > Of > > course we need add the mechanism to process keep-alive and status > > report messages. > > Could you share the good way to handle keep-alive and status messages if you > have? > If we changed to the decoding layer, it is strange to call walsender function > directly. I'm sorry, but I don't have a concrete idea at the moment. When I read through the last patch, I missed that WalSndDelay is actually a subset of WalSndLoop. Although it can handle keep-alives correctly, I'm not sure we can accept that structure.. > > Those setups work fine when no > > apply-delay involved, but they won't work with the patches we're > > talking about because the subscriber won't respond to the keep-alive > > packets from the peer. So when I wrote "practically works" in the > > last mail, this is what I meant. > > I'm not sure around the part. I think in the latest patch, subscriber can > respond > to the keepalive packets from the peer. Also, publisher can respond to the > peer. > Could you please tell me if you know a case that publisher or subscriber > cannot > respond to the opposite side? Note that if we apply the publisher-side patch, > we > don't have to apply subscriber-side patch. Sorry about that again, I missed that part in the last patch as mentioned earlier.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Time delayed LR (WAS Re: logical replication restrictions)
On Thu, Feb 16, 2023 at 2:25 PM Kyotaro Horiguchi wrote: > > At Thu, 16 Feb 2023 06:20:23 +, "Hayato Kuroda (Fujitsu)" > wrote in > > Dear Horiguchi-san, > > > > Thank you for responding! Before modifying patches, I want to confirm > > something > > you said. > > > > > As Amit-K mentioned, we may need to change the name of the option in > > > this version, since the delay mechanism in this version causes a delay > > > in sending from publisher than delaying apply on the subscriber side. > > > > Right, will be changed. > > > > > I'm not sure why output plugin is involved in the delay mechanism. It > > > appears to me that it would be simpler if the delay occurred in > > > reorder buffer or logical decoder instead. > > > > I'm planning to change, but.. > > Yeah, I don't think we've made up our minds about which way to go yet, > so it's a bit too early to work on that. > > > > Perhaps what I understand > > > correctly is that we could delay right before only sending commit > > > records in this case. If we delay at publisher end, all changes will > > > be sent at once if !streaming, and otherwise, all changes in a > > > transaction will be spooled at subscriber end. In any case, apply > > > worker won't be holding an active transaction unnecessarily. > > > > What about parallel case? Latest patch does not reject the combination of > > parallel > > streaming mode and delay. If delay is done at commit and subscriber uses an > > parallel > > apply worker, it may acquire lock for a long time. > > I didn't looked too closely, but my guess is that transactions are > conveyed in spool files in parallel mode, with each file storing a > complete transaction. > No, we don't try to collect all the data in files for parallel mode. Having said that, it doesn't matter because we won't know the time of the commit (which is used to compute delay) before we encounter the commit record in WAL. So, I feel for this approach, we can follow what you said. > > > Of > > > course we need add the mechanism to process keep-alive and status > > > report messages. > > > > Could you share the good way to handle keep-alive and status messages if > > you have? > > If we changed to the decoding layer, it is strange to call walsender > > function > > directly. > > I'm sorry, but I don't have a concrete idea at the moment. When I read > through the last patch, I missed that WalSndDelay is actually a subset > of WalSndLoop. Although it can handle keep-alives correctly, I'm not > sure we can accept that structure.. > I think we can use update_progress_txn() for this purpose but note that we are discussing to change the same in thread [1]. [1] - https://www.postgresql.org/message-id/20230210210423.r26ndnfmuifie4f6%40awork3.anarazel.de -- With Regards, Amit Kapila.
Re: Allow logical replication to copy tables in binary format
Hi, Please see the attached patch for following changes. Bharath Rupireddy , 30 Oca 2023 Pzt, 15:34 tarihinde şunu yazdı: > On Mon, Jan 30, 2023 at 4:19 PM Melih Mutlu > wrote: It'd be better and clearer to have a separate TAP test file IMO since > the focus of the feature here isn't to just test for data types. With > separate tests, you can verify "ERROR: incorrect binary data format > in logical replication column 1" cases. > Moved some tests from 002_types.pl to 014_binary.pl since this is where most binary features are tested. It covers now "incorrect data format" cases too. Also added some regression tests for copy_format parameter. > With the above said, do you think checking for publisher versions is > needed? The user can decide to enable/disable binary COPY based on the > publisher's version no? > +/* If the publisher is v16 or later, specify the format to copy data. > */ > +if (walrcv_server_version(LogRepWorkerWalRcvConn) >= 16) > +{ > If the user decides to enable it, then it might be nice to not allow it for previous versions. But I'm not sure. I'm okay to remove it if you all agree. > Few more comments on v7: > 1. > + Specifies the format in which pre-existing data on the > publisher will > + copied to the subscriber. Supported formats are > + text and binary. The > default is > + text. > It'll be good to call out the cases in the documentation as to where > copy_format can be enabled and needs to be disabled. > Modified that description a bit. Can you check if that's okay now? > 2. > + errmsg("%s value should be either \"text\" or \"binary\"", > How about "value must be either "? > Done > 3. > Why should the subscription's binary option and copy_format option be > tied at all? Tying these two options hurts usability. Is there a > fundamental reason? I think they both are/must be independent. One > deals with data types and another deals with how initial table data is > copied. > My initial purpose with this patch was just making subscriptions with binary option enabled fully binary from initial copy to apply. Things have changed a bit when we decided to move binary copy behind a parameter. I didn't actually think there would be any use case where a user wants the initial copy to be in binary format for a sub with binary = false. Do you think it would be useful to copy in binary even for a sub with binary disabled? Thanks, -- Melih Mutlu Microsoft v8-0001-Allow-logical-replication-to-copy-table-in-binary.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada wrote: > > On Tue, Feb 14, 2023 at 8:24 PM John Naylor > wrote: > > > I can think that something like traversing a HOT chain could visit > > > offsets out of order. But fortunately we prune such collected TIDs > > > before heap vacuum in heap case. > > > > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago... > > Right. TidStore is implemented not only for heap, so loading > out-of-order TIDs might be important in the future. That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept. > We have the following WIP comment in test_radixtree: > > // WIP: compiles with warnings because rt_attach is defined but not used > // #define RT_SHMEM > > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach > and friends? Sounds good to me, and the other fixes make sense as well. > FYI I've briefly tested the TidStore with blocksize = 32kb, and it > seems to work fine. That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.) > You removed the vacuum integration patch from v27, is there any reason for that? Just an oversight. Now for some general comments on the tid store... + * TODO: The caller must be certain that no other backend will attempt to + * access the TidStore before calling this function. Other backend must + * explicitly call tidstore_detach to free up backend-local memory associated + * with the TidStore. The backend that calls tidstore_destroy must not call + * tidstore_detach. + */ +void +tidstore_destroy(TidStore *ts) Do we need to do anything for this todo? It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros. The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper". I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"? maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/ It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef? The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though. Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though. I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary. Some comments on vacuum: I think we'd better get some real-world testing of this, fairly soon. I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed. Soon I plan to do some measurements with vacuuming large tables to get some concrete numbers that the community can get excited about. We also want to verify that progress reporting works as designed and has no weird corner cases. * autovacuum_work_mem) memory space to keep track of dead TIDs. We initially ... + * create a TidStore with the maximum bytes that can be used by the TidStore. This kind of implies that we allocate the maximum bytes upfront. I think this sentence can be removed. We already mentioned in the previous paragraph that we set an upper bound. - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", - vacrel->relname, (lo
Re: Allow logical replication to copy tables in binary format
Hi, Thanks for reviewing. Please see the v8 here [1] Takamichi Osumi (Fujitsu) , 1 Şub 2023 Çar, 06:05 tarihinde şunu yazdı: > (1) general comment > > I wondered if the addition of the new option/parameter can introduce some > confusion to the users. > > case 1. When binary = true and copy_format = text, the table sync is > conducted by text. > case 2. When binary = false and copy_format = binary, the table sync is > conducted by binary. > (Note that the case 2 can be accepted after addressing the 3rd comment of > Bharath-san in [1]. > I agree with the 3rd comment by itself.) > I replied to Bharath's comment [1], can you please check to see if that makes sense? > The name of the new subscription parameter looks confusing. > How about "init_sync_format" or something ? > The option to enable initial sync is named "copy_data", so I named the new parameter as "copy_format" to refer to that copy meant by "copy_data". Maybe "copy_data_format" would be better. I can change it if you think it's better. > (2) The commit message doesn't get updated. > Done > (3) whitespace errors. > Done > (4) pg_dump.c > Done > (5) describe.c > Done > (6) > > + * Extract the copy format value from a DefElem. > + */ > +char > +defGetCopyFormat(DefElem *def) > > Shouldn't this function be static and remove the change of > subscriptioncmds.h ? > I wanted to make "defGetCopyFormat" be consistent with "defGetStreamingMode" since they're basically doing the same work for different parameters. And that function isn't static, so I didn't make "defGetCopyFormat" static too. > (7) catalogs.sgml > Done (8) create_subscription.sgml > Done Also; The latest patch doesn't get updated for more than two weeks > after some review comments. If you don't have time, > I would like to help updating the patch for you and other reviewers. > > Kindly let me know your status. > Sorry for the delay. This patch is currently one of my priorities. Hopefully I will share quicker updates from now on. [1] https://www.postgresql.org/message-id/CAGPVpCQYi9AYQSS%3DRmGgVNjz5ZEnLB8mACwd9aioVhLmbgiAMA%40mail.gmail.com Thanks, -- Melih Mutlu Microsoft
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
FYI the last patch does not apply cleanly anymore. So a rebase is needed.
Re: Move defaults toward ICU in 16?
On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis wrote: > It feels very wrong to me to explain that sort order is defined by the > operating system on which Postgres happens to run. Saying that it's > defined by ICU, which is part of the Unicode consotium, is much better. > It doesn't eliminate versioning issues, of course, but I think it's a > better explanation for users. The fact that we can't use ICU on Windows, though, weakens this argument a lot. In my experience, we have a lot of Windows users, and they're not any happier with the operating system collations than Linux users. Possibly less so. I feel like this is a very difficult kind of change to judge. If everyone else feels this is a win, we should go with it, and hopefully we'll end up better off. I do feel like there are things that could go wrong, though, between the imperfect documentation, the fact that a substantial chunk of our users won't be able to use it because they run Windows, and everybody having to adjust to the behavior change. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Weird failure with latches in curculio on v15
On Thu, Feb 9, 2023 at 10:53 PM Nathan Bossart wrote: > I've been thinking about this, actually. I'm wondering if we could provide > a list of files to the archiving callback (configurable via a variable in > ArchiveModuleState), and then have the callback return a list of files that > are archived. (Or maybe we just put the list of files that need archiving > in ArchiveModuleState.) The returned list could include files that were > sent to the callback previously. The archive module would be responsible > for creating background worker(s) (if desired), dispatching files > to-be-archived to its background worker(s), and gathering the list of > archived files to return. Hmm. So in this design, the archiver doesn't really do the archiving any more, because the interface makes that impossible. It has to use a separate background worker process for that, full stop. I don't think that's a good design. It's fine if some people want to implement it that way, but it shouldn't be forced by the interface. -- Robert Haas EDB: http://www.enterprisedb.com
Missing default value of createrole_self_grant in document
Hi hackers, I noticed that the document of GUC createrole_self_grant doesn't mention its default value. The attached patch adds that. Regards, Shi Yu v1-0001-Add-default-value-of-createrole_self_grant-in-doc.patch Description: v1-0001-Add-default-value-of-createrole_self_grant-in-doc.patch
Re: Weird failure with latches in curculio on v15
On Fri, Feb 10, 2023 at 12:59 AM Andres Freund wrote: > I'm somewhat concerned about that too, but perhaps from a different > angle. First, I think we don't do our users a service by defaulting the > in-core implementation to something that doesn't scale to even a moderately > busy server. +1. > Second, I doubt we'll get the API for any of this right, without > an acutual user that does something more complicated than restoring one-by-one > in a blocking manner. Fair. > I don't think it's that hard to imagine problems. To be reasonably fast, a > decent restore implementation will have to 'restore ahead'. Which also > provides ample things to go wrong. E.g. > > - WAL source is switched, restore module needs to react to that, but doesn't, > we end up lots of wasted work, or worse, filename conflicts > - recovery follows a timeline, restore module doesn't catch on quickly enough > - end of recovery happens, restore just continues on I don't see how you can prevent those things from happening. If the restore process is working in some way that requires an event loop, and I think that will be typical for any kind of remote archiving, then either it has control most of the time, so the event loop can be run inside the restore process, or, as Nathan proposes, we don't let the archiver have control and it needs to run that restore process in a separate background worker. The hazards that you mention here exist either way. If the event loop is running inside the restore process, it can decide not to call the functions that we provide in a timely fashion and thus fail to react as it should. If the event loop runs inside a separate background worker, then that process can fail to be responsive in precisely the same way. Fundamentally, if the author of a restore module writes code to have multiple I/Os in flight at the same time and does not write code to cancel those I/Os if something changes, then such cancellation will not occur. That remains true no matter which process is performing the I/O. > > I don't quite see how you can make asynchronous and parallel archiving > > work if the archiver process only calls into the archive module at > > times that it chooses. That would mean that the module has to return > > control to the archiver when it's in the middle of archiving one or > > more files -- and then I don't see how it can get control back at the > > appropriate time. Do you have a thought about that? > > I don't think archiver is the hard part, that already has a dedicated > process, and it also has something of a queuing system already. The startup > process imo is the complicated one... > > If we had a 'restorer' process, startup fed some sort of a queue with things > to restore in the near future, it might be more realistic to do something you > describe? Some kind of queueing system might be a useful part of the interface, and a dedicated restorer process does sound like a good idea. But the archiver doesn't have this solved, precisely because you have to archive a single file, return control, and wait to be invoked again for the next file. That does not scale. -- Robert Haas EDB: http://www.enterprisedb.com
Re: wrong query result due to wang plan
On Thu, Feb 16, 2023 at 3:16 PM tender wang wrote: > select ref_1.r_comment as c0, subq_0.c1 as c1 from public.region as > sample_0 right join public.partsupp as sample_1 right join public.lineitem > as sample_2 on (cast(null as path) = cast(null as path)) on (cast(null as > "timestamp") < cast(null as "timestamp")) inner join public.lineitem as > ref_0 on (true) left join (select sample_3.ps_availqty as c1, > sample_3.ps_comment as c2 from public.partsupp as sample_3 where false > order by c1, c2 ) as subq_0 on (sample_1.ps_supplycost = subq_0.c1 ) right > join public.region as ref_1 on (sample_1.ps_availqty = ref_1.r_regionkey ) > where ref_1.r_comment is not NULL order by c0, c1; > The repro can be reduced to the query below. create table t (a int, b int); # explain (costs off) select * from t t1 left join (t t2 inner join t t3 on false left join t t4 on t2.b = t4.b) on t1.a = t2.a; QUERY PLAN -- Result One-Time Filter: false (2 rows) As we can see, the joinrel at the final level is marked as dummy, which is wrong. I traced this issue down to distribute_qual_to_rels() when we handle variable-free clause. If such a clause is not an outer-join clause, and it contains no volatile functions either, we assign it the full relid set of the current JoinDomain. I doubt this is always correct. Such as in the query above, the clause 'false' is assigned relids {t2, t3, t4, t2/t4}. And that makes it a pushed down restriction to the second left join. This is all right if we plan this query in the user-given order. But if we've commuted the two left joins, which is legal, this pushed down and constant false restriction would make the final joinrel be dummy. It seems we still need to check whether a variable-free qual comes from somewhere that is below the nullable side of an outer join before we decide that it can be evaluated at join domain level, just like we did before. So I wonder if we can add a 'below_outer_join' flag in JoinTreeItem, fill its value during deconstruct_recurse, and check it in distribute_qual_to_rels() like /* eval at join domain level if not below outer join */ - relids = bms_copy(jtitem->jdomain->jd_relids); + relids = jtitem->below_outer_join ? + bms_copy(qualscope) : bms_copy(jtitem->jdomain->jd_relids); Thanks Richard
Re: Normalization of utility queries in pg_stat_statements
Hi, On 2/16/23 1:34 AM, Michael Paquier wrote: While wondering about this stuff about the last few days and discussing with bertrand, I have changed my mind on the point that there is no need to be that aggressive yet with the normalization of the A_Const nodes, because the query string normalization of pg_stat_statements is not prepared yet to handle cases where a A_Const value uses a non-quoted value with whitespaces. The two cases where I saw an impact is on the commands that can define an isolation level: SET TRANSACTION and BEGIN. For example, applying normalization to A_Const nodes does the following as of HEAD: 1) BEGIN TRANSACTION READ ONLY, READ WRITE, DEFERRABLE, NOT DEFERRABLE; BEGIN TRANSACTION $1 ONLY, $2 WRITE, $3, $4 DEFERRABLE 2) SET TRANSACTION ISOLATION LEVEL READ COMMITTED; SET TRANSACTION ISOLATION LEVEL $1 COMMITTED On top of that, specifying a different isolation level may cause these commands to be grouped, which is not really cool. All that could be done incrementally later on, in 17~ or later depending on the adjustments that make sense. Thanks for those patches! Yeah, agree about the proposed approach. 0002 has been completed with a couple of commands to track all the commands with A_Const, so as we never lose sight of what happens. 0004 is what I think could be done for PG16, where normalization affects only Const. At the end of the day, this reflects the following commands that use Const nodes because they use directly queries, so the same rules as SELECT and DMLs apply to them: - DECLARE - EXPLAIN - CREATE MATERIALIZED VIEW - CTAS, SELECT INTO 0001: I like the idea of splitting the existing tests in dedicated files. What do you think about removing: " SET pg_stat_statements.track_utility = FALSE; SET pg_stat_statements.track_planning = TRUE; " In the new pg_stat_statements.sql? That way pg_stat_statements.sql would always behave with default values for those (currently we are setting both of them as non default). Then, with the default values in place, if we feel that some tests are missing we could add them in utility.sql or planning.sql accordingly. 0002: Produces: v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:834: trailing whitespace. CREATE VIEW view_stats_1 AS v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:838: trailing whitespace. CREATE VIEW view_stats_1 AS warning: 2 lines add whitespace errors. +-- SET TRANSACTION ISOLATION +BEGIN; +SET TRANSACTION ISOLATION LEVEL READ COMMITTED; +SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; +SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; What about adding things like "SET SESSION CHARACTERISTICS AS TRANSACTION..." too? 0003 and 0004: Thanks for keeping 0003 that's useful to see the impact of A_Const normalization. Looking at the diff they produce, I also do think that 0004 is what could be done for PG16. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: wrong query result due to wang plan
On Thu, Feb 16, 2023 at 5:50 PM Richard Guo wrote: > It seems we still need to check whether a variable-free qual comes from > somewhere that is below the nullable side of an outer join before we > decide that it can be evaluated at join domain level, just like we did > before. So I wonder if we can add a 'below_outer_join' flag in > JoinTreeItem, fill its value during deconstruct_recurse, and check it in > distribute_qual_to_rels() like > >/* eval at join domain level if not below outer join */ > - relids = bms_copy(jtitem->jdomain->jd_relids); > + relids = jtitem->below_outer_join ? > + bms_copy(qualscope) : bms_copy(jtitem->jdomain->jd_relids); > To be concrete, I mean something like attached. Thanks Richard v1-0001-Fix-variable-free-clause-distribution.patch Description: Binary data
Re: Move defaults toward ICU in 16?
On Thu, 2023-02-16 at 15:05 +0530, Robert Haas wrote: > On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis wrote: > > It feels very wrong to me to explain that sort order is defined by the > > operating system on which Postgres happens to run. Saying that it's > > defined by ICU, which is part of the Unicode consotium, is much better. > > It doesn't eliminate versioning issues, of course, but I think it's a > > better explanation for users. > > The fact that we can't use ICU on Windows, though, weakens this > argument a lot. In my experience, we have a lot of Windows users, and > they're not any happier with the operating system collations than > Linux users. Possibly less so. > > I feel like this is a very difficult kind of change to judge. If > everyone else feels this is a win, we should go with it, and hopefully > we'll end up better off. I do feel like there are things that could go > wrong, though, between the imperfect documentation, the fact that a > substantial chunk of our users won't be able to use it because they > run Windows, and everybody having to adjust to the behavior change. Unless I misunderstand, the lack of Windows support is not a matter of principle and can be added later on, right? I am in favor of changing the default. It might be good to add a section to the documentation in "Server setup and operation" recommending that if you go with the default choice of ICU, you should configure your package manager not to upgrade the ICU library. Yours, Laurenz Albe
Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16
On Wed, Feb 15, 2023 at 3:35 AM Nathan Bossart wrote: > > On Sat, Jan 21, 2023 at 06:42:08AM +0100, Drouvot, Bertrand wrote: > > On 1/20/23 9:01 PM, Nathan Bossart wrote: > >> Should we also change the related > >> variables (e.g., ndeletable in _hash_vacuum_one_page()) to uint16? > > > > Yeah, I thought about it too. What I saw is that there is other places that > > would be good candidates for the same > > kind of changes (see the int ntodelete argument in gistXLogDelete() being > > assigned to gistxlogDelete.ntodelete (uint16) for example). > > > > So, what do you think about: > > > > 1) keep this patch as it is (to "only" address the struct field and avoid > > possible future "useless" padding size increase) > > and > > 2) create a new patch (once this one is committed) to align the types for > > variables/arguments with the structs (related to XLOG records) fields when > > they are not? > > Okay. I've marked this one as ready-for-committer, then. > LGTM. I think the padding space we are trying to save here can be used for the patch [1], right? BTW, feel free to create the second patch (to align the types for variables/arguments) as that would be really helpful after we commit this one. I think this would require XLOG_PAGE_MAGIC as it changes the WAL record. BTW, how about a commit message like: Change xl_hash_vacuum_one_page.ntuples from int to uint16. This will create two bytes of padding space in xl_hash_vacuum_one_page which can be used for future patches. This makes the datatype of xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which is advisable as both are used for the same purpose. [1] - https://www.postgresql.org/message-id/2d62f212-fce6-d639-b9eb-2a5bc4bec3b4%40gmail.com -- With Regards, Amit Kapila.
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Thu, Feb 16, 2023 at 10:03 AM David Rowley wrote: > I suspect it's slower because the final sort must sort the entire > array still without knowledge that portions of it are pre-sorted. It > would be very interesting to improve this and do some additional work > and keep track of the "memtupsortedto" index by pushing them onto a > List each time we cross the availMem boundary, then do then qsort just > the final portion of the array in tuplesort_performsort() before doing > a k-way merge on each segment rather than qsorting the entire thing > again. I suspect this would be faster when work_mem exceeds L3 by some > large amount. Sounds like a reasonable thing to try. It seems like in-memory merge could still use abbreviation, unlike external merge. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Align GSS and TLS error handling in PQconnectPoll()
Patch looks good to me. Definitely an improvement over the status quo. Looking at the TLS error handling though I see these two lines: && conn->allow_ssl_try/* redundant? */ && !conn->wait_ssl_try) /* redundant? */ Are they actually redundant like the comment suggests? If so, we should probably remove them (in another patch). If not (or if we don't know), should we have these same checks for GSS?
Re: Missing default value of createrole_self_grant in document
> On 16 Feb 2023, at 10:47, shiy.f...@fujitsu.com wrote: > I noticed that the document of GUC createrole_self_grant doesn't mention its > default value. The attached patch adds that. Agreed, showing the default value in the documentation is a good pattern IMO. Unless objected to I'll go apply this in a bit. -- Daniel Gustafsson
Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication
Hi Shveta and Shi, Thanks for your investigations. shveta malik , 8 Şub 2023 Çar, 16:49 tarihinde şunu yazdı: > On Tue, Feb 7, 2023 at 8:18 AM shiy.f...@fujitsu.com > wrote: > > I reproduced the problem I reported before with latest patch (v7-0001, > > v10-0002), and looked into this problem. It is caused by a similar > reason. Here > > is some analysis for the problem I reported [1].#6. > > > > First, a tablesync worker (worker-1) started for "tbl1", its originname > is > > "pg_16398_1". And it exited because of unique constraint. In > > LogicalRepSyncTableStart(), originname in pg_subscription_rel is updated > when > > updating table state to DATASYNC, and the origin is created when > updating table > > state to FINISHEDCOPY. So when it exited with state DATASYNC , the > origin is not > > created but the originname has been updated in pg_subscription_rel. > > > > Then a tablesync worker (worker-2) started for "tbl2", its originname is > > "pg_16398_2". After tablesync of "tbl2" finished, this worker moved to > sync > > table "tbl1". In LogicalRepSyncTableStart(), it got the originname of > "tbl1" - > > "pg_16398_1", by calling ReplicationOriginNameForLogicalRep(), and tried > to drop > > the origin (although it is not actually created before). After that, it > called > > replorigin_by_name to get the originid whose name is "pg_16398_1" and > the result > > is InvalidOid. Origin won't be created in this case because the sync > worker has > > created a replication slot (when it synced tbl2), so the originid was > still > > invalid and it caused an assertion failure when calling > replorigin_advance(). > > > > It seems we don't need to drop previous origin in worker-2 because the > previous > > origin was not created in worker-1. I think one way to fix it is to not > update > > originname of pg_subscription_rel when setting state to DATASYNC, and > only do > > that when setting state to FINISHEDCOPY. If so, the originname in > > pg_subscription_rel will be set at the same time the origin is created. > > +1. Update of system-catalog needs to be done carefully and only when > origin is created. > I see that setting originname in the catalog before actually creating it causes issues. My concern with setting originname when setting the state to FINISHEDCOPY is that if worker waits until FINISHEDCOPY to update slot/origin name but it fails somewhere before reaching FINISHEDCOPY and after creating slot/origin, then this new created slot/origin will be left behind. It wouldn't be possible to find and drop them since their names are not stored in the catalog. Eventually, this might also cause hitting the max_replication_slots limit in case of such failures between origin creation and FINISHEDCOPY. One fix I can think is to update the catalog right after creating a new origin. But this would also require commiting the current transaction to actually persist the originname. I guess this action of commiting the transaction in the middle of initial sync could hurt the copy process. What do you think? Also; working on an updated patch to address your other comments. Thanks again. Best, -- Melih Mutlu Microsoft
Re: Considering additional sort specialisation functions for PG16
I wrote: > Have two memtuple arrays, one for first sortkey null and one for first sortkey non-null: Hacking on this has gotten only as far as the "compiles but segfaults" stage, but I wanted to note an idea that occurred to me: Internal qsort doesn't need the srctape member, and removing both that and isnull1 would allow 16-byte "init-tuples" for qsort, which would save a bit of work_mem space, binary space for qsort specializations, and work done during swaps. During heap sort, we already copy one entry into a stack variable to keep from clobbering it, so it's not a big step to read a member from the init array and form a regular sorttuple from it. -- John Naylor EDB: http://www.enterprisedb.com
Re: proposal: psql: psql variable BACKEND_PID
On Thu, 16 Feb 2023 at 12:44, Pavel Stehule wrote: > To find and use pg_backend_pid is not rocket science. But use :BACKEND_PID is > simpler. I wanted to call out that if there's a connection pooler (e.g. PgBouncer) in the middle, then BACKEND_PID (and %p) are incorrect, but pg_backend_pid() would work for the query. This might be an edge case, but if BACKEND_PID is added it might be worth listing this edge case in the docs somewhere.
Re: Some revises in adding sorting path
Hi Richard, On Tue, Jan 10, 2023 at 8:06 PM Richard Guo wrote: > In add_paths_with_pathkeys_for_rel, we do not try incremental sort atop > of the epq_path, which I think we can do. I'm not sure how useful this > is in real world since the epq_path is used only for EPQ checks, but it > seems doing that doesn't cost too much. I'm not sure this is a good idea, because the epq_path will return at most one tuple in an EPQ recheck. The reason why an extra Sort node is injected into the epq_path is only label it with the correct sort order to use it as an input for the EPQ merge-join path of a higher-level foreign join, so shouldn't we keep this step as much as simple and save cycles even a little? Sorry for being late to the party. Best regards, Etsuro Fujita
Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16
Hi, On 2/16/23 12:00 PM, Amit Kapila wrote: On Wed, Feb 15, 2023 at 3:35 AM Nathan Bossart wrote: On Sat, Jan 21, 2023 at 06:42:08AM +0100, Drouvot, Bertrand wrote: On 1/20/23 9:01 PM, Nathan Bossart wrote: Should we also change the related variables (e.g., ndeletable in _hash_vacuum_one_page()) to uint16? Yeah, I thought about it too. What I saw is that there is other places that would be good candidates for the same kind of changes (see the int ntodelete argument in gistXLogDelete() being assigned to gistxlogDelete.ntodelete (uint16) for example). So, what do you think about: 1) keep this patch as it is (to "only" address the struct field and avoid possible future "useless" padding size increase) and 2) create a new patch (once this one is committed) to align the types for variables/arguments with the structs (related to XLOG records) fields when they are not? Okay. I've marked this one as ready-for-committer, then. LGTM. Thanks for looking at it! I think the padding space we are trying to save here can be used for the patch [1], right? Yes exactly, without the current patch and adding isCatalogRel (from the patch [1] you mentioned) we would get: (gdb) ptype /o struct xl_hash_vacuum_one_page /* offset |size */ type = struct xl_hash_vacuum_one_page { /* 0 | 4 */TransactionId snapshotConflictHorizon; /* 4 | 4 */int ntuples; /* 8 | 1 */_Bool isCatalogRel; /* XXX 3-byte padding */ /* total size (bytes): 12 */ } While with the proposed patch: (gdb) ptype /o struct xl_hash_vacuum_one_page /* offset |size */ type = struct xl_hash_vacuum_one_page { /* 0 | 4 */TransactionId snapshotConflictHorizon; /* 4 | 2 */uint16 ntuples; /* 6 | 1 */_Bool isCatalogRel; /* XXX 1-byte padding */ /* total size (bytes):8 */ } BTW, feel free to create the second patch (to align the types for variables/arguments) as that would be really helpful after we commit this one. Yes, will do. I think this would require XLOG_PAGE_MAGIC as it changes the WAL record. Oh, I Was not aware about it, thanks! Will do in V2 (and in the logical decoding on standby patch as it adds the new field mentioned above). BTW, how about a commit message like: Change xl_hash_vacuum_one_page.ntuples from int to uint16. This will create two bytes of padding space in xl_hash_vacuum_one_page which can be used for future patches. This makes the datatype of xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which is advisable as both are used for the same purpose. LGTM, will add it to V2! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: pg_walinspect memory leaks
On Tue, Feb 14, 2023 at 4:07 PM Bharath Rupireddy wrote: > > On Tue, Feb 14, 2023 at 6:25 AM Andres Freund wrote: > > > > Hi, > > > > On 2023-02-13 15:22:02 -0800, Peter Geoghegan wrote: > > > More concretely, it looks like GetWALRecordInfo() calls > > > CStringGetTextDatum/cstring_to_text in a way that accumulates way too > > > much memory in ExprContext. > > > > Additionally, we leak two stringinfos for each record. > > > > > > > This could be avoided by using a separate memory context that is reset > > > periodically, or something else along the same lines. > > > > Everything other than a per-row memory context that's reset each time seems > > hard to manage in this case. > > > > Somehwat funnily, GetWALRecordsInfo() then ends up being unnecessarily > > dilligent about cleaning up O(1) memory, after not caring about O(N) > > memory... > > Thanks for reporting. I'll get back to you on this soon. The memory usage goes up with many WAL records in GetWALRecordsInfo(). The affected functions are pg_get_wal_records_info() and pg_get_wal_records_info_till_end_of_wal(). I think the best way to fix this is to use a temporary memory context (like the jsonfuncs.c), reset it after every tuple is put into the tuple store. This fix keeps the memory under limits. I'm attaching the patches here. For HEAD, I'd want to be a bit defensive and use the temporary memory context for pg_get_wal_fpi_info() too. And, the fix also needs to be back-patched to PG15. [1] HEAD: PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 1105979 ubuntu20 0 28.5g 28.4g 150492 R 80.7 93.0 1:47.12 postgres PATCHED: PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 13149 ubuntu20 0 173244 156872 150688 R 79.0 0.5 1:25.09 postgres postgres=# select count(*) from pg_get_wal_records_info_till_end_of_wal('0/100'); count -- 35285649 (1 row) postgres=# select pg_backend_pid(); pg_backend_pid 13149 (1 row) -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From 56ce258a6c4b4118a4cbe6612fbafb6cb172f7e3 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 15 Feb 2023 06:26:30 + Subject: [PATCH v1] Limit memory usage of pg_walinspect functions Some of the pg_walinspect functions can overuse and leak memory across WAL records iterations and can cause OOM if there are many WAL records to output are present. Fix this by using a temporary memory context that's reset for each WAL record iteraion. Reported-by: Peter Geoghegan Author: Bharath Rupireddy Discussion: https://www.postgresql.org/message-id/CAH2-WznLEJjn7ghmKOABOEZYuJvkTk%3DGKU3m0%2B-XBAH%2BerPiJQ%40mail.gmail.com Backpatch-through: 15 --- contrib/pg_walinspect/pg_walinspect.c | 28 +++ 1 file changed, 28 insertions(+) diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c index 91b740ed27..9d429c6d73 100644 --- a/contrib/pg_walinspect/pg_walinspect.c +++ b/contrib/pg_walinspect/pg_walinspect.c @@ -304,6 +304,8 @@ pg_get_wal_fpi_info(PG_FUNCTION_ARGS) XLogRecPtr start_lsn; XLogRecPtr end_lsn; XLogReaderState *xlogreader; + MemoryContext old_cxt; + MemoryContext tmp_cxt; start_lsn = PG_GETARG_LSN(0); end_lsn = PG_GETARG_LSN(1); @@ -314,14 +316,26 @@ pg_get_wal_fpi_info(PG_FUNCTION_ARGS) xlogreader = InitXLogReaderState(start_lsn); + tmp_cxt = AllocSetContextCreate(CurrentMemoryContext, + "pg_get_wal_fpi_info temporary cxt", + ALLOCSET_DEFAULT_SIZES); + while (ReadNextXLogRecord(xlogreader) && xlogreader->EndRecPtr <= end_lsn) { + /* Use the tmp context so we can clean up after each tuple is done */ + old_cxt = MemoryContextSwitchTo(tmp_cxt); + GetWALFPIInfo(fcinfo, xlogreader); + /* clean up and switch back */ + MemoryContextSwitchTo(old_cxt); + MemoryContextReset(tmp_cxt); + CHECK_FOR_INTERRUPTS(); } + MemoryContextDelete(tmp_cxt); pfree(xlogreader->private_data); XLogReaderFree(xlogreader); @@ -440,23 +454,37 @@ GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn, ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo; Datum values[PG_GET_WAL_RECORDS_INFO_COLS] = {0}; bool nulls[PG_GET_WAL_RECORDS_INFO_COLS] = {0}; + MemoryContext old_cxt; + MemoryContext tmp_cxt; InitMaterializedSRF(fcinfo, 0); xlogreader = InitXLogReaderState(start_lsn); + tmp_cxt = AllocSetContextCreate(CurrentMemoryContext, + "GetWALRecordsInfo temporary cxt", + ALLOCSET_DEFAULT_SIZES); + while (ReadNextXLogRecord(xlogreader) && xlogreader->EndRecPtr <= end_lsn) { + /* Use the tmp context so we can clean up after each tuple is done */ + old_cxt = MemoryContextSwitchTo(tmp_cxt); + GetWALRecordInfo(xlogreader, values, nulls, PG_GET_WAL_RECORDS_INFO_COLS); tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
Re: Allow logical replication to copy tables in binary format
On Mon, Jan 30, 2023 at 4:19 PM Melih Mutlu wrote: > > Hi Bharath, > > Thanks for reviewing. > > Bharath Rupireddy , 18 Oca 2023 Çar, > 10:17 tarihinde şunu yazdı: >> >> On Thu, Jan 12, 2023 at 1:53 PM Melih Mutlu wrote: >> 1. The performance numbers posted upthread [1] look impressive for the >> use-case tried, that's a 2.25X improvement or 55.6% reduction in >> execution times. However, it'll be great to run a few more varied >> tests to confirm the benefit. > > > Sure, do you have any specific test case or suggestion in mind? > >> >> 2. It'll be great to capture the perf report to see if the time spent >> in copy_table() is reduced with the patch. > > > Will provide that in another email soon. > >> >> 3. I think blending initial table sync's binary copy option with >> data-type level binary send/receive is not good. Moreover, data-type >> level binary send/receive has its own restrictions per 9de77b5453. >> IMO, it'll be good to have a new option, say copy_data_format synonyms >> with COPY command's FORMAT option but with only text (default) and >> binary values. > > > Added a "copy_format" option for subscriptions with text as default value. So > it would be possible to create a binary subscription but copy tables in text > format to avoid restrictions that you're concerned about. > >> >> 4. Why to add tests to existing 002_types.pl? Can we add a new file >> with all the data types covered? > > > Since 002_types.pl is where the replication of data types are covered. I > thought it would be okay to test replication with the binary option in that > file. > Sure, we can add a new file with different data types for testing > subscriptions with binary option. But then I feel like it would have too many > duplicated lines with 002_types.pl. > If you think that 002_types.pl lacks some data types needs to be tested, then > we should add those into 002_types.pl too whether we test subscription with > binary option in that file or some other place, right? > >> >> 5. It's not clear to me as to why these rows were removed in the patch? >> -(1, '{1, 2, 3}'), >> -(2, '{2, 3, 1}'), >> (3, '{3, 2, 1}'), >> (4, '{4, 3, 2}'), >> (5, '{5, NULL, 3}'); >> >> -- test_tbl_arrays >> INSERT INTO tst_arrays (a, b, c, d) VALUES >> -('{1, 2, 3}', '{"a", "b", "c"}', '{1.1, 2.2, 3.3}', '{"1 >> day", "2 days", "3 days"}'), >> -('{2, 3, 1}', '{"b", "c", "a"}', '{2.2, 3.3, 1.1}', '{"2 >> minutes", "3 minutes", "1 minute"}'), > > > Previously, it wasn't actually testing the initial table sync since all > tables were empty when subscription was created. > I just simply split the data initially inserted to test initial table sync. > > With this patch, it inserts the first two rows for all data types before > subscriptions get created. > You can see these lines: >> >> +# Insert initial test data >> +$node_publisher->safe_psql( >> + 'postgres', qq( >> + -- test_tbl_one_array_col >> + INSERT INTO tst_one_array (a, b) VALUES >> + (1, '{1, 2, 3}'), >> + (2, '{2, 3, 1}'); >> + >> + -- test_tbl_arrays >> + INSERT INTO tst_arrays (a, b, c, d) VALUES > > > >> >> 6. BTW, the subbinary description is missing in pg_subscription docs >> https://www.postgresql.org/docs/devel/catalog-pg-subscription.html? >> - Specifies whether the subscription will request the publisher to >> - send the data in binary format (as opposed to text). >> + Specifies whether the subscription will copy the initial data to >> + synchronize relations in binary format and also request the >> publisher >> + to send the data in binary format too (as opposed to text). > > > Done. > >> >> 7. A nitpick - space is needed after >= before 16. >> +if (walrcv_server_version(LogRepWorkerWalRcvConn) >=16 && > > > Done. > >> >> 8. Note that the COPY binary format isn't portable across platforms >> (Windows to Linux for instance) or major versions >> https://www.postgresql.org/docs/devel/sql-copy.html, whereas, logical >> replication is >> https://www.postgresql.org/docs/devel/logical-replication.html. >> I don't see any handling of such cases in copy_table but only a check >> for the publisher version. I think we need to account for all the >> cases - allow binary COPY only when publisher and subscriber are of >> same versions, architectures, platforms. The trick here is how we >> identify if the subscriber is of the same type and taste >> (architectures and platforms) as the publisher. Looking for >> PG_VERSION_STR of publisher and subscriber might be naive, but I'm not >> sure if there's a better way to do it. > > > I think having the "copy_format" option with text as default, like I replied > to your 3rd review above, will keep things as they are now. > The patch now will only allow users to choose binary copy as well, if they > want it and acknowledge the limitations that come with binary
Re: Silent overflow of interval type
On Thu, Feb 16, 2023 at 1:12 AM Tom Lane wrote: > Yeah, I don't think this would create a performance problem, at least not > if you're using a compiler that implements pg_sub_s64_overflow reasonably. > (And if you're not, and this bugs you, the answer is to get a better Please find attached the v2 of the said patch with the tests added. I tested and it applies with all tests passing both on REL_14_STABLE, REL_15_STABLE and master. I don't know how the decision on backpatching is made and whether it makes sense here or not. If any additional work is required, please let me know. > By chance did you look at all other nearby cases, is it the only place > with overflow? Not really, no. The other place where it could overflow was in the interval justification function and it was fixed about a year ago. That wasn't backpatched afaict. See https://postgr.es/m/caavxfhenqsj2xyfbpuf_8nnquijqkag04nw6abqq0dbzsxf...@mail.gmail.com Regards, Nick From 52d49e90b73d13c9acfd2b85f1ae38dfb0f64f9d Mon Sep 17 00:00:00 2001 From: Nick Babadzhanian Date: Thu, 16 Feb 2023 13:38:34 +0100 Subject: [PATCH] Address interval overflow and add corresponding tests --- src/backend/utils/adt/timestamp.c | 6 +- src/test/regress/expected/interval.out | 9 + src/test/regress/sql/interval.sql | 4 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c index f70f829d83..3ff51102a8 100644 --- a/src/backend/utils/adt/timestamp.c +++ b/src/backend/utils/adt/timestamp.c @@ -2688,7 +2688,11 @@ timestamp_mi(PG_FUNCTION_ARGS) (errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), errmsg("cannot subtract infinite timestamps"))); - result->time = dt1 - dt2; + /* Subtract dt1 and dt2 with overflow detection */ + if (unlikely(pg_sub_s64_overflow(dt1, dt2, &result->time))) + ereport(ERROR, +(errcode(ERRCODE_INTERVAL_FIELD_OVERFLOW), + errmsg("interval field out of range"))); result->month = 0; result->day = 0; diff --git a/src/test/regress/expected/interval.out b/src/test/regress/expected/interval.out index 78c1fab2b6..280f25c218 100644 --- a/src/test/regress/expected/interval.out +++ b/src/test/regress/expected/interval.out @@ -235,6 +235,15 @@ LINE 1: INSERT INTO INTERVAL_TBL_OF (f1) VALUES ('-2147483648 years'... -- Test edge-case overflow detection in interval multiplication select extract(epoch from '256 microseconds'::interval * (2^55)::float8); ERROR: interval out of range +-- Test edge-case overflow in timestamp[tz] subtraction +SELECT timestamptz'294276-12-31 23:59:59 UTC' - timestamptz'1999-12-23 19:59:04.224193 UTC' AS doesnt_overflow; +doesnt_overflow + + 106751991 days 04:00:54.775807 +(1 row) + +SELECT timestamptz'294276-12-31 23:59:59 UTC' - timestamptz'1999-12-23 19:59:04.224192 UTC' AS overflows; +ERROR: interval field out of range SELECT r1.*, r2.* FROM INTERVAL_TBL_OF r1, INTERVAL_TBL_OF r2 WHERE r1.f1 > r2.f1 diff --git a/src/test/regress/sql/interval.sql b/src/test/regress/sql/interval.sql index 55a449b617..a228cf1445 100644 --- a/src/test/regress/sql/interval.sql +++ b/src/test/regress/sql/interval.sql @@ -76,6 +76,10 @@ INSERT INTO INTERVAL_TBL_OF (f1) VALUES ('-2147483648 years'); -- Test edge-case overflow detection in interval multiplication select extract(epoch from '256 microseconds'::interval * (2^55)::float8); +-- Test edge-case overflow in timestamp[tz] subtraction +SELECT timestamptz'294276-12-31 23:59:59 UTC' - timestamptz'1999-12-23 19:59:04.224193 UTC' AS doesnt_overflow; +SELECT timestamptz'294276-12-31 23:59:59 UTC' - timestamptz'1999-12-23 19:59:04.224192 UTC' AS overflows; + SELECT r1.*, r2.* FROM INTERVAL_TBL_OF r1, INTERVAL_TBL_OF r2 WHERE r1.f1 > r2.f1 -- 2.39.1
RE: Support logical replication of DDLs
On Wednesday, February 15, 2023 5:51 PM Amit Kapila wrote: > > On Wed, Feb 15, 2023 at 2:02 PM Alvaro Herrera > wrote: > > > > On 2023-Feb-15, Peter Smith wrote: > > > > > On Thu, Feb 9, 2023 at 8:55 PM Ajin Cherian wrote: > > > > > > > > On Fri, Feb 3, 2023 at 11:41 AM Peter Smith > wrote: > > > > > > > 3. ExecuteGrantStmt > > > > > > > > > > + /* Copy the grantor id needed for DDL deparsing of Grant */ > > > > > + istmt.grantor_uid = grantor; > > > > > > > > > > SUGGESTION (comment) > > > > > Copy the grantor id to the parsetree, needed for DDL deparsing > > > > > of Grant > > > > > > > > didn't change this, as Alvaro said this was not a parsetree. > > > > > > Perhaps there is more to do here? Sorry, I did not understand the > > > details of Alvaro's post [1], but I did not recognize the difference > > > between ExecuteGrantStmt and ExecSecLabelStmt so it was my > > > impression either one or both of these places are either wrongly > > > commented, or maybe are doing something that should not be done. > > > > These two cases are different. In ExecGrantStmt we're adding the > > grantor OID to the InternalGrant struct, which is not a parse node, so > > there's no strong reason not to modify it (and also the suggested > > comment change is factually wrong). I don't know if the idea is > > great, but at least I see no strong objection. > > > > In the other case, as I said in [1], the patch proposes to edit the > > parse node to add the grantor, but I think a better idea might be to > > change the signature to ExecSecLabelStmt(SecLabelStmt *stmt, > > ObjectAddress *provider) so that the function can set the provider > > there; and caller passes &secondaryObject, which is the method we > > adopted for this kind of thing. > > > > +1, that is a better approach to make the required change in > ExecSecLabelStmt(). I did some research for this. The provider seems not a database object, user needs to register a provider via C ode via register_label_provider. And ObjectAddress only have three fields(classId, objectId, objectSubId), so it seems hard to set the provider with name to a ObjectAddress. We cannot get the correct provider name from the catalog as well because there could be multiple providers for the same db object. So, if we don't want to modify the parsetree. Maybe we can introduce a function GetDefaultProvider() which can be used if user doesn't specify the provider in the DDL command. GetDefaultProvider will return the first provider in the providers list as is done in ExecSecLabelStmt(). What do you think ? Best Regards, Hou zj
Refactor calculations to use instr_time
Hi, In 'instr_time.h' it is stated that: * When summing multiple measurements, it's recommended to leave the * running sum in instr_time form (ie, use INSTR_TIME_ADD or * INSTR_TIME_ACCUM_DIFF) and convert to a result format only at the end. So, I refactored 'PendingWalStats' to use 'instr_time' instead of 'PgStat_Counter' while accumulating 'wal_write_time' and 'wal_sync_time'. Also, I refactored some calculations to use 'INSTR_TIME_ACCUM_DIFF' instead of using 'INSTR_TIME_SUBTRACT' and 'INSTR_TIME_ADD'. What do you think? Regards, Nazir Bilal Yavuz Microsoft From 5216c70da53ff32b8e9898595445ad6f4880b06d Mon Sep 17 00:00:00 2001 From: Nazir Bilal Yavuz Date: Thu, 2 Feb 2023 15:06:48 +0300 Subject: [PATCH v1] Refactor instr_time calculations --- src/backend/access/transam/xlog.c | 6 ++ src/backend/storage/file/buffile.c | 6 ++ src/backend/utils/activity/pgstat_wal.c | 27 - src/include/pgstat.h| 17 +++- 4 files changed, 33 insertions(+), 23 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f9f0f6db8d1..3c35dc1ca23 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2206,8 +2206,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) instr_time duration; INSTR_TIME_SET_CURRENT(duration); - INSTR_TIME_SUBTRACT(duration, start); - PendingWalStats.wal_write_time += INSTR_TIME_GET_MICROSEC(duration); + INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, duration, start); } PendingWalStats.wal_write++; @@ -8201,8 +8200,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) instr_time duration; INSTR_TIME_SET_CURRENT(duration); - INSTR_TIME_SUBTRACT(duration, start); - PendingWalStats.wal_sync_time += INSTR_TIME_GET_MICROSEC(duration); + INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, duration, start); } PendingWalStats.wal_sync++; diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index 0a51624df3b..e55f86b675e 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -469,8 +469,7 @@ BufFileLoadBuffer(BufFile *file) if (track_io_timing) { INSTR_TIME_SET_CURRENT(io_time); - INSTR_TIME_SUBTRACT(io_time, io_start); - INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time); + INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start); } /* we choose not to advance curOffset here */ @@ -544,8 +543,7 @@ BufFileDumpBuffer(BufFile *file) if (track_io_timing) { INSTR_TIME_SET_CURRENT(io_time); - INSTR_TIME_SUBTRACT(io_time, io_start); - INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time); + INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start); } file->curOffset += bytestowrite; diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c index e8598b2f4e0..56211fbc61a 100644 --- a/src/backend/utils/activity/pgstat_wal.c +++ b/src/backend/utils/activity/pgstat_wal.c @@ -21,7 +21,7 @@ #include "executor/instrument.h" -PgStat_WalStats PendingWalStats = {0}; +PgStat_PendingWalUsage PendingWalStats = {0}; /* * WAL usage counters saved from pgWALUsage at the previous call to @@ -89,26 +89,25 @@ pgstat_flush_wal(bool nowait) * previous counters from the current ones. */ WalUsageAccumDiff(&diff, &pgWalUsage, &prevWalUsage); - PendingWalStats.wal_records = diff.wal_records; - PendingWalStats.wal_fpi = diff.wal_fpi; - PendingWalStats.wal_bytes = diff.wal_bytes; if (!nowait) LWLockAcquire(&stats_shmem->lock, LW_EXCLUSIVE); else if (!LWLockConditionalAcquire(&stats_shmem->lock, LW_EXCLUSIVE)) return true; -#define WALSTAT_ACC(fld) stats_shmem->stats.fld += PendingWalStats.fld - WALSTAT_ACC(wal_records); - WALSTAT_ACC(wal_fpi); - WALSTAT_ACC(wal_bytes); - WALSTAT_ACC(wal_buffers_full); - WALSTAT_ACC(wal_write); - WALSTAT_ACC(wal_sync); - WALSTAT_ACC(wal_write_time); - WALSTAT_ACC(wal_sync_time); +#define WALSTAT_ACC(fld, var_to_add) \ + (stats_shmem->stats.fld += var_to_add.fld) +#define WALLSTAT_ACC_INSTR_TIME_TYPE(fld) \ + (stats_shmem->stats.fld += INSTR_TIME_GET_MICROSEC(PendingWalStats.fld)) + WALSTAT_ACC(wal_records, diff); + WALSTAT_ACC(wal_fpi, diff); + WALSTAT_ACC(wal_bytes, diff); + WALSTAT_ACC(wal_buffers_full, PendingWalStats); + WALSTAT_ACC(wal_write, PendingWalStats); + WALSTAT_ACC(wal_sync, PendingWalStats); + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_write_time); + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_sync_time); #undef WALSTAT_ACC - LWLockRelease(&stats_shmem->lock); /* diff --git a/src/include/pgstat.h b/src/include/pgstat.h index db9675884f3..295c5eabf38 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -445,6 +445,21 @@ typedef struct PgStat_WalStats TimestampTz stat_reset_timestamp; } PgStat_WalStats; +/* Created f
Re: Dead code in ps_status.c
Thomas Munro writes: > On Thu, Feb 16, 2023 at 6:34 PM Tom Lane wrote: >> Hm, is "defined(sun)" true on any live systems at all? > My GCC compile farm account seems to have expired, or something, so I > couldn't check on wrasse's host (though whether wrasse is "live" is > debatable: Solaris 11.3 has reached EOL, it's just that the CPU is too > old to be upgraded, so it's not testing a real OS that anyone would > actually run PostgreSQL on). But from some googling[1], I think > __sun, __sun__ and sun should all be defined. My account still works, and what I see on wrasse's host is tgl@gcc-solaris11:~$ gcc -x c /dev/null -dM -E | grep -i svr #define __SVR4 1 #define __svr4__ 1 tgl@gcc-solaris11:~$ gcc -x c /dev/null -dM -E | grep -i sun #define __sun 1 #define sun 1 #define __sun__ 1 I don't know a way to get the list of predefined macros out of the compiler wrasse is actually using (/opt/developerstudio12.6/bin/cc), but doing some experiments with #ifdef confirmed that it defines __sun, __sun__, and __svr4__, but not __svr5__. regards, tom lane
Re: Move defaults toward ICU in 16?
On 2/16/23 4:35 AM, Robert Haas wrote: On Thu, Feb 16, 2023 at 1:01 AM Jeff Davis wrote: It feels very wrong to me to explain that sort order is defined by the operating system on which Postgres happens to run. Saying that it's defined by ICU, which is part of the Unicode consotium, is much better. It doesn't eliminate versioning issues, of course, but I think it's a better explanation for users. The fact that we can't use ICU on Windows, though, weakens this argument a lot. In my experience, we have a lot of Windows users, and they're not any happier with the operating system collations than Linux users. Possibly less so. This is one reason why we're discussing ICU as the "preferred default" vs. "the default." While it may not completely eliminate platform dependent behavior for collations, it takes a step forward. And AIUI, it does sound like ICU is available on newer versions of Windows[1]. I feel like this is a very difficult kind of change to judge. If everyone else feels this is a win, we should go with it, and hopefully we'll end up better off. I do feel like there are things that could go wrong, though, between the imperfect documentation, the fact that a substantial chunk of our users won't be able to use it because they run Windows, and everybody having to adjust to the behavior change. We should continue to improve our documentation. Personally, I found the biggest challenge was understanding how to set ICU locales / rules, particularly for nondeterministic collations as it was challenging to find where these were listed. I was able to overcome this with the examples in our docs + blogs, but I agree it's an area we can continue to improve upon. Thanks, Jonathan [1] https://learn.microsoft.com/en-us/dotnet/core/extensions/globalization-icu OpenPGP_signature Description: OpenPGP digital signature
Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16
Hi, On 2/16/23 1:26 PM, Drouvot, Bertrand wrote: Hi, On 2/16/23 12:00 PM, Amit Kapila wrote: I think this would require XLOG_PAGE_MAGIC as it changes the WAL record. Oh, I Was not aware about it, thanks! Will do in V2 (and in the logical decoding on standby patch as it adds the new field mentioned above). BTW, how about a commit message like: Change xl_hash_vacuum_one_page.ntuples from int to uint16. This will create two bytes of padding space in xl_hash_vacuum_one_page which can be used for future patches. This makes the datatype of xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which is advisable as both are used for the same purpose. LGTM, will add it to V2! Please find V2 attached. The commit message also mention the XLOG_PAGE_MAGIC bump. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.comFrom a1f627a6e8c347bdb3112033ae08ace57fbf0a10 Mon Sep 17 00:00:00 2001 From: Bertrand Drouvot Date: Thu, 16 Feb 2023 15:01:29 + Subject: [PATCH v2] Change xl_hash_vacuum_one_page.ntuples from int to uint16. This will create two bytes of padding space in xl_hash_vacuum_one_page which can be used for future patches. This makes the datatype of xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which is advisable as both are used for the same purpose. Bump XLOG_PAGE_MAGIC. --- src/include/access/hash_xlog.h | 4 ++-- src/include/access/xlog_internal.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) 100.0% src/include/access/ diff --git a/src/include/access/hash_xlog.h b/src/include/access/hash_xlog.h index a2f0f39213..9894ab9afe 100644 --- a/src/include/access/hash_xlog.h +++ b/src/include/access/hash_xlog.h @@ -251,13 +251,13 @@ typedef struct xl_hash_init_bitmap_page typedef struct xl_hash_vacuum_one_page { TransactionId snapshotConflictHorizon; - int ntuples; + uint16 ntuples; /* TARGET OFFSET NUMBERS FOLLOW AT THE END */ } xl_hash_vacuum_one_page; #define SizeOfHashVacuumOnePage \ - (offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(int)) + (offsetof(xl_hash_vacuum_one_page, ntuples) + sizeof(uint16)) extern void hash_redo(XLogReaderState *record); extern void hash_desc(StringInfo buf, XLogReaderState *record); diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index 59fc7bc105..8edae7bb07 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -31,7 +31,7 @@ /* * Each page of XLOG file has a header like this: */ -#define XLOG_PAGE_MAGIC 0xD111 /* can be used as WAL version indicator */ +#define XLOG_PAGE_MAGIC 0xD112 /* can be used as WAL version indicator */ typedef struct XLogPageHeaderData { -- 2.34.1
Re: PATCH: Using BRIN indexes for sorted output
On Thu, Feb 16, 2023 at 03:07:59PM +0100, Tomas Vondra wrote: > Rebased version of the patches, fixing only minor conflicts. Per cfbot, the patch fails on 32 bit builds. +ERROR: count mismatch: 0 != 1000 And causes warnings in mingw cross-compile. On Sun, Oct 23, 2022 at 11:32:37PM -0500, Justin Pryzby wrote: > I think new GUCs should be enabled during patch development. > Maybe in a separate 0002 patch "for CI only not for commit". > That way "make check" at least has a chance to hit that new code > paths. > > Also, note that indxpath.c had the var initialized to true. In your patch, the amstats guc is still being set to false during startup by the guc machinery. And the tests crash everywhere if it's set to on: TRAP: failed Assert("(nmatches_unique >= 1) && (nmatches_unique <= unique[nvalues-1])"), File: "../src/backend/access/brin/brin_minmax.c", Line: 644, PID: 25519 > . Some typos in your other patches: "heuristics heuristics". ste. >lest (least). These are still present. -- Justin
Re: Refactor calculations to use instr_time
Hi, On 2023-02-16 16:19:02 +0300, Nazir Bilal Yavuz wrote: > What do you think? Here's a small review: > +#define WALSTAT_ACC(fld, var_to_add) \ > + (stats_shmem->stats.fld += var_to_add.fld) > +#define WALLSTAT_ACC_INSTR_TIME_TYPE(fld) \ > + (stats_shmem->stats.fld += INSTR_TIME_GET_MICROSEC(PendingWalStats.fld)) > + WALSTAT_ACC(wal_records, diff); > + WALSTAT_ACC(wal_fpi, diff); > + WALSTAT_ACC(wal_bytes, diff); > + WALSTAT_ACC(wal_buffers_full, PendingWalStats); > + WALSTAT_ACC(wal_write, PendingWalStats); > + WALSTAT_ACC(wal_sync, PendingWalStats); > + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_write_time); > + WALLSTAT_ACC_INSTR_TIME_TYPE(wal_sync_time); > #undef WALSTAT_ACC > - > LWLockRelease(&stats_shmem->lock); WALSTAT_ACC is undefined, but WALLSTAT_ACC_INSTR_TIME_TYPE isn't. I'd not remove the newline before LWLockRelease(). > /* > diff --git a/src/include/pgstat.h b/src/include/pgstat.h > index db9675884f3..295c5eabf38 100644 > --- a/src/include/pgstat.h > +++ b/src/include/pgstat.h > @@ -445,6 +445,21 @@ typedef struct PgStat_WalStats > TimestampTz stat_reset_timestamp; > } PgStat_WalStats; > > +/* Created for accumulating wal_write_time and wal_sync_time as a > instr_time Minor code-formatting point: In postgres we don't put code in the same line as a multi-line comment starting with the /*. So either /* single line comment */ or /* * multi line * comment */ > + * but instr_time can't be used as a type where it ends up on-disk > + * because its units may change. PgStat_WalStats type is used for > + * in-memory/on-disk data. So, PgStat_PendingWalUsage is created for > + * accumulating intervals as a instr_time. > + */ > +typedef struct PgStat_PendingWalUsage > +{ > + PgStat_Counter wal_buffers_full; > + PgStat_Counter wal_write; > + PgStat_Counter wal_sync; > + instr_time wal_write_time; > + instr_time wal_sync_time; > +} PgStat_PendingWalUsage; > + I wonder if we should try to put pgWalUsage in here. But that's probably better done as a separate patch. Greetings, Andres Freund
Re: Move defaults toward ICU in 16?
Hi, On 2023-02-16 15:05:10 +0530, Robert Haas wrote: > The fact that we can't use ICU on Windows, though, weakens this > argument a lot. In my experience, we have a lot of Windows users, and > they're not any happier with the operating system collations than > Linux users. Possibly less so. Why can't you use ICU on windows? It works today, afaict? Greetings, Andres Freund
Re: Weird failure with latches in curculio on v15
Hi, On 2023-02-16 15:18:57 +0530, Robert Haas wrote: > On Fri, Feb 10, 2023 at 12:59 AM Andres Freund wrote: > > I don't think it's that hard to imagine problems. To be reasonably fast, a > > decent restore implementation will have to 'restore ahead'. Which also > > provides ample things to go wrong. E.g. > > > > - WAL source is switched, restore module needs to react to that, but > > doesn't, > > we end up lots of wasted work, or worse, filename conflicts > > - recovery follows a timeline, restore module doesn't catch on quickly > > enough > > - end of recovery happens, restore just continues on > > I don't see how you can prevent those things from happening. If the > restore process is working in some way that requires an event loop, > and I think that will be typical for any kind of remote archiving, > then either it has control most of the time, so the event loop can be > run inside the restore process, or, as Nathan proposes, we don't let > the archiver have control and it needs to run that restore process in > a separate background worker. The hazards that you mention here exist > either way. If the event loop is running inside the restore process, > it can decide not to call the functions that we provide in a timely > fashion and thus fail to react as it should. If the event loop runs > inside a separate background worker, then that process can fail to be > responsive in precisely the same way. Fundamentally, if the author of > a restore module writes code to have multiple I/Os in flight at the > same time and does not write code to cancel those I/Os if something > changes, then such cancellation will not occur. That remains true no > matter which process is performing the I/O. IDK. I think we can make that easier or harder. Right now the proposed API doesn't provide anything to allow to address this. > > > I don't quite see how you can make asynchronous and parallel archiving > > > work if the archiver process only calls into the archive module at > > > times that it chooses. That would mean that the module has to return > > > control to the archiver when it's in the middle of archiving one or > > > more files -- and then I don't see how it can get control back at the > > > appropriate time. Do you have a thought about that? > > > > I don't think archiver is the hard part, that already has a dedicated > > process, and it also has something of a queuing system already. The startup > > process imo is the complicated one... > > > > If we had a 'restorer' process, startup fed some sort of a queue with things > > to restore in the near future, it might be more realistic to do something > > you > > describe? > > Some kind of queueing system might be a useful part of the interface, > and a dedicated restorer process does sound like a good idea. But the > archiver doesn't have this solved, precisely because you have to > archive a single file, return control, and wait to be invoked again > for the next file. That does not scale. But there's nothing inherent in that. We know for certain which files we're going to archive. And we don't need to work one-by-one. The archiver could just start multiple subprocesses at the same time. All the blocking it does right now are artificially imposed by the use of system(). We could instead just use something popen() like and have a configurable number of processes running at the same time. What I was trying to point out was that the work a "restorer" process has to do is more speculative, because we don't know when we'll promote, whether we'll follow a timeline increase, whether the to-be-restored WAL already exists. That's solvable, but a bunch of the relevant work ought to be solved in core core code, instead of just in archive modules. Greetings, Andres Freund
Re: run pgindent on a regular basis / scripted manner
On 2023-02-16 Th 03:26, shiy.f...@fujitsu.com wrote: On Thu, Feb 9, 2023 6:10 AM Andrew Dunstan wrote: Thanks, I have committed this. Still looking at Robert's other request. Hi, Commit #068a243b7 supported directories to be non-option arguments of pgindent. But the help text doesn't mention that. Should we update it? Attach a small patch which did that. Thanks, pushed. cheers andrew -- Andrew Dunstan EDB:https://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On 2023-02-16 16:22:56 +0700, John Naylor wrote: > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada > > Right. TidStore is implemented not only for heap, so loading > > out-of-order TIDs might be important in the future. > > That's what I was probably thinking about some weeks ago, but I'm having a > hard time imagining how it would come up, even for something like the > conveyor-belt concept. We really ought to replace the tid bitmap used for bitmap heap scans. The hashtable we use is a pretty awful data structure for it. And that's not filled in-order, for example. Greetings, Andres Freund
Re: Support logical replication of DDLs
Hi, On 2/14/23 10:01 PM, houzj.f...@fujitsu.com wrote: Here is the new version patch which addressed above comments. I also fixed a bug for the deparsing of CREATE RULE that it didn't add parentheses for rule action list. I started testing this change set from this patch. I'm doing a mix of happy path, "making mistakes" path, and "real world" testing, and testing this both with unidirectional and "origin=none" replication. I wanted to report an issue I came up with using one of my real world cases. I had previously built a demo scheduling app to demonstrate several features of PostgreSQL to help with various kinds of data synchronization[1]. The first example uses a series of functions and triggers[2] to keep a calendar table up-to-date. I set up an experiment as such: 1. Create two different clusters. In each cluster, create a DB 2. On Cluster 1, run: CREATE PUBLICATION ddl FOR ALL TABLES WITH (ddl='all'); 3. On Cluster 2, run: CREATE SUBSCRIPTION ddl CONNECTION '' PUBLICATION ddl; 4. On Cluster 1, run the commands in [2]. Note that I reproduced the error both by running the commands individually and as part of a single transaction. 5. The transactions (or single transaction) completes successfully on Cluster 1 5. Cluster 2 reports the following error: 2023-02-16 16:11:10.537 UTC [25207] LOG: logical replication apply worker for subscription "ddl" has started 2023-02-16 16:11:10.570 UTC [25207] ERROR: relation "availability" does not exist at character 279 2023-02-16 16:11:10.570 UTC [25207] CONTEXT: processing remote data for replication origin "pg_16733" during message type "DDL" in transaction 890, finished at 0/BF298CC0 2023-02-16 16:11:10.570 UTC [25207] STATEMENT: CREATE OR REPLACE FUNCTION public.availability_rule_bulk_insert ( IN availability_rule public.availability_rule, IN day_of_week pg_catalog.int4 ) RETURNS pg_catalog.void LANGUAGE sql VOLATILE PARALLEL UNSAFE CALLED ON NULL INPUT SECURITY INVOKER COST 100 AS $_$ INSERT INTO availability ( room_id, availability_rule_id, available_date, available_range ) SELECT $1.room_id, $1.id, available_date::date + $2 - 1, tstzrange( /** start of range */ (available_date::date + $2 - 1) + $1.start_time, /** end of range */ /** check if there is a time wraparound, if so, increment by a day */ CASE $1.end_time <= $1.start_time WHEN TRUE THEN (available_date::date + $2) + $1.end_time ELSE (available_date::date + $2 - 1) + $1.end_time END ) FROM generate_series( date_trunc('week', CURRENT_DATE), date_trunc('week', CURRENT_DATE) + ($1.generate_weeks_into_future::text || ' weeks')::interval, '1 week'::interval ) available_date; $_$ 2023-02-16 16:11:10.573 UTC [15348] LOG: background worker "logical replication worker" (PID 25207) exited with exit code 1 I attempted this with both async and sync logical replication. In sync mode, the publisher hangs and is unable to accept any more writes. When I went in and explicitly schema qualified the tables in the functions[3], the example executed successfully. My high level guess without looking at the code is that the apply worker is not aware of the search_path to use when processing functions during creation. Provided that the publisher/subscriber environments are similar (if not identical), I would expect that if the function create succeeds on the publisher, it should also succeed on the subscriber. Thanks, Jonathan [1] https://github.com/CrunchyData/postgres-realtime-demo [2] https://github.com/CrunchyData/postgres-realtime-demo/blob/main/examples/demo/demo1.sql [3] https://gist.github.com/jkatz/5655c10da1a4c8691094e951ea07b036 OpenPGP_signature Description: OpenPGP digital signature
Re: Weird failure with latches in curculio on v15
On Thu, Feb 16, 2023 at 03:08:14PM +0530, Robert Haas wrote: > On Thu, Feb 9, 2023 at 10:53 PM Nathan Bossart > wrote: >> I've been thinking about this, actually. I'm wondering if we could provide >> a list of files to the archiving callback (configurable via a variable in >> ArchiveModuleState), and then have the callback return a list of files that >> are archived. (Or maybe we just put the list of files that need archiving >> in ArchiveModuleState.) The returned list could include files that were >> sent to the callback previously. The archive module would be responsible >> for creating background worker(s) (if desired), dispatching files >> to-be-archived to its background worker(s), and gathering the list of >> archived files to return. > > Hmm. So in this design, the archiver doesn't really do the archiving > any more, because the interface makes that impossible. It has to use a > separate background worker process for that, full stop. > > I don't think that's a good design. It's fine if some people want to > implement it that way, but it shouldn't be forced by the interface. I don't think it would force you to use a background worker, but if you wanted to, the tools would be available. At least, that is the intent. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Support logical replication of DDLs
On 2023-Feb-16, Jonathan S. Katz wrote: [replication tries to execute this command] > 2023-02-16 16:11:10.570 UTC [25207] STATEMENT: CREATE OR REPLACE FUNCTION > public.availability_rule_bulk_insert ( IN availability_rule > public.availability_rule, IN day_of_week pg_catalog.int4 ) RETURNS > pg_catalog.void LANGUAGE sql VOLATILE PARALLEL UNSAFE CALLED ON NULL INPUT > SECURITY INVOKER COST 100 AS $_$ > INSERT INTO availability ( > room_id, > availability_rule_id, > available_date, > available_range > ) [which results in:] > 2023-02-16 16:11:10.570 UTC [25207] ERROR: relation "availability" does not > exist at character 279 I don't think this is the fault of logical replication. Consider that for the backend server, the function source code is just an opaque string that is given to the plpgsql engine to interpret. So there's no way for the logical DDL replication engine to turn this into runnable code if the table name is not qualified. (The fact that this is a security-invoker function prevents you from attaching a SET search_path clause to the function, I believe? Which means it is extra dangerous to have an unqualified table reference there.) > My high level guess without looking at the code is that the apply worker is > not aware of the search_path to use when processing functions during > creation. Provided that the publisher/subscriber environments are similar > (if not identical), I would expect that if the function create succeeds on > the publisher, it should also succeed on the subscriber. If we're going to force search_path and all other settings to be identical, then we might as well give up the whole deparsing design and transmit the original string for execution in the replica; it is much simpler. But this idea was rejected outright when this stuff was first proposed years ago. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Crear es tan difícil como ser libre" (Elsa Triolet)
Re: ATTACH PARTITION seems to ignore column generation status
Hello, 11.01.2023 23:58, Tom Lane wrote: Amit Langote writes: I've updated your disallow-generated-child-columns-2.patch to do this, and have also merged the delta post that I had attached with my last email, whose contents you sound to agree with. Pushed with some further work to improve the handling of multiple- inheritance cases. We still need to insist that all or none of the parent columns are generated, but we don't have to require their generation expressions to be alike: that can be resolved by letting the child table override the expression, much as we've long done for plain default expressions. (This did need some work in pg_dump after all.) I'm pretty happy with where this turned out. I've encountered a query that triggers an assert added in that commit: CREATE TABLE t(a int, b int GENERATED ALWAYS AS (a) STORED) PARTITION BY RANGE (a); CREATE TABLE tp PARTITION OF t(b DEFAULT 1) FOR VALUES FROM (0) to (1); Core was generated by `postgres: law regression [local] CREATE TABLE '. Program terminated with signal SIGABRT, Aborted. warning: Section `.reg-xstate/3152655' in core file too small. #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140460372887360) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140460372887360) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=140460372887360) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140460372887360, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x7fbf79f0e476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x7fbf79ef47f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x55e76b35b322 in ExceptionalCondition ( conditionName=conditionName@entry=0x55e76b4a2240 "!(coldef->generated && !restdef->generated)", fileName=fileName@entry=0x55e76b49ec71 "tablecmds.c", lineNumber=lineNumber@entry=3028) at assert.c:66 #6 0x55e76afef8c3 in MergeAttributes (schema=0x55e76d480318, supers=supers@entry=0x55e76d474c18, relpersistence=112 'p', is_partition=true, supconstr=supconstr@entry=0x7ffe945a3768) at tablecmds.c:3028 #7 0x55e76aff0ef2 in DefineRelation (stmt=stmt@entry=0x55e76d44b2d8, relkind=relkind@entry=114 'r', ownerId=10, ownerId@entry=0, typaddress=typaddress@entry=0x0, queryString=queryString@entry=0x55e76d44a408 "CREATE TABLE tp PARTITION OF t(b DEFAULT 1) FOR VALUES FROM (0) to (1);") at tablecmds.c:861 ... Without asserts enables, the partition created successfully, and INSERT INTO t VALUES(0); SELECT * FROM t; yields: a | b ---+--- 0 | 1 (1 row) Is this behavior expected? Best regards, Alexander
Re: [PATCH] Align GSS and TLS error handling in PQconnectPoll()
On Thu, Feb 16, 2023 at 3:31 AM Jelte Fennema wrote: > > Patch looks good to me. Definitely an improvement over the status quo. Thanks for the review! > Looking at the TLS error handling though I see these two lines: > > && conn->allow_ssl_try/* redundant? */ > && !conn->wait_ssl_try) /* redundant? */ > > Are they actually redundant like the comment suggests? If so, we > should probably remove them (in another patch). If not (or if we don't > know), should we have these same checks for GSS? It's a fair point. GSS doesn't have an "allow" encryption mode, so they can't be the exact same checks. And we're already not checking the probably-redundant information, so I'd vote against speculatively adding it back. (try_gss is already based on gssencmode, which we're using here. So I think rechecking try_gss would only help if we wanted to clear it manually while in the middle of processing a GSS exchange. >From a quick inspection, I don't think that's happening today -- and I'm not really sure that it could be useful in the future, because I'd think prefer-mode is supposed to guarantee a retry on failure.) I suspect this is a much deeper rabbit hole; I think it's work that needs to be done, but I can't sign myself up for it at the moment. The complexity of this function is off the charts (for instance, why do we recheck conn->try_gss above, if the only apparent way to get to CONNECTION_GSS_STARTUP is by having try_gss = true to begin with? is there a goto/retry path I'm missing?). I think it either needs heavy assistance from a committer who already has intimate knowledge of this state machine and all of its possible branches, or from a static analysis tool that can help with a step-by-step simplification. Thanks, --Jacob
Re: Support logical replication of DDLs
On 2023-Feb-16, houzj.f...@fujitsu.com wrote: > I did some research for this. > > The provider seems not a database object, user needs to register a provider > via > C ode via register_label_provider. And ObjectAddress only have three > fields(classId, objectId, objectSubId), so it seems hard to set the provider > with name to > a ObjectAddress. We cannot get the correct provider name from the catalog as > well because there could be multiple providers for the same db object. Oh, I didn't realize that the provider wasn't an ObjectAddress. You're right, that idea won't fly. > So, if we don't want to modify the parsetree. Maybe we can introduce a > function > GetDefaultProvider() which can be used if user doesn't specify the provider in > the DDL command. GetDefaultProvider will return the first provider in the > providers list as is done in ExecSecLabelStmt(). What do you think ? One of the design principles is that the DDL execution must resolve the object just once, and thereafter all references must use the same. Resolving twice via this new GetDefaultProvider would violate that (consider if you made it a GUC and the user sent the SIGHUP that changed it just in between.) I think another workable possibility is to create a new value in CollectedCommandType, a separate struct in the union of CollectedCommand, and stash the info about the whole command there, including the provider; then tell ProcessUtilitySlow that the command was already stashed. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La conclusión que podemos sacar de esos estudios es que no podemos sacar ninguna conclusión de ellos" (Tanenbaum)
Re: psql: Add role's membership options to the \du+ command
On 16.02.2023 00:37, David G. Johnston wrote: I mean, either you accept the change in how this meta-command presents its information or you don't. Yes, that's the main issue of this patch. On the one hand, it would be nice to see the membership options with the psql command. On the other hand, I don't have an exact understanding of how best to do it. That's why I proposed a variant for discussion. It is quite possible that if there is no consensus, it would be better to leave it as is, and get information by queries to the system catalog. - Pavel Luzanov
Re: ATTACH PARTITION seems to ignore column generation status
On 2023-Feb-16, Alexander Lakhin wrote: > I've encountered a query that triggers an assert added in that commit: > CREATE TABLE t(a int, b int GENERATED ALWAYS AS (a) STORED) PARTITION BY > RANGE (a); > CREATE TABLE tp PARTITION OF t(b DEFAULT 1) FOR VALUES FROM (0) to (1); It seems wrong that this command is accepted. It should have given an error, because the partition is not allowed to override the generation of the value that is specified in the parent table. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
Add WAL read stats to pg_stat_wal
Hi, While working on [1], I was in need to know WAL read stats (number of times and amount of WAL data read from disk, time taken to read) to measure the benefit. I had to write a developer patch to capture WAL read stats as pg_stat_wal currently emits WAL write stats. With recent works on pg_stat_io which emit data read IO stats too, I think it's better to not miss WAL read stats. It might help others who keep an eye on IOPS of the production servers for various reasons. The WAL read stats I'm thinking useful are wal_read_bytes - total amount of WAL read, wal_read - total number of times WAL is read from disk, wal_read_time - total amount of time spent reading WAL (tracked only when an existing GUC track_wal_io_timing is on). I came up with a patch and attached it here. The WAL readers that add to WAL read stats are WAL senders, startup process and other backends using xlogreader for logical replication or pg_walinspect SQL functions. They all report stats to shared memory by calling pgstat_report_wal() in appropriate locations. In standby mode, calling pgstat_report_wa() for every record seems to be costly. Therefore, I chose to report stats every 1024 WAL records (a random number, suggestions for a better a way are welcome here). Note that the patch needs a bit more work, per [2]. With the patch, the WAL senders (processes exiting after checkpointer) will generate stats and we need to either let all or only one WAL sender to write stats to disk. Allowing one WAL sender to write might be tricky. Allowing all WAL senders to write might make too many writes to the stats file. And, we need a lock to let only one process write. I can't think of a best way here at the moment. Thoughts? [1] https://www.postgresql.org/message-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=q...@mail.gmail.com [2] /* * Write out stats after shutdown. This needs to be called by exactly one * process during a normal shutdown, and since checkpointer is shut down * very late... * * Walsenders are shut down after the checkpointer, but currently don't * report stats. If that changes, we need a more complicated solution. */ before_shmem_exit(pgstat_before_server_shutdown, 0); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com From c123e4f5e94cbb4881ce4d65c9dec3c4d5d72778 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 15 Feb 2023 13:29:22 + Subject: [PATCH v1] Add WAL read stats to pg_stat_wal --- doc/src/sgml/monitoring.sgml | 32 + src/backend/access/transam/xlogreader.c | 32 + src/backend/access/transam/xlogrecovery.c | 34 +++ src/backend/catalog/system_views.sql | 3 ++ src/backend/replication/walsender.c | 12 src/backend/utils/activity/pgstat_wal.c | 6 +++- src/backend/utils/adt/pgstatfuncs.c | 24 ++-- src/include/catalog/pg_proc.dat | 6 ++-- src/include/pgstat.h | 9 ++ src/test/regress/expected/rules.out | 5 +++- 10 files changed, 155 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index dca50707ad..20b16405b5 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4088,6 +4088,38 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i + + + wal_read_bytes numeric + + + Total amount of WAL read from disk in bytes, either by recovery process + for replay, or by WAL senders for replication, or by backend processes. + + + + + + wal_read bigint + + + Number of times WAL is read from disk, either by recovery process for + replay, or by WAL senders for replication, or by backend processes. + + + + + + wal_read_time double precision + + + Total amount of time spent reading WAL from disk, either by recovery + process for replay, or by WAL senders for replication, or by backend + processes, in milliseconds (if + is enabled, otherwise zero). + + + stats_reset timestamp with time zone diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index aa6c929477..29a0938dba 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -180,6 +180,11 @@ XLogReaderFree(XLogReaderState *state) pfree(state->readRecordBuf); pfree(state->readBuf); pfree(state); + +#ifndef FRONTEND + /* Report pending statistics to the cumulative stats system */ + pgstat_report_wal(true); +#endif } /* @@ -1506,6 +1511,9 @@ WALRead(XLogReaderState *state, uint32 startoff; int segbytes; int readbytes; +#ifndef FRONTEND + instr_time start; +#endif
Re: [PATCH] Add `verify-system` sslmode to use system CA pool for server cert
On Thu, Feb 16, 2023 at 1:35 AM Jelte Fennema wrote: > > FYI the last patch does not apply cleanly anymore. So a rebase is needed. Thanks for the nudge, v7 rebases over the configure conflict from 9244c11afe. --Jacob 1: e7e2d43b18 ! 1: b07af1c564 libpq: add sslrootcert=system to use default CAs @@ configure: fi ## configure.ac ## @@ configure.ac: if test "$with_ssl" = openssl ; then - # thread-safety. In 1.1.0, it's no longer required, and CRYPTO_lock() - # function was removed. AC_CHECK_FUNCS([CRYPTO_lock]) + # Function introduced in OpenSSL 1.1.1. + AC_CHECK_FUNCS([X509_get_signature_info]) + # Let tests differentiate between vanilla OpenSSL and LibreSSL. + AC_CHECK_DECLS([LIBRESSL_VERSION_NUMBER], [], [], [#include ]) AC_DEFINE([USE_OPENSSL], 1, [Define to 1 to build with OpenSSL support. (--with-ssl=openssl)]) 2: e4d9731e1e = 2: 5de1c458b1 libpq: force sslmode=verify-full for system CAs From 5de1c458b19dc73608b1518d87f153b733ce1921 Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Mon, 24 Oct 2022 15:24:11 -0700 Subject: [PATCH v7 2/2] libpq: force sslmode=verify-full for system CAs Weaker verification modes don't make much sense for public CAs. --- doc/src/sgml/libpq.sgml | 15 + doc/src/sgml/runtime.sgml | 6 ++-- src/interfaces/libpq/fe-connect.c | 56 +++ src/interfaces/libpq/t/001_uri.pl | 30 +++-- src/test/ssl/t/001_ssltests.pl| 12 +++ 5 files changed, 107 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml index a293431020..364a82adfd 100644 --- a/doc/src/sgml/libpq.sgml +++ b/doc/src/sgml/libpq.sgml @@ -1733,15 +1733,16 @@ postgresql://%2Fvar%2Flib%2Fpostgresql/dbname locations may be further modified by the SSL_CERT_DIR and SSL_CERT_FILE environment variables. - + - When using sslrootcert=system, it is critical to - also use the strongest certificate verification method, - sslmode=verify-full. In most cases it is trivial for - anyone to obtain a certificate trusted by the system for a hostname - they control, rendering the verify-ca mode useless. + When using sslrootcert=system, the default + sslmode is changed to verify-full, + and any weaker setting will result in an error. In most cases it is + trivial for anyone to obtain a certificate trusted by the system for a + hostname they control, rendering verify-ca and all + weaker modes useless. - + diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index b93184537a..dbe23db54f 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -2009,9 +2009,9 @@ pg_dumpall -p 5432 | psql -d postgres -p 5433 verify-full and have the appropriate root certificate file installed (). Alternatively the system CA pool can be used using sslrootcert=system; in - this case, sslmode=verify-full must be used for safety, - since it is generally trivial to obtain certificates which are signed by a - public CA. + this case, sslmode=verify-full is forced for safety, since + it is generally trivial to obtain certificates which are signed by a public + CA. diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c index 50b5df3490..69826eaadf 100644 --- a/src/interfaces/libpq/fe-connect.c +++ b/src/interfaces/libpq/fe-connect.c @@ -1259,6 +1259,22 @@ connectOptions2(PGconn *conn) goto oom_error; } +#ifndef USE_SSL + /* + * sslrootcert=system is not supported. Since setting this changes the + * default sslmode, check this _before_ we validate sslmode, to avoid + * confusing the user with errors for an option they may not have set. + */ + if (conn->sslrootcert + && strcmp(conn->sslrootcert, "system") == 0) + { + conn->status = CONNECTION_BAD; + libpq_append_conn_error(conn, "sslrootcert value \"%s\" invalid when SSL support is not compiled in", +conn->sslrootcert); + return false; + } +#endif + /* * validate sslmode option */ @@ -1305,6 +1321,21 @@ connectOptions2(PGconn *conn) goto oom_error; } +#ifdef USE_SSL + /* + * If sslrootcert=system, make sure our chosen sslmode is compatible. + */ + if (conn->sslrootcert + && strcmp(conn->sslrootcert, "system") == 0 + && strcmp(conn->sslmode, "verify-full") != 0) + { + conn->status = CONNECTION_BAD; + libpq_append_conn_error(conn, "weak sslmode \"%s\" may not be used with sslrootcert=system", +conn->sslmode); + return false; + } +#endif + /* * Validate TLS protocol versions for ssl_min_protocol_version and * ssl_max_protocol_version. @@ -5806,6 +5837,7 @@ static bool conninfo_add_defaults(PQconninfoOption *options, PQExpBuffer errorMessage) { PQconninfoOption *option; + PQconninfoOption *sslmode_de
Re: [PoC] Let libpq reject unexpected authentication requests
v14 rebases over the test and solution conflicts from 9244c11afe2. Thanks, --Jacob 1: 542d330310 ! 1: eec891c519 libpq: let client reject unexpected auth methods @@ src/test/ssl/t/002_scram.pl: $node->connect_ok( +qr/channel binding is required, but server did not offer an authentication method that supports channel binding/ + ); +} ++ + # Now test with a server certificate that uses the RSA-PSS algorithm. + # This checks that the certificate can be loaded and that channel binding + # works. (see bug #17760) +@@ src/test/ssl/t/002_scram.pl: if ($supports_rsapss_certs) + qr/connection authenticated: identity="ssltestuser" method=scram-sha-256/ + ]); + } + done_testing(); 2: 815fdc4393 ! 2: dc7f94f24c Add sslcertmode option for client certificates @@ src/tools/msvc/Solution.pm: sub GenerateFiles HAVE_STDINT_H=> 1, HAVE_STDLIB_H=> 1, @@ src/tools/msvc/Solution.pm: sub GenerateFiles - - my ($digit1, $digit2, $digit3) = $self->GetOpenSSLVersion(); - + $define{HAVE_HMAC_CTX_NEW} = 1; + $define{HAVE_OPENSSL_INIT_SSL} = 1; + } ++ ++ # Symbols needed with OpenSSL 1.0.2 and above. + if ( ($digit1 >= '3' && $digit2 >= '0' && $digit3 >= '0') + || ($digit1 >= '1' && $digit2 >= '1' && $digit3 >= '0') + || ($digit1 >= '1' && $digit2 >= '0' && $digit3 >= '2')) + { + $define{HAVE_SSL_CTX_SET_CERT_CB} = 1; + } -+ - # More symbols are needed with OpenSSL 1.1.0 and above. - if ( ($digit1 >= '3' && $digit2 >= '0' && $digit3 >= '0') - || ($digit1 >= '1' && $digit2 >= '1' && $digit3 >= '0')) + } + + $self->GenerateConfigHeader('src/include/pg_config.h', \%define, 1); 3: 6c3b1f28bc = 3: 9a84af5936 require_auth: decouple SASL and SCRAM From 9a84af59361e09446d087cf0c79f1a9a3b61e39d Mon Sep 17 00:00:00 2001 From: Jacob Champion Date: Tue, 18 Oct 2022 16:55:36 -0700 Subject: [PATCH v14 3/3] require_auth: decouple SASL and SCRAM Rather than assume that an AUTH_REQ_SASL* code refers to SCRAM-SHA-256, future-proof by separating the single allowlist into a list of allowed authentication request codes and a list of allowed SASL mechanisms. The require_auth check is now separated into two tiers. The AUTH_REQ_SASL* codes are always allowed. If the server sends one, responsibility for the check then falls to pg_SASL_init(), which compares the selected mechanism against the list of allowed mechanisms. (Other SCRAM code is already responsible for rejecting unexpected or out-of-order AUTH_REQ_SASL_* codes, so that's not explicitly handled with this addition.) Since there's only one recognized SASL mechanism, conn->sasl_mechs currently only points at static hardcoded lists. Whenever a second mechanism is added, the list will need to be managed dynamically. --- src/interfaces/libpq/fe-auth.c| 34 ++ src/interfaces/libpq/fe-connect.c | 42 +++ src/interfaces/libpq/libpq-int.h | 2 ++ src/test/authentication/t/001_password.pl | 10 +++--- src/test/ssl/t/002_scram.pl | 6 5 files changed, 83 insertions(+), 11 deletions(-) diff --git a/src/interfaces/libpq/fe-auth.c b/src/interfaces/libpq/fe-auth.c index a474e17fb2..34b0573a98 100644 --- a/src/interfaces/libpq/fe-auth.c +++ b/src/interfaces/libpq/fe-auth.c @@ -522,6 +522,40 @@ pg_SASL_init(PGconn *conn, int payloadlen) goto error; } + /* + * Before going ahead with any SASL exchange, ensure that the user has + * allowed (or, alternatively, has not forbidden) this particular mechanism. + * + * In a hypothetical future where a server responds with multiple SASL + * mechanism families, we would need to instead consult this list up above, + * during mechanism negotiation. We don't live in that world yet. The server + * presents one auth method and we decide whether that's acceptable or not. + */ + if (conn->sasl_mechs) + { + const char **mech; + bool found = false; + + Assert(conn->require_auth); + + for (mech = conn->sasl_mechs; *mech; mech++) + { + if (strcmp(*mech, selected_mechanism) == 0) + { +found = true; +break; + } + } + + if ((conn->sasl_mechs_denied && found) + || (!conn->sasl_mechs_denied && !found)) + { + libpq_append_conn_error(conn, "auth method \"%s\" requirement failed: server requested unacceptable SASL mechanism \"%s\"", + conn->require_auth, selected_mechanism); + goto error; + } + } + if (conn->channel_binding[0] == 'r' && /* require */ strcmp(selected_mechanism, SCRAM_SHA_256_PLUS_NAME) != 0) { diff --git a/src/interfaces/libpq/fe-con
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
Hi, On 2023-02-16 16:58:23 +0900, Michael Paquier wrote: > On Wed, Feb 15, 2023 at 01:00:00PM +0530, Bharath Rupireddy wrote: > > The v3 patch reduces initialization of iovec array elements which is a > > clear win when pg_pwrite_zeros is called for sizes less than BLCKSZ > > many times (I assume this is what is needed for the relation extension > > lock improvements feature). However, it increases the number of iovec > > initialization with zerobuf for the cases when pg_pwrite_zeros is > > called for sizes far greater than BLCKSZ (for instance, WAL file > > initialization). In those cases the cost of initializing the IOV doesn't matter, relative to the other costs. The important point is to not initialize a lot of elements if they're not even needed. Because we need to overwrite the trailing iov element, it doesn't seem worth to try to "pre-initialize" iov. Referencing a static variable is more expensive than accessing an on-stack variable. Having a first-call check is more expensive than not having it. Thus making the iov and zbuf_sz static isn't helpful. Particularly the latter seems like a bad idea, because it's a compiler constant. > It seems to me that v3 would do extra initializations only if > pg_pwritev_with_retry() does *not* retry its writes, but that's not > the case as it retries on a partial write as per its name. The number > of iov buffers is stricly capped by remaining_size. I don't really understand this bit? Greetings, Andres Freund
Re: Support logical replication of DDLs
On 2/16/23 12:53 PM, Alvaro Herrera wrote: On 2023-Feb-16, Jonathan S. Katz wrote: [replication tries to execute this command] 2023-02-16 16:11:10.570 UTC [25207] STATEMENT: CREATE OR REPLACE FUNCTION public.availability_rule_bulk_insert ( IN availability_rule public.availability_rule, IN day_of_week pg_catalog.int4 ) RETURNS pg_catalog.void LANGUAGE sql VOLATILE PARALLEL UNSAFE CALLED ON NULL INPUT SECURITY INVOKER COST 100 AS $_$ INSERT INTO availability ( room_id, availability_rule_id, available_date, available_range ) [which results in:] 2023-02-16 16:11:10.570 UTC [25207] ERROR: relation "availability" does not exist at character 279 I don't think this is the fault of logical replication. Consider that for the backend server, the function source code is just an opaque string that is given to the plpgsql engine to interpret. So there's no way for the logical DDL replication engine to turn this into runnable code if the table name is not qualified. Sure, that's fair. That said, the example above would fall under a "typical use case", i.e. I'm replicating functions that call tables without schema qualification. This is pretty common, and as logical replication becomes used for more types of workloads (e.g. high availability), we'll definitely see this. (The fact that this is a security-invoker function prevents you from attaching a SET search_path clause to the function, I believe? Which means it is extra dangerous to have an unqualified table reference there.) Yes, but the level of danger would depend on how the schema is actually used. And while the above pattern is not great, it is still widely common. My high level guess without looking at the code is that the apply worker is not aware of the search_path to use when processing functions during creation. Provided that the publisher/subscriber environments are similar (if not identical), I would expect that if the function create succeeds on the publisher, it should also succeed on the subscriber. If we're going to force search_path and all other settings to be identical, then we might as well give up the whole deparsing design and transmit the original string for execution in the replica; it is much simpler. But this idea was rejected outright when this stuff was first proposed years ago. Hm, maybe we go the other way in terms of execution of function bodies, i.e. we don't try to run/parse it on the subscriber? If the function body is just based in as a string, can we just insert it without doing any evaluation on the source code? I'd have to think a little bit more about the SQL standard bodies (BEGIN ATOMIC)...though AIUI it would possibly be a similar flow (execute on publisher, just copy w/o execution into subscriber)? If I'm using DDL replication, I'm trying to keep my publisher/subscribers synchronized to a reasonable level of consistency, so it is highly likely the function should work when it's called. I know things can go wrong and break, particularly if I've made independent changes to the schema on the subscriber, but that can happen anyway today with functions on a single instance. Thanks, Jonathan OpenPGP_signature Description: OpenPGP digital signature
Re: Add WAL read stats to pg_stat_wal
Hi, On 2023-02-16 23:39:00 +0530, Bharath Rupireddy wrote: > While working on [1], I was in need to know WAL read stats (number of > times and amount of WAL data read from disk, time taken to read) to > measure the benefit. I had to write a developer patch to capture WAL > read stats as pg_stat_wal currently emits WAL write stats. With recent > works on pg_stat_io which emit data read IO stats too, I think it's > better to not miss WAL read stats. It might help others who keep an > eye on IOPS of the production servers for various reasons. The WAL > read stats I'm thinking useful are wal_read_bytes - total amount of > WAL read, wal_read - total number of times WAL is read from disk, > wal_read_time - total amount of time spent reading WAL (tracked only > when an existing GUC track_wal_io_timing is on). I doesn't really seem useful to have this in pg_stat_wal, because you can't really figure out where those reads are coming from. Are they crash recovery? Walsender? ...? I think this'd better be handled by adding WAL support for pg_stat_io. Then the WAL reads would be attributed to the relevant backend type, making it easier to answer such questions. Going forward I want to add support for seeing pg_stat_io for individual connections, which'd then automatically support this feature for the WAL reads as well. Eventually I think pg_stat_wal should only track wal_records, wal_fpi, wal_buffers_full and fill the other columns from pg_stat_io. However, this doesn't "solve" the following issue: > Note that the patch needs a bit more work, per [2]. With the patch, > the WAL senders (processes exiting after checkpointer) will generate > stats and we need to either let all or only one WAL sender to write > stats to disk. Allowing one WAL sender to write might be tricky. > Allowing all WAL senders to write might make too many writes to the > stats file. And, we need a lock to let only one process write. I can't > think of a best way here at the moment. > > Thoughts? > > [1] > https://www.postgresql.org/message-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=q...@mail.gmail.com > [2] > /* > * Write out stats after shutdown. This needs to be called by exactly one > * process during a normal shutdown, and since checkpointer is shut down > * very late... > * > * Walsenders are shut down after the checkpointer, but currently don't > * report stats. If that changes, we need a more complicated solution. > */ > before_shmem_exit(pgstat_before_server_shutdown, 0); I wonder if we should keep the checkpointer around for longer. If we have checkpointer signal postmaster after it wrote the shutdown checkpoint, postmaster could signal walsenders to shut down, and checkpointer could do some final work, like writing out the stats. I suspect this could be useful for other things as well. It's awkward that we don't have a place to put "just before shutting down" type tasks. And checkpointer seems well suited for that. Greetings, Andres Freund
Re: Time delayed LR (WAS Re: logical replication restrictions)
Hi, On 2023-02-16 14:21:01 +0900, Kyotaro Horiguchi wrote: > I'm not sure why output plugin is involved in the delay mechanism. +many The output plugin absolutely never should be involved in something like this. It was a grave mistake that OutputPluginUpdateProgress() calls were added to the commit callback and then proliferated. > It appears to me that it would be simpler if the delay occurred in reorder > buffer or logical decoder instead. This is a feature specific to walsender. So the riggering logic should either directly live in the walsender, or in a callback set in LogicalDecodingContext. That could be called from decode.c or such. Greetings, Andres Freund
Re: recovery modules
Hi, On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote: > On Wed, Feb 15, 2023 at 03:38:21PM +0900, Michael Paquier wrote: > > I may tweak a bit the comments, but nothing more. And I don't think I > > have more to add. Andres, do you have anything you would like to > > mention? Just some minor comments below. None of them needs to be addressed. > @@ -144,10 +170,12 @@ basic_archive_configured(void) > * Archives one file. > */ > static bool > -basic_archive_file(const char *file, const char *path) > +basic_archive_file(ArchiveModuleState *state, const char *file, const char > *path) > { > sigjmp_buf local_sigjmp_buf; Not related the things changed here, but this should never have been pushed down into individual archive modules. There's absolutely no way that we're going to keep this up2date and working correctly in random archive modules. And it would break if archive modules are ever called outside of pgarch.c. > +static void > +basic_archive_shutdown(ArchiveModuleState *state) > +{ > + BasicArchiveData *data = (BasicArchiveData *) (state->private_data); The parens around (state->private_data) are imo odd. > + basic_archive_context = data->context; > + Assert(CurrentMemoryContext != basic_archive_context); > + > + if (MemoryContextIsValid(basic_archive_context)) > + MemoryContextDelete(basic_archive_context); I guess I'd personally be paranoid and clean data->context after this. Obviously doesn't matter right now, but at some later date it could be that we'd error out after this point, and re-entered the shutdown callback. > + > +/* > + * Archive module callbacks > + * > + * These callback functions should be defined by archive libraries and > returned > + * via _PG_archive_module_init(). ArchiveFileCB is the only required > callback. > + * For more information about the purpose of each callback, refer to the > + * archive modules documentation. > + */ > +typedef void (*ArchiveStartupCB) (ArchiveModuleState *state); > +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state); > +typedef bool (*ArchiveFileCB) (ArchiveModuleState *state, const char *file, > const char *path); > +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state); > + > +typedef struct ArchiveModuleCallbacks > +{ > + ArchiveStartupCB startup_cb; > + ArchiveCheckConfiguredCB check_configured_cb; > + ArchiveFileCB archive_file_cb; > + ArchiveShutdownCB shutdown_cb; > +} ArchiveModuleCallbacks; If you wanted you could just define the callback types in the struct now, as we don't need asserts for the types. Greetings, Andres Freund
Re: Make ON_ERROR_STOP stop on shell script failure
On 23/11/2022 00:16, Matheus Alcantara wrote: --- Original Message --- On Tuesday, November 22nd, 2022 at 20:10, bt22nakamorit wrote: There was a mistake in the error message for \! so I updated the patch. Best, Tatsuhiro Nakamori Hi I was checking your patch and seems that it failed to be applied into the master cleanly. Could you please rebase it? Was also looking into this patch. Tatsuhiro: can you please send a rebased version? Thanks -- Andreas 'ads' Scherbaum German PostgreSQL User Group European PostgreSQL User Group - Board of Directors Volunteer Regional Contact, Germany - PostgreSQL Project
Re: Make ON_ERROR_STOP stop on shell script failure
On 16/02/2023 20:33, Andreas 'ads' Scherbaum wrote: On 23/11/2022 00:16, Matheus Alcantara wrote: --- Original Message --- On Tuesday, November 22nd, 2022 at 20:10, bt22nakamorit wrote: There was a mistake in the error message for \! so I updated the patch. Best, Tatsuhiro Nakamori Hi I was checking your patch and seems that it failed to be applied into the master cleanly. Could you please rebase it? Was also looking into this patch. Tatsuhiro: can you please send a rebased version? The email address is bouncing. That might be why ... -- Andreas 'ads' Scherbaum German PostgreSQL User Group European PostgreSQL User Group - Board of Directors Volunteer Regional Contact, Germany - PostgreSQL Project
Re: Support logical replication of DDLs
On 2023-Feb-16, Jonathan S. Katz wrote: > On 2/16/23 12:53 PM, Alvaro Herrera wrote: > > I don't think this is the fault of logical replication. Consider that > > for the backend server, the function source code is just an opaque > > string that is given to the plpgsql engine to interpret. So there's no > > way for the logical DDL replication engine to turn this into runnable > > code if the table name is not qualified. > > Sure, that's fair. That said, the example above would fall under a "typical > use case", i.e. I'm replicating functions that call tables without schema > qualification. This is pretty common, and as logical replication becomes > used for more types of workloads (e.g. high availability), we'll definitely > see this. Hmm, I think you're saying that replay should turn check_function_bodies off, and I think I agree with that. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)
Re: Support logical replication of DDLs
On 2/16/23 2:38 PM, Alvaro Herrera wrote: On 2023-Feb-16, Jonathan S. Katz wrote: On 2/16/23 12:53 PM, Alvaro Herrera wrote: I don't think this is the fault of logical replication. Consider that for the backend server, the function source code is just an opaque string that is given to the plpgsql engine to interpret. So there's no way for the logical DDL replication engine to turn this into runnable code if the table name is not qualified. Sure, that's fair. That said, the example above would fall under a "typical use case", i.e. I'm replicating functions that call tables without schema qualification. This is pretty common, and as logical replication becomes used for more types of workloads (e.g. high availability), we'll definitely see this. Hmm, I think you're saying that replay should turn check_function_bodies off, and I think I agree with that. Yes, exactly. +1 The docs seem to think that is the correct approach too: "Set this parameter to off before loading functions on behalf of other users"[1]. [1] https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-CHECK-FUNCTION-BODIES OpenPGP_signature Description: OpenPGP digital signature
Re: recovery modules
Thanks for reviewing. On Thu, Feb 16, 2023 at 11:29:56AM -0800, Andres Freund wrote: > On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote: >> @@ -144,10 +170,12 @@ basic_archive_configured(void) >> * Archives one file. >> */ >> static bool >> -basic_archive_file(const char *file, const char *path) >> +basic_archive_file(ArchiveModuleState *state, const char *file, const char >> *path) >> { >> sigjmp_buf local_sigjmp_buf; > > Not related the things changed here, but this should never have been pushed > down into individual archive modules. There's absolutely no way that we're > going to keep this up2date and working correctly in random archive > modules. And it would break if archive modules are ever called outside of > pgarch.c. Yeah. IIRC I did briefly try to avoid this, but the difficulty was that each module will have its own custom cleanup logic. There's no requirement that a module creates an exception handler, but I imagine that any sufficiently complex one will. In any case, I agree that it's worth trying to pull this out of the individual modules. >> +static void >> +basic_archive_shutdown(ArchiveModuleState *state) >> +{ >> +BasicArchiveData *data = (BasicArchiveData *) (state->private_data); > > The parens around (state->private_data) are imo odd. Oops, removed. >> +basic_archive_context = data->context; >> +Assert(CurrentMemoryContext != basic_archive_context); >> + >> +if (MemoryContextIsValid(basic_archive_context)) >> +MemoryContextDelete(basic_archive_context); > > I guess I'd personally be paranoid and clean data->context after > this. Obviously doesn't matter right now, but at some later date it could be > that we'd error out after this point, and re-entered the shutdown callback. Done. >> +/* >> + * Archive module callbacks >> + * >> + * These callback functions should be defined by archive libraries and >> returned >> + * via _PG_archive_module_init(). ArchiveFileCB is the only required >> callback. >> + * For more information about the purpose of each callback, refer to the >> + * archive modules documentation. >> + */ >> +typedef void (*ArchiveStartupCB) (ArchiveModuleState *state); >> +typedef bool (*ArchiveCheckConfiguredCB) (ArchiveModuleState *state); >> +typedef bool (*ArchiveFileCB) (ArchiveModuleState *state, const char *file, >> const char *path); >> +typedef void (*ArchiveShutdownCB) (ArchiveModuleState *state); >> + >> +typedef struct ArchiveModuleCallbacks >> +{ >> +ArchiveStartupCB startup_cb; >> +ArchiveCheckConfiguredCB check_configured_cb; >> +ArchiveFileCB archive_file_cb; >> +ArchiveShutdownCB shutdown_cb; >> +} ArchiveModuleCallbacks; > > If you wanted you could just define the callback types in the struct now, as > we don't need asserts for the types. This crossed my mind. I thought it was nice to have a declaration for each callback that we can copy into the docs, but I'm sure we could do without it, too. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 5969db5576f9d475169f1b0e25fcb4d9d331c9f4 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Thu, 9 Feb 2023 13:49:46 -0800 Subject: [PATCH v15 1/1] restructure archive modules API --- contrib/basic_archive/basic_archive.c | 86 --- doc/src/sgml/archive-modules.sgml | 35 ++-- src/backend/Makefile | 2 +- src/backend/archive/Makefile | 18 src/backend/archive/meson.build | 5 ++ .../{postmaster => archive}/shell_archive.c | 32 --- src/backend/meson.build | 1 + src/backend/postmaster/Makefile | 1 - src/backend/postmaster/meson.build| 1 - src/backend/postmaster/pgarch.c | 27 +++--- src/backend/utils/misc/guc_tables.c | 1 + src/include/archive/archive_module.h | 58 + src/include/archive/shell_archive.h | 24 ++ src/include/postmaster/pgarch.h | 39 - 14 files changed, 243 insertions(+), 87 deletions(-) create mode 100644 src/backend/archive/Makefile create mode 100644 src/backend/archive/meson.build rename src/backend/{postmaster => archive}/shell_archive.c (77%) create mode 100644 src/include/archive/archive_module.h create mode 100644 src/include/archive/shell_archive.h diff --git a/contrib/basic_archive/basic_archive.c b/contrib/basic_archive/basic_archive.c index 36b7a4814a..cd852888ce 100644 --- a/contrib/basic_archive/basic_archive.c +++ b/contrib/basic_archive/basic_archive.c @@ -30,9 +30,9 @@ #include #include +#include "archive/archive_module.h" #include "common/int.h" #include "miscadmin.h" -#include "postmaster/pgarch.h" #include "storage/copydir.h" #include "storage/fd.h" #include "utils/guc.h" @@ -40,14 +40,27 @@ PG_MODULE_MAGIC; +typedef struct BasicArchiveData +{ + MemoryContext context; +} BasicArchiveData; + static char *archive_dire
Re: Support logical replication of DDLs
On 2/16/23 2:43 PM, Jonathan S. Katz wrote: On 2/16/23 2:38 PM, Alvaro Herrera wrote: On 2023-Feb-16, Jonathan S. Katz wrote: On 2/16/23 12:53 PM, Alvaro Herrera wrote: I don't think this is the fault of logical replication. Consider that for the backend server, the function source code is just an opaque string that is given to the plpgsql engine to interpret. So there's no way for the logical DDL replication engine to turn this into runnable code if the table name is not qualified. Sure, that's fair. That said, the example above would fall under a "typical use case", i.e. I'm replicating functions that call tables without schema qualification. This is pretty common, and as logical replication becomes used for more types of workloads (e.g. high availability), we'll definitely see this. Hmm, I think you're saying that replay should turn check_function_bodies off, and I think I agree with that. Yes, exactly. +1 I drilled into this a bit more using the SQL standard bodies (BEGIN ATOMIC) to see if there were any other behaviors we needed to account for. Overall, it worked well but I ran into one issue. First, functions with "BEGIN ATOMIC" ignores "check_function_bodies" which is by design based on how this feature works. We should still turn "check_function_bodies" to "off" though, per above discussion. In the context of DDL replication, "BEGIN ATOMIC" does support schema-unqualified functions, presumably because it includes the parsed content? I created an updated example[1] where I converted the SQL functions to use the standard syntax and I returned the table names to be schema unqualified. This seemed to work, but I ran into a weird case with this function: CREATE OR REPLACE FUNCTION public.calendar_manage(room_id int, calendar_date date) RETURNS void LANGUAGE SQL BEGIN ATOMIC WITH delete_calendar AS ( DELETE FROM calendar WHERE room_id = $1 AND calendar_date = $2 ) INSERT INTO calendar (room_id, status, calendar_date, calendar_range) SELECT $1, c.status, $2, c.calendar_range FROM calendar_generate_calendar($1, tstzrange($2, $2 + 1)) c; END; This produced an error on the subscriber, with the following message: 2023-02-16 20:58:24.096 UTC [26864] ERROR: missing FROM-clause entry for table "calendar_1" at character 322 2023-02-16 20:58:24.096 UTC [26864] CONTEXT: processing remote data for replication origin "pg_18658" during message type "DDL" in transaction 980, finished at 0/C099A7D8 2023-02-16 20:58:24.096 UTC [26864] STATEMENT: CREATE OR REPLACE FUNCTION public.calendar_manage ( IN room_id pg_catalog.int4, IN calendar_date pg_catalog.date ) RETURNS pg_catalog.void LANGUAGE sql VOLATILE PARALLEL UNSAFE CALLED ON NULL INPUT SECURITY INVOKER COST 100 BEGIN ATOMIC WITH delete_calendar AS ( DELETE FROM public.calendar WHERE ((calendar_1.room_id OPERATOR(pg_catalog.=) calendar_manage.room_id) AND (calendar_1.calendar_date OPERATOR(pg_catalog.=) calendar_manage.calendar_date)) ) INSERT INTO public.calendar (room_id, status, calendar_date, calendar_range) SELECT calendar_manage.room_id, c.status, calendar_manage.calendar_date, c.calendar_range FROM public.calendar_generate_calendar(calendar_manage.room_id, pg_catalog.tstzrange((calendar_manage.calendar_date)::timestamp with time zone, ((calendar_manage.calendar_date OPERATOR(pg_catalog.+) 1))::timestamp with time zone)) c(status, calendar_range); END This seemed to add an additional, incorrect reference to the origin table for the "room_id" and "calendar_date" attributes within the CTE of this function. I don't know if this is directly related to the DDL replication patch, but reporting it as I triggered the behavior through it. Thanks, Jonathan [1] https://gist.github.com/jkatz/fe29006b724fd6f32ee849a96dc01608 OpenPGP_signature Description: OpenPGP digital signature
Re: recovery modules
Hi, On 2023-02-16 12:15:12 -0800, Nathan Bossart wrote: > Thanks for reviewing. > > On Thu, Feb 16, 2023 at 11:29:56AM -0800, Andres Freund wrote: > > On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote: > >> @@ -144,10 +170,12 @@ basic_archive_configured(void) > >> * Archives one file. > >> */ > >> static bool > >> -basic_archive_file(const char *file, const char *path) > >> +basic_archive_file(ArchiveModuleState *state, const char *file, const > >> char *path) > >> { > >>sigjmp_buf local_sigjmp_buf; > > > > Not related the things changed here, but this should never have been pushed > > down into individual archive modules. There's absolutely no way that we're > > going to keep this up2date and working correctly in random archive > > modules. And it would break if archive modules are ever called outside of > > pgarch.c. > > Yeah. IIRC I did briefly try to avoid this, but the difficulty was that > each module will have its own custom cleanup logic. It can use PG_TRY/CATCH for that, if the top-level sigsetjmp is in pgarch.c. Or you could add a cleanup callback to the API, to be called after the top-level cleanup in pgarch.c. I'm quite baffled by: /* Close any files left open by copy_file() or compare_files() */ AtEOSubXact_Files(false, InvalidSubTransactionId, InvalidSubTransactionId); in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files() completely outside the context of a transaction environment. And it only does the thing you want because you pass parameters that aren't actually valid in the normal use in AtEOSubXact_Files(). I really don't understand how that's supposed to be ok. Greetings, Andres Freund
Re: Reconcile stats in find_tabstat_entry() and get rid of PgStat_BackendFunctionEntry
Hi, On 2023-02-15 09:21:48 +0100, Drouvot, Bertrand wrote: > diff --git a/src/backend/utils/activity/pgstat_relation.c > b/src/backend/utils/activity/pgstat_relation.c > index f793ac1516..b26e2a5a7a 100644 > --- a/src/backend/utils/activity/pgstat_relation.c > +++ b/src/backend/utils/activity/pgstat_relation.c > @@ -471,20 +471,46 @@ pgstat_fetch_stat_tabentry_ext(bool shared, Oid reloid) > * Find any existing PgStat_TableStatus entry for rel_id in the current > * database. If not found, try finding from shared tables. > * > + * If an entry is found, copy it and increment the copy's counters with their > + * subtransactions counterparts. Then return the copy. There is no need for > the > + * caller to pfree the copy as the MemoryContext will be reset soon after. > + * The "There is no need" bit seems a bit off. Yes, that's true for the current callers, but who says that it has to stay that way? Otherwise this looks ready, on a casual scan. Greetings, Andres Freund
Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy
On 2023-02-16 12:37:57 +0530, Robert Haas wrote: > The patch creates 100_bugs.pl What's the story behind 100_bugs.pl? This name clearly is copied from src/test/subscription/t/100_bugs.pl - but I've never understood why that is outside of the normal numbering space.
Re: Missing free_var() at end of accum_sum_final()?
Hi, On 2023-02-16 15:26:26 +0900, Michael Paquier wrote: > On Thu, Feb 16, 2023 at 06:59:13AM +0100, Joel Jacobson wrote: > > I noticed the NumericVar's pos_var and neg_var are not free_var()'d > > at the end of accum_sum_final(). > > > > The potential memory leak seems small, since the function is called > > only once per sum() per worker (and from a few more places), but > > maybe they should be free'd anyways for correctness? > > Indeed, it is true that any code path of numeric.c that relies on a > NumericVar with an allocation done in its buffer is careful enough to > free it, except for generate_series's SRF where one step of the > computation is done. I don't see directly why you could not do the > following: > @@ -11973,6 +11973,9 @@ accum_sum_final(NumericSumAccum *accum, NumericVar > *result) > /* And add them together */ > add_var(&pos_var, &neg_var, result); > > + free_var(&pos_var); > + free_var(&neg_var); > + > /* Remove leading/trailing zeroes */ > strip_var(result); But why do we need it? Most SQL callable functions don't need to be careful about not leaking O(1) memory, the exception being functions backing btree opclasses. In fact, the detailed memory management often is *more* expensive than just relying on the calling memory context being reset. Of course, numeric.c doesn't really seem to have gotten that message, so there's a consistency argument here. Greetings, Andres Freund
Re: Add connection active, idle time to pg_stat_activity
On Wed, Feb 1, 2023 at 12:46 PM Sergey Dudoladov wrote: > > I've sketched the first version of a patch to add pg_stat_session. > Please review this early version. Hi Sergey! I've taken a look into the patch and got some notes. 1. It is hard to understand what fastpath backend state is. What do fastpath metrics mean for a user? 2. Anyway, the path "if (PGSTAT_IS_FASTPATH(beentry))" seems unreachable to me. I'm a bit surprised that compilers do not produce warnings about it. Maybe I'm just wrong. 3. Tests do not check any incrementation logic. I think we can have some test that verifies delta for select some_counter from pg_stat_session where pid = pg_backend_pid(); 4. Macroses like PGSTAT_IS_RUNNING do not look like net win in code readability and PGSTAT prefix have no semantic load. That's all I've found so far. Thank you! Best regards, Andrey Borodin. PS. We were doing on-air review session [0], I hope Nik will chime-in with "usability part of a review". [0] https://youtu.be/vTV8XhWf3mo?t=2404
Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
Hi everyone, I've been working on a federated database project that heavily relies on foreign data wrappers. During benchmarking, we noticed high system CPU usage in OLTP-related cases, which we traced back to multiple brk calls resulting from block frees in AllocSetReset upon ExecutorEnd's FreeExecutorState. This is because FDWs allocate their own derived execution states and required data structures within this context, exceeding the initial 8K allocation, that need to be cleaned-up. Increasing the default query context allocation from ALLOCSET_DEFAULT_SIZES to a larger initial "appropriate size" solved the issue and almost doubled the throughput. However, the "appropriate size" is workload and implementation dependent, so making it configurable may be better than increasing the defaults, which would negatively impact users (memory-wise) who aren't encountering this scenario. I have a patch to make it configurable, but before submitting it, I wanted to hear your thoughts and feedback on this and any other alternative ideas you may have. -- Jonah H. Harris
Re: recovery modules
On Thu, Feb 16, 2023 at 01:17:54PM -0800, Andres Freund wrote: > On 2023-02-16 12:15:12 -0800, Nathan Bossart wrote: >> On Thu, Feb 16, 2023 at 11:29:56AM -0800, Andres Freund wrote: >> > On 2023-02-15 10:44:07 -0800, Nathan Bossart wrote: >> >> @@ -144,10 +170,12 @@ basic_archive_configured(void) >> >> * Archives one file. >> >> */ >> >> static bool >> >> -basic_archive_file(const char *file, const char *path) >> >> +basic_archive_file(ArchiveModuleState *state, const char *file, const >> >> char *path) >> >> { >> >> sigjmp_buf local_sigjmp_buf; >> > >> > Not related the things changed here, but this should never have been pushed >> > down into individual archive modules. There's absolutely no way that we're >> > going to keep this up2date and working correctly in random archive >> > modules. And it would break if archive modules are ever called outside of >> > pgarch.c. >> >> Yeah. IIRC I did briefly try to avoid this, but the difficulty was that >> each module will have its own custom cleanup logic. > > It can use PG_TRY/CATCH for that, if the top-level sigsetjmp is in pgarch.c. > Or you could add a cleanup callback to the API, to be called after the > top-level cleanup in pgarch.c. Yeah, that seems workable. > I'm quite baffled by: > /* Close any files left open by copy_file() or compare_files() > */ > AtEOSubXact_Files(false, InvalidSubTransactionId, > InvalidSubTransactionId); > > in basic_archive_file(). It seems *really* off to call AtEOSubXact_Files() > completely outside the context of a transaction environment. And it only does > the thing you want because you pass parameters that aren't actually valid in > the normal use in AtEOSubXact_Files(). I really don't understand how that's > supposed to be ok. Hm. Should copy_file() and compare_files() have PG_FINALLY blocks that attempt to close the files instead? What would you recommend? -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: [PATCH] Add pretty-printed XML output option
On 16.02.23 00:13, Peter Smith wrote: Today I fetched and tried the latest v11. It is failing too, but only just. - see attached file pretty-v11-results It looks only due to a whitespace EOF issue in the xml_2.out @@ -1679,4 +1679,4 @@ -- XML format: empty string SELECT xmlformat(''); ERROR: invalid XML document -\set VERBOSITY default \ No newline at end of file +\set VERBOSITY default -- The attached patch update (v12-0002) fixes the xml_2.out for me. I'm squashing v12-0001 and v12-0002 (v13 attached). There is still an open discussion on renaming the function to xmlserialize,[1] but it shouldn't be too difficult to change it later in case we reach a consensus :) Thanks! Jim 1- https://www.postgresql.org/message-id/CANNMO%2BKwb4_87G8qDeN%2BVk1B1vX3HvgoGW%2B13fJ-b6rj7qbAww%40mail.gmail.com From e28e9da7890d07e10f412ad61318d7a9ce4d058c Mon Sep 17 00:00:00 2001 From: Jim Jones Date: Thu, 16 Feb 2023 22:36:19 +0100 Subject: [PATCH v13] Add pretty-printed XML output option This small patch introduces a XML pretty print function. It basically takes advantage of the indentation feature of xmlDocDumpFormatMemory from libxml2 to format XML strings. --- doc/src/sgml/func.sgml | 34 ++ src/backend/utils/adt/xml.c | 45 + src/include/catalog/pg_proc.dat | 3 + src/test/regress/expected/xml.out | 101 src/test/regress/expected/xml_1.out | 53 +++ src/test/regress/expected/xml_2.out | 101 src/test/regress/sql/xml.sql| 33 + 7 files changed, 370 insertions(+) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index e09e289a43..a621192425 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -14861,6 +14861,40 @@ SELECT xmltable.* ]]> + + +xmlformat + + + xmlformat + + + +xmlformat ( xml ) xml + + + + Converts the given XML value to pretty-printed, indented text. + + + + Example: + + + + diff --git a/src/backend/utils/adt/xml.c b/src/backend/utils/adt/xml.c index 079bcb1208..ec12707b5c 100644 --- a/src/backend/utils/adt/xml.c +++ b/src/backend/utils/adt/xml.c @@ -473,6 +473,51 @@ xmlBuffer_to_xmltype(xmlBufferPtr buf) } #endif +Datum +xmlformat(PG_FUNCTION_ARGS) +{ +#ifdef USE_LIBXML + + xmlDocPtr doc; + xmlChar*xmlbuf = NULL; + text *arg = PG_GETARG_TEXT_PP(0); + StringInfoData buf; + int nbytes; + + doc = xml_parse(arg, XMLOPTION_DOCUMENT, false, GetDatabaseEncoding(), NULL); + + if(!doc) + elog(ERROR, "could not parse the given XML document"); + + /** + * xmlDocDumpFormatMemory ( + * xmlDocPtr doc, # the XML document + * xmlChar **xmlbuf, # the memory pointer + * int *nbytes, # the memory length + * int format # 1 = node indenting + *) + */ + + xmlDocDumpFormatMemory(doc, &xmlbuf, &nbytes, 1); + + xmlFreeDoc(doc); + + if(!nbytes) + elog(ERROR, "could not indent the given XML document"); + + initStringInfo(&buf); + appendStringInfoString(&buf, (const char *)xmlbuf); + + xmlFree(xmlbuf); + + PG_RETURN_XML_P(stringinfo_to_xmltype(&buf)); + +#else + NO_XML_SUPPORT(); +return 0; +#endif +} + Datum xmlcomment(PG_FUNCTION_ARGS) diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index c0f2a8a77c..54e8a6262a 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -8842,6 +8842,9 @@ { oid => '3053', descr => 'determine if a string is well formed XML content', proname => 'xml_is_well_formed_content', prorettype => 'bool', proargtypes => 'text', prosrc => 'xml_is_well_formed_content' }, +{ oid => '4642', descr => 'Indented text from xml', + proname => 'xmlformat', prorettype => 'xml', + proargtypes => 'xml', prosrc => 'xmlformat' }, # json { oid => '321', descr => 'I/O', diff --git a/src/test/regress/expected/xml.out b/src/test/regress/expected/xml.out index 3c357a9c7e..e45116aaa7 100644 --- a/src/test/regress/expected/xml.out +++ b/src/test/regress/expected/xml.out @@ -1599,3 +1599,104 @@ SELECT * FROM XMLTABLE('.' PASSING XMLELEMENT(NAME a) columns a varchar(20) PATH |(1 row) +-- XML format: single line XML string +SELECT xmlformat('Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650'); +xmlformat +-- + + + + + Belgian Waffles+ + $5.95+ + Two of our famous Belgian Waffles with plenty of real maple syrup+ + 650
Introduce list_reverse() to make lcons() usage less inefficient
While working on [1] to make improvements in the query planner around the speed to find EquivalenceMembers in an EquivalenceClass, because that patch does have a large impact in terms of performance improvements, some performance tests with that patch started to highlight some other places that bottleneck the planner's performance. One of those places is in generate_orderedappend_paths() when we find that the required sort order is the same as the reverse of the partition order. In this case, we build a list of paths for each partition using the lcons() function. Since Lists are now arrays in PostgreSQL, lcons() isn't as efficient as it once was and it must memmove the entire existing contents of the list up one element to make way to prepend the new element. This is effectively quadratic and becomes noticeable with a large number of partitions. One way we could solve that is to just lappend() the new item and then just reverse the order of the list only when we need to. This has the added advantage of removing a bunch of semi-duplicated code from generate_orderedappend_paths(). It also has a noticeable impact on the planner's performance. I did a quick test with: create table lp (a int, b int) partition by list(a); select 'create table lp'||x::text||' partition of lp for values in('||x::text||');' from generate_Series(1,1)x; \gexec create index on lp(a); Using: psql -c "explain (analyze, timing off) select * from lp order by a desc" postgres | grep "Planning Time" master: Planning Time: 6034.765 ms Planning Time: 5919.914 ms Planning Time: 5720.529 ms master + eclass idx (from [1]) (yes, it really is this much faster) Planning Time: 549.262 ms Planning Time: 489.023 ms Planning Time: 497.803 ms master + eclass idx + list_reverse (attached) Planning Time: 517.067 ms Planning Time: 463.613 ms Planning Time: 463.036 ms I suspect there won't be much controversy here and there's certainly not much complexity, so in absence of anyone voicing an opinion here, I'm inclined to not waste too much time on this one and just get it done. I'll leave it for a few days. David [1] https://postgr.es/m/flat/caj2pmkzncgoukse+_5lthd+kbxkvq6h2hqn8esxpxd+cxmg...@mail.gmail.com diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index a709d23ef1..5f1f675e44 100644 --- a/src/backend/nodes/list.c +++ b/src/backend/nodes/list.c @@ -1654,6 +1654,37 @@ list_copy_deep(const List *oldlist) return newlist; } +/* + * Reverses the order of 'list' elements in place and returns the input list + */ +List * +list_reverse(List *list) +{ + ListCell *head; + ListCell *tail; + + if (list == NIL) + return NIL; + + head = &list->elements[0]; + tail = &list->elements[list->length - 1]; + + while (head < tail) + { + ListCell tmp; + + /* Swap data at the head and tail position */ + memcpy(&tmp, head, sizeof(ListCell)); + memcpy(head, tail, sizeof(ListCell)); + memcpy(tail, &tmp, sizeof(ListCell)); + + head++; + tail--; + } + + return list; +} + /* * Sort a list according to the specified comparator function. * diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index ae0f9bdc8a..4356102399 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -1801,7 +1801,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, * Collect the appropriate child paths. The required logic varies * for the Append and MergeAppend cases. */ - if (match_partition_order) + if (match_partition_order || match_partition_order_desc) { /* * We're going to make a plain Append path. We don't need @@ -1822,25 +1822,6 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, fractional_subpaths = lappend(fractional_subpaths, cheapest_fractional); } } - else if (match_partition_order_desc) - { - /* -* As above, but we need to reverse the order of the children, -* because nodeAppend.c doesn't know anything about reverse -* ordering and will scan the children in the order presented. -*/ - cheapest_startup = get_singleton_append_subpath(cheapest_startup); - cheapest_total = get_singleton_append_subpath(cheapest_total); - - startup_subpaths = lcons(cheapest_st
Re: ATTACH PARTITION seems to ignore column generation status
Alvaro Herrera writes: > On 2023-Feb-16, Alexander Lakhin wrote: >> I've encountered a query that triggers an assert added in that commit: >> CREATE TABLE t(a int, b int GENERATED ALWAYS AS (a) STORED) PARTITION BY >> RANGE (a); >> CREATE TABLE tp PARTITION OF t(b DEFAULT 1) FOR VALUES FROM (0) to (1); > It seems wrong that this command is accepted. It should have given an > error, because the partition is not allowed to override the generation > of the value that is specified in the parent table. Agreed. We missed a check somewhere, will fix. regards, tom lane
psql \watch 2nd argument: iteration count
Hi hackers! >From time to time I want to collect some stats from locks, activity and other stat views into one table from different time points. In this case the \watch psql command is very handy. However, it's not currently possible to specify the number of times a query is performed. Also, if we do not provide a timespan, 2 seconds are selected. But if we provide an incorrect argument - 1 second is selected. PFA the patch that adds iteration count argument and makes timespan argument more consistent. What do you think? Best regards, Andrey Borodin. 0001-Add-iteration-count-argument-to-psql-watch-command.patch Description: Binary data
Re: [PATCH] Add pretty-printed XML output option
On Thu, Feb 16, 2023 at 2:12 PM Jim Jones wrote: > > I'm squashing v12-0001 and v12-0002 (v13 attached). I've looked into the patch. The code looks to conform to usual expectations. One nit: this comment should have just one asterisk. + /** And I have a dumb question: is this function protected from using external XML namespaces? What if the user passes some xmlns that will force it to read namespace data from the server filesystem? Or is it not possible? I see there are a lot of calls to xml_parse() anyway, but still... Best regards, Andrey Borodin.
Re: Introduce list_reverse() to make lcons() usage less inefficient
Hi, On 2023-02-17 11:36:40 +1300, David Rowley wrote: > While working on [1] to make improvements in the query planner around > the speed to find EquivalenceMembers in an EquivalenceClass, because > that patch does have a large impact in terms of performance > improvements, some performance tests with that patch started to > highlight some other places that bottleneck the planner's performance. > > One of those places is in generate_orderedappend_paths() when we find > that the required sort order is the same as the reverse of the > partition order. In this case, we build a list of paths for each > partition using the lcons() function. Since Lists are now arrays in > PostgreSQL, lcons() isn't as efficient as it once was and it must > memmove the entire existing contents of the list up one element to > make way to prepend the new element. This is effectively quadratic and > becomes noticeable with a large number of partitions. I have wondered before if we eventually ought to switch to embedded lists for some planner structures, including paths. add_path() inserts/deletes at points in the middle of the list, which isn't great. > One way we could solve that is to just lappend() the new item and then > just reverse the order of the list only when we need to. That's not generally the same as lcons() ing, but I guess it's fine here, because we build the lists from scratch, so the reversing actually yields the correct result. But wouldn't an even cheaper way here be to iterate over the children in reverse order when match_partition_order_desc? We can do that efficiently now. Looks like we don't have a readymade helper for it, but it'd be easy enough to add or open code. Greetings, Andres Freund
Re: Progress report of CREATE INDEX for nested partitioned tables
On Wed, Feb 08, 2023 at 04:40:49PM -0600, Justin Pryzby wrote: > This squishes together 001/2 as the main patch. > I believe it's ready. Update to address a compiler warning in the supplementary patches adding assertions. >From 71427bf7cd9927af04513ba3fe99e481a8ba1f61 Mon Sep 17 00:00:00 2001 From: Ilya Gladyshev Date: Tue, 31 Jan 2023 19:13:07 +0400 Subject: [PATCH 1/3] fix CREATE INDEX progress report with nested partitions The progress reporting was added in v12 (ab0dfc961) but the original patch didn't seem to consider the possibility of nested partitioning. When called recursively, DefineIndex() would clobber the number of completed partitions, and it was possible to end up with the TOTAL counter greater than the DONE counter. This clarifies/re-defines that the progress reporting counts both direct and indirect children, but doesn't count intermediate partitioned tables: - The TOTAL counter is set once at the start of the command. - For indexes which are newly-built, the recursively-called DefineIndex() increments the DONE counter. - For pre-existing indexes which are ATTACHed rather than built, DefineIndex() increments the DONE counter, and if the attached index is partitioned, the counter is incremented to account for each of its leaf partitions. Author: Ilya Gladyshev Reviewed-By: Justin Pryzby, Tomas Vondra, Dean Rasheed, Alvaro Herrera, Matthias van de Meent Discussion: https://www.postgresql.org/message-id/flat/a15f904a70924ffa4ca25c3c744cff31e0e6e143.camel%40gmail.com --- doc/src/sgml/monitoring.sgml | 10 ++- src/backend/commands/indexcmds.c | 70 +-- src/backend/utils/activity/backend_progress.c | 28 src/include/utils/backend_progress.h | 1 + 4 files changed, 102 insertions(+), 7 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index dca50707ad4..945d64ed2fa 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -6896,7 +6896,10 @@ FROM pg_stat_get_backend_idset() AS backendid; When creating an index on a partitioned table, this column is set to - the total number of partitions on which the index is to be created. + the total number of partitions on which the index is to be created or attached. + In the case of intermediate partitioned tables, this includes both + direct and indirect partitions, but excludes the intermediate + partitioned tables themselves. This field is 0 during a REINDEX. @@ -6907,7 +6910,10 @@ FROM pg_stat_get_backend_idset() AS backendid; When creating an index on a partitioned table, this column is set to - the number of partitions on which the index has been created. + the number of partitions on which the index has been created or attached. + In the case of intermediate partitioned tables, this includes both + direct and indirect partitions, but excludes the intermediate + partitioned tables themselves. This field is 0 during a REINDEX. diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 16ec0b114e6..147265ddcca 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -130,6 +130,30 @@ typedef struct ReindexErrorInfo char relkind; } ReindexErrorInfo; + +/* + * Count the number of direct and indirect leaf partitions. Note that this + * also excludes foreign tables. + */ +static int +count_leaf_partitions(Oid relid) +{ + int nleaves = 0; + List *childs = find_all_inheritors(relid, NoLock, NULL); + ListCell *lc; + + foreach(lc, childs) + { + Oid partrelid = lfirst_oid(lc); + + if (RELKIND_HAS_STORAGE(get_rel_relkind(partrelid))) + nleaves++; + } + + list_free(childs); + return nleaves; +} + /* * CheckIndexCompatible * Determine whether an existing index definition is compatible with a @@ -1219,8 +1243,18 @@ DefineIndex(Oid relationId, Relation parentIndex; TupleDesc parentDesc; - pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL, - nparts); + /* + * Set the total number of partitions at the start of the command; + * don't update it when being called recursively. + */ + if (!OidIsValid(parentIndexId)) + { +int total_parts; + +total_parts = count_leaf_partitions(relationId); +pgstat_progress_update_param(PROGRESS_CREATEIDX_PARTITIONS_TOTAL, + total_parts); + } /* Make a local copy of partdesc->oids[], just for safety */ memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts); @@ -1250,6 +1284,7 @@ DefineIndex(Oid relationId, { Oid childRelid = part_oids[i]; Relation childrel; +char child_relkind; Oid child_save_userid; int child_save_sec_context; int child_save_nestlevel; @@ -1259,6 +1294,7 @@ DefineIndex(Oid relationId, bool found = false; childrel = tabl
Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
Hi, On 2023-02-16 16:49:07 -0500, Jonah H. Harris wrote: > I've been working on a federated database project that heavily relies on > foreign data wrappers. During benchmarking, we noticed high system CPU > usage in OLTP-related cases, which we traced back to multiple brk calls > resulting from block frees in AllocSetReset upon ExecutorEnd's > FreeExecutorState. This is because FDWs allocate their own derived > execution states and required data structures within this context, > exceeding the initial 8K allocation, that need to be cleaned-up. What PG version? Do you have a way to reproduce this with core code, e.g. postgres_fdw/file_fdw? What is all that memory used for? Is it possible that the real issue are too many tiny allocations, due to some allocation growing slowly? > Increasing the default query context allocation from ALLOCSET_DEFAULT_SIZES > to a larger initial "appropriate size" solved the issue and almost doubled > the throughput. However, the "appropriate size" is workload and > implementation dependent, so making it configurable may be better than > increasing the defaults, which would negatively impact users (memory-wise) > who aren't encountering this scenario. > > I have a patch to make it configurable, but before submitting it, I wanted > to hear your thoughts and feedback on this and any other alternative ideas > you may have. This seems way too magic to expose to users. How would they ever know how to set it? And it will heavily on the specific queries, so a global config won't work well. If the issue is a specific FDW needing to make a lot of allocations, I can see adding an API to tell a memory context that it ought to be ready to allocate a certain amount of memory efficiently (e.g. by increasing the block size of the next allocation by more than 2x). Greetings, Andres Freund
Re: Make set_ps_display faster and easier to use
Hi, On 2023-02-16 14:19:24 +1300, David Rowley wrote: > After fixing up the set_ps_display()s to use set_ps_display_with_len() > where possible, I discovered some not so nice code which appends " > waiting" onto the process title. Basically, there's a bunch of code > that looks like this: > > const char *old_status; > int len; > > old_status = get_ps_display(&len); > new_status = (char *) palloc(len + 8 + 1); > memcpy(new_status, old_status, len); > strcpy(new_status + len, " waiting"); > set_ps_display(new_status); > new_status[len] = '\0'; /* truncate off " waiting" */ Yea, that code is atrocious... It took me a while to figure out that no, LockBufferForCleanup() isn't leaking memory, because it'll always reach the cleanup path *further up* in the function. Avoiding the allocation across loop iterations seems like a completely pointless optimization in these paths - we add the " waiting", precisely because it's a slow path. But of course not allocating memory would be even better... > Seeing that made me wonder if we shouldn't just have something more > generic for setting a suffix on the process title. I came up with > set_ps_display_suffix() and set_ps_display_remove_suffix(). The above > code can just become: > > set_ps_display_suffix("waiting"); > > then to remove the "waiting" suffix, just: > > set_ps_display_remove_suffix(); That'd definitely be better. It's not really a topic for this patch, but somehow the fact that we have these set_ps_display() calls all over feels wrong, particularly because most of them are paired with a pgstat_report_activity() call. It's not entirely obvious how it should be instead, but it doesn't feel right. > +/* > + * set_ps_display_suffix > + * Adjust the process title to append 'suffix' onto the end with a > space > + * between it and the current process title. > + */ > +void > +set_ps_display_suffix(const char *suffix) > +{ > + size_t len; Think this will give you an unused-variable warning in the PS_USE_NONE case. > +#ifndef PS_USE_NONE > + /* update_process_title=off disables updates */ > + if (!update_process_title) > + return; > + > + /* no ps display for stand-alone backend */ > + if (!IsUnderPostmaster) > + return; > + > +#ifdef PS_USE_CLOBBER_ARGV > + /* If ps_buffer is a pointer, it might still be null */ > + if (!ps_buffer) > + return; > +#endif This bit is now repeated three times. How about putting it into a helper? > +#ifndef PS_USE_NONE > +static void > +set_ps_display_internal(void) Very very minor nit: Perhaps this should be update_ps_display() or flush_ps_display() instead? Greetings, Andres Freund
Re: Introduce list_reverse() to make lcons() usage less inefficient
Andres Freund writes: > On 2023-02-17 11:36:40 +1300, David Rowley wrote: >> One of those places is in generate_orderedappend_paths() when we find >> that the required sort order is the same as the reverse of the >> partition order. In this case, we build a list of paths for each >> partition using the lcons() function. Since Lists are now arrays in >> PostgreSQL, lcons() isn't as efficient as it once was and it must >> memmove the entire existing contents of the list up one element to >> make way to prepend the new element. This is effectively quadratic and >> becomes noticeable with a large number of partitions. > I have wondered before if we eventually ought to switch to embedded lists for > some planner structures, including paths. add_path() inserts/deletes at points > in the middle of the list, which isn't great. I'm not hugely excited about that, because it presumes that paths appear in only one list, which isn't true. We could perhaps privilege RelOptInfo.pathlist over other cases, but that'd be asymmetrical and probably bug-inducing. regards, tom lane
Re: Dead code in ps_status.c
On Fri, Feb 17, 2023 at 3:38 AM Tom Lane wrote: > My account still works, and what I see on wrasse's host is > > tgl@gcc-solaris11:~$ gcc -x c /dev/null -dM -E | grep -i svr > #define __SVR4 1 > #define __svr4__ 1 > tgl@gcc-solaris11:~$ gcc -x c /dev/null -dM -E | grep -i sun > #define __sun 1 > #define sun 1 > #define __sun__ 1 > > I don't know a way to get the list of predefined macros out of the > compiler wrasse is actually using (/opt/developerstudio12.6/bin/cc), > but doing some experiments with #ifdef confirmed that it defines > __sun, __sun__, and __svr4__, but not __svr5__. Thanks. I went with __sun, because a random man page google found me for Sun "cc" mentioned that but not __sun__. Pushed. http://www.polarhome.com/service/man/?qf=cc&tf=2&of=Solaris&sf=1
Re: REASSIGN OWNED vs ALTER TABLE OWNER TO permission inconsistencies
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Wed, Feb 15, 2023 at 9:01 AM Stephen Frost wrote: > > I'm not really a fan of just dropping the CREATE check. If we go with > > "recipient needs CREATE rights" then at least without superuser > > intervention and excluding cases where REVOKE's or such are happening, > > we should be able to see that only objects where the owners of those > > objects have CREATE rights exist in the system. If we drop the CREATE > > check entirely then clearly any user who happens to have access to > > multiple roles can arrange to have objects owned by any of their roles > > in any schema or database they please without any consideration for what > > the owner of the parent object's wishes are. > > That's true, and it is a downside of dropping to CREATE check, but > it's also a bit hard to believe that anyone's really getting a lot of > value out of the current inconsistent checks. I agree that we should be consistent about these checks. I'm just more inclined to have that consistent result include the CREATE check than have it be dropped. Not sure that it's a huge thing but being able to control what set of owner roles are allowed to have objects in a given schema seems useful and was certainly the intent, as I recall anyhow. Thanks, Stephen signature.asc Description: PGP signature
Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
On Thu, Feb 16, 2023 at 7:32 PM Andres Freund wrote: > What PG version? > Hey, Andres. Thanks for the reply. Given not much changed regarding that allocation context IIRC, I’d think all recents. It was observed in 13, 14, and 15. Do you have a way to reproduce this with core code, > e.g. postgres_fdw/file_fdw? I’ll have to create one - it was most evident on a TPC-C or sysbench test using the Postgres, MySQL, SQLite, and Oracle FDWs. It may be reproducible with pgbench as well. What is all that memory used for? Is it possible that the real issue are too > many tiny allocations, due to some allocation growing slowly? The FDW state management allocations and whatever each FDW needs to accomplish its goals. Different FDWs do different things. This seems way too magic to expose to users. How would they ever know how to > set it? And it will heavily on the specific queries, so a global config > won't > work well. Agreed on the nastiness of exposing it directly. Not that we don’t give users control of memory anyway, but that one is easier to mess up without at least putting some custom set bounds on it. If the issue is a specific FDW needing to make a lot of allocations, I can > see > adding an API to tell a memory context that it ought to be ready to > allocate a > certain amount of memory efficiently (e.g. by increasing the block size of > the > next allocation by more than 2x). While I’m happy to be wrong, it seems is an inherent problem not really specific to FDW implementations themselves but the general expectation that all FDWs are using more of that context than non-FDW cases for similar types of operations, which wasn’t really a consideration in that allocation over time. If we come up with some sort of alternate allocation strategy, I don’t know how it would be very clean API-wise, but it’s definitely an idea. -- Jonah H. Harris
Re: Missing free_var() at end of accum_sum_final()?
On Thu, Feb 16, 2023 at 01:35:54PM -0800, Andres Freund wrote: > But why do we need it? Most SQL callable functions don't need to be careful > about not leaking O(1) memory, the exception being functions backing btree > opclasses. > > In fact, the detailed memory management often is *more* expensive than just > relying on the calling memory context being reset. > > Of course, numeric.c doesn't really seem to have gotten that message, so > there's a consistency argument here. I don't know which final result is better. The arguments go two ways: 1) Should numeric.c be simplified so as its memory structure with extra pfree()s, making it more consistent with more global assumptions than just this file? This has the disadvantage of creating more noise in backpatching, while increasing the risk of leaks if some of the removed parts are allocated in a tight loop within the same context. This makes memory management less complicated. That's how I am understanding your point. 2) Should the style within numeric.c be more consistent? This is how I am understanding this proposal. As you quote, this makes memory management more complicated (not convinced about that for the internal of numerics?), while making the file more consistent. At the end, perhaps that's not worth bothering, but 2) prevails when it comes to the rule of making some code consistent with its surroundings. 1) has more risks seeing how old this code is. -- Michael signature.asc Description: PGP signature
Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16
On Thu, Feb 16, 2023 at 8:39 PM Drouvot, Bertrand wrote: > > On 2/16/23 1:26 PM, Drouvot, Bertrand wrote: > > Hi, > > > > On 2/16/23 12:00 PM, Amit Kapila wrote: > >> I think this would require XLOG_PAGE_MAGIC as it changes the WAL record. > >> > > > > Oh, I Was not aware about it, thanks! Will do in V2 (and in the logical > > decoding on standby patch as it adds the new field mentioned above). > > > >> BTW, how about a commit message like: > >> Change xl_hash_vacuum_one_page.ntuples from int to uint16. > >> > >> This will create two bytes of padding space in xl_hash_vacuum_one_page > >> which can be used for future patches. This makes the datatype of > >> xl_hash_vacuum_one_page.ntuples same as gistxlogDelete.ntodelete which > >> is advisable as both are used for the same purpose. > >> > > > > LGTM, will add it to V2! > > > Please find V2 attached. > The commit message also mention the XLOG_PAGE_MAGIC bump. > Thanks, I was not completely sure about whether we need to bump XLOG_PAGE_MAGIC for this patch as this makes the additional space just by changing the datatype of one of the members of the existing WAL record. We normally change it for the addition/removal of new fields aka change in WAL record format, or addition of a new type of WAL record. Does anyone else have an opinion/suggestion on this matter? -- With Regards, Amit Kapila.
Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"
On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov wrote: > First of all it seemed to me that is not a problem at all since msdn > guarantees sector-by-sector atomicity. > "Physical Sector: The unit for which read and write operations to the device > are completed in a single operation. This is the unit of atomic write..." > https://learn.microsoft.com/en-us/windows/win32/fileio/file-buffering > https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile > (Of course, only if the 512 bytes lays from the beginning of the file with a > zero > offset, but this is our case. The current size of ControlFileData is > 296 bytes at offset = 0.) There are two kinds of atomicity that we rely on for the control file today: * atomicity on power loss (= device property, in case of overwrite filesystems) * atomicity of concurrent reads and writes (= VFS or kernel buffer pool interlocking policy) I assume that documentation is talking about the first thing (BTW, I suspect that documentation is also now wrong in one special case: NTFS filesystems mounted on DAX devices don't have sectors or sector-based atomicity unless you turn on BTT and slow them down[1]; that might eventually be something for PostgreSQL to think about, and it applies to other OSes too). With this patch we would stop relying on the second thing. Supposedly POSIX requires read/write atomicity, and many file systems offer it in a strong form (internal range locking) or maybe a weak accidental form (page-level locking). Since some extremely popular implementations just don't care, and Windows isn't even a POSIX, we just have to do it ourselves. BTW there are at least two other places where PostgreSQL already knows that concurrent reads and writes are possibly non-atomic (and we also don't even try to get the alignment right, making the question moot): pg_basebackup, which enables full_page_writes implicitly if you didn't have the GUC on already, and pg_rewind, which refuses to run if you haven't enabled full_page_writes explicitly (as recently discussed on another thread recently; that's an annoying difference, and also an annoying behaviour if you know your storage doesn't really need it!) > I tried to verify this fact experimentally and was very surprised. > Unfortunately it crashed in an hour during torture test: > 2023-02-13 15:10:52.675 MSK [5704] LOG: starting PostgreSQL 16devel, > compiled by Visual C++ build 1929, 64-bit > 2023-02-13 15:10:52.768 MSK [5704] LOG: database system is ready to accept > connections > @@ sizeof(ControlFileData) = 296 > . > 2023-02-13 16:10:41.997 MSK [9380] ERROR: calculated CRC checksum does not > match value stored in file Thanks. I'd seen reports of this in discussions on the 'net, but those had no authoritative references or supporting test results. The fact that fsync made it take longer (in your following email) makes sense and matches Linux. It inserts a massive pause in the middle of the experiment loop, affecting the probabilities. Therefore, we need a solution for Windows too. I tried to write the equivalent code, in the attached. I learned a few things: Windows locks are mandatory (that is, if you lock a file, reads/writes can fail, unlike Unix). Combined with async release, that means we probably also need to lock the file in xlog.c, when reading it in xlog.c:ReadControlFile() (see comments). Before I added that, on a CI run, I found that the read in there would sometimes fail, and adding the locking fixed that. I am a bit confused, though, because I expected that to be necessary only if someone managed to crash while holding the write lock, which the CI tests shouldn't do. Do you have any ideas? While contemplating what else a mandatory file lock might break, I remembered that basebackup.c also reads the control file. Hrmph. Not addressed yet; I guess it might need to acquire/release around sendFile(sink, XLOG_CONTROL_FILE, ...)? > > I would > > only want to consider this if we also stop choosing "open_datasync" as > > a default on at least Windows. > > I didn't quite understand this point. Could you clarify it for me, please? If > the performance > of UpdateMinRecoveryPoint() wasn't a problem we could just use fsync in all > platforms. The level of durability would be weakened on Windows. On all systems except Linux and FreeBSD, we default to wal_sync_method=open_datasync, so then we would start using FILE_FLAG_WRITE_THROUGH for the control file too. We know from reading and experimentation that FILE_FLAG_WRITE_THROUGH doesn't flush the drive cache on consumer Windows hardware, but its fdatasync-like thing does[2]. I have not thought too hard about the consequences of using different durability levels for different categories of file, but messing with write ordering can't be good for crash recovery, so I'd rather increase WAL durability than decrease control file durability. If a Windows user complains that it makes their fancy non-volati
Re: DDL result is lost by CREATE DATABASE with WAL_LOG strategy
On Fri, Feb 17, 2023 at 2:59 AM Andres Freund wrote: > > On 2023-02-16 12:37:57 +0530, Robert Haas wrote: > > The patch creates 100_bugs.pl > > What's the story behind 100_bugs.pl? This name clearly is copied from > src/test/subscription/t/100_bugs.pl - but I've never understood why that is > outside of the normal numbering space. > Yeah, I have also previously wondered about this name for src/test/subscription/t/100_bugs.pl. My guess is that it has been kept to distinguish it from the other feature tests which have numbering starting from 001. -- With Regards, Amit Kapila.
Re: Introduce list_reverse() to make lcons() usage less inefficient
On Fri, 17 Feb 2023 at 13:23, Andres Freund wrote: > But wouldn't an even cheaper way here be to iterate over the children in > reverse order when match_partition_order_desc? We can do that efficiently > now. Looks like we don't have a readymade helper for it, but it'd be easy > enough to add or open code. That seems fair. I think open coding is a better option. I had a go at foreach_reverse recently and decided to keep clear of it due to behavioural differences with foreach_delete_current(). I've attached a patch for this. It seems to have similar performance to the list_reverse() $ psql -c "explain (analyze, timing off) select * from lp order by a desc" postgres | grep "Planning Time" Planning Time: 522.554 ms <- cold relcache Planning Time: 467.776 ms Planning Time: 466.424 ms David diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c index ae0f9bdc8a..f4d4018960 100644 --- a/src/backend/optimizer/path/allpaths.c +++ b/src/backend/optimizer/path/allpaths.c @@ -1703,6 +1703,9 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, ListCell *lcr; boolmatch_partition_order; boolmatch_partition_order_desc; + int end_element; + int first_element; + int direction; /* * Determine if this sort ordering matches any partition pathkeys we @@ -1723,10 +1726,31 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, (!partition_pathkeys_desc_partial && pathkeys_contained_in(partition_pathkeys_desc, pathkeys))); + /* +* When the required pathkeys match the reverse of the partition +* order, we must build the list of paths in reverse starting with the +* last matching partition first. We can get away without making any +* special cases for this in the loop by just looping backward over +* the child relations in this case. +*/ + if (match_partition_order_desc) + { + first_element = list_length(live_childrels) - 1; + end_element = -1; + direction = -1; + match_partition_order = true; + } + else + { + first_element = 0; + end_element = list_length(live_childrels); + direction = 1; + } + /* Select the child paths for this ordering... */ - foreach(lcr, live_childrels) + for (int i = first_element; i != end_element; i += direction) { - RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr); + RelOptInfo *childrel = list_nth_node(RelOptInfo, live_childrels, i); Path *cheapest_startup, *cheapest_total, *cheapest_fractional = NULL; @@ -1822,25 +1846,6 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, fractional_subpaths = lappend(fractional_subpaths, cheapest_fractional); } } - else if (match_partition_order_desc) - { - /* -* As above, but we need to reverse the order of the children, -* because nodeAppend.c doesn't know anything about reverse -* ordering and will scan the children in the order presented. -*/ - cheapest_startup = get_singleton_append_subpath(cheapest_startup); - cheapest_total = get_singleton_append_subpath(cheapest_total); - - startup_subpaths = lcons(cheapest_startup, startup_subpaths); - total_subpaths = lcons(cheapest_total, total_subpaths); - - if (cheapest_fractional) - { - cheapest_fractional = get_singleton_append_subpath(cheapest_fractional); - fractional_subpaths = lcons(cheapest_fractional, fractional_subpaths); - } - } else { /* @@ -1859,7 +1864,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel, } /* ... and build the Append or MergeAppend paths */ -
Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
Hi, On 2023-02-16 21:34:18 -0500, Jonah H. Harris wrote: > On Thu, Feb 16, 2023 at 7:32 PM Andres Freund wrote: > Given not much changed regarding that allocation context IIRC, I’d think > all recents. It was observed in 13, 14, and 15. We did have a fair bit of changes in related code in the last few years, including some in 16. I wouldn't expect them to help *hugely*, but also wouldn't be surprised if it showed. > I’ll have to create one - it was most evident on a TPC-C or sysbench test > using the Postgres, MySQL, SQLite, and Oracle FDWs. It may be reproducible > with pgbench as well. I'd like a workload that hits a perf issue with this, because I think there likely are some general performance improvements that we could make, without changing the initial size or the "growth rate". Perhaps, as a starting point, you could get MemoryContextStats(queryDesc->estate->es_query_cxt) both at the end of standard_ExecutorStart() and at the beginning of standard_ExecutorFinish(), for one of the queries triggering the performance issues? > > If the issue is a specific FDW needing to make a lot of allocations, I can > > see > > adding an API to tell a memory context that it ought to be ready to > > allocate a > > certain amount of memory efficiently (e.g. by increasing the block size of > > the > > next allocation by more than 2x). > > > While I’m happy to be wrong, it seems is an inherent problem not really > specific to FDW implementations themselves but the general expectation that > all FDWs are using more of that context than non-FDW cases for similar > types of operations, which wasn’t really a consideration in that allocation > over time. Lots of things can end up in the query context, it's really not FDW specific for it to be of nontrivial size. E.g. most tuples passed around end up in it. Similar performance issues also exists for plenty other memory contexts. Which for me that's a reason *not* to make it configurable just for CreateExecutorState. Or were you proposing to make ALLOCSET_DEFAULT_INITSIZE configurable? That would end up with a lot of waste, I think. The executor context case might actually be a comparatively easy case to address. There's really two "phases" of use for es_query_ctx. First, we create the entire executor tree in it, during standard_ExecutorStart(). Second, during query execution, we allocate things with query lifetime (be that because they need to live till the end, or because they are otherwise managed, like tuples). Even very simple queries end up with multiple blocks at the end: E.g. SELECT relname FROM pg_class WHERE relkind = 'r' AND relname = 'frak'; yields: ExecutorState: 43784 total in 3 blocks; 8960 free (5 chunks); 34824 used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used Grand total: 51976 bytes in 4 blocks; 16888 free (5 chunks); 35088 used So quite justifiably we could just increase the hardcoded size in CreateExecutorState. I'd expect that starting a few size classes up would help noticeably. But I think we likely could do better here. The amount of memory that ends up in es_query_cxt during "phase 1" strongly correlates with the complexity of the statement, as the whole executor tree ends up in it. Using information about the complexity of the planned statement to influence es_query_cxt's block sizes would make sense to me. I suspect it's a decent enough proxy for "phase 2" as well. Medium-long term I really want to allocate at least all the executor nodes themselves in a single allocation. But that's a bit further out than what we're talking about here. Greetings, Andres Freund
Should CreateExprContext() be using ALLOCSET_DEFAULT_SIZES?
Hi, The thread around https://postgr.es/m/caduqk8uqw5qauqldd-0sbcvzvncre3jmjb9+ydwo_umv_ht...@mail.gmail.com reminded me of the following: ISTM that we really shouldn't use ALLOCSET_DEFAULT_SIZES for expression contexts, as they most commonly see only a few small, or no, allocations. That's true for OLTPish queries, but is is very often true even for analytics queries. Just because I had it loaded, here's the executor state for TPCH-Q01, which is pretty expression heavy: ExecutorState: 65536 total in 4 blocks; 42512 free (11 chunks); 23024 used TupleSort main: 32832 total in 2 blocks; 7320 free (7 chunks); 25512 used TupleSort sort: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used Caller tuples: 8192 total in 1 blocks (9 chunks); 6488 free (0 chunks); 1704 used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used Grand total: 139328 bytes in 11 blocks; 88032 free (18 chunks); 51296 used As you can see very little was allocated in the ExprContext's. ISTM that we could save a reasonable amount of memory by using a smaller initial size. Not so sure if a smaller max size should be used though. Greetings, Andres Freund
Re: proposal: psql: psql variable BACKEND_PID
čt 16. 2. 2023 v 12:49 odesílatel Jelte Fennema napsal: > On Thu, 16 Feb 2023 at 12:44, Pavel Stehule > wrote: > > To find and use pg_backend_pid is not rocket science. But use > :BACKEND_PID is simpler. > > I wanted to call out that if there's a connection pooler (e.g. > PgBouncer) in the middle, then BACKEND_PID (and %p) are incorrect, but > pg_backend_pid() would work for the query. This might be an edge case, > but if BACKEND_PID is added it might be worth listing this edge case > in the docs somewhere. > good note Regards Pavel
Re: Change xl_hash_vacuum_one_page.ntuples from int to uint16
Hi On 2023-02-17 08:30:09 +0530, Amit Kapila wrote: > Thanks, I was not completely sure about whether we need to bump > XLOG_PAGE_MAGIC for this patch as this makes the additional space just > by changing the datatype of one of the members of the existing WAL > record. We normally change it for the addition/removal of new fields > aka change in WAL record format, or addition of a new type of WAL > record. Does anyone else have an opinion/suggestion on this matter? I'd definitely change it - the width of a field still means you can't really parse the old WAL sensibly anymore. Greetings, Andres Freund
Re: Support logical replication of DDLs
> > I've implemented a prototype to allow replicated objects to have the > > same owner from the publisher in > > v69-0008-Allow-replicated-objects-to-have-the-same-owner-from.patch. > > > > I also think it would be a helpful addition for users.A few points Thanks for supporting this addition. > that come to my mind are: (a) Shouldn't the role have the same > privileges (for ex. rolbypassrls or rolsuper) on both sides before we > allow this? (b) Isn't it better to first have a replication of roles? > I think if we have (b) then it would be probably a bit easier because > if the subscription has allowed replicating roles and we can confirm > that the role is replicated then we don't need to worry about the > differences. Yes, having role replication will help further reduce the manual effort. But even if we don't end up doing role replication soon, I think we can still provide this subscription option (match_ddl_owner, off by default) and document that the same roles need to be on both sides for it to work. > Now, coming to implementation, won't it be better if we avoid sending > the owner to the subscriber unless it is changed for the replicated > command? Consider the current case of tables where we send schema only > if it is changed. This is not a direct mapping but it would be better > to avoid sending additional information and then process it on the > subscriber for each command. Right, we can do some optimization here: only send the owner for commands that create objects (CREATE TABLE/FUNCTION/INDEX etc.) Note that ALTER TABLE/OBJECT OWNER TO is replicated so we don't need to worry about owner change. Regards, Zane
Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
On Fri, 17 Feb 2023 at 16:40, Andres Freund wrote: > I'd like a workload that hits a perf issue with this, because I think there > likely are some general performance improvements that we could make, without > changing the initial size or the "growth rate". I didn't hear it mentioned explicitly here, but I suspect it's faster when increasing the initial size due to the memory context caching code that reuses aset MemoryContexts (see context_freelists[] in aset.c). Since we reset the context before caching it, then it'll remain fast when we can reuse a context, provided we don't need to do a malloc for an additional block beyond the initial block that's kept in the cache. Maybe we should think of a more general-purpose way of doing this caching which just keeps a global-to-the-process dclist of blocks laying around. We could see if we have any free blocks both when creating the context and also when we need to allocate another block. I see no reason why this couldn't be shared among the other context types rather than keeping this cache stuff specific to aset.c. slab.c might need to be pickier if the size isn't exactly what it needs, but generation.c should be able to make use of it the same as aset.c could. I'm unsure what'd we'd need in the way of size classing for this, but I suspect we'd need to pay attention to that rather than do things like hand over 16MBs of memory to some context that only wants a 1KB initial block. David
Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
On Thu, Feb 16, 2023 at 11:26 PM David Rowley wrote: > I didn't hear it mentioned explicitly here, but I suspect it's faster > when increasing the initial size due to the memory context caching > code that reuses aset MemoryContexts (see context_freelists[] in > aset.c). Since we reset the context before caching it, then it'll > remain fast when we can reuse a context, provided we don't need to do > a malloc for an additional block beyond the initial block that's kept > in the cache. This is what we were seeing. The larger initial size reduces/eliminates the multiple smaller blocks that are malloced and freed in each per-query execution. Maybe we should think of a more general-purpose way of doing this > caching which just keeps a global-to-the-process dclist of blocks > laying around. We could see if we have any free blocks both when > creating the context and also when we need to allocate another block. > I see no reason why this couldn't be shared among the other context > types rather than keeping this cache stuff specific to aset.c. slab.c > might need to be pickier if the size isn't exactly what it needs, but > generation.c should be able to make use of it the same as aset.c > could. I'm unsure what'd we'd need in the way of size classing for > this, but I suspect we'd need to pay attention to that rather than do > things like hand over 16MBs of memory to some context that only wants > a 1KB initial block. Yeah. There’s definitely a smarter and more reusable approach than I was proposing. A lot of that code is fairly mature and I figured more people wouldn’t want to alter it in such ways - but I’m up for it if an approach like this is the direction we’d want to go in. -- Jonah H. Harris
Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations
On Fri, 17 Feb 2023 at 17:40, Jonah H. Harris wrote: > Yeah. There’s definitely a smarter and more reusable approach than I was > proposing. A lot of that code is fairly mature and I figured more people > wouldn’t want to alter it in such ways - but I’m up for it if an approach > like this is the direction we’d want to go in. I've spent quite a bit of time in this area recently and I think that context_freelists[] is showing its age now. It does seem that slab and generation were added before context_freelists[] (9fa6f00b), but not by much, and those new contexts had fewer users back then. It feels a little unfair that aset should get to cache but the other context types don't. I don't think each context type should have some separate cache either as that probably means more memory wasted. Having something agnostic to if it's allocating a new context or adding a block to an existing one seems like a good idea to me. I think the tricky part will be the discussion around which size classes to keep around and in which cases can we use a larger allocation without worrying too much that it'll be wasted. We also don't really want to make the minimum memory that a backend can keep around too bad. Patches such as [1] are trying to reduce that. Maybe we can just keep a handful of blocks of 1KB, 8KB and 16KB around, or more accurately put, ALLOCSET_SMALL_INITSIZE, ALLOCSET_DEFAULT_INITSIZE and ALLOCSET_DEFAULT_INITSIZE * 2, so that it works correctly if someone adjusts those definitions. I think you'll want to look at what the maximum memory a backend can keep around in context_freelists[] and not make the worst-case memory consumption worse than it is today. I imagine this would be some new .c file in src/backend/utils/mmgr which aset.c, generation.c and slab.c each call a function from to see if we have any cached blocks of that size. You'd want to call that in all places we call malloc() from those files apart from when aset.c and generation.c malloc() for a dedicated block. You can probably get away with replacing all of the free() calls with a call to another function where you pass the pointer and the size of the block to have it decide if it's going to free() it or cache it. I doubt you need to care too much if the block is from a dedicated allocation or a normal block. We'd just always free() if it's not in the size classes that we care about. David [1] https://commitfest.postgresql.org/42/3867/
Re: Move defaults toward ICU in 16?
On Thu, Feb 16, 2023 at 9:45 PM Andres Freund wrote: > On 2023-02-16 15:05:10 +0530, Robert Haas wrote: > > The fact that we can't use ICU on Windows, though, weakens this > > argument a lot. In my experience, we have a lot of Windows users, and > > they're not any happier with the operating system collations than > > Linux users. Possibly less so. > > Why can't you use ICU on windows? It works today, afaict? Uh, I had the contrary impression from the discussion upthread, but it sounds like I might be misunderstanding the situation? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Support logical replication of global object commands
> > > Actually, I intend something for global objects. But the main thing > > > that is worrying me about this is that we don't have a clean way to > > > untie global object replication from database-specific object > > > replication. > > > > I think ultimately we need a clean and efficient way to publish (and > > subscribe to) any changes in all databases, preferably in one logical > > replication slot. > > > > Agreed. I was thinking currently for logical replication both > walsender and slot are database-specific. So we need a way to > distinguish the WAL for global objects and then avoid filtering based > on the slot's database during decoding. But which WALSender should handle the WAL for global objects if we don't filter by database? Is there any specific problem you see for decoding global objects commands in a database specific WALSender? > I also thought about whether > we want to have a WALSender that is not connected to a database for > the replication of global objects but I couldn't come up with a reason > for doing so. Do you have any thoughts on this matter? Regards, Zane
Re: Move defaults toward ICU in 16?
Hi, On February 16, 2023 9:14:05 PM PST, Robert Haas wrote: >On Thu, Feb 16, 2023 at 9:45 PM Andres Freund wrote: >> On 2023-02-16 15:05:10 +0530, Robert Haas wrote: >> > The fact that we can't use ICU on Windows, though, weakens this >> > argument a lot. In my experience, we have a lot of Windows users, and >> > they're not any happier with the operating system collations than >> > Linux users. Possibly less so. >> >> Why can't you use ICU on windows? It works today, afaict? > >Uh, I had the contrary impression from the discussion upthread, but it >sounds like I might be misunderstanding the situation? That was about the build environment in CI / cfbot, I think. Jeff was making icu a hard requirement by default, but ICU wasn't installed in a usable way, so the build failed. The patch he referred to was just building ICU during the CI run. I do remember encountering issues with the mkvcbuild.pl build not building against a downloaded modern icu build, but that was just about library naming or directory structure, or such. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: Move defaults toward ICU in 16?
On Fri, Feb 17, 2023 at 10:44:05AM +0530, Robert Haas wrote: > Uh, I had the contrary impression from the discussion upthread, but it > sounds like I might be misunderstanding the situation? IMO, it would be nice to be able to have the automatic detection of meson work in the CFbot to see how this patch goes. Perhaps that's not a reason enough to hold on this patch, though.. Separate question: what's the state of the Windows installers provided by the community regarding libicu? Is that embedded in the MSI? -- Michael signature.asc Description: PGP signature
shoud be get_extension_schema visible?
Hi more times I needed to get the extension's assigned namespace. There is already a cooked function get_extension_schema, but it is static. I need to find a function with a known name, but possibly an unknown schema from a known extension. Regards Pavel