Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
On Fri, Aug 28, 2020 at 9:49 PM Robert Haas wrote: > > On Tue, Jul 21, 2020 at 9:21 AM Dilip Kumar wrote: > > In the previous version, the feature was enabled for cluster/vacuum > > full command as well. in the attached patch I have enabled it only > > if we are running vacuum command. It will not be enabled during a > > table rewrite. If we think that it should be enabled for the 'vacuum > > full' then we might need to pass a flag from the cluster_rel, all the > > way down to the heap_freeze_tuple. > > I think this is a somewhat clunky way of accomplishing this. The > caller passes down a flag to heap_prepare_freeze_tuple() which decides > whether or not an error is forced, and then that function and > FreezeMultiXactId use vacuum_damage_elevel() to combine the results of > that flag with the value of the vacuum_tolerate_damage GUC. But that > means that a decision that could be made in one place is instead made > in many places. If we just had heap_freeze_tuple() and > FreezeMultiXactId() take an argument int vacuum_damage_elevel, then > heap_freeze_tuple() could pass ERROR and lazy_scan_heap() could > arrange to pass WARNING or ERROR based on the value of > vacuum_tolerate_damage. I think that would likely end up cleaner. What > do you think? I agree this way it is much more cleaner. I have changed in the attached patch. > I also suggest renaming is_corrupted_xid to found_corruption. With the > current name, it's not very clear which XID we're saying is corrupted; > in fact, the problem might be a MultiXactId rather than an XID, and > the real issue might be with the table's pg_class entry or something. Okay, changed to found_corruption. > The new arguments to heap_prepare_freeze_tuple() need to be documented > in its header comment. Done. I have also done a few more cosmetic changes to the patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com v4-0001-Provide-a-GUC-to-allow-vacuum-to-continue-on-corr.patch Description: Binary data
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
On Sat, Aug 29, 2020 at 1:46 AM Robert Haas wrote: > > On Fri, Aug 28, 2020 at 1:29 PM Andres Freund wrote: > > It can break HOT chains, plain ctid chains etc, for example. Which, if > > earlier / follower tuples are removed can't be detected anymore at a > > later time. > > I think I need a more specific example here to understand the problem. > If the xmax of one tuple matches the xmin of the next, then either > both values precede relfrozenxid or both follow it. In the former > case, neither tuple should be frozen and the chain should not get > broken; in the latter case, everything's normal anyway. If the xmax > and xmin don't match, then the chain was already broken. Perhaps we > are removing important evidence, though it seems like that might've > happened anyway prior to reaching the damaged page, but we're not > making whatever corruption may exist any worse. At least, not as far > as I can see. One example is, suppose during vacuum, there are 2 tuples in the hot chain, and the xmin of the first tuple is corrupted (i.e. smaller than relfrozenxid). And the xmax of this tuple (which is same as the xmin of the second tuple) is smaller than the cutoff_xid while trying to freeze the tuple. As a result, it will freeze the second tuple but the first tuple will be left untouched. Now, if we come for the heap_hot_search_buffer, then the xmax of the first tuple will not match the xmin of the second tuple as we have frozen the second tuple. But, I feel this is easily fixable right? I mean instead of not doing anything to the corrupted tuple we can partially freeze it? I mean we can just leave the corrupted xid alone but mark the other xid as frozen if that is smaller then cutoff_xid. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: new heapcheck contrib module
> 29 авг. 2020 г., в 00:56, Robert Haas написал(а): > > On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin > wrote: >> I don't think so. ISTM It's the same problem of xmax> just hidden behind detoasing. >> Our regular heap_check was checking xmin\xmax invariants for tables, but >> failed to recognise the problem in toast (while toast was accessible until >> CLOG truncation). > > The code can (and should, and I think does) refrain from looking up > XIDs that are out of the range thought to be valid -- but how do you > propose that it avoid looking up XIDs that ought to have clog data > associated with them despite being >= relfrozenxid and < nextxid? > TransactionIdDidCommit() does not have a suppress-errors flag, adding > one would be quite invasive, yet we cannot safely perform a > significant number of checks without knowing whether the inserting > transaction committed. What you write seems completely correct to me. I agree that CLOG thresholds lookup seems unnecessary. But I have a real corruption at hand (on testing site). If I have proposed here heapcheck. And I have pg_surgery from the thread nearby. Yet I cannot fix the problem, because cannot list affected tuples. These tools do not solve the problem neglected for long enough. It would be supercool if they could. This corruption like a caries had 3 stages: 1. incorrect VM flag that page do not need vacuum 2. xmin and xmax < relfrozenxid 3. CLOG truncated Stage 2 is curable with proposed toolset, stage 3 is not. But they are not that different. Thanks! Best regards, Andrey Borodin.
Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions
On Fri, Aug 28, 2020 at 2:18 PM Dilip Kumar wrote: > > As discussed, I have added a another test case for covering the out of > order subtransaction rollback scenario. > +# large (streamed) transaction with out of order subtransaction ROLLBACKs +$node_publisher->safe_psql('postgres', q{ How about writing a comment as: "large (streamed) transaction with subscriber receiving out of order subtransaction ROLLBACKs"? I have reviewed and modified the number of things in the attached patch: 1. In apply_handle_origin, improved the check streamed xacts. 2. In apply_handle_stream_commit() while applying changes in the loop, added CHECK_FOR_INTERRUPTS. 3. In DEBUG messages, print the path with double-quotes as we are doing in all other places. 4. + /* + * Exit if streaming option is changed. The launcher will start new + * worker. + */ + if (newsub->stream != MySubscription->stream) + { + ereport(LOG, + (errmsg("logical replication apply worker for subscription \"%s\" will " + "restart because subscription's streaming option were changed", + MySubscription->name))); + + proc_exit(0); + } + We don't need a separate check like this. I have merged this into one of the existing checks. 5. subxact_info_write() { .. + if (subxact_data.nsubxacts == 0) + { + if (ent->subxact_fileset) + { + cleanup_subxact_info(); + BufFileDeleteShared(ent->subxact_fileset, path); + pfree(ent->subxact_fileset); + ent->subxact_fileset = NULL; + } I don't think it is right to use BufFileDeleteShared interface here because it won't perform SharedFileSetUnregister which means if after above code execution is the server exits it will crash in SharedFileSetDeleteOnProcExit which will try to access already deleted fileset entry. Fixed this by calling SharedFileSetDeleteAll() instead. The another related problem is that in function SharedFileSetDeleteOnProcExit, it tries to delete the list element while traversing the list with 'foreach' construct which makes the behavior of list traversal unpredictable. I have fixed this in a separate patch v54-0001-Fix-the-SharedFileSetUnregister-API, if you are fine with this, I would like to commit this as this fixes a problem in the existing commit 808e13b282. 6. Function stream_cleanup_files() contains a missing_ok argument which is not used so removed it. 7. In pgoutput.c, change the ordering of functions to make them consistent with their declaration. 8. typedef struct RelationSyncEntry { Oid relid; /* relation oid */ + TransactionId xid; /* transaction that created the record */ Removed above parameter as this doesn't seem to be required as per the new design in the patch. Apart from above, I have added/changed quite a few comments and a few other cosmetic changes. Kindly review and let me know what do you think about the changes? One more comment for which I haven't done anything yet. +static void +set_schema_sent_in_streamed_txn(RelationSyncEntry *entry, TransactionId xid) +{ + MemoryContext oldctx; + + oldctx = MemoryContextSwitchTo(CacheMemoryContext); + + entry->streamed_txns = lappend_int(entry->streamed_txns, xid); Is it a good idea to append xid with lappend_int? Won't we need something equivalent for uint32? If so, I think we have a couple of options (a) use lcons method and accordingly append the pointer to xid, I think we need to allocate memory for xid if we want to use this idea or (b) use an array instead. What do you think? -- With Regards, Amit Kapila. v54-0001-Fix-the-SharedFileSetUnregister-API.patch Description: Binary data v54-0002-Add-support-for-streaming-to-built-in-logical-re.patch Description: Binary data
Re: More aggressive vacuuming of temporary tables
Michael Paquier writes: > I got to wonder lately if we should not have a static state like what > we do for MyXactFlags to track down if a utility query is a top-level > one or not as most of the time we just want to check the flag as > passed down by standard_ProcessUtility(). I have faced this problem > as well lately when pushing down a PreventInTransactionBlock() for > some stuff with REINDEX for example. Not sure how reliable this would > be though.. Passing isTopLevel down the road is a no-brainer, and > perhaps having a static value would create more problems than it > solves in practice. Hm. I suppose you'd need a PG_TRY block to ensure that the setting got restored correctly after an error, so maintaining it that way would be rather expensive. Also it just doesn't seem like transaction-wide state, so having a static for it feels like the wrong thing. [thinks for awhile...] Would it make any sense to mark Portals as being top-level or not, and then code that needs this info could look to "ActivePortal->isTopLevel"? We are already paying the PG_TRY overhead to maintain the ActivePortal variable safely. I'm not sure though whether the Portal is granular enough. Do we set up a separate Portal for subcommands? In the big scheme of things, though, we don't need this info in so many places that passing it down as a parameter is an undue burden. regards, tom lane
Re: Ideas about a better API for postgres_fdw remote estimates
Greetings, * Andrey Lepikhov (a.lepik...@postgrespro.ru) wrote: > During the implementation of sharding related improvements i noticed that if > we use a lot of foreign partitions, we have bad plans because of vacuum > don't update statistics of foreign tables.This is done by the ANALYZE > command, but it is very expensive operation for foreign table. > Problem with statistics demonstrates with TAP-test from the first patch in > attachment. Yes, the way we handle ANALYZE today for FDWs is pretty terrible, since we stream the entire table across to do it. > I implemented some FDW + pg core machinery to reduce weight of the problem. > The ANALYZE command on foreign table executes query on foreign server that > extracts statistics tuple, serializes it into json-formatted string and > returns to the caller. The caller deserializes this string, generates > statistics for this foreign table and update it. The second patch is a > proof-of-concept. Isn't this going to create a version dependency that we'll need to deal with..? What if a newer major version has some kind of improved ANALYZE command, in terms of what it looks at or stores, and it's talking to an older server? When I was considering the issue with ANALYZE and FDWs, I had been thinking it'd make sense to just change the query that's built in deparseAnalyzeSql() to have a TABLESAMPLE clause, but otherwise run in more-or-less the same manner as today. If we don't like the available TABLESAMPLE methods then we could add a new one that's explicitly the 'right' sample for an ANALYZE call and use that when it's available on the remote side. Not sure if it'd make any sense for ANALYZE itself to start using that same TABLESAMPLE code, but maybe? Not that I think it'd be much of an issue if it's independent either, with appropriate comments to note that we should probably try to make them match up for the sake of FDWs. > This patch speedup analyze command and provides statistics relevance on a > foreign table after autovacuum operation. Its effectiveness depends on > relevance of statistics on the remote server, but still. If we do decide to go down this route, wouldn't it mean we'd have to solve the problem of what to do when it's a 9.6 foreign server being queried from a v12 server and dealing with any difference in the statistics structures of the two? Seems like we would... in which case I would say that we should pull that bit out and make it general, and use it for pg_upgrade too, which would benefit a great deal from having the ability to upgrade stats between major versions also. That's a much bigger piece to take on, of course, but seems to be what's implied with this approach for the FDW. Thanks, Stephen signature.asc Description: PGP signature
Re: Clang UndefinedBehaviorSanitize (Postgres14) Detected undefined-behavior
Peter Geoghegan writes: > I wonder if we should start using -fno-delete-null-pointer-checks: > https://lkml.org/lkml/2018/4/4/601 > This may not be strictly relevant to the discussion, but I was > reminded of it just now and thought I'd mention it. Hmm. gcc 8.3 defines this as: Assume that programs cannot safely dereference null pointers, and that no code or data element resides at address zero. This option enables simple constant folding optimizations at all optimization levels. In addition, other optimization passes in GCC use this flag to control global dataflow analyses that eliminate useless checks for null pointers; these assume that a memory access to address zero always results in a trap, so that if a pointer is checked after it has already been dereferenced, it cannot be null. AFAICS, that's a perfectly valid assumption for our usage. I can see why the kernel might not want it, but we set things up whenever possible to ensure that dereferencing NULL would crash. However, while grepping the manual for that I also found '-Wnull-dereference' Warn if the compiler detects paths that trigger erroneous or undefined behavior due to dereferencing a null pointer. This option is only active when '-fdelete-null-pointer-checks' is active, which is enabled by optimizations in most targets. The precision of the warnings depends on the optimization options used. I wonder whether turning that on would find anything interesting. regards, tom lane
Re: Ideas about a better API for postgres_fdw remote estimates
Stephen Frost writes: > Isn't this going to create a version dependency that we'll need to deal > with..? What if a newer major version has some kind of improved ANALYZE > command, in terms of what it looks at or stores, and it's talking to an > older server? Yeah, this proposal is a nonstarter unless it can deal with the remote server being a different PG version with different stats. Years ago (when I was still at Salesforce, IIRC, so ~5 years) we had some discussions about making it possible for pg_dump and/or pg_upgrade to propagate stats data forward to the new database. There is at least one POC patch in the archives for doing that by dumping the stats data wrapped in a function call, where the target database's version of the function would be responsible for adapting the data if necessary, or maybe just discarding it if it couldn't adapt. We seem to have lost interest but it still seems like something worth pursuing. I'd guess that if such infrastructure existed it could be helpful for this. > When I was considering the issue with ANALYZE and FDWs, I had been > thinking it'd make sense to just change the query that's built in > deparseAnalyzeSql() to have a TABLESAMPLE clause, but otherwise run in > more-or-less the same manner as today. +1, that seems like something worth doing in any case, since even if we do get somewhere with the present idea it would only work for new remote servers. TABLESAMPLE would work pretty far back (9.5, looks like). regards, tom lane
Re: Ideas about a better API for postgres_fdw remote estimates
On 8/29/20 9:22 PM, Stephen Frost wrote: I implemented some FDW + pg core machinery to reduce weight of the problem. The ANALYZE command on foreign table executes query on foreign server that extracts statistics tuple, serializes it into json-formatted string and returns to the caller. The caller deserializes this string, generates statistics for this foreign table and update it. The second patch is a proof-of-concept. Isn't this going to create a version dependency that we'll need to deal with..? What if a newer major version has some kind of improved ANALYZE command, in terms of what it looks at or stores, and it's talking to an older server? When I was considering the issue with ANALYZE and FDWs, I had been thinking it'd make sense to just change the query that's built in deparseAnalyzeSql() to have a TABLESAMPLE clause, but otherwise run in more-or-less the same manner as today. If we don't like the available TABLESAMPLE methods then we could add a new one that's explicitly the 'right' sample for an ANALYZE call and use that when it's available on the remote side. Not sure if it'd make any sense for ANALYZE itself to start using that same TABLESAMPLE code, but maybe? Not that I think it'd be much of an issue if it's independent either, with appropriate comments to note that we should probably try to make them match up for the sake of FDWs. This approach does not contradict your idea. This is a lightweight opportunity to reduce the cost of analysis if we have a set of servers with actual versions of system catalog and fdw. This patch speedup analyze command and provides statistics relevance on a foreign table after autovacuum operation. Its effectiveness depends on relevance of statistics on the remote server, but still. If we do decide to go down this route, wouldn't it mean we'd have to solve the problem of what to do when it's a 9.6 foreign server being queried from a v12 server and dealing with any difference in the statistics structures of the two? Seems like we would... in which case I would say that we should pull that bit out and make it general, and use it for pg_upgrade too, which would benefit a great deal from having the ability to upgrade stats between major versions also. That's a much bigger piece to take on, of course, but seems to be what's implied with this approach for the FDW. Thank you for this use case. We can add field "version" to statistics string (btree uses versioning too). As you can see, in this patch we are only trying to get statistics. If for some reason this does not work out, then we go along a difficult route. Moreover, I believe this strategy should only work if we analyze a relation implicitly. If the user executes analysis explicitly by the command "ANALYZE ", we need to perform an fair analysis of the table. -- regards, Andrey Lepikhov Postgres Professional
Re: new heapcheck contrib module
> On Aug 29, 2020, at 3:27 AM, Andrey M. Borodin wrote: > > > >> 29 авг. 2020 г., в 00:56, Robert Haas написал(а): >> >> On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin >> wrote: >>> I don't think so. ISTM It's the same problem of xmax>> just hidden behind detoasing. >>> Our regular heap_check was checking xmin\xmax invariants for tables, but >>> failed to recognise the problem in toast (while toast was accessible until >>> CLOG truncation). >> >> The code can (and should, and I think does) refrain from looking up >> XIDs that are out of the range thought to be valid -- but how do you >> propose that it avoid looking up XIDs that ought to have clog data >> associated with them despite being >= relfrozenxid and < nextxid? >> TransactionIdDidCommit() does not have a suppress-errors flag, adding >> one would be quite invasive, yet we cannot safely perform a >> significant number of checks without knowing whether the inserting >> transaction committed. > > What you write seems completely correct to me. I agree that CLOG thresholds > lookup seems unnecessary. > > But I have a real corruption at hand (on testing site). If I have proposed > here heapcheck. And I have pg_surgery from the thread nearby. Yet I cannot > fix the problem, because cannot list affected tuples. These tools do not > solve the problem neglected for long enough. It would be supercool if they > could. > > This corruption like a caries had 3 stages: > 1. incorrect VM flag that page do not need vacuum > 2. xmin and xmax < relfrozenxid > 3. CLOG truncated > > Stage 2 is curable with proposed toolset, stage 3 is not. But they are not > that different. I had an earlier version of the verify_heapam patch that included a non-throwing interface to clog. Ultimately, I ripped that out. My reasoning was that a simpler patch submission was more likely to be acceptable to the community. If you want to submit a separate patch that creates a non-throwing version of the clog interface, and get the community to accept and commit it, I would seriously consider using that from verify_heapam. If it gets committed in time, I might even do so for this release cycle. But I don't want to make this patch dependent on that hypothetical patch getting written and accepted. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: list of extended statistics on psql
On Thu, Aug 27, 2020 at 07:53:23PM -0400, Alvaro Herrera wrote: +1 for the general idea, and +1 for \dX being the syntax to use IMO the per-type columns should show both the type being enabled as well as it being built. (How many more stat types do we expect -- Tomas? I wonder if having one column per type is going to scale in the long run.) I wouldn't expect a huge number of types. I can imagine maybe twice the current number of types, but not much more. But I'm not sure the output is easy to read even now ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: list of extended statistics on psql
On Thu, Aug 27, 2020 at 11:26:17PM -0400, Alvaro Herrera wrote: On 2020-Aug-28, Tatsuro Yamada wrote: > IMO the per-type columns should show both the type being enabled as > well as it being built. Hmm. I'm not sure how to get the status (enabled or disabled) of extended stats. :( Could you explain it more? pg_statistic_ext_data.stxdndistinct is not null if the stats have been built. (I'm not sure whether there's an easier way to determine this.) It's the only way, I think. Which types were requested is stored in pg_statistic_ext.stxkind and what was built is in pg_statistic_ext_data. But if we want the output to show both what was requested and which types were actually built, that'll effectively double the number of columns needed :-( Also, it might be useful to show the size of the statistics built, just like we show for \d+ etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote: Greetings, * Robert Haas (robertmh...@gmail.com) wrote: On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost wrote: > > Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore. > > We don't actually read the page when we're replaying an FPW though..? > If we don't read it, and we entirely write the page from the FPW, how is > pre-fetching helping..? Suppose there is a checkpoint. Then we replay a record with an FPW, pre-fetching nothing. Then the buffer gets evicted from shared_buffers, and maybe the OS cache too. Then, before the next checkpoint, we again replay a record for the same page. At this point, pre-fetching should be helpful. Sure- but if we're talking about 25GB of WAL, on a server that's got 32GB, then why would those pages end up getting evicted from memory entirely? Particularly, enough of them to end up with such a huge difference in replay time.. I do agree that if we've got more outstanding WAL between checkpoints than the system's got memory then that certainly changes things, but that wasn't what I understood the case to be here. I don't think it's very clear how much WAL there actually was in each case - the message only said there was more than 25GB, but who knows how many checkpoints that covers? In the cases with FPW=on this may easily be much less than one checkpoint (because with scale 45GB an update to every page will log 45GB of full-page images). It'd be interesting to see some stats from pg_waldump etc. Admittedly, I don't quite understand whether that is what is happening in this test case, or why SDD vs. HDD should make any difference. But there doesn't seem to be any reason why it doesn't make sense in theory. I agree that this could be a reason, but it doesn't seem to quite fit in this particular case given the amount of memory and WAL. I'm suspecting that it's something else and I'd very much like to know if it's a general "this applies to all (most? a lot of?) SSDs because the hardware has a larger than 8KB page size and therefore the kernel has to read it", or if it's something odd about this particular system and doesn't apply generally. Not sure. I doubt it has anything to do with the hardware page size, that's mostly transparent to the kernel anyway. But it might be that the prefetching on a particular SSD has more overhead than what it saves. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: list of extended statistics on psql
On 2020-Aug-29, Tomas Vondra wrote: > But if we want the > output to show both what was requested and which types were actually > built, that'll effectively double the number of columns needed :-( I was thinking it would be one column per type showing either disabled or enabled or built. But another idea is to show one type per line that's at least enabled. > Also, it might be useful to show the size of the statistics built, just > like we show for \d+ etc. \dX+ I suppose? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: list of extended statistics on psql
On Sat, Aug 29, 2020 at 06:43:47PM -0400, Alvaro Herrera wrote: On 2020-Aug-29, Tomas Vondra wrote: But if we want the output to show both what was requested and which types were actually built, that'll effectively double the number of columns needed :-( I was thinking it would be one column per type showing either disabled or enabled or built. But another idea is to show one type per line that's at least enabled. Also, it might be useful to show the size of the statistics built, just like we show for \d+ etc. \dX+ I suppose? Right. I've only used \d+ as an example of an existing command showing sizes of the objects. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Get rid of runtime handling of AlternativeSubPlan?
I wrote: > Back in bd3daddaf232d95b0c9ba6f99b0170a0147dd8af, which introduced > AlternativeSubPlans, I wrote: > There is a lot more that could be done based on this infrastructure: in > particular it's interesting to consider switching to the hash plan if we > start > out using the non-hashed plan but find a lot more upper rows going by than > we > expected. I have therefore left some minor inefficiencies in place, such as > initializing both subplans even though we will currently only use one. > > That commit will be twelve years old come August, and nobody has either > built anything else atop it or shown any interest in making the plan choice > switchable mid-run. So it seems like kind of a failed experiment. > > Therefore, I'm considering the idea of ripping out all executor support > for AlternativeSubPlan and instead having the planner replace an > AlternativeSubPlan with the desired specific SubPlan somewhere late in > planning, possibly setrefs.c. Here's a proposed patchset for that. This runs with the idea I'd had that setrefs.c could be smarter than the executor about which plan node subexpressions will be executed how many times. I did not take it very far, for fear of adding an undue number of planning cycles, but it's still better than what we have now. For ease of review, 0001 adds the new planner logic, while 0002 removes the now-dead executor support. There's one bit of dead code that I left in place for the moment, which is ruleutils.c's support for printing AlternativeSubPlans. I'm not sure if that's worth keeping or not --- it's dead code for normal use, but if someone tried to use ruleutils.c to print partially-planned expression trees, maybe there'd be a use for it? (It's also arguable that readfuncs.c's support is now dead code, but I have little interest in stripping that out.) regards, tom lane diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index e2f177515d..f0386480ab 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2254,6 +2254,7 @@ _outPlannerInfo(StringInfo str, const PlannerInfo *node) WRITE_BOOL_FIELD(hasLateralRTEs); WRITE_BOOL_FIELD(hasHavingQual); WRITE_BOOL_FIELD(hasPseudoConstantQuals); + WRITE_BOOL_FIELD(hasAlternativeSubPlans); WRITE_BOOL_FIELD(hasRecursion); WRITE_INT_FIELD(wt_param_id); WRITE_BITMAPSET_FIELD(curOuterRels); diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index b40a112c25..27006c59e0 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -629,6 +629,8 @@ subquery_planner(PlannerGlobal *glob, Query *parse, root->minmax_aggs = NIL; root->qual_security_level = 0; root->inhTargetKind = INHKIND_NONE; + root->hasPseudoConstantQuals = false; + root->hasAlternativeSubPlans = false; root->hasRecursion = hasRecursion; if (hasRecursion) root->wt_param_id = assign_special_exec_param(root); @@ -759,9 +761,6 @@ subquery_planner(PlannerGlobal *glob, Query *parse, */ root->hasHavingQual = (parse->havingQual != NULL); - /* Clear this flag; might get set in distribute_qual_to_rels */ - root->hasPseudoConstantQuals = false; - /* * Do expression preprocessing on targetlist and quals, as well as other * random expressions in the querytree. Note that we do not need to diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c index baefe0e946..b5a29ecfc4 100644 --- a/src/backend/optimizer/plan/setrefs.c +++ b/src/backend/optimizer/plan/setrefs.c @@ -49,6 +49,7 @@ typedef struct { PlannerInfo *root; int rtoffset; + double num_exec; } fix_scan_expr_context; typedef struct @@ -58,6 +59,7 @@ typedef struct indexed_tlist *inner_itlist; Index acceptable_rel; int rtoffset; + double num_exec; } fix_join_expr_context; typedef struct @@ -66,8 +68,28 @@ typedef struct indexed_tlist *subplan_itlist; Index newvarno; int rtoffset; + double num_exec; } fix_upper_expr_context; +/* + * Selecting the best alternative in an AlternativeSubPlan expression requires + * estimating how many times that expression will be evaluated. For an + * expression in a plan node's targetlist, the plan's estimated number of + * output rows is clearly what to use, but for an expression in a qual it's + * far less clear. Since AlternativeSubPlans aren't heavily used, we don't + * want to expend a lot of cycles making such estimates. What we use is twice + * the number of output rows. That's not entirely unfounded: we know that + * clause_selectivity() would fall back to a default selectivity estimate + * of 0.5 for any SubPlan, so if the qual containing the SubPlan is the last + * to be applied (which it likely would be, thanks to order_qual_clauses()), + * this matches what we could have estimated in a far more laborious fashion. + * Obviously there are many other scenarios, but it's probably not worth the + * troubl
Re: PROC_IN_ANALYZE stillborn 13 years ago
Alvaro Herrera writes: > I pushed despite the objection because it seemed that downstream > discussion was largely favorable to the change, and there's a different > proposal to solve the bloat problem for analyze; and also: Note that this quasi-related patch has pretty thoroughly hijacked the CF entry for James' original docs patch proposal. The cfbot thinks that that's the latest patch in the original thread, and unsurprisingly is failing to apply it. Since the discussion was all over the place, I'm not sure whether there's still a live docs patch proposal or not; but if so, somebody should repost that patch (and go back to the original thread title). regards, tom lane
Background writer and checkpointer in crash recovery
(Forking this thread from the SLRU fsync one[1] to allow for a separate CF entry; it's unrelated, except for being another case of moving work off the recovery process's plate.) Hello hackers, Currently we don't run the bgwriter process during crash recovery. I've CCed Simon and Heikki who established this in commit cdd46c76. Based on that commit message, I think the bar to clear to change the policy is to show that it's useful, and that it doesn't make crash recovery less robust. See the other thread for some initial evidence of usefulness from Jakub Wartak. I think it also just makes intuitive sense that it's got to help bigger-than-buffer-pool crash recovery if you can shift buffer eviction out of the recovery loop. As for robustness, I suppose we could provide the option to turn it off just in case that turns out to be useful in some scenarios, but I'm wondering why we would expect something that we routinely run in archive/streaming recovery to reduce robustness in only slightly different circumstances. Here's an experiment-grade patch, comments welcome, though at this stage it's primarily thoughts about the concept that I'm hoping to solicit. One question raised by Jakub that I don't have a great answer to right now is whether you'd want different bgwriter settings in this scenario for best results. I don't know, but I suspect that the answer is to make bgwriter more adaptive rather than more configurable, and that's an orthogonal project. Once we had the checkpointer running, we could also consider making the end-of-recovery checkpoint optional, or at least the wait for it to complete. I haven't shown that in this patch, but it's just different checkpoint request flags and possibly an end-of-recovery record. What problems do you foresee with that? Why should we have "fast" promotion but not "fast" crash recovery? [1] https://www.postgresql.org/message-id/flat/CA+hUKGLJ=84yt+nvhkeedauutvhmfq9i-n7k_o50jmq6rpj...@mail.gmail.com From 21d7d459a6076f85c743b739f1c4ba16451b7046 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Wed, 26 Aug 2020 16:34:33 +1200 Subject: [PATCH 1/2] Run checkpointer and bgworker in crash recovery. Relieve the startup process of some writeback and checkpointing work during crash recovery, making it more like replication recovery. This wasn't done back in commit cdd46c76 because it wasn't thought to be useful at the time. The theory of this patch is that there may be workloads where we can profitably move a bunch of buffer eviction work out of the recovery process. In order to have a bgwriter running, you also need a checkpointer. While you could give startup and bgwriter their own backend private pending ops table, it's nicer to merger their requests in one place. XXX This is just an experimental prototype patch and may contain all kinds of bugs! --- src/backend/access/transam/xlog.c | 33 ++- src/backend/postmaster/bgwriter.c | 3 --- src/backend/postmaster/checkpointer.c | 3 --- src/backend/postmaster/postmaster.c | 17 ++ src/backend/storage/sync/sync.c | 30 +++- src/include/storage/sync.h| 1 - 6 files changed, 22 insertions(+), 65 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 09c01ed4ae..d8080ed4f5 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -870,9 +870,6 @@ bool reachedConsistency = false; static bool InRedo = false; -/* Have we launched bgwriter during recovery? */ -static bool bgwriterLaunched = false; - /* For WALInsertLockAcquire/Release functions */ static int MyLockNo = 0; static bool holdingAllLocks = false; @@ -7123,25 +7120,14 @@ StartupXLOG(void) /* Also ensure XLogReceiptTime has a sane value */ XLogReceiptTime = GetCurrentTimestamp(); + PublishStartupProcessInformation(); + /* * Let postmaster know we've started redo now, so that it can launch - * checkpointer to perform restartpoints. We don't bother during - * crash recovery as restartpoints can only be performed during - * archive recovery. And we'd like to keep crash recovery simple, to - * avoid introducing bugs that could affect you when recovering after - * crash. - * - * After this point, we can no longer assume that we're the only - * process in addition to postmaster! Also, fsync requests are - * subsequently to be handled by the checkpointer, not locally. + * the archiver if necessary. */ - if (ArchiveRecoveryRequested && IsUnderPostmaster) - { - PublishStartupProcessInformation(); - EnableSyncRequestForwarding(); + if (IsUnderPostmaster) SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED); - bgwriterLaunched = true; - } /* * Allow read-only connections immediately if we're consistent @@ -7729,7 +7715,7 @@ StartupXLOG(void) * after we're fully out of recovery mode and already accepting * queries. */ - if (b
Re: Disk-based hash aggregate's cost model
On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote: On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: We have a Postgres 13 open item for Disk-based hash aggregate, which is the only non-trivial open item. There is a general concern about how often we get disk-based hash aggregation when work_mem is particularly low, and recursion seems unavoidable. This is generally thought to be a problem in the costing. We discussed two approaches to tweaking the cost model: 1. Penalize HashAgg disk costs by a constant amount. It seems to be chosen a little too often, and we can reduce the number of plan changes. 2. Treat recursive HashAgg spilling skeptically, and heavily penalize recursive spilling. The problem with approach #2 is that we have a default hash mem of 4MB, and real systems have a lot more than that. In this scenario, recursive spilling can beat Sort by a lot. I think the main issue is that we're mostly speculating what's wrong. I've shared some measurements and symptoms, and we've discussed what might be causing it, but I'm not really sure we know for sure. I really dislike (1) because it seems more like "We don't know what's wrong so we'll penalize hashagg," kind of solution. A much more principled solution would be to tweak the costing accordingly, not just by adding some constant. For (2) it really depends if recursive spilling is really the problem here. In the examples I shared, the number of partitions/batches was very different, but the query duration was mostly independent (almost constant). FWIW I still haven't seen any explanation why the current code spills more data than the CP_SMALL_TLIST patch (which was reverted). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Background writer and checkpointer in crash recovery
Thomas Munro writes: > Once we had the checkpointer running, we could also consider making > the end-of-recovery checkpoint optional, or at least the wait for it > to complete. I haven't shown that in this patch, but it's just > different checkpoint request flags and possibly an end-of-recovery > record. What problems do you foresee with that? Why should we have > "fast" promotion but not "fast" crash recovery? I think that the rationale for that had something to do with trying to reduce the costs of repeated crashes. If you've had one crash, you might well have another one in your near future, due to the same problem recurring. Therefore, avoiding multiple replays of the same WAL is attractive; and to do that we have to complete a checkpoint before giving users a chance to crash us again. I'm not sure what I think about your primary proposal here. On the one hand, optimizing crash recovery ought to be pretty darn far down our priority list; if it happens often enough for performance to be a consideration then we have worse problems to fix. Also, at least in theory, not running the bgwriter/checkpointer makes for fewer moving parts and thus better odds of completing crash recovery successfully. On the other hand, it could be argued that running without the bgwriter/checkpointer actually makes recovery less reliable, since that's a far less thoroughly-tested operating regime than when they're active. tl;dr: I think this should be much less about performance than about whether we successfully recover or not. Maybe there's still a case for changing in that framework, though. regards, tom lane
Re: Windows buildfarm members vs. new async-notify isolation test
On Thu, Dec 12, 2019 at 9:11 AM Tom Lane wrote: > Amit Kapila writes: > > I am convinced by your points. So +1 for your proposed patch. I have > > already reviewed it yesterday and it appears fine to me. > > OK, pushed. Thanks for reviewing! I made a thing to watch out for low probability BF failures and it told me that a similar failure in async-notify might have reappeared on brolga: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga&dt=2020-07-15%2008:30:11 | REL_10_STABLE https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga&dt=2020-05-21%2009:17:13 | REL9_6_STABLE https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga&dt=2020-04-22%2009:13:38 | REL9_6_STABLE https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga&dt=2020-04-05%2009:38:13 | REL9_6_STABLE https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga&dt=2020-04-03%2021:17:39 | REL9_6_STABLE
Re: Windows buildfarm members vs. new async-notify isolation test
Thomas Munro writes: > I made a thing to watch out for low probability BF failures and it > told me that a similar failure in async-notify might have reappeared > on brolga: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=brolga&dt=2020-07-15%2008:30:11 > | REL_10_STABLE > [ etc ] Hm, interesting. None of these examples show an actual *failure* to receive a notification, unlike the example that began this thread. So it seems unlikely that back-patching 16114f2ea would help. What we are seeing here, instead, is delayed timing of notify receipt(s). I suspect that this is a variant of the issue described over here: https://www.postgresql.org/message-id/flat/2527507.1598237598%40sss.pgh.pa.us I didn't have a great idea about how to fix it reliably in insert-conflict-specconflict, and I lack one here too :-(. It's interesting though that your examples are all in v10 or older. Could we have done something that indirectly fixes the problem since then? Or is that just chance? regards, tom lane
Rare link canary failure in dblink test
Hi, A couple of recent cases where an error "libpq is incorrectly linked to backend functions" broke the dblink test: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-08-22%2002:55:22 | REL9_6_STABLE | Windows https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2020-04-27%2022:46:16 | HEAD | Windows Before that, lorikeet last said that before commit a33245a8 apparently fixed something relevant, but curiously the message also appeared a couple of times on two Unixen. Perhaps we can write those off as lost in the mists of time. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2018-09-29%2010:56:25 | HEAD | Windows https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=damselfly&dt=2018-09-10%2011:09:29 | HEAD | Illumos https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jaguarundi&dt=2018-09-09%2018:11:15 | HEAD | FreeBSD
Re: Spurious "apparent wraparound" via SimpleLruTruncate() rounding
On Mon, May 25, 2020 at 12:00:33AM -0700, Noah Misch wrote: > On Mon, Apr 06, 2020 at 09:18:47PM -0700, Noah Misch wrote: > > On Mon, Apr 06, 2020 at 02:46:09PM -0400, Tom Lane wrote: > > > Noah Misch writes: > > > > On Wed, Mar 25, 2020 at 04:42:31PM -0400, Tom Lane wrote: > > > >> So I think what we're actually trying to accomplish here is to > > > >> ensure that instead of deleting up to half of the SLRU space > > > >> before the cutoff, we delete up to half-less-one-segment. > > > >> Maybe it should be half-less-two-segments, just to provide some > > > >> cushion against edge cases. Reading the first comment in > > > >> SetTransactionIdLimit makes one not want to trust too much in > > > >> arguments based on the exact value of xidWrapLimit, while for > > > >> the other SLRUs it was already unclear whether the edge cases > > > >> were exactly right. > > > > > > > That could be interesting insurance. While it would be sad for us to > > > > miss an > > > > edge case and print "must be vacuumed within 2 transactions" when wrap > > > > has > > > > already happened, reaching that message implies the DBA burned ~1M > > > > XIDs, all > > > > in single-user mode. More plausible is FreezeMultiXactId() overrunning > > > > the > > > > limit by tens of segments. Hence, if we do buy this insurance, let's > > > > skip far > > > > more segments. For example, instead of unlinking segments representing > > > > up to > > > > 2^31 past XIDs, we could divide that into an upper half that we unlink > > > > and a > > > > lower half. The lower half will stay in place; eventually, XID > > > > consumption > > > > will overwrite it. Truncation behavior won't change until the region > > > > of CLOG > > > > for pre-oldestXact XIDs exceeds 256 MiB. Beyond that threshold, > > > > vac_truncate_clog() will unlink the upper 256 MiB and leave the rest. > > > > CLOG > > > > maximum would rise from 512 MiB to 768 MiB. Would that be worthwhile? > > > > Temporarily wasting some disk > > > space is a lot more palatable than corrupting data, and these code > > > paths are necessarily not terribly well tested. So +1 for more > > > insurance. > > > > Okay, I'll give that a try. I expect this will replace the PagePrecedes > > callback with a PageDiff callback such that PageDiff(a, b) < 0 iff > > PagePrecedes(a, b). PageDiff callbacks shall distribute return values > > uniformly in [INT_MIN,INT_MAX]. SimpleLruTruncate() will unlink segments > > where INT_MIN/2 < PageDiff(candidate, cutoff) < 0. > > While doing so, I found that slru-truncate-modulo-v2.patch did get edge cases > wrong, as you feared. In particular, if the newest XID reached xidStopLimit > and was in the first page of a segment, TruncateCLOG() would delete its > segment. Attached slru-truncate-modulo-v3.patch fixes that; as restitution, I > added unit tests covering that and other scenarios. Reaching the bug via XIDs > was hard, requiring one to burn 1000k-CLOG_XACTS_PER_PAGE=967k XIDs in > single-user mode. I expect the bug was easier to reach via pg_multixact. > > The insurance patch stacks on top of the bug fix patch. It does have a > negative effect on TruncateMultiXact(), which uses SlruScanDirCbFindEarliest > to skip truncation in corrupted clusters. SlruScanDirCbFindEarliest() gives > nonsense answers if "future" segments exist. That can happen today, but the > patch creates new ways to make it happen. The symptom is wasting yet more > space in pg_multixact. I am okay with this, since it arises only after one > fills pg_multixact 50% full. There are alternatives. We could weaken the > corruption defense in TruncateMultiXact() or look for another implementation > of equivalent defense. We could unlink, say, 75% or 95% of the "past" instead > of 50% (this patch) or >99.99% (today's behavior). Rebased the second patch. The first patch did not need a rebase. Author: Noah Misch Commit: Noah Misch Prevent excess SimpleLruTruncate() deletion. Every core SLRU wraps around. With the exception of pg_notify, the wrap point can fall in the middle of a page. Account for this in the PagePrecedes callback specification and in SimpleLruTruncate()'s use of said callback. Update each callback implementation to fit the new specification. This changes SerialPagePrecedesLogically() from the style of asyncQueuePagePrecedes() to the style of CLOGPagePrecedes(). (Whereas pg_clog and pg_serial share a key space, pg_serial is nothing like pg_notify.) This closes a rare opportunity for data loss, which manifested as "apparent wraparound" or "could not access status of transaction" errors. This is more likely to affect pg_multixact, due to the thin space between multiStopLimit and multiWrapLimit. If a user's physical replication primary logged ": apparent wraparound" messages, the user should rebuild that standbys of that primary regardless of symptoms. At less risk is