Re: Perform streaming logical transactions by background workers and parallel apply
On Sat, Jan 7, 2023 at 11:13 AM houzj.f...@fujitsu.com wrote: > > On Saturday, January 7, 2023 12:50 PM Dilip Kumar > > > > On Fri, Jan 6, 2023 at 3:38 PM houzj.f...@fujitsu.com > > > > wrote: > > > > > > > Looks good, but I feel in pa_process_spooled_messages_if_required() > > function after getting the filestate the first check should be if > > (filestate== > > FS_EMPTY) return false. I mean why to process through all the states if it > > is > > empty and we can directly exit. It is not a big deal so if you prefer the > > way it is > > then I have no objection to it. > > I think your suggestion looks good, I have adjusted the code. > I also rebase the patch set due to the recent commit c6e1f6. > And here is the new version patch set. > LGTM -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
How to define template types in PostgreSQL
Dear all MobilityDB (https://github.com/MobilityDB/MobilityDB) defines at the C level four template types: Set, Span, SpanSet, and Temporal. The type Set is akin to PostgreSQL's ArrayType restricted to one dimension, but enforces the constraint that sets do not have duplicates, the types Span and SpanSet are akin to PostgreSQL's RangeType and MultirangeType but enforce the constraints that span types are of fixed length and that empty spans and infinite bounds are not allowed, and the typeTemporal is used to manipulate time-varying values. These template types need to be instantiated at the SQL level with base types (int, bigint, float, timestamptz, text, ...) and because of this, MobilityDB needs to define numerous SQL functions that all call the same function in C. Taking as example the Set type, we need to define, e.g., CREATE FUNCTION intset_eq(intset, intset) RETURNS bool AS 'MODULE_PATHNAME', 'Set_eq' ... CREATE FUNCTION bigintset_eq(bigintset, bigintset) RETURNS bool AS 'MODULE_PATHNAME', 'Set_eq' ... CREATE FUNCTION floatset_eq(floatset, floatset) RETURNS bool AS 'MODULE_PATHNAME', 'Set_eq' ... CREATE FUNCTION textset_eq(textset, textset) RETURNS bool AS 'MODULE_PATHNAME', 'Set_eq' ... ... CREATE FUNCTION intset_ne(intset, intset) RETURNS bool AS 'MODULE_PATHNAME', 'Set_ne' ... CREATE FUNCTION bigintset_ne(bigintset, bigintset) RETURNS bool AS 'MODULE_PATHNAME', 'Set_ne' ... CREATE FUNCTION floatset_ne(floatset, floatset) RETURNS bool AS 'MODULE_PATHNAME', 'Set_ne' ... CREATE FUNCTION textset_ne(textset, textset) RETURNS bool AS 'MODULE_PATHNAME', 'Set_ne' ... ... In the case of arrays, ranges, and multiranges, PostgreSQL avoids this redundancy using pseudo-types such as anyarray, anyrange, anymultirange, ... Is there a possibility that we can also define pseudo types such as anyset, anyspan, anyspanset, anytemporal, ? This will considerably reduce the number of SQL functions to define. Currently, given the high number of functions in MobilityDB, creating the extension takes a lng time Regards Esteban
Allow DISTINCT to use Incremental Sort
While working on the regression tests added in a14a58329, I noticed that DISTINCT does not make use of Incremental Sort. It'll only ever do full sorts on the cheapest input path or make use of a path that's already got the required pathkeys. Also, I see that create_final_distinct_paths() is a little quirky and if the cheapest input path happens to be sorted, it'll add_path() the same path twice, which seems like a bit of a waste of effort. That could happen if say enable_seqscan is off or if a Merge Join is the cheapest join method. Additionally, the parallel DISTINCT code looks like it should also get the same treatment. I see that I'd coded this to only add a unique path atop of a presorted path and it never considers sorting the cheapest partial path. I've adjusted that in the attached and also made it consider incremental sorting any path with presorted keys. Please see the attached patch. David diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 000d757bdd..c908944071 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -4654,22 +4654,63 @@ create_partial_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel, cheapest_partial_path->rows, NULL, NULL); - /* first try adding unique paths atop of sorted paths */ + /* +* First try sorting the cheapest path and incrementally sorting any paths +* with presorted keys and put a unique paths atop of those. +*/ if (grouping_is_sortable(parse->distinctClause)) { foreach(lc, input_rel->partial_pathlist) { - Path *path = (Path *) lfirst(lc); + Path *input_path = (Path *) lfirst(lc); + Path *sorted_path; + boolis_sorted; + int presorted_keys; - if (pathkeys_contained_in(root->distinct_pathkeys, path->pathkeys)) + is_sorted = pathkeys_count_contained_in(root->distinct_pathkeys, + input_path->pathkeys, + &presorted_keys); + + if (is_sorted) + sorted_path = input_path; + else { - add_partial_path(partial_distinct_rel, (Path *) - create_upper_unique_path(root, - partial_distinct_rel, - path, - list_length(root->distinct_pathkeys), - numDistinctRows)); + /* +* Try at least sorting the cheapest path and also try +* incrementally sorting any path which is partially sorted +* already (no need to deal with paths which have presorted keys +* when incremental sort is disabled unless it's the cheapest +* partial path). +*/ + if (input_path != cheapest_partial_path && + (presorted_keys == 0 || !enable_incremental_sort)) + continue; + + /* +* We've no need to consider both a sort and incremental sort. +* We'll just do a sort if there are no presorted keys and an +* incremental sort when there are presorted keys. +*/ + if (presorted_keys == 0 || !enable_incremental_sort) + sorted_path = (Path *) create_sort_path(root, + partial_distinct_rel, + input_path, +
Re: Notify downstream to discard the streamed transaction which was aborted due to crash.
On Fri, Jan 6, 2023 at 11:18 AM Dilip Kumar wrote: > > > > > To fix it, One idea is to send a stream abort message when we are cleaning > > up > > crashed transaction on publisher(e.g. in ReorderBufferAbortOld()). And here > > is > > a tiny patch which changes the same. I have confirmed that the bug is fixed > > and > > all regression tests pass. I didn't add a testcase because we need to make > > sure > > the crash happens before all the WAL logged transactions data are decoded > > which > > doesn't seem easy to write a stable test for this. > > > > Thoughts ? > > Fix looks good to me. Thanks for working on this. > Pushed! -- With Regards, Amit Kapila.
Re: Generating code for query jumbling through gen_node_support.pl
On Sat, Jan 07, 2023 at 07:37:49AM +0100, Peter Eisentraut wrote: > The generation script already has a way to identify location fields, by ($t > eq 'int' && $f =~ 'location$'), so you could use that as well. I recall that some of the nodes may need renames to map with this choice. That could be just one patch on top of the actual feature. > I'm concerned about the large number of additional field annotations this > adds. We have been careful so far to document the use of each attribute, > e.g., *why* does a field not need to be copied etc. This patch adds dozens > and dozens of annotations without any explanation at all. Now, the code > this replaces also has no documentation, but maybe this is the time to add > some. In most cases, the addition of the node marker would be enough to self-explain why they are included, but there is a trend for a lot of the nodes when it comes to collations and typmods where we don't want to see these in the jumbling calculations. I'll look at providing more info for all that. (Note: I'm out for now.) -- Michael signature.asc Description: PGP signature
Re: [PATCH] Add function to_oct
On Thu, 22 Dec 2022 at 23:11, Eric Radman wrote: > > On Thu, Dec 22, 2022 at 10:08:17AM +, Dag Lem wrote: > > > > The calculation of quotient and remainder can be replaced by less costly > > masking and shifting. > > > > Defining > > > > #define OCT_DIGIT_BITS 3 > > #define OCT_DIGIT_BITMASK 0x7 > > > > the content of the loop can be replaced by > > > > *--ptr = digits[value & OCT_DIGIT_BITMASK]; > > value >>= OCT_DIGIT_BITS; > > > > Also, the check for ptr > buf in the while loop can be removed. The > > check is superfluous, since buf cannot possibly be exhausted by a 32 > > bit integer (the maximum octal number being 377). > > I integrated these suggestions in the attached -v2 patch and tested > range of values manually. > > Also picked an OID > 8000 as suggested by unused_oids. Few suggestions 1) We could use to_oct instead of to_oct32 as we don't have multiple implementations for to_oct diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 98d90d9338..fde0b24563 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -3687,6 +3687,9 @@ { oid => '2090', descr => 'convert int8 number to hex', proname => 'to_hex', prorettype => 'text', proargtypes => 'int8', prosrc => 'to_hex64' }, +{ oid => '8335', descr => 'convert int4 number to oct', + proname => 'to_oct', prorettype => 'text', proargtypes => 'int4', + prosrc => 'to_oct32' }, 2) Similarly we could change this to "to_oct" +/* + * Convert an int32 to a string containing a base 8 (oct) representation of + * the number. + */ +Datum +to_oct32(PG_FUNCTION_ARGS) +{ + uint32 value = (uint32) PG_GETARG_INT32(0); 3) I'm not sure if AS "" is required, but I also noticed it is done similarly in to_hex too: +-- +-- test to_oct +-- +select to_oct(256*256*256 - 1) AS ""; + +-- + +(1 row) 4) You could add a commit message for the patch Regards, Vignesh
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
Thanks for looking into this. On 07/01/23 09:58, David Rowley wrote: On Sat, 7 Jan 2023 at 02:11, Ankit Kumar Pandey wrote: On 05/01/23 12:53, David Rowley wrote: We *can* reuse Sorts where a more strict or equivalent sort order is available. The question is how do we get the final WindowClause to do something slightly more strict to save having to do anything for the ORDER BY. One way you might think would be to adjust the WindowClause's orderClause to add the additional clauses, but that cannot be done because that would cause are_peers() in nodeWindowAgg.c to not count some rows as peers when they maybe should be given a less strict orderClause in the WindowClause. I attempted this in attached patch. I had a quick look at this and it's going to need some additional code to ensure there are no WindowFuncs in the ORDER BY clause. We can't sort on those before we evaluate them. Okay I will add this check. I also don't think there's any point in adding the additional pathkeys when the input path is already presorted. It would be better to leave this case alone and just do the incremental sort afterwards. So this will be no operation case well. You also don't seem to be considering the fact that the query might have a DISTINCT clause. Major reason for this was that I am not exactly aware of what distinct clause means (especially in context of window functions) and how it is different from other sortClauses (partition, order by, group). Comments in parsenodes.h didn't help. That's evaluated between window functions and the order by. It would be fairly useless to do a more strict sort when the sort order is going to be obliterated by a Hash Aggregate. Perhaps we can just not do this when the query has a DISTINCT clause. On the other hand, there are also a few reasons why we shouldn't do this. I mentioned the WindowClause runConditions earlier here. The patched version produces: postgres=# explain (analyze, costs off) select * from (select oid,relname,row_number() over (order by oid) rn1 from pg_class order by oid,relname) where rn1 < 10; QUERY PLAN -- WindowAgg (actual time=0.488..0.497 rows=9 loops=1) Run Condition: (row_number() OVER (?) < 10) -> Sort (actual time=0.466..0.468 rows=10 loops=1) Sort Key: pg_class.oid, pg_class.relname Sort Method: quicksort Memory: 67kB -> Seq Scan on pg_class (actual time=0.028..0.170 rows=420 loops=1) Planning Time: 0.214 ms Execution Time: 0.581 ms (8 rows) Whereas master produces: postgres=# explain (analyze, costs off) select * from (select oid,relname,row_number() over (order by oid) rn1 from pg_class order by oid,relname) where rn1 < 10; QUERY PLAN Incremental Sort (actual time=0.506..0.508 rows=9 loops=1) Sort Key: pg_class.oid, pg_class.relname Presorted Key: pg_class.oid Full-sort Groups: 1 Sort Method: quicksort Average Memory: 26kB Peak Memory: 26kB -> WindowAgg (actual time=0.475..0.483 rows=9 loops=1) Run Condition: (row_number() OVER (?) < 10) -> Sort (actual time=0.461..0.461 rows=10 loops=1) Sort Key: pg_class.oid Sort Method: quicksort Memory: 67kB -> Seq Scan on pg_class (actual time=0.022..0.178 rows=420 loops=1) Planning Time: 0.245 ms Execution Time: 0.594 ms (12 rows) (slightly bad example since oid is unique but...) It's not too clear to me that the patched version is a better plan. The bottom level sort, which sorts 420 rows has a more complex comparison to do. Imagine the 2nd column is a long text string. That would make the sort much more expensive. The top-level sort has far fewer rows to sort due to the runCondition filtering out anything that does not match rn1 < 10. The same can be said for a query with a LIMIT clause. Yes, this is a fair point. Multiple sort is actually beneficial in cases like this, perhaps limits clause and runCondition should be no op too? I think the patch should also be using pathkeys_contained_in() and Lists of pathkeys rather than concatenating lists of SortGroupClauses together. That should allow things to work correctly when a given pathkey has become redundant due to either duplication or a Const in the Eclass. Make sense, I actually duplicated that logic from make_pathkeys_for_window. We should make this changes there as well because if we have SELECT rank() OVER (PARTITION BY a ORDER BY a) (weird example but you get the idea), it leads to duplicates in window_sortclauses. Also, since I seem to be only be able to think of these cases properly by actually trying them, I ended up with the attached patch. I opted to not do the optimisation when there are runConditions or a LIMIT clause. Do
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 07/01/23 07:59, David Rowley wrote: On Thu, 5 Jan 2023 at 04:11, Ankit Kumar Pandey wrote: Attaching test cases for this (+ small change in doc). Tested this in one of WIP branch where I had modified select_active_windows and it failed as expected. Please let me know if something can be improved in this. Thanks for writing that. I had a look over the patch and ended up making some adjustments to the tests. Looking back at 728202b63, I think any tests we add here should be kept alongside the tests added by that commit rather than tacked on to the end of the test file. It also makes sense to me just to use the same table as the original tests. I also thought the comment in select_active_windows should be in the sort comparator function instead. I think that's a more likely place to capture the attention of anyone making modifications. Thanks, I will look it through. I've now pushed the adjusted patch. I can't seem to find updated patch in the attachment, can you please forward the patch again. Thanks. -- Regards, Ankit Kumar Pandey
Re: add \dpS to psql
On Sat, 7 Jan 2023 at 00:36, Nathan Bossart wrote: > > On Fri, Jan 06, 2023 at 06:52:33PM +, Dean Rasheed wrote: > > > > So I think we should use the same SQL clauses as every other psql > > command that supports "S", namely: > > > > if (!showSystem && !pattern) > > appendPQExpBufferStr(&buf, " AND n.nspname <> 'pg_catalog'\n" > > " AND n.nspname <> > > 'information_schema'\n"); > > Good catch. I should have noticed this. The deleted comment mentions that > the system/temp tables normally aren't very interesting from a permissions > perspective, so perhaps there is an argument for always excluding temp > tables without a pattern. After all, \dp always excludes indexes and TOAST > tables. However, it looks like \dt includes temp tables, and I didn't see > any other meta-commands that excluded them. > It might be true that temp tables aren't usually interesting from a permissions point of view, but it's not hard to imagine situations where interesting things do happen. It's also probably the case that most users won't have many temp tables, so I don't think including them by default will be particularly intrusive. Also, from a user perspective, I think it would be something of a POLA violation for \dp[S] and \dt[S] to include different sets of tables, though I appreciate that we do that now. There's nothing in the docs to indicate that that's the case. Anyway, I've pushed the v2 patch as-is. If anyone feels strongly enough that we should change its behaviour for temp tables, then we can still discuss that. Regards, Dean
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
Your email client seems to be adding additional vertical space to your emails. I've removed the additional newlines in the quotes. Are you able to fix the client so it does not do that? On Sun, 8 Jan 2023 at 00:10, Ankit Kumar Pandey wrote: > > On 07/01/23 09:58, David Rowley wrote: > > You also don't seem to be considering the fact that the query might > > have a DISTINCT clause. > > Major reason for this was that I am not exactly aware of what distinct > clause means (especially in > > context of window functions) and how it is different from other > sortClauses (partition, order by, group). I'm talking about the query's DISTINCT clause. i.e SELECT DISTINCT. If you look in the grouping_planner() function, you'll see that create_distinct_paths() is called between create_window_paths() and create_ordered_paths(). > Yes, this is a fair point. Multiple sort is actually beneficial in cases > like this, perhaps limits clause and runCondition should be no op too? I'm not sure what you mean by "no op". Do you mean not apply the optimization? > > I think the patch should also be using pathkeys_contained_in() and > > Lists of pathkeys rather than concatenating lists of SortGroupClauses > > together. That should allow things to work correctly when a given > > pathkey has become redundant due to either duplication or a Const in > > the Eclass. > > Make sense, I actually duplicated that logic from > make_pathkeys_for_window. We should make this changes there as well because > if we have SELECT rank() OVER (PARTITION BY a ORDER BY a) > (weird example but you get the idea), it leads to duplicates in > window_sortclauses. It won't lead to duplicate pathkeys. Look in make_pathkeys_for_sortclauses() and what pathkey_is_redundant() does. Notice that it checks if the pathkey already exists in the list before appending. > Agree with runConditions part but for limit clause, row reduction happens > at the last, so whether we use patched version or master version, > none of sorts would benefit/degrade from that, right? Maybe you're right. Just be aware that a sort done for a query with an ORDER BY LIMIT will perform a top-n sort. top-n sorts only need to store the top-n tuples and that can significantly reduce the memory required for sorting perhaps resulting in the sort fitting in memory rather than spilling out to disk. You might want to test this by having the leading sort column as an INT, and then the 2nd one as a long text column of maybe around two kilobytes. Make all the leading column values the same so that the comparison for the text column is always performed. Make the LIMIT small compared to the total number of rows, that should test the worse case. Check the performance with and without the limitCount != NULL part of the patch that disables the optimization for LIMIT. > Is it okay > to add comments in test cases? I don't see it much on existing cases > so kind of reluctant to add but it makes intentions much more clear. I think tests should always have a comment to state what they're testing. Not many people seem to do that, unfortunately. The problem with not stating what the test is testing is that if, for example, the test is checking that the EXPLAIN output is showing a Sort, what if at some point in the future someone adjusts some costing code and the plan changes to an Index Scan. If there's no comment to state that we're looking for a Sort plan, then the author of the patch that's adjusting the costs might just think it's ok to change the expected plan to an Index Scan. I've seen this problem occur even when the comments *do* exist. There's just about no hope of such a test continuing to do what it's meant to if the comments don't exist. David
Re: Improve tab completion for ALTER FUNCTION/PROCEDURE/ROUTINE
On Fri, 6 Jan 2023 at 15:33, Dean Rasheed wrote: > > On Fri, 6 Jan 2023 at 02:38, vignesh C wrote: > > > > On Thu, 5 Jan 2023 at 18:22, Dean Rasheed wrote: > > > > > > That leads to the attached, which barring objections, I'll push shortly. > > > > The changes look good to me. > > > > Pushed. Thanks for pushing this. Regards, Vignesh
Re: [BUG] Logical replica crash if there was an error in a function.
On Sun, 11 Dec 2022 at 09:21, Anton A. Melnikov wrote: > > > On 07.12.2022 21:03, Andres Freund wrote: > > > > > This CF entry causes tests to fail on all platforms: > > https://cirrus-ci.com/build/5755408111894528 > > > > E.g. > > https://api.cirrus-ci.com/v1/artifact/task/5298457144459264/testrun/build/testrun/subscription/100_bugs/log/regress_log_100_bugs > > > > Begin standard error > > psql::1: NOTICE: dropped replication slot "sub1" on publisher > > End standard error > > timed out waiting for match: ERROR: relation "error_name" does not exist > > at character at /tmp/cirrus-ci-build/src/test/subscription/t/100_bugs.pl > > line 115. > > > > Greetings, > > > > Andres Freund > > Thank you for reminding! > > There was a conflict when applying v2 on current master. > Rebased v3 is attached. Few suggestions: 1) There is a warning: +# This would crash on the subscriber if not fixed +$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)"); + +my $result = $node_subscriber->wait_for_log( + "ERROR: relation \"error_name\" does not exist at character" +); "my" variable $result masks earlier declaration in same scope at t/100_bugs.pl line 400. You can change: my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); to $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); 2) Now that the crash is fixed, you could change it to a better message: +# This would crash on the subscriber if not fixed +$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)"); + +my $result = $node_subscriber->wait_for_log( + "ERROR: relation \"error_name\" does not exist at character" +); Regards, Vignesh
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 07/01/23 17:28, David Rowley wrote: Your email client seems to be adding additional vertical space to your emails. I've removed the additional newlines in the quotes. Are you able to fix the client so it does not do that? I have adjusted my mail client, hope it is better now? On Sun, 8 Jan 2023 at 00:10, Ankit Kumar Pandey wrote: > > On 07/01/23 09:58, David Rowley wrote: > > You also don't seem to be considering the fact that the query might > > have a DISTINCT clause. > > Major reason for this was that I am not exactly aware of what distinct > clause means (especially in > > context of window functions) and how it is different from other > sortClauses (partition, order by, group). I'm talking about the query's DISTINCT clause. i.e SELECT DISTINCT. If you look in the grouping_planner() function, you'll see that create_distinct_paths() is called between create_window_paths() and create_ordered_paths(). Yes just saw this and got what you meant. > Yes, this is a fair point. Multiple sort is actually beneficial in cases > like this, perhaps limits clause and runCondition should be no op too? I'm not sure what you mean by "no op". Do you mean not apply the optimization? Yes, no op = no optimization. Sorry I didn't mention it before. > > I think the patch should also be using pathkeys_contained_in() and > > Lists of pathkeys rather than concatenating lists of SortGroupClauses > > together. That should allow things to work correctly when a given > > pathkey has become redundant due to either duplication or a Const in > > the Eclass. > > Make sense, I actually duplicated that logic from > make_pathkeys_for_window. We should make this changes there as well because > if we have SELECT rank() OVER (PARTITION BY a ORDER BY a) > (weird example but you get the idea), it leads to duplicates in > window_sortclauses. It won't lead to duplicate pathkeys. Look in make_pathkeys_for_sortclauses() and what pathkey_is_redundant() does. Notice that it checks if the pathkey already exists in the list before appending. Okay I see this, pathkey_is_redundant is much more smarter as well. Replacing list_concat_copy with list_concat_unique in make_pathkeys_for_window won't be of much benefit. > Agree with runConditions part but for limit clause, row reduction happens > at the last, so whether we use patched version or master version, > none of sorts would benefit/degrade from that, right? Maybe you're right. Just be aware that a sort done for a query with an ORDER BY LIMIT will perform a top-n sort. top-n sorts only need to store the top-n tuples and that can significantly reduce the memory required for sorting perhaps resulting in the sort fitting in memory rather than spilling out to disk. You might want to test this by having the leading sort column as an INT, and then the 2nd one as a long text column of maybe around two kilobytes. Make all the leading column values the same so that the comparison for the text column is always performed. Make the LIMIT small compared to the total number of rows, that should test the worse case. Check the performance with and without the limitCount != NULL part of the patch that disables the optimization for LIMIT. I checked this. For limit <<< total number of rows, top-n sort was performed but when I changed limit to higher value (or no limit), quick sort was performed. Top-n sort was twice as fast. Also, tested (first) patch version vs master, top-n sort was twice as fast there as well (outputs mentioned below). Current patch version (with limit excluded for optimizations) explain (analyze ,costs off) select count(*) over (order by id) from tt order by id, name limit 1; QUERY PLAN --- Limit (actual time=1.718..1.719 rows=1 loops=1) -> Incremental Sort (actual time=1.717..1.717 rows=1 loops=1) Sort Key: id, name Presorted Key: id Full-sort Groups: 1 Sort Method: top-N heapsort Average Memory: 25kB Peak Memory: 25kB -> WindowAgg (actual time=0.028..0.036 rows=6 loops=1) -> Sort (actual time=0.017..0.018 rows=6 loops=1) Sort Key: id Sort Method: quicksort Memory: 25kB -> Seq Scan on tt (actual time=0.011..0.012 rows=6 loops=1) Planning Time: 0.069 ms Execution Time: 1.799 ms Earlier patch(which included limit clause for optimizations) explain (analyze ,costs off) select count(*) over (order by id) from tt order by id, name limit 1; QUERY PLAN Limit (actual time=3.766..3.767 rows=1 loops=1) -> WindowAgg (actual time=3.764..3.765 rows=1 loops=1) -> Sort (actual time=3.749..3.750 rows=6 loops=1) Sort Key: id, name Sort Method: quicksort
Re: MERGE ... WHEN NOT MATCHED BY SOURCE
On Thu, 5 Jan 2023 at 13:21, Dean Rasheed wrote: > > On Thu, 5 Jan 2023 at 11:03, Alvaro Herrera wrote: > > > > > + /* Join type required */ > > > + if (left_join && right_join) > > > + qry->mergeJoinType = JOIN_FULL; > > > + else if (left_join) > > > + qry->mergeJoinType = JOIN_LEFT; > > > + else if (right_join) > > > + qry->mergeJoinType = JOIN_RIGHT; > > > + else > > > + qry->mergeJoinType = JOIN_INNER; > > > > One of the review comments that MERGE got initially was that parse > > analysis was not a place to "do query optimization", in the sense that > > the original code was making a decision whether to make an outer or > > inner join based on the set of WHEN clauses that appear in the command. > > That's how we ended up with transform_MERGE_to_join and > > mergeUseOuterJoin instead. This new code is certainly not the same, but > > it makes me a bit unconfortable. Maybe it's OK, though. > > > > Yeah I agree, it's a bit ugly. Perhaps a better solution would be to > do away with that field entirely and just make the decision in > transform_MERGE_to_join() by examining the action list again. > Attached is an updated patch taking that approach, allowing mergeUseOuterJoin to be removed from the Query node, which I think is probably a good thing. Aside from that, it includes a few additional comment updates in the executor that I'd missed, and psql tab completion support. Regards, Dean diff --git a/doc/src/sgml/mvcc.sgml b/doc/src/sgml/mvcc.sgml new file mode 100644 index b87ad5c..1482ede --- a/doc/src/sgml/mvcc.sgml +++ b/doc/src/sgml/mvcc.sgml @@ -396,8 +396,8 @@ originally matched appears later in the list of actions. On the other hand, if the row is concurrently updated or deleted so that the join condition fails, then MERGE will -evaluate the condition's NOT MATCHED actions next, -and execute the first one that succeeds. +evaluate the condition's NOT MATCHED [BY TARGET] +actions next, and execute the first one that succeeds. If MERGE attempts an INSERT and a unique index is present and a duplicate row is concurrently inserted, then a uniqueness violation error is raised; diff --git a/doc/src/sgml/ref/merge.sgml b/doc/src/sgml/ref/merge.sgml new file mode 100644 index 0995fe0..8ef121a --- a/doc/src/sgml/ref/merge.sgml +++ b/doc/src/sgml/ref/merge.sgml @@ -33,7 +33,8 @@ USING dat and when_clause is: { WHEN MATCHED [ AND condition ] THEN { merge_update | merge_delete | DO NOTHING } | - WHEN NOT MATCHED [ AND condition ] THEN { merge_insert | DO NOTHING } } + WHEN NOT MATCHED BY SOURCE [ AND condition ] THEN { merge_update | merge_delete | DO NOTHING } | + WHEN NOT MATCHED [BY TARGET] [ AND condition ] THEN { merge_insert | DO NOTHING } } and merge_insert is: @@ -70,7 +71,9 @@ DELETE from data_source to target_table_name producing zero or more candidate change rows. For each candidate change - row, the status of MATCHED or NOT MATCHED + row, the status of MATCHED, + NOT MATCHED BY SOURCE, + or NOT MATCHED [BY TARGET] is set just once, after which WHEN clauses are evaluated in the order specified. For each candidate change row, the first clause to evaluate as true is executed. No more than one WHEN @@ -226,16 +229,37 @@ DELETE At least one WHEN clause is required. + The WHEN clause may specify WHEN MATCHED, + WHEN NOT MATCHED BY SOURCE, or + WHEN NOT MATCHED [BY TARGET]. + Note that the SQL standard only defines + WHEN MATCHED and WHEN NOT MATCHED + (which is defined to mean no matching target row). + WHEN NOT MATCHED BY SOURCE is an extension to the + SQL standard, as is the option to append + BY TARGET to WHEN NOT MATCHED, to + make its meaning more explicit. + + If the WHEN clause specifies WHEN MATCHED - and the candidate change row matches a row in the + and the candidate change row matches a row in the source to a row in the target_table_name, the WHEN clause is executed if the condition is absent or it evaluates to true. - Conversely, if the WHEN clause specifies - WHEN NOT MATCHED + If the WHEN clause specifies + WHEN NOT MATCHED BY SOURCE and the candidate change + row represents a row in the + target_table_name that does + not match a source row, the WHEN clause is executed + if the condition is + absent or it evaluates to true. + + + If the WHEN clause specifies + WHEN NOT MATCHED [BY TARGET] and the candidate change row does not match a row in the target_table_name, the WHEN clause is executed if the @@ -257,7 +281,10 @@ DELETE A condition on a WHEN MATCHED clause can refer to columns in both the source and the target relations. A condition on a - WHEN NOT MATCHED clause can only refe
Re: postgres_fdw: using TABLESAMPLE to collect remote sample
On 1/6/23 23:41, Tomas Vondra wrote: > On 1/6/23 17:58, Tom Lane wrote: >> Tomas Vondra writes: >>> The one difference is that I realized the relkind check does not >>> actually say we can't do sampling - it just means we can't use >>> TABLESAMPLE to do it. We could still use "random()" ... >> >>> Furthermore, I don't think we should silently disable sampling when the >>> user explicitly requests TABLESAMPLE by specifying bernoulli/system for >>> the table - IMHO it's less surprising to just fail in that case. >> >> Agreed on both points. This patch looks good to me. >> > > Good, I'll get this committed.The "ORDER BY random()" idea is a possible > improvement, can be discussed on it's own. > Pushed. >>> Of course, all relkinds that don't support TABLESAMPLE currently have >>> reltuples value that will disable sampling anyway (e.g. views have -1). >>> So we won't actually fallback to random() anyway, because we can't >>> calculate the sample fraction. >>> That's a bit annoying for foreign tables pointing at a view, which is a >>> more likely use case than table pointing at a sequence. >> >> Right, that's a case worth being concerned about. >> >>> But I realized we could actually still do "random()" sampling: >>> SELECT * FROM t ORDER BY random() LIMIT $X; >> >> Hmm, interesting idea, but it would totally bollix our correlation >> estimates. Not sure that those are worth anything for remote views, >> but still... > > But isn't that an issue that we already have? I mean, if we do ANALYZE > on a foreign table pointing to a view, we fetch all the results. But if > the view does not have a well-defined ORDER BY, a trivial plan change > may change the order of results and thus the correlation. > > Actually, how is a correlation even defined for a view? > > It's true this "ORDER BY random()" thing would make it less stable, as > it would change the correlation on every run. Although - the calculated > correlation is actually quite stable, because it's guaranteed to be > pretty close to 0 because we make the order random. > > Maybe that's actually better than the current state where it depends on > the plan? Why not to look at the relkind and just set correlation to 0.0 > in such cases? > > But if we want to restore that current behavior (where it depends on the > actual query plan), we could do something like this: > >SELECT * FROM the_remote_view ORDER BY row_number() over (); > > But yeah, this makes the remote sampling more expensive. Probably still > a win because of network costs, but not great. > I've been thinking about this a bit more, and I'll consider submitting a patch for the next CF. IMHO it's probably better to accept correlation being 0 in these cases - it's more representative of what we know about the view / plan output (or rather lack of knowledge). However, maybe views are not the best / most common example to think about. I'd imagine it's much more common to reference a regular table, but the table gets truncated / populated quickly, and/or the autovacuum workers are busy so it takes time to update reltuples. But in this case it's also quite simple to fix the correlation by just ordering by ctid (which I guess we might do based on the relkind). There's a minor issue with partitioned tables, with foreign partitions pointing to remote views. This is kinda broken, because the sample size for individual relations is determined based on relpages. But that's 0 for views, so these partitions get ignored when building the sample. But that's a pre-existing issue. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: postgres_fdw: using TABLESAMPLE to collect remote sample
Tomas Vondra writes: > However, maybe views are not the best / most common example to think > about. I'd imagine it's much more common to reference a regular table, > but the table gets truncated / populated quickly, and/or the autovacuum > workers are busy so it takes time to update reltuples. But in this case > it's also quite simple to fix the correlation by just ordering by ctid > (which I guess we might do based on the relkind). > There's a minor issue with partitioned tables, with foreign partitions > pointing to remote views. This is kinda broken, because the sample size > for individual relations is determined based on relpages. But that's 0 > for views, so these partitions get ignored when building the sample. But > that's a pre-existing issue. I wonder if we should stop consulting reltuples directly at all, and instead do "EXPLAIN SELECT * FROM remote_table" to get the remote planner's estimate of the number of rows involved. Even for a plain table, that's likely to be a better number. regards, tom lane
Re: [PATCH] Add function to_oct
vignesh C writes: > Few suggestions > 1) We could use to_oct instead of to_oct32 as we don't have multiple > implementations for to_oct That seems (a) shortsighted and (b) inconsistent with the naming pattern used for to_hex, so I doubt it'd be an improvement. regards, tom lane
Re: [RFC] Add jit deform_counter
> On Fri, Jan 06, 2023 at 09:42:09AM +0100, Pavel Stehule wrote: > The explain part is working, the part of pg_stat_statements doesn't > > set jit_above_cost to 10; > set jit_optimize_above_cost to 10; > set jit_inline_above_cost to 10; > > (2023-01-06 09:08:59) postgres=# explain analyze select > count(length(prosrc) > 0) from pg_proc; > ┌┐ > │ QUERY PLAN > │ > ╞╡ > │ Aggregate (cost=154.10..154.11 rows=1 width=8) (actual > time=132.320..132.321 rows=1 loops=1) │ > │ -> Seq Scan on pg_proc (cost=0.00..129.63 rows=3263 width=16) (actual > time=0.013..0.301 rows=3266 loops=1) │ > │ Planning Time: 0.070 ms > │ > │ JIT: > │ > │ Functions: 3 > │ > │ Options: Inlining true, Optimization true, Expressions true, Deforming > true │ > │ Timing: Generation 0.597 ms, Deforming 0.407 ms, Inlining 8.943 ms, > Optimization 79.403 ms, Emission 43.091 ms, Total 132.034 ms │ > │ Execution Time: 132.986 ms > │ > └┘ > (8 rows) > > I see the result of deforming in explain analyze, but related values in > pg_stat_statements are 0. I'm not sure why, but pgss jit metrics are always nulls for explain analyze queries. I have noticed this with surprise myself, when recently was reviewing the lazy jit patch, but haven't yet figure out what is the reason. Anyway, without "explain analyze" you'll get correct deforming numbers in pgss. > Minimally, the values are assigned in wrong order > > + if (api_version >= PGSS_V1_11) > + { > + values[i++] = Float8GetDatumFast(tmp.jit_deform_time); > + values[i++] = Int64GetDatumFast(tmp.jit_deform_count); > + } (facepalm) Yep, will fix the order. > After reading the doc, I am confused what this metric means > > + > + > + jit_deform_count bigint > + > + > + Number of times tuples have been deformed > + > + > + > + > + > + jit_deform_time double > precision > + > + > + Total time spent by the statement on deforming tuples, in > milliseconds > + > + > > It is not clean so these times and these numbers are related just to the > compilation of the deforming process, not by own deforming. Good point, I need to formulate this more clearly.
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 07/01/23 09:58, David Rowley wrote: The attached patch has no tests added. It's going to need some of those. While writing test cases, I found that optimization do not happen for case #1 (which is prime candidate for such operation) like EXPLAIN (COSTS OFF) SELECT empno, depname, min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, sum(salary) OVER (PARTITION BY depname) depsalary FROM empsalary ORDER BY depname, empno, enroll_date This happens because mutual exclusiveness of two operands (when number of window functions > 1) viz is_sorted and last activeWindow in the condition: ( !is_sorted && lnext(activeWindows, l) == NULL) For 2nd last window function, is_sorted is false and path keys get added. In next run (for last window function), is_sorted becomes true and whole optimization part is skipped. Note: Major issue that if I remove is_sorted from condition, even though path keys are added, it still do not perform optimization and works same as in master/unoptimized case. Perhaps adding path keys at last window function is not doing trick? Maybe we need to add pathkeys to all window functions which are subset of query's order by irrespective of being last or not? Case #2: For presorted columns, eg CREATE INDEX depname_idx ON empsalary(depname); SET enable_seqscan=0; EXPLAIN (COSTS OFF) SELECT empno, min(salary) OVER (PARTITION BY depname) depminsalary FROM empsalary ORDER BY depname, empno; Is this correct plan: a) QUERY PLAN --- Incremental Sort Sort Key: depname, empno Presorted Key: depname -> WindowAgg -> Index Scan using depname_idx on empsalary (5 rows) or this: b) (Akin to Optimized version) QUERY PLAN --- WindowAgg -> Incremental Sort Sort Key: depname, empno Presorted Key: depname -> Index Scan using depname_idx on empsalary (5 rows) Patched version does (a) because of is_sorted condition. If we remove both is_sorted and lnext(activeWindows, l) == NULL conditions, we get correct results in these two cases. -- Regards, Ankit Kumar Pandey
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 07/01/23 21:57, Ankit Kumar Pandey wrote: On 07/01/23 09:58, David Rowley wrote: > > The attached patch has no tests added. It's going to need some of > those. While writing test cases, I found that optimization do not happen for case #1 (which is prime candidate for such operation) like EXPLAIN (COSTS OFF) SELECT empno, depname, min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, sum(salary) OVER (PARTITION BY depname) depsalary FROM empsalary ORDER BY depname, empno, enroll_date This happens because mutual exclusiveness of two operands (when number of window functions > 1) viz is_sorted and last activeWindow in the condition: ( !is_sorted && lnext(activeWindows, l) == NULL) For 2nd last window function, is_sorted is false and path keys get added. In next run (for last window function), is_sorted becomes true and whole optimization part is skipped. Note: Major issue that if I remove is_sorted from condition, even though path keys are added, it still do not perform optimization and works same as in master/unoptimized case. Perhaps adding path keys at last window function is not doing trick? Maybe we need to add pathkeys to all window functions which are subset of query's order by irrespective of being last or not? Case #2: For presorted columns, eg CREATE INDEX depname_idx ON empsalary(depname); SET enable_seqscan=0; EXPLAIN (COSTS OFF) SELECT empno, min(salary) OVER (PARTITION BY depname) depminsalary FROM empsalary ORDER BY depname, empno; Is this correct plan: a) QUERY PLAN --- Incremental Sort Sort Key: depname, empno Presorted Key: depname -> WindowAgg -> Index Scan using depname_idx on empsalary (5 rows) or this: b) (Akin to Optimized version) QUERY PLAN --- WindowAgg -> Incremental Sort Sort Key: depname, empno Presorted Key: depname -> Index Scan using depname_idx on empsalary (5 rows) Patched version does (a) because of is_sorted condition. If we remove both is_sorted and lnext(activeWindows, l) == NULL conditions, we get correct results in these two cases. Attached patch with test cases. For case #2, test cases still uses (a) as expected output which I don't think is right and we should revisit. Other than that, only failing case is due to issue mentioned in case #1. Thanks -- Regards, Ankit Kumar Pandey diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 000d757bdd..781916f219 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -4423,6 +4423,7 @@ create_one_window_path(PlannerInfo *root, PathTarget *window_target; ListCell *l; List *topqual = NIL; + bool has_runcondition = false; /* * Since each window clause could require a different sort order, we stack @@ -4457,6 +4458,42 @@ create_one_window_path(PlannerInfo *root, path->pathkeys, &presorted_keys); + has_runcondition |= (wc->runCondition != NIL); + + /* + * If not sorted already and the query has an ORDER BY clause, then we + * try to push the entire ORDER BY below the final WindowAgg. We + * don't do this when the query has a DISTINCT clause as the planner + * might opt to do a Hash Aggregate and spoil our efforts here. We + * also can't do this when the ORDER BY contains any WindowFuncs as we + * obviously can't sort something that's yet to be evaluated. We also + * don't bother with this when there is a WindowClauses with a + * runCondition as those can reduce the number of rows to sort in the + * ORDER BY and we don't want add complexity to the WindowAgg sort if + * the final ORDER BY sort is going to have far fewer rows to sort. + * For the same reason, don't bother if the query has a LIMIT clause. + */ + if (!is_sorted && lnext(activeWindows, l) == NULL && + root->parse->distinctClause == NIL && + root->parse->sortClause != NIL && + !has_runcondition && root->parse->limitCount == NULL && + !contain_window_function((Node *) get_sortgrouplist_exprs(root->parse->sortClause, + root->processed_tlist))) + { + List *orderby_pathkeys; + + orderby_pathkeys = make_pathkeys_for_sortclauses(root, + root->parse->sortClause, + root->processed_tlist); + + if (pathkeys_contained_in(window_pathkeys, orderby_pathkeys)) +window_pathkeys = orderby_pathkeys; + + is_sorted = pathkeys_count_contained_in(window_pathkeys, + path->pathkeys, + &presorted_keys); + } + /* Sort if necessary */ if (!is_sorted) { diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out index 90e89fb5b6..2b8b5c4533 100644 --- a/src/test/regress/expected/window.out +++ b/
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Hello. The few things I have got so far: 1) It is not required to order by random() to reproduce the issue - it could be done using queries like: BEGIN; SELECT omg.* FROM something_is_wrong_here AS omg ORDER BY value -- change is here LIMIT 1 FOR UPDATE \gset UPDATE something_is_wrong_here SET value = :value + 1 WHERE id = :id; COMMIT; But for some reason it is harder to reproduce without random in my case (typically need to wait for about a minute with 100 connections). 2) It is not an issue at table creation time. Issue is reproducible if vacuum_defer_cleanup_age set after table preparation. 3) To reproduce the issue, vacuum_defer_cleanup_age should flip xid over zero (be >= txid_current()). And it is stable So, for example - unable to reproduce with 733 value, but 734 gives error each time. Just a single additional txid_current() (after data is filled) fixes a crash... It looks like the first SELECT FOR UPDATE + UPDATE silently poisons everything somehow. You could use such PSQL script: DROP TABLE IF EXISTS something_is_wrong_here; CREATE TABLE something_is_wrong_here (id bigserial PRIMARY KEY, value numeric(15,4) DEFAULT 0 NOT NULL); INSERT INTO something_is_wrong_here (value) (SELECT 1 from generate_series(0, 100)); SELECT txid_current() \gset SELECT :txid_current + 1 as txid \gset ALTER SYSTEM SET vacuum_defer_cleanup_age to :txid;SELECT pg_reload_conf(); I have attached some scripts if someone goes to reproduce. Best regards, Michail. reproduce.sh Description: application/shellscript reproduce.bench Description: Binary data scheme.sql Description: application/sql
Re: [RFC] Add jit deform_counter
so 7. 1. 2023 v 16:48 odesílatel Dmitry Dolgov <9erthali...@gmail.com> napsal: > > On Fri, Jan 06, 2023 at 09:42:09AM +0100, Pavel Stehule wrote: > > The explain part is working, the part of pg_stat_statements doesn't > > > > set jit_above_cost to 10; > > set jit_optimize_above_cost to 10; > > set jit_inline_above_cost to 10; > > > > (2023-01-06 09:08:59) postgres=# explain analyze select > > count(length(prosrc) > 0) from pg_proc; > > > ┌┐ > > │ QUERY PLAN > > │ > > > ╞╡ > > │ Aggregate (cost=154.10..154.11 rows=1 width=8) (actual > > time=132.320..132.321 rows=1 loops=1) > │ > > │ -> Seq Scan on pg_proc (cost=0.00..129.63 rows=3263 width=16) > (actual > > time=0.013..0.301 rows=3266 loops=1) │ > > │ Planning Time: 0.070 ms > > │ > > │ JIT: > > │ > > │ Functions: 3 > > │ > > │ Options: Inlining true, Optimization true, Expressions true, > Deforming > > true │ > > │ Timing: Generation 0.597 ms, Deforming 0.407 ms, Inlining 8.943 ms, > > Optimization 79.403 ms, Emission 43.091 ms, Total 132.034 ms │ > > │ Execution Time: 132.986 ms > > │ > > > └┘ > > (8 rows) > > > > I see the result of deforming in explain analyze, but related values in > > pg_stat_statements are 0. > > I'm not sure why, but pgss jit metrics are always nulls for explain > analyze queries. I have noticed this with surprise myself, when recently > was reviewing the lazy jit patch, but haven't yet figure out what is the > reason. Anyway, without "explain analyze" you'll get correct deforming > numbers in pgss. > It was really strange, because I tested the queries without EXPLAIN ANALYZE too, and new columns were always zero on my comp. Other jit columns were filled. But I didn't do a deeper investigation. > > Minimally, the values are assigned in wrong order > > > > + if (api_version >= PGSS_V1_11) > > + { > > + values[i++] = Float8GetDatumFast(tmp.jit_deform_time); > > + values[i++] = Int64GetDatumFast(tmp.jit_deform_count); > > + } > > (facepalm) Yep, will fix the order. > > > After reading the doc, I am confused what this metric means > > > > + > > + > > + jit_deform_count bigint > > + > > + > > + Number of times tuples have been deformed > > + > > + > > + > > + > > + > > + jit_deform_time double > > precision > > + > > + > > + Total time spent by the statement on deforming tuples, in > > milliseconds > > + > > + > > > > It is not clean so these times and these numbers are related just to the > > compilation of the deforming process, not by own deforming. > > Good point, I need to formulate this more clearly. >
Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
=?UTF-8?B?w5ZuZGVyIEthbGFjxLE=?= writes: > Attached v22. I took a very brief look through this. I'm not too pleased with this whole line of development TBH. It seems to me that the core design of execReplication.c and related code is "let's build our own half-baked executor and much-less-than-half-baked planner, because XXX". (I'm not too sure what XXX was, really, but apparently somebody managed to convince people that that is a sane and maintainable design.) Now this patch has decided that it *will* use the real planner, or at least portions of it in some cases. If we're going to do that ISTM we ought to replace all the existing not-really-a-planner logic, but this has not done that; instead we have a large net addition to the already very duplicative replication code, with weird restrictions because it doesn't want to make changes to the half-baked executor. I think we should either live within the constraints set by this overarching design, or else nuke execReplication.c from orbit and start using the real planner and executor. Perhaps the foreign key enforcement mechanisms could be a model --- although if you don't want to buy into using SPI as well, you probably should look at Amit L's work at [1]. Also ... maybe I am missing something, but is REPLICA IDENTITY FULL sanely defined in the first place? It looks to me that RelationFindReplTupleSeq assumes without proof that there is a unique full-tuple match, but that is only reasonable to assume if there is at least one unique index (and maybe not even then, if nulls are involved). If there is a unique index, why can't that be chosen as replica identity? If there isn't, what semantics are we actually providing? What I'm thinking about is that maybe REPLICA IDENTITY FULL should be defined as "the subscriber can pick any unique index to match on, and is allowed to fail if the table has none". Or if "fail" is a bridge too far for you, we could fall back to the existing seqscan logic. But thumbing through the existing indexes to find a non-expression unique index wouldn't require invoking the full planner. Any candidate index would result in a plan estimated to fetch just one row, so there aren't likely to be serious cost differences. regards, tom lane [1] https://www.postgresql.org/message-id/flat/CA+HiwqG5e8pk8s7+7zhr1Nc_PGyhEdM5f=phkmodk1rywxf...@mail.gmail.com
Re: Infinite Interval
On Thu, Jan 5, 2023 at 11:30 PM jian he wrote: > > > > On Fri, Jan 6, 2023 at 6:54 AM Joseph Koshakow wrote: >> >> Looks like some of the error messages have changed and we >> have some issues with parsing "+infinity" after rebasing. > > > There is a commit 2ceea5adb02603ef52579b568ca2c5aebed87358 > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ceea5adb02603ef52579b568ca2c5aebed87358 > if you pull this commit then you can do select interval '+infinity', even > though I don't know why. It turns out that I was just misreading the error. The test was expecting us to fail on "+infinity" but we succeeded. I just removed that test case. >> pgindent. Looks like some of the error messages have changed The conditions for checking valid addition/subtraction between infinite values were missing some cases which explains the change in error messages. I've updated the logic and removed duplicate checks. I removed the extract/date_part tests since they were duplicated in a test above. I also converted the DO command tests to using SQL with joins so it more closely matches the existing tests. I've updated the extract/date_part logic for infinite intervals. Fields that are monotonically increasing should return +/-infinity and all others should return NULL. For Intervals, the fields are the same as timestamps plus the hour and day fields since those don't overflow into the next highest field. I think this patch is just about ready for review, except for the following two questions: 1. Should finite checks on intervals only look at months or all three fields? 2. Should we make the error messages for adding/subtracting infinite values more generic or leave them as is? My opinions are 1. We should only look at months. 2. We should make the errors more generic. Anyone else have any thoughts? - Joe
Re: Infinite Interval
On Sat, Jan 7, 2023 at 3:04 PM Joseph Koshakow wrote: > > On Thu, Jan 5, 2023 at 11:30 PM jian he wrote: > > > > > > > > On Fri, Jan 6, 2023 at 6:54 AM Joseph Koshakow wrote: > >> > >> Looks like some of the error messages have changed and we > >> have some issues with parsing "+infinity" after rebasing. > > > > > > There is a commit 2ceea5adb02603ef52579b568ca2c5aebed87358 > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ceea5adb02603ef52579b568ca2c5aebed87358 > > if you pull this commit then you can do select interval '+infinity', even > > though I don't know why. > > It turns out that I was just misreading the error. The test was > expecting us to fail on "+infinity" but we succeeded. I just removed > that test case. > > >> pgindent. Looks like some of the error messages have changed > > The conditions for checking valid addition/subtraction between infinite > values were missing some cases which explains the change in error > messages. I've updated the logic and removed duplicate checks. > > I removed the extract/date_part tests since they were duplicated in a > test above. I also converted the DO command tests to using SQL with > joins so it more closely matches the existing tests. > > I've updated the extract/date_part logic for infinite intervals. Fields > that are monotonically increasing should return +/-infinity and all > others should return NULL. For Intervals, the fields are the same as > timestamps plus the hour and day fields since those don't overflow into > the next highest field. > > I think this patch is just about ready for review, except for the > following two questions: > 1. Should finite checks on intervals only look at months or all three > fields? > 2. Should we make the error messages for adding/subtracting infinite > values more generic or leave them as is? > > My opinions are > 1. We should only look at months. > 2. We should make the errors more generic. > > Anyone else have any thoughts? > > - Joe Oops I forgot the actual patch. Please see attached. From 4ea7c98d47dcbff1313a5013572cc79839e4417e Mon Sep 17 00:00:00 2001 From: Joseph Koshakow Date: Sat, 17 Dec 2022 14:21:26 -0500 Subject: [PATCH] This is WIP. TODOs 1. Should we just use the months field to test for infinity? 2. Should the error messages for adding different sign infinties be "interval out of range"? Ashutosh Bapat and Joe Koshakow and Jian He --- doc/src/sgml/datatype.sgml | 2 +- doc/src/sgml/func.sgml | 5 +- src/backend/utils/adt/date.c | 20 ++ src/backend/utils/adt/datetime.c | 14 +- src/backend/utils/adt/timestamp.c | 448 src/include/datatype/timestamp.h | 21 ++ src/test/regress/expected/horology.out | 6 +- src/test/regress/expected/interval.out | 466 +++-- src/test/regress/sql/horology.sql | 6 +- src/test/regress/sql/interval.sql | 130 ++- 10 files changed, 1002 insertions(+), 116 deletions(-) diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index fdffba4442..2bcf959f70 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -2316,7 +2316,7 @@ TIMESTAMP WITH TIME ZONE '2004-10-19 10:23:54+02' infinity - date, timestamp + date, timestamp, interval later than all other time stamps diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 3bf8d021c3..7ddf76da4a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -9369,7 +9369,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); boolean - Test for finite interval (currently always true) + Test for finite interval (not +/-infinity) isfinite(interval '4 hours') @@ -10256,7 +10256,8 @@ SELECT EXTRACT(YEAR FROM TIMESTAMP '2001-02-16 20:38:40'); When the input value is +/-Infinity, extract returns +/-Infinity for monotonically-increasing fields (epoch, julian, year, isoyear, - decade, century, and millennium). + decade, century, and millennium + for all types and hour and day just for interval). For other fields, NULL is returned. PostgreSQL versions before 9.6 returned zero for all cases of infinite input. diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c index 99171d9c92..8334b9053f 100644 --- a/src/backend/utils/adt/date.c +++ b/src/backend/utils/adt/date.c @@ -2067,6 +2067,11 @@ time_pl_interval(PG_FUNCTION_ARGS) Interval *span = PG_GETARG_INTERVAL_P(1); TimeADT result; + if (INTERVAL_NOT_FINITE(span)) + ereport(ERROR, +(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), + errmsg("cannot add infinite interval to time"))); + result = time + span->time; result -= result / USECS_PER_DAY * USECS_PER_DAY; if (result < INT64CONST(0)) @@ -2085,6 +2090,11 @@ time_mi_interv
Re: How to define template types in PostgreSQL
Hi! I'd suggest creating an API that defines a general function set with variable input, and calling implementation defined on the input type? On Sat, Jan 7, 2023 at 12:32 PM Esteban Zimanyi wrote: > Dear all > > MobilityDB (https://github.com/MobilityDB/MobilityDB) defines at the C > level four template types: Set, Span, SpanSet, and Temporal. The type Set > is akin to PostgreSQL's ArrayType restricted to one dimension, but enforces > the constraint that sets do not have duplicates, the types Span and SpanSet > are akin to PostgreSQL's RangeType and MultirangeType but enforce the > constraints that span types are of fixed length and that empty spans and > infinite bounds are not allowed, and the typeTemporal is used to > manipulate time-varying values. > > These template types need to be instantiated at the SQL level with base > types (int, bigint, float, timestamptz, text, ...) and because of this, > MobilityDB needs to define numerous SQL functions that all call the same > function in C. Taking as example the Set type, we need to define, e.g., > > CREATE FUNCTION intset_eq(intset, intset) RETURNS bool AS > 'MODULE_PATHNAME', 'Set_eq' ... > CREATE FUNCTION bigintset_eq(bigintset, bigintset) RETURNS bool AS > 'MODULE_PATHNAME', 'Set_eq' ... > CREATE FUNCTION floatset_eq(floatset, floatset) RETURNS bool AS > 'MODULE_PATHNAME', 'Set_eq' ... > CREATE FUNCTION textset_eq(textset, textset) RETURNS bool AS > 'MODULE_PATHNAME', 'Set_eq' ... > ... > > CREATE FUNCTION intset_ne(intset, intset) RETURNS bool AS > 'MODULE_PATHNAME', 'Set_ne' ... > CREATE FUNCTION bigintset_ne(bigintset, bigintset) RETURNS bool AS > 'MODULE_PATHNAME', 'Set_ne' ... > CREATE FUNCTION floatset_ne(floatset, floatset) RETURNS bool AS > 'MODULE_PATHNAME', 'Set_ne' ... > CREATE FUNCTION textset_ne(textset, textset) RETURNS bool AS > 'MODULE_PATHNAME', 'Set_ne' ... > ... > > In the case of arrays, ranges, and multiranges, PostgreSQL avoids this > redundancy using pseudo-types such as anyarray, anyrange, anymultirange, ... > > Is there a possibility that we can also define pseudo types such as > anyset, anyspan, anyspanset, anytemporal, ? > > This will considerably reduce the number of SQL functions to define. > Currently, given the high number of functions in MobilityDB, creating the > extension takes a lng time > > Regards > > Esteban > > -- Regards, Nikita Malakhov Postgres Professional https://postgrespro.ru/
delayed initialization in worktable scan
Hi, while looking at fixing [1], I again came across the fact that we don't initialize the projection etc during ExecInitWorkTableScan(), but do so during the first call to ExecWorkTableScan(). This is explained with the following comment: /* * On the first call, find the ancestor RecursiveUnion's state via the * Param slot reserved for it. (We can't do this during node init because * there are corner cases where we'll get the init call before the * RecursiveUnion does.) */ I remember being confused about this before. I think I dug out the relevant commit back then 0a7abcd4c983 [2], but didn't end up finding the relevant thread. This time I did: [3]. Basically the issue is that in queries with two CTEs we can, at least currently, end up with a WorkTable scans on a CTE we've not yet initialized, due to the WorkTable scan of one CTE appearing in the other. Thus ExecInitRecursiveUnion() hasn't yet set up the param we use in nodeWorktablescan.c to find the tuplestore and the type of the tuples it contains. I don't think this is a huge issue, but it surprised multiple times, so I'd like to expand the comment. At least for me it's hard to get from "corner cases" to one worktable scan appearing in another CTE and to mutally recursive CTEs. I did go down the rabbit hole of trying to avoid this issue because it "feels wrong" to not know the return type of an executor node during initialization. The easiest approach I could see was to add the "scan type" to WorkTableScan (vs the target list, which includes the projection). Most of the other scan nodes get the scan type from the underlying relation etc, and thus don't need it in the plan node ([4]). That way the projection can be built normally during ExecInitWorkTableScan(), but we still need to defer the lookup of node->rustate. But that bothers me a lot less. I'm not sure it's worth changing this. Or whether that'd be the right approach. I'm also wondering if Tom's first instinct from back then making this an error would have been the right call. But that ship has sailed. To be clear, this "issue" is largely independent of [1] / not a blocker whatsoever. Partially I wrote this to have an email to find the next time I encounter this. Greetings, Andres Freund [1] https://postgr.es/m/17737-55a063e3e6c41b4f%40postgresql.org [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0a7abcd4c983 [3] https://postgr.es/m/87zllbxngg.fsf%40oxford.xeocode.com [4] I don't particularly like that, we spend a lot of time converting memory inefficient target lists into tupledescs during executor initialization, even though we rely on the tuple types not to be materially different anyway. But that's a separate issue.
Re: pgsql: Delay commit status checks until freezing executes.
Hi, On 2023-01-06 11:16:00 -0800, Peter Geoghegan wrote: > On Tue, Jan 3, 2023 at 10:52 PM Peter Geoghegan wrote: > > > And it'd make sense to have > > > the explanation of why TransactionIdDidAbort() isn't the same as > > > !TransactionIdDidCommit(), even for !TransactionIdIsInProgress() xacts, > > > near > > > the explanation for doing TransactionIdIsInProgress() first. > > > > I think that we should definitely have a comment directly over > > TransactionIdDidAbort(). Though I wouldn't mind reorganizing these > > other comments, or making the comment over TransactionIdDidAbort() > > mostly just point to the other comments. > > What do you think of the attached patch, which revises comments over > TransactionIdDidAbort, and adds something about it to the top of > heapam_visbility.c? Mostly looks good to me. I think it'd be good to add a reference to the heapam_visbility.c? comment to the top of transam.c (or move it). > * Assumes transaction identifier is valid and exists in clog. > + * > + * Not all transactions that must be treated as aborted will be > + * explicitly marked as such in clog. Transactions that were in > + * progress during a crash are never reported as aborted by us. > */ > bool /* true if given > transaction aborted */ > TransactionIdDidAbort(TransactionId transactionId) I think it's currently very likely to be true, but I'd weaken the "never" a bit nonetheless. I think it'd also be good to point to what to do instead. How about: Note that TransactionIdDidAbort() returns true only for explicitly aborted transactions, as transactions implicitly aborted due to a crash will commonly still appear to be in-progress in the clog. Most of the time TransactionIdDidCommit(), with a preceding TransactionIdIsInProgress() check, should be used instead of TransactionIdDidAbort(). Greetings, Andres Freund
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Sun, 8 Jan 2023 at 01:52, Ankit Kumar Pandey wrote: > I am just wondering though, why can we not do top-N sort > in optimized version if we include limit clause? Is top-N sort is > limited to non strict sorting or > cases last operation before limit is sort? . Maybe the sort bound can be pushed down. You'd need to adjust ExecSetTupleBound() so that it pushes the bound through WindowAggState. David
Re: delayed initialization in worktable scan
Andres Freund writes: > Basically the issue is that in queries with two CTEs we can, at least > currently, end up with a WorkTable scans on a CTE we've not yet initialized, > due to the WorkTable scan of one CTE appearing in the other. Thus > ExecInitRecursiveUnion() hasn't yet set up the param we use in > nodeWorktablescan.c to find the tuplestore and the type of the tuples it > contains. > I don't think this is a huge issue, but it surprised multiple times, so I'd > like to expand the comment. At least for me it's hard to get from "corner > cases" to one worktable scan appearing in another CTE and to mutally recursive > CTEs. Sure. I think I wrote the comment the way I did because I hadn't done enough analysis to be sure that mutually recursive CTEs was the only way to trigger it. But as long as you say "for example" or words to that effect, I don't have a problem with giving an example case here. > I did go down the rabbit hole of trying to avoid this issue because it "feels > wrong" to not know the return type of an executor node during initialization. > ... > I'm not sure it's worth changing this. Or whether that'd be the right > approach. I wouldn't bother unless we find a compelling reason to need the info earlier. regards, tom lane
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
(your email client still seems broken) On Sun, 8 Jan 2023 at 05:27, Ankit Kumar Pandey wrote: > > > While writing test cases, I found that optimization do not happen for > case #1 > > (which is prime candidate for such operation) like > > EXPLAIN (COSTS OFF) > SELECT empno, > depname, > min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, > sum(salary) OVER (PARTITION BY depname) depsalary > FROM empsalary > ORDER BY depname, empno, enroll_date > > This happens because mutual exclusiveness of two operands (when number > of window functions > 1) viz > > is_sorted and last activeWindow in the condition: > > ( !is_sorted && lnext(activeWindows, l) == NULL) > > For 2nd last window function, is_sorted is false and path keys get added. > > In next run (for last window function), is_sorted becomes true and whole > optimization > > part is skipped. > > Note: Major issue that if I remove is_sorted from condition, even though > > path keys are added, it still do not perform optimization and works same > as in master/unoptimized case. > > Perhaps adding path keys at last window function is not doing trick? > Maybe we need to add pathkeys > > to all window functions which are subset of query's order by > irrespective of being last or not? You might need to have another loop before the foreach loop that loops backwards through the WindowClauses and remembers the index of the WindowClause which has pathkeys contained in the query's ORDER BY pathkeys then apply the optimisation from that point in the main foreach loop. Also, if the condition within the foreach loop which checks when we want to apply this optimisation is going to be run > 1 time, then you should probably have boolean variable that's set before the loop which saves if we're going to try to apply the optimisation. That'll save from having to check things like if the query has a LIMIT clause multiple times. > Case #2: > > For presorted columns, eg > > CREATE INDEX depname_idx ON empsalary(depname); > SET enable_seqscan=0; > EXPLAIN (COSTS OFF) > SELECT empno, > min(salary) OVER (PARTITION BY depname) depminsalary > FROM empsalary > ORDER BY depname, empno; > > Is this correct plan: > > a) > >QUERY PLAN > --- > Incremental Sort > Sort Key: depname, empno > Presorted Key: depname > -> WindowAgg > -> Index Scan using depname_idx on empsalary > (5 rows) > > or this: > > b) (Akin to Optimized version) > >QUERY PLAN > --- > WindowAgg > -> Incremental Sort > Sort Key: depname, empno > Presorted Key: depname > -> Index Scan using depname_idx on empsalary > (5 rows) > > Patched version does (a) because of is_sorted condition. a) looks like the best plan to me. What's the point of pushing the sort below the WindowAgg in this case? The point of this optimisation is to reduce the number of sorts not to push them as deep into the plan as possible. We should only be pushing them down when it can reduce the number of sorts. There's no reduction in the number of sorts in the above plan. David
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On Sun, 8 Jan 2023 at 05:45, Ankit Kumar Pandey wrote: > Attached patch with test cases. I can look at this in a bit more detail if you find a way to fix the case you mentioned earlier. i.e, push the sort down to the deepest WindowAgg that has pathkeys contained in the query's ORDER BY pathkeys. EXPLAIN (COSTS OFF) SELECT empno, depname, min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, sum(salary) OVER (PARTITION BY depname) depsalary FROM empsalary ORDER BY depname, empno, enroll_date; QUERY PLAN --- Incremental Sort Sort Key: depname, empno, enroll_date Presorted Key: depname, empno -> WindowAgg -> WindowAgg -> Sort Sort Key: depname, empno -> Seq Scan on empsalary (8 rows) You'll also need to pay attention to how the has_runcondition is set. If you start pushing before looking at all the WindowClauses then you won't know if some later WindowClause has a runCondition. Adding an additional backwards foreach loop should allow you to do all the required prechecks and find the index of the WindowClause which you should start pushing from. David
Re: Infinite Interval
On Sat, Jan 7, 2023 at 3:05 PM Joseph Koshakow wrote: > > On Sat, Jan 7, 2023 at 3:04 PM Joseph Koshakow wrote: > > > > I think this patch is just about ready for review, except for the > > following two questions: > > 1. Should finite checks on intervals only look at months or all three > > fields? > > 2. Should we make the error messages for adding/subtracting infinite > > values more generic or leave them as is? > > > > My opinions are > > 1. We should only look at months. > > 2. We should make the errors more generic. > > > > Anyone else have any thoughts? Here's a patch with the more generic error messages. - Joe From 6ed93bc20db57cea2d692e9288d97b66f4a526dc Mon Sep 17 00:00:00 2001 From: Joseph Koshakow Date: Sat, 17 Dec 2022 14:21:26 -0500 Subject: [PATCH] This is WIP. TODOs 1. Should we just use the months field to test for infinity? Ashutosh Bapat and Joe Koshakow and Jian He --- doc/src/sgml/datatype.sgml | 2 +- doc/src/sgml/func.sgml | 5 +- src/backend/utils/adt/date.c | 20 ++ src/backend/utils/adt/datetime.c | 14 +- src/backend/utils/adt/timestamp.c | 448 src/include/datatype/timestamp.h | 21 ++ src/test/regress/expected/horology.out | 6 +- src/test/regress/expected/interval.out | 466 +++-- src/test/regress/sql/horology.sql | 6 +- src/test/regress/sql/interval.sql | 130 ++- 10 files changed, 1002 insertions(+), 116 deletions(-) diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index fdffba4442..2bcf959f70 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -2316,7 +2316,7 @@ TIMESTAMP WITH TIME ZONE '2004-10-19 10:23:54+02' infinity - date, timestamp + date, timestamp, interval later than all other time stamps diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 3bf8d021c3..7ddf76da4a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -9369,7 +9369,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}'); boolean - Test for finite interval (currently always true) + Test for finite interval (not +/-infinity) isfinite(interval '4 hours') @@ -10256,7 +10256,8 @@ SELECT EXTRACT(YEAR FROM TIMESTAMP '2001-02-16 20:38:40'); When the input value is +/-Infinity, extract returns +/-Infinity for monotonically-increasing fields (epoch, julian, year, isoyear, - decade, century, and millennium). + decade, century, and millennium + for all types and hour and day just for interval). For other fields, NULL is returned. PostgreSQL versions before 9.6 returned zero for all cases of infinite input. diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c index 99171d9c92..8334b9053f 100644 --- a/src/backend/utils/adt/date.c +++ b/src/backend/utils/adt/date.c @@ -2067,6 +2067,11 @@ time_pl_interval(PG_FUNCTION_ARGS) Interval *span = PG_GETARG_INTERVAL_P(1); TimeADT result; + if (INTERVAL_NOT_FINITE(span)) + ereport(ERROR, +(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), + errmsg("cannot add infinite interval to time"))); + result = time + span->time; result -= result / USECS_PER_DAY * USECS_PER_DAY; if (result < INT64CONST(0)) @@ -2085,6 +2090,11 @@ time_mi_interval(PG_FUNCTION_ARGS) Interval *span = PG_GETARG_INTERVAL_P(1); TimeADT result; + if (INTERVAL_NOT_FINITE(span)) + ereport(ERROR, +(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), + errmsg("cannot subtract infinite interval from time"))); + result = time - span->time; result -= result / USECS_PER_DAY * USECS_PER_DAY; if (result < INT64CONST(0)) @@ -2599,6 +2609,11 @@ timetz_pl_interval(PG_FUNCTION_ARGS) Interval *span = PG_GETARG_INTERVAL_P(1); TimeTzADT *result; + if (INTERVAL_NOT_FINITE(span)) + ereport(ERROR, +(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), + errmsg("cannot add infinite interval to time"))); + result = (TimeTzADT *) palloc(sizeof(TimeTzADT)); result->time = time->time + span->time; @@ -2621,6 +2636,11 @@ timetz_mi_interval(PG_FUNCTION_ARGS) Interval *span = PG_GETARG_INTERVAL_P(1); TimeTzADT *result; + if (INTERVAL_NOT_FINITE(span)) + ereport(ERROR, +(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE), + errmsg("cannot subtract infinite interval from time"))); + result = (TimeTzADT *) palloc(sizeof(TimeTzADT)); result->time = time->time - span->time; diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c index d166613895..4192e7a74b 100644 --- a/src/backend/utils/adt/datetime.c +++ b/src/backend/utils/adt/datetime.c @@ -70,7 +70,7 @@ static bool DetermineTimeZoneAbbrevOffsetInternal(pg_time_t t, const char *abbr, pg_tz *tzp, int *offset, int
Re: Using WaitEventSet in the postmaster
Hi, On 2023-01-07 18:08:11 +1300, Thomas Munro wrote: > On Sat, Jan 7, 2023 at 12:25 PM Andres Freund wrote: > > On 2023-01-07 11:08:36 +1300, Thomas Munro wrote: > > > 3. Is it OK to clobber the shared pending flag for SIGQUIT, SIGTERM, > > > SIGINT? If you send all of these extremely rapidly, it's > > > indeterminate which one will be seen by handle_shutdown_request(). > > > > That doesn't seem optimal. I'm mostly worried that we can end up > > downgrading a > > shutdown request. > > I was contemplating whether I needed to do some more push-ups to > prefer the first delivered signal (instead of the last), but you're > saying that it would be enough to prefer the fastest shutdown type, in > cases where more than one signal was handled between server loops. > WFM. I don't see any need for such an order requirement - in case of receiving a "less severe" shutdown request first, we'd process the more severe one soon after. There's nothing to be gained by trying to follow the order of the incoming signals. Afaict that's also the behaviour today. pmdie() has blocks like this: case SIGTERM: /* * Smart Shutdown: * * Wait for children to end their work, then shut down. */ if (Shutdown >= SmartShutdown) break; I briefly wondered about deduplicating that code, now that we we know the requested mode ahead of the switch. So it could be a something like: /* don't interrupt an already in-progress shutdown in a more "severe" mode */ if (mode < Shutdown) return; but that's again probaly something for later. > > > 5. Is the signal mask being correctly handled during forking? I > > > think so: I decided to push the masking logic directly into the > > > routine that forks, to make it easy to verify that all paths set the > > > mask the way we want. > > > > Hm. If I understand correctly, you used sigprocmask() directly (vs > > PG_SETMASK()) in fork_process() because you want the old mask? But why do we > > restore the prior mask, instead of just using PG_SETMASK(&UnBlockSig); as we > > still do in a bunch of places in the postmaster? > > It makes zero difference in practice but I think it's a nicer way to > code it because it doesn't make an unnecessary assumption about what > the signal mask was on entry. Heh, to me doing something slightly different than in other places actually seems to make it a bit less nice :). There's also some benefit in the certainty of knowing what the mask will be. But it doesn't matter mcuh. > > > 6. Is the naming and style OK? Better ideas welcome, but basically I > > > tried to avoid all unnecessary refactoring and changes, so no real > > > logic moves around, and the changes are pretty close to "mechanical". > > > One bikeshed decision was what to call the {handle,process}_XXX > > > functions and associated flags. > > > > I wonder if it'd be good to have a _pm_ in the name. > > I dunno about this one, it's all static stuff in a file called > postmaster.c and one (now) already has pm in it (see below). I guess stuff like signal handlers and their state somehow seems more global to me than their C linkage type suggests. Hence the desire to be a bit more "namespaced" in their naming. I do find it somewhat annoying when reasonably important global variables aren't uniquely named when using a debugger... But again, this isn't anything that should hold up the patch. > > Is there any reason to use MAXLISTEN here? We know how many sockets w're > > listening on by this point, I think? No idea if the overhead matters > > anywhere, > > but ... > > Fixed. I was thinking of determining the number once, in PostmasterMain(). But that's perhaps better done as a separate change... WFM. > > Hm. This is preexisting behaviour, but now it seems somewhat odd that we > > might > > end up happily forking a backend for each socket without checking signals > > inbetween. Forking might take a while, and if a signal arrived since the > > WaitEventSetWait() we'll not react to it. > > Right, so if you have 64 server sockets and they all have clients > waiting on their listen queue, we'll service one connection from each > before getting back to checking for pmsignals or shutdown, and that's > unchanged in this patch. (FWIW I also noticed that problem while > experimenting with the idea that you could handle multiple clients in > one go on OSes that report the listen queue depth size along with > WL_SOCKET_ACCEPT events, but didn't pursue it...) > > I guess we could check every time through the nevents loop. I may > look into that in a later patch, but I prefer to keep the same policy > in this patch. Makes sense. This was mainly me trying to make sure we're not changing subtle stuff accidentally (and trying to understand how things work currently, as a prerequisite). > > > static void
Re: drop postmaster symlink
Hello, This is a review of Peter's 2 patches. I see only 1 small problem. +++ Looking at the documentation, a "postmaster" in the glossary is defined as the controlling process. This works; it needs to be called something. There is still a postmaster.pid (etc.) in the data directory. The word "postmaster" (case insensitive) shows up 84 times in the documentation. I looked at all of these. I see a possible problem at line 1,412 of runtime.sgml, the "Linux Memory Overcommit" section. It talks about the postmaster's startup script invoking the postmaster. It might, possibly, be better to say that "postgres" is invoked, that being the name of the invoked program. There's a similar usage at line 1,416. On the other hand, the existing text makes it quite clear what is going on and there's a lot to be said for consistently using the word "postmaster". I mention this only in case somebody deems it significant. Perhaps there's a way to use markup, identifying "postgres" as a program with a name in the file system, to make things more clear. Most likely, nobody will care. In doc/src/sgml/ref/allfiles.sgml at line 222 there is an ENTITY defined which references the deleted postmaster.sgml file. Since I did a maintainer-clean, and the patch deletes the postmaster.sgml file, and I don't see any references to the entity in the docs, I believe that this line should be removed. (In other words, I don't think this file is automatically maintained.) After applying the patches the documentation seems to build just fine. I have not run contrib/start-scripts/{freebsd,linux}, but the patch looks fine and the result of applying the patch looks fine, and the patch is a one-line simple replacement of "postmaster" with "postgres" so I expect no problems. The modification to src/port/path.c is in a comment, so it will surely not affect anything at runtime. And the changed comment looks right to me. There's thousands of lines of comments in the code where "postmaster" (case insensitive) shows up after applying the patches. I've not reviewed any of these. Or looked for odd variable names, or references in the code to the postmaster binary - by name, etc. I'm not sure where it would make sense to look for such problems. Aside from the "allfiles.sgml" problem, I see no reason why these 2 patches should not be applied. As mentioned by others, I don't have strong feelings about getting rid of the postmaster symlink. But I did pick this patch to review because I remember, once upon a time, being slightly confused by a "postmaster" in the process list. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
FYI: 2022-10 thorntail failures from coreutils FICLONE
thorntail failed some recovery tests in 2022-10: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-11-02%2004%3A25%3A43 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-10-31%2013%3A32%3A42 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-10-29%2017%3A48%3A15 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-10-24%2013%3A48%3A16 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-10-24%2010%3A08%3A30 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-10-21%2000%3A58%3A14 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-10-16%2000%3A08%3A17 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-10-15%2020%3A48%3A18 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-10-14%2020%3A13%3A35 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2022-10-14%2006%3A58%3A15 thorntail has long seen fsync failures, due to a driver bug[1]. On 2022-09-28, its OS updated coreutils from 8.32-4.1, 9.1-1. That brought in "cp" use of the FICLONE ioctl. FICLONE internally syncs its source file, reporting EIO if that fails. A bug[2] in "cp" allowed it to silently make a defective copy instead of reporting that EIO. Since the recovery suite archive_command uses "cp", these test failures emerged. The kernel may change[3] to make such userspace bugs harder to add. For thorntail, my workaround was to replace "cp" with a wrapper doing 'exec /usr/bin/cp --reflink=never "$@"'. I might eventually propose the ability to disable FICLONE calls in PostgreSQL code. So far, those calls (in pg_upgrade) have not caused thorntail failures. [1] https://postgr.es/m/flat/20210508001418.ga3076...@rfd.leadboat.com [2] https://github.com/coreutils/coreutils/commit/f6c93f334ef5dbc5c68c299785565ec7b9ba5180 [3] https://lore.kernel.org/linux-xfs/20221108172436.ga3613...@rfd.leadboat.com
Re: drop postmaster symlink
"Karl O. Pinc" writes: > This is a review of Peter's 2 patches. I see only 1 small problem. > Looking at the documentation, a "postmaster" in the glossary is > defined as the controlling process. This works; it needs to be called > something. There is still a postmaster.pid (etc.) in the data > directory. > The word "postmaster" (case insensitive) shows up 84 times in the > documentation. I looked at all of these. Hmm ... I thought this patch was about getting rid of the admittedly-obsolete installed symlink. If it's trying to get rid of the "postmaster" terminology for our parent process, I'm very strongly against that, either as regards to code or documentation. regards, tom lane
Re: pgsql: Delay commit status checks until freezing executes.
On Sat, Jan 7, 2023 at 1:47 PM Andres Freund wrote: > > What do you think of the attached patch, which revises comments over > > TransactionIdDidAbort, and adds something about it to the top of > > heapam_visbility.c? > > Mostly looks good to me. I think it'd be good to add a reference to the > heapam_visbility.c? comment to the top of transam.c (or move it). Makes sense. > I think it's currently very likely to be true, but I'd weaken the "never" a > bit nonetheless. I think it'd also be good to point to what to do instead. How > about: > Note that TransactionIdDidAbort() returns true only for explicitly aborted > transactions, as transactions implicitly aborted due to a crash will > commonly still appear to be in-progress in the clog. Most of the time > TransactionIdDidCommit(), with a preceding TransactionIdIsInProgress() > check, should be used instead of TransactionIdDidAbort(). That does seem better. Do we need to do anything about this to the "pg_xact and pg_subtrans" section of the transam README? Also, does amcheck's get_xid_status() need a reference to these rules? FWIW, I found an existing comment about this rule in the call to TransactionIdAbortTree() from RecordTransactionAbort() -- which took me quite a while to find. So you might have been remembering that comment before. -- Peter Geoghegan
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Hi, Thomas, CCing you because of the 64bit xid representation aspect. On 2023-01-06 00:39:44 +0300, Michail Nikolaev wrote: > I apologize for the direct ping, but I think your snapshot scalability > work in PG14 could be related to the issue. Good call! > The TransactionIdRetreatedBy implementation looks correct... But with > txid_current=212195 I see errors like "could not access status of > transaction 58643736"... > So, maybe vacuum_defer_cleanup_age just highlights some special case > (something with "previous" xids on the left side of zero?) I think the bug is close to TransactionIdRetreatedBy(). Arguably in FullXidRelativeTo(). Or a more fundamental data representation issue with 64bit xids. To explain, here's a trace to the bottom of GetSnapshotData() leading to the problem: In the case I'm looking at here we end up with 720: oldestfxid = FullXidRelativeTo(latest_completed, oldestxid); and xmin is 255271, both correct. Then in TransactionIdRetreatedBy: /* apply vacuum_defer_cleanup_age */ def_vis_xid_data = TransactionIdRetreatedBy(xmin, vacuum_defer_cleanup_age); things start to be iffy. Because we retreat by vacuum_defer_cleanup_age, which was set to txid_current() in scheme.sql, and the xmin above is that xid, TransactionIdRetreatedBy() first ends up with 0. It then backtracks further to the highest 32 xid (i.e. 4294967295). So far so good. We could obviously end up with values further in the past as well, if vacuum_defer_cleanup_age were larger. Things start to seriously go off the rails when we convert that 32bit xid to 64 bit with: def_vis_fxid = FullXidRelativeTo(latest_completed, def_vis_xid); which returns {value = 18446744073709551615}, which 0-1 in 64bit. However, as 64bit xids are not supposed to wrap around, we're in trouble - it's an xid *very* far into the future. Allowing things to be pruned that shouldn't, because everything is below that. I don't quite know how to best fix this. 4294967295 is the correct result for TransactionIdRetreatedBy() in this case, and it'd not cause problems for FullXidRelativeTo() if we actually had wrapped around the xid counter before (since then we'd just represent it as a fxid "of the proper epoch"). Basically, in contrast to 32bit xids, 64bit xids cannot represent xids from before the start of the universe, whereas with the modulo arithmetic of 32bit that's not a real problem. It's probably not too hard to fix specifically in this one place - we could just clamp vacuum_defer_cleanup_age to be smaller than latest_completed, but it strikes me as as a somewhat larger issue for the 64it xid infrastructure. I suspect this might not be the only place running into problems with such "before the universe" xids. For a bit I was thinking that we should introduce the notion that a FullTransactionId can be from the past. Specifically that when the upper 32bit are all set, we treat the lower 32bit as a value from before xid 0 using the normal 32bit xid arithmetic. But it sucks to add overhead for that everywhere. It might be a bit more palatable to designate one individual value, e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far before the start of the universe an xid point to... FullXidRelativeTo() did assert that the input 32bit xid is in a reasonable range, but unfortunately didn't do a similar check for the 64bit case. Greetings, Andres Freund
Re: drop postmaster symlink
On 1/7/23 18:38, Tom Lane wrote: "Karl O. Pinc" writes: This is a review of Peter's 2 patches. I see only 1 small problem. Looking at the documentation, a "postmaster" in the glossary is defined as the controlling process. This works; it needs to be called something. There is still a postmaster.pid (etc.) in the data directory. The word "postmaster" (case insensitive) shows up 84 times in the documentation. I looked at all of these. Hmm ... I thought this patch was about getting rid of the admittedly-obsolete installed symlink. That was my understanding too. If it's trying to get rid of the "postmaster" terminology for our parent process, I'm very strongly against that, either as regards to code or documentation. +1 -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Hi, On 2023-01-07 21:06:06 +0300, Michail Nikolaev wrote: > 2) It is not an issue at table creation time. Issue is reproducible if > vacuum_defer_cleanup_age set after table preparation. > > 3) To reproduce the issue, vacuum_defer_cleanup_age should flip xid > over zero (be >= txid_current()). > And it is stable So, for example - unable to reproduce with 733 > value, but 734 gives error each time. > Just a single additional txid_current() (after data is filled) fixes a > crash... It looks like the first SELECT FOR UPDATE + UPDATE silently > poisons everything somehow. > You could use such PSQL script: FWIW, the concrete value for vacuum_defer_cleanup_age is not crucial to encounter the problem. It needs to be a value that, when compared to the xid that did the "INSERT INTO something_is_wrong_here", results in value <= 0. Setting vacuum_defer_cleanup_age than the xid to a much larger value allows the crash to be encountered repeatedly. Greetings, Andres Freund
Re: drop postmaster symlink
On Sat, 07 Jan 2023 18:38:25 -0500 Tom Lane wrote: > "Karl O. Pinc" writes: > > This is a review of Peter's 2 patches. I see only 1 small problem. > > > > > Looking at the documentation, a "postmaster" in the glossary is > > defined as the controlling process. This works; it needs to be > > called something. There is still a postmaster.pid (etc.) in the > > data directory. > > > The word "postmaster" (case insensitive) shows up 84 times in the > > documentation. I looked at all of these. > > Hmm ... I thought this patch was about getting rid of the > admittedly-obsolete installed symlink. If it's trying to get rid of > the "postmaster" terminology for our parent process, I'm very strongly > against that, either as regards to code or documentation. No. It's about getting rid of the symlink. I was grepping around to look for use of the symlink, started with the glossary just to be sure, and saw that "postmaster" is consistently (I think) used to refer to the controlling, parent, process. And wrote down what I was doing and what I found as I went along. And then sent out my notes. Sorry for the confusion. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: drop postmaster symlink
On Sat, 7 Jan 2023 19:33:38 -0500 Joe Conway wrote: > On 1/7/23 18:38, Tom Lane wrote: > > "Karl O. Pinc" writes: > >> This is a review of Peter's 2 patches. I see only 1 small > >> problem. The small problem is a reference to a deleted file. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: Perform streaming logical transactions by background workers and parallel apply
On Sat, Jan 7, 2023 at 2:25 PM Dilip Kumar wrote: > Today, I was analyzing this patch w.r.t recent commit c6e1f62e2c and found that pa_set_xact_state() should set the latch (wake up) for the leader worker as the leader could be waiting in pa_wait_for_xact_state(). What do you think? But otherwise, it should be okay w.r.t DDLs because this patch allows the leader worker to restart logical replication for subscription parameter change which will in turn stop/restart parallel workers if required. -- With Regards, Amit Kapila.
Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Hi, On 2023-01-07 16:29:23 -0800, Andres Freund wrote: > It's probably not too hard to fix specifically in this one place - we could > just clamp vacuum_defer_cleanup_age to be smaller than latest_completed, but > it strikes me as as a somewhat larger issue for the 64it xid infrastructure. I > suspect this might not be the only place running into problems with such > "before the universe" xids. I haven't found other problematic places in HEAD, but did end up find a less serious version of this bug in < 14: GetFullRecentGlobalXmin(). I did verify that with vacuum_defer_cleanup_age set GetFullRecentGlobalXmin() returns values that look likely to cause problems. Its "just" used in gist luckily. It's hard to find places that do this kind of arithmetic, we traditionally haven't had a helper for it. So it's open-coded in various ways. xidStopLimit = xidWrapLimit - 300; if (xidStopLimit < FirstNormalTransactionId) xidStopLimit -= FirstNormalTransactionId; and oddly: xidVacLimit = oldest_datfrozenxid + autovacuum_freeze_max_age; if (xidVacLimit < FirstNormalTransactionId) xidVacLimit += FirstNormalTransactionId; or (in < 14): RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age; if (!TransactionIdIsNormal(RecentGlobalXmin)) RecentGlobalXmin = FirstNormalTransactionId; The currently existing places I found, other than the ones in procarray.c, luckily don't seem to convert the xids to 64bit xids. > For a bit I was thinking that we should introduce the notion that a > FullTransactionId can be from the past. Specifically that when the upper 32bit > are all set, we treat the lower 32bit as a value from before xid 0 using the > normal 32bit xid arithmetic. But it sucks to add overhead for that > everywhere. > > It might be a bit more palatable to designate one individual value, > e.g. 2^32-1<<32, as a "before xid 0" indicator - it doesn't matter how far > before the start of the universe an xid point to... On IM Thomas suggested we could reserve the 2^32-1 epoch for invalid values. I hacked up a patch that converts various fxid functions to inline functions with such assertions, and it indeed quickly catches the problem this thread reported, close to the source of the use. One issue with that is is that it'd reduce what can be input for the xid8 type. But it's hard to believe that'd be a real issue? It's quite unfortunate that we don't have a test for vacuum_defer_cleanup_age yet :(. Greetings, Andres Freund
Re: pgsql: Delay commit status checks until freezing executes.
Hi, On 2023-01-07 15:41:29 -0800, Peter Geoghegan wrote: > Do we need to do anything about this to the "pg_xact and pg_subtrans" > section of the transam README? Probably a good idea, although it doesn't neatly fit right now. > Also, does amcheck's get_xid_status() need a reference to these rules? Don't think so? Whad made you ask? > FWIW, I found an existing comment about this rule in the call to > TransactionIdAbortTree() from RecordTransactionAbort() -- which took > me quite a while to find. So you might have been remembering that > comment before. Possible, my memory is vague enough that it's hard to be sure... Greetings, Andres Freund
Re: drop postmaster symlink
On Sat, 7 Jan 2023 19:56:08 -0600 "Karl O. Pinc" wrote: > On Sat, 07 Jan 2023 18:38:25 -0500 > Tom Lane wrote: > > > "Karl O. Pinc" writes: > > > This is a review of Peter's 2 patches. I see only 1 small > > > problem. ... > > Hmm ... I thought this patch was about getting rid of the > > admittedly-obsolete installed symlink. ... > No. It's about getting rid of the symlink. The only way I could think of to review a patch that removes something is to report all the places I looked where a reference to the symlink might be. Then report what I found each place I looked and report if there's a problem, or might be. That way the committer knows where I didn't look if there's more that needs removing. Apologies that this was not clear. Regards, Karl Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein
Re: [BUG] Logical replica crash if there was an error in a function.
Thanks for your remarks. On 07.01.2023 15:27, vignesh C wrote: Few suggestions: 1) There is a warning: +# This would crash on the subscriber if not fixed +$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)"); + +my $result = $node_subscriber->wait_for_log( + "ERROR: relation \"error_name\" does not exist at character" +); "my" variable $result masks earlier declaration in same scope at t/100_bugs.pl line 400. You can change: my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); to $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); The reason is that the patch fell behind the master. Fixed in v4 together with rebasing on current master. 2) Now that the crash is fixed, you could change it to a better message: +# This would crash on the subscriber if not fixed +$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)"); + +my $result = $node_subscriber->wait_for_log( + "ERROR: relation \"error_name\" does not exist at character" +); Tried to make this comment more clear. Best wishes for the new year! -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Companycommit 16beb574708e0e2e993eb2fa1b093be97b8e53cf Author: Anton A. Melnikov Date: Sun Dec 11 06:26:03 2022 +0300 Add test for syntax error in the function in a a logical replication worker. See dea834938. diff --git a/src/test/subscription/t/100_bugs.pl b/src/test/subscription/t/100_bugs.pl index 036576acab..b241a7d498 100644 --- a/src/test/subscription/t/100_bugs.pl +++ b/src/test/subscription/t/100_bugs.pl @@ -69,6 +69,51 @@ $node_publisher->wait_for_catchup('sub1'); pass('index predicates do not cause crash'); +# https://www.postgresql.org/message-id/flat/adf0452f-8c6b-7def-d35e-ab516c80088e%40inbox.ru + +# The bug was that when a syntax error occurred in a SQL-language or PL/pgSQL-language +# CREATE FUNCTION or DO command executed in a logical replication worker, +# we'd suffer a null pointer dereference or assertion failure. + +$node_subscriber->safe_psql('postgres', q{ + create or replace procedure rebuild_test( + ) as + $body$ + declare + l_code text:= E'create or replace function public.test_selector( + ) returns setof public.tab1 as + \$body\$ + select * from error_name + \$body\$ language sql;'; + begin + execute l_code; + perform public.test_selector(); + end + $body$ language plpgsql; + create or replace function test_trg() + returns trigger as + $body$ + declare + begin + call public.rebuild_test(); + return NULL; + end + $body$ language plpgsql; + create trigger test_trigger after insert on tab1 for each row +execute function test_trg(); + alter table tab1 enable replica trigger test_trigger; +}); + +# This command will cause a crash on the replica if that bug hasn't been fixed +$node_publisher->safe_psql('postgres', "INSERT INTO tab1 VALUES (3, 4)"); + +my $result = $node_subscriber->wait_for_log( + "ERROR: relation \"error_name\" does not exist at character" +); + +ok($result, + "ERROR: Logical decoding doesn't fail on function error"); + # We'll re-use these nodes below, so drop their replication state. $node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION sub1"); $node_publisher->safe_psql('postgres', "DROP PUBLICATION pub1"); @@ -352,7 +397,7 @@ $node_subscriber->wait_for_subscription_sync($node_publisher, 'tap_sub_sch'); # Subscriber's sch1.t1 should receive the row inserted into the new sch1.t1, # but not the row inserted into the old sch1.t1 post-rename. -my $result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); +$result = $node_subscriber->safe_psql('postgres', "SELECT * FROM sch1.t1"); is( $result, qq(1 2), 'check data in subscriber sch1.t1 after schema rename');
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 08/01/23 03:56, David Rowley wrote: > (your email client still seems broken) I am looking at this again, will be changing client for here onward. You might need to have another loop before the foreach loop that loops backwards through the WindowClauses and remembers the index of the WindowClause which has pathkeys contained in the query's ORDER BY pathkeys then apply the optimisation from that point in the main foreach loop. Also, if the condition within the foreach loop which checks when we want to apply this optimisation is going to be run > 1 time, then you should probably have boolean variable that's set before the loop which saves if we're going to try to apply the optimisation. That'll save from having to check things like if the query has a LIMIT clause multiple times. Thanks, this should do the trick. a) looks like the best plan to me. What's the point of pushing the sort below the WindowAgg in this case? The point of this optimisation is to reduce the number of sorts not to push them as deep into the plan as possible. We should only be pushing them down when it can reduce the number of sorts. There's no reduction in the number of sorts in the above plan. Yes, you are right, not in this case. I actually mentioned wrong case here, real problematic case is: EXPLAIN (COSTS OFF) SELECT empno, depname, min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, sum(salary) OVER (PARTITION BY depname) depsalary FROM empsalary ORDER BY depname, empno, enroll_date; QUERY PLAN --- Incremental Sort Sort Key: depname, empno, enroll_date Presorted Key: depname, empno -> WindowAgg -> WindowAgg -> Incremental Sort Sort Key: depname, empno Presorted Key: depname -> Index Scan using depname_idx on empsalary (9 rows) Here, it could have sorted on depname, empno, enroll_date. Again, as I mentioned before, this is implementation issue. We shouldn't be skipping optimization if pre-sorted keys are present. -- Regards, Ankit Kumar Pandey
Re: Todo: Teach planner to evaluate multiple windows in the optimal order
On 08/01/23 04:06, David Rowley wrote: On Sun, 8 Jan 2023 at 05:45, Ankit Kumar Pandey wrote: Attached patch with test cases. I can look at this in a bit more detail if you find a way to fix the case you mentioned earlier. i.e, push the sort down to the deepest WindowAgg that has pathkeys contained in the query's ORDER BY pathkeys. EXPLAIN (COSTS OFF) SELECT empno, depname, min(salary) OVER (PARTITION BY depname ORDER BY empno) depminsalary, sum(salary) OVER (PARTITION BY depname) depsalary FROM empsalary ORDER BY depname, empno, enroll_date; QUERY PLAN --- Incremental Sort Sort Key: depname, empno, enroll_date Presorted Key: depname, empno -> WindowAgg -> WindowAgg -> Sort Sort Key: depname, empno -> Seq Scan on empsalary (8 rows) You'll also need to pay attention to how the has_runcondition is set. If you start pushing before looking at all the WindowClauses then you won't know if some later WindowClause has a runCondition. Yes, this should be the main culprit. Adding an additional backwards foreach loop should allow you to do all the required prechecks and find the index of the WindowClause which you should start pushing from. This should do the trick. Thanks for headup, I will update the patch with suggested changes and required fixes. Regards, Ankit