Hi Richard, On Wed, Jul 30, 2025 at 9:08 PM Richard Guo <guofengli...@gmail.com> wrote:
> On Thu, Jul 31, 2025 at 10:33 AM Alexandra Wang > <alexandra.wang....@gmail.com> wrote: > > Thanks for the patch! I applied your patch and played around with it. > > Thanks for trying out this patch. > > > I have a question about the following test case you added in > > subselect.sql: > > > I was under the impression that we wanted Unique on top of a sorted > > node for the inner of the SIMI join. However, the expected output uses > > a HashAgg instead. Is this expected? > > Hmm, I don't think we have such expectation that "Sort+Unique" should > be used for the unique-ification step in this query. This patch > considers both hash-based and sort-based implementations, and lets the > cost comparison determine which one is preferable. So I wouldn't be > surprised if the planner ends up choosing the hash-based > implementation in the final plan. > > > While looking at the code, I also had a question about the following > > changes in costsize.c: > > > > --- a/src/backend/optimizer/path/costsize.c > > +++ b/src/backend/optimizer/path/costsize.c > > @@ -3963,7 +3963,9 @@ final_cost_mergejoin(PlannerInfo *root, MergePath > *path, > > * The whole issue is moot if we are working from a unique-ified outer > > * input, or if we know we don't need to mark/restore at all. > > */ > > - if (IsA(outer_path, UniquePath) || path->skip_mark_restore) > > + if (IsA(outer_path, UniquePath) || > > + IsA(outer_path, AggPath) || > > + path->skip_mark_restore) > > > > and > > > > @@ -4358,7 +4360,7 @@ final_cost_hashjoin(PlannerInfo *root, HashPath > *path, > > * because we avoid contaminating the cache with a value that's wrong > for > > * non-unique-ified paths. > > */ > > - if (IsA(inner_path, UniquePath)) > > + if (IsA(inner_path, UniquePath) || IsA(inner_path, AggPath)) > > > > I'm curious why AggPath was added in these two cases. > > Well, in final_cost_hashjoin() and final_cost_mergejoin(), we have > some special cases when the inner or outer relation has been > unique-ified. Previously, it was sufficient to check whether the path > was a UniquePath, since both hash-based and sort-based implementations > were represented that way. However, with this patch, UniquePath now > only represents the sort-based implementation, so we also need to > check for AggPath to account for the hash-based case. > > > The second diff looks fine to me. However, the first diff in the > > subselect test happened to be the test that I asked about. > > > > Here's the comparison of the two EXPLAIN ANALYZE results: > > > While both runs are fast (28ms vs. 17ms), the plan generated by the v5 > > patch is slower in this case. > > This is a very interesting observation. In fact, with the original v5 > patch, you can produce both plans by setting enable_hashagg on and > off. > > set enable_hashagg to on; > Incremental Sort (cost=91.95..277.59 rows=2500 width=16) > (actual time=31.960..147.040 rows=90000.00 loops=1) > > set enable_hashagg to off; > Merge Join (cost=70.14..294.34 rows=2500 width=16) > (actual time=4.303..71.891 rows=90000.00 loops=1) > > This seems to be another case where the planner chooses a suboptimal > plan due to inaccurate cost estimates. > Thanks for explaining! It makes sense now! While reviewing the code, the following diff concerns me: /* * If the joinrel is parallel-safe, we may be able to consider a partial - * merge join. However, we can't handle JOIN_UNIQUE_OUTER, because the - * outer path will be partial, and therefore we won't be able to properly - * guarantee uniqueness. Similarly, we can't handle JOIN_FULL, JOIN_RIGHT - * and JOIN_RIGHT_ANTI, because they can produce false null extended rows. + * merge join. However, we can't handle JOIN_FULL, JOIN_RIGHT and + * JOIN_RIGHT_ANTI, because they can produce false null extended rows. * Also, the resulting path must not be parameterized. */ if (joinrel->consider_parallel && - save_jointype != JOIN_UNIQUE_OUTER && - save_jointype != JOIN_FULL && - save_jointype != JOIN_RIGHT && - save_jointype != JOIN_RIGHT_ANTI && + jointype != JOIN_FULL && + jointype != JOIN_RIGHT && + jointype != JOIN_RIGHT_ANTI && outerrel->partial_pathlist != NIL && bms_is_empty(joinrel->lateral_relids)) What has changed that enabled JOIN_UNIQUE_OUTER to handle partial paths? Or is it no longer possible for the outer path to be partial? Best, Alex