Re: [HACKERS] Performance improvement for joins where outer side is unique

Tomas Vondra Wed, 16 Dec 2015 16:18:07 -0800

Hi,

On 12/16/2015 11:40 PM, David Rowley wrote:

On 17 December 2015 at 05:02, Tomas Vondra <tomas.von...@2ndquadrant.com
<mailto:tomas.von...@2ndquadrant.com>> wrote:


    0) I know the patch does not tweak costing - any plans in this

        direction? Would it be possible to simply use the costing used by
        semijoin?


Many thanks for looking at this.

The patch does tweak the costings so that unique joins are costed in the
same way as semi joins. It's a very small change, so easy to miss.


Thanks. I missed that bit somehow.


For example, see the change in initial_cost_nestloop()

-       if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+       if (jointype == JOIN_SEMI ||
+               jointype == JOIN_ANTI ||
+               unique_inner)
         {

Also see the changes in final_cost_nestloop() and final_cost_hashjoin()


    1) nodeHashjoin.c (and other join nodes)

    I've noticed we have this in the ExecHashJoin() method:

      /*
       * When the inner side is unique or we're performing a
       * semijoin, we'll consider returning the first match, but
       * after that we're done with this outer tuple.
       */
      if (node->js.unique_inner)
          node->hj_JoinState = HJ_NEED_NEW_OUTER;

    That seems a bit awkward because the comment speaks about unique
    joins *OR* semijoins, but the check only references unique_inner.
    That of course works because we do this in ExecInitHashJoin():

             switch (node->join.jointype)
             {
                     case JOIN_SEMI:
                             hjstate->js.unique_inner = true;
                             /* fall through */

    Either some semijoins may not be unique - in that case the naming is
    misleading at best, and we either need to find a better name or
    simply check two conditions like this:

      if (node->js.unique_inner || node->join.type == JOIN_SEMI)
         node->hj_JoinState = HJ_NEED_NEW_OUTER;

    Or all semijoins are unique joins, and in that case the comment may
    need rephrasing. But more importantly, it begs the question why
    we're detecting this in the executor and not in the planner? Because
    if we detect it in executor, we can't use this bit in costing, for
    example.


The reason not to detect that in the planner is simply that unique_join
is meant to mean that the relation on the inner side of the join will at
most contain a single Tuple which matches each outer relation tuple,
based on the join condition. I've added no detection for this in semi
joins, and I'd rather not go blindly setting the flag to true in the
planner as it simply may not be true for the semi join. At the moment
that might not matter as we're only using the unique_join flag as an
optimization in the join nodes, but I'd prefer not to do this as its
likely we'll want to do more with this flag later, and I'd rather we
keep the meaning well defined. You might argue that this is also true
for semi joins, but if down the road somewhere we want to perform some
checks on the inner relation before the join takes place, and in that
case the Tuples of the relation might not have the same properties we
claim they do.

But you're right that reusing the flag in the join nodes is not ideal,
and the comment is not that great either. I'd really rather go down the
path of either renaming the variable, or explaining this better in the
comment. It seems unnecessary to check both for each tuple being joined.
I'd imagine that might add a little overhead to joins which are not unique.


I'd be very surprised it that had any measurable impact.

How about changing the comment to:

/*
* In the event that the inner side has been detected to be
* unique, as an optimization we can skip searching for any
* subsequent matching inner tuples, as none will exist.
* For semijoins unique_inner will always be true, although
* in this case we don't match another inner tuple as this
* is the required semi join behavior.
*/

Alternatively or additionally we can rename the variable in the executor
state, although I've not thought of a name yet that I don't find overly
verbose: unique_inner_or_semi_join,  match_first_tuple_only.


I'd go with match_first_tuple_only.



    2) analyzejoins.c

    I see that this code in join_is_removable()

         innerrel = find_base_rel(root, innerrelid);

         if (innerrel->reloptkind != RELOPT_BASEREL)
             return false;

    was rewritten like this:

         innerrel = find_base_rel(root, innerrelid);

         Assert(innerrel->reloptkind == RELOPT_BASEREL);

    That suggests that previously the function expected cases where
    reloptkind may not be RELOPT_BASEREL, but now it'll error out int
    such cases. I haven't noticed any changes in the surrounding code
    that'd guarantee this won't happen, but I also haven't been able to
    come up with an example triggering the assert (haven't been trying
    too hard). How do we know the assert() makes sense?


I'd have changed this as this should be covered by the if
(!sjinfo->is_unique_join || a few lines up. mark_unique_joins() only
sets is_unique_join to true if specialjoin_is_unique_join() returns true
and that function tests reloptkind to ensure its set to RELOPT_BASEREL
and return false if the relation is not. Perhaps what is missing is a
comment to explain the function should only be used on RELOPT_BASEREL
type relations.

Yeah, I think it'd be good to document this contract somewhere. Eitherin the comment before the function, or perhaps right above the Assert().


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

Reply via email to