Re: [HACKERS] [BUGS] New hashed IN code ignores distinctiveness of subquery

Bradley Baetz Sun, 26 Jan 2003 23:29:33 -0800

[Moving -> -hackers]

On Mon, Jan 27, 2003 at 12:00:08AM -0500, Tom Lane wrote:
> Bradley Baetz <[EMAIL PROTECTED]> writes:


> > However, its much faster (although not as fast as sticking the DISTINCT
> > in there myself), but the actual rows coming from the sort is really odd
> > - where is that number coming from? How can sorting 9 rows take 44476
> > anythings?
> 
> We're back full circle to my original comment about rescans in
> mergejoin.  The EXPLAIN ANALYZE instrumentation counts a re-fetch
> as another returned row.

Hmm. OK, I poked through the code a bit more, and I think I now realise
why we were talking across each other. I've attached a 'patch' which
gets the mergejoin counts down to something reasonable. (The patch also
includes a bit to fix the estimated row count for JOIN_IN, as discussed
on -bugs)

When calculating the cost for a join, if a path is a T_UniquePath, then
the reduction in the number of rows to be examined isn't taken into
account. This is why the mergejoin code was calulating the cost of
merging two 50000 tuple paths - the overestimation that you menioned
earlier isn't as important here.

For reference, bugs is a 50000 row table, and there are 9 distinct
values for product_id.

Before(with the num_rows part of the patch, though)

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                                   QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3494816.98..3494816.98 rows=1 width=8) (actual
time=579.71..579.71 rows=1 loops=1)
   ->  Merge Join  (cost=5169.41..3494691.43 rows=50218 width=8) (actual
time=111.41..530.16 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..249.57
rows=50000 loops=1)
         ->  Sort  (cost=920.14..920.17 rows=9 width=4) (actual
time=111.25..143.42 rows=44476 loops=1)
               Sort Key: public.bugs.product_id
               ->  HashAggregate  (cost=920.00..920.00 rows=9 width=4)
(actual time=111.17..111.18 rows=9 loops=1)
                     ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.41 rows=50000 loops=1)
 Total runtime: 579.84 msec
(9 rows)

After:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                                   QUERY
PLAN

-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=8007.21..8007.21 rows=1 width=8) (actual
time=578.16..578.16 rows=1 loops=1)
   ->  Merge Join  (cost=5169.41..7881.67 rows=50218 width=8) (actual
time=110.94..527.79 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..250.74
rows=50000 loops=1)
         ->  Sort  (cost=920.14..920.17 rows=9 width=4) (actual
time=110.78..142.80 rows=44476 loops=1)
               Sort Key: public.bugs.product_id
               ->  HashAggregate  (cost=920.00..920.00 rows=9 width=4)
(actual time=110.70..110.71 rows=9 loops=1)
                     ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.00..67.14 rows=50000 loops=1)
 Total runtime: 578.30 msec
(9 rows)

The patch isn't correct as-is, because it only covers merge joins:

bbaetz=# set enable_mergejoin=false;
SET
bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);
                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4281712.05..4281712.05 rows=1 width=8) (actual
time=410.14..410.14 rows=1 loops=1)
   ->  Hash Join  (cost=1091.00..4281586.50 rows=50218 width=8) (actual
time=126.32..362.30 rows=50000 loops=1)
         Hash Cond: ("outer".product_id = "inner".product_id)
         ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..66.81 rows=50000 loops=1)
         ->  Hash  (cost=795.00..795.00 rows=50000 width=4) (actual
time=126.08..126.08 rows=0 loops=1)
               ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.02..68.23 rows=50000 loops=1)
 Total runtime: 410.25 msec
(7 rows)

I don't think that propogating my hack to everywhere which wants to know
how many rows are returned is a good idea though - is there a more
generic way to get the number of rows really returned by a path?

> 
>                       regards, tom lane

Bradley

Index: src/backend/optimizer/path/costsize.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/optimizer/path/costsize.c,v
retrieving revision 1.102
diff -u -r1.102 costsize.c
--- src/backend/optimizer/path/costsize.c       2003/01/22 20:16:40     1.102
+++ src/backend/optimizer/path/costsize.c       2003/01/27 07:26:15
@@ -808,6 +808,15 @@
        outerscansel = outer_rows / outer_path->parent->rows;
        innerscansel = inner_rows / inner_path->parent->rows;
 
+    /* If there is a UniquePath, then the actual number of rows to be
+     * merged will be less
+     */
+    if (outer_path->type == T_UniquePath)
+        outer_rows = ((UniquePath*)outer_path)->rows;
+
+    if (inner_path->type == T_UniquePath)
+        inner_rows = ((UniquePath*)inner_path)->rows;
+
        /* cost of source data */
 
        /*
@@ -873,7 +882,7 @@
         * counts here, not the scaled-down counts we obtained above.
         */
        ntuples = approx_selectivity(root, mergeclauses) *
-               outer_path->parent->rows * inner_path->parent->rows;
+               outer_rows * inner_rows;
 
        /* CPU costs */
        cost_qual_eval(&restrict_qual_cost, restrictlist);
@@ -1556,18 +1565,14 @@
                                temp = outer_rel->rows;
                        if (temp < inner_rel->rows)
                                temp = inner_rel->rows;
-                       break;
-               case JOIN_IN:
-                       temp = outer_rel->rows * selec;
-                       break;
-               case JOIN_REVERSE_IN:
-                       temp = inner_rel->rows * selec;
                        break;
+        case JOIN_REVERSE_IN:
                case JOIN_UNIQUE_OUTER:
                        upath = create_unique_path(root, outer_rel,
                                                                           
outer_rel->cheapest_total_path);
                        temp = upath->rows * inner_rel->rows * selec;
                        break;
+        case JOIN_IN:
                case JOIN_UNIQUE_INNER:
                        upath = create_unique_path(root, inner_rel,
                                                                           
inner_rel->cheapest_total_path);

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Re: [HACKERS] [BUGS] New hashed IN code ignores distinctiveness of subquery

Reply via email to