Re: Query with group by

Michael Stassen Fri, 17 Sep 2004 08:36:04 -0700


Jose Miguel Pérez wrote:

Hi Michael!

    Yes, you're right, thanks for extending and clarifying my message.
However, I'm not confident about your comments on the inefficiency for the
following query:


I'll explain my reasoning below.

       SELECT c1.date, c1.location, c1.version
       FROM cities c1
       LEFT JOIN cities c2
           ON c1.location=c2.location AND c1.content=c2.content
               AND c2.date>c1.date
       WHERE c2.id IS NULL AND c1.content = 'ALPHA'

This will work, but it is also somewhat inefficient.  The LEFT JOIN is
creating numerous extra, unwanted rows, only to throw them away with the
WHERE c2.id IS NULL.  Assuming n rows for a particular location value, you


    [...] and then [...]

The most efficient way is probably to use a temporary table.

  CREATE TEMPORARY TABLE max_dates
  SELECT location, MAX(date) AS max_date
  FROM temp
  WHERE content = 'ALPHA'
  GROUP BY location;

  SELECT t.*
  FROM temp t, max_dates m
  WHERE t.location = m.location
  AND t.date = m.max_date;

I notice I missed something here. We need to add "AND content = 'ALPHA', in case there is a row in t with the wrong content which matches the location and max_date from m. (There isn't one in the sample data.)

DROP TABLE max_dates;


    I don't think the temporary table is such an efficient way of doing
this. Pardon me, I'm provably wrong, but let me explain to see if I think
correctly. First, I assume as true this table have an index on "location",
"content" and "date", apart from the PK on ID. Given that, on my query we
are using the keys at full, I mean, although you say "the left join is
creating numerous extra, unwanted rows", this is not true. We could apply
the standard algebra here, but the real world query optimizers are smart
enough to not retrieve unwanted data. (What about joining four or more
tables! Multiply then).

You are certainly right that indexes, or their lack, will make a difference. I assumed proper indexing in my answer, but I probably should have made that explicit.

I do not claim that joins are not optimized. Mysql is certainly smart enough, in general, to only fetch matching rows for joins, and to use indexes to do so when they are available. But there are limits. The problem in this case is the condition on the right side of the LEFT JOIN.

From the manual, section "7.2.8 How MySQL Optimizes LEFT JOIN and RIGHT JOIN" <http://dev.mysql.com/doc/mysql/en/LEFT_JOIN_optimization.html>:

  A LEFT JOIN B join_condition is implemented in MySQL as follows:
  ...
  * The LEFT JOIN condition is used to decide how to retrieve rows from
    table B. (In other words, any condition in the WHERE clause is not
    used.)
  ...

So, the "WHERE c2.id IS NULL" cannot be applied until after the rows which match the ON clause (and the NULL rows) have been fetched. To illustrate, here's the LEFT JOIN query without the right side condition (I added a few columns to make it clear what's happening):

  SELECT c1.id id1, c1.date, c1.location, c1.version, c2.id id2, c2.date
  FROM temp c1
  LEFT JOIN temp c2
    ON c1.location=c2.location AND c1.content=c2.content
       AND c2.date>c1.date
  WHERE c1.content = 'ALPHA';
+------+------------+----------+---------+------+------------+
| id1  | date       | location | version | id2  | date       |
+------+------------+----------+---------+------+------------+
|    1 | 2004-09-14 | PARIS    |      10 |    2 | 2004-09-15 |
|    1 | 2004-09-14 | PARIS    |      10 |    3 | 2004-09-16 |
|    2 | 2004-09-15 | PARIS    |      11 |    3 | 2004-09-16 |
|    3 | 2004-09-16 | PARIS    |      10 | NULL | NULL       |
|    4 | 2004-09-14 | NEW-YORK |      11 |    5 | 2004-09-15 |
|    4 | 2004-09-14 | NEW-YORK |      11 |    6 | 2004-09-16 |
|    5 | 2004-09-15 | NEW-YORK |      11 |    6 | 2004-09-16 |
|    6 | 2004-09-16 | NEW-YORK |      10 | NULL | NULL       |
|    7 | 2004-09-14 | TOKYO    |      10 |    8 | 2004-09-15 |
|    8 | 2004-09-15 | TOKYO    |      11 | NULL | NULL       |
+------+------------+----------+---------+------+------------+
10 rows in set (0.07 sec)

There are actually 2 more rows than the original table has with content='ALPHA'! For each group (location value), we get 1 + (n(n-1)/2) results, where n is the number of rows in the group. In this case, 10 rows to get the 3 we want.

Now, there is a special case where a condition on the right side can help the optimizer (same manual page as above):

  * If you use LEFT JOIN to find rows that don't exist in some table and
    you have the following test: col_name IS NULL in the WHERE part, where
    col_name is a column that is declared as NOT NULL, MySQL stops
    searching for more rows (for a particular key combination) after it has
    found one row that matches the LEFT JOIN condition.

So, if id has been declared NOT NULL, which seems likely, mysql should stop looking at a given date-location combination as soon as a right side match is found. That reduces the rows to be filtered to:

+------+------------+----------+---------+------+------------+
| id1  | date       | location | version | id2  | date       |
+------+------------+----------+---------+------+------------+
|    1 | 2004-09-14 | PARIS    |      10 |    2 | 2004-09-15 |
|    2 | 2004-09-15 | PARIS    |      11 |    3 | 2004-09-16 |
|    3 | 2004-09-16 | PARIS    |      10 | NULL | NULL       |
|    4 | 2004-09-14 | NEW-YORK |      11 |    5 | 2004-09-15 |
|    5 | 2004-09-15 | NEW-YORK |      11 |    6 | 2004-09-16 |
|    6 | 2004-09-16 | NEW-YORK |      10 | NULL | NULL       |
|    7 | 2004-09-14 | TOKYO    |      10 |    8 | 2004-09-15 |
|    8 | 2004-09-15 | TOKYO    |      11 | NULL | NULL       |
+------+------------+----------+---------+------+------------+

That's a little better, 8 rows to get 3. In general, you'll get one row for each location-date combination in this case. That is, one row for every row in the table on the LEFT which matches the WHERE location = 'ALPHA'.

    Your query is creating a temporary table, doing a full scan of it
(thanks to the MAX(date) function), etc. If you do a EXPLAIN SELECT for your
query, you'll notice there is an Extra of: "Using where; Using temporary;
Using FILESORT". Reading the MySQL documentation, one can see "If you want
to make your queries as fast as possible, you should look out for Extra
values of Using filesort and Using temporary.". (Chapter 7.2.1 EXPLAIN
Syntax).


In reverse order:

You are certainly right that "temporary" and "filesort" are to be avoided. And they will be, if the table is properly indexed. Single column indexing won't help much here, because the WHERE condition, the GROUP BY column, and the MAX column are all different. A multi-column index on (content, location, date), however, will allow mysql to use the index to find the matching rows, find the groups, and calculate the MAX date. In that case, EXPLAIN shows "Using where; Using index".

Yes, a full table scan of the temporary table is done in step 2, but that's OK, because the temporary table has precisely one row per desired result.

  SELECT * FROM max_dates;
  +----------+------------+
  | location | max_date   |
  +----------+------------+
  | NEW-YORK | 2004-09-16 |
  | PARIS    | 2004-09-16 |
  | TOKYO    | 2004-09-15 |
  +----------+------------+
  3 rows in set (0.00 sec)

    If I'm not wrong, maybe the first LEFT JOIN is worse from a mathematical
point of view, but the temporary one maybe is the worst from a practical
perspective.

I meant my comments to be practical, rather than theoretical. In general, I am wary of solutions which select extra rows then filter for the desired results. Sometimes that's necessary, but my expectation is that a solution which selects only what you need should be faster most of the time. For this small sample table, the difference is irrelevant. For a large, busy, production db, a difference in speed may be a big deal. In that case, one could certainly try it both ways to see which is faster.

    And you'll see I'm very cautious because I'm not such a SQL guru, but
I'd like to know other opinions.

    Anyway, I don't know if one can program an agregate UDF called something
like EXTERNAL_MAX(...) or something, so that we could do like:

        SELECT EXTERNAL_MAX(date, version)  ---> i.e: Returns the "version"
value for the row with MAX(date).

This, for sure, will be the best solution. ;-)


That would have to do the same thing behind the scenes.

    Cheers,
    Jose Miguel.


Michael

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Query with group by

Reply via email to