Jose Miguel Pérez wrote:
Hi Michael!
Yes, you're right, thanks for extending and clarifying my message. However, I'm not confident about your comments on the inefficiency for the following query:
I'll explain my reasoning below.
SELECT c1.date, c1.location, c1.version FROM cities c1 LEFT JOIN cities c2 ON c1.location=c2.location AND c1.content=c2.content AND c2.date>c1.date WHERE c2.id IS NULL AND c1.content = 'ALPHA'
This will work, but it is also somewhat inefficient. The LEFT JOIN is creating numerous extra, unwanted rows, only to throw them away with the WHERE c2.id IS NULL. Assuming n rows for a particular location value, you
[...] and then [...]
The most efficient way is probably to use a temporary table.
CREATE TEMPORARY TABLE max_dates SELECT location, MAX(date) AS max_date FROM temp WHERE content = 'ALPHA' GROUP BY location;
SELECT t.* FROM temp t, max_dates m WHERE t.location = m.location AND t.date = m.max_date;
I notice I missed something here. We need to add "AND content = 'ALPHA', in case there is a row in t with the wrong content which matches the location and max_date from m. (There isn't one in the sample data.)
DROP TABLE max_dates;
I don't think the temporary table is such an efficient way of doing this. Pardon me, I'm provably wrong, but let me explain to see if I think correctly. First, I assume as true this table have an index on "location", "content" and "date", apart from the PK on ID. Given that, on my query we are using the keys at full, I mean, although you say "the left join is creating numerous extra, unwanted rows", this is not true. We could apply the standard algebra here, but the real world query optimizers are smart enough to not retrieve unwanted data. (What about joining four or more tables! Multiply then).
You are certainly right that indexes, or their lack, will make a difference. I assumed proper indexing in my answer, but I probably should have made that explicit.
I do not claim that joins are not optimized. Mysql is certainly smart enough, in general, to only fetch matching rows for joins, and to use indexes to do so when they are available. But there are limits. The problem in this case is the condition on the right side of the LEFT JOIN.
From the manual, section "7.2.8 How MySQL Optimizes LEFT JOIN and RIGHT JOIN" <http://dev.mysql.com/doc/mysql/en/LEFT_JOIN_optimization.html>:
A LEFT JOIN B join_condition is implemented in MySQL as follows: ... * The LEFT JOIN condition is used to decide how to retrieve rows from table B. (In other words, any condition in the WHERE clause is not used.) ...
So, the "WHERE c2.id IS NULL" cannot be applied until after the rows which match the ON clause (and the NULL rows) have been fetched. To illustrate, here's the LEFT JOIN query without the right side condition (I added a few columns to make it clear what's happening):
SELECT c1.id id1, c1.date, c1.location, c1.version, c2.id id2, c2.date FROM temp c1 LEFT JOIN temp c2 ON c1.location=c2.location AND c1.content=c2.content AND c2.date>c1.date WHERE c1.content = 'ALPHA'; +------+------------+----------+---------+------+------------+ | id1 | date | location | version | id2 | date | +------+------------+----------+---------+------+------------+ | 1 | 2004-09-14 | PARIS | 10 | 2 | 2004-09-15 | | 1 | 2004-09-14 | PARIS | 10 | 3 | 2004-09-16 | | 2 | 2004-09-15 | PARIS | 11 | 3 | 2004-09-16 | | 3 | 2004-09-16 | PARIS | 10 | NULL | NULL | | 4 | 2004-09-14 | NEW-YORK | 11 | 5 | 2004-09-15 | | 4 | 2004-09-14 | NEW-YORK | 11 | 6 | 2004-09-16 | | 5 | 2004-09-15 | NEW-YORK | 11 | 6 | 2004-09-16 | | 6 | 2004-09-16 | NEW-YORK | 10 | NULL | NULL | | 7 | 2004-09-14 | TOKYO | 10 | 8 | 2004-09-15 | | 8 | 2004-09-15 | TOKYO | 11 | NULL | NULL | +------+------------+----------+---------+------+------------+ 10 rows in set (0.07 sec)
There are actually 2 more rows than the original table has with content='ALPHA'! For each group (location value), we get 1 + (n(n-1)/2) results, where n is the number of rows in the group. In this case, 10 rows to get the 3 we want.
Now, there is a special case where a condition on the right side can help the optimizer (same manual page as above):
* If you use LEFT JOIN to find rows that don't exist in some table and you have the following test: col_name IS NULL in the WHERE part, where col_name is a column that is declared as NOT NULL, MySQL stops searching for more rows (for a particular key combination) after it has found one row that matches the LEFT JOIN condition.
So, if id has been declared NOT NULL, which seems likely, mysql should stop looking at a given date-location combination as soon as a right side match is found. That reduces the rows to be filtered to:
+------+------------+----------+---------+------+------------+ | id1 | date | location | version | id2 | date | +------+------------+----------+---------+------+------------+ | 1 | 2004-09-14 | PARIS | 10 | 2 | 2004-09-15 | | 2 | 2004-09-15 | PARIS | 11 | 3 | 2004-09-16 | | 3 | 2004-09-16 | PARIS | 10 | NULL | NULL | | 4 | 2004-09-14 | NEW-YORK | 11 | 5 | 2004-09-15 | | 5 | 2004-09-15 | NEW-YORK | 11 | 6 | 2004-09-16 | | 6 | 2004-09-16 | NEW-YORK | 10 | NULL | NULL | | 7 | 2004-09-14 | TOKYO | 10 | 8 | 2004-09-15 | | 8 | 2004-09-15 | TOKYO | 11 | NULL | NULL | +------+------------+----------+---------+------+------------+
That's a little better, 8 rows to get 3. In general, you'll get one row for each location-date combination in this case. That is, one row for every row in the table on the LEFT which matches the WHERE location = 'ALPHA'.
Your query is creating a temporary table, doing a full scan of it (thanks to the MAX(date) function), etc. If you do a EXPLAIN SELECT for your query, you'll notice there is an Extra of: "Using where; Using temporary; Using FILESORT". Reading the MySQL documentation, one can see "If you want to make your queries as fast as possible, you should look out for Extra values of Using filesort and Using temporary.". (Chapter 7.2.1 EXPLAIN Syntax).
In reverse order:
You are certainly right that "temporary" and "filesort" are to be avoided. And they will be, if the table is properly indexed. Single column indexing won't help much here, because the WHERE condition, the GROUP BY column, and the MAX column are all different. A multi-column index on (content, location, date), however, will allow mysql to use the index to find the matching rows, find the groups, and calculate the MAX date. In that case, EXPLAIN shows "Using where; Using index".
Yes, a full table scan of the temporary table is done in step 2, but that's OK, because the temporary table has precisely one row per desired result.
SELECT * FROM max_dates; +----------+------------+ | location | max_date | +----------+------------+ | NEW-YORK | 2004-09-16 | | PARIS | 2004-09-16 | | TOKYO | 2004-09-15 | +----------+------------+ 3 rows in set (0.00 sec)
If I'm not wrong, maybe the first LEFT JOIN is worse from a mathematical point of view, but the temporary one maybe is the worst from a practical perspective.
I meant my comments to be practical, rather than theoretical. In general, I am wary of solutions which select extra rows then filter for the desired results. Sometimes that's necessary, but my expectation is that a solution which selects only what you need should be faster most of the time. For this small sample table, the difference is irrelevant. For a large, busy, production db, a difference in speed may be a big deal. In that case, one could certainly try it both ways to see which is faster.
And you'll see I'm very cautious because I'm not such a SQL guru, but I'd like to know other opinions.
Anyway, I don't know if one can program an agregate UDF called something like EXTERNAL_MAX(...) or something, so that we could do like:
SELECT EXTERNAL_MAX(date, version) ---> i.e: Returns the "version" value for the row with MAX(date).
This, for sure, will be the best solution. ;-)
That would have to do the same thing behind the scenes.
Cheers, Jose Miguel.
Michael
-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]