Matthias Pigulla wrote:
Hi all,
can someone explain me why this query gets executed the way it is? I
simplified it as much as possible and think it's a conceptual/logical
thing, so I'll omit - at least for now - the table definitions and
sample data for brevity.
-- "Superquery"
EXPLAIN SELECT t1.id, t1.name
FROM document AS t1
WHERE t1.id
IN (
-- "Subquery"
SELECT DISTINCT data_id
FROM wfd_reference
WHERE wfd_field_id =699
AND ref_data_id
IN ( 171 )
)
This gives:
*************************** 1. row ***************************
id: 1
select_type: PRIMARY
table: t1
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 277
Extra: Using where
*************************** 2. row ***************************
id: 2
select_type: DEPENDENT SUBQUERY
table: wfd_reference
type: ref
possible_keys: field_data,test
key: field_data
key_len: 4
ref: const,func
rows: 4
Extra: Using where; Using temporary
However - why do we need the *dependent* subquery at all? The subquery
can be executed on its own, as it does not depend on any information of
the "superquery".
EXPLAIN SELECT DISTINCT data_id
FROM wfd_reference
WHERE wfd_field_id =699
AND ref_data_id
IN ( 171 )
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: wfd_reference
type: ref
possible_keys: field_data,test
key: test
key_len: 4
ref: const,const
rows: 9
Extra: Using where; Using temporary
Now if I just take the result of this query, concat the data_ids on the
application level and build the superquery as follows:
-- "two-staged superquery variant"
EXPLAIN SELECT t1.id, t1.name
FROM document AS t1
WHERE t1.id
IN (
32, 31, 30, 53, 56, 57, 58, 59, 60, 111
)
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
type: range
possible_keys: PRIMARY
key: PRIMARY
key_len: 2
ref: NULL
rows: 10
Extra: Using where
the results seem to be much better.
I was afraid of the dependent subquery for the "ALL" scan of t1, as t1
will become huge. OTOH, the subquery will be very restrictive: The
number of data_ids will always be very small, at least compared to the
number of rows in t1.
Are there any non-obvious reasons for the behaviour described above? Is
that something that cannot be optimized right now? Am I too
short-sighted with my "optimization" approach?
Thanks a lot,
Matthias
As others have already pointed out, mysql often doesn't optimize subqueries
properly, and fixing that doesn't seem to be a high priority right now.
Subqueries are relatively new in mysql, so it probably shouldn't be
surprising that they aren't optimized as well as other things. Fortunately,
most subqueries can be rewritten as joins
<http://dev.mysql.com/doc/mysql/en/rewriting-subqueries.html>, which mysql
does a good job of optimizing.
I believe your query is equivalent to
SELECT DISTINCT t1.id, t1.name
FROM document AS t1
JOIN wfd_reference AS wfd ON t1.id = wfd.data_id
WHERE wfd.wfd_field_id = 699
AND wfd.ref_data_id IN (171);
which should perform better than the (incorrectly) dependent subquery,
because it will be properly optimized. Mysql will use the 'test' index in
wfd_reference to pick the rows which match the conditions, then use t1's
primary key to get the corresponding rows from t1.
I notice the DISTINCT in your subquery. I take it that there may be
multiple rows in wfd_reference with the same data_id, wfd_field_id = 699,
and ref_data_id IN (171)? (If not, just leave out the DISTINCT in the
above.) In that case, it may be possible that your 2-step approach will be
even faster than the JOIN with DISTINCT. You can do this directly in mysql,
instead of in your app, by using a temporary table to store the inner query
result. Something like
CREATE TEMPORARY TABLE matches
SELECT DISTINCT data_id
FROM wfd_reference
WHERE wfd_field_id = 699
AND ref_data_id IN (171);
SELECT t1.id, t1.name
FROM document AS t1
JOIN matches ON t1.id = matches.data_id;
DROP TABLE matches;
Michael
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]