Re: Help! Nasty big table efficiency issues with GROUP BY

SGreen Mon, 19 Jul 2004 12:11:28 -0700

Have you tried moving the "GroupID not IN ()" into your JOIN clause?


SELECT * 
FROM News
JOIN FileGroup 
        ON News.FileID = FileGroup.FileID
        AND FileGroup.GroupID NOT IN (638)
WHERE News.Category = 7
GROUP BY News.FileID 
ORDER BY News.Date DESC 
LIMIT 100

I think, though, that because you are excluding only a small part of your 
FileGroup table, the optimizer might think that more than the "magic 
threshold" (somewhere around 30%) number of records will be returned and 
may still revert to a table scan. Another option could be to select all of 
the rows from FileGroup where the rows are IN the group to a temp table. 
Then LEFT JOIN the temp table to news and check for null values in the 
temporary table.

CREATE TEMPORARY TABLE tmpFG
SELECT FileID
FROM FileGroup
WHERE Group IN (638)

ALTER TABLE tmpFG ADD KEY(FileID)

SELECT * 
FROM News n
JOIN FileGroup fg
        ON n.FileID = fg.FileID
LEFT JOIN tmpFG tfg
        on fg.FileID = n.FileID
WHERE n.Category = 7
        AND tfg.fileID is null
GROUP BY n.FileID 
ORDER BY n.Date DESC 
LIMIT 100

DROP TABLE tmpFG

I suggest this as it is sometimes faster to find something then exclude 
what you find than to "not find" it the first time.

Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine


Chris Elsworth <[EMAIL PROTECTED]> wrote on 07/19/2004 11:59:43 
AM:

> Hello,
> 
> I wonder if someone could shed some light on a problem that's been 
bothering
> me for months. Please bear with me as I explain it..
> 
> I have two tables in question:
> 
> CREATE TABLE `News` (
>   `FileID` int(8) unsigned NOT NULL auto_increment,
>   `Subject` char(200) NOT NULL default '',
>   `Category` tinyint(3) unsigned NOT NULL default '0',
>   `SubCategory` smallint(5) unsigned NOT NULL default '0',
>   `Date` int(10) unsigned NOT NULL default '0',
>   PRIMARY KEY  (`FileID`),
>   KEY `S_D` (`SubCategory`,`Date`),
>   KEY `C_D_P` (`Category`,`Date`,`PostID`),
>   KEY `C_P_D` (`Category`,`PostID`,`Date`,
>   KEY `Subject` (`Subject`(10)),
>   KEY `C_D` (`Category`,`Date`),
>   FULLTEXT KEY `ft_Subject` (`Subject`)
> ) TYPE=MyISAM
> 
> CREATE TABLE `FileGroup` (
>   `FileID` int(8) unsigned NOT NULL default '0',
>   `GroupID` smallint(5) unsigned NOT NULL default '0',
>   PRIMARY KEY  (`GroupID`,`FileID`),
>   KEY `F_G` (`FileID`,`GroupID`),
> ) TYPE=MyISAM
> 
> 
> News has about 2.5m rows and FileGroup has 3.1m rows.
> For each row in News, there are one or more corresponding rows in
> FileGroup, determining in which newsgroups each News article appears.
> 
> So, typically:
> mysql> select * from News where FileID = 26222004\G
> *************************** 1. row ***************************
>      FileID: 26222004
>     Subject: some_usenet_post
>    Category: 7
> SubCategory: 706
>        Date: 1090239675
> 
> mysql> select * from FileGroup where FileID = 26222004\G
> *************************** 1. row ***************************
>  FileID: 26222004
> GroupID: 638
> *************************** 2. row ***************************
>  FileID: 26222004
> GroupID: 799
> 
> 
> The problem occurs when I want to get News rows that do not appear
> in a specific group or set of groups. I also only want one row per News
> article, not one row per group, so I have a GROUP BY (I could maybe use
> DISTINCT too but they'd do pretty much the same) in there.
> 
> I end up with something like this:
> 
> SELECT * FROM News
>   JOIN FileGroup ON (News.FileID = FileGroup.FileID)
>   WHERE GroupID NOT IN (638) AND Category = 7
>   GROUP BY News.FileID ORDER BY Date DESC 
>   LIMIT 100
> 
> The resulting explain:
> +-----------+------+-----------------------------+-------+---------
> +-------------+--------+----------------------------------------------+
> | table     | type | possible_keys               | key   | key_len |
> ref         | rows   | Extra                                        |
> +-----------+------+-----------------------------+-------+---------
> +-------------+--------+----------------------------------------------+
> | News      | ref  | PRIMARY,C_D_P,C_P_D,C_D     | C_P_D |       1 |
> const       | 595494 | Using where; Using temporary; Using filesort |
> | FileGroup | ref  | F_G                         | F_G   |       4 |
> News.FileID |      1 | Using where                                  |
> +-----------+------+-----------------------------+-------+---------
> +-------------+--------+----------------------------------------------+
> 
> MySQL is being forced to create a temporary table because of the GROUP 
BY,
> and it pretty much seems to end up scanning the entire table - queries 
are
> taking upwards of 30 seconds. In the queries that I can remove the JOIN 
(I
> don't always need it, because I don't always need to exclude items in
> certain groups) it flies, because the temporary table and filesort 
pretty
> much always go; I can fiddle with indexes to make that always the case; 
the
> GROUP BY can go too, since the rows will always be unique.
> 
> I've typed myself out now so I'll keep the actual question short :)
> What can I do about this? Is there a more efficient way to store this 
data
> to avoid these horrific queries?
> 
> If I can supply any more relevant information I'll be only too pleased 
to.
> 
> Thanks for any hints in advance.
> 
> -- 
> Chris
> 
> -- 
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]
>

Re: Help! Nasty big table efficiency issues with GROUP BY

Reply via email to