Hi Johnny, I just gave that a try but it didn't help as I suspected.
I still believe the problem is in mysql not being able to handle set subtractions. Therefore, it has to perform the work harder to return the rows that represent a "no match" with NULL values in place so they can then be filtered by the WHERE clause. This type of query seems to be a corner case in mysql one should be aware about when working with large datasets. Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers <joh...@pixelated.net>wrote: > Leonardo, > > I think a new compound key on email_id and activity in the activities table > may help. > > I'm not sure if this will help or not, Its hard to test w/o having a large > data set to test against. > > > On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges < > leonardoborges...@gmail.com> wrote: > >> Sure can: >> >> show create table activities; >> >> CREATE TABLE `activities` ( >> `id` int(11) NOT NULL AUTO_INCREMENT, >> `user_id` int(11) DEFAULT NULL, >> `email` varchar(100) DEFAULT NULL, >> `country_iso` varchar(2) DEFAULT NULL, >> `tags` varchar(255) DEFAULT NULL, >> `postcode` int(11) DEFAULT NULL, >> `activity` varchar(100) DEFAULT NULL, >> `page_id` int(11) DEFAULT NULL, >> `donation_frequency` varchar(100) DEFAULT NULL, >> `email_id` int(11) DEFAULT NULL, >> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE >> CURRENT_TIMESTAMP, >> PRIMARY KEY (`id`), >> KEY `user_idx` (`user_id`), >> KEY `email_idx` (`email_id`), >> KEY `activity_idx` (`activity`) >> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1 >> >> >> And the explain: >> >> >> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+ >> | id | select_type | table | type | possible_keys | key >> | key_len | ref | rows | Extra | >> >> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+ >> | 1 | SIMPLE | u | index | NULL | >> id_idx | 5 | NULL | 972064 | Using index | >> | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx | >> user_idx | 5 | getup.u.id | 20 | Using where | >> >> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+ >> >> >> Cheers, >> Leonardo Borges >> www.leonardoborges.com >> >> >> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers <joh...@pixelated.net>wrote: >> >>> Can you post show create table for activity and explain output of the >>> problem query? >>> >>> On Jul 7, 2011 8:51 PM, "Leonardo Borges" <leonardoborges...@gmail.com> >>> wrote: >>> >>> Hello everyone, >>> >>> I have an increasingly popular web application running on top of mysql >>> and >>> due to its popularity, I'm running into performance issues. After >>> carefully >>> examining database indexes and tuning queries I was able to pin down the >>> slowest part of the system. >>> >>> The app's got a user segmentation tool that allows you to filter users >>> based >>> on a range of criteria from which the slowest is: "Select all users that >>> did >>> not receive the email of id 100" >>> >>> To answer this question we turn to the activities table, which is >>> basically >>> a denormalized log of actions taken by the user in this format: >>> user_id | activity | email_id | ... >>> 10 | email_sent | 100 | ... >>> 10 | subscribed | NULL | ... >>> 10 | email_open | 100 | ... >>> >>> >>> Given this table and the question above, the usual way of finding out all >>> users who did not receive this email is through the use of a left outer >>> join, such as: >>> >>> select u.id >>> from users u >>> left outer join activities a >>> on u.id = a.user_id >>> and a.activity = 'email_sent' >>> and a.email_id = 100 >>> where a.user_id is null >>> >>> That's all fine for medium-ish tables. However our current activities >>> table >>> has over 13 million rows, slowing the hell out of this left outer join, >>> taking about 52 seconds in my machine. >>> >>> What this query is trying to do is to get the relative complement of set >>> A(users) to B(activities). As far as I know mysql doesn't support set >>> subtraction, thus the reason for these queries being slow. >>> >>> Based on that I've setup a test database on Postgresql, which supports >>> this >>> very set operation and rewrote the query to look like this: >>> >>> select u.id >>> from users u >>> except >>> select a.user_id >>> from activities a >>> where a.activity = 'email_sent' >>> and a.email_id = 100; >>> >>> The fact that postgresql knows how to subtract sets brought this query >>> down >>> to only 4 seconds. >>> >>> My question then is: since this is a somewhat common query in our system, >>> are there any workarounds I could use in mysql to improve things? >>> >>> I did find one myself, but it's a bit convoluted and might not perform >>> well >>> under load, but the following sql script gives me similar performance in >>> mysql: >>> >>> create temporary table email_sent_100 >>> select a.user_id >>> from user_activity_events a >>> where a.activity = 'email_sent' >>> >>> >>> and a.email_id = 100; >>> >>> create index user_id_idx on email_sent_100(user_id); //this could >>> potentially bring the runtime down in the case of a larg temp table. >>> >>> select count(u.id) >>> from users u >>> left outer join email_sent_100 s >>> on u.id = s.user_id >>> and s.user_id is null; >>> >>> A lot more lines and a lot more complex, but does the job in this >>> example. >>> >>> I'd appreciate your thoughts. >>> >>> Cheers, >>> Leonardo Borges >>> www.leonardoborges.com >>> >>> >> > > > -- > ----------------------------- > Johnny Withers > 601.209.4985 > joh...@pixelated.net >