Re: MySQL and set complements

Leonardo Borges Fri, 08 Jul 2011 06:54:22 -0700

Hi Johnny,

I just gave that a try but it didn't help as I suspected.


I still believe the problem is in mysql not being able to handle set
subtractions. Therefore, it has to perform the work harder to return the
rows that represent a "no match" with NULL values in place so they can then
be filtered by the WHERE clause.


This type of query seems to be a corner case in mysql one should be aware
about when working with large datasets.

Cheers,
Leonardo Borges
www.leonardoborges.com


On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers <joh...@pixelated.net>wrote:

> Leonardo,
>
> I think a new compound key on email_id and activity in the activities table
> may help.
>
> I'm not sure if this will help or not, Its hard to test w/o having a large
> data set to test against.
>
>
> On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <
> leonardoborges...@gmail.com> wrote:
>
>> Sure can:
>>
>> show create table activities;
>>
>> CREATE TABLE `activities` (
>>   `id` int(11) NOT NULL AUTO_INCREMENT,
>>   `user_id` int(11) DEFAULT NULL,
>>   `email` varchar(100) DEFAULT NULL,
>>   `country_iso` varchar(2) DEFAULT NULL,
>>   `tags` varchar(255) DEFAULT NULL,
>>   `postcode` int(11) DEFAULT NULL,
>>   `activity` varchar(100) DEFAULT NULL,
>>   `page_id` int(11) DEFAULT NULL,
>>   `donation_frequency` varchar(100) DEFAULT NULL,
>>   `email_id` int(11) DEFAULT NULL,
>>   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
>> CURRENT_TIMESTAMP,
>>   PRIMARY KEY (`id`),
>>   KEY `user_idx` (`user_id`),
>>   KEY `email_idx` (`email_id`),
>>   KEY `activity_idx` (`activity`)
>> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>>
>>
>> And the explain:
>>
>>
>> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>> | id | select_type | table | type  | possible_keys                   | key
>>      | key_len | ref        | rows   | Extra       |
>>
>> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>> |  1 | SIMPLE      | u     | index | NULL                            |
>> id_idx   | 5       | NULL       | 972064 | Using index |
>> |  1 | SIMPLE      | a     | ref   | user_idx,email_idx,activity_idx |
>> user_idx | 5       | getup.u.id |     20 | Using where |
>>
>> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>>
>>
>> Cheers,
>> Leonardo Borges
>> www.leonardoborges.com
>>
>>
>> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers <joh...@pixelated.net>wrote:
>>
>>> Can you post show create table for activity and explain output of the
>>> problem query?
>>>
>>> On Jul 7, 2011 8:51 PM, "Leonardo Borges" <leonardoborges...@gmail.com>
>>> wrote:
>>>
>>> Hello everyone,
>>>
>>> I have an increasingly popular web application running on top of mysql
>>> and
>>> due to its popularity, I'm running into performance issues. After
>>> carefully
>>> examining database indexes and tuning queries I was able to pin down the
>>> slowest part of the system.
>>>
>>> The app's got a user segmentation tool that allows you to filter users
>>> based
>>> on a range of criteria from which the slowest is: "Select all users that
>>> did
>>> not receive the email of id 100"
>>>
>>> To answer this question we turn to the activities table, which is
>>> basically
>>> a denormalized log of actions taken by the user in this format:
>>> user_id  | activity        | email_id | ...
>>> 10         | email_sent   | 100        | ...
>>> 10         | subscribed   | NULL     | ...
>>> 10         | email_open  | 100        | ...
>>>
>>>
>>> Given this table and the question above, the usual way of finding out all
>>> users who did not receive this email is through the use of a left outer
>>> join, such as:
>>>
>>> select u.id
>>> from users u
>>> left outer join activities a
>>> on u.id = a.user_id
>>> and a.activity = 'email_sent'
>>> and a.email_id = 100
>>> where a.user_id is null
>>>
>>> That's all fine  for medium-ish tables. However our current activities
>>> table
>>> has over 13 million rows, slowing the hell out of this left outer join,
>>> taking about 52 seconds in my machine.
>>>
>>> What this query is trying to do is to get the relative complement of set
>>> A(users) to B(activities). As far as I know mysql doesn't support set
>>> subtraction, thus the reason for these queries being slow.
>>>
>>> Based on that I've setup a test database on Postgresql, which supports
>>> this
>>> very set operation and rewrote the query to look like this:
>>>
>>> select u.id
>>> from users u
>>> except
>>> select a.user_id
>>> from activities a
>>> where a.activity = 'email_sent'
>>> and a.email_id = 100;
>>>
>>> The fact that postgresql knows how to subtract sets brought this query
>>> down
>>> to only 4 seconds.
>>>
>>> My question then is: since this is a somewhat common query in our system,
>>> are there any workarounds I could use in mysql to improve things?
>>>
>>> I did find one myself, but it's a bit convoluted and might not perform
>>> well
>>> under load, but the following sql script gives me similar performance in
>>> mysql:
>>>
>>> create temporary table email_sent_100
>>> select a.user_id
>>> from user_activity_events a
>>> where a.activity = 'email_sent'
>>>
>>>
>>> and a.email_id = 100;
>>>
>>> create index user_id_idx on email_sent_100(user_id);   //this could
>>> potentially bring the runtime down in the case of a larg temp table.
>>>
>>> select count(u.id)
>>> from users u
>>> left outer join email_sent_100 s
>>>   on u.id = s.user_id
>>>   and s.user_id is null;
>>>
>>> A lot more lines and a lot more complex, but does the job in this
>>> example.
>>>
>>> I'd appreciate your thoughts.
>>>
>>> Cheers,
>>> Leonardo Borges
>>> www.leonardoborges.com
>>>
>>>
>>
>
>
> --
> -----------------------------
> Johnny Withers
> 601.209.4985
> joh...@pixelated.net
>

Re: MySQL and set complements

Reply via email to