Re: getting certain rows from a group by

2012-09-19 Thread Larry Martell
On Tue, Sep 18, 2012 at 7:41 PM, Rick James  wrote:
> SELECT  ((the appropriate id))  -- <--
> FROM  data_cst, rollup, data_target
> WHERE  data_target.name = rollup.Target
>   AND  data_cst.ep = rollup.EP
>   AND  data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04 
> 23:59:59'
>   AND  data_target.id = data_cst.target_name_id
> GROUP BY  rollup.Target, rollup.EP
> ORDER BY  ABS(data_cst.bottom - rollup.`Avg Bottom`)
> LIMIT  1   -- <--

rollup is the name of the temp table that I was testing with. It
contains the output of the original query. I don't have that in the
real world. That's my issue - I can't figure out how to join with the
result set and apply the order by to the join condition. What would I
use in place of rollup to get the 'Avg Bottom'?

Here's what my query looks like now:

SELECT data_target.name as Target,
   q1.ep as EP,
   COUNT(*) as Wafers,
   Lots,
   SUM(numonep)/(COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns
ELSE 0 END)) as 'Sites/Wafer',
   MAX(LastRun) as "Last Run",
   SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as Rerun,
   COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as Runs,
   avgbottom as "Avg Bottom",
   3*stdbottom as "3 Sig",
   maxbottom as Max,
   minbottom as Min,
   SUM(numonep) as Count, SUM(numonep) - SUM(numbottoms) as NAs,
   100-((SUM(numonep) - SUM(numbottoms))/SUM(numonep)*100) as "% Success",
   3*stdbottom/avgbottom as "3Sig/Avg",
   AVG(avgbottom) as 'Wafer Avg',
   AVG(Wafer3Sigma) as 'Wafer 3 Sigma',
   AVG(Ranges) as 'Avg Range',
   3*STD(Ranges) as '3Sig of Ranges',
   MAX(Ranges) as 'Max Range',
   MIN(Ranges) as 'Min Range', (SUM(numonep) -
SUM(numbottoms))/COUNT(*) as 'NAs/Wafer',
   mx.id as maxID,
   mn.id as minID,
   lr.id as lrID
FROM (SELECT target_name_id,
 ep,
 wafer_id,
 COUNT(bottom) as numbottoms,
 AVG(bottom) as avgbottom,
 STD(bottom) as stdbottom,
 MAX(bottom) as maxbottom,
 MIN(bottom) as minbottom,
 MAX(date_time) as "LastRun",
 COUNT(*) as numonep,
 COUNT(DISTINCT target_name_id, ep, lot_id,
data_file_id)-1 as reruns,
 COUNT(DISTINCT(lot_id)) as Lots,
 3*STD(bottom) as Wafer3Sigma,
 MAX(bottom) - MIN(bottom) as Ranges
  FROM data_cst
  WHERE target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44,
116, 117, 118, 119, 120, 121)
  AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
'2010-03-04 23:59:59'
  GROUP BY target_name_id, ep, wafer_id
  HAVING count(*) < 999) q1
join data_cst mn ON mn.bottom = q1.minbottom
join data_cst mx on mx.bottom = q1.maxbottom
join data_cst lr on lr.date_time = q1.LastRun,
data_target
WHERE data_target.id = q1.target_name_id
GROUP BY q1.target_name_id, q1.ep;

>
>> -Original Message-
>> From: Larry Martell [mailto:larry.mart...@gmail.com]
>> Sent: Tuesday, September 18, 2012 2:57 PM
>> To: Rick James
>> Cc: peter.braw...@earthlink.net; mysql@lists.mysql.com
>> Subject: Re: getting certain rows from a group by
>>
>> On Tue, Sep 18, 2012 at 4:01 PM, Rick James 
>> wrote:
>> > SELECT mn.*
>> >mx.*
>> >> >FROM ( SELECT @min = MIN(bottom),
>> >> >  @max = MAX(bottom), ... )
>> >> >JOIN data_cst mn  ON bottom = @min
>> >> >JOIN data_cst mx  ON bottom = @max
>> >
>> > That is, the FROM finds the value for the desired row.
>> > The JOIN then gets to the rest of the fields.
>> > Caveat:  If two rows have bottom = @min, it returns both.
>>
>>
>> Thanks a lot Rick! This was super helpful. I've got it working for 3 of
>> the 4 cases - min(bottom), max(bottom), and max(date_time). But I can't
>> figure out how to work in the last case - where bottom is closest to
>> avg(bottom). In an individual query I can get it with an order by, like
>> this:
>>
>> SELECT rollup.Target, rollup.EP, rollup.`Avg Bottom`, data_cst.id,
>> data_cst.bottom FROM data_cst, rollup, data_target WHERE
>> data_target.name = rollup.Target AND data_cst.ep = rollup.EP AND
>> data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04
>> 23:59:59'
>> AND data_target.id = data_cst.target_name_id GROUP BY rollup.Target,
>> rollup.EP ORDER BY ABS(data_cst.bottom - rollup.`Avg Bottom`);
>>
>> Any way to work that into another join?
>>
>> Thanks!
>>
>> >
>> >> -Original Message-
>> >> From: Larry Martell [mailto:larry.mart...@gmail.com]
>> >> Sent: Tuesday, September 18, 2012 12:54 PM
>> >> To: Rick James
>> >> Cc: peter.braw...@earthlink.net; mysql@lists.mysql.com
>> >> Subject: Re: getting certain rows from a group by
>> >>
>> >> On Tue, Sep 18, 2012 at 2:05 PM, Rick James 
>> >> wrote:
>> >> > For single query...
>> >> >
>> >> > Plan A:
>> >> > Repeat the SELECT as once for each (MIN, AVG, etc) as a subquery
>> in
>> >> >WHERE bottom = ( 

Re: getting certain rows from a group by

2012-09-19 Thread Larry Martell
On Wed, Sep 19, 2012 at 12:04 AM,   wrote:
> 2012/09/18 06:53 -0400, Larry Martell 
> This works fine. But now I need to get a certain column
> (image_measurer_id) with each row returned that corresponds to the row
> from the group that has bottom = Min(bottom), bottom = Max(bottom),
> bottom closest to Avg(bottom), and bottom from the row where date_time
> = Max(date_time).
> 
> There is a standard and ugly way of getting such a thing, but it entails 
> repeating the table reference (supposing you want any, not every, to match):
>
> SELECT data_cst.target_name_id, image_measurer_id, nb, xb, vb, xt, bottom
> FROM (SELECT Min(bottom) AS nb, Max(bottom) AS xb, MIN(ABS(Avg(bottom) - 
> bottom)) AS vb, Max(date_time) AS xt
> FROM data_cst
>   WHERE target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44, 116, 
> 117, 118, 119, 120, 121)
>   AND DATE(data_cst.date_time) = '2010-03-04'
>   GROUP BY target_name_id, ep, wafer_id
>   HAVING count(*) < 999) AS st
> JOIN data_cst ON st.target_name_id = data_cst.target_name_id AND (nb = bottom 
> OR xb = bottom OR vb = bottom OR date_time = xt)
>
> One record will be returned for every row that holds a relevant extremum, not 
> guaranteed to be unique.
>
> This query pertains only to your original subquery, not the whole query. To 
> get the result to which you refer, join this to your original query, to which 
> you have added something like vb:
>
> SELECT *
> FROM (original query) JOIN (this query) ON nb = minbottom AND xb = maxbottom 
> AND xt = "Last Run" AND vb =  AND (original query).target_name_id = 
> (this_query).target_name_id

Thanks. I will look into doing it like this.

I got the min, max, and date conditions to work by adding the joins
that Rick suggested:

join data_cst mn ON mn.bottom = q1.minbottom
join data_cst mx ON mx.bottom = q1.maxbottom
join data_cst lr ON lr.date_time = q1.LastRun

But now you have me wondering - if this just matching on that one
field, then it can be returning a row from data_cst that is not part
of the original result set? Do I need to add all the same select
criteria to each join?

> --but I am not confident in the result. There are problems in the original 
> query, the biggest that of avgbottom, stdbottom, maxbottom, minbottom none 
> are aggregated over wafer_id.

The inner query aggregates over target_name_id, ep, wafer_id, and the
the outer query aggregates that result set over target_name_id, ep.
The avg and std are what the client wants - average of averages and
standards deviations. As are the min and max - the outer query returns
the min and max of all the wafers for each target, ep.

> Therefore, it is not certain from which record from q1 they are returned. 
> MySQL tends to pick the first that fits, but not only is nothing guaranteed, 
> it is explicitly written that if such not aggregated fields appear, the 
> output is not determinate unless all pertinent are equal, the possibility 
> wherof the reason for allowing it.

Yes, if case with dups probably had not been thought of by them, and
will have to be addressed at some point.

>
> When that has been handled, it is needful to change the foregoing query to 
> match that one in two levels, because averaging is not associative (there is 
> also the problem of equality-testing on generated floating-point numbers). If 
> it were only MAX and MIN, one level of not GROUPing BY wafer_id would be all 
> right.
>
> By the way, in the original query, I suggest instead of
> data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
> this,
> DATE(data_cst.date_time) = '2010-03-04'

The date_time was just an example. Depending on user input the query
could have times in it.

> Instead of
>
>  ... q1,
>  data_target
> WHERE data_target.id = target_name_id
>
> this would be better:
>
>  ... q1 JOIN data_target ON data_target.id = target_name_id

Why is one better then the other?

> I believe that
>  100-((SUM(numonep) - SUM(numbottoms))/SUM(numonep)*100)
>= 100*SUM(numbottoms)/SUM(numonep)
> and
>  SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END)
>= SUM(GREATEST(reruns, 0))

Yes, thanks - yours is clearer and I've made those changes.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



InnoDB vs. other storage engines

2012-09-19 Thread Mark Haney
I hope this doesn't end in some kind of flame war.  I'm looking to 
optimize my tables (and performance in general) of the DB my web app is 
using.  I'm tweaking things a little at a time, but I'm curious as to 
what the rest of the MySQL list thinks about changing my storage engine 
from InnoDB to something else so I can optimize the tables on a regular 
basis.


Is it worth the effort?  Any caveats?

I've never really encountered this situation before and I'm curious to 
see what others have to say on it.


Thanks in advance.

--

Mark Haney
Software Developer/Consultant
AB Emblem
ma...@abemblem.com
Linux marius.homelinux.org 3.5.1-1.fc17.x86_64 GNU/Linux

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: getting certain rows from a group by

2012-09-19 Thread Larry Martell
On Wed, Sep 19, 2012 at 12:04 AM,   wrote:
> 2012/09/18 06:53 -0400, Larry Martell 
> This works fine. But now I need to get a certain column
> (image_measurer_id) with each row returned that corresponds to the row
> from the group that has bottom = Min(bottom), bottom = Max(bottom),
> bottom closest to Avg(bottom), and bottom from the row where date_time
> = Max(date_time).
> 
> There is a standard and ugly way of getting such a thing, but it entails 
> repeating the table reference (supposing you want any, not every, to match):
>
> SELECT data_cst.target_name_id, image_measurer_id, nb, xb, vb, xt, bottom
> FROM (SELECT Min(bottom) AS nb, Max(bottom) AS xb, MIN(ABS(Avg(bottom) - 
> bottom)) AS vb, Max(date_time) AS xt
> FROM data_cst
>   WHERE target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44, 116, 
> 117, 118, 119, 120, 121)
>   AND DATE(data_cst.date_time) = '2010-03-04'
>   GROUP BY target_name_id, ep, wafer_id
>   HAVING count(*) < 999) AS st
> JOIN data_cst ON st.target_name_id = data_cst.target_name_id AND (nb = bottom 
> OR xb = bottom OR vb = bottom OR date_time = xt)

This:

MIN(ABS(Avg(bottom) - bottom))

Is not valid. It gives:

ERROR  (HY000): Invalid use of group function

Which is why I was doing it with an order by. But I can figure out how
to work that into a join,

>
> One record will be returned for every row that holds a relevant extremum, not 
> guaranteed to be unique.
>
> This query pertains only to your original subquery, not the whole query. To 
> get the result to which you refer, join this to your original query, to which 
> you have added something like vb:
>
> SELECT *
> FROM (original query) JOIN (this query) ON nb = minbottom AND xb = maxbottom 
> AND xt = "Last Run" AND vb =  AND (original query).target_name_id = 
> (this_query).target_name_id
>
> --but I am not confident in the result. There are problems in the original 
> query, the biggest that of avgbottom, stdbottom, maxbottom, minbottom none 
> are aggregated over wafer_id. Therefore, it is not certain from which record 
> from q1 they are returned. MySQL tends to pick the first that fits, but not 
> only is nothing guaranteed, it is explicitly written that if such not 
> aggregated fields appear, the output is not determinate unless all pertinent 
> are equal, the possibility wherof the reason for allowing it.
>
> When that has been handled, it is needful to change the foregoing query to 
> match that one in two levels, because averaging is not associative (there is 
> also the problem of equality-testing on generated floating-point numbers). If 
> it were only MAX and MIN, one level of not GROUPing BY wafer_id would be all 
> right.
>
> By the way, in the original query, I suggest instead of
> data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
> this,
> DATE(data_cst.date_time) = '2010-03-04'
>
> Instead of
>
>  ... q1,
>  data_target
> WHERE data_target.id = target_name_id
>
> this would be better:
>
>  ... q1 JOIN data_target ON data_target.id = target_name_id
>
> I believe that
>  100-((SUM(numonep) - SUM(numbottoms))/SUM(numonep)*100)
>= 100*SUM(numbottoms)/SUM(numonep)
> and
>  SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END)
>= SUM(GREATEST(reruns, 0))
>
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:http://lists.mysql.com/mysql
>

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: getting certain rows from a group by

2012-09-19 Thread Larry Martell
On Wed, Sep 19, 2012 at 10:26 AM, Larry Martell  wrote:
> On Wed, Sep 19, 2012 at 12:04 AM,   wrote:
>> 2012/09/18 06:53 -0400, Larry Martell 
>> This works fine. But now I need to get a certain column
>> (image_measurer_id) with each row returned that corresponds to the row
>> from the group that has bottom = Min(bottom), bottom = Max(bottom),
>> bottom closest to Avg(bottom), and bottom from the row where date_time
>> = Max(date_time).
>> 
>> There is a standard and ugly way of getting such a thing, but it entails 
>> repeating the table reference (supposing you want any, not every, to match):
>>
>> SELECT data_cst.target_name_id, image_measurer_id, nb, xb, vb, xt, bottom
>> FROM (SELECT Min(bottom) AS nb, Max(bottom) AS xb, MIN(ABS(Avg(bottom) - 
>> bottom)) AS vb, Max(date_time) AS xt
>> FROM data_cst
>>   WHERE target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44, 116, 
>> 117, 118, 119, 120, 121)
>>   AND DATE(data_cst.date_time) = '2010-03-04'
>>   GROUP BY target_name_id, ep, wafer_id
>>   HAVING count(*) < 999) AS st
>> JOIN data_cst ON st.target_name_id = data_cst.target_name_id AND (nb = 
>> bottom OR xb = bottom OR vb = bottom OR date_time = xt)
>>
>> One record will be returned for every row that holds a relevant extremum, 
>> not guaranteed to be unique.
>>
>> This query pertains only to your original subquery, not the whole query. To 
>> get the result to which you refer, join this to your original query, to 
>> which you have added something like vb:
>>
>> SELECT *
>> FROM (original query) JOIN (this query) ON nb = minbottom AND xb = maxbottom 
>> AND xt = "Last Run" AND vb =  AND (original query).target_name_id = 
>> (this_query).target_name_id
>
> Thanks. I will look into doing it like this.
>
> I got the min, max, and date conditions to work by adding the joins
> that Rick suggested:
>
> join data_cst mn ON mn.bottom = q1.minbottom
> join data_cst mx ON mx.bottom = q1.maxbottom
> join data_cst lr ON lr.date_time = q1.LastRun
>
> But now you have me wondering - if this just matching on that one
> field, then it can be returning a row from data_cst that is not part
> of the original result set? Do I need to add all the same select
> criteria to each join?

I verified that this is the case. I was not getting the correct data
back. I had to add:

and mn.target_name_id = q1.target_name_id and mn.ep = q1.ep and
mn.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04 23:59:59'

to each join so I got the correct rows back. So now my only issue is
still how to work the bottom is closest to avg(bottom) join into this.

>
>> --but I am not confident in the result. There are problems in the original 
>> query, the biggest that of avgbottom, stdbottom, maxbottom, minbottom none 
>> are aggregated over wafer_id.
>
> The inner query aggregates over target_name_id, ep, wafer_id, and the
> the outer query aggregates that result set over target_name_id, ep.
> The avg and std are what the client wants - average of averages and
> standards deviations. As are the min and max - the outer query returns
> the min and max of all the wafers for each target, ep.
>
>> Therefore, it is not certain from which record from q1 they are returned. 
>> MySQL tends to pick the first that fits, but not only is nothing guaranteed, 
>> it is explicitly written that if such not aggregated fields appear, the 
>> output is not determinate unless all pertinent are equal, the possibility 
>> wherof the reason for allowing it.
>
> Yes, if case with dups probably had not been thought of by them, and
> will have to be addressed at some point.
>
>>
>> When that has been handled, it is needful to change the foregoing query to 
>> match that one in two levels, because averaging is not associative (there is 
>> also the problem of equality-testing on generated floating-point numbers). 
>> If it were only MAX and MIN, one level of not GROUPing BY wafer_id would be 
>> all right.
>>
>> By the way, in the original query, I suggest instead of
>> data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
>> '2010-03-04 23:59:59'
>> this,
>> DATE(data_cst.date_time) = '2010-03-04'
>
> The date_time was just an example. Depending on user input the query
> could have times in it.
>
>> Instead of
>>
>>  ... q1,
>>  data_target
>> WHERE data_target.id = target_name_id
>>
>> this would be better:
>>
>>  ... q1 JOIN data_target ON data_target.id = target_name_id
>
> Why is one better then the other?
>
>> I believe that
>>  100-((SUM(numonep) - SUM(numbottoms))/SUM(numonep)*100)
>>= 100*SUM(numbottoms)/SUM(numonep)
>> and
>>  SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END)
>>= SUM(GREATEST(reruns, 0))
>
> Yes, thanks - yours is clearer and I've made those changes.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: getting certain rows from a group by

2012-09-19 Thread Larry Martell
OK, I think I have this working. The last join was this:

JOIN (select id, target_name_id, ep,date_time from data_cst WHERE
target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44, 116, 117,
118, 119, 120, 121) AND data_cst.date_time BETWEEN '2010-03-04
00:00:00' AND '2010-03-04 23:59:59' GROUP BY  target_name_id, ep
HAVING count(*) < 999 ORDER BY  ABS(data_cst.bottom - AVG(bottom))) cm
on cm.target_name_id = q1.target_name_id and cm.ep = q1.ep and
cm.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04 23:59:59'

I have do to more testing to ensure I'm getting the correct rows back.

Thanks for all the help!

On Wed, Sep 19, 2012 at 10:07 AM, Larry Martell  wrote:
> On Tue, Sep 18, 2012 at 7:41 PM, Rick James  wrote:
>> SELECT  ((the appropriate id))  -- <--
>> FROM  data_cst, rollup, data_target
>> WHERE  data_target.name = rollup.Target
>>   AND  data_cst.ep = rollup.EP
>>   AND  data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04 
>> 23:59:59'
>>   AND  data_target.id = data_cst.target_name_id
>> GROUP BY  rollup.Target, rollup.EP
>> ORDER BY  ABS(data_cst.bottom - rollup.`Avg Bottom`)
>> LIMIT  1   -- <--
>
> rollup is the name of the temp table that I was testing with. It
> contains the output of the original query. I don't have that in the
> real world. That's my issue - I can't figure out how to join with the
> result set and apply the order by to the join condition. What would I
> use in place of rollup to get the 'Avg Bottom'?
>
> Here's what my query looks like now:
>
> SELECT data_target.name as Target,
>q1.ep as EP,
>COUNT(*) as Wafers,
>Lots,
>SUM(numonep)/(COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns
> ELSE 0 END)) as 'Sites/Wafer',
>MAX(LastRun) as "Last Run",
>SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as Rerun,
>COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as Runs,
>avgbottom as "Avg Bottom",
>3*stdbottom as "3 Sig",
>maxbottom as Max,
>minbottom as Min,
>SUM(numonep) as Count, SUM(numonep) - SUM(numbottoms) as NAs,
>100-((SUM(numonep) - SUM(numbottoms))/SUM(numonep)*100) as "% Success",
>3*stdbottom/avgbottom as "3Sig/Avg",
>AVG(avgbottom) as 'Wafer Avg',
>AVG(Wafer3Sigma) as 'Wafer 3 Sigma',
>AVG(Ranges) as 'Avg Range',
>3*STD(Ranges) as '3Sig of Ranges',
>MAX(Ranges) as 'Max Range',
>MIN(Ranges) as 'Min Range', (SUM(numonep) -
> SUM(numbottoms))/COUNT(*) as 'NAs/Wafer',
>mx.id as maxID,
>mn.id as minID,
>lr.id as lrID
> FROM (SELECT target_name_id,
>  ep,
>  wafer_id,
>  COUNT(bottom) as numbottoms,
>  AVG(bottom) as avgbottom,
>  STD(bottom) as stdbottom,
>  MAX(bottom) as maxbottom,
>  MIN(bottom) as minbottom,
>  MAX(date_time) as "LastRun",
>  COUNT(*) as numonep,
>  COUNT(DISTINCT target_name_id, ep, lot_id,
> data_file_id)-1 as reruns,
>  COUNT(DISTINCT(lot_id)) as Lots,
>  3*STD(bottom) as Wafer3Sigma,
>  MAX(bottom) - MIN(bottom) as Ranges
>   FROM data_cst
>   WHERE target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44,
> 116, 117, 118, 119, 120, 121)
>   AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
>   GROUP BY target_name_id, ep, wafer_id
>   HAVING count(*) < 999) q1
> join data_cst mn ON mn.bottom = q1.minbottom
> join data_cst mx on mx.bottom = q1.maxbottom
> join data_cst lr on lr.date_time = q1.LastRun,
> data_target
> WHERE data_target.id = q1.target_name_id
> GROUP BY q1.target_name_id, q1.ep;
>
>>
>>> -Original Message-
>>> From: Larry Martell [mailto:larry.mart...@gmail.com]
>>> Sent: Tuesday, September 18, 2012 2:57 PM
>>> To: Rick James
>>> Cc: peter.braw...@earthlink.net; mysql@lists.mysql.com
>>> Subject: Re: getting certain rows from a group by
>>>
>>> On Tue, Sep 18, 2012 at 4:01 PM, Rick James 
>>> wrote:
>>> > SELECT mn.*
>>> >mx.*
>>> >> >FROM ( SELECT @min = MIN(bottom),
>>> >> >  @max = MAX(bottom), ... )
>>> >> >JOIN data_cst mn  ON bottom = @min
>>> >> >JOIN data_cst mx  ON bottom = @max
>>> >
>>> > That is, the FROM finds the value for the desired row.
>>> > The JOIN then gets to the rest of the fields.
>>> > Caveat:  If two rows have bottom = @min, it returns both.
>>>
>>>
>>> Thanks a lot Rick! This was super helpful. I've got it working for 3 of
>>> the 4 cases - min(bottom), max(bottom), and max(date_time). But I can't
>>> figure out how to work in the last case - where bottom is closest to
>>> avg(bottom). In an individual query I can get it with an order by, like
>>> this:
>>>
>>> SELECT rollup.Target, rollup.EP, rollup.`Avg Bottom`, data_cst.id,
>>> data_cst.bottom FROM data_cst, rollup, data_target WHERE
>>> data_target.name

Re: InnoDB vs. other storage engines

2012-09-19 Thread Manuel Arostegui
2012/9/19 Mark Haney 

> I hope this doesn't end in some kind of flame war.  I'm looking to
> optimize my tables (and performance in general) of the DB my web app is
> using.  I'm tweaking things a little at a time, but I'm curious as to what
> the rest of the MySQL list thinks about changing my storage engine from
> InnoDB to something else so I can optimize the tables on a regular basis.
>
> Is it worth the effort?  Any caveats?


Hi Mark,

I would depend on what your workload would be. Mostly writes, mostly reads,
how many writes/reads do you expect etc.
The best approach, from my point of view, would be, firstly, tune your
MySQL server (if you've not done it yet) before getting into engine/tables
optimizations which can be more complicated.

 Manuel.


RE: InnoDB vs. other storage engines

2012-09-19 Thread Rick James
No flames from me; I stay out of that religious war.  However, the general 
consensus is to move to InnoDB.  So, here are the gotchas.  Most are 
non-issues; a few might bite you, but can probably be dealt with:

http://mysql.rjweb.org/doc.php/myisam2innodb



> -Original Message-
> From: Manuel Arostegui [mailto:man...@tuenti.com]
> Sent: Wednesday, September 19, 2012 12:51 PM
> To: Mark Haney
> Cc: mysql mailing list
> Subject: Re: InnoDB vs. other storage engines
> 
> 2012/9/19 Mark Haney 
> 
> > I hope this doesn't end in some kind of flame war.  I'm looking to
> > optimize my tables (and performance in general) of the DB my web app
> > is using.  I'm tweaking things a little at a time, but I'm curious as
> > to what the rest of the MySQL list thinks about changing my storage
> > engine from InnoDB to something else so I can optimize the tables on
> a regular basis.
> >
> > Is it worth the effort?  Any caveats?
> 
> 
> Hi Mark,
> 
> I would depend on what your workload would be. Mostly writes, mostly
> reads, how many writes/reads do you expect etc.
> The best approach, from my point of view, would be, firstly, tune your
> MySQL server (if you've not done it yet) before getting into
> engine/tables optimizations which can be more complicated.
> 
>  Manuel.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



RE: Making myself crazy...

2012-09-19 Thread Rick James
Other comments:

s_product_sales_log has no PRIMARY KEY.  All InnoDB tables 'should' have an 
explicit PK.

INT(5) is not what you think.  INT is always a 32-bit, 4-byte quantity, 
regardless of the number.

Use TINYINT UNSIGNED, SMALLINT UNSIGNED, etc. wherever reasonable.

>  KEY `is_value_added` (`is_value_added`),
A single-column INDEX on a flag, ENUM, and other low-cardinality field, is 
almost always useless.

Slightly more readable:

SELECT  pc AS `Profit Centre`, CONCAT('$', FORMAT(prp,
2)) AS Paid, CONCAT('$', FORMAT(prt, 2)) AS Trade,
CONCAT('$', FORMAT(prtot, 2)) AS `Both`
FROM
( SELECT  tt.`Profit Centre` AS pc, mt.Paid / tt.Hours AS prp,
mt.Trade / tt.Hours AS prt, mt.tot / tt.Hours AS prtot
FROM

( SELECT  SUM(TIMESTAMPDIFF(MINUTE, tl.`in`, tl.`out`)/60) AS 
Hours,
pc.ID AS pcid, pc.name AS `Profit Centre`
FROM  s_timelog tl
LEFT JOIN  s_project proj ON tl.project_id = proj.id
LEFT JOIN  s_profit_centre pc ON proj.profit_centre = pc.ID
WHERE  proj.private = 'no'
  AND  YEAR(tl.`in`) = '{{{1}}}'
GROUP BY  `Profit Centre`) tt
LEFT JOIN
( SELECT  SUM(case when sales.Type NOT IN('barter', 'work 
trade') then `Total` else 0 end) AS Paid,
SUM(case when sales.Type IN('barter', 'work trade') 
then `Total` else 0 end) AS Trade,
SUM(`Total`) AS tot, pc.ID AS pcid, pc.name AS `Profit 
Centre`
FROM  s_product_sales_log sales
LEFT JOIN  s_product prod ON sales.Product = prod.ID
LEFT JOIN  s_profit_centre pc ON prod.profit_centre = pc.ID
WHERE  YEAR(sales.`Date`) = '{{{1}}}'
GROUP BY  `Profit Centre`
) mt ON mt.pcid = tt.pcid
WHERE  mt.tot != 0
) xx


Performance issues...
> WHERE  YEAR(sales.`Date`) = '{{{1}}}'
won't use
>  KEY `Date` (`Date`),
because the column (Date) is hidden in a function.  A workaround:
WHERE `Date` >= '{{{1}}}-01-01'
  AND `Date` <  '{{{1}}}-01-01' + INTERVAL 1 YEAR

JOINing two subqueries -- There is no way to index either of them, so the JOIN 
will have to do a table scan of one temp table for every row of the other temp 
table.
(The alternative is to CREATE TEMPORARY TABLE... with an index, for one of the 
subqueries.)

It would probably be better to move the mt.tot!=0 test inside:

GROUP BY  `Profit Centre`
) mt ON mt.pcid = tt.pcid
WHERE  mt.tot != 0
) xx
-->
GROUP BY  `Profit Centre`
HAVING tot != 0   -- added
) mt ON mt.pcid = tt.pcid
-- removed:   WHERE  mt.tot != 0
) xx

That would make mt have fewer rows, hence that unindexed JOIN could run faster.

> -Original Message-
> From: Jan Steinman [mailto:j...@bytesmiths.com]
> Sent: Tuesday, September 18, 2012 9:26 PM
> To: Rick James
> Cc: mysql@lists.mysql.com
> Subject: Re: Making myself crazy...
>
> Thanks for your help, Rick.
>
> The solution that seems to be working was to have both the FROM and
> LEFT JOIN be grouped subqueries. I realize this is probably not the
> best way of doing things, but it seems fast enough at present.
>
> Here is the query, followed by the table definitions, resulting in the
> following data:
>   http://www.EcoReality.org/wiki/2012_gross_productivity
>
> -- (The following query is from a template, in which "{{{1}}}" is
> replaced by a year, such as "2012".) SELECT
>   pc AS `Profit Centre`,
>   CONCAT('$', FORMAT(prp, 2)) AS Paid,
>   CONCAT('$', FORMAT(prt, 2)) AS Trade,
>   CONCAT('$', FORMAT(prtot, 2)) AS `Both` FROM (SELECT
>   tt.`Profit Centre` AS pc,
>   mt.Paid / tt.Hours AS prp,
>   mt.Trade / tt.Hours AS prt,
>   mt.tot / tt.Hours AS prtot
> FROM
> (SELECT
>   SUM(TIMESTAMPDIFF(MINUTE, tl.`in`, tl.`out`)/60) AS Hours,
>   pc.ID AS pcid,
>   pc.name AS `Profit Centre`
> FROM s_timelog tl
> LEFT JOIN
>   s_project proj ON tl.project_id = proj.id LEFT JOIN
>   s_profit_centre pc ON proj.profit_centre = pc.ID WHERE proj.private =
> 'no' AND YEAR(tl.`in`) = '{{{1}}}'
> GROUP BY `Profit Centre`) tt
> LEFT JOIN
>   (SELECT
> SUM(case when sales.Type NOT IN('barter', 'work trade') then
> `Total` else 0 end) AS Paid,
> SUM(case when sales.Type IN('barter', 'work trade') then `Total`
> else 0 end) AS Trade,
> SUM(`Total`) AS tot,
> pc.ID AS pcid,
> pc.name AS `Profit Centre`
>   FROM s_product_sales_log sales
>   LEFT JOIN
> s_product prod ON sales.Product = prod.ID
>   LEFT JOIN
> s_profit_centre pc ON prod.profit_centre = pc.ID
>   WHERE YEAR(sales.`Date`) = '{{{1}}}'
>   GROUP BY `Profit Centre`) mt
> ON mt.pcid = tt.pcid
> WHERE mt.tot != 0) xx
>
> --"s_timelog" and "s_product_sales_log" are the two

Re: getting certain rows from a group by

2012-09-19 Thread Larry Martell
Well, I'm getting the proper rows from the 4 joins I added, but now
I'm seeing a weird side effect - my counts are 3 times to much.

The query is really huge and nasty now, but I'm going to paste it below.

In the outer most select, Wafers, Rerun, Runs, and Count are 3 times
what they should be. If I take off the joins, I get the correct
counts. I can't see why that's happening.


SELECT data_target.name as Target,
   q1.ep as EP,
   COUNT(*) as Wafers,
   Lots,
   FORMAT(SUM(numonep)/(COUNT(*)+(SUM(GREATEST(q1.reruns,
0, 1) as 'Sites/Wafer',
   DATE_FORMAT(MAX(LastRun), '%m/%d/%y') as "Last Run",
   SUM(GREATEST(q1.reruns, 0)) as Rerun,
   COUNT(*)+SUM(GREATEST(q1.reruns, 0)) as Runs,
   FORMAT(avgbottom, 1) as "Avg Bottom",
   FORMAT(3*stdbottom, 2) as "3 Sig",
   FORMAT(maxbottom, 1) as Max,
   FORMAT(minbottom, 1) as Min,
   SUM(numonep) as Count,
   SUM(numonep) - SUM(numbottoms) as NAs,
   CONCAT(FORMAT(100*SUM(numbottoms)/SUM(numonep), 1), '
%') as "% Success",
   FORMAT(3*stdbottom/avgbottom, 2) as "3Sig/Avg",
   FORMAT(AVG(avgbottom), 1) as 'Wafer Avg',
   FORMAT(AVG(Wafer3Sigma), 2) as 'Wafer 3 Sigma',
   FORMAT(AVG(Ranges), 1) as 'Avg Range',
   FORMAT(3*STD(Ranges), 2) as '3Sig of Ranges',
   FORMAT(MAX(Ranges), 1) as 'Max Range',
   FORMAT(MIN(Ranges), 1) as 'Min Range',
   FORMAT((SUM(numonep) - SUM(numbottoms))/COUNT(*), 1) as
'NAs/Wafer',
   mx.maxImage as maxImage,
   mn.minImage as minImage,
   lr.lrImage as lrlrImage,
   cm.cmImage as cmcmImage
   FROM (SELECT target_name_id,
ep,
wafer_id,
COUNT(bottom) as numbottoms,
AVG(bottom) as avgbottom,
STD(bottom) as stdbottom,
MAX(bottom) as maxbottom,
MIN(bottom) as minbottom,
MAX(date_time) as "LastRun",
COUNT(*) as numonep,
COUNT(DISTINCT target_name_id, ep, lot_id,
data_file_id)-1 as reruns,
COUNT(DISTINCT(lot_id)) as Lots,
3*STD(bottom) as Wafer3Sigma,
MAX(bottom) - MIN(bottom) as Ranges
   FROM data_cst
   WHERE TRUE
 AND data_cst.target_name_id IN (775, 776, 777, 778, 779, 780,
45, 44, 116, 117, 118, 119, 120, 121)
 AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
'2010-03-04 23:59:59'
 GROUP BY target_name_id, ep, wafer_id
 HAVING count(*) < 999) q1
   JOIN (SELECT data_cstimage.name as minImage,
data_cst.bottom,
data_cst.target_name_id,
data_cst.date_time,
data_cst.lot_id,
data_cst.tool_id,
data_cst.roiname,
data_cst.recipe_id,
data_cst.ep
 FROM data_cstimage, data_cst
 WHERE data_cstimage.id = data_cst.image_measurer_id) mn
 ON mn.bottom = q1.minbottom
AND mn.target_name_id IN (775, 776, 777, 778, 779,
780, 45, 44, 116, 117, 118, 119, 120, 121)
AND mn.date_time BETWEEN '2010-03-04 00:00:00' AND
'2010-03-04 23:59:59'
   JOIN (SELECT data_cstimage.name as maxImage,
data_cst.bottom,
data_cst.target_name_id,
data_cst.date_time,
data_cst.lot_id,
data_cst.tool_id,
data_cst.roiname,
data_cst.recipe_id,
data_cst.ep
 FROM data_cstimage, data_cst
 WHERE data_cstimage.id = data_cst.image_measurer_id) mx
 ON mx.bottom = q1.maxbottom
AND mx.target_name_id IN (775, 776, 777, 778, 779,
780, 45, 44, 116, 117, 118, 119, 120, 121)
AND mx.date_time BETWEEN '2010-03-04 00:00:00' AND
'2010-03-04 23:59:59'
   JOIN (SELECT data_cstimage.name as lrImage,
data_cst.bottom,
data_cst.target_name_id,
data_cst.date_time,
data_cst.lot_id,
data_cst.tool_id,
data_cst.roiname,
data_cst.recipe_id,
data_cst.ep
 FROM data_cstimage, data_cst
 WHERE data_cstimage.id = data_

RE: getting certain rows from a group by

2012-09-19 Thread Rick James
> my counts are 3 times too much.
Without studying the code, I would guess that there is a JOIN between he data 
that needs COUNTing and the GROUP BY for the COUNT.  That is, it collects more 
'joined' rows before counting.  Fixing it will probably make the query even 
messier.

> -Original Message-
> From: Larry Martell [mailto:larry.mart...@gmail.com]
> Sent: Wednesday, September 19, 2012 3:22 PM
> To: Rick James
> Cc: peter.braw...@earthlink.net; mysql@lists.mysql.com
> Subject: Re: getting certain rows from a group by
>
> Well, I'm getting the proper rows from the 4 joins I added, but now I'm
> seeing a weird side effect - my counts are 3 times to much.
>
> The query is really huge and nasty now, but I'm going to paste it
> below.
>
> In the outer most select, Wafers, Rerun, Runs, and Count are 3 times
> what they should be. If I take off the joins, I get the correct counts.
> I can't see why that's happening.
>
>
> SELECT data_target.name as Target,
>q1.ep as EP,
>COUNT(*) as Wafers,
>Lots,
>FORMAT(SUM(numonep)/(COUNT(*)+(SUM(GREATEST(q1.reruns,
> 0, 1) as 'Sites/Wafer',
>DATE_FORMAT(MAX(LastRun), '%m/%d/%y') as "Last Run",
>SUM(GREATEST(q1.reruns, 0)) as Rerun,
>COUNT(*)+SUM(GREATEST(q1.reruns, 0)) as Runs,
>FORMAT(avgbottom, 1) as "Avg Bottom",
>FORMAT(3*stdbottom, 2) as "3 Sig",
>FORMAT(maxbottom, 1) as Max,
>FORMAT(minbottom, 1) as Min,
>SUM(numonep) as Count,
>SUM(numonep) - SUM(numbottoms) as NAs,
>CONCAT(FORMAT(100*SUM(numbottoms)/SUM(numonep), 1), '
> %') as "% Success",
>FORMAT(3*stdbottom/avgbottom, 2) as "3Sig/Avg",
>FORMAT(AVG(avgbottom), 1) as 'Wafer Avg',
>FORMAT(AVG(Wafer3Sigma), 2) as 'Wafer 3 Sigma',
>FORMAT(AVG(Ranges), 1) as 'Avg Range',
>FORMAT(3*STD(Ranges), 2) as '3Sig of Ranges',
>FORMAT(MAX(Ranges), 1) as 'Max Range',
>FORMAT(MIN(Ranges), 1) as 'Min Range',
>FORMAT((SUM(numonep) - SUM(numbottoms))/COUNT(*), 1) as
> 'NAs/Wafer',
>mx.maxImage as maxImage,
>mn.minImage as minImage,
>lr.lrImage as lrlrImage,
>cm.cmImage as cmcmImage
>FROM (SELECT target_name_id,
> ep,
> wafer_id,
> COUNT(bottom) as numbottoms,
> AVG(bottom) as avgbottom,
> STD(bottom) as stdbottom,
> MAX(bottom) as maxbottom,
> MIN(bottom) as minbottom,
> MAX(date_time) as "LastRun",
> COUNT(*) as numonep,
> COUNT(DISTINCT target_name_id, ep, lot_id,
> data_file_id)-1 as reruns,
> COUNT(DISTINCT(lot_id)) as Lots,
> 3*STD(bottom) as Wafer3Sigma,
> MAX(bottom) - MIN(bottom) as Ranges
>FROM data_cst
>WHERE TRUE
>  AND data_cst.target_name_id IN (775, 776, 777, 778, 779, 780,
> 45, 44, 116, 117, 118, 119, 120, 121)
>  AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
>  GROUP BY target_name_id, ep, wafer_id
>  HAVING count(*) < 999) q1
>JOIN (SELECT data_cstimage.name as minImage,
> data_cst.bottom,
> data_cst.target_name_id,
> data_cst.date_time,
> data_cst.lot_id,
> data_cst.tool_id,
> data_cst.roiname,
> data_cst.recipe_id,
> data_cst.ep
>  FROM data_cstimage, data_cst
>  WHERE data_cstimage.id =
> data_cst.image_measurer_id) mn
>  ON mn.bottom = q1.minbottom
> AND mn.target_name_id IN (775, 776, 777, 778, 779, 780,
> 45, 44, 116, 117, 118, 119, 120, 121)
> AND mn.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
>JOIN (SELECT data_cstimage.name as maxImage,
> data_cst.bottom,
> data_cst.target_name_id,
> data_cst.date_time,
> data_cst.lot_id,
> data_cst.tool_id,
> data_cst.roiname,
> data_cst.recipe_id,
> data_cst.ep
>  FROM data_cstimage, data_cst
>  WHERE data_cstimage.id =
> data_cst.image_measurer_id) mx
>  ON mx.bottom = q1.maxbottom
>   

Re: getting certain rows from a group by

2012-09-19 Thread hsv
 2012/09/19 14:36 -0400, Larry Martell 
MIN(ABS(Avg(bottom) - bottom))

Is not valid. It gives:

ERROR  (HY000): Invalid use of group function

Yes, I had my doubts of that, for all that I suggested it.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql



Re: Making Myself Crazy

2012-09-19 Thread Jan Steinman
Thanks for your help, Rick!

Interspersed are some questions and rationales for you to shoot down... :-)

> From: Rick James 
> 
> s_product_sales_log has no PRIMARY KEY.  All InnoDB tables 'should' have an 
> explicit PK.

This table really has no identifying information. There could be two identical, 
valid rows, if the same person sold the same amount of the same product to the 
same other person on the same day.

All the foreign keys were indexed. Is there something I don't understand about 
something a PK field does? If an individual record cannot be uniquely 
identified by its information, is there really any need for a primary key?

None the less, I added field "ID" as an unsigned autoincrement INT and made it 
PK.

> INT(5) is not what you think.  INT is always a 32-bit, 4-byte quantity, 
> regardless of the number.

> 
> Use TINYINT UNSIGNED, SMALLINT UNSIGNED, etc. wherever reasonable.

Understood. I make all my keys UINT even when they could be smaller, because 
I've gotten into some gnarly consistency problems. with foreign keys.

>> KEY `is_value_added` (`is_value_added`),
> A single-column INDEX on a flag, ENUM, and other low-cardinality field, is 
> almost always useless.

Why is that? Surely, even a flag separates the record space into two?

> Performance issues...
>> WHERE  YEAR(sales.`Date`) = '{{{1}}}'
> won't use
>> KEY `Date` (`Date`),
> because the column (Date) is hidden in a function.  A workaround:
>   WHERE `Date` >= '{{{1}}}-01-01'
> AND `Date` <  '{{{1}}}-01-01' + INTERVAL 1 YEAR

Thanks! Good catch.

> JOINing two subqueries -- There is no way to index either of them, so the 
> JOIN will have to do a table scan of one temp table for every row of the 
> other temp table.
> (The alternative is to CREATE TEMPORARY TABLE... with an index, for one of 
> the subqueries.)

But I made sure the subqueries were the smallest possible sets -- essentially, 
the domain of s_profit_centre, which only has 12 records.

I had the entire thing coded up into one massive JOIN of everything, and it 
took 30 minutes to run! By LEFT JOINing down to a dozen or fewer records, it 
seems to run in reasonable time, even though it's two subqueries that are not 
indexed.

> It would probably be better to move the mt.tot!=0 test inside:
> 
>   GROUP BY  `Profit Centre`
>   ) mt ON mt.pcid = tt.pcid
>   WHERE  mt.tot != 0
>   ) xx
> -->
>   GROUP BY  `Profit Centre`
>   HAVING tot != 0   -- added
>   ) mt ON mt.pcid = tt.pcid
>   -- removed:   WHERE  mt.tot != 0
>   ) xx
> 
> That would make mt have fewer rows, hence that unindexed JOIN could run 
> faster.

My first attempt to do that produced an error. And again, both the subqueries 
will have 12 or fewer records, so I'm wondering if this really helps anything.

Thanks for your help!

[clip]


 People see what they have been conditioned to see; they refuse to see what 
they don't expect to see. -- Merle P. Martin
 Jan Steinman, EcoReality Co-op 





--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/mysql