Re: Most efficient way of handling a large dataset

Joerg Bruehe Fri, 24 Oct 2008 09:53:00 -0700

Hi Mark, all!

Mark Goodge wrote:
> I'd appreciate some advice on how best to handle a biggish dataset
> consisting of around 5 million lines. At the moment, I have a single
> table consisting of four fields and one primary key:
> 
> partcode varchar(20)
> region varchar(10)
> location varchar(50)
> qty int(11)
> PRIMARY KEY (partcode, region, location)
> 
> The biggest variable is partcode, with around 80,000 distinct values.
> For statistical purposes, I need to be able to select a sum(qty) based
> on the other three fields (eg, "select sum(qty) from mytable where
> partcode ='x' and region = 'y' and location = 'z'") as well as
> generating a list of partcodes and total quantities in each region and
> location (eg, "select sum(qty), partcode from mytable where region = 'y'
> and location = 'z' group by partcode").

Sorry, I don't get it - I think there is a contradiction in your mail:

Your table has four fields, three of which form the primary key.
This means that for any combination of values for those three fields
(partcode, region, location) there will be at most one row, and so only
one qty value. Why do you want to sum over one value?

And in your second query there is also no need for "sum(qty)", a plain
"qty" will do because for each group there will be only one row (region
and location are set to fixed values in your statement).

IMO, the main question is whether all your statements use fixed values
for region and location (like your second statement), but only some do
for partcode (your first statement).
If that holds true for your statements, then your primary key is defined
in the wrong order: it should have partcode as the last field (= the
least significant one). Then, all your statement could use the primary
key, and you need no separate index.

> 
> [[...]]
> 
> Does anyone have any suggestions? My initial thought is to replace the
> region and location varchar fields with int fields keyed to a separate
> list of region and location names. Would that help, or is there a better
> way?

Well, if your data are integer values, then using integer as column type
should speed up your operations considerably:
Operations (including comparisons) on integers are faster than on
character strings, and reduced data size means shorter (= faster)
transfers and more elements in caches (assuming same cache size).

HTH,
Jörg

-- 
Joerg Bruehe,  MySQL Build Team,
               [EMAIL PROTECTED]
Sun Microsystems GmbH,   Sonnenallee 1,   D-85551 Kirchheim-Heimstetten
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering     Muenchen: HRB161028

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Most efficient way of handling a large dataset

Reply via email to