Re: proposal: new back end tailored to data mining very largetables

Heitzso Thu, 06 Feb 2003 04:10:29 -0800

I appreciate everyone's suggestions.  I wasn't aware of the MERGE
table variation (wasn't in my large mysql book) nor of some of
the file IO optimizing aspects of INNODB.

Re INNODB's ability to return just the fields needed ... ???
I looked via GOOGLE and also dug around in the doc at the
http://www.innodb.com site and couldn't find any references
to this.  Could someone please give me a URL to a web site
that docs this and how the page/disk structure is rigged to
support it?  Thanks.

Re normalizing that's probably not an option and most likely
wouldn't help if it was.  Not an option because we don't control
how the data source chooses to present the fact file.  In the
case of detailed natality both data providers and data consumers
expect a certain file layout.  From our website at the CDC:
http://wonder.cdc.gov/wonder/sci_data/natal/detail/type_txt/Natal98.asp
documents the 1998 detailed natality record layout.  This is what
researchers expect to dig into.  On the 'most likely wouldn't help' 
end of things most of the record elements are already coded down
into single character encodings so we don't have "MALE/FEMALE" 
instead we have "0/1" or "1/0" or whatever.  So there's no disk space
and resultant disk IO to be gained by using reference ids/pointers.

Otherwise I believe I'm still back with file per field to open
up the disk bottleneck.  Between our branch and a group at
Census that has also worked on it, we've invested maybe 3 man years
trying to figure out ways to optimize indexes, the select, 
disk layout, etc.  One of the Census folks used to be an IBM
DB/2 consultant.  WHICH ISN'T TO SAY THAT WE KNOW EVERYTHING.
We are wonderfully capable of making mistakes and being quite
ignorant -- happens every day.  Data mining literature falls back 
to pre-aggregated or pre-subsetted tables and that's what we
(CDC ITB) have done in the past.  Our compressed mortality query 
page http://wonder.cdc.gov/mortICD9J.shtml accesses a MS SQL Server
stored procedure that selects one of a hundred or so tables to throw 
the query against.  

But we want to provide access to detail level data and that means,
for natality, 200 fields in a record 500 bytes wide, but 40 million
records. And often no WHERE clause. Certainly no pattern to the
queries researchers make. Yuck.

A where on YEAR with 10 years of data decreases file IO by a factor
of 10.  A where on STATE with 50 states decreases, on average,
file IO by a factor of 50.  But given that it is exceedingly rare
for even 10 fields to be involved in a SELECT, using one file
per field gives us a file IO improvement factor of 20+. Add another
factor of 2 for a lame-brain-low-cpu-hit compression of the 
field files gives us a net file IO improvement factor of 40.
And this is the technique that most current OTS high end enterprise 
data mining tools seem to be using (with some 3 man months
spent researching OTS distributed data mining tools by several
members of our branch recently).

Numbers crunch:
  320 MB/sec LVD 64/66 SCSI Raid yielding data throughput of say
  300 MB/sec times 10 seconds (OLAP ideal max for data turn
  around) times factor of 40 == 120G of effective raw table.

I'm leaving our branch in a week so probably won't be involved
with this effort (adding file/field backend to mysql) but I
believe for a particular type of data query that _we_ fight every
day it is the best non-mega-$ solution.

=============

On Mon, 2003-02-03 at 03:27, Steven Roussey wrote:
> First of all, I'd try optimizing your app before writing a whole new
> back-end. As such, I'd keep to the normal mysql list.
> 
> For example, even if the indexes are big, try to index all the columns that
> might be searched. Heck, start by indexing all of them. If the data is
> read-only, try myisampack.
> 
> Or, do the index thing above and use InnoDB for this app and be sure to
> select only those columns that you need. InnoDB does not read the whole
> record if it does not need to, even in a table scan (which is the worst case
> scenario you are calculating).
> 
> All your calculations assume a full table scan which can be avoided by good
> choice of indexes and by using InnoDB to avoid whole-record retrieval. Am I
> missing something? Pulling data from a small 14GB table should not be a
> problem. My machine (<$10K) deals with 100GB of data and does 5000 to 10000
> queries per second.
> 
> Also, your reference to denormalization didn't make any sense to me. What
> level of normal form are you expecting?
> 
> Sincerely,
> Steven Roussey
> http://Network54.com/
> 
> 
> 
> 
> ---------------------------------------------------------------------
> Before posting, please check:
>    http://www.mysql.com/manual.php   (the manual)
>    http://lists.mysql.com/           (the list archive)
> 
> To request this thread, e-mail [EMAIL PROTECTED]
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
-- 
Heitzso <[EMAIL PROTECTED]>
MetaMedia, Inc.

---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: proposal: new back end tailored to data mining very largetables

Reply via email to