I appreciate everyone's suggestions. I wasn't aware of the MERGE table variation (wasn't in my large mysql book) nor of some of the file IO optimizing aspects of INNODB.
Re INNODB's ability to return just the fields needed ... ??? I looked via GOOGLE and also dug around in the doc at the http://www.innodb.com site and couldn't find any references to this. Could someone please give me a URL to a web site that docs this and how the page/disk structure is rigged to support it? Thanks. Re normalizing that's probably not an option and most likely wouldn't help if it was. Not an option because we don't control how the data source chooses to present the fact file. In the case of detailed natality both data providers and data consumers expect a certain file layout. From our website at the CDC: http://wonder.cdc.gov/wonder/sci_data/natal/detail/type_txt/Natal98.asp documents the 1998 detailed natality record layout. This is what researchers expect to dig into. On the 'most likely wouldn't help' end of things most of the record elements are already coded down into single character encodings so we don't have "MALE/FEMALE" instead we have "0/1" or "1/0" or whatever. So there's no disk space and resultant disk IO to be gained by using reference ids/pointers. Otherwise I believe I'm still back with file per field to open up the disk bottleneck. Between our branch and a group at Census that has also worked on it, we've invested maybe 3 man years trying to figure out ways to optimize indexes, the select, disk layout, etc. One of the Census folks used to be an IBM DB/2 consultant. WHICH ISN'T TO SAY THAT WE KNOW EVERYTHING. We are wonderfully capable of making mistakes and being quite ignorant -- happens every day. Data mining literature falls back to pre-aggregated or pre-subsetted tables and that's what we (CDC ITB) have done in the past. Our compressed mortality query page http://wonder.cdc.gov/mortICD9J.shtml accesses a MS SQL Server stored procedure that selects one of a hundred or so tables to throw the query against. But we want to provide access to detail level data and that means, for natality, 200 fields in a record 500 bytes wide, but 40 million records. And often no WHERE clause. Certainly no pattern to the queries researchers make. Yuck. A where on YEAR with 10 years of data decreases file IO by a factor of 10. A where on STATE with 50 states decreases, on average, file IO by a factor of 50. But given that it is exceedingly rare for even 10 fields to be involved in a SELECT, using one file per field gives us a file IO improvement factor of 20+. Add another factor of 2 for a lame-brain-low-cpu-hit compression of the field files gives us a net file IO improvement factor of 40. And this is the technique that most current OTS high end enterprise data mining tools seem to be using (with some 3 man months spent researching OTS distributed data mining tools by several members of our branch recently). Numbers crunch: 320 MB/sec LVD 64/66 SCSI Raid yielding data throughput of say 300 MB/sec times 10 seconds (OLAP ideal max for data turn around) times factor of 40 == 120G of effective raw table. I'm leaving our branch in a week so probably won't be involved with this effort (adding file/field backend to mysql) but I believe for a particular type of data query that _we_ fight every day it is the best non-mega-$ solution. ============= On Mon, 2003-02-03 at 03:27, Steven Roussey wrote: > First of all, I'd try optimizing your app before writing a whole new > back-end. As such, I'd keep to the normal mysql list. > > For example, even if the indexes are big, try to index all the columns that > might be searched. Heck, start by indexing all of them. If the data is > read-only, try myisampack. > > Or, do the index thing above and use InnoDB for this app and be sure to > select only those columns that you need. InnoDB does not read the whole > record if it does not need to, even in a table scan (which is the worst case > scenario you are calculating). > > All your calculations assume a full table scan which can be avoided by good > choice of indexes and by using InnoDB to avoid whole-record retrieval. Am I > missing something? Pulling data from a small 14GB table should not be a > problem. My machine (<$10K) deals with 100GB of data and does 5000 to 10000 > queries per second. > > Also, your reference to denormalization didn't make any sense to me. What > level of normal form are you expecting? > > Sincerely, > Steven Roussey > http://Network54.com/ > > > > > --------------------------------------------------------------------- > Before posting, please check: > http://www.mysql.com/manual.php (the manual) > http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail [EMAIL PROTECTED] > To unsubscribe, e-mail <[EMAIL PROTECTED]> -- Heitzso <[EMAIL PROTECTED]> MetaMedia, Inc. --------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php