Hi,

On Thursday, October 21, 2004, at 04:40  PM, [EMAIL PROTECTED] wrote:

I don't think that he is worried about table scanning, he is worried about
ROW scanning. Each of his rows is so large (2500*(size of float) +
3*(size of tinyint) + some other stuff) that just moving that much data
around through his machine is consuming too much time.


If you have a query that does something like this

SELECT onefield
FROM hugetable
WHERE indexed_column = 12
        AND non_indexed = 6

Odds are that you will be doing an indexed search. Say that for his
indexed_column he gets 24M rows that match its condition out of the 16B
rows on his table (this is way less than 30%). The next thing the engine
has to do is to LOAD each of those 24M rows (every single column) into
memory so that the value of the non_indexed column can be compared to 6.
In order to perform that second comparison, the memory bus, the hard
drives, and anything else related to reading records will have to transfer
275.4 GB of data (at least once) just so that he can get the value from
the 1 column he specified in his SELECT statement out of each row that
matches his two WHERE conditions.


My idea is to create a way to ask the engine check the value of the second
field directly from the DISK copy of each table (without moving each row
into local memory) . If it matches a second seek is performed to pull in
the field(s) specified in his select . Alternatively we could create some
way that we can ask the engine to only pull those columns through memory
that either participate in the SELECT clause or one of the other ON or
WHERE conditions. This way we minimize how much data must be moved
through memory to resolve one of these rather bulky queries.


We could potentially create a new threshold value, say... if you use less
than 10% of the fields on a table in a query , that would cause this
optimization to kick in.


If anyone else has any ideas (and I am sure there are many) on how to
minimize disk traffic and memory throughput in order to handle his rather
large dataset, this would be a great time to speak up. I know that this is
a rather extreme case but solving this issue may make MySQL just that much
faster for the rest of us.

The problem with the above is that you will increase disk seeks by one per row per column. A disk seek is much more expensive than a continuous read. The only case where I could see this optimization would be useful was if you had a single column being read on a fixed width table (then you could compute the initial seek offset, without adding any extra ones). With 2 columns (which is your above example), it would be faster to do 1 disk seek + read the entire row (10k in this case, with 2500 float fields) than doing 2 disk seeks and reading 8 bytes (2 columns). Moving data thru CPU and memory is trivial in cost compared to doing an extra disk seek.


Regards,

Harrison


-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]



Reply via email to