<snip>

The main purpose of the FuzzyOcr's db was of course to avoid computing the
OCR passes needed to decode the image text for known images. The problem is
that the cache content is not searched for an exact match of the key values
(which are image type, width, height, number of colors and color
frequencies): it looks for the best match of these values within a given
range. This has a number of drawbacks:

 a) range search defeats look-up indexing in the db,
    thereby resulting in browsing the whole db for a match;

</snip>

Range searching in a database is (can be?) vastly faster than a full table
scan. You have to USE the indexes, not just assume an FTS will be required.
The database optimizer _SHOULD_ figure this out, but only if it has
reasonable statistics and is passed a reasonable query.

I got some SQL from MapQuest once, that had a WHERE clause containing the
arithmetic to compute the distance from a given location (a "radius"
search). As coded, of course, a Full Table Scan was required and the
distance function was evaluated to determine a row's presence in the result
set. It was very slow, even for just a few hundred thousand location
records.

I indexed the Latitude and Longitude columns, and expressed the query
without the radius in the WHERE, but rather the ranges of possible Latitude
and Longitude values (In effect, the rectangle that just contained the
circle the user desired). The unwanted "corners" of the result set were
discarded, rather than every single row outside the desired radius. The
performance gains (for normal radiuses, 10, 20, 50, 100 miles) were
enormous. The average gain was 100:1 - two orders of magnitude.

MapQuest, of course, wasn't interested. The SQL ran on their clients'
machines, not theirs.

If fuzzyOCR caching method has any merit at all, tuning the SQL and/or the
database will provide decent performance. 

"Explain Execution Plan" is your friend!

Dan Barker

Reply via email to