> -----Messaggio originale----- > Da: Dan Barker [mailto:[EMAIL PROTECTED] > > <snip> > > The main purpose of the FuzzyOcr's db was of course to avoid computing > the > OCR passes needed to decode the image text for known images. The > problem is > that the cache content is not searched for an exact match of the key > values > (which are image type, width, height, number of colors and color > frequencies): it looks for the best match of these values within a > given > range. This has a number of drawbacks: > > a) range search defeats look-up indexing in the db, > thereby resulting in browsing the whole db for a match; > > </snip> > > Range searching in a database is (can be?) vastly faster than a full > table > scan. You have to USE the indexes, not just assume an FTS will be > required. > The database optimizer _SHOULD_ figure this out, but only if it has > reasonable statistics and is passed a reasonable query. > > I got some SQL from MapQuest once, that had a WHERE clause containing > the > arithmetic to compute the distance from a given location (a "radius" > search). As coded, of course, a Full Table Scan was required and the > distance function was evaluated to determine a row's presence in the > result > set. It was very slow, even for just a few hundred thousand location > records. > > I indexed the Latitude and Longitude columns, and expressed the query > without the radius in the WHERE, but rather the ranges of possible > Latitude > and Longitude values (In effect, the rectangle that just contained the > circle the user desired). The unwanted "corners" of the result set were > discarded, rather than every single row outside the desired radius. The > performance gains (for normal radiuses, 10, 20, 50, 100 miles) were > enormous. The average gain was 100:1 - two orders of magnitude. > > MapQuest, of course, wasn't interested. The SQL ran on their clients' > machines, not theirs. > > If fuzzyOCR caching method has any merit at all, tuning the SQL and/or > the > database will provide decent performance. > > "Explain Execution Plan" is your friend!
I totally agree with you, Dan. I recall such a discussion when FuzzyOcr was still under development and there were some of us contributing ideas about this. Nevertheless, when the final code was out, it didn't implement anything in order to at least allow the SQL server to reduce the number of rows to retrieve (in example, the specific select didn't even establish the allowed, rough ranges on image width and height attributes in its where expression). The result is point a) of the list of drawbacks in my post. However, even fixing it, points b) and c) still hold. This is a problem bound to the range checking itself: it is not always correct to assert that two images are the same if their "distance" is less than a given epsilon. This is true regardless of how you compute your "distance" or how low is your epsilon, but in the specific case maybe the FuzzyOcr's distance function is not the best we could get... Thereby, to my opinion drawback a) may eventually get fixed by not allowing any range search at all, but instead computing a true hash (md5 or whatever good) of the image file and then using it as a primary key in the db. Spammers often use the very same image in only few messages, thereby the performance gain would be low but, nevertheless, it would be non zero. c) would still hold, anyway, so maybe event this "solution" wouldn't help that much when you're trying to tune FO to defenestrate that damn spammer. As someone else pointed out in this list, the whole caching code was due to concerns about execution times needed by computing OCR code on a lot of images. These concerns seems much relaxed now, so the best option we actually have is to disable caching at all, which is like discarding any caching code in FuzzyOcr. Giampaolo > Dan Barker