Re: [Devel-spam] FuzzyOcr 3.5.1 released

jdow Tue, 09 Jan 2007 05:14:32 -0800

Yup - if you are looking for "within 10 miles" you can perform a raw
comparison by looking at the lat-lon degrees number to remove anything
more than two degrees apart. That knocks down your search by 180 time
in each direction, over 30000:1 savings right there. If you store all
the data as degree and fractional degree you can remove everything more
than a small fraction of a degree apart.


But for the first cut storing everything in the grid square 117 to 118
longitude and 34 to 35 longitude in its own part of the tree structure
allows almost instant selection of "likely" candidates. You could also
use links to store 117 to 118, 34-35 in one box, 117.5-118.5, 34-35 in
another box - noting the overlap in the concept. That means a site right
on a corner or edge of a criterion marker isn't lost. Anything like that
which can be used to reduce the amount of data that needs to be tested
even at the expence of cross-linked trees is a huge savings. You enter
an item into the database once, that performs the searches for the crude
region linkages. Then the searches, the "many" operation, can proceed
quicker due to filtering out excess searches.

{^_^}

----- Original Message -----From: "Dan Barker" <[EMAIL PROTECTED]>

Giampaolo: I hope you succeed.

I've given up hope on convincing folks (Mapquest in particular) thatradiussearches can be indexed. You needn't pull the lat/long of every singleentry

to run the distance function, and then discard the ones too far away. You
can index on LAT and LONG and structure the query such that only the
"possible" lat/long values need the distance function (and the rest of the
record fetched) evaluated.

Just because it's two orders of magnitude more efficient doesn't make
anybody listen.

Same conversation, different universe!

Dan

-----Original Message-----
From: Giampaolo Tomassoni [mailto:[EMAIL PROTECTED]

From: Andy Dills [mailto:[EMAIL PROTECTED]


...omissis...

> I understand that the "order" keyword in select is potentially
expensive, but
> necessary because matches occur generally towards the most
recent entries,
> thus increasing the possibility of a match earlier on.  When
your hash count
> is in the thousands, earlier matches mean less queries to the
database, and
> potentially faster results.

It's not just the order directive, it's the iteration throughout the
entire database.

Consider when the database grows to >50k records. For a new image that
doesn't have a hash, that's 50k records that must be sorted then
sent from
the DB server to the mail server, then all 50k records must be checked
against the hash before we decide that we haven't seen this image before.
That just isn't a workable algorithm. If iteration throughout the entire
database is a requirement, hashing is a performance hit rather than a
performance gain.

A better solution might be a seperate daemon that holds the hashes in
memory, to which you submit the hash being considered.


Other ways could be the ones depicted in my recent post (Message-ID:

<[EMAIL PROTECTED]>), in which closeimages

are basicly clustered together thanks to a surrogate index.

giampaolo


Honestly, I have been extremely impressed with having hashing turned
completely off.

Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---

Re: [Devel-spam] FuzzyOcr 3.5.1 released

Reply via email to