Hi, In an address search framework of a company, we need to deal with queries including potential spelling errors. After some external address space constraints (e.g. match first letters, word length, etc.) we're still ending up with a huge data set to filter through Levenshtein like distance metrics.
Sequential scanning a record set with roughly 100,000 entries through some sort of distance metric is not something we'd want in production. For this purpose, I plan to implement BK-Trees[1] on top of GiST, which will (technically) reduce overhead complexity from O(n) to O(logn). As far as I'm concerned, such a work will worth the time it will take when compared to overhead reduction it will bring. [1] Some approaches to best-match file searching http://portal.acm.org/citation.cfm?id=362003.362025 Anyway, I have some experience with source code of intarray module. Does anybody have suggestions/warnings/comments about such a project? Would PostgreSQL team welcome such a patch to get integrated into fuzzystrmatch module, or should I create my own project at pgfoundry? BTW, as you'd imagine, related implementation won't be something specific to Levenshtein. Any distance metric on any kind of data will be able to benefit from BK-Trees. Regards. ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly