hello -
a fuzzy query related question:
has there been any other implementations of "fuzzy" queries other than
edit-distance? and/or modifications of edit-distance to less penalize
common alternate spellings? - i.e. "couldn't" vs. "couldnt" -- here the
apostrophe would get a smaller penalty than character mismatch.
i'm thinking specifically of the algorithms in the SecondString open
source package:
http://secondstring.sourceforge.net/
what do you think the difficulty would be to wrap an alternate algorithm
that provides a:
float score(String1, String2)
function?
---marc
mark harwood wrote:
One thing I was thinking of doing was checking the
character frequency
An alternative idea is index-time fuzzification rather
than query-time. This is documented in one of the case
studies in LIA - the principle is you don't
index/search for whole words but use an NGram Analyzer
to break them up at index time:
Kylie becomes multiple words:
[ k]
[ ky]
[ kyl]
[ky]
[kyl]
[kyli]
[yl]
[yli]
[ylie]
[ kylie ]
Obviously you use the same analyzer to process
queries.
Lucene will automatically look after relevancy of
partial matches for you but your indexes are bigger
and your queries will generate many more Boolean
clauses.
___________________________________________________________
Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]