Re: Funny results with Fuzzy

Marc Hadfield Tue, 25 Oct 2005 10:44:17 -0700


hello -

a fuzzy query related question:

has there been any other implementations of "fuzzy" queries other thanedit-distance? and/or modifications of edit-distance to less penalizecommon alternate spellings? - i.e. "couldn't" vs. "couldnt" -- here theapostrophe would get a smaller penalty than character mismatch.

i'm thinking specifically of the algorithms in the SecondString opensource package:

http://secondstring.sourceforge.net/

what do you think the difficulty would be to wrap an alternate algorithmthat provides a:

float score(String1, String2)
function?


---marc

mark harwood wrote:

One thing I was thinking of doing was checking the

character frequency


An alternative idea is index-time fuzzification rather
than query-time. This is documented in one of the case
studies in LIA - the principle is you don't
index/search for whole words but use an NGram Analyzer
to break them up at index time:

Kylie becomes multiple words:
[ k]
[ ky]
[ kyl]
[ky]
[kyl]
[kyli]
[yl]
[yli]
[ylie]
[ kylie ]

Obviously you use the same analyzer to process
queries.
Lucene will automatically look after relevancy of
partial matches for you but your indexes are bigger
and your queries will generate many more Boolean
clauses.

___________________________________________________________Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Funny results with Fuzzy

Reply via email to