On Mon, Mar 29, 2010 at 10:57 AM, Benjamin Patrick Jung <bpj...@terreon.de>wrote:
> > [Examples] Search term --> Subset of expected result > Cinamo~0.5 --> Cinema, Cinnamon [works] > Strawbarr~0.8 --> Strawberry [doesn't work] > > --> > As far as I understand, the "Edit distance" > (aka "Levinshtein distance") between "Strawbarr" and "Strawberry" > is 2 (one replacement and one insertion to transform "Strawbarr" into > "Strawberry") > > yes you are correct, the scaling is a bit strange in my opinion. you can see it in FuzzyTermsEnum's javadocs (if you look at the code): Similarity returns a number that is 1.0f or less (including negative numbers) based on how similar the Term is compared to a target term. It returns exactly 0.0f when editDistance > maximumEditDistance Otherwise it returns: 1 - (editDistance / length) where length is the length of the shortest term (text or target) including a prefix that are identical and editDistance is the Levenshtein distance for the two words. I think other implementations instead tend to use 1 - (editDistance / length) for scaling, where length is the length of the longest term. -- Robert Muir rcm...@gmail.com