There is a mismatch between the score for a wildcard match and an exact match

Paul Taylor Fri, 09 Mar 2012 02:40:47 -0800

There is a mismatch between the score for a wildcard match and an exactmatch


I search for


|recording:live OR recording:luve*
|

And here is the Explain Output from Search

|DocNo:0:1.4196585:11111111-1cf0-4d1f-aca7-2a6f89e34b36
1.4196585  =  (MATCH)  max plus0.1  times others of:
  0.3763506  =  (MATCH)  ConstantScore(recording:luve*),  product of:
    1.0  =  boost
    0.3763506  =  queryNorm
  1.3820235  =  (MATCH)  weight(recording:luve in0),  product of:
    0.7211972  =  queryWeight(recording:luve),  product of:
      1.9162908  =  idf(docFreq=1,  maxDocs=5)
      0.3763506  =  queryNorm
    1.9162908  =  (MATCH)  fieldWeight(recording:luve in0),  product of:
      1.0  =  tf(termFreq(recording:luve)=1)
      1.9162908  =  idf(docFreq=1,  maxDocs=5)
      1.0  =  fieldNorm(field=recording,  doc=0)

DocNo:1:0.3763506:22222222-1cf0-4d1f-aca7-2a6f89e34b36
0.3763506  =  (MATCH)  max plus0.1  times others of:
  0.3763506  =  (MATCH)  ConstantScore(recording:luve*),  product of:
    1.0  =  boost
    0.3763506  =  queryNorm
|

In my test I have 5 documents one contains an exact match, another awildcard match and the other three do not match all. The score of theexact match is *1.4* compared to *0.37* for the wildcard match, thatsnearly a factor of *4*. With a much larger index the score for an exactmatch on a rare term compared to a wildcard search would be even higher.

The whole difference is due to the different scoring mechism used forwildcard to exact match, wildcards don't take tf/idf or lengthnorm intoaccount you just get a constant score for each match. Now I'm notbothered about tf or lengthnorm in my data domain it doesnt make muchdifference but the *idf* score is a real killer. Because the matchingdoc is found once in 5 documents its idf contribution is idf squared i.e*3.61*

I know this constant score is quicker than calculating thetf*idf*lengthnorm for each wildcard match but it doesn't make sense tome for the idf to contribute so much to the score. I also know I canchange the rewrite method but there are two problems with this.


1.

   Scoring rewrite methods perform less well because they are
   calculating idf, tf and lengthnorm. idf is the only value I need.

2.

   Ones that do calculate the score dont make much sense either as they
   would calculate the idf of the matching term even though this isn't
   what was actually search for and this term could be rarer than what
   I was actually searching for, possibly boosting it higher than the
   exact match.

(I could also change the similarity class to override the idfcalculation so it always returns 1 but that doesn't make sense becausethe idf is very useful for comparing exact matches to different words


i.e recording:luve OR recording:luve* OR recording:the OR recording:the*

I would want matches to *luve* to score higher than matches to thecommon word *the* )

So does a rewrite method already exist or is possible for it to justcalculate the idf of the term it was trying to match to so for examplein this case I search for 'luve' and the wildcard matches on 'luvely'that it would multiple the luvely match by the idf of luve (3.61). Thisway my wildcard match would be comparable to the exact match and I canjust change my query to boost the exact match slightly so exact matchwould always score higher than wildcard match but not too much higher


i.e

|recording:live^1.2  OR recording:luve*
|

and with this mythical rewrite method this would give (depending onqueryNorm):


 * Doc 0:0:1.692
 * Doc 1:0:1.419

There is a mismatch between the score for a wildcard match and an exact match

Reply via email to