Lucene scoring: Term frequency normalisation

Karl Koch Tue, 12 Dec 2006 02:24:05 -0800

Hi,

I have a question about the current Lucene scoring algoritm. In this scoring 
algorithm, the term frequency is calcualted by using the square root of the 
number of occuring terms as described in


http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html#formula_tf

Having read a number of IR papers and also in a number of IR books, I am quite 
familiar that log is used to normalise term frequency in order to prevent very 
high term frequencies from having too much an effect on the scoring. 

However, what exactly is the advantage of using sqare root instead of log? Is 
there any scientific reason behind this? Does anybody know a paper about this 
issue? Any source of impirical evidence that this works better than the log? Is 
there perhaps another discussion thread in here which I have not seen. 

Thank you advance,
Karl 


-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Lucene scoring: Term frequency normalisation

Reply via email to