Hi Rich, SeetSpotSimilarity looks promising. Does it not favor shorter docs by not > normalizing or does it make some attempt to standardized. > > > - using e.g. SeetSpotSimilarity which do not favor shorter documents. >
SweetSpotSimilarity (I misspelled it previously) defines a range of lengths which is not normalized (this is the sweet spot) and a slop by which longer or shorter documents are "punished" relative to the gap from the defined range. It takes parameters for defining the range and the slop. See here: http://lucene.apache.org/java/3_1_0/api/contrib-misc/org/apache/lucene/misc/SweetSpotSimilarity.html I stumbled upon the 'Explain' function yesterday though it returns a crowded > message using debug in SOLR admin. Is there another method or interface > which returns more or cleaner info? > I am not familiar with the use of Solr for this, I hope someone else will answer this... > >>>> (Number of times the term appears in a document) > >>>> / (Total Number of terms in that document) > >>>> * Log10(Number of total documents > >>>> / Number of times search term appears in all > documents) > If I read this correctly it seems the part about dividing by (Total Number of terms in that document) does not describe the default scoring in Lucene, where this value, aka the doc length, only takes part in the length normalization. This is also explained here: http://lucene.apache.org/java/3_1_0/api/core/org/apache/lucene/search/Similarity.html Once you are able to actually compare the score computation details (Explain) for such two documents I think you'll be in better state for identifying what really hurts the search quality. Regards, Doron