On 04/04/2013 23:26, Chris Hostetter wrote:
: At index time I boost the alias field of a small set of documents, setting the
: boost to 2.0f, which I thought meant equivalent to doubling the score this doc
: would get over another doc, everything else being equal.
1) you haven't shown us enough details to be certian, but based on the
code you've provied it looks like you are adding a boost for *each* field
instance named "alias" if the value of artistGuid is in your
artistGuIdSet...
: if(artistGuIdSet.contains(artistGuid)) {
: for(IndexableField indexablefield:doc.getFields())
: {
: if(indexablefield.name().equals(ArtistIndexField.ALIAS.getName()))
: {
: Field field = (Field)indexablefield;
: field.setBoost(ARTIST_DOC_BOOST);
...so a doc with N values in the "alias" field is going to get a field
boost of N*2.
I was converting a document boost from lucene 3 code. For a particular
document I only call setBoost() once, however the problem artists do
have a number of aliases I thought when you add multiple values
independently to one field its still treated as one field but is lucene
4 now treating as seperate fields so I end up calling field.setBoost()
for each alias I have added to the alias field ?
2) Looking at the URL you mentioned
: http://search.musicbrainz.org/?type=artist&query=Jean&explain=true
...the debug explanation currently produced by that URL says...
6.4894321E10 = (MATCH) weight(alias:jean in 7610) [MusicbrainzSimilarity],
result of:
...
7.5161928E9 = fieldNorm(doc=7610)
ou need to look at your "MusicbrainzSimilarity" class and it's fieldNorm
method to determine for certain why it's producing such large values. we
have no idea how that's implemented.
The MusicBrainz Similarity class aims to solve another issue with
aliases, that a field with many aliases has a disadvantage in scoring
with one with few aliases, I dont think Im doing anything silly
regarding the boost here am i ?
package org.musicbrainz.search.analysis;
import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.index.Norm;
import org.apache.lucene.search.similarities.DefaultSimilarity;
/**
* Calculates a score for a match, overridden to deal with problems with alias
fields in artist and label indexes
*/
//TODO in Lucene 4.1 we can now use PerFieldSimailrityWrapper so that we only
oerform this on fields that need it, with
//current code tf() is performed on every field because we are not passed
fieldname
public class MusicbrainzSimilarity extends DefaultSimilarity
{
/**
* Calculates a value which is inversely proportional to the number of
terms in the field. When multiple
* aliases are added to an artist (or label) it is seen as one field, so
artists with many aliases can be
* disadvantaged against when the matching alias is radically different to
other aliases.
*
* @param state
* @return
*/
@Override
public float lengthNorm(FieldInvertState state) {
if (state.getName().equals("alias"))
{
if(state.getLength()>=3) {
return state.getBoost() * 0.578f; //Same result as normal calc
if field had three terms the most common scenario
}
else
{
return super.lengthNorm(state);
}
}
else
{
return super.lengthNorm(state);
}
}
/**
* This method calculates a value based on how many times the search term
was found in the field. Because
* we have only short fields the only real case (apart from rare exceptions
like Duran Duran Duran) whereby
* the term term is found more than twice would be when
* a search term matches multiples aliases, to remove the bias this gives
towards artists/labels with
* many aliases we limit the value to what would be returned for a two term
match.
*
* Note: would prefer to do this just for alias field, but the field is not
passed as a parameter.
* @param freq
* @return score component
*/
@Override
public float tf(float freq) {
if (freq > 2.0f) {
return 1.41f; //Same result as if matched term twice
} else {
return super.tf(freq);
}
}
}
-Hoss