On 04/04/2013 23:26, Chris Hostetter wrote:
: At index time I boost the alias field of a small set of documents, setting the
: boost to 2.0f, which I thought meant equivalent to doubling the score this doc
: would get over another doc, everything else being equal.

1) you haven't shown us enough details to be certian, but based on the
code you've provied it looks like you are adding a boost for *each* field
instance named "alias" if the value of artistGuid is in your
artistGuIdSet...

:         if(artistGuIdSet.contains(artistGuid)) {
:             for(IndexableField indexablefield:doc.getFields())
:             {
: if(indexablefield.name().equals(ArtistIndexField.ALIAS.getName()))
:                 {
:                     Field field = (Field)indexablefield;
:                     field.setBoost(ARTIST_DOC_BOOST);

...so a doc with N values in the "alias" field is going to get a field
boost of N*2.
I was converting a document boost from lucene 3 code. For a particular document I only call setBoost() once, however the problem artists do have a number of aliases I thought when you add multiple values independently to one field its still treated as one field but is lucene 4 now treating as seperate fields so I end up calling field.setBoost() for each alias I have added to the alias field ?
2) Looking at the URL you mentioned

: http://search.musicbrainz.org/?type=artist&query=Jean&explain=true

...the debug explanation currently produced by that URL says...

6.4894321E10 = (MATCH) weight(alias:jean in 7610) [MusicbrainzSimilarity], 
result of:
    ...
    7.5161928E9 = fieldNorm(doc=7610)

ou need to look at your "MusicbrainzSimilarity" class and it's fieldNorm
method to determine for certain why it's producing such large values.  we
have no idea how that's implemented.
The MusicBrainz Similarity class aims to solve another issue with aliases, that a field with many aliases has a disadvantage in scoring with one with few aliases, I dont think Im doing anything silly
regarding the boost here am i ?

package org.musicbrainz.search.analysis;

import org.apache.lucene.index.FieldInvertState;
import org.apache.lucene.index.Norm;
import org.apache.lucene.search.similarities.DefaultSimilarity;

/**
 * Calculates a score for a match, overridden to deal with problems with alias 
fields in artist and label indexes
 */
//TODO in Lucene 4.1 we can now use PerFieldSimailrityWrapper so that we only 
oerform this on fields that need it, with
//current code tf() is performed on every field because we are not passed 
fieldname


public class MusicbrainzSimilarity extends DefaultSimilarity
{
   /**
     * Calculates a value which is inversely proportional to the number of 
terms in the field. When multiple
     * aliases are added to an artist (or label) it is seen as one field, so 
artists with many aliases can be
     * disadvantaged against when the matching alias is radically different to 
other aliases.
     *
     * @param state
     * @return
     */
    @Override
    public float lengthNorm(FieldInvertState state) {

        if (state.getName().equals("alias"))
        {
            if(state.getLength()>=3) {
                return state.getBoost() * 0.578f; //Same result as normal calc 
if field had three terms the most common scenario
            }
            else
            {
                return super.lengthNorm(state);
            }
        }
        else
        {
            return super.lengthNorm(state);
        }
    }

    /**
     * This method calculates a value based on how many times the search term 
was found in the field. Because
     * we have only short fields the only real case (apart from rare exceptions 
like Duran Duran Duran) whereby
     * the term term is found more than twice would be when
     * a search term matches multiples aliases, to remove the bias this gives 
towards artists/labels with
     * many aliases we limit the value to what would be returned for a two term 
match.
     *
     * Note: would prefer to do this just for alias field, but the field is not 
passed as a parameter.
     * @param freq
     * @return score component
     */
    @Override
    public float tf(float freq) {
        if (freq > 2.0f) {
            return 1.41f; //Same result as if matched term twice

        } else {
            return super.tf(freq);
        }
    }
}


-Hoss


Reply via email to