On 28/01/2012 11:22, Uwe Schindler wrote:
-----Original Message-----
From: Paul Taylor [mailto:paul_t...@fastmail.fm]
Sent: Saturday, January 28, 2012 10:33 AM
To: 'java-user@lucene.apache.org'
Subject: Does Fuzzy Search scores the same as Exact Match

All things being equal does a fuzzy match give the same score as an
exact match.
i.e if I do a search for farmin and it matches two docs one on term
farmin, the
other on term farming, will it score farming higher or score both
the same
?

YES, depends on the Fuzzy configuration (rewrite method,...), but
the default does so!

Uwe


So how do I change it, seems like a funny default to have.
Maybe I was not clear, it should score "farming" higher than "farmin" by
default, but the default rewrite mode also takes TF/IDF into account (in
addition).
Maybe there was some confusion in your original question, to make it clear:
If you search for "farming", "farming" (exact match) should score higher
than "farmin" (distance 1). With default rewrite mode this is correct for
boosting, but if a typo is more unlikely in the corpus, then based on TF-IDF
the score can still be different. You can prohibit that by using the right
rewrite mode that *only* takes levensthein distance as inverse boost and not
use TF-IDF =>  http://goo.gl/0eJ47

You can change that by a different rewrite method:

The default is: http://goo.gl/JhHOA (which combines the standard vector
model
with additionally boosting exact matches - we have that for backwards
compatibility only, its not what most users expect)

The better one is: http://goo.gl/0eJ47, which does not take TF/IDF into
account
and only boosts by levensthein distance.

You can disable fuzzy boosting altogether:
Additionally http://goo.gl/VWlkW provides two other scoring models (TF/IDF
only, no boosting - or constant score at all)

Uwe


Hi

Using the rewrite method you suggested for fuzzy query new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100), it doesn't consider the query idf which makes sense so that rare query terms are n't boosted, but neither does it consider the idf or field/norm of the matching document this seems wrong because this still seem relavent. The end result is that I get alot of identical scores when I normalize the scores and when a match that matches one term in a two term field scores no better than a term that matches one term in three , which doesn't seem right

In contrast when I don't change the rewrite I get a better spread of scores, but unfortunately what clearly seems to be the best document doesn't always match because of the query idf problem.

Isn't there a way to get something inbetween these two extremes, to keep the field weight part of the calculation that you get with default, multiplied by ConstantScore instead of queryWeight

I have some example explain below,
Original Search is for 'República' from that I construct a disjunction query for two fields (artist and sortname), and then for each field we create a fuzzy and a wildcard query (wildcard not relevant to this question)

With New rewrite method:
DocNo:1:0.87149507:22222222-1cf0-4d1f-aca7-2a6f89e34b36:0.7922682 = (MATCH) custom((() | () | (ConstantScore(sortname:republic)^0.6 ConstantScore(sortname:republica)^0.8 ConstantScore(sortname:republice)^0.62222224) | ConstantScore(sortname:republica*^0.64000005)^0.64000005 | (ConstantScore(artist:republic)^1.2 ConstantScore(artist:republica)^1.6 ConstantScore(artist:republice)^1.2444445) | ConstantScore(artist:republica*^1.2800001)^1.2800001)~0.1), product of:
  0.7922682 = (MATCH) max plus 0.1 times others of:
    0.33857617 = (MATCH) sum of:
0.33857617 = (MATCH) ConstantScore(sortname:republica)^0.8, product of:
        0.8 = boost
        0.42322022 = queryNorm
0.27086097 = (MATCH) ConstantScore(sortname:republica*^0.64000005)^0.64000005, product of:
      0.64000005 = boost
      0.42322022 = queryNorm
    0.67715234 = (MATCH) sum of:
      0.67715234 = (MATCH) ConstantScore(artist:republica)^1.6, product of:
        1.6 = boost
        0.42322022 = queryNorm
0.54172194 = (MATCH) ConstantScore(artist:republica*^1.2800001)^1.2800001, product of:
      1.2800001 = boost
      0.42322022 = queryNorm
  1.0 = queryBoost

With Default Rewrite Method:
DocNo:1:1.2145596:22222222-1cf0-4d1f-aca7-2a6f89e34b36:1.104145 = (MATCH) custom((() | () | (sortname:republic^0.6 sortname:republica^0.8 sortname:republice^0.62222224) | ConstantScore(sortname:republica*^0.64000005)^0.64000005 | (artist:republic^1.2 artist:republica^1.6 artist:republice^1.2444445) | ConstantScore(artist:republica*^1.2800001)^1.2800001)~0.1), product of:
  1.104145 = (MATCH) max plus 0.1 times others of:
    0.5056261 = (MATCH) sum of:
      0.5056261 = (MATCH) weight(sortname:republica^0.8 in 1), product of:
        0.29863092 = queryWeight(sortname:republica^0.8), product of:
          0.8 = boost
          1.6931472 = idf(docFreq=2, maxDocs=6)
          0.22047028 = queryNorm
1.6931472 = (MATCH) fieldWeight(sortname:republica in 1), product of:
          1.0 = tf(termFreq(sortname:republica)=1)
          1.6931472 = idf(docFreq=2, maxDocs=6)
          1.0 = fieldNorm(field=sortname, doc=1)
0.14110099 = (MATCH) ConstantScore(sortname:republica*^0.64000005)^0.64000005, product of:
      0.64000005 = boost
      0.22047028 = queryNorm
    1.0112522 = (MATCH) sum of:
      1.0112522 = (MATCH) weight(artist:republica^1.6 in 1), product of:
        0.59726185 = queryWeight(artist:republica^1.6), product of:
          1.6 = boost
          1.6931472 = idf(docFreq=2, maxDocs=6)
          0.22047028 = queryNorm
        1.6931472 = (MATCH) fieldWeight(artist:republica in 1), product of:
          1.0 = tf(termFreq(artist:republica)=1)
          1.6931472 = idf(docFreq=2, maxDocs=6)
          1.0 = fieldNorm(field=artist, doc=1)
0.28220198 = (MATCH) ConstantScore(artist:republica*^1.2800001)^1.2800001, product of:
      1.2800001 = boost
      0.22047028 = queryNorm
  1.0 = queryBoost

This is my queryParser Code

package org.musicbrainz.search.servlet;

import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.musicbrainz.search.LuceneVersion;

import java.util.HashMap;
import java.util.Map;

public class DismaxQueryParser {

    public static String IMPOSSIBLE_FIELD_NAME = "\uFFFC\uFFFC\uFFFC";
    private DisjunctionQueryParser dqp;

public DismaxQueryParser(org.apache.lucene.analysis.Analyzer analyzer) {
        dqp = new DisjunctionQueryParser(IMPOSSIBLE_FIELD_NAME, analyzer);
    }

public Query parse(String query) throws org.apache.lucene.queryParser.ParseException {

Query q0 = dqp.parse(DismaxQueryParser.IMPOSSIBLE_FIELD_NAME + ":(" + query + ")"); Query phrase = dqp.parse(DismaxQueryParser.IMPOSSIBLE_FIELD_NAME + ":\"" + query + "\"");
        if (phrase instanceof DisjunctionMaxQuery) {
            BooleanQuery bq = new BooleanQuery(true);
            bq.add(q0, BooleanClause.Occur.MUST);
            bq.add(phrase, BooleanClause.Occur.SHOULD);
            return bq;
        }
        else {
            return q0;
        }

    }

    public void addAlias(String field, DismaxAlias dismaxAlias) {
        dqp.addAlias(field, dismaxAlias);
    }

    static class DisjunctionQueryParser extends QueryParser {

        //Only make terms that are this length fuzzy
        private static final int MIN_FIELD_LENGTH_TO_MAKE_FUZZY = 4;
        private static final float FUZZY_SIMILARITY = 0.5f;

        //Reduce boost of wildcard matches compared to fuzzy /exact matches
        private static final float WILDCARD_BOOST_REDUCER = 0.8f;

public DisjunctionQueryParser(String defaultField, org.apache.lucene.analysis.Analyzer analyzer) {
            super(LuceneVersion.LUCENE_VERSION, defaultField, analyzer);

        }


protected Map<String, DismaxAlias> aliases = new HashMap<String, DismaxAlias>(3);

        //Field to DismaxAlias
        public void addAlias(String field, DismaxAlias dismaxAlias) {
            aliases.put(field, dismaxAlias);
        }

protected org.apache.lucene.search.Query getFuzzyQuery(java.lang.String field, java.lang.String termStr, float minSimilarity)
                throws org.apache.lucene.queryParser.ParseException {
FuzzyQuery fq = (FuzzyQuery) super.getFuzzyQuery(field, termStr, minSimilarity); //so that fuzzy queries term do not get an advantage over exact matches just because the query term is rarer //fq.setRewriteMethod(new MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100));
            return fq;
        }

protected Query getFieldQuery(String field, String queryText, boolean quoted)
                throws org.apache.lucene.queryParser.ParseException {
            //If field is an alias
            if (aliases.containsKey(field)) {
                DismaxAlias a = aliases.get(field);
DisjunctionMaxQuery q = new DisjunctionMaxQuery(a.getTie());
                boolean ok = false;

                for (String f : a.getFields().keySet()) {

                    //if query can be created for this field and text
                    Query querySub;
                    Query queryWildcard = null;

if (!quoted && queryText.length() >= MIN_FIELD_LENGTH_TO_MAKE_FUZZY) {
                        querySub = getFieldQuery(f, queryText, quoted);
queryWildcard = getWildcardQuery(((TermQuery) querySub).getTerm().field(), ((TermQuery) querySub).getTerm().text() + '*'); querySub = getFuzzyQuery(((TermQuery) querySub).getTerm().field(), ((TermQuery) querySub).getTerm().text(), FUZZY_SIMILARITY);
                    } else {
                        querySub = getFieldQuery(f, queryText, quoted);
                    }

                    if (querySub != null) {
//if query was quoted but doesn't generate a phrase query we reject it
                        if (
                                (quoted == false) ||
                                        (querySub instanceof PhraseQuery)
                                ) {
//Reduce phrase because will have matched both parts giving far too much score differential
                            if(quoted == true) {
                                querySub.setBoost(0.1f);
                            }
                            //Boost as specified
                            else if (a.getFields().get(f) != null) {
                                querySub.setBoost(a.getFields().get(f));
                            }
                            q.add(querySub);
                            ok = true;
                        }
                    }

                    if (queryWildcard != null) {
                        if (a.getFields().get(f) != null) {
queryWildcard.setBoost(a.getFields().get(f)*WILDCARD_BOOST_REDUCER);
                        }
                        q.add(queryWildcard);
                    }
                }
                //Something has been added to disjunction query
                return ok ? q : null;

            } else {
                //usual Field
                try {
                    return super.getFieldQuery(field, queryText, quoted);
                } catch (Exception e) {
                    return null;
                }
            }
        }
    }

    static class DismaxAlias {
        public DismaxAlias() {

        }

        private float tie;
        //Field Boosts
        private Map<String, Float> fields;

        public float getTie() {
            return tie;
        }

        public void setTie(float tie) {
            this.tie = tie;
        }

        public Map<String, Float> getFields() {
            return fields;
        }

        public void setFields(Map<String, Float> fields) {
            this.fields = fields;
        }
    }
}
Thanks for any help Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to