Re: Does Fuzzy Search scores the same as Exact Match

Paul Taylor Wed, 01 Feb 2012 04:47:04 -0800

On 28/01/2012 11:22, Uwe Schindler wrote:

-----Original Message-----
From: Paul Taylor [mailto:paul_t...@fastmail.fm]
Sent: Saturday, January 28, 2012 10:33 AM
To: 'java-user@lucene.apache.org'
Subject: Does Fuzzy Search scores the same as Exact Match


All things being equal does a fuzzy match give the same score as an
exact match.
i.e if I do a search for farmin and it matches two docs one on term

farmin, the

other on term farming, will it score farming higher or score both
the same

?

YES, depends on the Fuzzy configuration (rewrite method,...), but
the default does so!

Uwe

So how do I change it, seems like a funny default to have.

Maybe I was not clear, it should score "farming" higher than "farmin" by
default, but the default rewrite mode also takes TF/IDF into account (in
addition).

Maybe there was some confusion in your original question, to make it clear:
If you search for "farming", "farming" (exact match) should score higher
than "farmin" (distance 1). With default rewrite mode this is correct for
boosting, but if a typo is more unlikely in the corpus, then based on TF-IDF
the score can still be different. You can prohibit that by using the right
rewrite mode that *only* takes levensthein distance as inverse boost and not
use TF-IDF =>  http://goo.gl/0eJ47

You can change that by a different rewrite method:

The default is: http://goo.gl/JhHOA (which combines the standard vector

model

with additionally boosting exact matches - we have that for backwards
compatibility only, its not what most users expect)

The better one is: http://goo.gl/0eJ47, which does not take TF/IDF into

account

and only boosts by levensthein distance.

You can disable fuzzy boosting altogether:
Additionally http://goo.gl/VWlkW provides two other scoring models (TF/IDF
only, no boosting - or constant score at all)

Uwe

Hi

Using the rewrite method you suggested for fuzzy query newMultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100), it doesn'tconsider the query idf which makes sense so that rare query terms aren't boosted, but neither does it consider the idf or field/norm of thematching document this seems wrong because this still seem relavent. Theend result is that I get alot of identical scores when I normalize thescoresand when a match that matches one term in a two term field scores nobetter than a term that matches one term in three , which doesn't seem right

In contrast when I don't change the rewrite I get a better spread ofscores, but unfortunately what clearly seems to be the best documentdoesn't always match because of the query idf problem.

Isn't there a way to get something inbetween these two extremes, to keepthe field weight part of the calculation that you get with default,multiplied by ConstantScore instead of queryWeight


I have some example explain below,

Original Search is for 'República' from that I construct a disjunctionquery for two fields (artist and sortname), and then for each field wecreate a fuzzy and a wildcard query (wildcard not relevant to this question)


With New rewrite method:

DocNo:1:0.87149507:22222222-1cf0-4d1f-aca7-2a6f89e34b36:0.7922682 =(MATCH) custom((() | () | (ConstantScore(sortname:republic)^0.6ConstantScore(sortname:republica)^0.8ConstantScore(sortname:republice)^0.62222224) |ConstantScore(sortname:republica*^0.64000005)^0.64000005 |(ConstantScore(artist:republic)^1.2 ConstantScore(artist:republica)^1.6ConstantScore(artist:republice)^1.2444445) |ConstantScore(artist:republica*^1.2800001)^1.2800001)~0.1), product of:

  0.7922682 = (MATCH) max plus 0.1 times others of:
    0.33857617 = (MATCH) sum of:

0.33857617 = (MATCH) ConstantScore(sortname:republica)^0.8,product of:

        0.8 = boost
        0.42322022 = queryNorm

0.27086097 = (MATCH)ConstantScore(sortname:republica*^0.64000005)^0.64000005, product of:

      0.64000005 = boost
      0.42322022 = queryNorm
    0.67715234 = (MATCH) sum of:
      0.67715234 = (MATCH) ConstantScore(artist:republica)^1.6, product of:
        1.6 = boost
        0.42322022 = queryNorm

0.54172194 = (MATCH)ConstantScore(artist:republica*^1.2800001)^1.2800001, product of:

      1.2800001 = boost
      0.42322022 = queryNorm
  1.0 = queryBoost

With Default Rewrite Method:

DocNo:1:1.2145596:22222222-1cf0-4d1f-aca7-2a6f89e34b36:1.104145 =(MATCH) custom((() | () | (sortname:republic^0.6 sortname:republica^0.8sortname:republice^0.62222224) |ConstantScore(sortname:republica*^0.64000005)^0.64000005 |(artist:republic^1.2 artist:republica^1.6 artist:republice^1.2444445) |ConstantScore(artist:republica*^1.2800001)^1.2800001)~0.1), product of:

  1.104145 = (MATCH) max plus 0.1 times others of:
    0.5056261 = (MATCH) sum of:
      0.5056261 = (MATCH) weight(sortname:republica^0.8 in 1), product of:
        0.29863092 = queryWeight(sortname:republica^0.8), product of:
          0.8 = boost
          1.6931472 = idf(docFreq=2, maxDocs=6)
          0.22047028 = queryNorm

1.6931472 = (MATCH) fieldWeight(sortname:republica in 1),product of:

          1.0 = tf(termFreq(sortname:republica)=1)
          1.6931472 = idf(docFreq=2, maxDocs=6)
          1.0 = fieldNorm(field=sortname, doc=1)

0.14110099 = (MATCH)ConstantScore(sortname:republica*^0.64000005)^0.64000005, product of:

      0.64000005 = boost
      0.22047028 = queryNorm
    1.0112522 = (MATCH) sum of:
      1.0112522 = (MATCH) weight(artist:republica^1.6 in 1), product of:
        0.59726185 = queryWeight(artist:republica^1.6), product of:
          1.6 = boost
          1.6931472 = idf(docFreq=2, maxDocs=6)
          0.22047028 = queryNorm
        1.6931472 = (MATCH) fieldWeight(artist:republica in 1), product of:
          1.0 = tf(termFreq(artist:republica)=1)
          1.6931472 = idf(docFreq=2, maxDocs=6)
          1.0 = fieldNorm(field=artist, doc=1)

0.28220198 = (MATCH)ConstantScore(artist:republica*^1.2800001)^1.2800001, product of:

      1.2800001 = boost
      0.22047028 = queryNorm
  1.0 = queryBoost

This is my queryParser Code

package org.musicbrainz.search.servlet;

import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;
import org.musicbrainz.search.LuceneVersion;

import java.util.HashMap;
import java.util.Map;

public class DismaxQueryParser {

    public static String IMPOSSIBLE_FIELD_NAME = "\uFFFC\uFFFC\uFFFC";
    private DisjunctionQueryParser dqp;

public DismaxQueryParser(org.apache.lucene.analysis.Analyzeranalyzer) {

        dqp = new DisjunctionQueryParser(IMPOSSIBLE_FIELD_NAME, analyzer);
    }

public Query parse(String query) throwsorg.apache.lucene.queryParser.ParseException {

Query q0 = dqp.parse(DismaxQueryParser.IMPOSSIBLE_FIELD_NAME +":(" + query + ")");Query phrase =dqp.parse(DismaxQueryParser.IMPOSSIBLE_FIELD_NAME + ":\"" + query + "\"");

        if (phrase instanceof DisjunctionMaxQuery) {
            BooleanQuery bq = new BooleanQuery(true);
            bq.add(q0, BooleanClause.Occur.MUST);
            bq.add(phrase, BooleanClause.Occur.SHOULD);
            return bq;
        }
        else {
            return q0;
        }

    }

    public void addAlias(String field, DismaxAlias dismaxAlias) {
        dqp.addAlias(field, dismaxAlias);
    }

    static class DisjunctionQueryParser extends QueryParser {

        //Only make terms that are this length fuzzy
        private static final int MIN_FIELD_LENGTH_TO_MAKE_FUZZY = 4;
        private static final float FUZZY_SIMILARITY = 0.5f;

        //Reduce boost of wildcard matches compared to fuzzy /exact matches
        private static final float WILDCARD_BOOST_REDUCER = 0.8f;

public DisjunctionQueryParser(String defaultField,org.apache.lucene.analysis.Analyzer analyzer) {

            super(LuceneVersion.LUCENE_VERSION, defaultField, analyzer);

        }

protected Map<String, DismaxAlias> aliases = newHashMap<String, DismaxAlias>(3);


        //Field to DismaxAlias
        public void addAlias(String field, DismaxAlias dismaxAlias) {
            aliases.put(field, dismaxAlias);
        }

protected org.apache.lucene.search.QuerygetFuzzyQuery(java.lang.String field, java.lang.String termStr, floatminSimilarity)

                throws org.apache.lucene.queryParser.ParseException {

FuzzyQuery fq = (FuzzyQuery) super.getFuzzyQuery(field,termStr, minSimilarity);//so that fuzzy queries term do not get an advantage overexact matches just because the query term is rarer//fq.setRewriteMethod(newMultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100));

            return fq;
        }

protected Query getFieldQuery(String field, String queryText,boolean quoted)

                throws org.apache.lucene.queryParser.ParseException {
            //If field is an alias
            if (aliases.containsKey(field)) {
                DismaxAlias a = aliases.get(field);

DisjunctionMaxQuery q = newDisjunctionMaxQuery(a.getTie());

                boolean ok = false;

                for (String f : a.getFields().keySet()) {

                    //if query can be created for this field and text
                    Query querySub;
                    Query queryWildcard = null;

if (!quoted && queryText.length() >=MIN_FIELD_LENGTH_TO_MAKE_FUZZY) {

                        querySub = getFieldQuery(f, queryText, quoted);

queryWildcard = getWildcardQuery(((TermQuery)querySub).getTerm().field(), ((TermQuery) querySub).getTerm().text() + '*');querySub = getFuzzyQuery(((TermQuery)querySub).getTerm().field(), ((TermQuery) querySub).getTerm().text(),FUZZY_SIMILARITY);

                    } else {
                        querySub = getFieldQuery(f, queryText, quoted);
                    }

                    if (querySub != null) {

//if query was quoted but doesn't generate aphrase query we reject it

                        if (
                                (quoted == false) ||
                                        (querySub instanceof PhraseQuery)
                                ) {

//Reduce phrase because will have matchedboth parts giving far too much score differential

                            if(quoted == true) {
                                querySub.setBoost(0.1f);
                            }
                            //Boost as specified
                            else if (a.getFields().get(f) != null) {
                                querySub.setBoost(a.getFields().get(f));
                            }
                            q.add(querySub);
                            ok = true;
                        }
                    }

                    if (queryWildcard != null) {
                        if (a.getFields().get(f) != null) {

queryWildcard.setBoost(a.getFields().get(f)*WILDCARD_BOOST_REDUCER);

                        }
                        q.add(queryWildcard);
                    }
                }
                //Something has been added to disjunction query
                return ok ? q : null;

            } else {
                //usual Field
                try {
                    return super.getFieldQuery(field, queryText, quoted);
                } catch (Exception e) {
                    return null;
                }
            }
        }
    }

    static class DismaxAlias {
        public DismaxAlias() {

        }

        private float tie;
        //Field Boosts
        private Map<String, Float> fields;

        public float getTie() {
            return tie;
        }

        public void setTie(float tie) {
            this.tie = tie;
        }

        public Map<String, Float> getFields() {
            return fields;
        }

        public void setFields(Map<String, Float> fields) {
            this.fields = fields;
        }
    }
}
Thanks for any help Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Does Fuzzy Search scores the same as Exact Match

Reply via email to