Re: Getting irrelevant results using fuzzy query

markharw00d Wed, 18 Jun 2008 13:10:58 -0700

This looks like it is related to an issue I first raised here:
   http://markmail.org/message/37ywsemfudpos6uh

At the time I identified 2 issues with FuzzyQuery - that the usual"coord" and "idf" scoring factors shouldn't be applied to fuzzy queries.

The coord factor got fixed but idf remains an issue in FuzzyQuery I beleive.

There is a class I added to contrib/queries - "FuzzyLikeThisQuery",which can be used as a replacement to FuzzyQuery and fixes the idfissue. You can subclass the QueryParser to create an instance of thatrather than FuzzyQuery if required.


Cheers,
Mark

László Monda wrote:

Hi List,

I've been redirected from [EMAIL PROTECTED] to here to discuss
my issue.


---------- My original email ----------


I try to provide relevant results for the users of a lyrics site, even
in the case of misspellings by indexing artist and songs with Lucene.

The problem is that Lucene provides irrelevant search results.  For
example searching for "Coldplay" returns "Longplay" as the most relevant
result.

This is how I create individual documents:

Document document = new Document();
document.add(new Field("artist", artist, Field.Store.YES,
Field.Index.UN_TOKENIZED));
document.add(new Field("song", song, Field.Store.YES,
Field.Index.UN_TOKENIZED));
document.add(new Field("path", path, Field.Store.YES, Field.Index.NO));
indexWriter.addDocument(document);

And this is how I compose the actual query:

BooleanQuery query = new BooleanQuery();
if (artist.length() > 0) {
    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
artist));
    query.add(artist_query, BooleanClause.Occur.MUST);
}
if (song.length() > 0) {
    FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
    query.add(song_query, BooleanClause.Occur.MUST);
}

Please let me know what's wrong, I'd like to make this work right.

Thanks in advance!


---------- My reply to an answer ----------


On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote:

On Dienstag, 17. Juni 2008, László Monda wrote:

    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
artist));

You should try the FuzzyQuery constructor that takes a minimum

similarity

and a prefix length. The general problem is however, that the degree

of

similarity is only one factor. The other factors are the same as for

other

searches, e.g. the number of occurences of the term in the document

and in

the whole index.

You could try to write your own similarity implementation that

disables all

these factors, see

http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html

I understand some essential concepts related to Lucene such as the
Levenshtein distance and tokenization, but I really don't want to go
this deep if it's not necessary.

Since fuzzy searching is based on the Levenshtein distance, the distance
between "coldplay" and "coldplay" is 0 and the distance between
"coldplay" and "downplay" is 3 so how on earth is possible that when
searching for "coldplay", Lucene returns "longplay"?  This shouldn't
happen regardless of the minimum similarity and prefix length factors.

Additional info: Lucene seems to do the right thing when only few
documents are present, but goes crazy when there is about 1.5 million
documents in the index.


---------------------------------------------------------------------


I hope that some of you can help me because I don't have any ideas what
can be wrong here.

Thanks in advance!




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Getting irrelevant results using fuzzy query

Reply via email to