Getting irrelevant results using fuzzy query

László Monda Wed, 18 Jun 2008 07:05:57 -0700

Hi List,

I've been redirected from [EMAIL PROTECTED] to here to discuss
my issue.

---------- My original email ----------

I try to provide relevant results for the users of a lyrics site, even
in the case of misspellings by indexing artist and songs with Lucene.

The problem is that Lucene provides irrelevant search results.  For
example searching for "Coldplay" returns "Longplay" as the most relevant
result.

This is how I create individual documents:

Document document = new Document();
document.add(new Field("artist", artist, Field.Store.YES,
Field.Index.UN_TOKENIZED));
document.add(new Field("song", song, Field.Store.YES,
Field.Index.UN_TOKENIZED));
document.add(new Field("path", path, Field.Store.YES, Field.Index.NO));
indexWriter.addDocument(document);

And this is how I compose the actual query:

BooleanQuery query = new BooleanQuery();
if (artist.length() > 0) {
    FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
artist));
    query.add(artist_query, BooleanClause.Occur.MUST);
}
if (song.length() > 0) {
    FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
    query.add(song_query, BooleanClause.Occur.MUST);
}

Please let me know what's wrong, I'd like to make this work right.

Thanks in advance!

---------- My reply to an answer ----------

On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote:
> On Dienstag, 17. Juni 2008, László Monda wrote:
> 
> >     FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
> > artist));
> 
> You should try the FuzzyQuery constructor that takes a minimum
similarity 
> and a prefix length. The general problem is however, that the degree
of 
> similarity is only one factor. The other factors are the same as for
other 
> searches, e.g. the number of occurences of the term in the document
and in 
> the whole index.
> 
> You could try to write your own similarity implementation that
disables all 
> these factors, see
>
http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html

I understand some essential concepts related to Lucene such as the
Levenshtein distance and tokenization, but I really don't want to go
this deep if it's not necessary.

Since fuzzy searching is based on the Levenshtein distance, the distance
between "coldplay" and "coldplay" is 0 and the distance between
"coldplay" and "downplay" is 3 so how on earth is possible that when
searching for "coldplay", Lucene returns "longplay"?  This shouldn't
happen regardless of the minimum similarity and prefix length factors.

Additional info: Lucene seems to do the right thing when only few
documents are present, but goes crazy when there is about 1.5 million
documents in the index.

---------------------------------------------------------------------

I hope that some of you can help me because I don't have any ideas what
can be wrong here.

Thanks in advance!

-- 
Laci  <http://monda.hu>

signature.asc
Description: This is a digitally signed message part

Getting irrelevant results using fuzzy query

Reply via email to