Hi List, I've been redirected from [EMAIL PROTECTED] to here to discuss my issue.
---------- My original email ---------- I try to provide relevant results for the users of a lyrics site, even in the case of misspellings by indexing artist and songs with Lucene. The problem is that Lucene provides irrelevant search results. For example searching for "Coldplay" returns "Longplay" as the most relevant result. This is how I create individual documents: Document document = new Document(); document.add(new Field("artist", artist, Field.Store.YES, Field.Index.UN_TOKENIZED)); document.add(new Field("song", song, Field.Store.YES, Field.Index.UN_TOKENIZED)); document.add(new Field("path", path, Field.Store.YES, Field.Index.NO)); indexWriter.addDocument(document); And this is how I compose the actual query: BooleanQuery query = new BooleanQuery(); if (artist.length() > 0) { FuzzyQuery artist_query = new FuzzyQuery(new Term("artist", artist)); query.add(artist_query, BooleanClause.Occur.MUST); } if (song.length() > 0) { FuzzyQuery song_query = new FuzzyQuery(new Term("song", song)); query.add(song_query, BooleanClause.Occur.MUST); } Please let me know what's wrong, I'd like to make this work right. Thanks in advance! ---------- My reply to an answer ---------- On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote: > On Dienstag, 17. Juni 2008, László Monda wrote: > > > FuzzyQuery artist_query = new FuzzyQuery(new Term("artist", > > artist)); > > You should try the FuzzyQuery constructor that takes a minimum similarity > and a prefix length. The general problem is however, that the degree of > similarity is only one factor. The other factors are the same as for other > searches, e.g. the number of occurences of the term in the document and in > the whole index. > > You could try to write your own similarity implementation that disables all > these factors, see > http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html I understand some essential concepts related to Lucene such as the Levenshtein distance and tokenization, but I really don't want to go this deep if it's not necessary. Since fuzzy searching is based on the Levenshtein distance, the distance between "coldplay" and "coldplay" is 0 and the distance between "coldplay" and "downplay" is 3 so how on earth is possible that when searching for "coldplay", Lucene returns "longplay"? This shouldn't happen regardless of the minimum similarity and prefix length factors. Additional info: Lucene seems to do the right thing when only few documents are present, but goes crazy when there is about 1.5 million documents in the index. --------------------------------------------------------------------- I hope that some of you can help me because I don't have any ideas what can be wrong here. Thanks in advance! -- Laci <http://monda.hu>
signature.asc
Description: This is a digitally signed message part