This looks like it is related to an issue I first raised here:
http://markmail.org/message/37ywsemfudpos6uh
At the time I identified 2 issues with FuzzyQuery - that the usual
"coord" and "idf" scoring factors shouldn't be applied to fuzzy queries.
The coord factor got fixed but idf remains an issue in FuzzyQuery I beleive.
There is a class I added to contrib/queries - "FuzzyLikeThisQuery",
which can be used as a replacement to FuzzyQuery and fixes the idf
issue. You can subclass the QueryParser to create an instance of that
rather than FuzzyQuery if required.
Cheers,
Mark
László Monda wrote:
Hi List,
I've been redirected from [EMAIL PROTECTED] to here to discuss
my issue.
---------- My original email ----------
I try to provide relevant results for the users of a lyrics site, even
in the case of misspellings by indexing artist and songs with Lucene.
The problem is that Lucene provides irrelevant search results. For
example searching for "Coldplay" returns "Longplay" as the most relevant
result.
This is how I create individual documents:
Document document = new Document();
document.add(new Field("artist", artist, Field.Store.YES,
Field.Index.UN_TOKENIZED));
document.add(new Field("song", song, Field.Store.YES,
Field.Index.UN_TOKENIZED));
document.add(new Field("path", path, Field.Store.YES, Field.Index.NO));
indexWriter.addDocument(document);
And this is how I compose the actual query:
BooleanQuery query = new BooleanQuery();
if (artist.length() > 0) {
FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
artist));
query.add(artist_query, BooleanClause.Occur.MUST);
}
if (song.length() > 0) {
FuzzyQuery song_query = new FuzzyQuery(new Term("song", song));
query.add(song_query, BooleanClause.Occur.MUST);
}
Please let me know what's wrong, I'd like to make this work right.
Thanks in advance!
---------- My reply to an answer ----------
On Tue, 2008-06-17 at 20:38 +0200, Daniel Naber wrote:
On Dienstag, 17. Juni 2008, László Monda wrote:
FuzzyQuery artist_query = new FuzzyQuery(new Term("artist",
artist));
You should try the FuzzyQuery constructor that takes a minimum
similarity
and a prefix length. The general problem is however, that the degree
of
similarity is only one factor. The other factors are the same as for
other
searches, e.g. the number of occurences of the term in the document
and in
the whole index.
You could try to write your own similarity implementation that
disables all
these factors, see
http://lucene.apache.org/java/2_3_1/api/org/apache/lucene/search/Similarity.html
I understand some essential concepts related to Lucene such as the
Levenshtein distance and tokenization, but I really don't want to go
this deep if it's not necessary.
Since fuzzy searching is based on the Levenshtein distance, the distance
between "coldplay" and "coldplay" is 0 and the distance between
"coldplay" and "downplay" is 3 so how on earth is possible that when
searching for "coldplay", Lucene returns "longplay"? This shouldn't
happen regardless of the minimum similarity and prefix length factors.
Additional info: Lucene seems to do the right thing when only few
documents are present, but goes crazy when there is about 1.5 million
documents in the index.
---------------------------------------------------------------------
I hope that some of you can help me because I don't have any ideas what
can be wrong here.
Thanks in advance!
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]