"Maxym Mykhalchuk" <[EMAIL PROTECTED]> wrote on 10/04/2006 09:46:16 PM: > Here's the issue: All my "documents" will be having a few (2-3: > title, short description) short fields. You see, it's rare that the > same word is repeated several times in a title, so will Lucene be > able to give me a decent ranking, or will it be able to tell me "oh, > yes, this term is in the following 300 titles". > > On what I've read on the topic so far, it seems that inverted > indexes do work good on big texts, as they are able to exploit the > repetition of words to do ranking.
Lucene is no psychic. If you're looking for "dog", and the document contains two short documents, actually titles: "Sparky the Fire Dog" and "Dog Hause Home Page" (just two silly titles from Google's top 10 results for "dog"...) Then there's hardly any way for Lucene to determine which document should be ranked higher. For single word queries in a situation like this, you might want to help Lucene learn the "good" ranking. One way is to use Document.setBoost() (or Field.setBoost) to pre-determine which document is more "important" regardless of its text (e.g., using some sort of link analysis, or whatever trick that is applicable in your situation). Another way is to override Lucene's relevance ranking with some other type of sorting (see the Sort class) - for example, to sort all the matching results by date, to get the newer matching results first. In many applications, you might want to let your users control this sort order; For example, in a shopping site (where product names are the very short "documents"), you might want to let the user sort the results by price, by popularity, by release date, by users' ranking, and so on. For multi-word queries, it is actually possible to improve on Lucene's standard ranking. For example, let's say you have the two titles "Hot Dog on a Stick" "Your Dog in Hot Weather" And get a query "hot dog" (without quotation marks). Using QueryParser, Lucene will normally rank the two titles more or less the same. However, the first one is probably much better because the words "hot" and "dog", don't just appear there, they actually appear very close, and in this case even in order. This sort of proximity-influenced scoring is missing from Lucene's QueryParser, and I've been wondering recently on how it is best to add it, and whether it is possible to easily do it with existing Lucene machinary, like the SpanQuery class. Has anyone ever tried to do something like this before, and can tell us their experience? Good Luck, Nadav. -- Nadav Har'El --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]