On Dec 17, 2008, at 9:26 AM, Rajiv2 wrote:


Because, the search term is provided by a user, and that user would explicity have to put quotes around "marietta ga" when I beleive the search text as it is : fleming roofing inc., marietta ga -- should score higher for "marietta
ga"


Just because the user doesn't do it, doesn't mean you can't. Your stating that there is an implied ordering in their query, yet you don't want to take advantage of that. You can often achieve better results by generating phrase queries implicitly based on 2 or 3 grams. You might also even try generating the whole thing as a phrase query with a really large slop value (like 100 or more). Thus, scoring will reward things when they are closer together, but you still get the flexibility of an AND-like query. Downside is, possibly, a small performance hit, but you could test it first. Or, you could add in the phrase query as an optional OR query to the original query, something like" fleming OR roofing OR marietta OR ga OR ("fleming roofing" OR "roofing marietta" OR "marietta ga".

You could also try using a more intelligent Query Parser that is tuned to your domain. You could also try to factor in click-through stats into your results. Probably not the answer you want to hear, but it is doable and useful.

Do you have any a priori knowledge about Marietta GA over Fleming, GA to begin with? Have you done any broader scale relevance assessment? It is often the problem that "fixing" one query, results in breaking a whole bunch of others. What I typically recommend is that you take the top 50 queries plus 10-30 random queries from your logs and do an assessment of the top 5/10 results for: relevant, somewhat relevant, not relevant and embarrassing. The goal is to maximize relevant while minimizing embarrassing and not relevant.

Is this particular example an isolated case or do you feel this is systemic to your application? I've said it before, but it bears repeating: Just because someone typed search terms into your search box does not mean you have to actually do a search in order to present them results. If you KNOW the Marietta result is a better result for this query, then make it the top result. Solr has this feature via the "QueryElevationComponent" (horrible name, I know), but I call it Editorial Placement. It's not that hard to implement.

Finally, I'd say I wouldn't split hairs over position too much, if the Marietta result is #2 and the Fleming result is #1. Now, if you're telling me the Marietta result is something like #100 and Fleming is #1, that's a different story. The fact is, b/c your user didn't put quotes, you don't actually know for a fact that the Fleming result is what they wanted (but I agree, it is highly likely). The point is, I wouldn't quibble over anything that is in the top ten. Lucene is doing what you told it to do, that is rank the results according to TF/ IDF, etc. If you have other pertinent information about Marietta or the query then you should tell Lucene that via phrases, boosts or payloads or altering the Similarity. But, like I said, be careful that you aren't breaking other queries.

HTH,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to