Re: IDF scoring issue

Grant Ingersoll Wed, 17 Dec 2008 07:32:00 -0800


On Dec 17, 2008, at 9:26 AM, Rajiv2 wrote:

Because, the search term is provided by a user, and that user wouldexplicityhave to put quotes around "marietta ga" when I beleive the searchtext as itis : fleming roofing inc., marietta ga -- should score higher for"marietta
ga"

Just because the user doesn't do it, doesn't mean you can't. Yourstating that there is an implied ordering in their query, yet youdon't want to take advantage of that. You can often achieve betterresults by generating phrase queries implicitly based on 2 or 3grams. You might also even try generating the whole thing as a phrasequery with a really large slop value (like 100 or more). Thus,scoring will reward things when they are closer together, but youstill get the flexibility of an AND-like query. Downside is,possibly, a small performance hit, but you could test it first. Or,you could add in the phrase query as an optional OR query to theoriginal query, something like" fleming OR roofing OR marietta OR gaOR ("fleming roofing" OR "roofing marietta" OR "marietta ga".

You could also try using a more intelligent Query Parser that is tunedto your domain. You could also try to factor in click-through statsinto your results. Probably not the answer you want to hear, but itis doable and useful.

Do you have any a priori knowledge about Marietta GA over Fleming, GAto begin with? Have you done any broader scale relevance assessment?It is often the problem that "fixing" one query, results in breaking awhole bunch of others. What I typically recommend is that you takethe top 50 queries plus 10-30 random queries from your logs and do anassessment of the top 5/10 results for: relevant, somewhat relevant,not relevant and embarrassing. The goal is to maximize relevant whileminimizing embarrassing and not relevant.

Is this particular example an isolated case or do you feel this issystemic to your application? I've said it before, but it bearsrepeating: Just because someone typed search terms into your searchbox does not mean you have to actually do a search in order to presentthem results. If you KNOW the Marietta result is a better result forthis query, then make it the top result. Solr has this feature viathe "QueryElevationComponent" (horrible name, I know), but I call itEditorial Placement. It's not that hard to implement.

Finally, I'd say I wouldn't split hairs over position too much, if theMarietta result is #2 and the Fleming result is #1. Now, if you'retelling me the Marietta result is something like #100 and Fleming is#1, that's a different story. The fact is, b/c your user didn't putquotes, you don't actually know for a fact that the Fleming result iswhat they wanted (but I agree, it is highly likely). The point is, Iwouldn't quibble over anything that is in the top ten. Lucene isdoing what you told it to do, that is rank the results according to TF/IDF, etc. If you have other pertinent information about Marietta orthe query then you should tell Lucene that via phrases, boosts orpayloads or altering the Similarity. But, like I said, be carefulthat you aren't breaking other queries.


HTH,
Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: IDF scoring issue

Reply via email to