On 10/01/2012 10:18, Ian Lea wrote:
If a term has an accent, add both accented and unaccented versions at
index and search time.

So in your example your default field would contain

República Republica

and a search for "República" would expand to "República Republica" and
match both and score higher than a search for "Republica" which would
just match the unaccented version.

Thanks, that is a solution, but a side effect would be that if searching for Republica , a document containing 'Banana Republica' would score as well as "República" (because República expands to "Republica República)) as in both cases the search term would match one of two terms, whereas I would want it to score República higher.

I don't really want to mess with the matching I'm happy with what it matches and the order the results are returned in, but the trouble is because we are only searching short amounts of text not large chunks of text we typically end up with many matches having the same score and I would like to just improve the scoring aspect so that matches that appear better to the user are higher up in the search.

República

It's not quite synonyms but you could borrow synonym code from
somewhere.  There's stuff in the lucene contrib area and in LIA and
maybe elsewhere.  I've used the LIA code to do something similar.


An alternative would be to store accented versions in a separate field
and add a query for that field to the mix if you have accented terms.
You could boost that part of the query.


Also, the accent case was the easiest to explain but I also want to apply this in different cases such as misspellings. i.e if there are two documents in the index with the value

James Clarke
David Clarke

And I search for

Dave Clarke

I would like David Clarke to score higher than James Clarke, because the first name is nearly the same but at the moment they both score the same because just match on second value. I dont want to introduce synonms or wildcard searches because I think it will return far too many false positives, and also search is not restricted to latin charsets. But having done a search that returns
both

James Clarke
David Clarke

I can then safetly adjust the scores, maybe I should just try my original idea.


Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to