Re: AW: Best practices for multiple languages?

Dominique Bejean Thu, 20 Jan 2011 01:46:44 -0800

Hi,

During a recent Solr project we needed to index document in a lot oflanguages. The natural solution with Lucene and Solr is to define onefield per languages. Each field is configured in the schema.xml file touse a language specific processing (tokenizing, stop words, stemmer,...). This is really not easy to manage if you have a lot of languagesand this means that 1) the search interface need to know in whichlanguage your are searching 2) the search interface can't search in alllanguages at the same time.

So, I decided that the only solution was to index all languages in onlyone field.

Obviously, each language needs to be processed specifically. For this, Idevelopped a analyzer that is in charge to redirect content to thecorrect tockenizer, filters and stemmer accordingly to its language.This analyzer is also used at query time. If the user specify thelanguage of its query, the query is processed by appropriate tockenizer,filters and stemmer otherwise the query is processed by a defauttockenizer, filters and stemmer.


With this solution :

1. I only need one field (or two if I want both stemmed and unstemmedprocessing)

2. The user can search in all document regarless to there language

I hope this help.

Dominique
www.zoonix.fr
www.crawl-anywhere.com



Le 20/01/11 00:29, Bill Janssen a écrit :

Paul Libbrecht<p...@hoplahup.net>  wrote:

I did several changes of this sort and the precision and recall
measures went better in particular in presence of language-indication
failure which happened to be very common in our authoring environment.

There are two kinds of failures:  no language, or wrong language.

For no language, I fall back to StandardAnalyzer, so I should have
results similar to yours.  For wrong language, well, I'm using OTS
trigram-based language guessers, and they're pretty good these days.

Wouldn't it be better to prefer precise matches (a field that is
analyzed with StandardAnalyzer for example) but also allow matches are
stemmed.

Yes, I think it might improve things, but again, by how much?  Stemming is
better than no stemming, in terms of recall.  But this approach would also
improve precision.

Bill

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: AW: Best practices for multiple languages?

Reply via email to