Re: Best practices for multiple languages?

Paul Libbrecht Thu, 20 Jan 2011 13:56:57 -0800

Isn't this approach somewhat bad for term-frequency?

Words that would appear in several languages would be a lot more frequent 
(hence less significative).


I'm still preferring the split-field method with a proper query expansion.
This way, the term-frequency is evaluated on the corpus of one language.

Dominique, in your case, at least if on the web, you have:
- the user's preferred language (if defined in a profile)
- the list of languages the browser says it accepts
And that can easily be limited to around 8 so that you cover any language the 
user is expecting to search.

paul


Le 20 janv. 2011 à 10:46, Dominique Bejean a écrit :

> Hi,
> 
> During a recent Solr project we needed to index document in a lot of 
> languages. The natural solution with Lucene and Solr is to define one field 
> per languages. Each field is configured in the schema.xml file to use a 
> language specific processing (tokenizing, stop words, stemmer, ...).  This is 
> really not easy to manage if you have a lot of languages and this means that 
> 1) the search interface need to know in which language your are searching 2) 
> the search interface can't search in all languages at the same time.
> 
> So, I decided that the only solution was to index all languages in only one 
> field.
> 
> Obviously, each language needs to be processed specifically. For this, I 
> developped a analyzer that is in charge to redirect content to the correct 
> tockenizer, filters and stemmer  accordingly to its language. This analyzer 
> is also used at query time. If the user specify the language of its query, 
> the query is processed by appropriate tockenizer, filters and stemmer 
> otherwise the query is processed by a defaut tockenizer, filters and stemmer.
> 
> With this solution :
> 
> 1. I only need one field (or two if I want both stemmed and unstemmed 
> processing)
> 2. The user can search in all document regarless to there language
> 
> I hope this help.
> 
> Dominique
> www.zoonix.fr
> www.crawl-anywhere.com
> 
> 
> 
> Le 20/01/11 00:29, Bill Janssen a écrit :
>> Paul Libbrecht<p...@hoplahup.net>  wrote:
>> 
>>> I did several changes of this sort and the precision and recall
>>> measures went better in particular in presence of language-indication
>>> failure which happened to be very common in our authoring environment.
>> There are two kinds of failures:  no language, or wrong language.
>> 
>> For no language, I fall back to StandardAnalyzer, so I should have
>> results similar to yours.  For wrong language, well, I'm using OTS
>> trigram-based language guessers, and they're pretty good these days.
>> 
>>>>> Wouldn't it be better to prefer precise matches (a field that is
>>>>> analyzed with StandardAnalyzer for example) but also allow matches are
>>>>> stemmed.
>> Yes, I think it might improve things, but again, by how much?  Stemming is
>> better than no stemming, in terms of recall.  But this approach would also
>> improve precision.
>> 
>> Bill
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best practices for multiple languages?

Reply via email to