Re: Best practices for multiple languages?

Paul Libbrecht Wed, 19 Jan 2011 10:17:02 -0800

Because it does not find "junks" when you search "junk".
Or... chevaux when you search cheval.


paul


Le 19 janv. 2011 à 18:59, Luca Rondanini a écrit :

> why not just using the StandardAnalyzer? it works pretty well even with
> Asian languages!
> 
> 
> 
> On Wed, Jan 19, 2011 at 12:23 AM, Shai Erera <ser...@gmail.com> wrote:
> 
>> If you index documents, each in a different language, but all its fields
>> are
>> of the same language, then what you can do is the following:
>> 
>> Create separate indexes per language
>> -------------------------------------------------------
>> This will work and is not too hard to set up. Requires some maintenance
>> code
>> (e.g. directing a search request against the relevant index) but nothing
>> too
>> complicated. The advantage of using this approach is that you don't risk
>> running into issues like search for "die" when the language is German, yet
>> you will find documents in English indexed w/ that word. So your searches
>> are "language safe". A disadvantage is that if you ever require to do
>> cross-language operation, like search two languages, you need to do search
>> federation which is less good. Also, maintenance becomes a slight pain,
>> because you e.g. need to optimize multiple indexes, make sure they don't
>> try
>> to optimize at once, resulting in a sudden burst of IO.
>> 
>> Create one index
>> -------------------------
>> Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the
>> proper Analyzer per the doc's language. That way, all your documents are
>> located in the same index so administration is really simple. They also
>> don't step on each other toes - each document is analyzed exactly as it
>> should. You might get into weird situations like the "die" example
>> (fetching
>> a document in incorrect language), but that's easily solvable by indexing
>> for each document a "language" field and use it as a Filter during the
>> search. You can cache that Filter so that its posting list isn't traversed
>> for every query but instead only once.
>> 
>> We use the second approach and we're required to support 32 languages.
>> While
>> in most deployments the number never exceeds 3-4 languages, I know of some
>> that handle > 10. If you're careful enough, it just works.
>> 
>> Hope this helps.
>> 
>> Shai
>> 
>> On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht <p...@hoplahup.net> wrote:
>> 
>>> 
>>> But for this, you need a skillfully designed:
>>> - set of fields
>>> - multiplexing analyzer
>>> - query expansion
>>> In one of my projects, we do not split language by fields and it's a
>>> pain... I'm having recurring issues in one sense or the other.
>>> - the "die" example that Oti s mentioned is a good one: stop-word in
>>> German, essential verb in English
>>> - I had recently issues with the contribution of the word Fourier (for
>> the
>>> name of series): in English it stays fourier, in French in becomes fouri.
>>> So: if the resource is contributed in French, the indexed value is fouri,
>>> English seekers won't find it; if the resource is contributed in English,
>>> French seekers won't find it.
>>> So my last lesson: always have a whitespace-lowercase unstemmed field
>> also
>>> at hand and prefer it over the others in your query expansion.
>>> 
>>> A wiki page should probably be made.
>>> 
>>> paul
>>> 
>>> 
>>> Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit :
>>>> I think we should be using lucene with snowball jar's which means one
>>> index for all languages (ofcourse size of index is always a matter of
>>> concerns).
>>>> 
>>>> Hope this helps.
>>>> -vinaya
>>>> 
>>>> On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
>>>>> What is the "best practice" to support multiple languages, i.e.
>>> Lucene-Documents that have multiple language content/fields?
>>>>> Should
>>>>> a) each language be indexed in a seperate index/directory or should
>>>>> b) the Documents (in a single directory) hold the diverse localized
>>> fields?
>>>>> 
>>>>> We most often will be searching "language dependent" which (at least
>>> performance wise) mandates one-directory-per-language...
>>>>> 
>>>>> Any (lucene specific) white papers on this topic?
>>>>> 
>>>>> Thx in advance
>>>>> Clemens
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> 
>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Best practices for multiple languages?

Reply via email to