Re: Best practices for multiple languages?

Shai Erera Wed, 19 Jan 2011 00:24:32 -0800

If you index documents, each in a different language, but all its fields are
of the same language, then what you can do is the following:


Create separate indexes per language
-------------------------------------------------------
This will work and is not too hard to set up. Requires some maintenance code
(e.g. directing a search request against the relevant index) but nothing too
complicated. The advantage of using this approach is that you don't risk
running into issues like search for "die" when the language is German, yet
you will find documents in English indexed w/ that word. So your searches
are "language safe". A disadvantage is that if you ever require to do
cross-language operation, like search two languages, you need to do search
federation which is less good. Also, maintenance becomes a slight pain,
because you e.g. need to optimize multiple indexes, make sure they don't try
to optimize at once, resulting in a sudden burst of IO.

Create one index
-------------------------
Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the
proper Analyzer per the doc's language. That way, all your documents are
located in the same index so administration is really simple. They also
don't step on each other toes - each document is analyzed exactly as it
should. You might get into weird situations like the "die" example (fetching
a document in incorrect language), but that's easily solvable by indexing
for each document a "language" field and use it as a Filter during the
search. You can cache that Filter so that its posting list isn't traversed
for every query but instead only once.

We use the second approach and we're required to support 32 languages. While
in most deployments the number never exceeds 3-4 languages, I know of some
that handle > 10. If you're careful enough, it just works.

Hope this helps.

Shai

On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht <p...@hoplahup.net> wrote:

>
> But for this, you need a skillfully designed:
> - set of fields
> - multiplexing analyzer
> - query expansion
> In one of my projects, we do not split language by fields and it's a
> pain... I'm having recurring issues in one sense or the other.
> - the "die" example that Oti s mentioned is a good one: stop-word in
> German, essential verb in English
> - I had recently issues with the contribution of the word Fourier (for the
> name of series): in English it stays fourier, in French in becomes fouri.
> So: if the resource is contributed in French, the indexed value is fouri,
> English seekers won't find it; if the resource is contributed in English,
> French seekers won't find it.
> So my last lesson: always have a whitespace-lowercase unstemmed field also
> at hand and prefer it over the others in your query expansion.
>
> A wiki page should probably be made.
>
> paul
>
>
> Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit :
> > I think we should be using lucene with snowball jar's which means one
> index for all languages (ofcourse size of index is always a matter of
> concerns).
> >
> > Hope this helps.
> > -vinaya
> >
> > On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote:
> >> What is the "best practice" to support multiple languages, i.e.
> Lucene-Documents that have multiple language content/fields?
> >> Should
> >> a) each language be indexed in a seperate index/directory or should
> >> b) the Documents (in a single directory) hold the diverse localized
> fields?
> >>
> >> We most often will be searching "language dependent" which (at least
> performance wise) mandates one-directory-per-language...
> >>
> >> Any (lucene specific) white papers on this topic?
> >>
> >> Thx in advance
> >> Clemens
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Best practices for multiple languages?

Reply via email to