If you index documents, each in a different language, but all its fields are of the same language, then what you can do is the following:
Create separate indexes per language ------------------------------------------------------- This will work and is not too hard to set up. Requires some maintenance code (e.g. directing a search request against the relevant index) but nothing too complicated. The advantage of using this approach is that you don't risk running into issues like search for "die" when the language is German, yet you will find documents in English indexed w/ that word. So your searches are "language safe". A disadvantage is that if you ever require to do cross-language operation, like search two languages, you need to do search federation which is less good. Also, maintenance becomes a slight pain, because you e.g. need to optimize multiple indexes, make sure they don't try to optimize at once, resulting in a sudden burst of IO. Create one index ------------------------- Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the proper Analyzer per the doc's language. That way, all your documents are located in the same index so administration is really simple. They also don't step on each other toes - each document is analyzed exactly as it should. You might get into weird situations like the "die" example (fetching a document in incorrect language), but that's easily solvable by indexing for each document a "language" field and use it as a Filter during the search. You can cache that Filter so that its posting list isn't traversed for every query but instead only once. We use the second approach and we're required to support 32 languages. While in most deployments the number never exceeds 3-4 languages, I know of some that handle > 10. If you're careful enough, it just works. Hope this helps. Shai On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht <p...@hoplahup.net> wrote: > > But for this, you need a skillfully designed: > - set of fields > - multiplexing analyzer > - query expansion > In one of my projects, we do not split language by fields and it's a > pain... I'm having recurring issues in one sense or the other. > - the "die" example that Oti s mentioned is a good one: stop-word in > German, essential verb in English > - I had recently issues with the contribution of the word Fourier (for the > name of series): in English it stays fourier, in French in becomes fouri. > So: if the resource is contributed in French, the indexed value is fouri, > English seekers won't find it; if the resource is contributed in English, > French seekers won't find it. > So my last lesson: always have a whitespace-lowercase unstemmed field also > at hand and prefer it over the others in your query expansion. > > A wiki page should probably be made. > > paul > > > Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit : > > I think we should be using lucene with snowball jar's which means one > index for all languages (ofcourse size of index is always a matter of > concerns). > > > > Hope this helps. > > -vinaya > > > > On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote: > >> What is the "best practice" to support multiple languages, i.e. > Lucene-Documents that have multiple language content/fields? > >> Should > >> a) each language be indexed in a seperate index/directory or should > >> b) the Documents (in a single directory) hold the diverse localized > fields? > >> > >> We most often will be searching "language dependent" which (at least > performance wise) mandates one-directory-per-language... > >> > >> Any (lucene specific) white papers on this topic? > >> > >> Thx in advance > >> Clemens > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >