Because it does not find "junks" when you search "junk". Or... chevaux when you search cheval.
paul Le 19 janv. 2011 à 18:59, Luca Rondanini a écrit : > why not just using the StandardAnalyzer? it works pretty well even with > Asian languages! > > > > On Wed, Jan 19, 2011 at 12:23 AM, Shai Erera <ser...@gmail.com> wrote: > >> If you index documents, each in a different language, but all its fields >> are >> of the same language, then what you can do is the following: >> >> Create separate indexes per language >> ------------------------------------------------------- >> This will work and is not too hard to set up. Requires some maintenance >> code >> (e.g. directing a search request against the relevant index) but nothing >> too >> complicated. The advantage of using this approach is that you don't risk >> running into issues like search for "die" when the language is German, yet >> you will find documents in English indexed w/ that word. So your searches >> are "language safe". A disadvantage is that if you ever require to do >> cross-language operation, like search two languages, you need to do search >> federation which is less good. Also, maintenance becomes a slight pain, >> because you e.g. need to optimize multiple indexes, make sure they don't >> try >> to optimize at once, resulting in a sudden burst of IO. >> >> Create one index >> ------------------------- >> Here, you'd use IndexWriter.addDocument(doc, analyzer) method and pass the >> proper Analyzer per the doc's language. That way, all your documents are >> located in the same index so administration is really simple. They also >> don't step on each other toes - each document is analyzed exactly as it >> should. You might get into weird situations like the "die" example >> (fetching >> a document in incorrect language), but that's easily solvable by indexing >> for each document a "language" field and use it as a Filter during the >> search. You can cache that Filter so that its posting list isn't traversed >> for every query but instead only once. >> >> We use the second approach and we're required to support 32 languages. >> While >> in most deployments the number never exceeds 3-4 languages, I know of some >> that handle > 10. If you're careful enough, it just works. >> >> Hope this helps. >> >> Shai >> >> On Wed, Jan 19, 2011 at 9:44 AM, Paul Libbrecht <p...@hoplahup.net> wrote: >> >>> >>> But for this, you need a skillfully designed: >>> - set of fields >>> - multiplexing analyzer >>> - query expansion >>> In one of my projects, we do not split language by fields and it's a >>> pain... I'm having recurring issues in one sense or the other. >>> - the "die" example that Oti s mentioned is a good one: stop-word in >>> German, essential verb in English >>> - I had recently issues with the contribution of the word Fourier (for >> the >>> name of series): in English it stays fourier, in French in becomes fouri. >>> So: if the resource is contributed in French, the indexed value is fouri, >>> English seekers won't find it; if the resource is contributed in English, >>> French seekers won't find it. >>> So my last lesson: always have a whitespace-lowercase unstemmed field >> also >>> at hand and prefer it over the others in your query expansion. >>> >>> A wiki page should probably be made. >>> >>> paul >>> >>> >>> Le 19 janv. 2011 à 07:53, Vinaya Kumar Thimmappa a écrit : >>>> I think we should be using lucene with snowball jar's which means one >>> index for all languages (ofcourse size of index is always a matter of >>> concerns). >>>> >>>> Hope this helps. >>>> -vinaya >>>> >>>> On Tuesday 18 January 2011 11:23 PM, Clemens Wyss wrote: >>>>> What is the "best practice" to support multiple languages, i.e. >>> Lucene-Documents that have multiple language content/fields? >>>>> Should >>>>> a) each language be indexed in a seperate index/directory or should >>>>> b) the Documents (in a single directory) hold the diverse localized >>> fields? >>>>> >>>>> We most often will be searching "language dependent" which (at least >>> performance wise) mandates one-directory-per-language... >>>>> >>>>> Any (lucene specific) white papers on this topic? >>>>> >>>>> Thx in advance >>>>> Clemens >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org