Hi Ivan and Robert, >> sounds like you should talk to Tom Burton-West! Ok, I'll bite.
A few questions: Are you planning to have separate fields for each language or the same fields with contents in different languages? If #2 are you planning to have a field to indicate the language so you can do filter queries? Do you need to accommodate searches where you don't know what language the user is searching in? >> 2. What about mixing Latin and non-Latin languages? We ran tests on English >> and Chinese collections mixed together >>and didn't see any negative impact >> (precision/recall). Interesting. I've wondered whether mixing languages would cause any issues with idf stats in the ranking formula, especially if the number of documents in each language is significantly different. This may not be relevant to your use case. We found that dirty OCR combined with multiple languages can cause a large number of unique terms. If you have a large enough index, this can make multiterm queries (i.e. prefix/wildcard etc) computationally expensive. It can also seriously increase memory use. We started by changing the termInfosIndexDivisor to deal with this at search time (http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again), but then when we were re-indexing, we discovered that the termInfosIndexDivisor doesn't currently affect the indexReader opened when indexing so we changed the termIndexInterval from 128 to 1024. This took our memory use from over 18GB down to under 4GB and also eliminated large stop-the-world garbage collection pauses. (Index size is about 350GB). Tom -----Original Message----- On Mon, May 9, 2011 at 5:32 PM, Provalov, Ivan <ivan.prova...@cengage.com> wrote: > We are planning to ingest some non-English content into our application. All > content is OCR'ed and there are a lot of misspellings and garbage terms > because of this. Each document has one primary language with a some > exceptions (e.g. a few English terms mixed in with primarily non-English > document terms). > sounds like you should talk to Tom Burton-West! > 1. Does it make sense to mix two or more different Latin-based languages in > the same index directory in Lucene (e.g. Spanish/French/English)? I think it depends upon the application. If the user is specifying the language via the UI somehow then its probably simplest to just use different indexes for each collection. > 2. What about mixing Latin and non-Latin languages? We ran tests on English > and Chinese collections mixed together and didn't see any negative impact > (precision/recall). Any other potential issues? Right, none of the terms would overlap here... the only "issue" would be a skewed maxDoc but this is probably not a big deal at all. But whats the benefit to mixing them? > 3. Any recommendations for an Urdu analyzer? > you can always start with standardanalyzer as it will tokenize it... you might be able to make use of resources such as http://www.crulp.org/software/ling_resources/UrduClosedClassWordsList.htm and http://www.crulp.org/software/ling_resources/UrduHighFreqWords.htm as a stoplist. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org