We are planning to ingest some non-English content into our application. All content is OCR'ed and there are a lot of misspellings and garbage terms because of this. Each document has one primary language with a some exceptions (e.g. a few English terms mixed in with primarily non-English document terms).
1. Does it make sense to mix two or more different Latin-based languages in the same index directory in Lucene (e.g. Spanish/French/English)? 2. What about mixing Latin and non-Latin languages? We ran tests on English and Chinese collections mixed together and didn't see any negative impact (precision/recall). Any other potential issues? 3. Any recommendations for an Urdu analyzer? Thank you, Ivan Provalov --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org