Non-English Languages Search

Provalov, Ivan Mon, 09 May 2011 14:33:30 -0700

We are planning to ingest some non-English content into our application.  All 
content is OCR'ed and there are a lot of misspellings and garbage terms because 
of this.  Each document has one primary language with a some exceptions (e.g. a 
few English terms mixed in with primarily non-English document terms).


1. Does it make sense to mix two or more different Latin-based languages in the 
same index directory in Lucene (e.g. Spanish/French/English)?  
2. What about mixing Latin and non-Latin languages?  We ran tests on English 
and Chinese collections mixed together and didn't see any negative impact 
(precision/recall).  Any other potential issues?
3. Any recommendations for an Urdu analyzer?

Thank you,

Ivan Provalov
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Non-English Languages Search

Reply via email to