Hi All!
I'm new to Lucene so forgive me if this question was asked before. I have a database with records in the same table in many different languages (up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc. you name it. I've looked at what people say about Lucene and it looks like for the most part standard analyzers should do fine with most Unicode languages but there are quite a few exceptions. Here is some recently updated Lucene Jira thread: https://issues.apache.org/jira/browse/LUCENE-1488 My question is, what would be the safest bet for me in terms of analyzers/tokenizers? Do I really have to write my own ones for the bunch of languages that are not supported? Did anyone already solve the problem similar to mine? I'm sure someone already did :) And yes, I looked at the Lucene sandbox analyzers. It just adds more confusion. For example why there analyzers for DE and FR? Wouldn't the standard analyzer (which is Unicode complaint as I understood) deal with EU languages just fine? Thanks in advance for advices :)