Lucene and multi-lingual Unicode - advice needed

OBender Hotmail Mon, 15 Jun 2009 10:11:12 -0700

Hi All!


I'm new to Lucene so forgive me if this question was asked before.

I have a database with records in the same table in many different languages
(up to 70) it includes all W-European, Arabic, Eastern, CJK, Cyrillic, etc.
you name it. 
I've looked at what people say about Lucene and it looks like for the most
part standard analyzers should do fine with most Unicode languages but there
are quite a few exceptions.
Here is some recently updated Lucene Jira thread:
https://issues.apache.org/jira/browse/LUCENE-1488

My question is, what would be the safest bet for me in terms of
analyzers/tokenizers? 
Do I really have to write my own ones for the bunch of languages that are
not supported?
Did anyone already solve the problem similar to mine? I'm sure someone
already did :)

And yes, I looked at the Lucene sandbox analyzers. It just adds more
confusion. For example why there analyzers for DE and FR? Wouldn't the
standard analyzer (which is Unicode complaint as I understood) deal with EU
languages just fine?

Thanks in advance for advices :)

Lucene and multi-lingual Unicode - advice needed

Reply via email to