Re: How international languages are supported in Lucene

Daniel Noll Mon, 09 Jun 2008 17:09:07 -0700

On Tuesday 10 June 2008 07:49:29 Otis Gospodnetic wrote:
> Hi Daniel,
>
> What makes you say that about language detection?  Wouldn't that depend on
> the language detection approach or tool one uses and on the type and amount
> of content one trains language detector on?  And what is the threshold for
> "reliable enough" that you have in mind?


I can't come up with a number of course, but I can say for certain that ICU's 
detector is unusable for detecting languages.  It's barely good enough to 
correctly identify the charset; if you create a simple test in one charset it 
often detects it as another.  If you then re-encode the text in that charset, 
it detects it as being yet another, and so forth.

If you know of any better [open source] libraries for the same purpose, I'd 
love to hear of it.

Additionally, anything the developer or user has to train I consider 
unreliable also.  If a detector has to be trained, it should be trained by 
the ones who are distributing it.  Not everyone has a corpus of every 
language in the world in order to train such a thing. :-/

Daniel

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How international languages are supported in Lucene

Reply via email to