On Tuesday 10 June 2008 07:49:29 Otis Gospodnetic wrote: > Hi Daniel, > > What makes you say that about language detection? Wouldn't that depend on > the language detection approach or tool one uses and on the type and amount > of content one trains language detector on? And what is the threshold for > "reliable enough" that you have in mind?
I can't come up with a number of course, but I can say for certain that ICU's detector is unusable for detecting languages. It's barely good enough to correctly identify the charset; if you create a simple test in one charset it often detects it as another. If you then re-encode the text in that charset, it detects it as being yet another, and so forth. If you know of any better [open source] libraries for the same purpose, I'd love to hear of it. Additionally, anything the developer or user has to train I consider unreliable also. If a detector has to be trained, it should be trained by the ones who are distributing it. Not everyone has a corpus of every language in the world in order to train such a thing. :-/ Daniel --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]