Aha, I see. I wasn't referring to character-encoding-based lang ID. That is probably good enough if you need to know if the text is English or if it's Chinese or Japanese or Korean or Russian Cyrillic or Arabic or...
I think there is a bit missing in your statement about training. You can't just train on, say "English". There are all kinds of "English" out there. The distribution of terms in WSJ is probably a lot different from what you might find in some corpus of medical texts or botany tests or contemporary art. As for not everyone having access to a corpus, I've come to love Wikipedia for exactly this thing. :) For an NLP class I took recently we used Croatian Wikipedia while developing an unsupervised morphological analyzer for highly inflected languages (Croatian being one of them - 7 cases, a pile of case/number/gender-dependent suffixes...). The analyzer ended up matching the state-of-the-art results. :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Daniel Noll <[EMAIL PROTECTED]> > To: java-user@lucene.apache.org > Sent: Tuesday, June 10, 2008 2:09:10 AM > Subject: Re: How international languages are supported in Lucene > > On Tuesday 10 June 2008 07:49:29 Otis Gospodnetic wrote: > > Hi Daniel, > > > > What makes you say that about language detection? Wouldn't that depend on > > the language detection approach or tool one uses and on the type and amount > > of content one trains language detector on? And what is the threshold for > > "reliable enough" that you have in mind? > > I can't come up with a number of course, but I can say for certain that ICU's > detector is unusable for detecting languages. It's barely good enough to > correctly identify the charset; if you create a simple test in one charset it > often detects it as another. If you then re-encode the text in that charset, > it detects it as being yet another, and so forth. > > If you know of any better [open source] libraries for the same purpose, I'd > love to hear of it. > > Additionally, anything the developer or user has to train I consider > unreliable also. If a detector has to be trained, it should be trained by > the ones who are distributing it. Not everyone has a corpus of every > language in the world in order to train such a thing. :-/ > > Daniel > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]