Re: NGram Language Categorization Source

2005-08-22 Thread Andrzej Bialecki
Kevin Burton wrote: A lot depends on the reference profiles (which in turn depend on the quality of your training corpus - in this case, your corpus is not the best choice, because each text contains a lot of foreign words). I realize that my corpus isnt' the best. That's one of the reason's

Re: NGram Language Categorization Source

2005-08-21 Thread Otis Gospodnetic
Hello, Sounds like that LI acronym was confusing -Language Identification. Otis > > It was > > also found that the way you create ngram profiles (e.g. with or > without > > surrounding spaces, single length or mixed length) affects the LI > > performance. > > LI??? > > I haven't benchmarked

Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
> * A Nutch implementation: > http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/languageidentifier/ > > * A Lucene patch: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763 A step in the right direction. It doesn't have other language categories created though. > * JTextCat (http

Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
> Erhm... Not to rain on your parade, but Googling for "ngram java" gives > a lot of hits. http://sourceforge.net/projects/ngramj and also > "languageidentifier" in Nutch are two examples of Open Source Java > implementations. Each can be used with Lucene. I think I've played with ngramj and found

Re: NGram Language Categorization Source

2005-08-21 Thread Kevin Burton
>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf > > > > Linguini: Language Identification for Multilingual Documents > > John M. Prager > > Prager also uses an n-gram approach, so you might

Re: NGram Language Categorization Source

2005-08-20 Thread Tom White
Hi Kevin, On 8/19/05, Kevin Burton <[EMAIL PROTECTED]> wrote: > Hey lucene guys. > > I know for a fact that a bunch of you have been curious about language > categorization for a long time now and Java has lacked a solid way to > solve this problem. > > Anyway. This new library that I just rele

Re: NGram Language Categorization Source

2005-08-20 Thread Andrzej Bialecki
Kevin Burton wrote: Hey lucene guys. I know for a fact that a bunch of you have been curious about language categorization for a long time now and Java has lacked a solid way to solve this problem. Anyway. This new library that I just released should be easy to tie into your lucene indexers.

Re: NGram Language Categorization Source

2005-08-20 Thread Ken Krugler
Hi Kevin, I know for a fact that a bunch of you have been curious about language categorization for a long time now and Java has lacked a solid way to solve this problem. Anyway. This new library that I just released should be easy to tie into your lucene indexers. Just use the library on a t

NGram Language Categorization Source

2005-08-20 Thread Kevin Burton
Hey lucene guys. I know for a fact that a bunch of you have been curious about language categorization for a long time now and Java has lacked a solid way to solve this problem. Anyway. This new library that I just released should be easy to tie into your lucene indexers. Just use the library o