Kevin Burton wrote:
Hey lucene guys.

I know for a fact that a bunch of you have been curious about language
categorization for a long time now and Java has lacked a solid way to
solve this problem.

Anyway.  This new library that I just released should be easy to tie
into your lucene indexers.  Just use the library on a text (strip the
HTML) and then create a new field in Lucene called LANG (or soemthing)
and then create a filter before you search with JUST that language
code.

I'd love some help with filling out missing languages if anyone has
some spare time.  That help make up for all the hard work I've done
here (nudge.. nudge)

I did a full research of the lang categorization space for Java and I
think this is basically the only library out there.

Erhm... Not to rain on your parade, but Googling for "ngram java" gives a lot of hits. http://sourceforge.net/projects/ngramj and also "languageidentifier" in Nutch are two examples of Open Source Java implementations. Each can be used with Lucene.

A lot depends on the reference profiles (which in turn depend on the quality of your training corpus - in this case, your corpus is not the best choice, because each text contains a lot of foreign words). It was also found that the way you create ngram profiles (e.g. with or without surrounding spaces, single length or mixed length) affects the LI performance. For documents with mixed languages it was also found that methods, which combine ngrams with stopwords, work better.

Additionally, simple methods based on cosine similarity (or delta ranking) don't give correct results for documents with mixed languages. In such cases input texts are chunked, and each chunk is analyzed separately, and then the scores are combined... etc, etc... millions of ways you can do this - and of course no method is perfect. :-)

So, there is still a lot to do in this area, if you come up with some unique way of improving LI performance...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to