Kevin Burton wrote:
Hey lucene guys.
I know for a fact that a bunch of you have been curious about language
categorization for a long time now and Java has lacked a solid way to
solve this problem.
Anyway. This new library that I just released should be easy to tie
into your lucene indexers. Just use the library on a text (strip the
HTML) and then create a new field in Lucene called LANG (or soemthing)
and then create a filter before you search with JUST that language
code.
I'd love some help with filling out missing languages if anyone has
some spare time. That help make up for all the hard work I've done
here (nudge.. nudge)
I did a full research of the lang categorization space for Java and I
think this is basically the only library out there.
Erhm... Not to rain on your parade, but Googling for "ngram java" gives
a lot of hits. http://sourceforge.net/projects/ngramj and also
"languageidentifier" in Nutch are two examples of Open Source Java
implementations. Each can be used with Lucene.
A lot depends on the reference profiles (which in turn depend on the
quality of your training corpus - in this case, your corpus is not the
best choice, because each text contains a lot of foreign words). It was
also found that the way you create ngram profiles (e.g. with or without
surrounding spaces, single length or mixed length) affects the LI
performance. For documents with mixed languages it was also found that
methods, which combine ngrams with stopwords, work better.
Additionally, simple methods based on cosine similarity (or delta
ranking) don't give correct results for documents with mixed languages.
In such cases input texts are chunked, and each chunk is analyzed
separately, and then the scores are combined... etc, etc... millions of
ways you can do this - and of course no method is perfect. :-)
So, there is still a lot to do in this area, if you come up with some
unique way of improving LI performance...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]