[ANNOUNCEMENT] NLP-based Analyzer library for Lucene

Lars Buitinck Tue, 08 Feb 2011 08:51:32 -0800

Dear all,

For anyone wanting to add some NLP abilities to Lucene, I've released
a small library at
https://github.com/larsmans/lucene-stanford-lemmatizer . This library
performs part-of-speech tagging (determining word categories such as
noun, verb), filtering based on part-of-speech and lemmatizing
(reducing words to their base form).


In other words: this is an NLP-based replacement for a stemmer and a
stop list, implemented as a Lucene analyzer. It requires the Stanford
POS Tagger.

lucene-stanford-lemmatizer can be used to index or query lemmas as
well as the terms as they appear in text, and/or to filter out terms
before indexing/querying based on their part-of-speech. By default, it
filters out pronouns, determiners (the, a) and several other
non-informative word categories.

I've seen this code improve search quality, even on very noisy data.
The software is designed for English, but does a pretty good job at
detecting non-English words and leaving those alone (in contrast to
the Porter/Snowball stemmer).

Regards,
Lars Buitinck

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

[ANNOUNCEMENT] NLP-based Analyzer library for Lucene

Reply via email to