Dear all, For anyone wanting to add some NLP abilities to Lucene, I've released a small library at https://github.com/larsmans/lucene-stanford-lemmatizer . This library performs part-of-speech tagging (determining word categories such as noun, verb), filtering based on part-of-speech and lemmatizing (reducing words to their base form).
In other words: this is an NLP-based replacement for a stemmer and a stop list, implemented as a Lucene analyzer. It requires the Stanford POS Tagger. lucene-stanford-lemmatizer can be used to index or query lemmas as well as the terms as they appear in text, and/or to filter out terms before indexing/querying based on their part-of-speech. By default, it filters out pronouns, determiners (the, a) and several other non-informative word categories. I've seen this code improve search quality, even on very noisy data. The software is designed for English, but does a pretty good job at detecting non-English words and leaving those alone (in contrast to the Porter/Snowball stemmer). Regards, Lars Buitinck --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org