[Afaik] Lucene stemming is based on Snowball (http://snowball.tartarus.org/)
and snowball is an implementation of Porter's algorithm (
http://www.tartarus.org/~martin/PorterStemmer/) so, if I'm not wrong, you
should refer to them.
I have tried both HtmlParser v1.5 and NekoHTML. About the former my
implementation doesn't work as i.e. it get text from javascripts; I
have followed the hint from
http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/TextExtractingVisitor.html
The following is my NOT working implement
d the HtmlParser coming with Nutch but I wasn't able to
make it work without adjusting global configuration Nutch's xml;
perhaps it's the only way to make such plugin work? Does Lucene expose
any good HTML parser in the contrib section to parse web pages found
in the wild?
Best regards,
G
As Lucene native language is Java it should be more natural to access its
functionalities through JSP; anyway the idea of accessing Lucene
functionalities seems interesting as PHP is perhaps most widely deployed
server side scripting language.
I think that the way to provide access to Lucene AP