Hello! On Fri, Mar 11, 2011 at 12:03 PM, shrinath.m <shrinat...@webyog.com> wrote: > I am trying to index content withing certain HTML tags, how do I index it ? > Which is the best parser/tokenizer available to do this ?
As a general HTML parser I would recommend "Jericho HTML Parser" - http://jericho.htmlparser.net/docs/index.html It is pretty robust (this is usually very important) and easy to use (see provided examples). At the bottom of a given page you have the short overview of few other HTML parsers. I haven't compared it's speed with other parsers (I was only searching for robust parser). But, this parser is neither an event nor tree based parser (so, even automata theory can help us here). If you need something pretty specific, like extracting all links from page, I would recommend you to use simple regular expressions. Regards, Ivan Krišto --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org