On Fri, Mar 11, 2011 at 10:03 PM, shrinath.m <shrinat...@webyog.com> wrote: > I am trying to index content withing certain HTML tags, how do I index it ? > Which is the best parser/tokenizer available to do this ?
This doesn't really answer the question, but I think it will help... The features you want to look for: 1. A StAX-like "pull parsing" API - this makes it easier to implement Reader since Reader is also a pull API. 2. Doesn't try to store the entire HTML file in memory in any form - this makes it not bomb on gigantic HTML files, which do occur in reality. A specific counterexample which fails to satisfy both of these rules is HTMLParser (htmlparser.sf.net), but be cautious of any library which doesn't satisfy both. TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org