I suggest you Jsoup Html parser,which is fast ,easy and simple html parser.I used many html parsers and out of which i am comfortable with Jsoup.
http://jsoup.org/ IBM ICU provides the best tokenizers. On 3/11/11, Bill Janssen <jans...@parc.com> wrote: > shrinath.m <shrinat...@webyog.com> wrote: > >> Consider we've offline HTML pages, no parsing while crawling, now what ? >> Any tokenizer someone has built for this ? > > In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages > by selecting only text between certain tags, before indexing them. > These are offline Web pages, as in your application. Take a look at > <http://uplib.parc.com/hg/uplib/file/2a204fc2dd1a/extensions/FilterWebPage.py>. > > Bill > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- ********************************* Sreejith.S http://sreejiths.emurse.com/ http://srijiths.wordpress.com/ tweet2sree@twitter ********************************* ILUGCBE http://ilugcbe.techstud.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org