Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

Trejkaz Sat, 12 Mar 2011 02:00:34 -0800

On Fri, Mar 11, 2011 at 10:03 PM, shrinath.m <shrinat...@webyog.com> wrote:
> I am trying to index content withing certain HTML tags, how do I index it ?
> Which is the best parser/tokenizer available to do this ?


This doesn't really answer the question, but I think it will help...

The features you want to look for:
1. A StAX-like "pull parsing" API - this makes it easier to implement
Reader since Reader is also a pull API.
2. Doesn't try to store the entire HTML file in memory in any form -
this makes it not bomb on gigantic HTML files, which do occur in
reality.

A specific counterexample which fails to satisfy both of these rules
is HTMLParser (htmlparser.sf.net), but be cautious of any library
which doesn't satisfy both.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

Reply via email to