I also use htmlparser, which is rather good. I've had to customize it, though, to parse strings containing html source rather than accept urls of resources to fetch etc. Also it crashes on meta tags that don't have
name attributes (something I discovered only a couple of days ago).

Simon

Daniel Noll wrote:
John Wang wrote:
Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.

We use this library to do our HTML parsing work:

http://htmlparser.sourceforge.net/

It's fairly resilient to bad code, including things like false assumptions about nesting HTML inside script. (e.g. document.write("</script>");

Daniel



--
Dr. Simon Courtenage
Software Systems Engineering Research Group
Dept. of Software Engineering, Cavendish School of Computer Science
University of Westminster, London, UK
Email: [EMAIL PROTECTED]   Web: http://users.cscs.wmin.ac.uk/~courtes | 
http://www.sse.wmin.ac.uk


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to