Re: HTML text extraction

Simon Courtenage Wed, 21 Jun 2006 00:14:44 -0700

I also use htmlparser, which is rather good. I've had to customize it,though, to parse strings containinghtml source rather than accept urls of resources to fetch etc. Also itcrashes on meta tags that don't have

name attributes (something I discovered only a couple of days ago).


Simon

Daniel Noll wrote:

John Wang wrote:
Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.
We use this library to do our HTML parsing work:

http://htmlparser.sourceforge.net/
It's fairly resilient to bad code, including things like falseassumptions about nesting HTML inside script. (e.g.document.write("</script>");
Daniel



--
Dr. Simon Courtenage
Software Systems Engineering Research Group
Dept. of Software Engineering, Cavendish School of Computer Science
University of Westminster, London, UK
Email: [EMAIL PROTECTED]   Web: http://users.cscs.wmin.ac.uk/~courtes | 
http://www.sse.wmin.ac.uk


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTML text extraction

Reply via email to