Re: HTML text extraction

Daniel Noll Tue, 20 Jun 2006 22:46:16 -0700

John Wang wrote:

Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.


We use this library to do our HTML parsing work:

http://htmlparser.sourceforge.net/

It's fairly resilient to bad code, including things like falseassumptions about nesting HTML inside script. (e.g.document.write("</script>");


Daniel

--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTML text extraction

Reply via email to