I found that CyberNeko left style and script in the text and JTidy produced better output, but both of them use DOM and were therefore subject to OutOfMemory errors (JTidy being worse than CyberNeko). I've since then moved over to TagSoup, which I needed to customise to strip style script (a simple tweak), but "kept on trucking" with any size document.
-----Original Message----- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: 21 June 2006 07:37 To: java-user@lucene.apache.org Subject: Re: HTML text extraction John, I also wrote about using NekoHTML, I think. I prefer that to JTidy. That also tells you what Simpy.com uses. Otis ----- Original Message ---- From: John Wang <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, June 21, 2006 1:39:41 AM Subject: HTML text extraction Can someone please suggest a HTML text extraction library? In the Lucene book, it recommends Tidy. Seems jtidy is not really being maintained. Otis, what do you guys use at Simpy? Thanks -john --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
smime.p7s
Description: S/MIME cryptographic signature