I found that CyberNeko left style and script in the text and JTidy produced
better output, but both of them use DOM and were therefore subject to
OutOfMemory errors (JTidy being worse than CyberNeko). I've since then moved
over to TagSoup, which I needed to customise to strip style script (a simple
tweak), but "kept on trucking" with any size document. 

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 21 June 2006 07:37
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction

John,

I also wrote about using NekoHTML, I think.  I prefer that to JTidy.  That
also tells you what Simpy.com uses.

Otis

----- Original Message ----
From: John Wang <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, June 21, 2006 1:39:41 AM
Subject: HTML text extraction

Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.

Otis, what do you guys use at Simpy?

Thanks

-john




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to