That's what NekoHTML is for.  Plus, it's a perfect fit for Xerces since it's essentially an extension of it.  It's actively developed as well.

http://nekohtml.sourceforge.net/
http://sourceforge.net/projects/nekohtml/

Jake

On Wed, 22 Jul 2009 07:52:17 -0700 (PDT)
 Derek Alexander <d.alexan...@lse.ac.uk> wrote:

Thanks for the reply. I had looked at the JTidy project. Unfortunately their
current stable release removes empty tags which is no good for me, and too
many errors are reported trying to build the latest source (which includes a
config option for not deleting empty tags, if I understand correct). Seems
I'll have to write something to pre-parse the docs.

Regards,
Derek



keshlam wrote:

Closet thing I can think of is the W3C's "tidy" tool, which repairs some of the common/obvious errors.

______________________________________
"... Three things see no end: A loop with exit code done wrong,
A semaphore untested, And the change that comes along. ..."
  -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (
http://www.ovff.org/pegasus/songs/threes-rev-11.html)



Derek Alexander <d.alexan...@lse.ac.uk> 07/22/2009 09:55 AM
Please respond to
j-users@xerces.apache.org


To
j-users@xerces.apache.org
cc

Subject
repairing document while parsing?







Hi,

Is there any way with xerces (or any other xml parser you know of) to plug
in some kind of error handler that can attempt to repair the document being
parsed, rather than just log errors.

Specific case I have is xhtml documents that may have attribute values that
aren't escaped properly, e.g., href="http://some.server/path?blah&foo=baa";

What I want to do is catch the error that &foo is not a known entity and
replace it with &amp;foo as it ought to be, and have the parser carry on
with that.

Cheers,
Derek


--
View this message in context: http://www.nabble.com/repairing-document-while-parsing--tp24607002p24607002.html

Sent from the Xerces - J - Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org





--
View this message in context: http://www.nabble.com/repairing-document-while-parsing--tp24607002p24608002.html
Sent from the Xerces - J - Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org

Reply via email to