You can use Xerces to parse an HTML file only if the HTML file is also valid XML, i.e., XHTML In many case, you can use HTML Tidy<https://www.html-tidy.org/> to convert an HTML file to valid XML that can then be parsed by Xerces.
From: Joseph Kesselman <kesh...@alum.mit.edu.INVALID> Date: Tuesday, May 27, 2025 at 2:34 PM To: j-users@xerces.apache.org <j-users@xerces.apache.org> Subject: Re: Parsing HTML Supporting an HTML DOM, and being able serialize to HTML, does not necessarily imply being able to parse HTML. As far as I know, that last is not supported by Xerces. I was able to (ab)use the W3C's _tidy_ tool to do some basic HTML parsing. Inelegant but it sufficed for what I needed. -- /_ Joe Kesselman (he/him/his) -/ _) My Alexa skill for New Music/New Sounds fans: / https://www.amazon.com/dp/B09WJ3H657/<https://www.amazon.com/dp/B09WJ3H657/> Caveat: Opinionated old geezer with overcompensated writer's block. May be redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant. ________________________________ From: Olivier Cailloux <olivier.caill...@dauphine.psl.eu> Sent: Tuesday, May 27, 2025 1:08:55 PM To: j-users@xerces.apache.org <j-users@xerces.apache.org> Subject: Parsing HTML Dear list, Apache Xerces-J says that it implements DOM Level 1 HTML. I asked recently about the bootstrapping support, which did not yield answers, so let me broaden the question. Can anyone point me towards some way of reading HTML (non XML<https://www.w3.org/TR/DOM-Level-1/introduction.html#ID-E7C3082>) files using Xerces-J? I tried various things using org.apache.xerces.parsers.DOMParserImpl but parsing this file<https://github.com/oliviercailloux/JARiS/blob/main/src/test/resources/io/github/oliviercailloux/jaris/xml/Html/Simple.html> for example (valid according to Nu validator<https://validator.nu/>) fails. I am ready to use a way that does not follow the W3C bootstrapping one, if required. Thanks a lot!