Re: Parsing HTML

Joseph Kesselman Tue, 27 May 2025 11:27:29 -0700

Supporting an HTML DOM, and being able serialize to HTML, does not necessarily 
imply being able to parse HTML.  As far as I know, that last is not supported 
by Xerces.


I was able to (ab)use the W3C's _tidy_ tool to do some basic HTML parsing. 
Inelegant but it sufficed for what I needed.

--
   /_  Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
   /   https://www.amazon.com/dp/B09WJ3H657/

Caveat: Opinionated old geezer with overcompensated writer's block. May be 
redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant.
________________________________
From: Olivier Cailloux <olivier.caill...@dauphine.psl.eu>
Sent: Tuesday, May 27, 2025 1:08:55 PM
To: j-users@xerces.apache.org <j-users@xerces.apache.org>
Subject: Parsing HTML

Dear list,

Apache Xerces-J says that it implements DOM Level 1 HTML.  I asked recently 
about the bootstrapping support, which did not yield answers, so let me broaden 
the question.

Can anyone point me towards some way of reading HTML (non 
XML<https://www.w3.org/TR/DOM-Level-1/introduction.html#ID-E7C3082>) files 
using Xerces-J? I tried various things using 
org.apache.xerces.parsers.DOMParserImpl but parsing this 
file<https://github.com/oliviercailloux/JARiS/blob/main/src/test/resources/io/github/oliviercailloux/jaris/xml/Html/Simple.html>
 for example (valid according to Nu validator<https://validator.nu/>) fails.

I am ready to use a way that does not follow the W3C bootstrapping one, if 
required.

Thanks a lot!

Re: Parsing HTML

Reply via email to