Re: Parsing HTML

Paul Kinnucan Tue, 27 May 2025 11:59:57 -0700

You can use Xerces to parse an HTML file only if the HTML file is also valid 
XML, i.e., XHTML In many case, you can use HTML 
Tidy<https://www.html-tidy.org/> to convert an HTML file to valid XML that can 
then be parsed by Xerces.

From: Joseph Kesselman <kesh...@alum.mit.edu.INVALID>
Date: Tuesday, May 27, 2025 at 2:34 PM
To: j-users@xerces.apache.org <j-users@xerces.apache.org>
Subject: Re: Parsing HTML
Supporting an HTML DOM, and being able serialize to HTML, does not necessarily 
imply being able to parse HTML.  As far as I know, that last is not supported 
by Xerces.

I was able to (ab)use the W3C's _tidy_ tool to do some basic HTML parsing. 
Inelegant but it sufficed for what I needed.

--
   /_  Joe Kesselman (he/him/his)
-/ _) My Alexa skill for New Music/New Sounds fans:
   /   
https://www.amazon.com/dp/B09WJ3H657/<https://www.amazon.com/dp/B09WJ3H657/>

Caveat: Opinionated old geezer with overcompensated writer's block. May be 
redundant, verbose, prolix, sesquipedalian, didactic, officious, or redundant.
________________________________
From: Olivier Cailloux <olivier.caill...@dauphine.psl.eu>
Sent: Tuesday, May 27, 2025 1:08:55 PM
To: j-users@xerces.apache.org <j-users@xerces.apache.org>
Subject: Parsing HTML

Dear list,

Apache Xerces-J says that it implements DOM Level 1 HTML.  I asked recently 
about the bootstrapping support, which did not yield answers, so let me broaden 
the question.

Can anyone point me towards some way of reading HTML (non 
XML<https://www.w3.org/TR/DOM-Level-1/introduction.html#ID-E7C3082>) files 
using Xerces-J? I tried various things using 
org.apache.xerces.parsers.DOMParserImpl but parsing this 
file<https://github.com/oliviercailloux/JARiS/blob/main/src/test/resources/io/github/oliviercailloux/jaris/xml/Html/Simple.html>
 for example (valid according to Nu validator<https://validator.nu/>) fails.

I am ready to use a way that does not follow the W3C bootstrapping one, if 
required.

Thanks a lot!

Re: Parsing HTML

Reply via email to