Tue, 27 May 2025 17:08:55 +0000, /Olivier Cailloux/:
Can anyone point me towards some way of reading HTML (non XML) files
using Xerces-J? I tried various things using
org.apache.xerces.parsers.DOMParserImpl but parsing this file for
example (valid according to Nu validator) fails.
Haven't used it myself – have you tried NekoHTML?
* https://nekohtml.sourceforge.net/
* https://central.sonatype.com/artifact/net.sourceforge.nekohtml/nekohtml
I am ready to use a way that does not follow the W3C bootstrapping
one, if required.
Using the TagSoup parser:
*
https://web.archive.org/web/20160815081758/http://home.ccil.org/~cowan/XML/tagsoup/
* https://central.sonatype.com/artifact/org.ccil.cowan.tagsoup/tagsoup
The following (non-Xerces-specific) appears to work for me:
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMResult;
import javax.xml.transform.sax.SAXSource;
import org.xml.sax.InputSource;
import org.w3c.dom.Document;
import org.ccil.cowan.tagsoup.Parser;
public static void main(String[] args) throws Exception {
TransformerFactory tf = TransformerFactory.newInstance();
Transformer identity = tf.newTransformer();
SAXSource source = new SAXSource(new Parser(), new InputSource(
"https://raw.githubusercontent.com/oliviercailloux/JARiS/refs/heads/main/src/test/resources/io/github/oliviercailloux/jaris/xml/Html/Simple.html"));
DOMResult result = new DOMResult();
identity.transform(source, result);
Document doc = (Document) result.getNode();
System.out.println(doc.getDocumentElement().getTagName());
System.out.println(doc.getDocumentElement().getNamespaceURI());
System.out.println(doc
.getElementsByTagName("meta").getLength());
}
--
Stanimir
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org