Re: Parsing HTML

Stanimir Stamenkov Thu, 29 May 2025 09:03:04 -0700

Tue, 27 May 2025 17:08:55 +0000, /Olivier Cailloux/:

Can anyone point me towards some way of reading HTML (non XML) filesusing Xerces-J? I tried various things usingorg.apache.xerces.parsers.DOMParserImpl but parsing this file forexample (valid according to Nu validator) fails.


Haven't used it myself – have you tried NekoHTML?

* https://nekohtml.sourceforge.net/
* https://central.sonatype.com/artifact/net.sourceforge.nekohtml/nekohtml

I am ready to use a way that does not follow the W3C bootstrappingone, if required.


Using the TagSoup parser:

*https://web.archive.org/web/20160815081758/http://home.ccil.org/~cowan/XML/tagsoup/

* https://central.sonatype.com/artifact/org.ccil.cowan.tagsoup/tagsoup

The following (non-Xerces-specific) appears to work for me:

    import javax.xml.transform.Transformer;
    import javax.xml.transform.TransformerFactory;
    import javax.xml.transform.dom.DOMResult;
    import javax.xml.transform.sax.SAXSource;

    import org.xml.sax.InputSource;
    import org.w3c.dom.Document;
    import org.ccil.cowan.tagsoup.Parser;

    public static void main(String[] args) throws Exception {
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer identity = tf.newTransformer();

        SAXSource source = new SAXSource(new Parser(), new InputSource(

"https://raw.githubusercontent.com/oliviercailloux/JARiS/refs/heads/main/src/test/resources/io/github/oliviercailloux/jaris/xml/Html/Simple.html";));

        DOMResult result = new DOMResult();
        identity.transform(source, result);

        Document doc = (Document) result.getNode();
        System.out.println(doc.getDocumentElement().getTagName());
        System.out.println(doc.getDocumentElement().getNamespaceURI());
        System.out.println(doc
                .getElementsByTagName("meta").getLength());
    }

--
Stanimir

---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org

Re: Parsing HTML

Reply via email to