DOMNormalizer question

Jacob Kjome Fri, 15 Dec 2006 10:55:19 -0800

Based on something Michael Glavassevich said about validating an HTMLdocument in memory using normalizeDocument() [1] (to get "id"attributes registered as type "ID", for optimized getElementById()lookup), I tried an experiment. I parsed an HTML document using theXerces DOMParser, providing it with the NekoHTMLHTMLConfiguration. First I tried validating against the HTML 4.01DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalidand now this? who writes these flippin things????), I took the XHTML1.0 Strict DTD and changed all the elements to be declared in uppercase (and removed "xmlns" and "xml:space" stuff) and obtained thelocal URL via a Catalog-based entity resolver. I set the followingparameters...


    config.setParameter("validate", Boolean.TRUE);
    config.setParameter("schema-type", javax.xml.XMLConstants.XML_DTD_NS_URI);
    config.setParameter("schema-location", url.toExternalForm());
        config.setParameter("namespaces", Boolean.FALSE);
    config.setParameter("well-formed", Boolean.FALSE);

It all loads up just fine, but fails because of aNullPointerException in HTMLElementImpl when callinggetAttributeNodeNS() inside DOMNormalizer.startElement() (see line 1790)...


for (int i = 0; i < attrCount; i++) {
    attributes.getName(i, fAttrQName);
    Attr attr = null;

attr = currentElement.getAttributeNodeNS(fAttrQName.uri,fAttrQName.localpart);

        ....
        ....
}

This is because HTMLElementIImpl, on line 158, calls toLowerCase() onthe localName...


return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );

The reason why the localName is null in this case is that the "for"loop above loops over *all* possible attributes of the elementwithout checking for attribute.isSpecified() before callinggetAttributeNodeNS(). If the attribute is not specified, of courseit is going to be null, so why bother calling it?

I worked around this by modifyingHTMLElementImpl.getAttributeNodeNS() to return null if the provided'localName' is null, avoiding the inevitable NullPointerExceptionupon the toLowerCase() call. The in memory validation works afterthis change! Yippie!

So, the question is, where is this properly fixed? I suppose itwould be smart for HTMLElementImpl to be checking for null beforeattempting to manipulate the string to put it in all lowercase, so,maybe that should be patched regardless. However, shouldn't thefirst line in the "for" loop of DOMNormalizer.startElement() be....


if (!attributes.isSpecified(i)) continue;

If the attribute isn't specified, why attempt to get the attributenode? It's already known that it's going to be null, isn'tit? Wouldn't this even be a minor optimization? Is there a goodreason not to do this?



Jake

[1] http://issues.apache.org/jira/browse/XERCESJ-1200


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DOMNormalizer question

Reply via email to