Based on something Michael Glavassevich said about validating an HTML document in memory using normalizeDocument() [1] (to get "id" attributes registered as type "ID", for optimized getElementById() lookup), I tried an experiment. I parsed an HTML document using the Xerces DOMParser, providing it with the NekoHTML HTMLConfiguration. First I tried validating against the HTML 4.01 DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid and now this? who writes these flippin things????), I took the XHTML 1.0 Strict DTD and changed all the elements to be declared in upper case (and removed "xmlns" and "xml:space" stuff) and obtained the local URL via a Catalog-based entity resolver. I set the following parameters...

    config.setParameter("validate", Boolean.TRUE);
    config.setParameter("schema-type", javax.xml.XMLConstants.XML_DTD_NS_URI);
    config.setParameter("schema-location", url.toExternalForm());
        config.setParameter("namespaces", Boolean.FALSE);
    config.setParameter("well-formed", Boolean.FALSE);

It all loads up just fine, but fails because of a NullPointerException in HTMLElementImpl when calling getAttributeNodeNS() inside DOMNormalizer.startElement() (see line 1790)...

for (int i = 0; i < attrCount; i++) {
    attributes.getName(i, fAttrQName);
    Attr attr = null;

attr = currentElement.getAttributeNodeNS(fAttrQName.uri, fAttrQName.localpart);
        ....
        ....
}

This is because HTMLElementIImpl, on line 158, calls toLowerCase() on the localName...

return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );


The reason why the localName is null in this case is that the "for" loop above loops over *all* possible attributes of the element without checking for attribute.isSpecified() before calling getAttributeNodeNS(). If the attribute is not specified, of course it is going to be null, so why bother calling it?

I worked around this by modifying HTMLElementImpl.getAttributeNodeNS() to return null if the provided 'localName' is null, avoiding the inevitable NullPointerException upon the toLowerCase() call. The in memory validation works after this change! Yippie!

So, the question is, where is this properly fixed? I suppose it would be smart for HTMLElementImpl to be checking for null before attempting to manipulate the string to put it in all lowercase, so, maybe that should be patched regardless. However, shouldn't the first line in the "for" loop of DOMNormalizer.startElement() be....

if (!attributes.isSpecified(i)) continue;

If the attribute isn't specified, why attempt to get the attribute node? It's already known that it's going to be null, isn't it? Wouldn't this even be a minor optimization? Is there a good reason not to do this?


Jake


[1] http://issues.apache.org/jira/browse/XERCESJ-1200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to