Based on something Michael Glavassevich said about validating an HTML
document in memory using normalizeDocument() [1] (to get "id"
attributes registered as type "ID", for optimized getElementById()
lookup), I tried an experiment. I parsed an HTML document using the
Xerces DOMParser, providing it with the NekoHTML
HTMLConfiguration. First I tried validating against the HTML 4.01
DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid
and now this? who writes these flippin things????), I took the XHTML
1.0 Strict DTD and changed all the elements to be declared in upper
case (and removed "xmlns" and "xml:space" stuff) and obtained the
local URL via a Catalog-based entity resolver. I set the following
parameters...
config.setParameter("validate", Boolean.TRUE);
config.setParameter("schema-type", javax.xml.XMLConstants.XML_DTD_NS_URI);
config.setParameter("schema-location", url.toExternalForm());
config.setParameter("namespaces", Boolean.FALSE);
config.setParameter("well-formed", Boolean.FALSE);
It all loads up just fine, but fails because of a
NullPointerException in HTMLElementImpl when calling
getAttributeNodeNS() inside DOMNormalizer.startElement() (see line 1790)...
for (int i = 0; i < attrCount; i++) {
attributes.getName(i, fAttrQName);
Attr attr = null;
attr = currentElement.getAttributeNodeNS(fAttrQName.uri,
fAttrQName.localpart);
....
....
}
This is because HTMLElementIImpl, on line 158, calls toLowerCase() on
the localName...
return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) );
The reason why the localName is null in this case is that the "for"
loop above loops over *all* possible attributes of the element
without checking for attribute.isSpecified() before calling
getAttributeNodeNS(). If the attribute is not specified, of course
it is going to be null, so why bother calling it?
I worked around this by modifying
HTMLElementImpl.getAttributeNodeNS() to return null if the provided
'localName' is null, avoiding the inevitable NullPointerException
upon the toLowerCase() call. The in memory validation works after
this change! Yippie!
So, the question is, where is this properly fixed? I suppose it
would be smart for HTMLElementImpl to be checking for null before
attempting to manipulate the string to put it in all lowercase, so,
maybe that should be patched regardless. However, shouldn't the
first line in the "for" loop of DOMNormalizer.startElement() be....
if (!attributes.isSpecified(i)) continue;
If the attribute isn't specified, why attempt to get the attribute
node? It's already known that it's going to be null, isn't
it? Wouldn't this even be a minor optimization? Is there a good
reason not to do this?
Jake
[1] http://issues.apache.org/jira/browse/XERCESJ-1200
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]