No matter what the root cause here is, I think it would it still make sense to check for null in getAttribute*() methods in HTMLElement. No matter what DOM normalization issues continue to exist, this simple change allows normalization to succeed. The rest of the issues can be addressed as they are discovered.
So, the first line of these methods would be, essentially... if (localName == null) return null; Jake Quoting Michael Glavassevich <[EMAIL PROTECTED]>: > Hi Jake, > > The code you found in DOMNormalizer is looping over the attributes in the > document not all of the possible attributes in the DTD. If a defaulted > attribute is missing from the DOM then there's probably a bug somewhere > else in the class which wouldn't surprise me. Around this time last year > [1] in memory DTD validation was completely broken. I spent a couple weeks > fixing most of the major issues [2][3][4][5][6][7][8] but I didn't get > through all of them and haven't found the time to clear up the rest. > > Thanks. > > [1] http://marc.theaimsgroup.com/?l=xerces-j-dev&m=113285279019052&w=2 > [2] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113333032523512&w=2 > [3] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113338115425840&w=2 > [4] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113402200124272&w=2 > [5] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113337500722384&w=2 > [6] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113389841006312&w=2 > [7] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113399680924552&w=2 > [8] http://marc.theaimsgroup.com/?l=xerces-cvs&m=113330014128271&w=2 > > Michael Glavassevich > XML Parser Development > IBM Toronto Lab > E-mail: [EMAIL PROTECTED] > E-mail: [EMAIL PROTECTED] > > Jacob Kjome <[EMAIL PROTECTED]> wrote on 12/15/2006 01:55:38 PM: > > > Based on something Michael Glavassevich said about validating an HTML > > document in memory using normalizeDocument() [1] (to get "id" > > attributes registered as type "ID", for optimized getElementById() > > lookup), I tried an experiment. I parsed an HTML document using the > > Xerces DOMParser, providing it with the NekoHTML > > HTMLConfiguration. First I tried validating against the HTML 4.01 > > DTD, but since it's totally malformed (XHTML Basic 1.0 DTD is invalid > > and now this? who writes these flippin things????), I took the XHTML > > 1.0 Strict DTD and changed all the elements to be declared in upper > > case (and removed "xmlns" and "xml:space" stuff) and obtained the > > local URL via a Catalog-based entity resolver. I set the following > > parameters... > > > > config.setParameter("validate", Boolean.TRUE); > > config.setParameter("schema-type", javax.xml.XMLConstants. > > XML_DTD_NS_URI); > > config.setParameter("schema-location", url.toExternalForm()); > > config.setParameter("namespaces", Boolean.FALSE); > > config.setParameter("well-formed", Boolean.FALSE); > > > > It all loads up just fine, but fails because of a > > NullPointerException in HTMLElementImpl when calling > > getAttributeNodeNS() inside DOMNormalizer.startElement() (see line > 1790)... > > > > for (int i = 0; i < attrCount; i++) { > > attributes.getName(i, fAttrQName); > > Attr attr = null; > > > > attr = currentElement.getAttributeNodeNS(fAttrQName.uri, > > fAttrQName.localpart); > > .... > > .... > > } > > > > This is because HTMLElementIImpl, on line 158, calls toLowerCase() on > > the localName... > > > > return super.getAttributeNode( localName.toLowerCase(Locale.ENGLISH) ); > > > > > > The reason why the localName is null in this case is that the "for" > > loop above loops over *all* possible attributes of the element > > without checking for attribute.isSpecified() before calling > > getAttributeNodeNS(). If the attribute is not specified, of course > > it is going to be null, so why bother calling it? > > > > I worked around this by modifying > > HTMLElementImpl.getAttributeNodeNS() to return null if the provided > > 'localName' is null, avoiding the inevitable NullPointerException > > upon the toLowerCase() call. The in memory validation works after > > this change! Yippie! > > > > So, the question is, where is this properly fixed? I suppose it > > would be smart for HTMLElementImpl to be checking for null before > > attempting to manipulate the string to put it in all lowercase, so, > > maybe that should be patched regardless. However, shouldn't the > > first line in the "for" loop of DOMNormalizer.startElement() be.... > > > > if (!attributes.isSpecified(i)) continue; > > > > If the attribute isn't specified, why attempt to get the attribute > > node? It's already known that it's going to be null, isn't > > it? Wouldn't this even be a minor optimization? Is there a good > > reason not to do this? > > > > > > Jake > > > > > > [1] http://issues.apache.org/jira/browse/XERCESJ-1200 > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]