Re: Problem with parsing HTML

Michael Glavassevich Sun, 13 May 2012 21:37:28 -0700

Perhaps you already know... NekoHTML is maintained by another community 
out in SourceForge [1].


Thanks.

[1] http://sourceforge.net/tracker/?group_id=195122&atid=952178

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

"Yizhou Z." <westward.zh...@gmail.com> wrote on 13/05/2012 11:13:00 PM:

> Just tried out parsing some other HTML files, and found Xerces 
> worked well for the "input" tags in these HTML files. The previous 
> problem seems to have something to do with NekoHTML's parser.

> On Sun, May 13, 2012 at 1:22 PM, Yizhou Z. <westward.zh...@gmail.com> 
wrote:
> NekoHTML parser uses Xerces' HTML DOM implementation. And it seems 
> that it can always return the appropriate HTML DOM element objects 
> for other types of element nodes.  But for <input />, I found it 
> returns an object of type "org.apache.xerces.dom.ElementNSImpl". I 
> wonder if this is a bug in the version of Xerces that I use.
> 
> Thanks.
> 

> On Sun, May 13, 2012 at 5:34 AM, Michael Glavassevich 
<mrgla...@ca.ibm.com
> > wrote:
> Have you tried setting the 'document-class-name' property [1] so 
> that it points to Xerces' HTML DOM implementation? 
> 
> Thanks. 
> 
> [1] 
http://xerces.apache.org/xerces2-j/properties.html#dom.document-class-name
> 
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrgla...@ca.ibm.com 
> E-mail: mrgla...@apache.org 
> 
> "Yizhou Z." <westward.zh...@gmail.com> wrote on 12/05/2012 11:40:23 AM:
> 
> 
> > Hi. I am using NekoHTML to parse a piece of HTML code which includes
> > an input element:
> 
> > <input type="password" name="pw" maxlength="20" class="password" 
> > id="Password1" /> 
> > 
> > My program for parsing HTML is below. 
> > 
> > DOMParser parser = new DOMParser(); 
> > parser.setProperty("
http://cyberneko.org/html/properties/default-encoding
> > ", "UTF-8"); 
> > parser.setProperty("http://cyberneko.org/html/properties/filters";, 
> >   new XMLDocumentFilter[] { new DefaultFilter() { 
> >     public void startElement(QName element, XMLAttributes attrs, 
> > Augmentations augs) 
> >     throws XNIException { 
> >       element.uri = null; 
> >       super.startElement(element, attrs, augs); 
> >     } 
> > } }); 
> > BufferedReader in = new BufferedReader(new FileReader("./test.html")); 

> > parser.parse(new InputSource(in)); 
> > HTMLDocument d = (HTMLDocument) parser.getDocument(); 
> > System.out.println(d.getElementById("Password1").getClass()); 
> > 
> > The print out of the above program is "class 
> > org.apache.xerces.dom.ElementNSImpl" rather than "class 
> > org.apache.html.dom.HTMLInputElementImpl", which puzzles me. Is 
> > there anything I went wrong with? 
> > 
> > Thanks!

Re: Problem with parsing HTML

Reply via email to