Nuno Santos wrote: > I have just started using libxml2dom to read html files and I have some > questions I hope you guys can answer me. > > The page I am working on (teste.htm): > <html> > <head> > <title> > Title > </title> > </head> > <body bgcolor = 'FFFFF'> > <table> > <tr bgcolor="#EEEEEE"> > <td nowrap="nowrap"> > <font size="2" face="Tahoma, Arial"> <a name="1375048"></a> > </font> > </td> > <td nowrap="nowrap"> > <font size="-2" face="Verdana"> 8/15/2009</font> > </td> > </tr> > </table> > </body> > </html> > > >>> import libxml2dom > >>> foo = open('teste.htm', 'r') > >>> str1 = foo.read() > >>> doc = libxml2dom.parseString(str1, html=1) > >>> html = doc.firstChild > >>> html.nodeName > u'html' > >>> head = html.firstChild > >>> head.nodeName > u'head' > >>> title = head.firstChild > >>> title.nodeName > u'title' > >>> body = head.nextSibling > >>> body.nodeName > u'body' > >>> table = body.firstChild > >>> table.nodeName > u'text' #?! Why!? Shouldn't it be a table? (1) > >>> table = body.firstChild.nextSibling #why this works? is there a > text element hidden? (2) > >>> table.nodeName > u'table' > >>> tr = table.firstChild > >>> tr.nodeName > u'tr' > >>> td = tr.firstChild > >>> td.nodeName > u'td' > >>> font = td.firstChild > >>> font.nodeName > u'text' # (1) > >>> font = td.firstChild.nextSibling # (2) > >>> font.nodeName > u'font' > >>> a = font.firstChild > >>> a.nodeName > u'text' #(1) > >>> a = font.firstChild.nextSibling #(2) > >>> a.nodeName > u'a' > > > It seems like sometimes there are some text elements 'hidden'. This is > probably a standard in DOM I simply am not familiar with this and I > would very much appreciate if anyone had the kindness to explain me this.
Without a schema or something similar, a parser can't tell if whitespace is significant or not. So if you have <root> <child/> </root> you will have not 2, but 4 nodes - root, text containing a newline + 2 spaces, child, and again a text with a newline. You have to skip over those that you are not interested in, or use a different XML-library such as ElementTree (e.g. in the form of lxml) that has a different approach about text-nodes. Diez -- http://mail.python.org/mailman/listinfo/python-list