Re: getroot() problem

Dave Angel Sun, 23 Oct 2011 18:26:21 -0700

On 10/23/2011 09:06 PM, ???????? wrote:

C:\Documents and Settings\peng>cd c:\python32




C:\Python32>python

Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win

32

Type "help", "copyright", "credits" or "license" for more information.

import lxml.html

sfile='http://finance.yahoo.com/q/op?s=A+Options'

root=lxml.html.parse(sfile).getroot()

there is no problem to  parse  :


http://finance.yahoo.com/q/op?s=A+Options'




why  i can not parse

http://frux.wikispaces.com/  ??

import lxml.html

sfile='http://frux.wikispaces.com/'

root=lxml.html.parse(sfile).getroot()


Traceback (most recent call last):

   File "<stdin>", line 1, in<module>

   File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in 
parse



     return etree.parse(filename_or_url, parser, base_url=base_url, **kw)

   File "lxml.etree.pyx", line 2942, in lxml.etree.parse 
(src/lxml/lxml.etree.c:5

4187)

   File "parser.pxi", line 1528, in lxml.etree._parseDocument 
(src/lxml/lxml.etre

e.c:79485)

   File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL 
(src/lxml/lx

ml.etree.c:79768)

   File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile 
(src/lxml/lxml.e

tree.c:78843)

   File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile 
(src/

lxml/lxml.etree.c:75698)

   File "parser.pxi", line 564, in 
lxml.etree._ParserContext._handleParseResultDo

c (src/lxml/lxml.etree.c:71739)

   File "parser.pxi", line 645, in lxml.etree._handleParseResult 
(src/lxml/lxml.e

tree.c:72614)

   File "parser.pxi", line 583, in lxml.etree._raiseParseError 
(src/lxml/lxml.etr

ee.c:71927)

IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to load e

xternal entity "http://frux.wikispaces.com/";'

Double-spacing makes your message much harder to read. I can onlycomment in a general way, in any case. most html is mal-formed, and notlegal html. Although I don't have any experience with parsing it, I dowith xml which has similar problems.

The first thing I'd do is to separate the loading of the byte stringfrom the website, from the parsing of those bytes. Further, I'd make alocal copy of those bytes, so you can do testing repeatably. Forexample, you could run wget utility to copy the bytes locally and createa file.

--

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: getroot() problem

Reply via email to