On 10/23/2011 09:06 PM, ???????? wrote:
C:\Documents and Settings\peng>cd c:\python32



C:\Python32>python

Python 3.2.2 (default, Sep  4 2011, 09:51:08) [MSC v.1500 32 bit (Intel)] on win

32

Type "help", "copyright", "credits" or "license" for more information.

import lxml.html

sfile='http://finance.yahoo.com/q/op?s=A+Options'

root=lxml.html.parse(sfile).getroot()
there is no problem to  parse  :


http://finance.yahoo.com/q/op?s=A+Options'




why  i can not parse

http://frux.wikispaces.com/  ??

import lxml.html

sfile='http://frux.wikispaces.com/'

root=lxml.html.parse(sfile).getroot()

Traceback (most recent call last):

   File "<stdin>", line 1, in<module>

   File "C:\Python32\lib\site-packages\lxml\html\__init__.py", line 692, in 
parse



     return etree.parse(filename_or_url, parser, base_url=base_url, **kw)

   File "lxml.etree.pyx", line 2942, in lxml.etree.parse 
(src/lxml/lxml.etree.c:5

4187)

   File "parser.pxi", line 1528, in lxml.etree._parseDocument 
(src/lxml/lxml.etre

e.c:79485)

   File "parser.pxi", line 1557, in lxml.etree._parseDocumentFromURL 
(src/lxml/lx

ml.etree.c:79768)

   File "parser.pxi", line 1457, in lxml.etree._parseDocFromFile 
(src/lxml/lxml.e

tree.c:78843)

   File "parser.pxi", line 997, in lxml.etree._BaseParser._parseDocFromFile 
(src/

lxml/lxml.etree.c:75698)

   File "parser.pxi", line 564, in 
lxml.etree._ParserContext._handleParseResultDo

c (src/lxml/lxml.etree.c:71739)

   File "parser.pxi", line 645, in lxml.etree._handleParseResult 
(src/lxml/lxml.e

tree.c:72614)

   File "parser.pxi", line 583, in lxml.etree._raiseParseError 
(src/lxml/lxml.etr

ee.c:71927)

IOError: Error reading file 'b'http://frux.wikispaces.com/'': b'failed to load e

xternal entity "http://frux.wikispaces.com/";'

>
Double-spacing makes your message much harder to read. I can only comment in a general way, in any case. most html is mal-formed, and not legal html. Although I don't have any experience with parsing it, I do with xml which has similar problems.

The first thing I'd do is to separate the loading of the byte string from the website, from the parsing of those bytes. Further, I'd make a local copy of those bytes, so you can do testing repeatably. For example, you could run wget utility to copy the bytes locally and create a file.
--

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to