At Thursday 9/11/2006 20:23, i80and wrote:

I'm working on a basic web spider, and I'm having problems with the
urlparser.
[...]
            SpliceStart = Website.find('<a href="', (i+1))
            SpliceEnd = (Website.find('">', SpliceStart))

            ParsedURL =
urlparse((Website[SpliceStart+9:(SpliceEnd+1)]))
            robotparser.set_url(ParsedURL.hostname + '/' +
'robots.txt')
-----
Traceback (most recent call last):
  File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 120, in <module>
    FindLinks(Website)
  File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py",
line 84, in FindLinks
    robotparser.read()
  File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read
    f = opener.open(self.url)
  File "C:\Program Files\Python25\lib\urllib.py", line 190, in open
    return getattr(self, name)(url)
  File "C:\Program Files\Python25\lib\urllib.py", line 451, in
open_file
    return self.open_local_file(url)
  File "C:\Program Files\Python25\lib\urllib.py", line 465, in
open_local_file
    raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'en.wikipedia.org\\robots.txt'

Note the last line 'en.wikipedia.org\\robots.txt'.  I want
'en.wikipedia.org/robots.txt'!  What am I doing wrong?

No, you don't want 'en.wikipedia.org/robots.txt'; you want 'http://en.wikipedia.org/robots.txt' urllib treats the former as a file: request, here the \\ in the normalized path. You are parsing the link and then building a new URI using ONLY the hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead.

You may try Beautiful Soup for a better HTML parsing.

--
Gabriel Genellina
Softlab SRL
__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis! ¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to