Thank you! Fixed my problem perfectly! Gabriel Genellina wrote: > At Thursday 9/11/2006 20:23, i80and wrote: > > >I'm working on a basic web spider, and I'm having problems with the > >urlparser. > >[...] > > SpliceStart = Website.find('<a href="', (i+1)) > > SpliceEnd = (Website.find('">', SpliceStart)) > > > > ParsedURL = > >urlparse((Website[SpliceStart+9:(SpliceEnd+1)])) > > robotparser.set_url(ParsedURL.hostname + '/' + > >'robots.txt') > >----- > >Traceback (most recent call last): > > File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py", > >line 120, in <module> > > FindLinks(Website) > > File "C:/Documents and Settings/Andrew/Desktop/ScoutCode-0.09.py", > >line 84, in FindLinks > > robotparser.read() > > File "C:\Program Files\Python25\lib\robotparser.py", line 61, in read > > f = opener.open(self.url) > > File "C:\Program Files\Python25\lib\urllib.py", line 190, in open > > return getattr(self, name)(url) > > File "C:\Program Files\Python25\lib\urllib.py", line 451, in > >open_file > > return self.open_local_file(url) > > File "C:\Program Files\Python25\lib\urllib.py", line 465, in > >open_local_file > > raise IOError(e.errno, e.strerror, e.filename) > >IOError: [Errno 2] The system cannot find the path specified: > >'en.wikipedia.org\\robots.txt' > > > >Note the last line 'en.wikipedia.org\\robots.txt'. I want > >'en.wikipedia.org/robots.txt'! What am I doing wrong? > > No, you don't want 'en.wikipedia.org/robots.txt'; you want > 'http://en.wikipedia.org/robots.txt' > urllib treats the former as a file: request, here the \\ in the > normalized path. > You are parsing the link and then building a new URI using ONLY the > hostname part; that's wrong. Use urljoin(ParsedURL, '/robots.txt') instead. > > You may try Beautiful Soup for a better HTML parsing. > > -- > Gabriel Genellina > Softlab SRL > > __________________________________________________ > Correo Yahoo! > Espacio para todos tus mensajes, antivirus y antispam ¡gratis! > ¡Abrí tu cuenta ya! - http://correo.yahoo.com.ar
-- http://mail.python.org/mailman/listinfo/python-list