On Feb 8, 8:06 pm, Björn Steinbrink <[EMAIL PROTECTED]> wrote: > On Thu, 08 Feb 2007 10:20:56 -0800, k0mp wrote: > > On Feb 8, 6:54 pm, Leif K-Brooks <[EMAIL PROTECTED]> wrote: > >> k0mp wrote: > >> > Is there a way to retrieve a web page and before it is entirely > >> > downloaded, begin to test if a specific string is present and if yes > >> > stop the download ? > >> > I believe that urllib.openurl(url) will retrieve the whole page before > >> > the program goes to the next statement. > > >> Use urllib.urlopen(), but call .read() with a smallish argument, e.g.: > > >> >>> foo = urllib.urlopen('http://google.com') > >> >>> foo.read(512) > >> '<html><head> ... > > >> foo.read(512) will return as soon as 512 bytes have been received. You
> >> can keep caling it until it returns an empty string, indicating that > >> there's no more data to be read. > > > Thanks for your answer :) > > > I'm not sure that read() works as you say. > > Here is a test I've done : > > > import urllib2 > > import re > > import time > > > CHUNKSIZE = 1024 > > > print 'f.read(CHUNK)' > > print time.clock() > > > for i in range(30) : > > f = urllib2.urlopen('http://google.com') > > while True: # read the page using a loop > > chunk = f.read(CHUNKSIZE) > > if not chunk: break > > m = re.search('<html>', chunk ) > > if m != None : > > break > > > print time.clock() > > > print > > > print 'f.read()' > > print time.clock() > > for i in range(30) : > > f = urllib2.urlopen('http://google.com') > > m = re.search('<html>', f.read() ) > > if m != None : > > break > > A fair comparison would use "pass" here. Or a while loop as in the > other case. The way it is, it compares 30 times read(CHUNKSIZE) > against one time read(). > > Björn That's right my test was false. I've replaced http://google.com with http://aol.com And the 'break' in the second loop with 'continue' ( because when the string is found I don't want the rest of the page to be parsed. I obtain this : f.read(CHUNK) 0.1 0.17 f.read() 0.17 0.23 f.read() is still faster than f.read(CHUNK) -- http://mail.python.org/mailman/listinfo/python-list