Dave Brueck wrote: > > > If you're tossing images that are too _small_, is there any benefit to > not downloading the whole image, checking it, and then throwing it away?
Its a 'webscraper' app that downloads images based on search criteria. The user may want only images above 640x480, although the general case will be something like 200x200 to avoid downloading thumbnails > > Checking just the first 1K probably won't save you too much time unless > you're over a modem. Are you using a byte-range HTTP request to pull > down the images or just a normal GET (via e.g. urllib)? If you're not > using a byte-range request, then all of the data is already on its way > so maybe you could go ahead and get it all. I'm not familiar with byte-range requests. Is this a standard feature of webservers? I know there will be more that one K in the pipeline if I do a read, but if I close the file object from urllib it will stop the download if there is data remaining - wont it? > > But hey, if your current approach works... :) It _is_ a bit > unconventional, so to reduce the risk you could test it on a decent mix > of image types (normal JPEG, progressive JPEG, normal & progressive GIF, > png, etc.) - just to make sure PIL is able to handle partial data for > all different types you might encounter. > > Also, if PIL can't handle the partial data, can you reliably detect that > scenario? If so, you could detect that case and use the > download-it-all-and-check approach as a failsafe. The PIL code worked with most of the images I threw at it (just jpegs), if there was no 'size' attribute then I just continue to download the entire image. It may have caused a memory leak though, with this code in memory usage increased continuously.. Actualy, this may all be moot now. Originally I looked at reading the image dimensions from the jpeg header, but that turned out to be non-trivial and I gave up. Fortunately I found some Perl code that does it, and converted it to Python (and I dont even know Perl!). Here's the code if anyone is interested.. import struct def GetJpegSize(data): idata = iter(data) width = None height = None try: B1 = ord(idata.next()) B2 = ord(idata.next()) if B1 != 0xFF or B2 != 0xD8: return -1, -1 while True: byte = ord(idata.next()) while byte != 0xFF: byte = ord(idata.next()) while byte == 0xFF: byte = ord(idata.next()) if byte >= 0xc0 and byte <= 0xc3: idata.next() idata.next() idata.next() height, width = struct.unpack( '>HH', "".join(idata.next() for b in range(4)) ) break else: offset = struct.unpack('>H', idata.next() + idata.next())[0] - 2 for _ in xrange(offset): idata.next() except StopIteration: pass return width, height if __name__ == "__main__": first_k = file("test.jpg","rb").read(1024) print GetJpegSize(first_k) Returns (-1, -1) for a non-jpeg, or (None, None) if the size wasn't contained in the data supplied (some jpegs have embedded thumbnails), or (width, height) if the dimensions were found. And the original source: http://wiki.tcl.tk/757 Thanks, Will -- http://www.willmcgugan.com "".join( [ {'*':'@','^':'.'}.get(c,None) or chr(97+(ord(c)-84)%26) for c in "jvyy*jvyyzpthtna^pbz" ] ) -- http://mail.python.org/mailman/listinfo/python-list