On 23/05/2006 10:19 AM, Brian wrote: > First off, I am sorry for cluttering this group with my inept > questions, but I am stuck again despite a few hours of hair pulling. > > I have a function (below) that takes a list of html pages that have > images on them (not porn but boats). This function then (supposedly) > goes through and extracts the links to those images and puts them into > a list, appending with each iteration of the for loop. The list of > html pages is 82 items long and each page has multiple image links. > When the function gets to item 77 or so, the list gets all funky. > Sometimes it goes empty,
The list (not a tuple!!) found by findall is empty or smaller than expected when the webmaster has used .jpg instead of .jpeg. Pages 27, 77, and 79-82 at the moment have all .jpg as you would have found out had you inspected the actual data you are operating on instead of guessing. The print statement is your friend; use it. Your browser's "view source" functionality (ctrl-U in Firefox) is also handy. However if you mean that your foundPics list becomes empty, then either you haven't posted the code that you actually used, or the pixies from the bottom of the garden have been rearranging it for you :-) and others it is a much more abbreviated list > than I expect - it should have roughly 750 image links. > > When I looked at it while running, it appears as if my regex is > actually appending a tuple (I think) of the results it finds to the > list. No, read the manual. findall returns a list. *You* are appending that list to your list. > My best guess is that the list is getting too big and croaks. Very unlikely. In any case you would have seen evidence, like an exception and a traceback ... or maybe just your swap disk going into overdrive :-) > Since one of the objects of the function is also to be able to count > the items in the list, I am getting some strange errors there as well. And what were the strange errors that you perceived? > > Here is the code: [snip] Here is mine: import re, urllib def countPics(): foundPics = [] links_count = 0 pics_count = 0 pics = re.compile(r"images/.*\.jpeg") # for better results, change jpeg to jpe?g for link in ["cetaceaPage%02d.html" % x for x in range(1, 83)]: picPage = urllib.urlopen("http://continuouswave.com/whaler/cetacea/" + link) links_count += 1 html = picPage.read() picPage.close() findall_result = pics.findall(html) pics_count += len(findall_result) print links_count, pics_count, link, findall_result foundPics.append(findall_result) print("done") countPics() You may wish to change that append to extend, but then you will lose track of which pictures are on which page, if that matters to you. HTH, John -- http://mail.python.org/mailman/listinfo/python-list