Re: Oh what a twisted thread we weave....

GregM Mon, 31 Oct 2005 11:32:44 -0800

Tom,

Thanks for the reply and sorry for the delay in getting back to you.
Thanks for pointing out my logic problem. I had added the 2nd part of
the if statement at the last minute...


Yes I have a single threaded version its several hundred lines and uses
COM to write the results out to and Excel spreadsheet.. I was trying to
better understand threading and queues before I started hacking on my
current code... maybe that was a mistake... hey I'm still learning and
I learn a lot just by reading stuff posted to this group. I hope at
some point I can help others in the same way.

Here are the relevent parts of the code (no COM stuff)

here is a summary:
# see if url exists
# if exists then
#       hit page
#       get text of page
#       see if text of page contains search terms
#       if it does then
#               update appropiate counters and lists
#       else update static line and do the next one
# when done with Links list
#       - calculate totals and times
#       - write info to xls file
# end.

# utils are functions and classes that I wrote
# from utils import PrintStatic, HttpExists2
#
# My version of 'easyExcel' with extentions and improvements.
# import excelled
import urllib2
import time
import socket
import os
#import msvcrt         # for printstatic
from datetime import datetime
import pythoncom
from sys import exc_info, stdout, argv, exit

# search terms to use for matching.
#primarySearchTerm = 'Narrow your'
ST_lookingFor = 'Looking for Something'
ST_errorConnecting = 'there has been an error connecting'
ST_zeroMatch = 'You found 0 products'
ST_zeroMatch2 = 'There are no products matching your selection'

#initialize Globals
timeout = 90                    # sets timeout for urllib2.urlopen()
failedlinks = []                # list for failed urls
zeromatch = []                  # list for 0 result searches
pseudo404 = []                  # list for shop.com 404 pages
t = 0                                   # used to store starting time for 
getting a page.
count = 0                               # number of tests so far
pagetime = 0                    # time it took to load page
slowestpage = 0                 # slowest page time
fastestpage = 10                # fastest page time
cumulative = 0                  # total time to load all pages (used to calc. 
avg)

#version number of the program
version = 'B2.9'

def ShopCom404(testUrl):
        """ checks url for shop.com 404 url
                shop.com 404 url -- returns status 200
                http://www.shop.com/amos/cc/main/404/ccsyn/260
        """
        if '404' in testUrl:
                return True
        else:
                return False

##### main program #####

try:
        links = open(testfile).readlines()
except:
        exc, err, tb = exc_info()
        print 'There is a problem with the file you specified. Check the file
and re-run the program.\n'
        #print str(exc)
        print str(err)
        print
        exit()

# timeout in seconds
socket.setdefaulttimeout(timeout)
totalNumberTests = len(links)
print 'URLCheck ' + version + ' by Greg Moore (c) 2005 Shop.com\n\n'
# asctime() returns a human readable time stamp whereas time() doesn't
startTimeStr = time.asctime()
start = datetime.today()
for url in links:
        count = count + 1
        #HttpExists2 - checks to see if URL exists and detects redirection.
        # handles 404's and exceptions better. Returns tuple depending on
results:
        # if found: true and final url. if not found: false and attempted url
        pgChk = HttpExists2(url)
        if pgChk[0] == False:
                #failed url Exists
                failedlinks.append(pgChk[1])
        elif ShopCom404(pgChk[1]):
                #Our version of a 404
                pseudo404.append(url)
        if pgChk[0] and not ShopCom404(url):
                #if valid page not a 404 then get the page and check it.
                try:
                        t = time.time()
                        urlObj = urllib2.urlopen(url)
                        pagetime = time.time() - t
                        webpg = urlObj.read()
                        if (ST_zeroMatch in self.webpg) or (ST_zeroMatch2 in 
self.webpg):
                                zeromatch.append(url)
                        elif ST_errorConnecting in webpg:
                        # for some reason we got the error page
                        # so add it to the failed urls
                                failmsg = 'Error Connecting Page with: ' + url
                                failedlinks.append(failmsg)
                except:
                        print 'exception with: ' + url
        #figure page times
        cumulative += pagetime
        if pagetime > slowestpage:
                slowestpage = pagetime, url.strip()
        elif pagetime < fastestpage:
                fastestpage = pagetime, url.strip()
        msg = 'testing ' + str(count) + ' of ' + str(totalNumberTests) + \
                '. Currnet runtime: ' + str(datetime.today() - start)
        # status message that updates the same line.
        #PrintStatic(msg)

### Now write out results
end = datetime.today()
finished = datetime.today()
finishedTimeStr = time.asctime()
avg = cumulative/totalNumberTests
failed = len(failedlinks)
nomatches = len(zeromatch)

#setup COM connection to Excel and write the spreadsheet.

If I understand what I've read about threading I need to convert much
of the above into a function and then call threading.thread start or
run to fire off each thread. but where and how and how to limit to X
number of threads is the part I get lost on. The example I've seen
using queues and threads never show using a list (squence) for the
source data and I'm not sure where I'd use the Queue stuff or for that
mattter if I'm just complicating the issue.

Once again thanks for the help.
Greg.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Oh what a twisted thread we weave....

Reply via email to