I have a parser that needs to process 7 million files. After running for 2 days, it had only processed 1.5 million. I want this script to parse several files at once by using multiple threads: one for each file currently being analyzed.
My code iterates through all of the directories within a directory, and at each directory, iterates through each file in that directory. I structured my code something like this. I think I might be misunderstanding how to use threads: mythreads = [] for directory in dirList: #some processing... for file in fileList: p = Process(currDir,directory,file) #class that extends thread.Threading mythreads.append(p) p.start() for thread in mythreads: thread.join() del thread The actual class that extends threading.thread is below: class Process(threading.Thread): vlock = threading.Lock() def __init__(self,currDir,directory,file): #thread constructor threading.Thread.__init__(self) self.currDir = currDir self.directory = directory self.file = file def run(self): redirect = re.compile(r'#REDIRECT',re.I) xmldoc = minidom.parse(os.path.join(self.currDir,self.file)) try: markup = xmldoc.firstChild.childNodes[-2].childNodes[-2].childNodes[-2].childNodes[0].data except: #An error occurred Process.vlock.acquire() BAD = open("bad.log","a") BAD.writelines(self.file + "\n") BAD.close() Process.vlock.release() print "Error." return #if successful, do more processing... I did an experiment with a variety of numbers of threads and there is no performance gain. The code is taking the same amount of time to process 1000 files as it would if the code did not use threads. Any ideas on what I am doing wrong? -- http://mail.python.org/mailman/listinfo/python-list