On Jul 17, 5:31 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > Zaki wrote: > > On Jul 17, 2:49 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > >> Zaki wrote: > >>> Hey all, > >>> I'm really new to Python and this may seem like a really dumb > >>> question, but basically, I wrote a script to do the following, however > >>> the processing time/memory usage is not what I'd like it to be. Any > >>> suggestions? > >>> Outline: > >>> 1. Read tab delim files from a directory, files are of 3 types: > >>> install, update, and q. All 3 types contain ID values that are the > >>> only part of interest. > >>> 2. Using set() and set.add(), generate a list of unique IDs from > >>> install and update files. > >>> 3. Using the set created in (2), check the q files to see if there are > >>> matches for IDs. Keep all matches, and add any non matches (which only > >>> occur once in the q file) to a queue of lines to be removed from teh q > >>> files. > >>> 4. Remove the lines in the q for each file. (I haven't quite written > >>> the code for this, but I was going to implement this using csv.writer > >>> and rewriting all the lines in the file except for the ones in the > >>> removal queue). > >>> Now, I've tried running this and it takes much longer than I'd like. I > >>> was wondering if there might be a better way to do things (I thought > >>> generator expressions might be a good way to attack this problem, as > >>> you could generate the set, and then check to see if there's a match, > >>> and write each line that way). > >> Why are you checking and removing lines in 2 steps? Why not copy the > >> matching lines to a new q file and then replace the old file with the > >> new one (or, maybe, delete the new q file if no lines were removed)? > > > That's what I've done now. > > > Here is the final code that I have running. It's very much 'hack' type > > code and not at all efficient or optimized and any help in optimizing > > it would be greatly appreciated. > > > import csv > > import sys > > import os > > import time > > > begin = time.time() > > > #Check minutes elapsed > > def timeElapsed(): > > current = time.time() > > elapsed = current-begin > > return round(elapsed/60) > > > #USAGE: python logcleaner.py <input_dir> <output_dir> > > > inputdir = sys.argv[1] > > outputdir = sys.argv[2] > > > logfilenames = os.listdir(inputdir) > > > IDs = set() #IDs from update and install logs > > foundOnceInQuery = set() > > #foundTwiceInQuery = set() > > #IDremovalQ = set() Note: Unnecessary, duplicate of foundOnceInQuery; > > Queue of IDs to remove from query logs (IDs found only once in query > > logs) > > > #Generate Filename Queues For Install/Update Logs, Query Logs > > iNuQ = [] > > queryQ = [] > > > for filename in logfilenames: > > if filename.startswith("par1.install") or filename.startswith > > ("par1.update"): > > if filename.startswith(("par1.install", "par1.update")): > > > iNuQ.append(filename) > > elif filename.startswith("par1.query"): > > queryQ.append(filename) > > > totalfiles = len(iNuQ) + len(queryQ) > > print "Total # of Files to be Processed:" , totalfiles > > print "Install/Update Logs to be processed:" , len(iNuQ) > > print "Query logs to be processed:" , len(queryQ) > > > #Process install/update queue to generate list of valid IDs > > currentfile = 1 > > for file in iNuQ: > > > print "Processing", currentfile, "install/update log out of", len > > (iNuQ) > > print timeElapsed() > > reader = csv.reader(open(inputdir+file),delimiter = '\t') > > for row in reader: > > IDs.add(row[2]) > > currentfile+=1 > > Best not to call it 'file'; that's a built-in name. > > Also you could use 'enumerate', and joining filepaths is safer with > os.path.join(). > > for currentfile, filename in enumerate(iNuQ, start=1): > print "Processing", currentfile, "install/update log out of", len(iNuQ) > print timeElapsed() > current_path = os.path.join(inputdir, filename) > reader = csv.reader(open(current_path), delimiter = '\t') > for row in reader: > IDs.add(row[2]) > > > > > print "Finished processing install/update logs" > > print "Unique IDs found:" , len(IDs) > > print "Total Time Elapsed:", timeElapsed() > > > currentfile = 1 > > for file in queryQ: > > Similar remarks to above ... > > > print "Processing", currentfile, "query log out of", len(queryQ) > > print timeElapsed() > > reader = csv.reader(open(inputdir+file), delimiter = '\t') > > outputfile = csv.writer(open(outputdir+file), 'w') > > ... and also here. > > > for row in reader: > > if row[2] in IDs: > > ouputfile.writerow(row) > > Should be 'outputfile'. > > > else: > > if row[2] in foundOnceInQuery: > > foundOnceInQuery.remove(row[2]) > > You're removing the ID here ... > > > outputfile.writerow(row) > > #IDremovalQ.remove(row[2]) > > #foundTwiceInQuery.add(row[2]) > > > else: > > foundOnceInQuery.add(row[2]) > > ... and adding it again here! > > > #IDremovalQ.add(row[2]) > > > currentfile+=1 > > For safety you should close the files after use. > > > print "Finished processing query logs and writing new files" > > print "# of Query log entries removed:" , len(foundOnceInQuery) > > print "Total Time Elapsed:", timeElapsed() > > Apart from that, it looks OK. > > How big are the q files? If they're not too big and most of the time > you're not removing rows, you could put the output rows into a list and > then create the output file only if rows have been removed, otherwise > just copy the input file, which might be faster.
MRAB, could you please repost what I sent to you here as I meant to post it in the main discussion. -- http://mail.python.org/mailman/listinfo/python-list