Back about 8 yrs ago, on pc hardware, I was reading twin 5 Mb files and doing a 'fancy' diff between the 2, in about 60 seconds. Granted, your file is likely bigger, but so is modern hardware and 20 mins does seem a bit high.
Can't talk about the rest of your code, but some parts of it may be optimized def parseValue(line, col): s = line[col.start:col.end+1] # no switch in python if col.format == ColumnFormat.DATE: return Format.parseDate(s) if col.format == ColumnFormat.UNSIGNED: return Format.parseUnsigned(s) How about taking the big if clause out? That would require making all the formatters into functions, rather than in-lining some of them, but it may clean things up. #prebuilding a lookup of functions vs. expected formats... #This is done once. #Remember, you have to position this dict's computation _after_ all the Format.parseXXX declarations. Don't worry, Python _will_ complain if you don't. dict_format_func = {ColumnFormat.DATE:Format.parseDate, ColumnFormat.UNSIGNED:Format.parseUnsigned, .... def parseValue(line, col): s = line[col.start:col.end+1] #get applicable function, apply it to s return dict_format_func[col.format](s) Also... if col.format == ColumnFormat.STRING: # and-or trick (no x ? y:z in python 2.4) return not col.strip and s or rstrip(s) Watch out! 'col.strip' here is not the result of stripping the column, it is the strip _function_ itself, bound to the col object, so it always be true. I get caught by those things all the time :-( I agree that taking out the dot.dot.dots would help, but I wouldn't expect it to matter that much, unless it was in an incredibly tight loop. I might be that. if s.startswith('999999') or s.startswith('000000'): return -1 would be better as... #outside of loop, define a set of values for which you want to return -1 set_return = set(['999999','000000']) #lookup first 6 chars in your set def parseDate(s): if s[0:6] in set_return: return -1 return int(mktime(strptime(s, "%y%m%d"))) Bottom line: Python built-in data objects, such as dictionaries and sets, are very much optimized. Relying on them, rather than writing a lot of ifs and doing weird data structure manipulations in Python itself, is a good approach to try. Try to build those objects outside of your main processing loops. Cheers Douhet-did-suck -- http://mail.python.org/mailman/listinfo/python-list