On Wed, 12 Dec 2007 16:44:01 -0800, igor.tatarinov wrote: > Here is some of my code. Tell me what's wrong with it :) > > def loadFile(inputFile, loader): > # .zip files don't work with zlib
Pardon? > f = popen('zcat ' + inputFile) > for line in f: > loader.handleLine(line) Do you really need to compress the file? Five million lines isn't a lot. It depends on the length of each line, naturally, but I'd be surprised if it were more than 100MB. > ... > > In Loader class: > def handleLine(self, line): > # filter out 'wrong' lines > if not self._dataFormat(line): return Who knows what the _dataFormat() method does? How complicated is it? Why is it a private method? > # add a new output record > rec = self.result.addRecord() Who knows what this does? How complicated it is? > for col in self._dataFormat.colFormats: Hmmm... a moment ago, _dataFormat seemed to be a method, or at least a callable. Now it has grown a colFormats attribute. Complicated and confusing. > value = parseValue(line, col) > rec[col.attr] = value > > And here is parseValue (will using a hash-based dispatch make it much > faster?): Possibly, but not enough to reduce 20 minutes to one or two. But you know something? Your code looks like a bad case of over- generalisation. I assume it's a translation of your C++ code -- no wonder it takes an entire minute to process the file! (Oh lord, did I just say that???) Object-oriented programming is a useful tool, but sometimes you don't need a HyperDispatcherLoaderManagerCreator, you just need a hammer. In your earlier post, you gave the data specification: "The text file has a fixed format (like a punchcard). The columns contain integer, real, and date values. The output files are the same values in binary." Easy-peasy. First, some test data: fp = open('BIG', 'w') for i in xrange(5000000): anInt = i % 3000 aBool = ['TRUE', 'YES', '1', 'Y', 'ON', 'FALSE', 'NO', '0', 'N', 'OFF'][i % 10] aFloat = ['1.12', '-3.14', '0.0', '7.42'][i % 4] fp.write('%s %s %s\n' % (anInt, aBool, aFloat)) if i % 45000 == 0: # Write a comment and a blank line. fp.write('# this is a comment\n \n') fp.close() Now let's process it: import struct # Define converters for each type of value to binary. def fromBool(s): """String to boolean byte.""" s = s.upper() if s in ('TRUE', 'YES', '1', 'Y', 'ON'): return struct.pack('b', True) elif s in ('FALSE', 'NO', '0', 'N', 'OFF'): return struct.pack('b', False) else: raise ValueError('not a valid boolean') def fromInt(s): """String to integer bytes.""" return struct.pack('l', int(s)) def fromFloat(s): """String to floating point bytes.""" return struct.pack('f', float(s)) # Assume three fields... DEFAULT_FORMAT = [fromInt, fromBool, fromFloat] # And three files... OUTPUT_FILES = ['ints.out', 'bools.out', 'floats.out'] def process_line(s, format=DEFAULT_FORMAT): s = s.strip() fields = s.split() # I assume the fields are whitespace separated assert len(fields) == len(format) return [f(x) for (x, f) in zip(fields, format)] def process_file(infile, outfiles=OUTPUT_FILES): out = [open(f, 'wb') for f in outfiles] for line in file(infile, 'r'): # ignore leading/trailing whitespace and comments line = line.strip() if line and not line.startswith('#'): fields = process_line(line) # now write the fields to the files for x, fp in zip(fields, out): fp.write(x) for f in out: f.close() And now let's use it and see how long it takes: >>> import time >>> s = time.time(); process_file('BIG'); time.time() - s 129.58465385437012 Naturally if your converters are more complex (e.g. date-time), or if you have more fields, it will take longer to process, but then I've made no effort at all to optimize the code. -- Steven. -- http://mail.python.org/mailman/listinfo/python-list