On Wed, 12 Dec 2007 14:48:03 -0800, igor.tatarinov wrote: > Hi, I am pretty new to Python and trying to use it for a relatively > simple problem of loading a 5 million line text file and converting it > into a few binary files. The text file has a fixed format (like a > punchcard). The columns contain integer, real, and date values. The > output files are the same values in binary. I have to parse the values > and write the binary tuples out into the correct file based on a given > column. It's a little more involved but that's not important.
I suspect that this actually is important, and that your slowdown has everything to do with the stuff you dismiss and nothing to do with Python's object model or execution speed. > I have a C++ prototype of the parsing code and it loads a 5 Mline file > in about a minute. I was expecting the Python version to be 3-4 times > slower and I can live with that. Unfortunately, it's 20 times slower and > I don't see how I can fix that. I've run a quick test on my machine with a mere 1GB of RAM, reading the entire file into memory at once, and then doing some quick processing on each line: >>> def make_big_file(name, size=5000000): ... fp = open(name, 'w') ... for i in xrange(size): ... fp.write('here is a bunch of text with a newline\n') ... fp.close() ... >>> make_big_file('BIG') >>> >>> def test(name): ... import time ... start = time.time() ... fp = open(name, 'r') ... for line in fp.readlines(): ... line = line.strip() ... words = line.split() ... fp.close() ... return time.time() - start ... >>> test('BIG') 22.53150200843811 Twenty two seconds to read five million lines and split them into words. I suggest the other nineteen minutes and forty-odd seconds your code is taking has something to do with your code and not Python's execution speed. Of course, I wouldn't normally read all 5M lines into memory in one big chunk. Replace the code for line in fp.readlines(): with for line in fp: and the time drops from 22 seconds to 16. -- Steven -- http://mail.python.org/mailman/listinfo/python-list