Hi, I am pretty new to Python and trying to use it for a relatively simple problem of loading a 5 million line text file and converting it into a few binary files. The text file has a fixed format (like a punchcard). The columns contain integer, real, and date values. The output files are the same values in binary. I have to parse the values and write the binary tuples out into the correct file based on a given column. It's a little more involved but that's not important.
I have a C++ prototype of the parsing code and it loads a 5 Mline file in about a minute. I was expecting the Python version to be 3-4 times slower and I can live with that. Unfortunately, it's 20 times slower and I don't see how I can fix that. The fundamental difference is that in C++, I create a single object (a line buffer) that's reused for each input line and column values are extracted straight from that buffer without creating new string objects. In python, new objects must be created and destroyed by the million which must incur serious memory management overhead. Correct me if I am wrong but 1) for line in file: ... will create a new string object for every input line 2) line[start:end] will create a new string object as well 3) int(time.mktime(time.strptime(s, "%m%d%y%H%M%S"))) will create 10 objects (since struct_time has 8 fields) 4) a simple test: line[i:j] + line[m:n] in hash creates 3 strings and there is no way to avoid that. I thought arrays would help but I can't load an array without creating a string first: ar(line, start, end) is not supported. I hope I am missing something. I really like Python but if there is no way to process data efficiently, that seems to be a problem. Thanks, igor -- http://mail.python.org/mailman/listinfo/python-list