I'm trying to use Python to work with large pipe ('|') delimited data files. The files range in size from 25 MB to 200 MB.
Since each line corresponds to a record, what I'm trying to do is create an object from each record. However, it seems that doing this causes the memory overhead to go up two or three times. See the two examples below: running each on the same input file results in 3x the memory usage for Example 2. (Memory usage is checked using top.) This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python 2.3.4 on CentOS 4.4 (64bit). Is this "just the way it is" or am I overlooking something obvious? Thanks, Matt Example 1: read lines into list: # begin readlines.py import sys, time filedata = list() file = open(sys.argv[1]) while True: line = file.readline() if len(line) == 0: break # EOF filedata.append(line) file.close() print "data read; sleeping 20 seconds..." time.sleep(20) # gives time to check top # end readlines.py Example 2: read lines into objects: # begin readobjects.py import sys, time class FileRecord: def __init__(self, line): self.line = line records = list() file = open(sys.argv[1]) while True: line = file.readline() if len(line) == 0: break # EOF rec = FileRecord(line) records.append(rec) file.close() print "data read; sleeping 20 seconds..." time.sleep(20) # gives time to check top # end readobjects.py -- http://mail.python.org/mailman/listinfo/python-list