Matt Garman a écrit : > I'm trying to use Python to work with large pipe ('|') delimited data > files.
Looks like a job for the csv module (in the standard lib). > The files range in size from 25 MB to 200 MB. > > Since each line corresponds to a record, what I'm trying to do is > create an object from each record. However, it seems that doing this > causes the memory overhead to go up two or three times. > > See the two examples below: running each on the same input file > results in 3x the memory usage for Example 2. (Memory usage is > checked using top.) Just for the record, *everything* in Python is an object - so the problem is not about 'using objects'. Now Of course, a complex object might eat up more space than a simple one... Python has 2 simple types for structured data : tuples (like database rows), and dicts (associative arrays). You can use the csv module to parse a csv-like format into either tuples or dicts. If you want to save memory, tuples may be the best choice. > This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python > 2.3.4 on CentOS 4.4 (64bit). > > Is this "just the way it is" or am I overlooking something obvious? What are you doing with your records ? Do you *really* need to keep the whole list in memory ? Else you can just work line by line: source = open(sys.argv[1]) for line in source: do_something_with(line) source.close() This will avoid building a huge in-memory list. While we're at it, your snippets are definitively unpythonic and overcomplicated: (snip) > filedata = list() > file = open(sys.argv[1]) > while True: > line = file.readline() > if len(line) == 0: break # EOF > filedata.append(line) > file.close() (snip) filedata = open(sys.argv[1]).readlines()) > Example 2: read lines into objects: > # begin readobjects.py > import sys, time > class FileRecord: class FileRecord(object): > def __init__(self, line): > self.line = line If this is your real code, I don't see any reason why this should eat up 3 times more space than the original version. > records = list() > file = open(sys.argv[1]) > while True: > line = file.readline() > if len(line) == 0: break # EOF > rec = FileRecord(line) > records.append(rec) > file.close() records = map(FileRecord, open(sys.argv[1]).readlines())) -- http://mail.python.org/mailman/listinfo/python-list