On 24/03/2007 8:11 AM, Matt Garman wrote: > I'm trying to use Python to work with large pipe ('|') delimited data > files. The files range in size from 25 MB to 200 MB. > > Since each line corresponds to a record, what I'm trying to do is > create an object from each record.
An object with only 1 attribute and no useful methods seems a little pointless; I presume you will elaborate it later. > However, it seems that doing this > causes the memory overhead to go up two or three times. > > See the two examples below: running each on the same input file > results in 3x the memory usage for Example 2. (Memory usage is > checked using top.) > > This happens for both Python 2.4.3 on Gentoo Linux (64bit) and Python > 2.3.4 on CentOS 4.4 (64bit). > > Is this "just the way it is" or am I overlooking something obvious? > > Thanks, > Matt > > > Example 1: read lines into list: > # begin readlines.py Interesting name for the file :-) How about using the file.readlines() method? Why do you want all 200Mb in memory at once anyway? > import sys, time > filedata = list() > file = open(sys.argv[1]) You have just clobbered the builtin file() function/type. In this case it doesn't matter, but you should lose the habit, quickly. > while True: > line = file.readline() > if len(line) == 0: break # EOF > filedata.append(line) > file.close() > print "data read; sleeping 20 seconds..." > time.sleep(20) # gives time to check top How about using raw_input('Hit the Any key...') ? > # end readlines.py > > > Example 2: read lines into objects: > # begin readobjects.py > import sys, time > class FileRecord: > def __init__(self, line): > self.line = line > records = list() > file = open(sys.argv[1]) > while True: > line = file.readline() > if len(line) == 0: break # EOF > rec = FileRecord(line) > records.append(rec) > file.close() > print "data read; sleeping 20 seconds..." > time.sleep(20) # gives time to check top > # end readobjects.py After all that, you still need to split the lines into the more-than-one fieldS (plural) that one would expect in a record. A possibly faster alternative to (fastest_line_reader_so_far, (line.split('|')) is to use the csv module, as in the following example, which also shows one way of making an object out of a row of data. C:\junk>type readpipe.py import sys, csv class Contacts(object): __slots__ = ['first', 'family', 'email'] def __init__(self, row): for attrname, value in zip(self.__slots__, row): setattr(self, attrname, value) def readpipe(fname): if hasattr(fname, 'read'): f = fname else: f = open(fname, 'rb') # 'b' is in case you'd like your script to be portable reader = csv.reader( f, delimiter='|', quoting=csv.QUOTE_NONE, # Set quotechar to a char that you don't expect in your data # e.g. the ASCII control char BEL (0x07). This is necessary # for Python 2.3, whose csv module used the quoting arg only when # writing, otherwise your " characters may get stripped off. quotechar='\x07', skipinitialspace=True, ) for row in reader: if row == ['']: # blank line continue c = Contacts(row) # do something useful with c, e.g. print [(x, getattr(c, x)) for x in dir(c) if not x.startswith('_')] if __name__ == '__main__': if sys.argv[1:2]: readpipe(sys.argv[1]) else: print '*** Testing ***' import cStringIO readpipe(cStringIO.StringIO('''\ Biff|Bloggs|[EMAIL PROTECTED] Joseph ("Joe")|Blow|[EMAIL PROTECTED] "Joe"|Blow|[EMAIL PROTECTED] Santa|Claus|[EMAIL PROTECTED] ''')) C:\junk>\python23\python readpipe.py *** Testing *** [('email', '[EMAIL PROTECTED]'), ('family', 'Bloggs'), ('first', 'Biff')] [('email', '[EMAIL PROTECTED]'), ('family', 'Blow'), ('first', 'Joseph ("Joe")')] [('email', '[EMAIL PROTECTED]'), ('family', 'Blow'), ('first', '"Joe"')] [('email', '[EMAIL PROTECTED]'), ('family', 'Claus'), ('first', 'Santa')] C:\junk>\python25\python readpipe.py *** Testing *** [('email', '[EMAIL PROTECTED]'), ('family', 'Bloggs'), ('first', 'Biff')] [('email', '[EMAIL PROTECTED]'), ('family', 'Blow'), ('first', 'Joseph ("Joe")')] [('email', '[EMAIL PROTECTED]'), ('family', 'Blow'), ('first', '"Joe"')] [('email', '[EMAIL PROTECTED]'), ('family', 'Claus'), ('first', 'Santa')] C:\junk> HTH, John -- http://mail.python.org/mailman/listinfo/python-list