On Fri, Mar 13, 2009 at 1:13 PM, psaff...@googlemail.com <psaff...@googlemail.com> wrote: > Thanks for all the replies. > [snip] > > The numpy solution does work, but it uses more than 1GB of memory for > one of my 130MB files. I'm using > > np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6', > 'i4', 'f8']}) > > so shouldn't it use 18 bytes per line? The file has 5832443 lines, > which by my arithmetic is around 100MB...?
I made a mock up file with 5832443 lines, each line consisting of abcdef 100 100.0 and ran the g2arr() function with 'S6' for the string. While running (which took really long), the memory usage spiked on my computer to around 800MB, but once g2arr() returned, the memory usage went to around 200MB. The number of bytes consumed by the array is 105MB (using arr.nbytes). From looking at the loadtxt routine in numpy, it looks like there are a zillion objects created (string objects for splitting each line, temporary ints floats and strings for type conversions, etc) while in the routine which are garbage collected upon return. I'm not well versed in Python's internal memory managment system, but from what I understand, practically all that memory is either returned to the OS or held onto by Python for future use by other objects after the routine returns. But the only memory in use by the array is the ~100MB for the raw data. Making 5 copies of the array (using numpy.copy(arr)) bumps total memory usage (from top) up to 700MB, which is 117MB per array or so. The total memory reported by summing the arr.nbytes is 630MB (105MB / array), so there isn't that much memory wasted. Basically, the numpy solution will pack the data into an array of C structs with the fields as indicated by the dtype parameter. Perhaps a database solution as mentioned in other posts would suit you better; if the temporary spike in memory usage is unacceptable you could try to roll your own loadtxt function that would be leaner and meaner. I suggest the numpy solution for its ease and efficient use of memory. Kurt -- http://mail.python.org/mailman/listinfo/python-list