Thanks for all the replies. First of all, can anybody recommend a good way to show memory usage? I tried heapy, but couldn't make much sense of the output and it didn't seem to change too much for different usages. Maybe I was just making the h.heap() call in the wrong place. I also tried getrusage() in the resource module. That seemed to give 0 for the shared and unshared memory size no matter what I did. I was calling it after the function call the filled up the lists. The memory figures I give in this message come from top.
The numpy solution does work, but it uses more than 1GB of memory for one of my 130MB files. I'm using np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S6', 'i4', 'f8']}) so shouldn't it use 18 bytes per line? The file has 5832443 lines, which by my arithmetic is around 100MB...? My previous solution - using a python array for the numbers and a list of tuples for the coordinates uses about 900MB. The dictionary solution suggested by Tim got this down to 650MB. If I just ignore the coordinates, this comes down to less than 100MB. I feel sure the list mechanics for storing the coordinates is what is killing me here. As to "work smarter", you could be right, but it's tricky. The 28 files are in 4 groups of 7, so given that each file is about 6 million lines, each group of data points contains about 42 million points. First, I need to divide every point by the median of its group. Then I need to z-score the whole group of points. After this preparation, I need to file each point, based on its coordinates, into other data structures - the genome itself is divided up into bins that cover a range of coordinates, and we file each point into the appropriate bin for the coordinate region it overlaps. Then there operations that combine the values from various bins. The relevant coordinates for these combinations come from more enormous csv files. I've already done all this analysis on smaller datasets, so I'm hoping I won't have to make huge changes just to fit the data into memory. Yes, I'm also finding out how much it will cost to upgrade to 32GB of memory :) Sorry for the long message... Peter -- http://mail.python.org/mailman/listinfo/python-list