On Fri, Mar 13, 2009 at 10:59 AM, psaff...@googlemail.com <psaff...@googlemail.com> wrote: > I'm reading in some rather large files (28 files each of 130MB). Each > file is a genome coordinate (chromosome (string) and position (int)) > and a data point (float). I want to read these into a list of > coordinates (each a tuple of (chromosome, position)) and a list of > data points. > > This has taught me that Python lists are not memory efficient, because > if I use lists it gets through 100MB a second until it hits the swap > space and I have 8GB physical memory in this machine. I can use Python > or numpy arrays for the data points, which is much more manageable. > However, I still need the coordinates. If I don't keep them in a list, > where can I keep them?
Assuming your data is in a plaintext file something like 'genomedata.txt' below, the following will load it into a numpy array with a customized dtype. You can access the different fields by name ('chromo', 'position', and 'dpoint' -- change to your liking). Don't know if this works or not; might give it a try. =============================================== [186]$ cat genomedata.txt gene1 120189 5.34849 gene2 84040 903873.1 gene3 300822 -21002.2020 [187]$ cat g2arr.py import numpy as np def g2arr(fname): # the 'S100' should be modified to be large enough for your string field. dt = np.dtype({'names': ['chromo', 'position', 'dpoint'], 'formats': ['S100', np.int, np.float]}) return np.loadtxt(fname, delimiter=' ', dtype=dt) if __name__ == '__main__': arr = g2arr('genomedata.txt') print arr print arr['chromo'] print arr['position'] print arr['dpoint'] ================================================= Take a look at the np.loadtxt and np.dtype documentation. Kurt -- http://mail.python.org/mailman/listinfo/python-list