On Feb 18, 12:56 am, Carl Banks <pavlovevide...@gmail.com> wrote: > On Feb 17, 3:08 pm, Lionel <lionel.ke...@gmail.com> wrote: > > > > > > > Hello all, > > > On a previous thread (http://groups.google.com/group/comp.lang.python/ > > browse_thread/thread/64da35b811e8f69d/67fa3185798ddd12? > > hl=en&lnk=gst&q=keene#67fa3185798ddd12) I was asking about reading in > > binary data. Briefly, my data consists of complex numbers, 32-bit > > floats for real and imaginary parts. The data is stored as 4 bytes > > Real1, 4 bytes Imaginary1, 4 bytes Real2, 4 bytes Imaginary2, etc. in > > row-major format. I needed to read the data in as two separate numpy > > arrays, one for real values and one for imaginary values. > > > There were several very helpful performance tips offered, and one in > > particular I've started looking into. The author suggested a > > "numpy.memmap" object may be beneficial. It was suggested I use it as > > follows: > > > descriptor = dtype([("r", "<f4"), ("i", "<f4")]) > > data = memmap(filename, dtype=descriptor, mode='r').view(recarray) > > print "First 100 real values:", data.r[:100] > > > I have two questions: > > 1) What is "recarray"? > > Let's look: > > [GCC 4.3.2] on linux2 > Type "help", "copyright", "credits" or "license" for more information.>>> > import numpy > >>> numpy.recarray > > <class 'numpy.core.records.recarray'> > > >>> help(numpy.recarray) > > Help on class recarray in module numpy.core.records: > > class recarray(numpy.ndarray) > | recarray(shape, dtype=None, buf=None, **kwds) > | > | Subclass of ndarray that allows field access using attribute > lookup. > | > | Parameters > | ---------- > | shape : tuple > | shape of record array > | dtype : data-type or None > | The desired data-type. If this is None, then the data-type is > determine > | by the *formats*, *names*, *titles*, *aligned*, and > *byteorder* keywords > | buf : [buffer] or None > | If this is None, then a new array is created of the given > shape and data > | If this is an object exposing the buffer interface, then the > array will > | use the memory from an existing buffer. In this case, the > *offset* and > | *strides* keywords can also be used. > ... > > So there you have it. It's a subclass of ndarray that allows field > access using attribute lookup. (IOW, you're creating a view of the > memmap'ed data of type recarray, which is the type numpy uses to > access structures by name. You need to create the view because > regular numpy arrays, which numpy.memmap creates, can't access fields > by attribute.) > > help() is a nice thing to use, and numpy is one of the better > libraries when it comes to docstrings, so learn to use it. > > > 2) The documentation for numpy.memmap claims that it is meant to be > > used in situations where it is beneficial to load only segments of a > > file into memory, not the whole thing. This is definately something > > I'd like to be able to do as my files are frequently >1Gb. I don't > > really see in the diocumentation how portions are loaded, however. > > They seem to create small arrays and then assign the entire array > > (i.e. file) to the memmap object. Let's assume I have a binary data > > file of complex numbers in the format described above, and let's > > assume that the size of the complex data array (that is, the entire > > file) is 100x100 (rows x columns). Could someone please post a few > > lines showing how to load the top-left 50 x 50 quadrant, and the lower- > > right 50 x 50 quadrant into memmap objects? Thank you very much in > > advance! > > You would memmap the whole region in question (in this case the whole > file), then take a slice. Actually you could get away with memmapping > just the last 50 rows (bottom half). The offset into the file would > be 50*100*8, so: > > data = memmap(filename, dtype=descriptor, mode='r',offset= > (50*100*8)).view(recarray) > reshaped_data = reshape(data,(50,100)) > intersting_data = reshaped_data[:,50:100] > > A word of caution: Every instance of numpy.memmap creates its own mmap > of the whole file (even if it only creates an array from part of the > file). The implications of this are A) you can't use numpy.memmap's > offset parameter to get around file size limitations, and B) you > shouldn't create many numpy.memmaps of the same file. To work around > B, you should create a single memmap, and dole out views and slices. > > Carl Banks- Hide quoted text - > > - Show quoted text -
Thanks Carl, I like your solution. Am I correct in my understanding that memory is allocated at the slicing step in your example i.e. when "reshaped_data" is sliced using "interesting_data = reshaped_data[:, 50:100]"? In other words, given a huge (say 1Gb) file, a memmap object is constructed that memmaps the entire file. Some relatively small amount of memory is allocated for the memmap operation, but the bulk memory allocation occurs when I generate my final numpy sub-array by slicing, and this accounts for the memory efficiency of using memmap? -- http://mail.python.org/mailman/listinfo/python-list