sturlamolden wrote: > oyekomova wrote: > > Thanks for your help. I compared the following code in NumPy with the > > csvread in Matlab for a very large csv file. Matlab read the file in > > 577 seconds. On the other hand, this code below kept running for over 2 > > hours. Can this program be made more efficient? FYI - The csv file was > > a simple 6 column file with a header row and more than a million > > records. > > > > > > import csv > > from numpy import array > > import time > > t1=time.clock() > > file_to_read = file('somename.csv','r') > > read_from = csv.reader(file_to_read) > > read_from.next() > > > datalist = [ map(float, row[:]) for row in read_from ] > > I'm willing to bet that this is your problem. Python lists are arrays > under the hood! > > Try something like this instead: > > > # read the whole file in one chunk > lines = file_to_read.readlines() > # count the number of columns > n = 1 > for c in lines[1]: > if c == ',': n += 1 > # count the number of rows > m = len(lines[1:])
Please consider using m = len(lines) - 1 > #allocate > data = empty((m,n),dtype=float) > # create csv reader, skip header > reader = csv.reader(lines[1:]) lines[1:] again? The OP set you an example: read_from.next() so you could use: reader = csv.reader(lines) _unused = reader.next() > # read > for i in arange(0,m): > data[i,:] = map(float,reader.next()) > > And if this is too slow, you may consider vectorizing the last loop: > > data = empty((m,n),dtype=float) > newstr = ",".join(lines[1:]) > flatdata = data.reshape((n*m)) # flatdata is a view of data, not a copy > reader = csv.reader([newstr]) > flatdata[:] = map(float,reader.next()) > > I hope this helps! -- http://mail.python.org/mailman/listinfo/python-list